Gemini 3 is here. Is it state of the art? How will it feel for normal humans inside Search and the Google’s Workspace apps? How does it compare to GPT-5.1, Claude Sonnet 4.5, DeepSeek, and friends?

Let’s see.

TL;DR

What Gemini 3 actually is

Gemini 1 pushed “native multimodality” and longer context. Gemini 2 and 2.5 leaned into reasoning and early agents (with Gemini 2.5 Pro sitting at the top of LMArena for half a year).

Gemini 3 tries to combine all of that into a single flagship:

From a technical perspective, there are three pillars here that matter:

1. Reasoning benchmarks

Evaluation chart for Gemini 3 Deep Think

Gemini 3 Pro surpasses Gemini 2.5 Pro on every major AI benchmark worth caring about.

Those benchmarks stress conceptual reasoning, science, and math. The takeaway: Gemini 3 sits clearly in the frontier band. You can argue about which inch of that band wins your favorite leaderboard; you cannot argue that Gemini 3 belongs in that tier.

2. Multimodal and context

Gemini 3 goes hard on multimodal:

These are tests where the model has to understand and reason over images and videos. Gemini 3’s longer context (over models like GPT 5.1) can ingest long papers, codebases, or lecture series and supposedly still reason accurately across visual and textual information in one pass.

Looking at Google’s focus on “generative interfaces,” it’s not hard to see where they’re planning to go: Layout-heavy answers, embedded simulations, and on-the-fly UI that responds to your query.

3. Agents, coding, and “plan anything”

On the coding side:

On longer-term planning:

How this plays for regular users

Everyone else cares about “what happens when I open Search or the Google Docs”. Here is how this release hits you if you live in Google land.

Search: AI Mode from day one

Gemini 3 now runs AI Mode in Search from launch, which is a first for Google. Earlier model versions shipped into Search weeks or months after model announcements.

Changes you will see:

This feels powerful and also brutal for publishers. More of the “answer” lives inside Google’s glass box. Reuters calls the redesign a further hit to sites that rely on search traffic for revenue.

Gemini app: less flattery, more spine

The Verge calls Gemini 3 “less prone to empty flattery” than ChatGPT and highlights that Google explicitly tuned responses to be concise, direct, and less sycophantic.

Google’s own blog leans into this: Gemini 3 aims to trade cliché and flattery for “genuine insight,” with the model more willing to give blunt corrections instead of mirroring your opinion.

For normal use, that should mean:

For the everyperson, this is welcome, and is a clear response to OpenAI’s own model tweaking away from sycophancy. AI that acts as a mirror for human ego has limited use.

How Gemini 3 stacks up against GPT-5.1, Claude Sonnet 4.5, DeepSeek, and others

Independent and semi-independent sources paint this rough picture:

Skywork’s pre-release comparative testing between GPT-5.1 and Gemini 3 lines up with this: GPT-5.1 leads for pure text work, long-form writing, steady API behavior, and some enterprise automation, while Gemini 3 feels stronger for visual work, multimodal flows, and large-document analysis (with that sweet 1M token context).

My read (getting down to it)

If you ignore marketing and look at use case clusters:

So does Gemini 3 “win”? For many users inside Google’s world, yes, because integration matters more than a few percentage points. For teams already locked into OpenAI or Anthropic ecosystems, Gemini 3 exists more as competitive pressure than a drop-in replacement.

The smarter move, as always, is getting cozy with multiple models: use Gemini 3 where its multimodal and context strengths matter, GPT-5.1 or Claude where their ecosystems and coding strengths shine, DeepSeek where cost and math dominate. While the foundation labs continue to duke it out, it’s clear that the consumer wins when embracing it all at once.


Antigravity: Google’s agent-first IDE

google-antigravity

I’d be remised to not include a short blurb in here on Antigravity (which probably deserves its own post later). Antigravity is Google’s new developer environment (yeah, another one) that centers agents rather than autocomplete snippets. It feels like a direct response to Cursor’s agent-based overhaul earlier this Fall.

Core traits:

Artifacts. As agents work, they emit Artifacts: task lists, detailed plans, screenshots, browser recordings, and other traces. Google argues these are easier to audit than raw tool-call logs.

Model mix. Gemini 3 Pro sits at the center, backed by the Gemini 2.5 Computer Use model for browser control and the Nano Banana image editing model. Antigravity also supports Claude Sonnet 4.5 and a GPT-OSS model for teams that want multi-vendor setups.

Availability and cost. Public preview, free to use with “generous” Gemini 3 Pro rate limits that reset every five hours.

If you already use Cursor, Copilot Workspace, or similar tools, Antigravity fits the same trend: IDEs where agents manage work streams, with humans supervising and steering rather than typing every line.

The differentiator here is tight integration with Google’s own models, browser, and safety stack plus explicit support for artifact-based traceability (which matters if you ever need to defend an agent’s behavior to a security team).


Gemini 3 is an impressive showing for Google, and is the first time I’ve seen a responsibly clear product vision on their AI-side. I like the emphasis on reduced flattery, explicit planning benchmarks, and real agent products instead of loose demos. I do worry about AI Mode in Search turning the open web into a corpus feeder with no path back for publishers and about agentic systems that touch browsers, terminals, and personal data with incomplete transparency.

For builders and power users, though, the practical message is simple: Gemini 3 belongs in your toolbelt. Treat it as one strong system among several, wire it into workflows that benefit from multimodal context and strong planning, and keep a human hand on the kill switch while the industry figures out how aggressive agent automation should become.

Originally published on the Handy AI newsletter →