The significance of Gemini 3: AI has surpassed the "hallucination stage" and is approaching human capabilities, with "human-machine collaboration" shifting from "humans correcting AI" to "humans guiding AI work."

Ethan Mollick believes that the emergence of Gemini 3 signifies the rise of "agent models" with autonomous capabilities. Although it is not flawless, the errors it makes are no longer baseless "hallucinations," but are closer to the biases humans have in judgment or intent understanding. As AI capabilities improve, human-machine collaboration is evolving from "humans fixing AI's mistakes" to "humans guiding AI's work."

Google's latest release of the Gemini 3 model marks a key turning point in the field of artificial intelligence.

Wallstreetcn previously mentioned that on Tuesday, the 18th, Eastern Time, Google officially launched the highly anticipated and most powerful artificial intelligence (AI) model to date, Gemini 3. It was made available on the same day across Google Search, the Gemini app, and multiple developer platforms, being utilized in various profitable products.

Google executives emphasized at the press conference that Gemini 3 is leading in several popular industry rankings that measure AI model performance. Demis Hassabis, CEO of Google AI research lab DeepMind, stated that Gemini 3 is "the best multimodal understanding model in the world," and also the company's most powerful agent and code generation model to date.

According to a deep evaluation by Wharton School professor Ethan Mollick, the release of Gemini 3 and its accompanying tool "Antigravity" demonstrates astonishing "agent" capabilities. Compared to the GPT-3 model from three years ago, AI is no longer just generating text; it can now write code, build interactive applications, and execute multi-step tasks.

Mollick pointed out that this leap from "description" to "action" means that AI is evolving from a conversational partner to a general tool that can access computers and complete actual work.

Mollick's conclusion is that we are transitioning from the "chatbot era" to the "digital colleague era." Although Gemini 3 is not flawless, the mistakes it makes are no longer baseless "hallucinations," but are closer to human biases in judgment or intent understanding.

This indicates that the model of human-machine collaboration is changing; what humans will do is no longer just correct AI's basic errors, but provide it with higher-level strategic guidance and direction. This may be the biggest transformation in the AI field since the release of ChatGPT.

Below is the full text of Ethan Mollick's evaluation article:

Three years, from GPT-3 to Gemini 3

From chatbot to agent

Ethan Mollick

November 19, 2025

I have been testing Google's newly released Gemini 3 model. It is outstanding. But rather than listing benchmark numbers, let the AI demonstrate how far we have come in these three years—I did just that.

It has been nearly three years since ChatGPT was released. A few days before its launch, I wrote my first article about OpenAI's earlier GPT-3 on this Substack. Then ChatGPT burst onto the scene, and I immediately wrote another: "I am usually cautious about technology predictions, but this time it will change our world faster and more dramatically than we expect. No longer just automating repetitive and dangerous work, the first jobs to be disrupted by AI may be those more focused on analysis, creativity, writing, and communication ”

Looking back today, I was right.

(Illustration: A screenshot of my first article after using GPT-3.5 three years ago)

I could verbally explain how much the original ChatGPT differs from Gemini 3 released by Google today, but there's no need. I threw that screenshot to Gemini 3 and wrote just one sentence:

Me: "Create something to show everyone how much AI has progressed in the past three years."

Gemini: "Of course. Your tweet is from November 2022—the eve of the modern AI boom. At that time, the biggest 'wow factor' was that AI could write a coherent piece of text or compose a little poem for a silly prompt like 'candy-powered superluminal spaceship escaping from otters.'

To demonstrate progress, I not only described it in words but also created an interactive, playable 'candy-powered superluminal spaceship simulator' for you.

In 2022, AI could only describe the engine; by 2025, AI can write the engine, design the interface, and even let you pilot the spaceship yourself."

It really created a mini-game: I had to escape from the pursuing otters using candy energy, interspersed with little poems and various funny updates. Readers of this blog probably already know that AI can write code, but what's more interesting is—when you no longer see Gemini 3 as just a "chat box," but as "something else," what else can it do?

A tool that writes code is not just a "code-writing tool"

Along with Gemini 3, Google also released Antigravity. Programmers will understand at a glance: it's similar to Claude Code and OpenAI Codex, giving you computer access to autonomously write programs under guidance.

If you're not a programmer, you might overlook it—I think that's a mistake. "Writing code" does not equal "being a programmer," but rather "being able to accomplish any task that can only be done on a computer." Thus, the essence of these tools has been completely rewritten.

Gemini 3 excels at writing code, and this is relevant to you even if you don't consider yourself a "programmer." A fundamental perspective in the AI field is: everything you do on a computer ultimately boils down to code; as long as AI can write code, it can build dashboards, scrape websites, create PPTs, read files... This makes "code-writing agents" a universal tool. Antigravity has productized this concept: give me an "inbox," and I assign tasks to the agents, which will notify me when they need approval or assistance

(Illustration: Four agents running simultaneously, one working, one waiting for my response)

I communicate with them not through code, but in English; they use code to do the work for me. Gemini 3 excels at planning, knowing what to do and when to ask for guidance. For example, I put all the past newsletter drafts in one folder and instructed:

"Create a beautiful webpage for me that summarizes all my predictions about AI, and also search online for which ones were right and which were wrong."

It read all the documents, ran the code, and first presented me with an editable proposal—this was the first time it asked me something, and its understanding was so accurate that it surprised me. I made a few minor adjustments and let it proceed.

Then it searched the web, built the site, took over the browser to check the results, and packaged the final product for me. I provided feedback like I would to a real colleague, and it continued to iterate.

It is not perfect—agents have not reached that level yet. I did not encounter hallucinations, but there were indeed places that needed my correction; however, those errors were more like "judgment biases or misunderstandings that human colleagues might have," rather than the outrageous hallucinations of traditional AI. Importantly, I felt I could control the decisions of the artificial intelligence because it regularly checks and confirms its work, and my operational process is clearly visible. This feeling is more like managing a teammate rather than communicating with an AI through a chat interface.

Doctorate-Level Intelligence?

Antigravity is not the only surprise. Another shock is that it demonstrated true "judgment."

I often complain: AI benchmarks have become a mess. Gemini 3 leads in most rankings (it may not outperform the $200 GPT-5 Pro, but when the "deep thinking" version of Gemini 3 comes out, it might turn the tables). The industry loves to use the slogan: "Doctorate-level intelligence." I decided to put it to the test.

I threw a bunch of old files from a crowdfunding research project I did ten years ago at it—file names like "project_final_seriously_this_time_done.xls," and the format was ancient STATA. The command was just one sentence:

"Understand the data structure on your own, clean up the STATA files, and prepare for new analysis."

It really restored the damaged data and thoroughly understood the complex environment.

Then I gave it a typical "second-year PhD paper" task, without any hints:

"Very well, now write an original paper using this data. Conduct in-depth research in the field, elevate the theme to the level of entrepreneurship or strategic theory, perform rigorous statistics, and format it as a journal article."

It chose the topic, proposed hypotheses, ran statistics, created charts, and formatted the text—successfully balancing the most difficult aspect of "topic selection and feasibility." I vaguely said, "enrich and improve it further," and ultimately received a 14-page paper.

(Illustration: The first two pages of the paper)

Even more astonishingly, it created an index: using NLP to mathematically compare project descriptions with a vast number of descriptions to measure "crowdfunding creativity uniqueness." It wrote the code itself and verified the results.

So, can this be considered "PhD-level"?

If you mean "capable of performing the work of a qualified graduate student at a top university," part of the answer is "yes." However, it also has the common shortcomings of graduate students: statistical methods have flaws, theoretical leaps are too large, and the evidence chain is not rigorous enough... We have moved beyond the "illusion" stage and entered a more subtle, more "human" area of flaws. Interestingly, when I gave open suggestions like a teacher would (e.g., "add more crowdfunding literature to establish the method"), it improved significantly—perhaps with more guidance, it could approach "PhD" level.

What is Gemini 3?

It is an outstanding "thinking + execution" partner that billions of people around the world can easily access; it is also a mirror reflecting that AI has not slowed down, intelligent agents are on the rise, and humans need to learn to manage "smart AI" among multiple trends.

Three years ago, we were amazed that "machines can write otter poetry"; less than 1,000 days later, I was debating statistical methods with an intelligent agent that built its own research environment.

The era of chatbots is giving way to the era of "digital colleagues."

Indeed, Gemini 3 is still not perfect and requires "human managers" who can direct and verify. But "humans in the loop" is evolving from "humans cleaning up after AI" to "humans guiding AI to work"—this may be the biggest paradigm shift since the release of ChatGPT.

Easter egg:

I asked Gemini to "purely use code to create a Substack cover image, first check the dimensions," and it first searched online for specifications, then purely relied on mathematics to draw the image, completing the task seamlessly.

Obligatory Warning:

Granting AI agents access to your computer carries risks—it may move/delete files without asking, or even leak documents. It will be much better when such tools are aimed at non-programmers; caution is essential at this stage