I Tried Gemma 4 Against GPT and Claude for a Week

Author: Kuldeepsinh Jadeja

Published: April 29, 2026

Categories:

Programming

Chatgpt

Artificial-intelligence

Google

Technology

I genuinely feel the change in perception on using AI.

Beyond Benchmarks

Real developer tasks — coding, debugging, writing — and what actually matters in practice.

A week of actual work. Not screenshots | Kuldeepsinh Jadeja
A week of actual work. Not screenshots.

The numbers didn’t feel real at first.
The benchmarks looked suspicious.

89.2% on AIME. 80% on LiveCodeBench v6. A Codeforces ELO that jumped from 110 to 2,150 in a single generation that reads like someone forgot to sanity check the slide before publishing it.

Google DeepMind released Gemma 4 on April 2, 2026. The numbers read more like it had been optimized for headlines.

So instead of arguing about benchmarks (which I’ve done before, and it goes nowhere), I decided to stop reading and start testing.

I spent a week running Gemma 4 against GPT and Claude on the kinds of tasks I actually do: debugging, writing component logic, explaining unfamiliar APIs, and drafting docs.

This is Gemma 4 vs GPT vs Claude from someone who was genuinely skeptical going in.

What Gemma 4 Actually Is (Before We Compare Anything)

Gemma 4 is Google DeepMind’s latest family of open-weight models. Four sizes: E2B, E4B, 26B MoE, and 31B Dense. Apache 2.0 license, which means you can use it commercially, self-host it, fine-tune it, and ship it inside a product without negotiating with anyone.

The 31B Dense is the flagship. That’s what I tested most heavily.

The MoE variant 26B total parameters, but only 3.8B active per inference, is the one I’d actually use in production. It matches roughly 97% of the 31B’s quality at a fraction of the compute cost. I’ll come back to that.

What I actually did (nothing fancy)

I didn’t run synthetic benchmarks. I ran the things that were already on my to-do list:

  • A custom React hook with some messy state management
  • An async race condition in a Node.js service that had been bugging me for two days
  • A piece of the Next.js App Router docs I kept avoiding because I didn’t fully understand it.

Also: writing test cases for a utility function, a short technical spec, some changelogs, and PR descriptions. The unglamorous stuff that takes more time than it should.

I rotated between Gemma 4, GPT, and Claude without overthinking it. Same prompts where possible. Sometimes not. I wasn’t trying to be fair, just honest.

The same kind of work I described when I wrote about automating 80% of my workflow with AI, not glamorous, but time-consuming.

I Automated 80% of My Workflow With AI

Where Gemma 4 Genuinely Surprised Me

The coding tasks.

That’s where I least expected it to hold up, and it did.

On the React hook, Gemma 4 31B produced clean, idiomatic code on the first pass. No hallucinated APIs. No incorrect hook dependency arrays. It even flagged a stale closure issue I hadn’t explicitly mentioned.

That part felt… slightly uncomfortable, honestly. Like it inferred more than I intended.

The async bug was more interesting. I pasted ~80 lines, described the symptoms, and Gemma 4 correctly identified that the problem was in how I was handling Promise.allSettled results, specifically, that I wasn’t guarding against a rejected promise before accessing its value property.
GPT got there too, eventually, but asked three clarifying questions first. Claude nailed it in one shot and explained it more clearly. Gemma 4 nailed it in one shot with less explanation.

Test suite. All green. First pass. | Kuldeepsinh Jadeja
Test suite. All green. First pass.
I’m not sure if “less explanation” is a feature or a bug. Depends on the day.

For generating test cases, Gemma 4 was genuinely fast and solid. Edge cases included, structure was sensible, no unnecessary padding. That’s probably where I’d trust it the most without thinking twice.

Where GPT and Claude Still Win

Here’s the part that got buried in every benchmark post I read.

When things get a bit ambiguous, Gemma 4 starts to feel like it’s skipping steps.

I asked all three something architectural, where to put shared validation logic in a slightly messy service setup.

Claude leaned into the tradeoffs. It kind of walked around the problem a bit before landing somewhere.

GPT did a version of that too.

Gemma 4 gave a reasonable answer, but it felt like it jumped straight to the conclusion. Less “thinking with me” more “here’s what you probably want”.

It’s not wrong. Just… lighter.

And yeah, writing is another gap.

Changelogs, PR descriptions, Claude still sounds like someone who’s done this before. There’s a rhythm to it. GPT is close.

Gemma 4 gets the job done, but you can feel the difference if you read it twice.

Also: long prompts.

I had one messy ~600-word spec with constraints buried everywhere. Claude followed all of it. GPT missed a couple small things. Gemma 4 missed more than that, not dramatically, but enough that I had to go back and fix it.

That surprised me more than the reasoning gap, actually.

The benchmark question (the annoying part)

Let me be direct about something.

Gemma 4 31B scores 84.3% on GPQA Diamond, graduate-level scientific reasoning. Claude’s top models and GPT sit above that on the same benchmark. And yes, in practice, the gap does show up. Not on simple tasks. On the tasks where you need a model to reason carefully through something genuinely ambiguous, the frontier proprietary models aren’t playing the same game yet.

But here’s the thing I kept coming back to: for maybe 70–75% of my actual daily developer work, Gemma 4 was good enough. And “good enough, running locally, at zero per-token cost” is a different value proposition than “slightly better, $X per million tokens, data leaving my machine.”

I can’t speak to every use case. In the tasks I ran, the proprietary models were ahead on the complex reasoning end, and the gap was real, just not as wide as I’d assumed.

For teams handling sensitive code, or for anyone watching API costs quietly double over the last year, that narrowed gap changes the calculation. The math looks different when “a bit worse” also means “free, local, and yours.”

Not either/or. Just routing | Kuldeepsinh Jadeja
Not either/or. Just routing.

The Hybrid Workflow That Actually Makes Sense

I didn’t replace anything.

I just… split the work differently.

Run Gemma 4 locally (the 26B MoE is my choice — 40+ tokens/sec on a decent GPU, manageable VRAM footprint) for routine tasks: test generation, boilerplate, quick explanations, PR descriptions. Switch to Claude or GPT for anything where the reasoning depth actually matters — complex architectural decisions, subtle debugging, long-document understanding.

Both Claude Code and similar tooling now support fast model switching. The workflow is mature. It’s not an either/or choice.

Honestly, I think this is how most developers will end up using open models — not as replacements for the cloud APIs, but as a cheaper first pass that handles the 70% and routes the harder stuff up. The “open models will kill proprietary AI” framing misses this. They’ll coexist, and the tools will just get better at switching between them automatically.

This wasn’t a planned system. It just sort of emerged after a few days.

And now it feels obvious.

One thing that might matter more than people are saying

There’s an article to be written about how much the Apache 2.0 license actually matters.

Gemma 3 had restrictions. Llama still has usage thresholds for large commercial deployments. Gemma 4 is clean. You can slot it into a product, fine-tune it on your codebase, and deploy it internally — without reading the fine print twice.

That’s not exciting to talk about, but if you’ve ever had to deal with compliance or enterprise constraints, it’s the difference between “interesting” and “usable”.

Still not cancelling the Claude subscription | Kuldeepsinh Jadeja
Still not cancelling the Claude subscription.

So… is there a winner?

Not really.

Gemma 4 vs GPT vs Claude isn’t a single winner situation. It’s a question of what you need it for.

If you’re building something where the AI model runs locally, costs matter, or data privacy is non-negotiable, Gemma 4 is now a serious option in a way that previous open models weren’t. The benchmark numbers, for once, roughly held up in my experience.

Gemma 4 is good enough for a lot of real work. More than I expected.

GPT and Claude are still better when things get complicated. Also, more consistent.

Both of those things can be true at the same time.

My honest take: Gemma 4 made me rethink how I split work between local and cloud models. It didn’t make me cancel my Claude subscription.

Both of those things can be true.

Small afterthought

I went into this expecting to dismiss Gemma 4.

Didn’t happen.

Didn’t switch entirely either.

It just… earned a place in the workflow.

That’s a quieter outcome than the benchmarks suggest, but probably a more useful one.

Questions I had to Google before writing this:

Is Gemma 4 better than GPT for coding?

For straightforward to medium-complexity coding tasks, Gemma 4 31B is competitive with GPT and sometimes faster. For complex multi-file refactoring or deep architectural reasoning, GPT and Claude still have an edge in my experience.

Can Gemma 4 replace Claude for developers?

Not entirely, at least not yet. For 70–75% of routine developer tasks, Gemma 4 handles things well — especially if you’re running it locally. But for nuanced reasoning, writing quality, and complex instruction following, Claude is still ahead.

How does Gemma 4 31B perform on real tasks?

Better than I expected going in. Coding and test generation were strong. Complex reasoning and long-prompt instruction following were where it showed more limitations compared to proprietary frontier models.

What is the best way to run Gemma 4 locally?

The 26B MoE variant is the practical choice for most developers — it achieves roughly 97% of the 31B’s quality with significantly lower memory requirements, and runs at 40+ tokens/sec on a single modern GPU.

If you loved reading this, you’ll love this too:

A message from our Founder

Hey, Sunil here. I wanted to take a moment to thank you for reading until the end and for being a part of this community. Did you know that our team run these publications as a volunteer effort to over 3.5m monthly readers? We don’t receive any funding, we do this to support the community.

If you want to show some love, please take a moment to follow me on LinkedIn, TikTok, Instagram. You can also subscribe to our weekly newsletter. And before you go, don’t forget to clap and follow the writer️!


I Tried Gemma 4 Against GPT and Claude for a Week was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

Back to Top