Best Local LLM for 8GB VRAM Coding? I Tried Everything.

Author: Kuldeepsinh Jadeja

Published: May 23, 2026

Categories:

Programming

Machine-learning

Artificial-intelligence

Software-development

Best Local LLM for 8GB VRAM Coding? I Tried Everything | Kuldeepsinh Jadeja — **Best Local LLM for 8GB VRAM Coding? I Tried Everything.**

LOCAL AI / BUDGET HARDWARE

Gemma 4 killed my KV cache. Dense Qwen crawled at 3 tok/s. After weeks of testing local LLMs on a GTX 1070, one setup finally stopped fighting me.

After weeks of testing Gemma 4, dense Qwen, BeeLlama.cpp, and every “8GB-friendly” setup Reddit recommended, only one actually felt usable for real coding work.

If you just want the answer:

The best local LLM for 8GB VRAM coding right now is probably Qwen3.6–35B-A3B running through llama.cpp with--n-cpu-moe enabled.

Still loud. Still refusing to die. Turns out still useful. | Kuldeepsinh Jadeja — **Still loud. Still refusing to die. Turns out still useful.**

Every local LLM thread eventually turns into someone benchmarking a 70B model on a 4090 and calling it “surprisingly accessible.”

Meanwhile my GTX 1070 sounds like it’s preparing for takeoff every time llama.cpp starts loading weights.

So I went looking for the best local LLM for 8GB VRAM coding because honestly, I got tired of reading advice written by people whose GPU costs more than my entire desktop.

And after a few weeks of testing different setups, I think most “8GB compatible” recommendations are technically true in the same way sleeping in your car is technically a housing solution.

Yes, the models load.

That’s not the same thing as usable.

There’s a huge difference between:

the model generates tokens

and

you can leave it running on a coding task without the whole setup collapsing into a 3 tok/s swamp.

That gap matters way more than benchmarks do.

Especially for agentic coding.

Why Most Local LLM Setups on 8GB VRAM Fall Apart

The weird part is that loading the model usually isn’t the problem anymore.

Quantization got good enough that almost anything can be bullied into memory if you compress it hard enough and throw enough system RAM at llama.cpp.

The problems show up later.

Usually around the point where your coding agent has:

Opened twelve files
Dumped stack traces into context
Retried the same broken test four times
and quietly accumulated 40k tokens of garbage memory

That’s when the “works great on my machine” setups start dying.

Two things consistently killed performance for me:

1️⃣ KV Cache Growth

This is the real VRAM killer on long coding sessions.

Fresh context benchmarks are misleading because almost every model feels fast at 4k context. The collapse happens later, when the KV cache starts eating memory alongside the model itself.

Gemma 4 was especially bad here.

Around 30–40k context, I could literally watch VRAM disappear in nvtop while generation slowed to a crawl.

And coding agents hit long context fast. Faster than most people expect.

A single debugging session can accumulate:

Tool calls
Logs
Diffs
Retries
Terminal output
File contents
Reasoning traces

Suddenly your “small task” is dragging around half a novel worth of context.

2️⃣ PCIe Bottlenecks on Dense Models

Dense models on 8GB feel terrible once offloading starts.

Technically the setup works. Practically llama.cpp starts hauling weights back and forth between VRAM and RAM every forward pass like it’s carrying groceries up apartment stairs.

I tested Dense Qwen3.6–27B this way.

Around 3 tok/s.

Which sounds survivable until the agent makes a mistake and spends the next several minutes slowly explaining the mistake it slowly made.

That was the point where I realized most people defining “usable” have a much higher tolerance for suffering than I do.

The Candidates I Actually Tested (And What Broke)

Gemma 4 26B MoE

This one genuinely tricked me at first.

On paper it looks perfect for budget hardware:

MoE architecture
Smaller active parameter count
Reasonable quant sizes
Decent coding performance

Short sessions felt great.

Then context grew and the KV cache started eating the machine alive.

At around 32k tokens, performance became inconsistent enough that I stopped trusting unattended runs entirely.

That became the pattern with almost every “8GB-ready” setup:

fine at the beginning,
terrible once the session became realistic.

Single-turn coding help? Sure.

Long agentic sessions? I gave up on it pretty quickly.

I actually spent a full week trying to make Gemma 4 behave locally because on paper it looked like the perfect budget setup. The longer I tested it though, the stranger the KV cache behavior became. I’ll probably write that up separately because some of the degradation patterns were genuinely weird.

I Ran Gemma 4 Locally. Here’s What Nobody’s Telling You.

Three candidates. Two failures. One setup that held up | Kuldeepsinh Jadeja — **Three candidates. Two failures. One setup that held up.**

Dense Qwen3.6–27B

I understand why people love this model.

I also understand why most of those people own more VRAM than I do.

With 32GB system RAM, I got it running through layer offloading in llama.cpp.

The problem was speed.

Or more specifically: accumulated waiting.

I gave it a normal coding task:

Refactor a module
Write tests
Fix failures

The full loop took around 40 minutes.

Not because the model was dumb.
Mostly because inference felt like wading through wet cement.

The psychological effect surprised me more than the benchmark numbers did.

You stop experimenting when every mistake costs another ten minutes.

That part surprised me more than the benchmarks did, honestly.

Slow inference changes how you think. You become weirdly conservative. You stop trying ideas because every failed attempt turns into another coffee break.

BeeLlama.cpp

This one felt like driving a heavily modified project car.

Sometimes incredible.

Sometimes concerning.

The throughput gains are real. Short-context speed noticeably improved on my GTX 1070, even compared to standard llama.cpp.

But I also hit:

One hard segfault
One silent inference hang
And one session where my coding agent waited forever because the server quietly died mid-task

That matters more for coding agents than regular chat.

If I’m sitting there babysitting inference, fine.

If I leave for coffee and come back to a dead process after an hour-long task, the performance gains stop feeling very important.

I still think BeeLlama.cpp is worth watching though. It just feels early.

The One Setup That Actually Worked

The weird answer ended up being the model I originally ignored:

Qwen3.6–35B-A3B.

I ended up using a GGUF quant through Hugging Face, mostly because the llama.cpp workflow gave me more control than Ollama did.

The name alone made me assume there was absolutely no way this thing would behave on 8GB VRAM.

But because it’s MoE and only around 3B parameters stay active during inference, the real footprint is much smaller than the headline number suggests.

That changes everything about how the system behaves under load.

More importantly: The KV cache behavior stayed stable.

That was the real breakthrough.

Not peak benchmark speed.
Not first-token latency.
Just consistency during ugly, realistic, long-running coding sessions.

Fresh context on my GTX 1070 sits around:

35–40 tok/s early
20-ish tok/s during longer sessions

And honestly? Once the speed stays predictable, your brain stops thinking about the hardware constantly.

That’s the first time local coding on 8GB stopped feeling like a science experiment to me.

The llama.cpp Flag That Made the Biggest Difference

This was the command that finally stopped fighting me:

./llama-server \
  -m Qwen3.6-35B-A3B-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --n-cpu-moe \
  --ctx-size 65536 \
  --flash-attn \
  -ctk q4_0 \
  -ctv q4_0 \
  -t 8

The important flag is:

--n-cpu-moe

That one matters way more than most people realize.

Without it, inactive MoE experts still create enough VRAM pressure that the KV cache starts competing for memory during long sessions.

With it, memory behavior becomes predictable enough that sessions stop randomly collapsing near the 30k-token mark.

That stability mattered more than raw speed.

One flag. The difference between running out of VRAM at 30k tokens and running clean past 60k | Kuldeepsinh Jadeja — **One flag. The difference between running out of VRAM at 30k tokens and running clean past 60k.**

Is Local Agentic Coding on 8GB VRAM Actually Viable?

Honestly?

Yeah. More than I expected.

Not “replace a cloud 4090 cluster” viable.
Not “run giant autonomous repo agents for 14 hours” viable.

But absolutely viable for:

Focused coding sessions
Feature implementation
Debugging
Test iteration
Medium-sized refactors

Especially if you pair:

8GB VRAM
32GB RAM
Qwen3.6–35B-A3B
llama.cpp
KV cache quantization

The funniest part is that the setup people kept dismissing as “too weak” was the first one that actually let me stop thinking about hardware and just code.

That’s probably the real benchmark.

38 tok/s at 12k context. Not a 4090. Good enough | Kuldeepsinh Jadeja — **38 tok/s at 12k context. Not a 4090. Good enough.**

I’ve also been experimenting with replacing cloud coding assistants entirely with local setups. That’s probably the next rabbit hole because the tradeoffs get weird fast once you start comparing local models against tools like Copilot or Claude Code directly.

I Replaced GitHub Copilot With a Local Setup. Here’s What Nobody Tells You

If you’ve been waiting to try local agentic coding until you could “afford a proper GPU” — you don’t have to wait.

The setup exists.

It’s just buried underneath a lot of benchmark screenshots from people running hardware most normal developers don’t own.

Honestly, that was probably the most useful thing I learned from all of this.

A message from our Founder

Hey, Sunil here. I wanted to take a moment to thank you for reading until the end and for being a part of this community. Did you know that our team run these publications as a volunteer effort to over 3.5m monthly readers? We don’t receive any funding, we do this to support the community.

If you want to show some love, please take a moment to follow me on LinkedIn, TikTok, Instagram. You can also subscribe to our weekly newsletter. And before you go, don’t forget to clap and follow the writer️!

Best Local LLM for 8GB VRAM Coding? I Tried Everything. was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read full article on Medium