Gemma is Crazy Fast

I've been playing around with local LLMs for the past year, always trying out the smaller models at certain quants to see how they fare on normal consumer GPUs. I have nothing crazy, just a 7600xt which is a few years old now but has 16GB VRAM (bought specifically for LLMs as I don't game) with the hope in the future local models would scale down well. Well, I just tried Gemma 4 and I think we might be there?

Gemma 4 is the latest Gemma release by google with these parameter sizes:

The 31b has all parameters activated, the rest only activate a subset. They also have a nice 126k/256k context size. Being MoE models dramatically help speed up tok/s.

Testing it out

Anyway I have a very simple vibe coded (the only things I vibecode are scripts) script to kick off local models with a rocm backed llama I compile, but here's a snippet of the actual command I tested. It's a bit of a game using just the CLI when setting the context size / GPU offload. You also have to make sure the quant used is actually good.

GGML_VK_VISIBLE_DEVICES=0 HIP_VISIBLE_DEVICES=0 ROCM_HOME=/opt/rocm llama-cli \
-hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-IQ4_XS \
--ctx-size 16384 \
--n-gpu-layers 25 \
--temp 1 \
--top-p 0.95 \
--top-k 64 \
--min-p 0.0 \
--cache-type-k bf16 \
--cache-type-v bf16

I don't use local models yet for tasks like coding / research. If I did, I'd probably just use lm-studio or unsloths new tool. Running this I get [ Prompt: 83.3 t/s | Generation: 25.2 t/s ] which is pretty good for this GPU!

Open router has some benchmarks for this specific model here. It's interesting to note that terminal bench is pretty low, which makes me think this may not be good for agentic coding? Googles official model card doesn't have this bench, but shows pretty good results on big bench and LiveCodeBench. I'm not sure how reliable these benchmarking comparison sites are, but it matches or is ahead of sonnet in a lot!

Broader Thoughts

Intel recently released an "affordable" 32GB GPU for around $1k, which would be enough for the dense model, or this MoE one with more context than 16k. I think in the next 2-3 years, assuming local models don't fall out of fashion, things will be smart and fast enough that a local LLM server would make sense.

You can grab a framework with 128GB ram currently for about 3k - but this has about 256GB/s, while this Arc card (Which is DDR6, not 7) has almost triple at 608GB/s. I think in the future, a 2-3 year old epyc server with either this card or a 3090 will be in my future. I would love to explore more with use cases other than coding. I'm reading accelerando as I'm typing this and, while it's fiction, it's giving me a lot of ideas on the future of agents :)