Gemma 4 is Google’s latest open-weight language model — a significant leap from Gemma 3 with better reasoning, longer context, and improved coding performance. Unlike cloud APIs, running it locally means zero data leaves your machine. Perfect for proprietary code, air-gapped environments, or just avoiding subscription fees.
Gemma 4 comes in four sizes: E2B and E4B for edge devices (phones, Raspberry Pi, IoT), and 26B MoE plus 31B Dense for workstations. All models are multimodal (vision + audio on edge models), support 140+ languages, and now use the permissive Apache 2.0 license.
:::note[TL;DR]
- Gemma 4 comes in four sizes: E2B, E4B (edge/mobile), 26B MoE, and 31B Dense (workstation/server)
- E2B/E4B run on phones, Raspberry Pi, Jetson Nano with 128K context
- 26B MoE activates only 3.8B params for fast inference; 31B Dense for maximum quality with 256K context
- All models are multimodal (vision + audio on edge) and support 140+ languages
- Install Ollama, then
ollama pull gemma4:27b— models download automatically on first use - Apple Silicon gets GPU acceleration; NVIDIA needs ~24GB+ VRAM for the 31B model
- Now under Apache 2.0 license (not Google’s custom license) — truly open for commercial use :::
Prerequisites
Before installing Gemma 4, check your hardware:
Minimum (CPU only):
- 4 GB RAM for E2B models (edge/IoT)
- 8 GB RAM for E4B models
- 16 GB RAM for 26B MoE models
- 32 GB RAM for 31B Dense models
Edge/Mobile (E2B/E4B):
- Runs on Raspberry Pi 4/5, NVIDIA Jetson Orin Nano
- Android phones with 6GB+ RAM
- iOS devices (via Core ML)
- 128K context window
Better performance (GPU):
- Apple Silicon Mac (M1/M2/M3/M4) — Metal acceleration works out of the box
- NVIDIA GPU with 8+ GB VRAM for E4B models
- NVIDIA GPU with 16+ GB VRAM for 26B MoE
- NVIDIA GPU with 24+ GB VRAM for 31B Dense
- 256K context window for 26B/31B models
Key Features:
- Multimodal: Vision + audio understanding on all models
- Multilingual: Native support for 140+ languages
- Agentic: Native function calling and structured JSON output
- License: Apache 2.0 (fully permissive for commercial use)
- Context: 128K (E2B/E4B) or 256K (26B/31B) tokens
Install Ollama
If you don’t have Ollama yet, install it first:
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download from ollama.com. Runs as a background service.
Verify installation:
ollama --version
Download and Run Gemma 4
Ollama makes this trivial. Models download on first use and cache for future runs.
# Run the E2B model (edge/IoT, ~2GB, fastest on limited hardware)
ollama run gemma4:2b
# Run the E4B model (edge/IoT, ~3GB, better quality than E2B)
ollama run gemma4:4b
# Run the 26B MoE model (desktop, activates 3.8B params, fast inference)
ollama run gemma4:27b
# Run the 31B Dense model (workstation, maximum quality, 256K context)
ollama run gemma4:31b
The Scenario: You’re deploying an AI assistant on a Raspberry Pi 5 at a remote factory. You pull
gemma4:2b, get local vision + audio processing with 128K context, and it all runs offline without internet. The E2B model handles OCR from camera feeds and voice commands natively.
First launch downloads the model weights:
- E2B: ~2GB
- E4B: ~3GB
- 26B MoE: ~16GB (fits on 80GB H100 unquantized, ~7GB quantized)
- 31B Dense: ~19GB (fits on 80GB H100 unquantized, ~8GB quantized)
Subsequent starts are instant.
Available Model Variants
Gemma 4 offers quantized variants for different VRAM constraints:
| Variant | Effective Size | VRAM Needed | Best For | Context |
|---|---|---|---|---|
gemma4:2b (E2B) | ~2 GB | 3-4 GB | Raspberry Pi, IoT, phones | 128K |
gemma4:4b (E4B) | ~3 GB | 4-6 GB | Edge devices, Jetson Nano | 128K |
gemma4:27b (26B MoE) | ~16 GB (activates 3.8B) | 12-16 GB | Fast desktop inference | 256K |
gemma4:31b (31B Dense) | ~19 GB | 24+ GB | Maximum quality, fine-tuning | 256K |
gemma4:27b-q4_K_M | ~7 GB | 8-10 GB | Mid-range GPUs (26B MoE) | 256K |
gemma4:31b-q4_K_M | ~8 GB | 10-12 GB | High-end consumer GPUs | 256K |
Key difference: The 26B MoE activates only 3.8 billion parameters during inference — delivering exceptional tokens/second while still having 26B total capacity. The 31B Dense uses all parameters for maximum quality.
Pull a quantized variant:
ollama pull gemma4:31b-q4_K_M
:::tip
The q4_K_M quantization uses 4-bit precision with intelligent mixing. You lose ~2-3% quality but save 30-40% VRAM. Most users won’t notice the difference for everyday coding tasks.
:::
Hardware-Specific Setup
Apple Silicon (M1/M2/M3/M4)
No configuration needed. GPU acceleration works automatically via Metal:
ollama run gemma4:12b
On an M2 Pro with 16GB unified memory, the 12B model runs at ~25 tokens/second. The 27B model also runs on M-series chips with 24GB+ RAM, though you may need to close other apps.
NVIDIA GPUs
Install the NVIDIA Container Toolkit for maximum throughput. Verify CUDA is available:
ollama ps # Shows if GPU is being used
:::warning
If you see “CUDA out of memory” errors, your model is too large for your VRAM. Kill the process with ollama stop gemma4:27b and switch to a smaller variant or quantized version.
:::
CPU-Only Systems
Gemma 4 runs on CPU if you lack a compatible GPU. It’s slower but functional:
# Force CPU mode if needed
export OLLAMA_NO_GPU=1
ollama run gemma4:2b
Expect 2-5 tokens/second on a modern CPU for the E2B model. Usable for simple queries on edge devices.
Edge Devices (Raspberry Pi, Jetson Nano)
The E2B and E4B models are engineered specifically for edge:
# On Raspberry Pi 5 with 8GB RAM
ollama run gemma4:2b
# On NVIDIA Jetson Orin Nano
ollama run gemma4:4b
Features on edge:
- Vision: Process camera frames locally for OCR, object detection
- Audio: Native speech recognition and understanding
- Offline: Works without internet after initial download
- Low latency: Near-zero response time for real-time applications
Using the REST API
Ollama exposes an OpenAI-compatible API at localhost:11434:
Basic chat completion
curl http://localhost:11434/api/chat -d '{
"model": "gemma4:31b",
"messages": [
{ "role": "user", "content": "Explain recursion in Python" }
],
"stream": false
}'
Generate (single prompt)
curl http://localhost:11434/api/generate -d '{
"model": "gemma4:31b",
"prompt": "Write a Python function to reverse a linked list",
"stream": false
}'
OpenAI-compatible endpoint
Any library that works with OpenAI can point to Ollama:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # required but ignored
)
response = client.chat.completions.create(
model='gemma4:12b',
messages=[{'role': 'user', 'content': 'Refactor this function'}]
)
print(response.choices[0].message.content)
Python SDK Usage
Install the official Ollama Python library:
pip install ollama
Basic usage:
import ollama
response = ollama.chat(
model='gemma4:31b',
messages=[
{'role': 'user', 'content': 'Write a bash script to find large files'}
]
)
print(response['message']['content'])
Streaming for real-time output:
stream = ollama.chat(
model='gemma4:31b',
messages=[{'role': 'user', 'content': 'Tell me a joke'}],
stream=True,
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
IDE Integration
Continue.dev (VS Code / JetBrains)
Add to your Continue config:
{
"models": [
{
"title": "Gemma 4 31B (Local)",
"provider": "ollama",
"model": "gemma4:31b",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "Gemma 4 26B MoE Autocomplete",
"provider": "ollama",
"model": "gemma4:27b"
}
}
The Scenario: You’re on a plane with no Wi-Fi. Open VS Code, hit Tab for autocomplete, and Gemma 4 suggests the next line. Local AI doesn’t need the internet.
Cursor
In Cursor settings, add a custom OpenAI-compatible model:
- Base URL:
http://localhost:11434/v1 - Model:
gemma4:31b
Claude Code
Pipe files to your local Gemma 4 instance:
claude -p "Review this code for bugs" < src/utils/parser.ts
Useful Commands
ollama list # show downloaded models
ollama pull gemma4:31b # download a specific variant
ollama rm gemma4:27b # remove a model to free space
ollama show gemma4:31b # model info and parameters
ollama ps # show running models
ollama stop gemma4:31b # stop a running model
ollama run gemma4:4b "prompt" # one-shot, non-interactive
Performance Comparison
Approximate tokens/second on different hardware:
| Hardware | E2B | E4B | 26B MoE | 31B Dense |
|---|---|---|---|---|
| Raspberry Pi 5 (8GB) | 8 t/s | 4 t/s | N/A | N/A |
| M2 Pro (16GB) | 45 t/s | 35 t/s | 30 t/s | 15 t/s |
| RTX 4090 (24GB) | 90 t/s | 75 t/s | 65 t/s | 35 t/s |
| RTX 3060 (12GB) | 30 t/s | 25 t/s | 20 t/s | N/A |
| CPU (i7-12700K) | 5 t/s | 3 t/s | <1 t/s | <1 t/s |
Numbers are approximate — actual speed varies by prompt length and context window usage. The 26B MoE model activates only 3.8B parameters during inference, making it surprisingly fast for its size.
Prompting Tips
Gemma 4 responds well to direct, specific prompts:
For coding:
You are an expert Python developer. Write a clean, documented function that [task]. Include type hints and a docstring.
For explanation:
Explain [topic] as if I'm a senior developer who knows [related tech] but is new to this specific concept. Be concise.
For review:
Review this code for bugs, performance issues, and style violations. Rate each on severity (low/medium/high).
Troubleshooting
”Error: model not found”
Run ollama pull gemma4:12b first to download the weights.
Out of memory errors
Switch to a smaller model or quantized variant. Use Activity Monitor (macOS) or nvidia-smi (Linux) to check memory usage.
Slow performance
- Verify GPU acceleration:
ollama psshould show the model - Try a smaller model variant
- Close other memory-heavy applications
- Check thermal throttling on laptops
API connection refused
Ensure Ollama server is running:
ollama serve # starts the server
Summary
- Gemma 4 runs fully offline via Ollama — no API keys, no data leaks
- Four sizes: E2B and E4B for edge/mobile (128K context), 26B MoE and 31B Dense for workstations (256K context)
- 26B MoE activates only 3.8B parameters for fast inference; 31B Dense for maximum quality
- Quantized variants (
q4_K_M) save VRAM with minimal quality loss - Apple Silicon gets automatic GPU acceleration; NVIDIA needs sufficient VRAM
- Multimodal: Vision + audio understanding on all models
- Multilingual: Native support for 140+ languages
- Apache 2.0 license — fully permissive for commercial use
- OpenAI-compatible API works with existing tools and libraries
Frequently Asked Questions
What’s the difference between Gemma 3 and Gemma 4?
Gemma 4 improves reasoning, coding performance, and instruction following. The 31B Dense model ranks #3 on the Arena AI open-source leaderboard, outperforming models 20x its size. Key upgrades include:
- Multimodal support (vision + audio) on all models
- 140+ languages natively
- 128K context (E2B/E4B) or 256K context (26B/31B)
- Apache 2.0 license (was Google’s restrictive custom license)
- Native function calling and agentic workflow support
Can I run Gemma 4 without internet after the initial download?
Yes. Once you ollama pull the model, it runs entirely offline. The weights are stored in ~/.ollama/models/. No cloud connection required for inference. This is ideal for air-gapped environments, privacy-sensitive work, or deployments without reliable internet.
Which Gemma 4 size should I choose?
- E2B (2B effective): Raspberry Pi, IoT devices, phones, real-time edge processing with vision/audio
- E4B (4B effective): Jetson Nano, Android devices, better quality than E2B while still edge-friendly
- 26B MoE (Mixture of Experts): Desktop workstations, fast inference (activates only 3.8B params), coding assistants
- 31B Dense: High-end GPUs, maximum quality, fine-tuning, complex reasoning tasks
How does the 26B MoE model work?
MoE (Mixture of Experts) means the model has 26 billion total parameters but only activates 3.8 billion during each inference pass. It routes each token to the most relevant “expert” sub-networks. This gives you fast tokens-per-second comparable to a 4B model, with the quality of a much larger model.
Can I use Gemma 4 for commercial projects?
Yes. Gemma 4 uses the Apache 2.0 license — the same permissive license used by Android, Kubernetes, and TensorFlow. You can use it commercially, modify it, distribute it, and even build proprietary products on top of it. No usage restrictions, no attribution requirements beyond the license text.
What to Read Next
- How to Install Ollama and Run LLMs Locally — deeper dive into Ollama setup and configuration
- Qwen Coder Cheatsheet — comparison with the leading local coding model