How to Install Gemma 4 Locally with Ollama (2026 Guide)

Gemma 4 is Google’s latest open-weight language model — a significant leap from Gemma 3 with better reasoning, longer context, and improved coding performance. Unlike cloud APIs, running it locally means zero data leaves your machine. Perfect for proprietary code, air-gapped environments, or just avoiding subscription fees.

Gemma 4 comes in four sizes: E2B and E4B for edge devices (phones, Raspberry Pi, IoT), and 26B MoE plus 31B Dense for workstations. All models are multimodal (vision + audio on edge models), support 140+ languages, and now use the permissive Apache 2.0 license.

:::note[TL;DR]

Gemma 4 comes in four sizes: E2B, E4B (edge/mobile), 26B MoE, and 31B Dense (workstation/server)
E2B/E4B run on phones, Raspberry Pi, Jetson Nano with 128K context
26B MoE activates only 3.8B params for fast inference; 31B Dense for maximum quality with 256K context
All models are multimodal (vision + audio on edge) and support 140+ languages
Install Ollama, then ollama pull gemma4:27b — models download automatically on first use
Apple Silicon gets GPU acceleration; NVIDIA needs ~24GB+ VRAM for the 31B model
Now under Apache 2.0 license (not Google’s custom license) — truly open for commercial use :::

Prerequisites

Before installing Gemma 4, check your hardware:

Minimum (CPU only):

4 GB RAM for E2B models (edge/IoT)
8 GB RAM for E4B models
16 GB RAM for 26B MoE models
32 GB RAM for 31B Dense models

Edge/Mobile (E2B/E4B):

Runs on Raspberry Pi 4/5, NVIDIA Jetson Orin Nano
Android phones with 6GB+ RAM
iOS devices (via Core ML)
128K context window

Better performance (GPU):

Apple Silicon Mac (M1/M2/M3/M4) — Metal acceleration works out of the box
NVIDIA GPU with 8+ GB VRAM for E4B models
NVIDIA GPU with 16+ GB VRAM for 26B MoE
NVIDIA GPU with 24+ GB VRAM for 31B Dense
256K context window for 26B/31B models

Key Features:

Multimodal: Vision + audio understanding on all models
Multilingual: Native support for 140+ languages
Agentic: Native function calling and structured JSON output
License: Apache 2.0 (fully permissive for commercial use)
Context: 128K (E2B/E4B) or 256K (26B/31B) tokens

Install Ollama

If you don’t have Ollama yet, install it first:

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download from ollama.com. Runs as a background service.

Verify installation:

ollama --version

Download and Run Gemma 4

Ollama makes this trivial. Models download on first use and cache for future runs.

# Run the E2B model (edge/IoT, ~2GB, fastest on limited hardware)
ollama run gemma4:2b

# Run the E4B model (edge/IoT, ~3GB, better quality than E2B)
ollama run gemma4:4b

# Run the 26B MoE model (desktop, activates 3.8B params, fast inference)
ollama run gemma4:27b

# Run the 31B Dense model (workstation, maximum quality, 256K context)
ollama run gemma4:31b

The Scenario: You’re deploying an AI assistant on a Raspberry Pi 5 at a remote factory. You pull gemma4:2b, get local vision + audio processing with 128K context, and it all runs offline without internet. The E2B model handles OCR from camera feeds and voice commands natively.

First launch downloads the model weights:

E2B: ~2GB
E4B: ~3GB
26B MoE: ~16GB (fits on 80GB H100 unquantized, ~7GB quantized)
31B Dense: ~19GB (fits on 80GB H100 unquantized, ~8GB quantized)

Subsequent starts are instant.

Available Model Variants

Gemma 4 offers quantized variants for different VRAM constraints:

Variant	Effective Size	VRAM Needed	Best For	Context
`gemma4:2b` (E2B)	~2 GB	3-4 GB	Raspberry Pi, IoT, phones	128K
`gemma4:4b` (E4B)	~3 GB	4-6 GB	Edge devices, Jetson Nano	128K
`gemma4:27b` (26B MoE)	~16 GB (activates 3.8B)	12-16 GB	Fast desktop inference	256K
`gemma4:31b` (31B Dense)	~19 GB	24+ GB	Maximum quality, fine-tuning	256K
`gemma4:27b-q4_K_M`	~7 GB	8-10 GB	Mid-range GPUs (26B MoE)	256K
`gemma4:31b-q4_K_M`	~8 GB	10-12 GB	High-end consumer GPUs	256K

Key difference: The 26B MoE activates only 3.8 billion parameters during inference — delivering exceptional tokens/second while still having 26B total capacity. The 31B Dense uses all parameters for maximum quality.

Pull a quantized variant:

ollama pull gemma4:31b-q4_K_M

:::tip The q4_K_M quantization uses 4-bit precision with intelligent mixing. You lose ~2-3% quality but save 30-40% VRAM. Most users won’t notice the difference for everyday coding tasks. :::

Hardware-Specific Setup

Apple Silicon (M1/M2/M3/M4)

No configuration needed. GPU acceleration works automatically via Metal:

ollama run gemma4:12b

On an M2 Pro with 16GB unified memory, the 12B model runs at ~25 tokens/second. The 27B model also runs on M-series chips with 24GB+ RAM, though you may need to close other apps.

NVIDIA GPUs

Install the NVIDIA Container Toolkit for maximum throughput. Verify CUDA is available:

ollama ps  # Shows if GPU is being used

:::warning If you see “CUDA out of memory” errors, your model is too large for your VRAM. Kill the process with ollama stop gemma4:27b and switch to a smaller variant or quantized version. :::

CPU-Only Systems

Gemma 4 runs on CPU if you lack a compatible GPU. It’s slower but functional:

# Force CPU mode if needed
export OLLAMA_NO_GPU=1
ollama run gemma4:2b

Expect 2-5 tokens/second on a modern CPU for the E2B model. Usable for simple queries on edge devices.

Edge Devices (Raspberry Pi, Jetson Nano)

The E2B and E4B models are engineered specifically for edge:

# On Raspberry Pi 5 with 8GB RAM
ollama run gemma4:2b

# On NVIDIA Jetson Orin Nano
ollama run gemma4:4b

Features on edge:

Vision: Process camera frames locally for OCR, object detection
Audio: Native speech recognition and understanding
Offline: Works without internet after initial download
Low latency: Near-zero response time for real-time applications

Using the REST API

Ollama exposes an OpenAI-compatible API at localhost:11434:

Basic chat completion

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4:31b",
  "messages": [
    { "role": "user", "content": "Explain recursion in Python" }
  ],
  "stream": false
}'

Generate (single prompt)

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:31b",
  "prompt": "Write a Python function to reverse a linked list",
  "stream": false
}'

OpenAI-compatible endpoint

Any library that works with OpenAI can point to Ollama:

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required but ignored
)

response = client.chat.completions.create(
    model='gemma4:12b',
    messages=[{'role': 'user', 'content': 'Refactor this function'}]
)
print(response.choices[0].message.content)

Python SDK Usage

Install the official Ollama Python library:

pip install ollama

Basic usage:

import ollama

response = ollama.chat(
    model='gemma4:31b',
    messages=[
        {'role': 'user', 'content': 'Write a bash script to find large files'}
    ]
)
print(response['message']['content'])

Streaming for real-time output:

stream = ollama.chat(
    model='gemma4:31b',
    messages=[{'role': 'user', 'content': 'Tell me a joke'}],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

IDE Integration

Continue.dev (VS Code / JetBrains)

Add to your Continue config:

{
  "models": [
    {
      "title": "Gemma 4 31B (Local)",
      "provider": "ollama",
      "model": "gemma4:31b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Gemma 4 26B MoE Autocomplete",
    "provider": "ollama",
    "model": "gemma4:27b"
  }
}

The Scenario: You’re on a plane with no Wi-Fi. Open VS Code, hit Tab for autocomplete, and Gemma 4 suggests the next line. Local AI doesn’t need the internet.

Cursor

In Cursor settings, add a custom OpenAI-compatible model:

Base URL: http://localhost:11434/v1
Model: gemma4:31b

Claude Code

Pipe files to your local Gemma 4 instance:

claude -p "Review this code for bugs" < src/utils/parser.ts

Useful Commands

ollama list                  # show downloaded models
ollama pull gemma4:31b       # download a specific variant
ollama rm gemma4:27b         # remove a model to free space
ollama show gemma4:31b       # model info and parameters
ollama ps                    # show running models
ollama stop gemma4:31b       # stop a running model
ollama run gemma4:4b "prompt" # one-shot, non-interactive

Performance Comparison

Approximate tokens/second on different hardware:

Hardware	E2B	E4B	26B MoE	31B Dense
Raspberry Pi 5 (8GB)	8 t/s	4 t/s	N/A	N/A
M2 Pro (16GB)	45 t/s	35 t/s	30 t/s	15 t/s
RTX 4090 (24GB)	90 t/s	75 t/s	65 t/s	35 t/s
RTX 3060 (12GB)	30 t/s	25 t/s	20 t/s	N/A
CPU (i7-12700K)	5 t/s	3 t/s	<1 t/s	<1 t/s

Numbers are approximate — actual speed varies by prompt length and context window usage. The 26B MoE model activates only 3.8B parameters during inference, making it surprisingly fast for its size.

Prompting Tips

Gemma 4 responds well to direct, specific prompts:

For coding:

You are an expert Python developer. Write a clean, documented function that [task]. Include type hints and a docstring.

For explanation:

Explain [topic] as if I'm a senior developer who knows [related tech] but is new to this specific concept. Be concise.

For review:

Review this code for bugs, performance issues, and style violations. Rate each on severity (low/medium/high).

Troubleshooting

”Error: model not found”

Run ollama pull gemma4:12b first to download the weights.

Out of memory errors

Switch to a smaller model or quantized variant. Use Activity Monitor (macOS) or nvidia-smi (Linux) to check memory usage.

Slow performance

Verify GPU acceleration: ollama ps should show the model
Try a smaller model variant
Close other memory-heavy applications
Check thermal throttling on laptops

API connection refused

Ensure Ollama server is running:

ollama serve  # starts the server

Summary

Gemma 4 runs fully offline via Ollama — no API keys, no data leaks
Four sizes: E2B and E4B for edge/mobile (128K context), 26B MoE and 31B Dense for workstations (256K context)
26B MoE activates only 3.8B parameters for fast inference; 31B Dense for maximum quality
Quantized variants (q4_K_M) save VRAM with minimal quality loss
Apple Silicon gets automatic GPU acceleration; NVIDIA needs sufficient VRAM
Multimodal: Vision + audio understanding on all models
Multilingual: Native support for 140+ languages
Apache 2.0 license — fully permissive for commercial use
OpenAI-compatible API works with existing tools and libraries

Frequently Asked Questions

What’s the difference between Gemma 3 and Gemma 4?

Gemma 4 improves reasoning, coding performance, and instruction following. The 31B Dense model ranks #3 on the Arena AI open-source leaderboard, outperforming models 20x its size. Key upgrades include:

Multimodal support (vision + audio) on all models
140+ languages natively
128K context (E2B/E4B) or 256K context (26B/31B)
Apache 2.0 license (was Google’s restrictive custom license)
Native function calling and agentic workflow support

Can I run Gemma 4 without internet after the initial download?

Yes. Once you ollama pull the model, it runs entirely offline. The weights are stored in ~/.ollama/models/. No cloud connection required for inference. This is ideal for air-gapped environments, privacy-sensitive work, or deployments without reliable internet.

Which Gemma 4 size should I choose?

E2B (2B effective): Raspberry Pi, IoT devices, phones, real-time edge processing with vision/audio
E4B (4B effective): Jetson Nano, Android devices, better quality than E2B while still edge-friendly
26B MoE (Mixture of Experts): Desktop workstations, fast inference (activates only 3.8B params), coding assistants
31B Dense: High-end GPUs, maximum quality, fine-tuning, complex reasoning tasks

How does the 26B MoE model work?

MoE (Mixture of Experts) means the model has 26 billion total parameters but only activates 3.8 billion during each inference pass. It routes each token to the most relevant “expert” sub-networks. This gives you fast tokens-per-second comparable to a 4B model, with the quality of a much larger model.

Can I use Gemma 4 for commercial projects?

Yes. Gemma 4 uses the Apache 2.0 license — the same permissive license used by Android, Kubernetes, and TensorFlow. You can use it commercially, modify it, distribute it, and even build proprietary products on top of it. No usage restrictions, no attribution requirements beyond the license text.