:::note[TL;DR]
- Ollama installs as a local model server — one command on macOS/Linux, installer on Windows
- Run a model with
ollama run llama3.2— it downloads automatically on first use (~2–5 GB) - The REST API at
http://localhost:11434is OpenAI-compatible — any existing OpenAI library works with zero code changes - Apple Silicon gets GPU acceleration automatically; NVIDIA needs the container toolkit
- Models run fully offline after the initial download — no API key, no data leaves your machine :::
Prerequisites
- macOS, Linux, or Windows
- At least 8 GB RAM for 7B models (16 GB recommended)
- ~5–10 GB free disk space per model
- Docker (optional — only needed for Open WebUI)
Ollama is a tool that lets you download and run large language models on your own hardware. No API key. No internet connection required after download. No data leaving your machine. You get a local model server with a REST API you can call from any application.
In 2026, local LLMs are good enough for real work. Llama 3.3, Mistral, Gemma 3, Qwen 2.5, and Phi-4 all run well on a modern laptop or desktop. If you have a GPU, even better.
What you need
Minimum (CPU only):
- 8 GB RAM for 7B parameter models
- 16 GB RAM for 13B parameter models
Better performance:
- Apple Silicon Mac (M1/M2/M3/M4) — excellent performance via Metal GPU
- NVIDIA GPU with 8+ GB VRAM — runs models at full GPU speed
- AMD GPU (ROCm support on Linux)
Install Ollama
macOS:
brew install ollama
# or download the app from ollama.com
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer from ollama.com. Runs as a background service.
Verify installation:
ollama --version
Download and run a model
# Run a model (downloads automatically on first use)
ollama run llama3.2
# Start chatting — type your message and press Enter
# Type /bye to exit
That’s it. The first run downloads the model (~2-5 GB depending on the model). Subsequent runs start instantly from cache.
Popular models to try
| Model | Size | Best for |
|---|---|---|
llama3.2 | 3B (2 GB) | Fast responses, everyday tasks |
llama3.3 | 70B (43 GB) | Best open-source quality, needs 64 GB RAM |
mistral | 7B (4 GB) | Good general use, fast |
gemma3 | 4B (3 GB) | Google’s efficient model |
phi4 | 14B (9 GB) | Microsoft’s compact high-quality model |
qwen2.5-coder | 7B (4 GB) | Code generation and completion |
deepseek-r1 | 7B (5 GB) | Reasoning and math |
nomic-embed-text | — | Text embeddings for RAG |
# Pull without running
ollama pull mistral
# List downloaded models
ollama list
# Remove a model
ollama rm llama3.2
:::tip
For code assistance on confidential projects, qwen2.5-coder is the strongest coding-focused model available through Ollama. Pull it with ollama pull qwen2.5-coder — it handles code review, completion, and refactoring without any data leaving your machine.
:::
The scenario: You’re working on a client project with sensitive code and you need code review help. You can’t paste it into ChatGPT — confidentiality agreement. You install Ollama, pull
qwen2.5-coder, and get solid code suggestions with zero data leaving the machine. Problem solved.
Run as a server
Ollama runs as a local REST API server on http://localhost:11434. You can call it from your own apps:
# Start the server (usually starts automatically on install)
ollama serve
# Chat via API
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{ "role": "user", "content": "Explain recursion in simple terms" }
],
"stream": false
}'
Generate endpoint (single prompt, no chat history):
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Write a Python function to check if a string is a palindrome",
"stream": false
}'
Use with Python
pip install ollama
import ollama
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'user', 'content': 'Why is the sky blue?'}
]
)
print(response['message']['content'])
Streaming:
stream = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Tell me a joke'}],
stream=True,
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
Use with OpenAI-compatible clients
Ollama exposes an OpenAI-compatible API at /v1. Any library that supports OpenAI can point to Ollama instead:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama', # required but ignored
)
response = client.chat.completions.create(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Hello!'}]
)
Use with Open WebUI (chat interface)
Open WebUI gives you a ChatGPT-like browser interface for your local models:
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000. It automatically detects your Ollama models.
Useful commands
ollama run llama3.2 # run a model interactively
ollama run llama3.2 "prompt" # one-shot, non-interactive
ollama list # list local models
ollama pull mistral # download a model
ollama rm mistral # delete a model
ollama show llama3.2 # model info and parameters
ollama ps # show running models
ollama stop llama3.2 # stop a running model
Performance tips
-
Apple Silicon: All Ollama GPU acceleration works automatically. No setup needed.
-
NVIDIA GPU: Install the NVIDIA Container Toolkit for maximum speed. :::warning Models must fit in RAM (or VRAM) to run at usable speed. If a model is larger than available memory, Ollama falls back to CPU-only inference which is extremely slow — often 1–3 tokens per second. On 8 GB RAM, stick to 3B–7B models. On 16 GB, up to 13B. Check
ollama show <model>to see the size before pulling. ::: -
RAM matters more than you think: Models need to fit in RAM (or VRAM). If you only have 8 GB, stick to 3B-7B models.
-
Quantization: Ollama serves quantized models (Q4, Q5, Q8) by default — they use less memory with minimal quality loss.
ollama pull llama3.3:70b-instruct-q4_K_Mfor the 70B at lower precision.
Summary
- Ollama installs as a local server on macOS, Linux, and Windows — one command to install, one to run a model
- Models download automatically on first run and are cached for subsequent use
- The REST API at
http://localhost:11434is OpenAI-compatible — any library that supports OpenAI can point to Ollama - Apple Silicon Macs get GPU acceleration automatically; NVIDIA GPUs need the container toolkit for full speed
- Open WebUI gives you a ChatGPT-like browser interface for your local Ollama models with one Docker command
Frequently Asked Questions
Which Ollama model is best for coding?
qwen2.5-coder (7B or 32B) is the strongest coding-focused model available through Ollama in 2026. For general coding assistance on a machine with limited RAM, mistral (7B) is a reliable fallback. If you have a GPU with 24+ GB VRAM, llama3.3:70b gives near-GPT-4 quality.
Can I use Ollama with LangChain or LlamaIndex?
Yes. Both have Ollama integrations. LangChain: from langchain_ollama import ChatOllama. LlamaIndex: from llama_index.llms.ollama import Ollama. Or use the OpenAI-compatible endpoint with base_url="http://localhost:11434/v1".
Does Ollama work without internet after the initial download?
Yes. Once a model is downloaded (ollama pull), it runs entirely offline. The model weights are stored locally in ~/.ollama/models/. No data is sent anywhere during inference.
What to Read Next
- What Is an LLM? — understand what you’re running locally before optimizing it
- MCP vs Function Calling — connect your local Ollama models to tools and external data