How to Install Ollama and Run LLMs Locally

:::note[TL;DR]

Ollama installs as a local model server — one command on macOS/Linux, installer on Windows
Run a model with ollama run llama3.2 — it downloads automatically on first use (~2–5 GB)
The REST API at http://localhost:11434 is OpenAI-compatible — any existing OpenAI library works with zero code changes
Apple Silicon gets GPU acceleration automatically; NVIDIA needs the container toolkit
Models run fully offline after the initial download — no API key, no data leaves your machine :::

Prerequisites

macOS, Linux, or Windows
At least 8 GB RAM for 7B models (16 GB recommended)
~5–10 GB free disk space per model
Docker (optional — only needed for Open WebUI)

Ollama is a tool that lets you download and run large language models on your own hardware. No API key. No internet connection required after download. No data leaving your machine. You get a local model server with a REST API you can call from any application.

In 2026, local LLMs are good enough for real work. Llama 3.3, Mistral, Gemma 3, Qwen 2.5, and Phi-4 all run well on a modern laptop or desktop. If you have a GPU, even better.

What you need

Minimum (CPU only):

8 GB RAM for 7B parameter models
16 GB RAM for 13B parameter models

Better performance:

Apple Silicon Mac (M1/M2/M3/M4) — excellent performance via Metal GPU
NVIDIA GPU with 8+ GB VRAM — runs models at full GPU speed
AMD GPU (ROCm support on Linux)

Install Ollama

macOS:

brew install ollama
# or download the app from ollama.com

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com. Runs as a background service.

Verify installation:

ollama --version

Download and run a model

# Run a model (downloads automatically on first use)
ollama run llama3.2

# Start chatting — type your message and press Enter
# Type /bye to exit

That’s it. The first run downloads the model (~2-5 GB depending on the model). Subsequent runs start instantly from cache.

Popular models to try

Model	Size	Best for
`llama3.2`	3B (2 GB)	Fast responses, everyday tasks
`llama3.3`	70B (43 GB)	Best open-source quality, needs 64 GB RAM
`mistral`	7B (4 GB)	Good general use, fast
`gemma3`	4B (3 GB)	Google’s efficient model
`phi4`	14B (9 GB)	Microsoft’s compact high-quality model
`qwen2.5-coder`	7B (4 GB)	Code generation and completion
`deepseek-r1`	7B (5 GB)	Reasoning and math
`nomic-embed-text`	—	Text embeddings for RAG

# Pull without running
ollama pull mistral

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.2

:::tip For code assistance on confidential projects, qwen2.5-coder is the strongest coding-focused model available through Ollama. Pull it with ollama pull qwen2.5-coder — it handles code review, completion, and refactoring without any data leaving your machine. :::

The scenario: You’re working on a client project with sensitive code and you need code review help. You can’t paste it into ChatGPT — confidentiality agreement. You install Ollama, pull qwen2.5-coder, and get solid code suggestions with zero data leaving the machine. Problem solved.

Run as a server

Ollama runs as a local REST API server on http://localhost:11434. You can call it from your own apps:

# Start the server (usually starts automatically on install)
ollama serve

# Chat via API
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "user", "content": "Explain recursion in simple terms" }
  ],
  "stream": false
}'

Generate endpoint (single prompt, no chat history):

curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Write a Python function to check if a string is a palindrome",
  "stream": false
}'

Use with Python

pip install ollama

import ollama

response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'Why is the sky blue?'}
    ]
)
print(response['message']['content'])

Streaming:

stream = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Tell me a joke'}],
    stream=True,
)
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Use with OpenAI-compatible clients

Ollama exposes an OpenAI-compatible API at /v1. Any library that supports OpenAI can point to Ollama instead:

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',  # required but ignored
)

response = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)

Use with Open WebUI (chat interface)

Open WebUI gives you a ChatGPT-like browser interface for your local models:

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000. It automatically detects your Ollama models.

Useful commands

ollama run llama3.2          # run a model interactively
ollama run llama3.2 "prompt" # one-shot, non-interactive
ollama list                  # list local models
ollama pull mistral          # download a model
ollama rm mistral            # delete a model
ollama show llama3.2         # model info and parameters
ollama ps                    # show running models
ollama stop llama3.2         # stop a running model

Performance tips

Apple Silicon: All Ollama GPU acceleration works automatically. No setup needed.
NVIDIA GPU: Install the NVIDIA Container Toolkit for maximum speed. :::warning Models must fit in RAM (or VRAM) to run at usable speed. If a model is larger than available memory, Ollama falls back to CPU-only inference which is extremely slow — often 1–3 tokens per second. On 8 GB RAM, stick to 3B–7B models. On 16 GB, up to 13B. Check ollama show <model> to see the size before pulling. :::
RAM matters more than you think: Models need to fit in RAM (or VRAM). If you only have 8 GB, stick to 3B-7B models.
Quantization: Ollama serves quantized models (Q4, Q5, Q8) by default — they use less memory with minimal quality loss. ollama pull llama3.3:70b-instruct-q4_K_M for the 70B at lower precision.

Summary

Ollama installs as a local server on macOS, Linux, and Windows — one command to install, one to run a model
Models download automatically on first run and are cached for subsequent use
The REST API at http://localhost:11434 is OpenAI-compatible — any library that supports OpenAI can point to Ollama
Apple Silicon Macs get GPU acceleration automatically; NVIDIA GPUs need the container toolkit for full speed
Open WebUI gives you a ChatGPT-like browser interface for your local Ollama models with one Docker command

Frequently Asked Questions

Which Ollama model is best for coding?

qwen2.5-coder (7B or 32B) is the strongest coding-focused model available through Ollama in 2026. For general coding assistance on a machine with limited RAM, mistral (7B) is a reliable fallback. If you have a GPU with 24+ GB VRAM, llama3.3:70b gives near-GPT-4 quality.

Can I use Ollama with LangChain or LlamaIndex?

Yes. Both have Ollama integrations. LangChain: from langchain_ollama import ChatOllama. LlamaIndex: from llama_index.llms.ollama import Ollama. Or use the OpenAI-compatible endpoint with base_url="http://localhost:11434/v1".

Does Ollama work without internet after the initial download?

Yes. Once a model is downloaded (ollama pull), it runs entirely offline. The model weights are stored locally in ~/.ollama/models/. No data is sent anywhere during inference.