**Minimum (CPU only):** 8 GB RAM for 7B parameter models 16 GB RAM for 13B parameter models

What Is an LLM? — understand what you're running locally before optimizing it MCP vs Function Calling — connect your local Ollama models to tools and external data

How to Install Ollama and Run LLMs Locally

Q: Which Ollama model is best for coding?

qwen2.5-coder (7B or 32B) is the strongest coding-focused model available through Ollama in 2026. For general coding assistance on a machine with limited RAM, mistral (7B) is a reliable fallback. If you have a GPU with 24+ GB VRAM, llama3.3:70b gives near-GPT-4 quality.

Q: Can I use Ollama with LangChain or LlamaIndex?

Yes. Both have Ollama integrations. LangChain: from langchain_ollama import ChatOllama. LlamaIndex: from llama_index.llms.ollama import Ollama. Or use the OpenAI-compatible endpoint with base_url="http://localhost:11434/v1".

Q: Does Ollama work without internet after the initial download?

Yes. Once a model is downloaded (ollama pull), it runs entirely offline. The model weights are stored locally in ~/.ollama/models/. No data is sent anywhere during inference. ---

TL;DR

Ollama installs as a local model server — one command on macOS/Linux, installer on Windows
Run a model with ollama run llama3.2 — it downloads automatically on first use (~2–5 GB)
The REST API at http://localhost:11434 is OpenAI-compatible — any existing OpenAI library works with zero code changes
Apple Silicon gets GPU acceleration automatically; NVIDIA needs the container toolkit
Models run fully offline after the initial download — no API key, no data leaves your machine

Prerequisites

macOS, Linux, or Windows
At least 8 GB RAM for 7B models (16 GB recommended)
~5–10 GB free disk space per model
Docker (optional — only needed for Open WebUI)

Ollama is a tool that lets you download and run large language models on your own hardware. No API key. No internet connection required after download. No data leaving your machine. You get a local model server with a REST API you can call from any application.

In 2026, local LLMs are good enough for real work. Llama 3.3, Mistral, Gemma 3, Qwen 2.5, and Phi-4 all run well on a modern laptop or desktop. If you have a GPU, even better.

What you need

Minimum (CPU only):

8 GB RAM for 7B parameter models
16 GB RAM for 13B parameter models

Better performance:

Apple Silicon Mac (M1/M2/M3/M4) — excellent performance via Metal GPU
NVIDIA GPU with 8+ GB VRAM — runs models at full GPU speed
AMD GPU (ROCm support on Linux)

Install Ollama

macOS:

bash

brew install ollama
# or download the app from ollama.com

Linux:

bash

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com. Runs as a background service.

Verify installation:

bash

ollama --version

Download and run a model

bash

# Run a model (downloads automatically on first use)
ollama run llama3.2

# Start chatting — type your message and press Enter
# Type /bye to exit

That’s it. The first run downloads the model (~2-5 GB depending on the model). Subsequent runs start instantly from cache.

Popular models to try

Model	Size	Best for
`llama3.2`	3B (2 GB)	Fast responses, everyday tasks
`llama3.3`	70B (43 GB)	Best open-source quality, needs 64 GB RAM
`mistral`	7B (4 GB)	Good general use, fast
`gemma3`	4B (3 GB)	Google’s efficient model
`phi4`	14B (9 GB)	Microsoft’s compact high-quality model
`qwen2.5-coder`	7B (4 GB)	Code generation and completion
`deepseek-r1`	7B (5 GB)	Reasoning and math
`nomic-embed-text`	—	Text embeddings for RAG

bash

# Pull without running
ollama pull mistral

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.2

Pro Tip

For code assistance on confidential projects, qwen2.5-coder is the strongest coding-focused model available through Ollama. Pull it with ollama pull qwen2.5-coder — it handles code review, completion, and refactoring without any data leaving your machine.

The scenario: You’re working on a client project with sensitive code and you need code review help. You can’t paste it into ChatGPT — confidentiality agreement. You install Ollama, pull qwen2.5-coder, and get solid code suggestions with zero data leaving the machine. Problem solved.

Run as a server

Ollama runs as a local REST API server on http://localhost:11434. You can call it from your own apps:

bash

# Start the server (usually starts automatically on install)
ollama serve

bash

# Chat via API
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "user", "content": "Explain recursion in simple terms" }
  ],
  "stream": false
}'

Generate endpoint (single prompt, no chat history):

bash

curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Write a Python function to check if a string is a palindrome",
  "stream": false
}'

Use with Python

bash

pip install ollama

python

import ollama

response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'Why is the sky blue?'}
    ]
)
print(response['message']['content'])

Streaming:

python

stream = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Tell me a joke'}],
    stream=True,
)
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Use with OpenAI-compatible clients

Ollama exposes an OpenAI-compatible API at /v1. Any library that supports OpenAI can point to Ollama instead:

python

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',  # required but ignored
)

response = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)

Use with Open WebUI (chat interface)

Open WebUI gives you a ChatGPT-like browser interface for your local models:

bash

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000. It automatically detects your Ollama models.

Useful commands

bash

ollama run llama3.2          # run a model interactively
ollama run llama3.2 "prompt" # one-shot, non-interactive
ollama list                  # list local models
ollama pull mistral          # download a model
ollama rm mistral            # delete a model
ollama show llama3.2         # model info and parameters
ollama ps                    # show running models
ollama stop llama3.2         # stop a running model

Performance tips

Apple Silicon: All Ollama GPU acceleration works automatically. No setup needed.
NVIDIA GPU: Install the NVIDIA Container Toolkit for maximum speed.

Warning

Models must fit in RAM (or VRAM) to run at usable speed. If a model is larger than available memory, Ollama falls back to CPU-only inference which is extremely slow — often 1–3 tokens per second. On 8 GB RAM, stick to 3B–7B models. On 16 GB, up to 13B. Check ollama show <model> to see the size before pulling.

RAM matters more than you think: Models need to fit in RAM (or VRAM). If you only have 8 GB, stick to 3B-7B models.
Quantization: Ollama serves quantized models (Q4, Q5, Q8) by default — they use less memory with minimal quality loss. ollama pull llama3.3:70b-instruct-q4_K_M for the 70B at lower precision.

Summary

Ollama installs as a local server on macOS, Linux, and Windows — one command to install, one to run a model
Models download automatically on first run and are cached for subsequent use
The REST API at http://localhost:11434 is OpenAI-compatible — any library that supports OpenAI can point to Ollama
Apple Silicon Macs get GPU acceleration automatically; NVIDIA GPUs need the container toolkit for full speed
Open WebUI gives you a ChatGPT-like browser interface for your local Ollama models with one Docker command

Frequently Asked Questions

Which Ollama model is best for coding?

qwen2.5-coder (7B or 32B) is the strongest coding-focused model available through Ollama in 2026. For general coding assistance on a machine with limited RAM, mistral (7B) is a reliable fallback. If you have a GPU with 24+ GB VRAM, llama3.3:70b gives near-GPT-4 quality.

Can I use Ollama with LangChain or LlamaIndex?

Yes. Both have Ollama integrations. LangChain: from langchain_ollama import ChatOllama. LlamaIndex: from llama_index.llms.ollama import Ollama. Or use the OpenAI-compatible endpoint with base_url="http://localhost:11434/v1".

Does Ollama work without internet after the initial download?

Yes. Once a model is downloaded (ollama pull), it runs entirely offline. The model weights are stored locally in ~/.ollama/models/. No data is sent anywhere during inference.

How to Install Ollama and Run LLMs Locally

Prerequisites

What you need

Install Ollama

Download and run a model

Popular models to try

Run as a server

Use with Python

Use with OpenAI-compatible clients

Use with Open WebUI (chat interface)

Useful commands

Performance tips

Summary

Frequently Asked Questions

Which Ollama model is best for coding?

Can I use Ollama with LangChain or LlamaIndex?

Does Ollama work without internet after the initial download?

What to Read Next

Related Articles

How to Install Gemma 4 Locally with Ollama (2026 Guide)

OpenClaw vs ChatGPT vs Claude: Which AI Setup Is Right for You?

What Is OpenClaw? The Self-Hosted AI Agent You Actually Own

Prerequisites

What you need

Install Ollama

Download and run a model

Popular models to try

Run as a server

Use with Python

Use with OpenAI-compatible clients

Use with Open WebUI (chat interface)

Useful commands

Performance tips

Summary

Frequently Asked Questions

Which Ollama model is best for coding?

Can I use Ollama with LangChain or LlamaIndex?

Does Ollama work without internet after the initial download?

What to Read Next

Related Articles

How to Install Gemma 4 Locally with Ollama (2026 Guide)

OpenClaw vs ChatGPT vs Claude: Which AI Setup Is Right for You?

What Is OpenClaw? The Self-Hosted AI Agent You Actually Own

Before you go...