MeshWorld India LogoMeshWorld.
OllamaLLMAIPrivacyHow-ToLinuxmacOSSelf-Hosted7 min read

How to Install Ollama and Run LLMs Locally

Vishnu
By Vishnu
How to Install Ollama and Run LLMs Locally
TL;DR
  • Ollama installs as a local model server — one command on macOS/Linux, installer on Windows
  • Run a model with ollama run llama3.2 — it downloads automatically on first use (~2–5 GB)
  • The REST API at http://localhost:11434 is OpenAI-compatible — any existing OpenAI library works with zero code changes
  • Apple Silicon gets GPU acceleration automatically; NVIDIA needs the container toolkit
  • Models run fully offline after the initial download — no API key, no data leaves your machine

Prerequisites

  • macOS, Linux, or Windows
  • At least 8 GB RAM for 7B models (16 GB recommended)
  • ~5–10 GB free disk space per model
  • Docker (optional — only needed for Open WebUI)

Ollama is a tool that lets you download and run large language models on your own hardware. No API key. No internet connection required after download. No data leaving your machine. You get a local model server with a REST API you can call from any application.

In 2026, local LLMs are good enough for real work. Llama 3.3, Mistral, Gemma 3, Qwen 2.5, and Phi-4 all run well on a modern laptop or desktop. If you have a GPU, even better.

What you need

Minimum (CPU only):

  • 8 GB RAM for 7B parameter models
  • 16 GB RAM for 13B parameter models

Better performance:

  • Apple Silicon Mac (M1/M2/M3/M4) — excellent performance via Metal GPU
  • NVIDIA GPU with 8+ GB VRAM — runs models at full GPU speed
  • AMD GPU (ROCm support on Linux)

Install Ollama

macOS:

bash
brew install ollama
# or download the app from ollama.com

Linux:

bash
curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com. Runs as a background service.

Verify installation:

bash
ollama --version

Download and run a model

bash
# Run a model (downloads automatically on first use)
ollama run llama3.2

# Start chatting — type your message and press Enter
# Type /bye to exit

That’s it. The first run downloads the model (~2-5 GB depending on the model). Subsequent runs start instantly from cache.

ModelSizeBest for
llama3.23B (2 GB)Fast responses, everyday tasks
llama3.370B (43 GB)Best open-source quality, needs 64 GB RAM
mistral7B (4 GB)Good general use, fast
gemma34B (3 GB)Google’s efficient model
phi414B (9 GB)Microsoft’s compact high-quality model
qwen2.5-coder7B (4 GB)Code generation and completion
deepseek-r17B (5 GB)Reasoning and math
nomic-embed-textText embeddings for RAG
bash
# Pull without running
ollama pull mistral

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.2
Pro Tip

For code assistance on confidential projects, qwen2.5-coder is the strongest coding-focused model available through Ollama. Pull it with ollama pull qwen2.5-coder — it handles code review, completion, and refactoring without any data leaving your machine.

The scenario: You’re working on a client project with sensitive code and you need code review help. You can’t paste it into ChatGPT — confidentiality agreement. You install Ollama, pull qwen2.5-coder, and get solid code suggestions with zero data leaving the machine. Problem solved.

Run as a server

Ollama runs as a local REST API server on http://localhost:11434. You can call it from your own apps:

bash
# Start the server (usually starts automatically on install)
ollama serve
bash
# Chat via API
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "user", "content": "Explain recursion in simple terms" }
  ],
  "stream": false
}'

Generate endpoint (single prompt, no chat history):

bash
curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Write a Python function to check if a string is a palindrome",
  "stream": false
}'

Use with Python

bash
pip install ollama
python
import ollama

response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'Why is the sky blue?'}
    ]
)
print(response['message']['content'])

Streaming:

python
stream = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Tell me a joke'}],
    stream=True,
)
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Use with OpenAI-compatible clients

Ollama exposes an OpenAI-compatible API at /v1. Any library that supports OpenAI can point to Ollama instead:

python
from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',  # required but ignored
)

response = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)

Use with Open WebUI (chat interface)

Open WebUI gives you a ChatGPT-like browser interface for your local models:

bash
docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000. It automatically detects your Ollama models.

Useful commands

bash
ollama run llama3.2          # run a model interactively
ollama run llama3.2 "prompt" # one-shot, non-interactive
ollama list                  # list local models
ollama pull mistral          # download a model
ollama rm mistral            # delete a model
ollama show llama3.2         # model info and parameters
ollama ps                    # show running models
ollama stop llama3.2         # stop a running model

Performance tips

  • Apple Silicon: All Ollama GPU acceleration works automatically. No setup needed.
  • NVIDIA GPU: Install the NVIDIA Container Toolkit for maximum speed.
Warning

Models must fit in RAM (or VRAM) to run at usable speed. If a model is larger than available memory, Ollama falls back to CPU-only inference which is extremely slow — often 1–3 tokens per second. On 8 GB RAM, stick to 3B–7B models. On 16 GB, up to 13B. Check ollama show <model> to see the size before pulling.

  • RAM matters more than you think: Models need to fit in RAM (or VRAM). If you only have 8 GB, stick to 3B-7B models.
  • Quantization: Ollama serves quantized models (Q4, Q5, Q8) by default — they use less memory with minimal quality loss. ollama pull llama3.3:70b-instruct-q4_K_M for the 70B at lower precision.

Summary

  • Ollama installs as a local server on macOS, Linux, and Windows — one command to install, one to run a model
  • Models download automatically on first run and are cached for subsequent use
  • The REST API at http://localhost:11434 is OpenAI-compatible — any library that supports OpenAI can point to Ollama
  • Apple Silicon Macs get GPU acceleration automatically; NVIDIA GPUs need the container toolkit for full speed
  • Open WebUI gives you a ChatGPT-like browser interface for your local Ollama models with one Docker command

Frequently Asked Questions

Which Ollama model is best for coding?

qwen2.5-coder (7B or 32B) is the strongest coding-focused model available through Ollama in 2026. For general coding assistance on a machine with limited RAM, mistral (7B) is a reliable fallback. If you have a GPU with 24+ GB VRAM, llama3.3:70b gives near-GPT-4 quality.

Can I use Ollama with LangChain or LlamaIndex?

Yes. Both have Ollama integrations. LangChain: from langchain_ollama import ChatOllama. LlamaIndex: from llama_index.llms.ollama import Ollama. Or use the OpenAI-compatible endpoint with base_url="http://localhost:11434/v1".

Does Ollama work without internet after the initial download?

Yes. Once a model is downloaded (ollama pull), it runs entirely offline. The model weights are stored locally in ~/.ollama/models/. No data is sent anywhere during inference.


Share_This Twitter / X
Vishnu
Written By

Vishnu

Founder & Principal Architect at MeshWorld. Senior engineer and instructor specializing in AI agent systems, scalable web architecture, and modern development workflows.

Enjoyed this article?

Support MeshWorld and help us create more technical content