Running language models locally used to mean wrestling with Python environments, downloading 50GB model files, and praying your GPU had enough VRAM. In 2026, it's as easy as installing an app and typing one command. Here's everything you need to know about running LLMs on your own hardware.
Why Run Locally
Privacy. Your data never leaves your machine. If you're working with proprietary code, customer data, medical records, or anything sensitive, local inference means zero exposure to third-party servers. No terms of service, no data retention policies to read, no trust required.
Cost. Cloud API calls add up. If you're running thousands of requests per day for batch processing, embeddings, or automated pipelines, local inference costs nothing after the hardware investment. A $500 GPU pays for itself in weeks of heavy API usage.
Offline access. Planes, trains, spotty coffee shop wifi — local models don't care. Your AI assistant works anywhere your laptop goes.
Latency. No network round trip. For interactive use cases — code completion, chat, text processing — local models respond in milliseconds. The speed of a fast local model often beats a cloud API even when the cloud model is smarter.
Customization. Fine-tune models on your data, adjust parameters at inference time, chain models together without rate limits, run experimental architectures nobody hosts. Full control.
Hardware Requirements
The bottom line: you need enough RAM (or VRAM) to hold the model. The model needs to fit in memory to run, and more memory means you can run bigger, smarter models.
CPU-Only (Runs on Any Machine)
- 8GB RAM: 1-3B parameter models (basic tasks, fast)
- 16GB RAM: 7-8B parameter models (good quality, moderate speed)
- 32GB RAM: 13B parameter models (strong quality, slower)
- 64GB RAM: 30-34B parameter models (near cloud quality)
CPU inference is surprisingly usable for smaller models. A modern M-series Mac or a recent Intel/AMD processor runs 7B models at 20-40 tokens per second — perfectly fine for chat and code completion.
GPU-Accelerated (Recommended for Serious Use)
- 8GB VRAM (RTX 4060): 7B models at full speed
- 12GB VRAM (RTX 4070): 13B models comfortably
- 16GB VRAM (RTX 4080): 13B at full context, 30B quantized
- 24GB VRAM (RTX 4090): 30B+ models, the sweet spot for local AI
- Apple Silicon: Unified memory is a cheat code — 32GB M3 Pro runs 30B models using both CPU and GPU seamlessly
NVIDIA GPUs with CUDA support give the best performance. Apple Silicon is the best value because system RAM and GPU memory are shared. AMD GPUs work through ROCm but driver support is still inconsistent.
Setting Up Ollama
Ollama is the fastest way to start. It's a command-line tool that downloads, manages, and runs models with one command.
Install
# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Windows
# Download from ollama.aiRun Your First Model
# Download and start chatting
ollama run llama3:8b
# List available models
ollama list
# Pull a model without running
ollama pull codellama:13b
# Run with custom parameters
ollama run llama3:8b --temperature 0.2 --num-ctx 8192That's it. No Python, no CUDA toolkit, no dependency hell. Ollama handles model downloading, quantization selection, and GPU detection automatically.
Ollama as an API Server
Ollama runs an OpenAI-compatible API server on port 11434:
# Start the server (runs automatically after install)
ollama serve
# Chat completion
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:8b",
"messages": [{"role": "user", "content": "Explain closures in JavaScript"}]
}'Because it's OpenAI-compatible, you can point any tool that supports custom OpenAI endpoints at Ollama. VS Code extensions, Python scripts, custom apps — they all just work.
Modelfile for Custom Configuration
# Modelfile
FROM llama3:8b
PARAMETER temperature 0.3
PARAMETER num_ctx 16384
SYSTEM """
You are a senior TypeScript developer. You write clean, well-typed code.
Always use strict TypeScript. Prefer functional patterns.
Keep responses concise and code-focused.
"""ollama create my-code-assistant -f Modelfile
ollama run my-code-assistantSetting Up LM Studio
LM Studio adds a GUI on top of local model management. If you prefer clicking to typing, it's the better choice.
Download from lmstudio.ai, install, and launch. The interface lets you:
- Discover models — Browse and search Hugging Face models filtered by what runs on your hardware
- Download with one click — Select quantization level based on your available memory
- Chat interface — Test models immediately in a built-in chat UI
- Local server — Run an OpenAI-compatible server with a toggle switch
- Compare models — Run the same prompt through multiple models side by side
LM Studio's biggest advantage is model discovery. It shows you estimated RAM usage for each quantization level and warns you if a model won't fit on your hardware before you download it.
Best Models for Different Tasks (March 2026)
Code Generation
- Qwen 2.5 Coder 32B — The best local code model. Handles complex multi-file tasks, understands project context, generates correct TypeScript. Needs 24GB VRAM for full speed.
- CodeLlama 13B — Solid for completions and single-file tasks. Runs well on 12GB VRAM.
- DeepSeek Coder V3 Lite — Strong code reasoning in a 16B package. Good balance of quality and speed.
General Chat and Writing
- Llama 3 70B (quantized) — Near cloud-quality responses for general tasks. Needs 48GB+ RAM or multiple GPUs.
- Llama 3 8B — The go-to for fast, general-purpose local AI. Runs on almost anything.
- Mistral Small — Excellent reasoning for its size. Fast on consumer hardware.
Embeddings and RAG
- nomic-embed-text — High-quality embeddings, runs on CPU at 1000+ docs/second. Perfect for local RAG pipelines.
- mxbai-embed-large — Larger embedding model with better semantic understanding.
Summarization and Analysis
- Phi-3 Medium — Microsoft's model punches above its weight for analysis tasks. Runs well on 16GB.
- Gemma 2 27B — Google's model is strong at following complex instructions.
Integrating with Your Dev Workflow
As a Git Commit Message Generator
# In your .bashrc or .zshrc
function ai-commit() {
local diff=$(git diff --cached)
local message=$(echo "$diff" | ollama run llama3:8b \
"Generate a concise git commit message for this diff. One line, imperative mood, under 72 characters. Output only the message, nothing else.")
git commit -m "$message"
}As a Code Review Pre-commit Hook
#!/usr/bin/env python3
# .git/hooks/pre-commit
import subprocess
import requests
diff = subprocess.check_output(["git", "diff", "--cached"]).decode()
response = requests.post("http://localhost:11434/v1/chat/completions", json={
"model": "qwen2.5-coder:32b",
"messages": [
{"role": "system", "content": "Review this diff for bugs. Reply LGTM if clean, or list issues."},
{"role": "user", "content": diff}
]
})
review = response.json()["choices"][0]["message"]["content"]
if "LGTM" not in review:
print(f"AI Review Concerns:\n{review}")
exit(1)As a Documentation Generator
# Generate JSDoc for all functions in a file
ollama run qwen2.5-coder:32b "Add JSDoc comments to every exported function. Keep existing code unchanged. Output the complete file." < src/utils.ts > src/utils-documented.tsPerformance Tips
Use quantized models. Q4_K_M quantization reduces model size by 75% with minimal quality loss. It's the default in Ollama for good reason. Only use full precision (F16/F32) if you need maximum quality and have the VRAM for it.
Match context length to your needs. Default context is usually 2048-4096 tokens. Increasing it uses more memory and slows generation. Set num_ctx to what you actually need.
Keep models loaded. Ollama keeps models in memory by default. The first response after loading takes a few seconds; subsequent responses are near-instant. Don't restart the server between requests.
Use GPU offloading. If your model is too big for VRAM, Ollama automatically splits layers between GPU and CPU. More GPU layers means faster inference. Check with ollama ps to see what's loaded where.
Batch your requests. For processing multiple documents, send requests in parallel. Local models don't have rate limits — your only bottleneck is compute.
When Local Beats Cloud (and When It Doesn't)
Local wins for: privacy-sensitive work, high-volume batch processing, offline usage, code completion (latency matters), prototyping and experimentation, embeddings generation.
Cloud wins for: tasks requiring the smartest models (Claude Opus, GPT-4 class), long-context work (100K+ tokens), multimodal tasks (vision, audio), when you don't want to manage hardware, one-off complex reasoning.
The best setup uses both. Local models handle the high-volume, latency-sensitive, privacy-critical workloads. Cloud APIs handle the tasks that need maximum intelligence. You don't have to choose — run Ollama locally and keep your Anthropic API key for the hard problems.
Local AI has crossed the usability threshold. The models are good enough, the tools are polished enough, and consumer hardware is powerful enough. If you haven't tried running models locally since 2024, you'll be surprised how far it's come.