How to Run LLMs Locally on Your Machine
Running AI models locally gives you privacy, speed, and zero API costs. Here's what hardware you need, which tools to use, and how to choose the right model.
Every API call sends your data to someone else’s server. Every token costs money. Every request depends on an internet connection and a third party’s uptime. Running models locally eliminates all three concerns. Your data stays on your machine, inference is free after the initial setup, and you don’t need Wi-Fi to use it.
Local models aren’t as capable as frontier models like GPT-4o or Claude 3.5 Sonnet. But for many tasks, they don’t need to be.
What “Running Locally” Actually Means
When you run a model locally, you download the model weights (a large file, typically 2-30GB) and run inference on your own hardware. The model loads into memory (RAM for CPU inference, VRAM for GPU inference), and you send it prompts just like you would with an API. The difference is everything happens on your machine. No network calls. No usage fees. No data leaving your laptop.
The key constraint is memory. A model’s weights need to fit in memory to run. A 7-billion-parameter model at full precision (16-bit floating point) requires about 14GB of memory. At 4-bit quantization (more on this below), that drops to about 4GB. If the weights don’t fit in your available memory, the model either won’t load or will be painfully slow as it swaps between memory and disk.
Hardware Requirements
For experimentation and light use (7B models):
- 8GB RAM minimum, 16GB recommended
- Any modern CPU works (Apple Silicon is particularly good at this)
- No GPU required, but inference will be slower (5-15 tokens/second on CPU vs. 30-80 on GPU)
For serious local use (13B-34B models):
- 32GB RAM or 12GB+ VRAM (GPU)
- Apple M2/M3 with 32GB unified memory handles 13B models comfortably
- NVIDIA RTX 3090/4090 (24GB VRAM) can run 34B models at decent speed
For large models (70B+):
- 64GB+ RAM or multiple GPUs
- These models are impractical for most personal hardware
- The 70B class is where local models start approaching API-quality for specific tasks
Apple Silicon deserves special mention. The unified memory architecture means the CPU and GPU share the same memory pool, so a MacBook with 32GB of unified memory can use all 32GB for a model. On a traditional PC, you’re limited to whatever VRAM your GPU has (typically 8-24GB), with slower CPU/RAM as overflow.
Ollama: The Easiest Starting Point
Ollama is to local LLMs what Docker is to containers. It manages model downloads, handles quantization variants, and provides a simple CLI and API. Install it, run a command, and you have a local model running.
After installing from ollama.com, running a model is one command:
ollama run llama3.2
This downloads the model (if you don’t have it) and starts an interactive chat. That’s it. Behind the scenes, Ollama handles model format conversion, optimal memory allocation, and choosing the right quantization for your hardware.
Ollama also exposes a local HTTP API on port 11434, so you can integrate it into applications the same way you’d use the OpenAI API. Many tools (LangChain, LlamaIndex, Open WebUI) support Ollama as a backend, so you can swap between local and cloud models with a configuration change.
Choosing a Model
Not all local models are good at the same things. The right choice depends on what you’re doing and what hardware you have.
General-purpose conversation and reasoning: Llama 3.2 (3B, 11B) and Mistral (7B) are the best starting points. They handle a wide range of tasks competently: summarization, Q&A, writing, light analysis. The 3B variant runs on almost anything. The 11B variant needs about 8GB and delivers noticeably better quality.
Code generation and understanding: CodeLlama and DeepSeek-Coder are trained specifically on code. They outperform general models on programming tasks because a larger proportion of their training data is source code. If you’re using a local model for coding assistance, use a code-specific model.
Small and fast (for constrained hardware): Phi-3 Mini (3.8B) and Gemma 2B are designed to be small without being useless. They’re the best option for machines with 8GB RAM or for applications where latency matters more than quality. They handle structured tasks (extraction, classification, simple Q&A) surprisingly well for their size.
Long context: Most local models have 4K-8K context windows. Some variants (Yarn-Llama, Mistral’s extended versions) support 32K-128K. If you need to process long documents locally, check the context length before downloading.
Quantization: Trading Precision for Size
Full-precision model weights use 16 bits per parameter. Quantization reduces this to 8, 4, or even 2 bits. The result: smaller files, less memory, faster inference, and (usually) a modest quality loss.
16-bit (no quantization): Full quality. A 7B model needs ~14GB. 8-bit (Q8): Nearly imperceptible quality loss. Size cut in half (~7GB for 7B). 4-bit (Q4): Noticeable but acceptable quality loss for most tasks. Size cut to ~4GB for 7B. 2-bit (Q2): Significant quality degradation. Only useful for the largest models where 4-bit doesn’t fit.
The sweet spot for most people is 4-bit quantization (specifically Q4_K_M in GGUF format). You lose maybe 5-10% quality compared to full precision, but the model runs on consumer hardware. Going to 8-bit is worth it if you have the memory, because the quality improvement is noticeable on tasks requiring nuance or precise reasoning.
Ollama handles this automatically. When you ollama run llama3.2, it downloads a 4-bit quantized version by default. You can specify variants (ollama run llama3.2:8b-q8_0) if you want different quantization levels.
When Local Models Beat APIs
Privacy-sensitive work. Legal documents, medical records, proprietary code, internal communications. If the data can’t leave your machine, local is your only option. No terms of service to worry about. No data retention policies. No risk of training data leakage.
High-volume, low-complexity tasks. If you’re processing 10,000 documents with a simple extraction task, API costs add up quickly. A local model running the same extraction at 20 tokens/second is free after setup. The total time might be longer, but the cost is zero.
Offline use. Planes, trains, places with unreliable internet. Once the model is downloaded, it runs entirely offline.
Experimentation. Try 20 different prompts on 100 test cases without worrying about a $50 API bill. Local inference removes the financial friction from experimentation, which means you experiment more, which means you learn faster.
When APIs Win
Quality ceiling. Frontier models (GPT-4o, Claude 3.5 Sonnet) are still significantly better than any model you can run locally, especially for complex reasoning, nuanced writing, and multi-step problem solving. If quality is the top priority, APIs win.
Speed at scale. Cloud providers run models on specialized hardware (clusters of A100/H100 GPUs). They can serve requests in parallel to thousands of users. Your single machine generates one response at a time.
Zero maintenance. No model updates to manage, no hardware to maintain, no memory issues to debug. You call an endpoint and get a response. The provider handles everything else.
Running Local Models in Applications
The practical pattern for most developers: use Ollama as a local inference server and call it from your application via HTTP, just like you’d call OpenAI. The request format is nearly identical. Many client libraries support both, so you can develop locally with Ollama and deploy with an API by changing one configuration value.
This also enables local RAG. Combine Ollama with a local vector database (Chroma, for example) and a local embedding model, and you have a complete RAG pipeline that runs entirely on your machine. No API calls, no data leaving your network, no per-query costs.
The Bigger Picture
Local LLMs aren’t a replacement for cloud APIs. They’re a complement. The developers who get the most out of AI use both: local models for privacy, experimentation, and high-volume work; cloud APIs for maximum quality when it matters. Knowing when to use each is a practical skill that saves money, protects data, and removes dependence on any single provider.
Chapter 4 of Get Insanely Good at AI covers the mechanics of running and fine-tuning local models, including quantization tradeoffs, hardware optimization, and building local inference pipelines for production use.