Run a capable AI model entirely on your own laptop. No account, no API key, no per-token bill, and nothing leaving the machine. That is a local LLM, and as of 2026 you can have one answering questions in about five minutes. This guide explains how it works, what hardware you need, which models to pick, and how to start.
It assumes no background beyond knowing what a chatbot is. By the end you will understand the three pieces that make a local model run, be able to read a model name like qwen3-8b-Q4_K_M.gguf, and know exactly what your machine can handle.
What you'll learn
- What a local LLM is, and the honest case for and against running one
- The three-part stack: the model file, the runtime, and your hardware
- How quantization shrinks a 14 GB model to fit in 4 GB
- Exactly which models your RAM or GPU can run, with a memory ladder
- The five-minute path to your first running model
What is a local LLM?
When you use a hosted assistant, your prompt travels to a company's datacenter, runs on their GPUs, and the answer comes back over the network. You rent the model by the token, and the provider sees everything you send.
A local LLM inverts that. The model's weights, the billions of numbers that encode what it knows, are released as a file you can download. You run that file on your own GPU, your own Mac, or even your own CPU. The model never phones home. There is no account, no rate limit, and no usage meter.
This is possible because of open-weight models: capable models whose weights are published under licenses that let you download and run them. Meta's Llama, Alibaba's Qwen, Google's Gemma, Mistral, DeepSeek, and Microsoft's Phi are the best-known families. They are not the absolute frontier, but as of 2026 the gap is small enough that a good local model handles most everyday work well.
Why run an LLM locally?
Four reasons drive most people to local models, and one set of trade-offs sends them back to the cloud. Both are worth stating plainly.
- Privacy. Your prompts and documents never leave the machine. For legal, medical, financial, or proprietary work, this is often the only acceptable answer.
- Cost at volume. After the download, inference is free per use. If you process millions of tokens a day (classification, extraction, drafting), local hardware pays for itself fast.
- Offline and always-on. No network, no outage, no deprecation. A model you have on disk works on a plane and still works in five years.
- Control. You pick the exact version, set your own system prompt, and nothing changes under you. No silent model swaps, no new safety filter mid-project.
The honest case against: the largest hosted models are still better at the hardest reasoning and coding, you supply and maintain the hardware, and the first setup takes an afternoon to get comfortable. For occasional use of the smartest possible model, the cloud is simpler. Most people end up using both, and choosing per task.
How a local LLM actually works
Running a model locally means assembling three pieces. Once you see them as separate layers, every tool and tutorial makes sense.
The model file holds the weights. For local use it is almost always a single .gguf file. The name tells you the family (Qwen3), the size (8B = 8 billion parameters), and the quantization (Q4_K_M). More on both below.
The runtime is the program that reads the file, loads it into memory, and does the math that produces each next token. Ollama and LM Studio are the popular ones. Both are built on a project called llama.cpp, which is the engine doing the real work.
The hardware is where the model lives while it runs. The single most important number is memory: GPU memory (VRAM) on a PC, or unified memory on an Apple Silicon Mac. The whole model has to fit, so memory decides which models you can run at all.
The models: open weights you can download
Model names look cryptic but follow a pattern: a family (who made it), a size in billions of parameters, and sometimes a variant. Bigger is generally smarter and slower. These are the families worth knowing as of 2026.
| Family | Common sizes | Good for | License |
|---|---|---|---|
| Qwen3 (Alibaba) | 8B · 14B · 30B | The default all-rounder; strong reasoning and multilingual | Apache 2.0 |
| Llama 3.3 (Meta) | 8B · 70B | The largest ecosystem of fine-tunes and tooling | Llama community |
| Gemma 3 (Google) | 4B · 12B · 27B | Images and text, 140+ languages, long context | Gemma terms |
| Mistral | 7B · small/large | Efficient, strong instruction-following | Apache 2.0 |
| DeepSeek | varied · MoE | Code and step-by-step reasoning | Open weight |
| Phi-4-mini (Microsoft) | 3.8B | Runs on a laptop with no dedicated GPU | MIT |
ollama run qwen3). When you do browse, Hugging Face is the main repository, and the names there follow the same family-size-quant pattern.
Quantization: how a 14 GB model fits in 4 GB
A model's weights are numbers. Stored at full precision (16 bits each), a 7-billion-parameter model is about 14 GB, too big for most consumer GPUs. Quantization stores those numbers with fewer bits. It is the single trick that makes local models practical.
Reading a quant label: the number after Q is the bits per weight (lower = smaller, faster, slightly less accurate). K means a modern, grouped method. M is the medium variant (there are S and L too). When in doubt, pick Q4_K_M. It is the format behind most local LLM downloads for a reason.
What can your hardware run?
One rule covers most of it: a model has to fit in memory, with room to spare for the conversation. A Q4_K_M model needs roughly its file size in memory, plus 10-20% for context. Here is the ladder from a modest laptop to a serious workstation.
The tools that run the model
You do not interact with the model file directly. A runtime does that. The four below cover almost every use, and the first two are where beginners should start.
| Tool | Best for | Interface |
|---|---|---|
| Ollama | Developers; one command to run, plus an OpenAI-compatible API to build on | Command line + API |
| LM Studio | Beginners; browse, download, and chat with no terminal | Desktop app (GUI) |
| llama.cpp | The engine under the others; maximum control and odd hardware | Command line / library |
| vLLM | Serving a model to many users at once, in production | Server |
A useful fact: Ollama and LM Studio both run on llama.cpp, so they produce the same speed on the same hardware and read the same GGUF files. The choice is purely about how you like to work. On Apple Silicon, both can use Apple's MLX backend for extra speed.
Quickstart: your first local model in five minutes
The fastest path is Ollama. It installs in one step, downloads the model for you, and gives you a chat prompt. Here is the whole thing.
- Install Ollama. Download it for macOS, Windows, or Linux from the official site and run the installer. On Linux it is a single command from their homepage.
- Pull and run a model. Open a terminal and run one line. Ollama downloads the model the first time, then drops you into a chat:
Start with an 8B model. If your machine is light on memory, tryollama run qwen3ollama run phi4-miniinstead. - Chat. Type a question at the
>>>prompt. The first token takes a moment while the model loads into memory; after that it streams. Type/byeto exit. - Build on it (optional). Ollama also serves an OpenAI-compatible API at
http://localhost:11434. Point any tool that speaks the OpenAI format at that URL and it will use your local model with no code changes:curl http://localhost:11434/v1/chat/completions \ -d '{"model":"qwen3","messages":[{"role":"user","content":"Hello"}]}'
Or have your agent set it up
The fastest path of all is to let a coding agent do the work. Modern setup is less about memorizing commands and more about handing an assistant a clear instruction, then confirming each step. Paste this into Claude, a terminal agent like Claude Code, or any AI assistant, fill in your machine, and let it walk you through the install:
You are my local-LLM setup assistant. Get a working local model running on my machine, one step at a time.
MY MACHINE: [your OS, total RAM, and GPU model or Apple Silicon chip].
Rules:
- Recommend one model and quantization that fits my memory. A Q4_K_M model needs about its file size in memory, plus 10 to 20 percent for context.
- Give exact commands for my platform only. Do not assume a GPU I did not mention.
- Walk me through installing Ollama, then pulling and running the model. One step at a time, and wait for me to confirm each works before the next.
- After it runs, show me how to call it from the OpenAI-compatible API at http://localhost:11434.
- If a step errors, diagnose it from the exact output I paste back.
Start by telling me which model you recommend for my machine, and why.
That is the same logic this guide walks through, handed to a model that can run it with you. It is the difference between reading a manual and having someone who has read it sitting next to you.
Which model should you run?
There is no single best model, only the best fit for your machine and your task. Use the memory ladder above to find your size class, then this flow to pick within it.
Not sure which to pick? Hand the decision to a model that tracks the current landscape and knows your constraints:
Recommend the single best local LLM for my hardware as of today.
MY MACHINE: [your OS, RAM, and GPU/VRAM or Apple Silicon chip].
WHAT I MAINLY DO: [for example: general chat and writing, coding, or data extraction].
Give me one recommendation, not a list: the model, size, and quantization, the approximate memory it needs, the rough tokens per second to expect, and the exact ollama run command. If a newer model has replaced these since mid-2026, say so and pick that instead.
Local vs cloud: an honest comparison
This is not a contest with a winner. Each side is better at different things, and most serious users keep both. Here is the trade-off in full.
| Local LLM | Hosted (cloud) LLM | |
|---|---|---|
| Privacy | Total; nothing leaves your machine | Prompts sent to a provider |
| Cost | Hardware + electricity, then free per use | Per-token or subscription, forever |
| Top quality | Very good, close on most tasks | Best on the hardest reasoning and code |
| Setup | An afternoon to get comfortable | Instant; nothing to install |
| Offline | Works with no network | Needs a connection |
| Scaling to many users | You manage the hardware | Elastic; someone else's problem |
A simple way to decide: if the work is private, repetitive, or high-volume, lean local. If you need the smartest possible answer once in a while and setup time matters more than data control, lean cloud. The two are not rivals; they are tools for different jobs.
Common mistakes and limits
- Picking a model too big for your memory. If it spills past your GPU or unified memory, it slows to a crawl. Drop a size or a quant before blaming the model.
- Expecting frontier coding from a 7B model. Small models are good at a lot, but they are not the largest hosted models. Match expectations to size.
- Ignoring context limits. Long documents fill the context window and eat memory. Watch both, and chunk large inputs.
- Treating generated code as trusted. A local model is still a model. Review what it writes the way you would review anything from the internet.
- Downloading from unknown sources. Stick to official repositories on Hugging Face or the model the runtime pulls for you.
Frequently asked questions
What is a local LLM?
A local LLM is a large language model that runs entirely on your own computer instead of a company's servers. You download the model's weights as a file, run it with a tool like Ollama or LM Studio, and every prompt and response stays on your machine. There are no API keys, no per-token charges, and no internet connection required once the model is downloaded.
Can my computer run a local LLM?
Most modern computers can run a small model. The practical floor is about 16 GB of RAM, or any Apple Silicon Mac. A laptop with 8 GB of GPU memory comfortably runs a 7-8B model at usable speed. Larger models need more memory: roughly 24 GB for a 30B model and 48 GB or more for a 70B model. Quantization is what makes this fit.
What is the best local LLM to run?
As of 2026 the common default is the Qwen3 family (8B, 14B, or 30B) for its quality and permissive license. Llama 3.3 has the largest ecosystem, Gemma 3 handles images and many languages, Mistral and DeepSeek are strong on reasoning and code, and Phi-4-mini runs without a dedicated GPU. The best choice depends on your hardware and task.
Is running an LLM locally free?
The software and the open-weight models are free to download and run. You pay only in hardware and electricity. After the one-time download there is no per-token cost and no subscription, which is why local models win for high-volume or privacy-sensitive work. The trade-off is that you supply the compute.
Ollama vs LM Studio: which should I use?
Use Ollama if you are a developer who wants a command line and an OpenAI-compatible API to build on. Use LM Studio if you want a graphical app to browse, download, and chat with models without a terminal. Both run the same GGUF files at the same speed because both are built on llama.cpp underneath.
What is GGUF and what does Q4_K_M mean?
GGUF is the standard file format for a local model: one file holding the weights plus the metadata a runtime needs. Q4_K_M is a quantization level that stores most weights at about 4 bits instead of 16, cutting the file to roughly a quarter of full size while keeping the most important layers at higher precision. It loses only a few percent of quality and is the most-downloaded format for local use.
Are local LLMs as good as ChatGPT or Claude?
The best open models you can run on a high-end machine are close to frontier quality on many everyday tasks, but the largest hosted models still lead on the hardest reasoning, long-context, and coding work. The honest framing is fit, not rank: a model that is private, free per use, and always available can be the better tool even when a hosted model would score higher on a benchmark.
Is it safe and private to run an LLM locally?
Yes, and that privacy is the main reason to run one. Because the model runs on your hardware, your prompts and outputs never leave the machine and nothing is logged by a provider. Download model files from reputable sources such as the official repositories on Hugging Face, and review any code a model generates the way you would review code from the internet.
Glossary
- Parameters
- The numbers a model learned during training, counted in billions (the "B" in 8B). More parameters generally means more capability and more memory.
- Weights
- Another word for the parameters: the actual values that get downloaded and loaded into memory to run the model.
- Open weight
- A model whose weights are published for download. Not always the same as open source, and the license sets the terms of use.
- GGUF
- The standard single-file format for a local model, holding the weights plus the metadata a runtime needs to load it.
- Quantization
- Storing weights with fewer bits to shrink the file and memory footprint, at a small cost in quality. Q4_K_M is the common default.
- VRAM
- The memory on a graphics card. On a PC it sets the ceiling on model size. On a Mac the equivalent is unified memory.
- Unified memory
- On Apple Silicon, one memory pool shared by the CPU and GPU, which lets a Mac run larger models than a similar PC GPU.
- Context window
- How much text (your prompt plus the conversation) the model can consider at once, measured in tokens. Bigger contexts use more memory.
- Token
- A chunk of text, roughly a few characters. Models read and write in tokens, and speed is measured in tokens per second.
- Inference
- Running a trained model to get an answer, as opposed to training it. Everything on this page is about inference.
- Mixture of experts (MoE)
- A design where only part of the model runs for each token, so a large model can be fast while still fitting in memory.
Where to go next
You now have the full picture: the stack, the models, the memory math, and a five-minute path to a running model. Two directions from here.
If the economics interest you (what compute actually costs, and how local stacks up against hosted inference at scale), read the GPU and infra economics playbook and the broader guide to AI infrastructure. For the daily moves in models, chips, and tooling, the daily briefing reads the wire so you do not have to, and closes each edition with one falsifiable call we settle in public.
This guide is part of The Primer, our growing library of ground-up explainers. We re-check every one against the live landscape each month, so the numbers and model names stay current.