How to Run a Local LLM: A Complete Beginner's Guide

Run a capable AI model entirely on your own laptop. No account, no API key, no per-token bill, and nothing leaving the machine. That is a local LLM, and as of 2026 you can have one answering questions in about five minutes. This guide explains how it works, what hardware you need, which models to pick, and how to start.

It assumes no background beyond knowing what a chatbot is. By the end you will understand the three pieces that make a local model run, be able to read a model name like qwen3-8b-Q4_K_M.gguf, and know exactly what your machine can handle.

What you'll learn

What a local LLM is, and the honest case for and against running one
The three-part stack: the model file, the runtime, and your hardware
How quantization shrinks a 14 GB model to fit in 4 GB
Exactly which models your RAM or GPU can run, with a memory ladder
The five-minute path to your first running model

What is a local LLM?

When you use a hosted assistant, your prompt travels to a company's datacenter, runs on their GPUs, and the answer comes back over the network. You rent the model by the token, and the provider sees everything you send.

A local LLM inverts that. The model's weights, the billions of numbers that encode what it knows, are released as a file you can download. You run that file on your own GPU, your own Mac, or even your own CPU. The model never phones home. There is no account, no rate limit, and no usage meter.

This is possible because of open-weight models: capable models whose weights are published under licenses that let you download and run them. Meta's Llama, Alibaba's Qwen, Google's Gemma, Mistral, DeepSeek, and Microsoft's Phi are the best-known families. They are not the absolute frontier, but as of 2026 the gap is small enough that a good local model handles most everyday work well.

Open weights, not open source. "Open weight" means you can download and run the model. It does not always mean you can see the training data or use it commercially without limits. Check the license: Apache 2.0 (Qwen, Mistral) is the most permissive, while Llama and Gemma carry their own community terms.

Why run an LLM locally?

Four reasons drive most people to local models, and one set of trade-offs sends them back to the cloud. Both are worth stating plainly.

Privacy. Your prompts and documents never leave the machine. For legal, medical, financial, or proprietary work, this is often the only acceptable answer.
Cost at volume. After the download, inference is free per use. If you process millions of tokens a day (classification, extraction, drafting), local hardware pays for itself fast.
Offline and always-on. No network, no outage, no deprecation. A model you have on disk works on a plane and still works in five years.
Control. You pick the exact version, set your own system prompt, and nothing changes under you. No silent model swaps, no new safety filter mid-project.

The honest case against: the largest hosted models are still better at the hardest reasoning and coding, you supply and maintain the hardware, and the first setup takes an afternoon to get comfortable. For occasional use of the smartest possible model, the cloud is simpler. Most people end up using both, and choosing per task.

How a local LLM actually works

Running a model locally means assembling three pieces. Once you see them as separate layers, every tool and tutorial makes sense.

Three layers, assembled on one machine. The runtime loads the model file into memory on your hardware, then turns your prompt into a stream of tokens. Swap any layer (a bigger model, a faster runtime, more memory) and the other two still fit.

The model file holds the weights. For local use it is almost always a single .gguf file. The name tells you the family (Qwen3), the size (8B = 8 billion parameters), and the quantization (Q4_K_M). More on both below.

The runtime is the program that reads the file, loads it into memory, and does the math that produces each next token. Ollama and LM Studio are the popular ones. Both are built on a project called llama.cpp, which is the engine doing the real work.

The hardware is where the model lives while it runs. The single most important number is memory: GPU memory (VRAM) on a PC, or unified memory on an Apple Silicon Mac. The whole model has to fit, so memory decides which models you can run at all.

The models: open weights you can download

Model names look cryptic but follow a pattern: a family (who made it), a size in billions of parameters, and sometimes a variant. Bigger is generally smarter and slower. These are the families worth knowing as of 2026.

Family	Common sizes	Good for	License
Qwen3 (Alibaba)	8B · 14B · 30B	The default all-rounder; strong reasoning and multilingual	Apache 2.0
Llama 3.3 (Meta)	8B · 70B	The largest ecosystem of fine-tunes and tooling	Llama community
Gemma 3 (Google)	4B · 12B · 27B	Images and text, 140+ languages, long context	Gemma terms
Mistral	7B · small/large	Efficient, strong instruction-following	Apache 2.0
DeepSeek	varied · MoE	Code and step-by-step reasoning	Open weight
Phi-4-mini (Microsoft)	3.8B	Runs on a laptop with no dedicated GPU	MIT

Where to get them. You rarely download these by hand. A runtime like Ollama fetches the right file for you with one command (ollama run qwen3). When you do browse, Hugging Face is the main repository, and the names there follow the same family-size-quant pattern.

Quantization: how a 14 GB model fits in 4 GB

A model's weights are numbers. Stored at full precision (16 bits each), a 7-billion-parameter model is about 14 GB, too big for most consumer GPUs. Quantization stores those numbers with fewer bits. It is the single trick that makes local models practical.

The trade-off curve. Going from 16-bit to 4-bit roughly quarters the file and the memory needed, while keeping around 95% of quality. "K_M" means a smart mix that keeps the most important layers at higher precision. For most people, Q4_K_M is the right starting point; step up to Q5 or Q8 only if you have memory to spare and want the last few percent.

Reading a quant label: the number after Q is the bits per weight (lower = smaller, faster, slightly less accurate). K means a modern, grouped method. M is the medium variant (there are S and L too). When in doubt, pick Q4_K_M. It is the format behind most local LLM downloads for a reason.

What can your hardware run?

One rule covers most of it: a model has to fit in memory, with room to spare for the conversation. A Q4_K_M model needs roughly its file size in memory, plus 10-20% for context. Here is the ladder from a modest laptop to a serious workstation.

The memory ladder. On a PC the number is your GPU's VRAM. On a Mac it is unified memory, shared between CPU and GPU, which is why a 64 GB MacBook can run a 70B model that will not fit on a 24 GB graphics card. If a model is too big, drop to a smaller size or a lower quant before giving up.

Apple Silicon is unusually good at this. Because the CPU and GPU share one pool of memory, a Mac can load models far larger than its price-equivalent PC GPU. A 32 GB Mac runs 30B-class models comfortably, and Apple's MLX framework has become one of the fastest ways to run them.

The tools that run the model

You do not interact with the model file directly. A runtime does that. The four below cover almost every use, and the first two are where beginners should start.

Tool	Best for	Interface
Ollama	Developers; one command to run, plus an OpenAI-compatible API to build on	Command line + API
LM Studio	Beginners; browse, download, and chat with no terminal	Desktop app (GUI)
llama.cpp	The engine under the others; maximum control and odd hardware	Command line / library
vLLM	Serving a model to many users at once, in production	Server

A useful fact: Ollama and LM Studio both run on llama.cpp, so they produce the same speed on the same hardware and read the same GGUF files. The choice is purely about how you like to work. On Apple Silicon, both can use Apple's MLX backend for extra speed.

Quickstart: your first local model in five minutes

The fastest path is Ollama. It installs in one step, downloads the model for you, and gives you a chat prompt. Here is the whole thing.

Install Ollama. Download it for macOS, Windows, or Linux from the official site and run the installer. On Linux it is a single command from their homepage.
Pull and run a model. Open a terminal and run one line. Ollama downloads the model the first time, then drops you into a chat:
```
ollama run qwen3
```
Start with an 8B model. If your machine is light on memory, try ollama run phi4-mini instead.
Chat. Type a question at the >>> prompt. The first token takes a moment while the model loads into memory; after that it streams. Type /bye to exit.
Build on it (optional). Ollama also serves an OpenAI-compatible API at http://localhost:11434. Point any tool that speaks the OpenAI format at that URL and it will use your local model with no code changes:
```
curl http://localhost:11434/v1/chat/completions \
  -d '{"model":"qwen3","messages":[{"role":"user","content":"Hello"}]}'
```

Prefer a window over a terminal? Install LM Studio instead, open the model browser, search for "Qwen3 8B", click download, and chat. Same model, same speed, no command line.

Or have your agent set it up

The fastest path of all is to let a coding agent do the work. Modern setup is less about memorizing commands and more about handing an assistant a clear instruction, then confirming each step. Paste this into Claude, a terminal agent like Claude Code, or any AI assistant, fill in your machine, and let it walk you through the install:

Prompt · paste into your agent

You are my local-LLM setup assistant. Get a working local model running on my machine, one step at a time.

MY MACHINE: [your OS, total RAM, and GPU model or Apple Silicon chip].

Rules:
- Recommend one model and quantization that fits my memory. A Q4_K_M model needs about its file size in memory, plus 10 to 20 percent for context.
- Give exact commands for my platform only. Do not assume a GPU I did not mention.
- Walk me through installing Ollama, then pulling and running the model. One step at a time, and wait for me to confirm each works before the next.
- After it runs, show me how to call it from the OpenAI-compatible API at http://localhost:11434.
- If a step errors, diagnose it from the exact output I paste back.

Start by telling me which model you recommend for my machine, and why.

That is the same logic this guide walks through, handed to a model that can run it with you. It is the difference between reading a manual and having someone who has read it sitting next to you.

Which model should you run?

There is no single best model, only the best fit for your machine and your task. Use the memory ladder above to find your size class, then this flow to pick within it.

Match the model to the job, then to the memory. The cheapest way to choose is empirical: pull two candidates and run your real prompts through each. Quality on your task beats any benchmark leaderboard.

Not sure which to pick? Hand the decision to a model that tracks the current landscape and knows your constraints:

Prompt · let your agent pick

Recommend the single best local LLM for my hardware as of today.

MY MACHINE: [your OS, RAM, and GPU/VRAM or Apple Silicon chip].
WHAT I MAINLY DO: [for example: general chat and writing, coding, or data extraction].

Give me one recommendation, not a list: the model, size, and quantization, the approximate memory it needs, the rough tokens per second to expect, and the exact ollama run command. If a newer model has replaced these since mid-2026, say so and pick that instead.

Local vs cloud: an honest comparison

This is not a contest with a winner. Each side is better at different things, and most serious users keep both. Here is the trade-off in full.

	Local LLM	Hosted (cloud) LLM
Privacy	Total; nothing leaves your machine	Prompts sent to a provider
Cost	Hardware + electricity, then free per use	Per-token or subscription, forever
Top quality	Very good, close on most tasks	Best on the hardest reasoning and code
Setup	An afternoon to get comfortable	Instant; nothing to install
Offline	Works with no network	Needs a connection
Scaling to many users	You manage the hardware	Elastic; someone else's problem

A simple way to decide: if the work is private, repetitive, or high-volume, lean local. If you need the smartest possible answer once in a while and setup time matters more than data control, lean cloud. The two are not rivals; they are tools for different jobs.

Common mistakes and limits

Picking a model too big for your memory. If it spills past your GPU or unified memory, it slows to a crawl. Drop a size or a quant before blaming the model.
Expecting frontier coding from a 7B model. Small models are good at a lot, but they are not the largest hosted models. Match expectations to size.
Ignoring context limits. Long documents fill the context window and eat memory. Watch both, and chunk large inputs.
Treating generated code as trusted. A local model is still a model. Review what it writes the way you would review anything from the internet.
Downloading from unknown sources. Stick to official repositories on Hugging Face or the model the runtime pulls for you.

Frequently asked questions

What is a local LLM?

A local LLM is a large language model that runs entirely on your own computer instead of a company's servers. You download the model's weights as a file, run it with a tool like Ollama or LM Studio, and every prompt and response stays on your machine. There are no API keys, no per-token charges, and no internet connection required once the model is downloaded.

Can my computer run a local LLM?

Most modern computers can run a small model. The practical floor is about 16 GB of RAM, or any Apple Silicon Mac. A laptop with 8 GB of GPU memory comfortably runs a 7-8B model at usable speed. Larger models need more memory: roughly 24 GB for a 30B model and 48 GB or more for a 70B model. Quantization is what makes this fit.

What is the best local LLM to run?

As of 2026 the common default is the Qwen3 family (8B, 14B, or 30B) for its quality and permissive license. Llama 3.3 has the largest ecosystem, Gemma 3 handles images and many languages, Mistral and DeepSeek are strong on reasoning and code, and Phi-4-mini runs without a dedicated GPU. The best choice depends on your hardware and task.

Is running an LLM locally free?

The software and the open-weight models are free to download and run. You pay only in hardware and electricity. After the one-time download there is no per-token cost and no subscription, which is why local models win for high-volume or privacy-sensitive work. The trade-off is that you supply the compute.

Ollama vs LM Studio: which should I use?

Use Ollama if you are a developer who wants a command line and an OpenAI-compatible API to build on. Use LM Studio if you want a graphical app to browse, download, and chat with models without a terminal. Both run the same GGUF files at the same speed because both are built on llama.cpp underneath.

What is GGUF and what does Q4_K_M mean?

GGUF is the standard file format for a local model: one file holding the weights plus the metadata a runtime needs. Q4_K_M is a quantization level that stores most weights at about 4 bits instead of 16, cutting the file to roughly a quarter of full size while keeping the most important layers at higher precision. It loses only a few percent of quality and is the most-downloaded format for local use.

Are local LLMs as good as ChatGPT or Claude?

The best open models you can run on a high-end machine are close to frontier quality on many everyday tasks, but the largest hosted models still lead on the hardest reasoning, long-context, and coding work. The honest framing is fit, not rank: a model that is private, free per use, and always available can be the better tool even when a hosted model would score higher on a benchmark.

Is it safe and private to run an LLM locally?

Yes, and that privacy is the main reason to run one. Because the model runs on your hardware, your prompts and outputs never leave the machine and nothing is logged by a provider. Download model files from reputable sources such as the official repositories on Hugging Face, and review any code a model generates the way you would review code from the internet.

Glossary

Parameters: The numbers a model learned during training, counted in billions (the "B" in 8B). More parameters generally means more capability and more memory.
Weights: Another word for the parameters: the actual values that get downloaded and loaded into memory to run the model.
Open weight: A model whose weights are published for download. Not always the same as open source, and the license sets the terms of use.
GGUF: The standard single-file format for a local model, holding the weights plus the metadata a runtime needs to load it.
Quantization: Storing weights with fewer bits to shrink the file and memory footprint, at a small cost in quality. Q4_K_M is the common default.
VRAM: The memory on a graphics card. On a PC it sets the ceiling on model size. On a Mac the equivalent is unified memory.
Unified memory: On Apple Silicon, one memory pool shared by the CPU and GPU, which lets a Mac run larger models than a similar PC GPU.
Context window: How much text (your prompt plus the conversation) the model can consider at once, measured in tokens. Bigger contexts use more memory.
Token: A chunk of text, roughly a few characters. Models read and write in tokens, and speed is measured in tokens per second.
Inference: Running a trained model to get an answer, as opposed to training it. Everything on this page is about inference.
Mixture of experts (MoE): A design where only part of the model runs for each token, so a large model can be fast while still fitting in memory.

Where to go next

You now have the full picture: the stack, the models, the memory math, and a five-minute path to a running model. Two directions from here.

If the economics interest you (what compute actually costs, and how local stacks up against hosted inference at scale), read the GPU and infra economics playbook and the broader guide to AI infrastructure. For the daily moves in models, chips, and tooling, the daily briefing reads the wire so you do not have to, and closes each edition with one falsifiable call we settle in public.

This guide is part of The Primer, our growing library of ground-up explainers. We re-check every one against the live landscape each month, so the numbers and model names stay current.

How to Run a Local LLM: A Complete Beginner's Guide

What is a local LLM?

Why run an LLM locally?

How a local LLM actually works

The models: open weights you can download

Quantization: how a 14 GB model fits in 4 GB

What can your hardware run?

The tools that run the model

Quickstart: your first local model in five minutes

Or have your agent set it up

Which model should you run?

Local vs cloud: an honest comparison

Common mistakes and limits

Frequently asked questions

What is a local LLM?

Can my computer run a local LLM?

What is the best local LLM to run?

Is running an LLM locally free?

Ollama vs LM Studio: which should I use?

What is GGUF and what does Q4_K_M mean?

Are local LLMs as good as ChatGPT or Claude?

Is it safe and private to run an LLM locally?

Glossary

Where to go next

Keep learning