nextbig.dev
Vancouver, B.C. · Intelligence on AI and the machines that run it
nextbig.dev
← The Primer
BeginnerUpdated June 2026

How to Run a Local LLM: A Complete Beginner's Guide

A local large language model runs entirely on your own hardware. No API keys, no per-token bills, no data leaving your machine. Here is what that means, what you need to run one, and how to start in about five minutes.

Reading with an AI? Take this whole guide into your assistant, or copy any command and prompt below. View .md

Run a capable AI model entirely on your own laptop. No account, no API key, no per-token bill, and nothing leaving the machine. That is a local LLM, and as of 2026 you can have one answering questions in about five minutes. This guide explains how it works, what hardware you need, which models to pick, and how to start.

It assumes no background beyond knowing what a chatbot is. By the end you will understand the three pieces that make a local model run, be able to read a model name like qwen3-8b-Q4_K_M.gguf, and know exactly what your machine can handle.

What you'll learn

  • What a local LLM is, and the honest case for and against running one
  • The three-part stack: the model file, the runtime, and your hardware
  • How quantization shrinks a 14 GB model to fit in 4 GB
  • Exactly which models your RAM or GPU can run, with a memory ladder
  • The five-minute path to your first running model

What is a local LLM?

When you use a hosted assistant, your prompt travels to a company's datacenter, runs on their GPUs, and the answer comes back over the network. You rent the model by the token, and the provider sees everything you send.

A local LLM inverts that. The model's weights, the billions of numbers that encode what it knows, are released as a file you can download. You run that file on your own GPU, your own Mac, or even your own CPU. The model never phones home. There is no account, no rate limit, and no usage meter.

This is possible because of open-weight models: capable models whose weights are published under licenses that let you download and run them. Meta's Llama, Alibaba's Qwen, Google's Gemma, Mistral, DeepSeek, and Microsoft's Phi are the best-known families. They are not the absolute frontier, but as of 2026 the gap is small enough that a good local model handles most everyday work well.

Open weights, not open source. "Open weight" means you can download and run the model. It does not always mean you can see the training data or use it commercially without limits. Check the license: Apache 2.0 (Qwen, Mistral) is the most permissive, while Llama and Gemma carry their own community terms.

Why run an LLM locally?

Four reasons drive most people to local models, and one set of trade-offs sends them back to the cloud. Both are worth stating plainly.

The honest case against: the largest hosted models are still better at the hardest reasoning and coding, you supply and maintain the hardware, and the first setup takes an afternoon to get comfortable. For occasional use of the smartest possible model, the cloud is simpler. Most people end up using both, and choosing per task.

How a local LLM actually works

Running a model locally means assembling three pieces. Once you see them as separate layers, every tool and tutorial makes sense.

The local LLM stack: a model file and a runtime on your hardware turn a prompt into tokens YOUR MACHINE · NOTHING LEAVES IT Your prompt 1 · The model file qwen3-8b-Q4_K_M.gguf · the weights, ~4.7 GB on disk 2 · The runtime Ollama, LM Studio, llama.cpp, MLX (loads the model, runs inference) 3 · The hardware GPU memory · Apple Silicon unified memory · or CPU + system RAM Tokens stream back as the reply ↓
Three layers, assembled on one machine. The runtime loads the model file into memory on your hardware, then turns your prompt into a stream of tokens. Swap any layer (a bigger model, a faster runtime, more memory) and the other two still fit.

The model file holds the weights. For local use it is almost always a single .gguf file. The name tells you the family (Qwen3), the size (8B = 8 billion parameters), and the quantization (Q4_K_M). More on both below.

The runtime is the program that reads the file, loads it into memory, and does the math that produces each next token. Ollama and LM Studio are the popular ones. Both are built on a project called llama.cpp, which is the engine doing the real work.

The hardware is where the model lives while it runs. The single most important number is memory: GPU memory (VRAM) on a PC, or unified memory on an Apple Silicon Mac. The whole model has to fit, so memory decides which models you can run at all.

The models: open weights you can download

Model names look cryptic but follow a pattern: a family (who made it), a size in billions of parameters, and sometimes a variant. Bigger is generally smarter and slower. These are the families worth knowing as of 2026.

FamilyCommon sizesGood forLicense
Qwen3 (Alibaba)8B · 14B · 30BThe default all-rounder; strong reasoning and multilingualApache 2.0
Llama 3.3 (Meta)8B · 70BThe largest ecosystem of fine-tunes and toolingLlama community
Gemma 3 (Google)4B · 12B · 27BImages and text, 140+ languages, long contextGemma terms
Mistral7B · small/largeEfficient, strong instruction-followingApache 2.0
DeepSeekvaried · MoECode and step-by-step reasoningOpen weight
Phi-4-mini (Microsoft)3.8BRuns on a laptop with no dedicated GPUMIT
Where to get them. You rarely download these by hand. A runtime like Ollama fetches the right file for you with one command (ollama run qwen3). When you do browse, Hugging Face is the main repository, and the names there follow the same family-size-quant pattern.

Quantization: how a 14 GB model fits in 4 GB

A model's weights are numbers. Stored at full precision (16 bits each), a 7-billion-parameter model is about 14 GB, too big for most consumer GPUs. Quantization stores those numbers with fewer bits. It is the single trick that makes local models practical.

Quantization shrinks a 7B model from about 14 GB to about 4.4 GB with little quality loss FILE SIZE OF A 7B MODEL, BY PRECISION FP16 full precision ~14 GB · 100% quality Q8_0 8-bit ~7 GB · ~99% Q4_K_M 4-bit, recommended ~4.4 GB · ~96% Q4_K_M is the default: about a quarter of the size, a few percent of quality lost, and the most-downloaded format for local use.
The trade-off curve. Going from 16-bit to 4-bit roughly quarters the file and the memory needed, while keeping around 95% of quality. "K_M" means a smart mix that keeps the most important layers at higher precision. For most people, Q4_K_M is the right starting point; step up to Q5 or Q8 only if you have memory to spare and want the last few percent.

Reading a quant label: the number after Q is the bits per weight (lower = smaller, faster, slightly less accurate). K means a modern, grouped method. M is the medium variant (there are S and L too). When in doubt, pick Q4_K_M. It is the format behind most local LLM downloads for a reason.

What can your hardware run?

One rule covers most of it: a model has to fit in memory, with room to spare for the conversation. A Q4_K_M model needs roughly its file size in memory, plus 10-20% for context. Here is the ladder from a modest laptop to a serious workstation.

A memory ladder mapping available GPU or unified memory to the largest model you can comfortably run MEMORY AVAILABLE LARGEST COMFORTABLE MODEL (Q4_K_M) 8 GB entry GPU / 16 GB Mac 7-8B fast, great for chat and drafting · 40+ tok/s 16 GB RTX 4060 Ti / 24 GB Mac 13-14B noticeably stronger reasoning 24 GB RTX 4090 · the sweet spot 30-32B approaches frontier quality on everyday tasks 48 GB dual GPU / 64 GB Mac 70B the heavyweight open models 64 GB+ high-memory Mac 70B + headroom or large mixture-of-experts models Rule of thumb: a Q4_K_M model needs about its file size in memory, plus 10-20% for the conversation.
The memory ladder. On a PC the number is your GPU's VRAM. On a Mac it is unified memory, shared between CPU and GPU, which is why a 64 GB MacBook can run a 70B model that will not fit on a 24 GB graphics card. If a model is too big, drop to a smaller size or a lower quant before giving up.
Apple Silicon is unusually good at this. Because the CPU and GPU share one pool of memory, a Mac can load models far larger than its price-equivalent PC GPU. A 32 GB Mac runs 30B-class models comfortably, and Apple's MLX framework has become one of the fastest ways to run them.

The tools that run the model

You do not interact with the model file directly. A runtime does that. The four below cover almost every use, and the first two are where beginners should start.

ToolBest forInterface
OllamaDevelopers; one command to run, plus an OpenAI-compatible API to build onCommand line + API
LM StudioBeginners; browse, download, and chat with no terminalDesktop app (GUI)
llama.cppThe engine under the others; maximum control and odd hardwareCommand line / library
vLLMServing a model to many users at once, in productionServer

A useful fact: Ollama and LM Studio both run on llama.cpp, so they produce the same speed on the same hardware and read the same GGUF files. The choice is purely about how you like to work. On Apple Silicon, both can use Apple's MLX backend for extra speed.

Quickstart: your first local model in five minutes

The fastest path is Ollama. It installs in one step, downloads the model for you, and gives you a chat prompt. Here is the whole thing.

  1. Install Ollama. Download it for macOS, Windows, or Linux from the official site and run the installer. On Linux it is a single command from their homepage.
  2. Pull and run a model. Open a terminal and run one line. Ollama downloads the model the first time, then drops you into a chat:
    ollama run qwen3
    Start with an 8B model. If your machine is light on memory, try ollama run phi4-mini instead.
  3. Chat. Type a question at the >>> prompt. The first token takes a moment while the model loads into memory; after that it streams. Type /bye to exit.
  4. Build on it (optional). Ollama also serves an OpenAI-compatible API at http://localhost:11434. Point any tool that speaks the OpenAI format at that URL and it will use your local model with no code changes:
    curl http://localhost:11434/v1/chat/completions \
      -d '{"model":"qwen3","messages":[{"role":"user","content":"Hello"}]}'
Prefer a window over a terminal? Install LM Studio instead, open the model browser, search for "Qwen3 8B", click download, and chat. Same model, same speed, no command line.

Or have your agent set it up

The fastest path of all is to let a coding agent do the work. Modern setup is less about memorizing commands and more about handing an assistant a clear instruction, then confirming each step. Paste this into Claude, a terminal agent like Claude Code, or any AI assistant, fill in your machine, and let it walk you through the install:

Prompt · paste into your agent
You are my local-LLM setup assistant. Get a working local model running on my machine, one step at a time.

MY MACHINE: [your OS, total RAM, and GPU model or Apple Silicon chip].

Rules:
- Recommend one model and quantization that fits my memory. A Q4_K_M model needs about its file size in memory, plus 10 to 20 percent for context.
- Give exact commands for my platform only. Do not assume a GPU I did not mention.
- Walk me through installing Ollama, then pulling and running the model. One step at a time, and wait for me to confirm each works before the next.
- After it runs, show me how to call it from the OpenAI-compatible API at http://localhost:11434.
- If a step errors, diagnose it from the exact output I paste back.

Start by telling me which model you recommend for my machine, and why.

That is the same logic this guide walks through, handed to a model that can run it with you. It is the difference between reading a manual and having someone who has read it sitting next to you.

Which model should you run?

There is no single best model, only the best fit for your machine and your task. Use the memory ladder above to find your size class, then this flow to pick within it.

A decision flow for choosing a local model by task and hardware What is the job? Chat, writing, Q&A the everyday case Qwen3 8B / 14B or Llama 3.3 8B Code & reasoning harder tasks Qwen3 30B / DeepSeek needs 24 GB+ memory Light hardware no GPU, <16 GB Phi-4-mini / Gemma 4B runs on CPU Then tune the quant to fit: start at Q4_K_M, step up to Q5 or Q8 only if memory allows. Download two, run the same prompt through both, and keep the one you like.
Match the model to the job, then to the memory. The cheapest way to choose is empirical: pull two candidates and run your real prompts through each. Quality on your task beats any benchmark leaderboard.

Not sure which to pick? Hand the decision to a model that tracks the current landscape and knows your constraints:

Prompt · let your agent pick
Recommend the single best local LLM for my hardware as of today.

MY MACHINE: [your OS, RAM, and GPU/VRAM or Apple Silicon chip].
WHAT I MAINLY DO: [for example: general chat and writing, coding, or data extraction].

Give me one recommendation, not a list: the model, size, and quantization, the approximate memory it needs, the rough tokens per second to expect, and the exact ollama run command. If a newer model has replaced these since mid-2026, say so and pick that instead.

Local vs cloud: an honest comparison

This is not a contest with a winner. Each side is better at different things, and most serious users keep both. Here is the trade-off in full.

Local LLMHosted (cloud) LLM
PrivacyTotal; nothing leaves your machinePrompts sent to a provider
CostHardware + electricity, then free per usePer-token or subscription, forever
Top qualityVery good, close on most tasksBest on the hardest reasoning and code
SetupAn afternoon to get comfortableInstant; nothing to install
OfflineWorks with no networkNeeds a connection
Scaling to many usersYou manage the hardwareElastic; someone else's problem

A simple way to decide: if the work is private, repetitive, or high-volume, lean local. If you need the smartest possible answer once in a while and setup time matters more than data control, lean cloud. The two are not rivals; they are tools for different jobs.

Common mistakes and limits

Frequently asked questions

What is a local LLM?

A local LLM is a large language model that runs entirely on your own computer instead of a company's servers. You download the model's weights as a file, run it with a tool like Ollama or LM Studio, and every prompt and response stays on your machine. There are no API keys, no per-token charges, and no internet connection required once the model is downloaded.

Can my computer run a local LLM?

Most modern computers can run a small model. The practical floor is about 16 GB of RAM, or any Apple Silicon Mac. A laptop with 8 GB of GPU memory comfortably runs a 7-8B model at usable speed. Larger models need more memory: roughly 24 GB for a 30B model and 48 GB or more for a 70B model. Quantization is what makes this fit.

What is the best local LLM to run?

As of 2026 the common default is the Qwen3 family (8B, 14B, or 30B) for its quality and permissive license. Llama 3.3 has the largest ecosystem, Gemma 3 handles images and many languages, Mistral and DeepSeek are strong on reasoning and code, and Phi-4-mini runs without a dedicated GPU. The best choice depends on your hardware and task.

Is running an LLM locally free?

The software and the open-weight models are free to download and run. You pay only in hardware and electricity. After the one-time download there is no per-token cost and no subscription, which is why local models win for high-volume or privacy-sensitive work. The trade-off is that you supply the compute.

Ollama vs LM Studio: which should I use?

Use Ollama if you are a developer who wants a command line and an OpenAI-compatible API to build on. Use LM Studio if you want a graphical app to browse, download, and chat with models without a terminal. Both run the same GGUF files at the same speed because both are built on llama.cpp underneath.

What is GGUF and what does Q4_K_M mean?

GGUF is the standard file format for a local model: one file holding the weights plus the metadata a runtime needs. Q4_K_M is a quantization level that stores most weights at about 4 bits instead of 16, cutting the file to roughly a quarter of full size while keeping the most important layers at higher precision. It loses only a few percent of quality and is the most-downloaded format for local use.

Are local LLMs as good as ChatGPT or Claude?

The best open models you can run on a high-end machine are close to frontier quality on many everyday tasks, but the largest hosted models still lead on the hardest reasoning, long-context, and coding work. The honest framing is fit, not rank: a model that is private, free per use, and always available can be the better tool even when a hosted model would score higher on a benchmark.

Is it safe and private to run an LLM locally?

Yes, and that privacy is the main reason to run one. Because the model runs on your hardware, your prompts and outputs never leave the machine and nothing is logged by a provider. Download model files from reputable sources such as the official repositories on Hugging Face, and review any code a model generates the way you would review code from the internet.

Glossary

Parameters
The numbers a model learned during training, counted in billions (the "B" in 8B). More parameters generally means more capability and more memory.
Weights
Another word for the parameters: the actual values that get downloaded and loaded into memory to run the model.
Open weight
A model whose weights are published for download. Not always the same as open source, and the license sets the terms of use.
GGUF
The standard single-file format for a local model, holding the weights plus the metadata a runtime needs to load it.
Quantization
Storing weights with fewer bits to shrink the file and memory footprint, at a small cost in quality. Q4_K_M is the common default.
VRAM
The memory on a graphics card. On a PC it sets the ceiling on model size. On a Mac the equivalent is unified memory.
Unified memory
On Apple Silicon, one memory pool shared by the CPU and GPU, which lets a Mac run larger models than a similar PC GPU.
Context window
How much text (your prompt plus the conversation) the model can consider at once, measured in tokens. Bigger contexts use more memory.
Token
A chunk of text, roughly a few characters. Models read and write in tokens, and speed is measured in tokens per second.
Inference
Running a trained model to get an answer, as opposed to training it. Everything on this page is about inference.
Mixture of experts (MoE)
A design where only part of the model runs for each token, so a large model can be fast while still fitting in memory.

Where to go next

You now have the full picture: the stack, the models, the memory math, and a five-minute path to a running model. Two directions from here.

If the economics interest you (what compute actually costs, and how local stacks up against hosted inference at scale), read the GPU and infra economics playbook and the broader guide to AI infrastructure. For the daily moves in models, chips, and tooling, the daily briefing reads the wire so you do not have to, and closes each edition with one falsifiable call we settle in public.

This guide is part of The Primer, our growing library of ground-up explainers. We re-check every one against the live landscape each month, so the numbers and model names stay current.

Keep learning

The Primer is our growing library of ground-up explainers, re-checked every month so the details stay current. The daily briefing tracks what changes.

Browse The Primer