Local AI Core Concepts

TL;DR

Local AI has 6 building blocks: Models (the brain), Parameters (model size/capability), Quantization (compression to fit your hardware), Inference Engines (the runtime), Context Windows (how much text fits), and the OpenAI-compatible API (how apps talk to your model). Master these and you can evaluate any local AI setup.

Concept Map

Here's how the pieces fit together. You pick a model, quantize it to fit your hardware, load it into an inference engine, and interact through a CLI, API, or web UI.

Explain Like I'm 12

Think of a local AI model like a brain in a jar:

Model = the brain itself (Llama, Mistral, etc.)
Parameters = how big the brain is (bigger = smarter but heavier)
Quantization = shrinking the brain so it fits in your jar (your computer's memory)
Inference engine = the life support machine that makes the brain think (Ollama, llama.cpp)
Context window = how much the brain can remember at once during a conversation
API = the phone line apps use to talk to the brain

Cheat Sheet

Concept	What It Does	Key Terms
Model	Pre-trained neural network for text generation	Llama 3, Mistral, Gemma, DeepSeek, Phi, Qwen
Parameters	Model size — more = smarter but needs more memory	1B, 7B, 13B, 70B
Quantization	Compress model to use less memory	Q4_K_M, Q5_K_M, Q8_0, GGUF, AWQ, GPTQ
Inference Engine	Software that loads and runs the model	Ollama, llama.cpp, vLLM, TGI
Context Window	Max tokens the model can see at once	4K, 8K, 32K, 128K tokens
OpenAI-compat API	Standard API so any app can talk to your model	`/v1/chat/completions`

The Building Blocks

1. Models

A model is a pre-trained neural network that generates text. Open-source models are released by companies and researchers for anyone to download and run. The landscape evolves fast, but these are the major families:

Model Family	By	Sizes	Best For
Llama 3 / 3.1	Meta	8B, 70B, 405B	General purpose, reasoning, coding
Mistral / Mixtral	Mistral AI	7B, 8x7B, 8x22B	Efficient, fast, multilingual
Gemma 2	Google	2B, 9B, 27B	Compact, good quality per parameter
DeepSeek V3 / Coder	DeepSeek	7B, 33B, 67B	Coding, math, reasoning
Phi-3 / Phi-4	Microsoft	3.8B, 14B	Small but capable, runs on edge devices
Qwen 2.5	Alibaba	0.5B – 72B	Multilingual, strong benchmarks

Tip: Start with Llama 3.1 8B or Gemma 2 9B for your first local model. They're the best quality-to-size ratio for 8 GB hardware. Graduate to 70B when you want near-cloud quality.

2. Parameters & Model Sizes

Parameters are the "knowledge weights" of the model. More parameters = more knowledge and better reasoning, but more memory required.

Size	FP16 Memory	Q4 Memory	Quality Level
1-3B	2-6 GB	1-2 GB	Basic tasks, autocomplete
7-8B	14-16 GB	4-5 GB	Good for chat, coding, summaries
13-14B	26-28 GB	8-9 GB	Better reasoning, fewer mistakes
30-34B	60-68 GB	18-20 GB	Near-cloud quality
70B	140 GB	40 GB	Excellent — approaches GPT-4 on many tasks

Info: The rule of thumb: each billion parameters needs ~2 GB at FP16 or ~0.6 GB at Q4 (4-bit quantization). Plus overhead for context (KV cache), which grows with longer conversations.

3. Quantization

Quantization compresses a model's weights from high-precision floats (16-bit) to lower-precision integers (8-bit, 4-bit). This is the magic that makes 70B models run on consumer GPUs.

Quant Level	Bits	Size Reduction	Quality Impact
FP16 (none)	16	Baseline	Full quality
Q8_0	8	~50%	Near-zero loss
Q5_K_M	5	~69%	Very slight loss
Q4_K_M	4	~75%	Minimal loss — best value
Q3_K_M	3	~81%	Noticeable degradation
Q2_K	2	~87%	Significant loss, not recommended

Tip: Q4_K_M is the sweet spot for most users — 75% smaller with minimal quality loss. Use Q5_K_M if you have the VRAM, or Q8_0 for near-perfect quality.

GGUF is the standard file format for quantized models. It's used by llama.cpp, Ollama, and LM Studio. You'll find GGUF files on Hugging Face (the "GitHub of AI models").

4. Inference Engines

An inference engine loads the model into memory and generates text. It handles tokenization, GPU acceleration, and the API server.

Engine	Best For	Key Feature
Ollama	Easiest setup, general use	One-command install, model library, OpenAI-compat API
llama.cpp	Maximum control, advanced users	CPU + GPU inference, GGUF native, highly optimized
LM Studio	GUI-first, non-developers	Desktop app with built-in chat UI and model browser
vLLM	Production serving, high throughput	PagedAttention, continuous batching, multi-GPU
text-generation-inference	Production serving (Hugging Face)	Docker-based, tensor parallelism, flash attention

Info: Ollama wraps llama.cpp with a user-friendly CLI and model library. If Ollama is "Docker for AI models," then llama.cpp is the underlying engine. Most users should start with Ollama.

5. Context Window

The context window is how much text the model can see at once — your system prompt, conversation history, and the current message all count. Measured in tokens (roughly 3/4 of a word).

# Check a model's context window in Ollama
ollama show llama3.1 --modelfile | grep num_ctx

Context Size	Tokens	Approx. Text	VRAM Overhead
Small	4,096	~3,000 words	Low
Medium	8,192	~6,000 words	Moderate
Large	32,768	~25,000 words	High
Extra-large	128,000	~100,000 words	Very high

Warning: Larger context windows eat more VRAM (for the KV cache). A 7B model at 4K context might use 5 GB, but at 32K context it could use 8-10 GB. If you're running out of memory, reduce the context window first.

6. OpenAI-Compatible API

Most local inference engines expose an OpenAI-compatible REST API. This means any app built for ChatGPT (code editors, chatbots, automation tools) can work with your local model by just changing the base URL.

# Same API format as OpenAI, but pointed at localhost
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1",
    "messages": [{"role": "user", "content": "Explain Docker in one sentence"}]
  }'

# Python with the official OpenAI library
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",  # Point to Ollama
    api_key="not-needed"                    # Local = no auth
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Explain Docker in one sentence"}]
)
print(response.choices[0].message.content)

Tip: This is incredibly powerful. Tools like Continue (VS Code), Aider, and Open WebUI all support OpenAI-compatible endpoints. Set base_url to your local server and they "just work."

Test Yourself

What does "7B" mean in "Llama 3.1 7B"?

7 billion parameters. Parameters are the learned weights in the neural network. More parameters generally means better quality but requires more memory and compute. A 7B model at Q4 quantization needs roughly 4-5 GB of VRAM.

Why is Q4_K_M the most popular quantization level?

It's the best balance of size and quality. Q4_K_M reduces model size by ~75% while keeping quality nearly indistinguishable from the full-precision model for most tasks. Going lower (Q3, Q2) noticeably degrades output quality; going higher (Q5, Q8) needs more memory for diminishing returns.

What is GGUF?

GGUF (GPT-Generated Unified Format) is the standard file format for quantized models used by llama.cpp, Ollama, and LM Studio. It packages the model weights, tokenizer, and metadata into a single file. You download GGUF files from Hugging Face to run models locally.

How does the OpenAI-compatible API help with local AI adoption?

Any application built for OpenAI's API (ChatGPT) can work with your local model by just changing the base URL from api.openai.com to localhost:11434. This means VS Code extensions, chatbots, and automation tools designed for cloud AI work with local models without code changes.

Why does increasing the context window use more VRAM?

The model needs to store a KV (key-value) cache for every token in the context. Longer context = more tokens to cache = more memory. A 7B model might use 5 GB with 4K context but 10 GB with 32K context. If you're memory-constrained, reduce context before reducing quantization quality.

Local AI Core Concepts

Concept Map

Cheat Sheet

The Building Blocks

1. Models

2. Parameters & Model Sizes

3. Quantization

4. Inference Engines

5. Context Window

6. OpenAI-Compatible API

Test Yourself

Related Topics