Local AI Core Concepts

TL;DR

Local AI has 6 building blocks: Models (the brain), Parameters (model size/capability), Quantization (compression to fit your hardware), Inference Engines (the runtime), Context Windows (how much text fits), and the OpenAI-compatible API (how apps talk to your model). Master these and you can evaluate any local AI setup.

Concept Map

Here's how the pieces fit together. You pick a model, quantize it to fit your hardware, load it into an inference engine, and interact through a CLI, API, or web UI.

Local AI concept map showing relationships between models, quantization, inference engines, APIs, and UIs
Explain Like I'm 12

Think of a local AI model like a brain in a jar:

  • Model = the brain itself (Llama, Mistral, etc.)
  • Parameters = how big the brain is (bigger = smarter but heavier)
  • Quantization = shrinking the brain so it fits in your jar (your computer's memory)
  • Inference engine = the life support machine that makes the brain think (Ollama, llama.cpp)
  • Context window = how much the brain can remember at once during a conversation
  • API = the phone line apps use to talk to the brain

Cheat Sheet

ConceptWhat It DoesKey Terms
ModelPre-trained neural network for text generationLlama 3, Mistral, Gemma, DeepSeek, Phi, Qwen
ParametersModel size — more = smarter but needs more memory1B, 7B, 13B, 70B
QuantizationCompress model to use less memoryQ4_K_M, Q5_K_M, Q8_0, GGUF, AWQ, GPTQ
Inference EngineSoftware that loads and runs the modelOllama, llama.cpp, vLLM, TGI
Context WindowMax tokens the model can see at once4K, 8K, 32K, 128K tokens
OpenAI-compat APIStandard API so any app can talk to your model/v1/chat/completions

The Building Blocks

1. Models

A model is a pre-trained neural network that generates text. Open-source models are released by companies and researchers for anyone to download and run. The landscape evolves fast, but these are the major families:

Model FamilyBySizesBest For
Llama 3 / 3.1Meta8B, 70B, 405BGeneral purpose, reasoning, coding
Mistral / MixtralMistral AI7B, 8x7B, 8x22BEfficient, fast, multilingual
Gemma 2Google2B, 9B, 27BCompact, good quality per parameter
DeepSeek V3 / CoderDeepSeek7B, 33B, 67BCoding, math, reasoning
Phi-3 / Phi-4Microsoft3.8B, 14BSmall but capable, runs on edge devices
Qwen 2.5Alibaba0.5B – 72BMultilingual, strong benchmarks
Tip: Start with Llama 3.1 8B or Gemma 2 9B for your first local model. They're the best quality-to-size ratio for 8 GB hardware. Graduate to 70B when you want near-cloud quality.

2. Parameters & Model Sizes

Parameters are the "knowledge weights" of the model. More parameters = more knowledge and better reasoning, but more memory required.

SizeFP16 MemoryQ4 MemoryQuality Level
1-3B2-6 GB1-2 GBBasic tasks, autocomplete
7-8B14-16 GB4-5 GBGood for chat, coding, summaries
13-14B26-28 GB8-9 GBBetter reasoning, fewer mistakes
30-34B60-68 GB18-20 GBNear-cloud quality
70B140 GB40 GBExcellent — approaches GPT-4 on many tasks
Info: The rule of thumb: each billion parameters needs ~2 GB at FP16 or ~0.6 GB at Q4 (4-bit quantization). Plus overhead for context (KV cache), which grows with longer conversations.

3. Quantization

Quantization compresses a model's weights from high-precision floats (16-bit) to lower-precision integers (8-bit, 4-bit). This is the magic that makes 70B models run on consumer GPUs.

Quant LevelBitsSize ReductionQuality Impact
FP16 (none)16BaselineFull quality
Q8_08~50%Near-zero loss
Q5_K_M5~69%Very slight loss
Q4_K_M4~75%Minimal loss — best value
Q3_K_M3~81%Noticeable degradation
Q2_K2~87%Significant loss, not recommended
Tip: Q4_K_M is the sweet spot for most users — 75% smaller with minimal quality loss. Use Q5_K_M if you have the VRAM, or Q8_0 for near-perfect quality.

GGUF is the standard file format for quantized models. It's used by llama.cpp, Ollama, and LM Studio. You'll find GGUF files on Hugging Face (the "GitHub of AI models").

4. Inference Engines

An inference engine loads the model into memory and generates text. It handles tokenization, GPU acceleration, and the API server.

EngineBest ForKey Feature
OllamaEasiest setup, general useOne-command install, model library, OpenAI-compat API
llama.cppMaximum control, advanced usersCPU + GPU inference, GGUF native, highly optimized
LM StudioGUI-first, non-developersDesktop app with built-in chat UI and model browser
vLLMProduction serving, high throughputPagedAttention, continuous batching, multi-GPU
text-generation-inferenceProduction serving (Hugging Face)Docker-based, tensor parallelism, flash attention
Info: Ollama wraps llama.cpp with a user-friendly CLI and model library. If Ollama is "Docker for AI models," then llama.cpp is the underlying engine. Most users should start with Ollama.

5. Context Window

The context window is how much text the model can see at once — your system prompt, conversation history, and the current message all count. Measured in tokens (roughly 3/4 of a word).

# Check a model's context window in Ollama
ollama show llama3.1 --modelfile | grep num_ctx
Context SizeTokensApprox. TextVRAM Overhead
Small4,096~3,000 wordsLow
Medium8,192~6,000 wordsModerate
Large32,768~25,000 wordsHigh
Extra-large128,000~100,000 wordsVery high
Warning: Larger context windows eat more VRAM (for the KV cache). A 7B model at 4K context might use 5 GB, but at 32K context it could use 8-10 GB. If you're running out of memory, reduce the context window first.

6. OpenAI-Compatible API

Most local inference engines expose an OpenAI-compatible REST API. This means any app built for ChatGPT (code editors, chatbots, automation tools) can work with your local model by just changing the base URL.

# Same API format as OpenAI, but pointed at localhost
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1",
    "messages": [{"role": "user", "content": "Explain Docker in one sentence"}]
  }'
# Python with the official OpenAI library
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",  # Point to Ollama
    api_key="not-needed"                    # Local = no auth
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Explain Docker in one sentence"}]
)
print(response.choices[0].message.content)
Tip: This is incredibly powerful. Tools like Continue (VS Code), Aider, and Open WebUI all support OpenAI-compatible endpoints. Set base_url to your local server and they "just work."

Test Yourself

What does "7B" mean in "Llama 3.1 7B"?

7 billion parameters. Parameters are the learned weights in the neural network. More parameters generally means better quality but requires more memory and compute. A 7B model at Q4 quantization needs roughly 4-5 GB of VRAM.

Why is Q4_K_M the most popular quantization level?

It's the best balance of size and quality. Q4_K_M reduces model size by ~75% while keeping quality nearly indistinguishable from the full-precision model for most tasks. Going lower (Q3, Q2) noticeably degrades output quality; going higher (Q5, Q8) needs more memory for diminishing returns.

What is GGUF?

GGUF (GPT-Generated Unified Format) is the standard file format for quantized models used by llama.cpp, Ollama, and LM Studio. It packages the model weights, tokenizer, and metadata into a single file. You download GGUF files from Hugging Face to run models locally.

How does the OpenAI-compatible API help with local AI adoption?

Any application built for OpenAI's API (ChatGPT) can work with your local model by just changing the base URL from api.openai.com to localhost:11434. This means VS Code extensions, chatbots, and automation tools designed for cloud AI work with local models without code changes.

Why does increasing the context window use more VRAM?

The model needs to store a KV (key-value) cache for every token in the context. Longer context = more tokens to cache = more memory. A 7B model might use 5 GB with 4K context but 10 GB with 32K context. If you're memory-constrained, reduce context before reducing quantization quality.