intermediate

LLMs & AI Infrastructure

Learn how LLMs work — from neural networks and transformers to quantization, inference engines, and GPU hardware for self-hosted AI.

45-60 min

Updated 2026-02-23

6 Topics

LLMsTransformersQuantizationvLLMGPU HardwareDeployment

Learning Outcomes

1
Explain how Large Language Models process text through neural networks, transformers, and attention mechanisms using everyday analogies
2
Compare model families (Qwen, DeepSeek, Llama, Mistral) and quantization formats (AWQ, GPTQ, GGUF) to select the right model for a given task
3
Analyze how VRAM, tensor parallelism, and inference engines (vLLM, llama.cpp) determine what models can run on specific hardware
4
Evaluate the tradeoffs between quantization levels, model sizes, and quality to make informed deployment decisions
5
Design a multi-model GPU serving strategy using vLLM sleep mode, PagedAttention, and model-switching workflows
6
Architect scaling plans from personal hardware to corporate GPU clusters using container orchestration and multi-node inference

Introduction — Why Run Your Own AI?

The Cloud AI Problem

Every time you ask ChatGPT a question, type a prompt into Claude, or use Google's Gemini, your words travel across the internet to a datacenter you do not control. Someone else's servers process your thoughts. Someone else's policies decide what you can and cannot ask.

This is convenient. It is also a compromise.

Three problems define the cloud AI experience: privacy, cost, and freedom.

The Privacy Problem

When you send a message to a cloud AI, you are sharing that data with the provider. Your business plans, personal conversations, code, medical questions — all of it passes through servers owned by OpenAI, Google, or Anthropic. Even with privacy policies, your data exists on infrastructure you cannot audit.

For individuals, this might be acceptable. For businesses handling sensitive data — medical records, legal documents, proprietary code — it can be a dealbreaker.

The Cost Problem

Cloud AI is not free, and it adds up fast.

Service	Monthly Cost	Annual Cost
ChatGPT Plus	$20/month	$240/year
Claude Pro	$20/month	$240/year
GPT-4o API (moderate)	$50-200/month	$600-2,400/year
Enterprise API (heavy)	$500-5,000/month	$6,000-60,000/year

Subscription prices as of Feb 2026. API cost ranges are estimates based on moderate to heavy usage patterns. OpenAI ChatGPT pricing, Anthropic Claude pricing. Last verified 2026-02-23.

For a single user, $20/month feels reasonable. For a team of 50 engineers using AI-assisted coding, the bill becomes $1,000-5,000/month in API costs alone.^[est.]

The Freedom Problem

Cloud providers can change their models, raise prices, add content filters, or discontinue services at any time. You have no control over model availability, response quality, or censorship policies.

When OpenAI deprecated GPT-3.5-turbo, applications that depended on it broke. When providers add new safety filters, workflows that previously worked stop functioning.

Self-hosted AI eliminates all three problems. Your data never leaves your hardware. The cost is fixed (electricity + hardware amortization). And the models you download today will work identically forever — no one can change them remotely.

Note

Self-hosted AI is not a replacement for cloud AI in every scenario. Cloud providers offer frontier models (GPT-4o, Claude Opus) with capabilities that currently exceed what runs on consumer hardware. The goal is to understand both options and choose wisely.

The Self-Hosted Alternative

Running your own AI means downloading open-weight language models and serving them on your own GPUs. The software stack looks like this:

Self-Hosted AI Stack

Your Applications

Chat UI, Coding Assistant, Voice

OpenAI-compatible API

Inference Engine (vLLM)

Serves the model via HTTP API

GPU compute

GPU Hardware

Consumer, Prosumer, Datacenter

Real-World Analogy: Cloud AI is like renting a car every day. Self-hosted AI is like buying your own car — higher upfront cost, but it is always in your driveway, no one reads your GPS history, and it works even when the internet is down.

What Is an LLM? — The Brain Analogy

Language Models: The Core Idea

A Large Language Model is a program that predicts the next word in a sequence. That is it. Every seemingly intelligent conversation, every piece of generated code, every creative story — all of it emerges from a system that is extraordinarily good at one task: given some text, predict what comes next.

When you type "The capital of France is", the model predicts "Paris" because it has seen millions of examples where those words are followed by "Paris." But this simple mechanism, scaled to billions of parameters and trillions of training examples, produces behavior that looks remarkably like understanding.

Neural Networks: Layers of Math That Learn

Under the hood, an LLM is a neural network — a mathematical structure loosely inspired by the brain. A neural network consists of layers of artificial "neurons," each connected to neurons in the next layer.

Neural Network Structure

Input Layer

(tokens)

Hidden Layers

billions of connections

Output Layer

next word

probability

distribution

(50,000+ options)

Each connection between neurons has a weight — a number that controls how strongly one neuron influences the next. A model with 32 billion parameters has 32 billion of these weights. During training, each weight is adjusted slightly to make the model's predictions better.

Real-World Analogy: Think of a neural network as a massive telephone switchboard. Each connection has a dial (weight) that controls how strong the signal is. During training, the AI turns these billions of dials slightly, getting better at routing information from question to answer. During inference (when you chat), the dials are locked in place — the AI just uses what it learned.

Neurons, Weights, and Activation Functions

Each neuron performs three simple operations:

Multiply each input by its weight
Sum all the weighted inputs together
Apply an activation function to decide whether to "fire"

The activation function introduces non-linearity — without it, stacking layers would be mathematically equivalent to a single layer. Individually, these operations are trivially simple. The power comes from scale — 32 billion weights organized into hundreds of layers create emergent capabilities that no individual weight explains.

Training vs Inference: Learning vs Using

These are the two fundamental phases of any AI model's life:

Training is the learning phase. The model reads enormous amounts of text (trillions of tokens) and adjusts its weights to minimize prediction errors. Training a frontier model costs millions of dollars in GPU compute.

Inference is the using phase. The weights are frozen. You send a prompt, the model processes it through its layers, and it generates a response.

Aspect	Training	Inference
Purpose	Learn patterns from data	Generate responses
Weights	Constantly changing	Frozen (read-only)
Cost	$1M-$100M+ (frontier)	$0.001-$0.06 per 1K tokens
Duration	Weeks to months	Milliseconds to seconds
Hardware	Thousands of GPUs	1-8 GPUs (for most models)
Who does it	AI labs (OpenAI, Meta, Alibaba)	You (on your own hardware)

Training cost range ($1M-$100M+) and inference cost range ($0.001-$0.06/1K tokens) are industry estimates reflecting frontier model training (e.g., GPT-4, Llama 3) and major API providers. GPU counts and durations are order-of-magnitude estimates. As of early 2026..

Best Practice

You will never need to train a model from scratch. Self-hosted AI is about inference — downloading pre-trained models and running them. This makes the hardware requirements vastly more accessible.

How Training Actually Works: A Simplified Example

Imagine the model sees: "The cat sat on the ___"

The model predicts "table" with 60% confidence
The correct answer was "mat"
The error (loss) is calculated: the model was wrong
Through backpropagation, every weight in the network is adjusted slightly to make "mat" more likely next time
This happens billions of times across trillions of training examples

After seeing enough examples, the model learns not just facts but patterns, reasoning strategies, and even coding conventions.

The Transformer Revolution — How Modern AI Thinks

The Problem with Older Architectures

Before 2017, the dominant architecture for processing text was the Recurrent Neural Network (RNN). These models processed text one word at a time, from left to right, carrying a "hidden state" that summarized everything they had seen so far.

This had a fatal flaw: by the time the model reached the 500th word, information about the 1st word was severely degraded. Long documents turned into mush.

Self-Attention: Looking at Everything Simultaneously

In 2017, a research paper titled "Attention Is All You Need" introduced the transformer architecture. Its key innovation was self-attention — the ability for every word in a sequence to directly attend to every other word, regardless of distance.

RNN vs Transformer Processing

Traditional (RNN)

The

cat

sat

the

mat

One word at a time — information decays over distance

Transformer (Self-Attention)

Thecatsatonthemat

Every word sees every other word simultaneously

Key, Query, and Value: The Attention Mechanism

Self-attention works through three learned transformations of each word's representation:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information should I share?"

Real-World Analogy: Imagine you are in a library doing research. Your Query is your research question. You scan the Key (title) of every book on the shelf simultaneously. The books with matching keys get high attention scores. Then you read the Value (content) of the highest-scoring books.

Multi-Head Attention: Multiple Perspectives

Multi-head attention runs multiple attention mechanisms in parallel, each with different learned weights. A model with 64 attention heads examines the text from 64 different perspectives simultaneously.

Multi-Head Attention

Input: "The developer fixed the bug that crashed the server"

Head 1(Syntax)developerfixed(subject–verb)

Head 2(Coreference)bugthat(what "that" refers to)

Head 3(Causality)crashedserver(what crashed what)

Head 4(Semantics)fixedbug ↔ crashed(problem–solution chain)

…up to 64+ heads analyzing different relationships

Real-World Analogy: Think of reading a mystery novel. Old AI read one word at a time, forgetting earlier clues by the end. Transformers read the entire book at once, with "spotlight teams" (attention heads). One team tracks the suspect, another tracks the murder weapon, another tracks the timeline. They all share notes to solve the mystery together.

Positional Encoding: Knowing Word Order

Since self-attention processes all words simultaneously, it has no inherent sense of order. Positional encoding adds a unique mathematical signature to each position in the sequence. Modern models use Rotary Position Embeddings (RoPE), which encode relative distances between tokens rather than absolute positions.

The AI Timeline

2017

"Attention Is All You Need"

Transformers invented at Google — the architecture that powers all modern AI.

2020

GPT-3 (175B parameters)

The scale breakthrough — in-context learning emerges for the first time.

2022

ChatGPT — AI Goes Mainstream

Transformers meet 100 million users in 2 months.¹ The world changes.

2023-2024

The Open-Source Explosion

Llama, Mistral, Qwen release competitive models. MoE scales to 685B (DeepSeek V3.2) on consumer budgets.

2025

Reasoning Models Change the Game

DeepSeek-R1, Qwen-QwQ, and Llama 4 launch with MoE architecture. Chain-of-thought reasoning emerges.

2026

Specialists Beat Generalists

Open-weight models close the gap with proprietary (80.2% vs 80.9% on SWE-bench).² Model selection shifts from "biggest" to "best specialist for the task." Agentic coding agents go mainstream.

Note

Every model you will encounter — Qwen, Llama, Mistral, DeepSeek, GPT, Claude — is a transformer. The architecture from that 2017 paper is the foundation of the entire modern AI industry.

Tokens, Context Windows, and the KV Cache

What Are Tokens?

LLMs do not process words. They process tokens — subword pieces that balance vocabulary size with representational efficiency.

TokenizationExample

"Hello, world!"        → ["Hello", ",", " world", "!"]        = 4 tokens
"def fibonacci(n):"    → ["def", " fib", "onacci", "(n", "):"] = 5 tokens
"Supercalifragilistic" → ["Super", "cal", "ifrag", "ilistic"]  = 4 tokens
"192.168.1.20"         → ["192", ".", "168", ".", "1", ".", "20"] = 7 tokens

A typical English word averages about 1.3 tokens. Code tends to be more token-dense than prose.

Warning

Token limits are not word limits. When a model says it has a "128K context window," that means 128,000 tokens — roughly 96,000 words of English text, or about 40,000-60,000 lines of code.

Context Windows: How Much the AI Remembers

The context window is the total amount of text the model can process in a single interaction — your input prompt plus the model's response. Everything outside the context window simply does not exist to the model.

Model	Context Window	~Word Count	Use Case
GPT-3 (2020)	4,096 tokens	~3,000	Short conversations
Llama 3.3 70B	131,072 tokens	~98,000	Documents, entire codebases
Qwen2.5-Coder-32B	131,072 tokens	~98,000	Entire codebases
Qwen3-Coder (Next)	262,144 tokens	~196,000	Massive repositories
Gemini 1.5 Pro	1,000,000 tokens	~750,000	Books, video transcripts

GPT-3 context is historical. Meta Llama 3.3 model card, Qwen2.5-Coder model card, Google Gemini docs. Last verified 2026-02-23.

Real-World Analogy: The context window is like a desk. A 4K context window is a school desk — you can fit a few pages. A 128K context window is a conference table — you can spread out entire codebases. A 1M context window is a warehouse floor.

The KV Cache: Memory for Attention

When the model generates text, it needs to compute attention between the new token and every previous token. The Key-Value (KV) cache stores the Key and Value matrices for all previous tokens. When generating the next token, the model only needs to compute the new token's Query and compare it against cached Keys — no re-computation needed.

KV Cache Optimization

Without KV Cache

Token 1: process 1 token

Token 2: re-process 1+2 = 3 ops

Token 3: re-process 1+2+3 = 6 ops

…

Token 1000: 500,500 ops

O(n²) — catastrophically slow

With KV Cache

Token 1: process 1, cache K₁V₁

Token 2: process 1, lookup = 2 ops

Token 3: process 1, lookup = 2 ops

…

Token 1000: 2 ops

O(n) — linear, fast

Warning

The KV cache is the hidden VRAM consumer. A model that "fits" in 18GB of VRAM might actually need 30GB+ when processing long contexts. Always budget VRAM for both model weights AND the KV cache.

Temperature and Sampling: Controlling Creativity

Temperature scales the probability distribution:

Temperature 0.0: Always picks the highest-probability token (deterministic)
Temperature 0.7: Moderate randomness (good default for most tasks)
Temperature 1.5: High randomness (creative, sometimes incoherent)

Temperature EffectsExample

Prompt: "The programmer wrote a function to"

Temperature 0.0 → "calculate the sum of two numbers"     (predictable)
Temperature 0.7 → "parse JSON data from the API response" (balanced)
Temperature 1.5 → "transcend the boundaries of recursion"  (creative)

For coding tasks, low temperature (0.1-0.3) produces more reliable, deterministic output. For creative writing, higher temperature (0.7-1.0) adds variety.

Parameters, Model Sizes, and the MoE Revolution

What Are Parameters?

Parameters are the learnable weights in a neural network — the billions of “dials” we discussed in Section 2. When someone says “Qwen2.5-Coder-32B,” the “32B” means 32 billion parameters.

More parameters means more capacity to store knowledge and patterns. But it also means more VRAM required, slower inference, and diminishing returns at extreme scale.

Parameter Count vs Intelligence

Here is the counterintuitive truth: more parameters does not always mean smarter.

A 32B coding specialist might score 92.7% on HumanEval (a coding benchmark). A model with far more parameters, like GPT-4o, scores roughly the same on coding tasks. How can a smaller model match a larger one? Specialization.

Factor	Effect on Quality
Training data quality	More impactful than model size
Data specialization	Coding models trained on trillions of code tokens
Architecture optimization	Better attention patterns, efficient heads
Training recipe	Learning rate schedules, curriculum ordering
Parameter count	Important but not decisive

A 32B model trained exclusively on high-quality code outperforms a 70B model trained on general internet text — at coding tasks. The specialist beats the generalist at its specialty.

Real-World Analogy

A 32B coding model is like a surgeon who spent 10 years studying only hearts. A 70B general model is like a doctor who studied everything. For heart surgery, you want the specialist. For a general checkup, the generalist is fine.

Dense vs Mixture of Experts (MoE)

In a dense model, every parameter is active for every token. A 70B dense model processes every token through all 70 billion parameters.

In a Mixture of Experts (MoE) model, only a fraction of parameters are active per token. A router network decides which “experts” (subsets of the model) handle each token.

Dense vs MoE Architecture

Dense Model (70B)

All 70B parameters active for every token

100% active — maximum VRAM, slower

MoE Model (Mixtral 8×7B)

Router picks 2 of 8 experts per token

Active

Idle

28% active — less compute VRAM, fast

MoE achieves the knowledge capacity of a large model with the compute cost of a smaller one. Mixtral 8x7B has 47B total parameters but only activates 13B per token, giving it the speed of a 13B model with the knowledge of a much larger one.

Hospital Analogy

A dense model is like having every doctor examine every patient — thorough but slow and expensive. An MoE model is like a triage system — the receptionist (router) sends you to the right specialist (expert). Fewer doctors work on each case, but they are the RIGHT doctors, so care quality stays high while throughput doubles.

The MoE Tradeoff

Aspect	Dense	MoE
Total parameters	All active	Many, but most idle
Active parameters	100%	10-30% per token
VRAM (weights)	Full model size	Full model size (all experts loaded)
VRAM (compute)	Full model	Only active experts
Inference speed	Proportional to total params	Proportional to active params
Knowledge capacity	Limited by total params	Larger (distributed across experts)
Best example	Llama 3.3 70B	DeepSeek V3.2 (685B total, ~37B active)

MoE Range

The range of MoE architectures is enormous. DeepSeek V3.2 has 685B parameters but activates only 37B per token. Qwen3-Coder-Next has 80B parameters but activates only 3B per token — making it fast enough to run on consumer GPUs while retaining the knowledge capacity of a much larger model.

VRAM Warning

MoE models still need to load ALL parameters into VRAM, even though only a fraction is active per token. A 47B MoE model needs roughly the same VRAM for weights as a 47B dense model, but runs at the speed of a 13B model.

Scaling Laws and Diminishing Returns

Research by OpenAI and others has established scaling laws: predictable relationships between model size, training data, compute budget, and performance.

The key finding: doubling model size does not double performance. Each doubling yields progressively smaller improvements. Going from 7B to 13B is a massive jump. Going from 70B to 140B is a modest one.

This is why the industry is shifting focus from “make it bigger” to “make it smarter”:

Better training data (quality over quantity)
Specialized training (code, math, reasoning)
Architectural innovations (MoE, efficient attention)
Post-training optimization (RLHF, DPO, reasoning chains)

Quantization Deep Dive — Shrinking Models Without Losing Intelligence

What Is Quantization?

Neural network weights are numbers. In full precision, each weight is stored as a 32-bit floating-point number (FP32), using 4 bytes of storage. Quantization reduces this precision — storing weights in fewer bits — to shrink the model and speed up inference.

A 32B parameter model in FP32 requires:

VRAM CalculationMath

32,000,000,000 parameters × 4 bytes = 128 GB

That does not fit on any single consumer GPU. Quantization is what makes large models accessible.

The Precision Ladder

Precision Ladder

FP32

128 GBBaseline (training)

FP16/BF16

64 GB~0% loss

INT8

32 GB~0.1% loss

INT4/NF4

16 GB~1–6% loss

INT3

12 GB~5–15% loss

INT2

8 GB~15–30% loss

Bar width = relative model size for a 32B parameter model

Image Compression Analogy

Quantization is like image compression. A RAW photo (FP32) is 25MB with every pixel perfect. JPEG at 95% quality (FP16) is 5MB and looks identical. JPEG at 80% quality (INT8) is 2MB and you might notice artifacts in extreme zoom. JPEG at 50% quality (INT4) is 500KB — great for thumbnails, do not print it on a billboard. For AI models, modern INT4 methods like AWQ are more like JPEG at 90% — the compression algorithm is smart enough to keep what matters.

AWQ: Activation-Aware Weight Quantization

AWQ is a state-of-the-art 4-bit quantization method developed by MIT researchers. It works by observing which weights matter most during actual model inference (the “activation-aware” part) and protecting those weights from aggressive quantization.

How AWQ works:

Run the model on calibration data (real prompts and responses)
Identify which weights produce the largest activations (most important weights)
Scale important weights up before quantization (protecting them from precision loss)
Quantize all weights to 4-bit
During inference, the scaling is reversed mathematically

The result: 75% reduction in model size with minimal quality degradation.

AWQ Quality Preservation

Perplexity values are representative of AWQ quantization impact.
Metric	FP16 (Baseline)	AWQ (4-bit)	Degradation
HumanEval (coding)	92.7%	92.7%	0.0%
MBPP (coding)	90.2%	89.8%	0.4%
Perplexity	5.12	5.39	5.3%
Model Size	~64 GB	~18 GB	-72%

Benchmark data for Qwen2.5-Coder-32B-Instruct. Qwen2.5-Coder-32B-Instruct-AWQ model card. Last verified 2026-02-23.

Zero degradation on HumanEval. The coding benchmark scores are identical because AWQ preserves the weights that matter for code generation.

Quantization Format Comparison

Format	Bits	Quality	Speed	GPU Support	Best For
AWQ	4-bit	Excellent	Very Fast (Marlin)	NVIDIA only	Production serving via vLLM
GPTQ	4-bit	Good	Fast	NVIDIA only	Alternative to AWQ, broad support
GGUF	2-8 bit	Variable	Medium	CPU + GPU	llama.cpp, Ollama, Apple Silicon
EXL2	Variable	Excellent	Very Fast	NVIDIA only	ExLlamaV2, flexible bit allocation
BNB (NF4)	4-bit	Good	Medium	NVIDIA	QLoRA fine-tuning

When to use which:

AWQ + vLLM: Best choice for NVIDIA GPU production serving. This is what most dual-GPU setups should use.
GGUF + llama.cpp: Best for Apple Silicon (M-series Macs) or CPU-only servers. Also good for mixed CPU+GPU inference.
EXL2: Best for single-GPU setups with ExLlamaV2. Offers per-layer bit allocation for maximum quality at a target size.
GPTQ: Legacy format, still widely available. Use when AWQ is not available for your model.

Marlin Kernel Acceleration

Marlin is an NVIDIA-optimized CUDA kernel specifically designed for 4-bit quantized inference. It is not a quantization method — it is a speed optimization for already-quantized models.

Standard 4-bit inference: the GPU must unpack 4-bit weights to 16-bit, compute, then handle the results. Marlin performs the computation directly on 4-bit data using specialized CUDA instructions, eliminating the unpack step.

Metric	Standard 4-bit	Marlin 4-bit	Improvement
Throughput (tokens/sec)	68	741	10.9x faster
Latency per token	14.7ms	1.35ms	10.9x faster
Quality	Baseline	Identical	0% change

Throughput measured on A100 80GB with Llama-class models. Exact numbers depend on model architecture, batch size, and GPU. IST-DASLab Marlin GitHub. Last verified 2026-02-23.

Critical point: Marlin changes speed, not quality. The model produces identical outputs whether using Marlin or standard kernels. It is purely a computational optimization.

vLLM automatically uses Marlin kernels when serving AWQ-quantized models on supported NVIDIA GPUs.

Best Practice

For NVIDIA GPU deployments, always use AWQ-quantized models with vLLM. You get Marlin acceleration automatically — near-full quality at 4x less VRAM and 10x+ faster throughput.

Inference Engines — The Software That Makes It Fast

What Is an Inference Engine?

An inference engine is the software that loads a model into GPU memory and serves it via an API. Think of it as the “runtime” — just like Python is the runtime for Python scripts, vLLM is the runtime for language models.

The inference engine handles:

Loading model weights into GPU VRAM
Processing incoming prompts (tokenization, attention computation)
Managing the KV cache across multiple concurrent requests
Returning generated text via an API endpoint

vLLM: The Production Standard

vLLM (pronounced “v-L-L-M”) is the most widely used inference engine for GPU-based deployments. Created at UC Berkeley, it introduced two key innovations that make it dramatically faster than alternatives.

PagedAttention: Traditional inference engines allocate a contiguous block of GPU memory for each request’s KV cache. If the request might need up to 128K tokens, the engine reserves memory for 128K tokens — even if the actual conversation only uses 2K tokens. This wastes enormous amounts of VRAM.

PagedAttention borrows the concept of virtual memory from operating systems. Instead of contiguous allocation, it stores KV cache entries in fixed-size “pages” scattered across GPU memory. Pages are allocated on demand and freed immediately when no longer needed.

PagedAttention vs Traditional Memory Allocation

Traditional Memory Allocation

Request 1

35% used

Request 2

20% used

Request 3

REJECTED — no contiguous space

Total VRAM utilization: ~25%

PagedAttention

Free

Total VRAM utilization: ~85% — Request 3 accepted

Airline Seat Analogy

Traditional memory allocation is like reserving an entire row in a movie theater for your group, even if only 3 seats are used. PagedAttention is like airline seat assignment — any empty seat anywhere in the plane can be assigned to any passenger. This means vLLM can serve 2-4x more concurrent users with the same GPU memory.

Continuous Batching: Instead of waiting for all requests in a batch to finish before starting new ones, vLLM adds new requests to the running batch as old ones complete. This keeps GPU utilization high even with variable-length requests.

vLLM Performance

Metric	vLLM	Ollama	Improvement
Concurrent throughput (16 users)	793 tokens/sec	41 tokens/sec	19.3x
P99 latency	80ms	1,200ms	15x
Max concurrent users	50+	3-5	10x+
GPU memory efficiency	85-95%	40-60%	~2x

Numbers measured with AWQ-quantized 32B-class models on A100/H100 GPUs. Concurrent throughput assumes 16 simultaneous users. Exact performance varies by model, hardware, and workload. vLLM documentation. Last verified 2026-02-23.

vLLM is designed for production serving — multiple concurrent users with guaranteed low latency. Ollama is designed for single-user simplicity.

Tensor Parallelism: Splitting Across GPUs

When a model is too large for a single GPU, tensor parallelism (TP) splits the model across multiple GPUs. Each GPU holds a portion of each layer and they communicate to produce the final result.

Tensor Parallelism

Single GPU (Model fits entirely)

GPU 0 (32GB)

All 64 layers loaded

Full model weights

Tensor Parallel = 2 (Model split across 2 GPUs)

GPU 0 (32GB)

Left half of weights

64 layers × 50%

PCIe / NVLink

GPU 1 (32GB)

Right half of weights

64 layers × 50%

Combined: 64GB usable VRAM

Mural Painting Analogy

Imagine a team of two artists painting one large mural. Rather than each painting the whole thing (which would not fit on one easel), they split the canvas in half. Artist 1 paints the left side, Artist 2 paints the right side, and they coordinate at the seam. That is tensor parallelism — two GPUs each hold half the model and compute in parallel.

Interconnect Bandwidth

Interconnect	Bandwidth	Typical Use
PCIe 5.0	64 GB/s per direction	Consumer GPUs (RTX 5090)
NVLink (H100)	900 GB/s	Datacenter GPUs
NVSwitch (GB200)	1,800 GB/s	Multi-GPU fabric

For inference, PCIe 5.0 bandwidth is sufficient — the bottleneck is VRAM, not interconnect speed. For training, NVLink’s 14x higher bandwidth becomes essential.

vLLM Sleep Mode: Instant Model Switching

A breakthrough feature in vLLM 0.8+ is sleep mode, which enables near-instant switching between different models on the same GPU.

Sleep Mode vs Traditional Model Switching

Traditional

1. Stop model (5–10s)

2. Unload VRAM (10–20s)

3. Load from disk (30–120s)

4. Warm up (5–10s)

50–160 seconds

Sleep Mode

1. Offload to CPU RAM (1–3s)

2. Load new weights (3–10s)

4–13 seconds

10–40× faster

L1 Sleep

1. Offload to CPU RAM (0.1–0.5s)

2. Reload from RAM (0.5–6s)

0.6–6.5 seconds

18–200× faster

Sleep Mode APIHTTP

# Put current model to sleep (offload to CPU RAM)
POST /v1/sleep
{"level": "l1"}   # L1 = weights in CPU RAM (fast reload)

# Wake up and reload
POST /v1/wake

This is a game-changer for setups with limited VRAM that need to switch between models — for example, switching between a coding model during work hours and a general chat model in the evening.

Engine Comparison

Feature	vLLM	llama.cpp	Ollama	TensorRT-LLM
Primary Use	Production GPU serving	Universal (CPU/GPU)	Easy local use	Maximum NVIDIA speed
GPU Support	NVIDIA (CUDA)	NVIDIA, AMD, Apple	NVIDIA, Apple	NVIDIA only
CPU Support	No	Yes (excellent)	Yes (via llama.cpp)	No
Multi-GPU	Tensor parallelism	Limited	No	Full support
Concurrent Users	50+	1-3	1-3	50+
Quantization	AWQ, GPTQ, FP8	GGUF (2-8 bit)	GGUF	FP8, INT4
Setup Complexity	Medium	Low	Very Low	High
Best For	Multi-user production	Mac/CPU deployments	Personal single-user	Maximum performance

Note

Ollama uses llama.cpp under the hood and adds a friendly CLI interface. It is excellent for getting started quickly but not designed for production multi-user serving.

Model Families and How to Choose

Major Open-Weight Model Families

The open-weight model ecosystem has exploded since Meta released Llama 2 in 2023. Dozens of families now compete across coding, reasoning, and general intelligence. Here are the most significant families as of early 2026:

Qwen (Alibaba Cloud, China)

The Qwen family from Alibaba’s research lab has become a dominant force in open-weight AI. Qwen2.5 and Qwen3 models consistently top benchmarks across coding, math, and general intelligence.

Qwen2.5-Coder-32B-Instruct: A strong code generation specialist. Trained on 5.5 trillion code tokens. Scores 92.7% on HumanEval. 128K context window fits entire codebases. Best suited for code completion and generation tasks rather than agentic tool-use workflows.
Qwen3-32B: General-purpose powerhouse with hybrid thinking modes (can switch between fast response and deep reasoning).
Qwen3-Coder-Next: 80B MoE model (only ~3B active per token) with 262K context. Explicitly trained for agentic coding with tool-use and recovery behaviors.
Qwen3-Coder-30B-A3B: Mid-range agentic coding model (30B total, 3B active). Strong tool-calling capability with smaller VRAM footprint.

License: Apache 2.0 (fully permissive, commercial use allowed, no restrictions).

DeepSeek (DeepSeek AI, China)

DeepSeek made headlines with V3 and the R1 reasoning model, both trained at a fraction of typical costs.

DeepSeek-R1-Distill-Qwen-32B: A 32B model distilled from the 671B R1 reasoning model. Inherits deep chain-of-thought reasoning. Beats OpenAI’s o1-mini on math and logic benchmarks.^[3]
DeepSeek V3.2: 685B MoE (37B active per token). Scores 73.1% on SWE-bench Verified and 74.2% on Aider Polyglot — the highest agentic coding scores among open-weight models.^[4]
DeepSeek V3.2-Speciale: Variant optimized for specialized tasks with enhanced reasoning.

License: MIT (fully permissive).

VRAM Requirement

DeepSeek V3.2 requires 200GB+ VRAM — far beyond consumer GPUs. Only the distilled versions (32B) run on consumer hardware.

Llama (Meta, USA)

Meta’s Llama family established the open-weight movement. Llama 3.3 and 4.0 represent the latest generations.

Llama 3.3 70B: The general-purpose workhorse. Strong across all tasks, well-supported by every inference engine. Reliable tool-calling support with the llama3_json parser in vLLM.
Llama 4 Scout (109B MoE): 109B total parameters, ~17B active. Latest-generation model with improved reasoning and native function calling support.
Llama 4 Maverick (400B MoE): 400B total parameters, ~17B active. Flagship model requiring multi-GPU or aggressive quantization for consumer deployment.

License: Llama Community License (free for commercial use under 700M monthly active users).

Mistral (Mistral AI, France)

Mixtral 8x7B: 47B total parameters, 13B active. The original “cheap but capable” MoE model.
Mistral Large (2): 123B dense model, competitive with GPT-4.
Mistral Small 3.1 24B: Efficient model for resource-constrained setups.

License: Apache 2.0.

Yi (01.AI, China)

Yi-Coder-9B: Good coding quality for its size. Fits on a single 12GB GPU.
Yi-34B: Strong general model at a moderate size.

License: Apache 2.0.

Stale Models

Yi-Coder models were last updated in September 2024. While still capable, more recent alternatives offer better performance for cutting-edge tasks.

GLM (Zhipu AI, China)

Zhipu AI’s GLM family focuses on bilingual Chinese/English capability and has emerged as a strong coding contender.

GLM-4.7: 120B+ parameters with strong coding and reasoning. Scores 94.2% on HumanEval.
GLM-4.7-Flash: Lighter variant with thinking and tool-calling capabilities. Scores 59.2% on SWE-bench Verified.
GLM-5: 744B MoE (40B active). Scores 77.8% on SWE-bench — one of the highest among open models.^[5]

License: MIT.

MiniMax (MiniMax AI, China)

MiniMax M2.5: Scores 80.2% on SWE-bench Verified — the highest among all open-weight models, within 0.7 points of Claude Opus 4.5.^[2]
MiniMax PRISM: Official uncensored variant for unrestricted use cases.

License: MIT.

Kimi (Moonshot AI, China)

Kimi-Dev-72B: Strong development-focused model with function-calling support.
Kimi K2 / K2.5: Latest generation. K2.5 scores 76.8% on SWE-bench Verified.

License: Modified MIT (commercial use allowed; attribution required above 100M users).

GPT-OSS (OpenAI, USA)

OpenAI’s first open-weight releases, focused on accessibility and transparency.

gpt-oss-20b: Compact model requiring only 16GB VRAM in its native MXFP4 format. Designed for reliable tool calling.
gpt-oss-120b: Larger variant requiring ~80GB VRAM. Stronger coding capability.

License: Apache 2.0.

IBM Granite (IBM, USA)

Granite 3.3 8B / 34B: Optimized for enterprise deployment, SQL generation, and business analytics.
Granite 4.0: Latest generation with improved code generation.
Trained exclusively on license-permissible data — the safest choice for IP-sensitive deployments.

License: Apache 2.0.

Microsoft Phi-4 (Microsoft, USA)

Phi-4 (14B): Matches much larger models on reasoning benchmarks despite its compact size. Only ~7GB VRAM with AWQ quantization.
Phi-4-mini (3.8B): Ultra-compact with built-in function calling support.

License: MIT.

The Comprehensive Model Comparison

Model	Params	Active	Context	HumanEval	SWE-bench	VRAM (AWQ)	Best For
Qwen3-Coder-Next	80B	~3B	262K	~93%	70.6%	~48GB	Agentic coding
Qwen3-Coder-30B-A3B	30B	~3B	128K	~90%	--	~18GB	Lightweight agentic
Qwen2.5-Coder-32B	32B	32B	128K	92.7%	--	~18GB	Code generation
DeepSeek-R1-Distill-32B	32B	32B	32K	79.2%	--	~18GB	Reasoning, math
Llama 3.3 70B	70B	70B	128K	88.4%	--	~40GB	General purpose
Llama 4 Scout	109B	~17B	128K	~89%	--	~55GB	Latest general
GLM-4.7	120B+	120B+	128K	94.2%	--	~60GB+	Bilingual coding
MiniMax M2.5	--	--	--	--	80.2%	--	Maximum SWE-bench
Phi-4	14B	14B	16K	~82%	--	~7GB	Compact reasoning
GPT-OSS-20B	20B	20B	--	--	34.0%	~16GB	Reliable tool calling
IBM Granite 34B	34B	34B	32K	~86%	--	~18GB	Enterprise, IP-safe
DeepSeek V3.2	685B	37B	128K	--	73.1%	~350GB	Maximum quality

HuggingFace model cards for each model. SWE-bench scores from SWE-bench Verified leaderboard. VRAM estimates assume AWQ 4-bit quantization. Llama 3.3, DeepSeek V3, GLM-5, Qwen2.5-Coder, SWE-bench Verified leaderboard. Last verified 2026-02-23.

Commercial Model Comparison

Model	Provider	HumanEval	Cost per 1M tokens	Privacy	Customizable
GPT-4o	OpenAI	~92%	$2.50 / $10	No	No
Claude Opus 4.5	Anthropic	~90%	$5 / $25	No	No
Gemini 2.5 Pro	Google	~88%	$1.25 / $10	No	No
Qwen2.5-Coder-32B	Self-hosted	92.7%	$0 (hardware cost)	Yes	Yes
DeepSeek-R1-32B	Self-hosted	79.2%	$0 (hardware cost)	Yes	Yes

API pricing (input/output per 1M tokens). HumanEval scores from respective model cards. Pricing is highly fluid — check provider pages for current rates. As of Feb 2026. OpenAI pricing, Anthropic pricing, Google Gemini pricing. Last verified 2026-02-23.

The self-hosted models match or exceed commercial APIs on coding tasks, with zero per-token costs and full data privacy. The tradeoff is upfront hardware investment and maintenance responsibility.

Understanding Licenses

License	Commercial Use	Modify	Distribute	Patent Grant	Notable Models
Apache 2.0	Yes	Yes	Yes	Yes	Qwen, Mistral, Yi
MIT	Yes	Yes	Yes	No	DeepSeek
Llama Community	Yes (< 700M MAU)	Yes	Yes	No	Llama 3
Proprietary	Via API only	No	No	No	GPT-4, Claude

Open-weight vs open-source: “Open-weight” means the model weights are publicly available. “Open-source” means the weights, training code, AND training data are all available. Most “open” models are actually open-weight — the training process is proprietary.

Apache 2.0 Gold Standard

Apache 2.0 is the gold standard for open models. It grants a patent license (protecting you from patent lawsuits by the model creator), allows commercial use with no restrictions, and requires only that you include the license notice.

How to Choose: The Decision Framework

Step 1: Define Your Task

Code generation (autocomplete, scaffolding) → Qwen2.5-Coder-32B or Yi-Coder-9B
Agentic coding (tool use, file editing, debugging) → Qwen3-Coder-Next or Qwen3-Coder-30B-A3B
Reasoning/Math → DeepSeek-R1-Distill-32B (chain-of-thought reasoning)
General chat → Qwen3-32B, Llama 4 Scout, or Llama 3.3 70B
Lightweight/fast → Phi-4 or Yi-Coder-9B (5-7GB VRAM)
Enterprise (IP-sensitive) → IBM Granite (license-permissible training data)

Step 2: Check VRAM Budget

Single 12GB GPU → Yi-Coder-9B, Phi-4, or smaller
Single 24GB GPU → Qwen2.5-Coder-32B-AWQ, Qwen3-Coder-30B-A3B-AWQ, DeepSeek-R1-32B-AWQ
Dual 32GB GPUs → Qwen3-Coder-Next-AWQ (80B MoE), Llama 3.3 70B-AWQ, Llama 4 Scout-AWQ (all TP=2)
80GB+ (H100) → Up to 120B FP16, 400B+ in AWQ

Step 3: Consider Context Needs

Short conversations (< 4K) → Any model
Code files (8K-32K) → Need 32K+ context model
Full codebases (32K-128K) → Qwen2.5-Coder, Qwen3-32B, Llama (128K native)
Massive repositories (128K-262K) → Qwen3-Coder-Next (262K context)
Ultra-long documents (1M+) → MiniMax, Kimi (API), Gemini API

Vehicle Analogy

Choosing an LLM is like choosing a vehicle for specific terrain. A code generation model is a Formula 1 car — unbeatable on a smooth racetrack (writing code), but it cannot navigate rough terrain (agentic tool use). An agentic model is a rally car — handles varied terrain. Llama 3.3 70B is an SUV — comfortable everywhere, master of nothing specific. DeepSeek-R1 is a chess grandmaster’s limousine — takes longer to respond but the answer is deeply thought through. The key is matching vehicle to terrain.

GPU Hardware — From Gaming Cards to Datacenter Supercomputers

GPU Architecture Fundamentals

A GPU (Graphics Processing Unit) was originally designed to render video game graphics — thousands of simple calculations in parallel. It turns out this same architecture is perfect for neural network inference, which also requires massive parallelism.

Modern AI GPUs contain three types of processing units:

CUDA Cores / Stream Processors: General-purpose parallel processors. Handle the basic matrix multiplications that drive neural network computation. Thousands per GPU.
Tensor Cores / Matrix Accelerators: Specialized units that perform matrix multiply-and-accumulate operations in a single clock cycle. 4-16x faster than CUDA cores for AI workloads. This is where actual inference math happens.
VRAM (Video RAM): High-bandwidth memory attached directly to the GPU. This is where model weights, the KV cache, and intermediate computations live. VRAM is the single most important specification for AI inference.

Why VRAM Is the Bottleneck

For AI inference, the processing pipeline is:

Load model weights from VRAM into compute units
Load input tokens
Compute attention and feed-forward layers
Store KV cache entries back to VRAM
Output next token

The bottleneck is almost always step 1 — moving data from VRAM to compute units. This is called being memory-bandwidth-bound. More VRAM means you can load larger models. Faster VRAM bandwidth means you can feed the compute units faster.

Model Doesn't Fit in VRAM?

Option ADon’t run it

Option BQuantize it (AWQ, GPTQ) to make it smaller

Option CSplit across multiple GPUs (tensor parallelism)

Option DOffload to CPU RAM (10–100× slower)

Consumer Tier: Gaming GPUs for AI

GPU	VRAM	Bandwidth	Tensor Cores	FP16 TFLOPS	Power	Price
RTX 4060 Ti	16GB GDDR6X	288 GB/s	128	22.1	165W	$400
RTX 4080	16GB GDDR6X	717 GB/s	304	48.7	320W	$1,200
RTX 4090	24GB GDDR6X	1,008 GB/s	512	82.6	450W	$1,600
RTX 5090	32GB GDDR7	1,792 GB/s	512+	104.8	575W	$2,000

MSRP prices; actual retail may vary. RTX 5090 specs based on launch specifications. NVIDIA GeForce product pages. Last verified 2026-02-23.

Key insights for consumer GPUs:

RTX 4090 (24GB): Runs most 7B-13B models in FP16, 32B models in AWQ/4-bit. The workhorse of the hobbyist AI community.
RTX 5090 (32GB): Runs 32B models in AWQ comfortably with room for large KV caches. Two in tandem (64GB total) can run 70B AWQ models.
Multi-GPU: Consumer GPUs communicate via PCIe 5.0 (64 GB/s). Fast enough for inference but limits training efficiency.
Limitation: No NVLink support. Consumer motherboards typically support 2 GPUs maximum for AI workloads.

Bandwidth Advantage

The RTX 5090’s GDDR7 provides 1,792 GB/s bandwidth — 78% more than the RTX 4090’s GDDR6X. This directly translates to faster token generation for memory-bound inference.

Prosumer Tier: Apple Silicon and AMD

Apple Silicon (M-series)

Chip	Unified Memory	Bandwidth	GPU Cores	Neural Engine	Best For
M3 Pro	36GB	150 GB/s	18	16-core	Small models (7-13B)
M3 Max	128GB	400 GB/s	40	16-core	Medium models (32-70B)
M3 Ultra	192GB	800 GB/s	80	32-core	Large models (70B FP16)
M4 Ultra (2026)	256GB+	1,000+ GB/s	80+	32-core+	Very large models

M4 Ultra specs are pre-release estimates. Apple Mac Studio specs. Last verified 2026-02-23.

Apple Silicon’s unique advantage is unified memory — the CPU and GPU share the same memory pool. A Mac Studio M3 Ultra with 192GB can load a 70B model in FP16 without quantization. No NVIDIA GPU at any consumer price point can do this.

The downside: Apple’s GPU architecture is optimized for different workloads than NVIDIA’s Tensor Cores. Token-for-token, NVIDIA GPUs are faster, but Apple Silicon can load larger models.

AMD MI300X

Spec	MI300X
VRAM	192GB HBM3
Bandwidth	5,300 GB/s
FP16 TFLOPS	653.7
Power	750W
Price	~$10,000-15,000

Price is estimate from enterprise channels. AMD Instinct MI300X product page. Last verified 2026-02-23.

The MI300X is AMD’s datacenter GPU with a massive 192GB of HBM3 memory. It can run 70B models in full FP16 on a single card. However, software support (ROCm) lags behind NVIDIA’s CUDA ecosystem.

Datacenter Tier: The NVIDIA AI Factory

GPU	VRAM	Bandwidth	Interconnect	FP16 TFLOPS	Power	Price
A100 (2020)	80GB HBM2e	2,039 GB/s	NVLink 600GB/s	312	400W	~$10,000
H100 SXM (2023)	80GB HBM3	3,350 GB/s	NVLink 900GB/s	989	700W	~$25,000-35,000
H200 (2024)	141GB HBM3e	4,800 GB/s	NVLink 900GB/s	989	700W	~$30,000-40,000
B200 (2025)	192GB HBM3e	8,000 GB/s	NVLink 1,800GB/s	2,250+	1,000W	~$35,000-50,000
GB200 NVL72	13.5TB agg.	72x NVLink	NVSwitch fabric	162,000+	120kW	~$3,000,000+

H100/H200 FP16 TFLOPS is dense Tensor Core (989 TFLOPS); sparsity doubles it. Prices are estimates — datacenter GPUs are typically sold through enterprise channels. NVIDIA A100, H100, H200, B200. Last verified 2026-02-23.

Key datacenter-only features:

HBM (High Bandwidth Memory): Stacked memory chips with 3-8x the bandwidth of consumer GDDR. An H100’s 3,350 GB/s feeds data to Tensor Cores fast enough to keep them fully utilized.
NVLink: A direct GPU-to-GPU interconnect with 14x the bandwidth of PCIe 5.0. Makes tensor parallelism across multiple GPUs nearly as efficient as a single larger GPU.
NVSwitch: A fabric switch connecting up to 576 GPUs in a single domain with full bisection bandwidth. The GB200 NVL72 rack operates as a single logical machine with 13.5TB of aggregate GPU memory.

The Interconnect Hierarchy

Interconnect Speed Hierarchy

PCIe 5.0

64 GB/s

Consumer GPUs (RTX 5090)

NVLink (H100)

900 GB/s

Datacenter (14× PCIe)

NVSwitch (GB200)

1,800 GB/s

Multi-GPU fabric (28× PCIe)

PCIe 5.0 (consumer): Fine for 2-GPU tensor parallelism on inference. The ~2% overhead is negligible.
NVLink (datacenter): Essential for 4-8 GPU tensor parallelism and training.
NVSwitch (mega-scale): Required for 72+ GPU configurations.

VRAM Budgeting: The Math That Matters

Every GPU deployment needs a VRAM budget:

VRAM BudgetFormula

Total VRAM Needed = Model Weights + KV Cache + Activation Memory + CUDA Overhead

Where:
  Model Weights = Parameters × Bytes_per_Parameter
    - FP16: params × 2 bytes
    - INT4/AWQ: params × 0.5 bytes

  KV Cache = 2 × num_layers × num_heads × head_dim × max_context × batch_size × 2 bytes

  Activation Memory ≈ 1-2 GB (varies by batch size)

  CUDA Overhead ≈ 0.5-1 GB (driver, context, buffers)

Example: Qwen3-Coder-Next-AWQ on Dual 32GB GPUs (TP=2)

VRAM Budget Example 1Calculation

Model Weights (AWQ 4-bit): 80B × 0.5 bytes = 40 GB
  (Note: 80B total params, but ALL must be loaded even though only 3B active)
  Per GPU (TP=2): 40 / 2 = 20 GB

KV Cache (32K context, 1 user): ~2 GB
  Per GPU: 1 GB

Activation Memory: ~1 GB per GPU
CUDA Overhead: ~0.5 GB per GPU

Total per GPU: 20 + 1 + 1 + 0.5 = 22.5 GB
Available per GPU: 32 GB
Remaining per GPU: 9.5 GB  ← Fits with headroom

Example: Llama 3.3 70B-AWQ on Dual 32GB GPUs (TP=2)

VRAM Budget Example 2Calculation

Model Weights (AWQ 4-bit): 70B × 0.5 bytes = 35 GB
  Per GPU (TP=2): 35 / 2 = 17.5 GB

KV Cache (8K context, 1 user): ~2.8 GB
  Per GPU: 1.4 GB

Activation Memory: ~1.5 GB per GPU
CUDA Overhead: ~0.5 GB per GPU

Total per GPU: 17.5 + 1.4 + 1.5 + 0.5 = 20.9 GB
Available per GPU: 32 GB
Remaining per GPU: 11.1 GB  ← Fits, less headroom

Best Practice

Always calculate your VRAM budget before deploying a model. The formula is simple: if weights + cache + overhead > available VRAM, the model will either fail to load or crash mid-inference.

Power and Cooling at Scale

GPU	Power Draw	Annual Electricity (24/7)	Cooling Requirement
RTX 4090	450W	~$400/year	Standard air cooling
RTX 5090	575W	~$500/year	Enhanced air cooling
Dual RTX 5090	1,150W	~$1,000/year	Good airflow, possibly liquid
H100 SXM	700W	~$600/year	Liquid cooling recommended
8x H100 DGX	10,200W	~$9,000/year	Dedicated liquid cooling
GB200 NVL72	120,000W	~$105,000/year	Industrial liquid cooling

Derived calculations. Power draw from NVIDIA spec sheets. Annual electricity assumes 24/7 operation at $0.12/kWh (U.S. national average residential rate). Actual costs vary by location and usage pattern.. Last verified 2026-02-23.

At the consumer level, power is manageable. At the datacenter level, power and cooling become the dominant operating costs — often exceeding the hardware amortization cost.

Emerging Technologies and the Future

The Attention Problem

The transformer’s self-attention mechanism has a fundamental limitation: it scales quadratically with context length. Processing a 128K token context requires 128K × 128K = 16 billion attention computations. Doubling the context to 256K quadruples this to 64 billion.

This O(n²) scaling is why long-context inference is so expensive and why researchers are exploring alternatives.

Flash Attention: Making Attention Faster (Today)

Flash Attention (by Tri Dao, Stanford) does not change the math of attention — it changes how the computation is organized in GPU memory. By reordering memory access patterns to maximize GPU cache utilization, Flash Attention computes exact attention 2-4x faster with 5-20x less memory overhead.

Flash Attention is already integrated into vLLM and is the default for all modern inference engines.

Gated DeltaNet: Beyond Transformers

Gated DeltaNet is a linear attention architecture that achieves O(n) scaling — doubling context length only doubles cost, not quadruples it.

Scaling Comparison

ContextTransformer O(n²)DeltaNet O(n)

4K tokens16M ops4K ops

32K tokens1B ops32K ops

128K tokens16B ops128K ops

1M tokens1T ops1M ops

At 1M tokens: transformer is 1,000,000× more expensive

Gated DeltaNet is still in the research phase, but early results show quality approaching transformers on standard benchmarks. If it works at scale, it could enable million-token context windows on consumer hardware.

YaRN: Extending Context Without Retraining

YaRN (Yet another RoPE extension method) allows models trained with short context windows to be extended to longer contexts without full retraining. It modifies the positional encoding (RoPE) to handle positions beyond the original training range.

A model trained with 4K context can be extended to 32K or even 128K using YaRN, with some quality degradation at extreme extensions. This is valuable because training long-context models is extremely expensive.

Speculative Decoding: Two Models, One Speed

Speculative decoding uses a small, fast model (the “draft” model) to predict multiple tokens ahead, then verifies them with the large, accurate model in a single forward pass.

Speculative Decoding

Without Speculative Decoding

Large model generates one token at a time:

Token 1→ [500ms] →Token 2→ [500ms] →Token 3→ [500ms]

Total: 1,500ms

With Speculative Decoding

Small model drafts 3 tokens: [50ms]

Large model verifies all 3 in one pass: [600ms]

Total: 650ms (2.3× faster)

The key insight: verifying multiple tokens in parallel is nearly as fast as generating one token. If the draft model’s predictions are mostly correct, the total throughput improves 2-3x.

Model Distillation: Teaching Small Models to Think Big

Distillation trains a small “student” model to mimic a large “teacher” model. The student learns not just the correct answers but the teacher’s probability distributions — capturing nuanced knowledge that would require a much larger dataset to learn directly.

Notable example: DeepSeek-R1-Distill-Qwen-32B is a 32B model distilled from the 671B DeepSeek-R1. It inherits the 671B model’s deep reasoning abilities despite being 21x smaller. This is why it beats OpenAI’s o1-mini on reasoning benchmarks.^[3]

LoRA and QLoRA: Customizing Models

LoRA (Low-Rank Adaptation) allows you to fine-tune a model by training only a tiny fraction of its parameters. Instead of modifying all 32 billion weights, LoRA adds small trainable matrices (typically 0.1-1% of total parameters) that adjust the model’s behavior.

QLoRA combines LoRA with 4-bit quantization, enabling fine-tuning on a single consumer GPU:

Full Fine-Tuning vs QLoRA

Full Fine-Tuning

Train all 32B parameters

Requires: 128GB+ VRAM

Hardware: Multiple A100/H100 GPUs

Cost: $1,000–10,000+

QLoRA Fine-Tuning

Train ~32M parameters (0.1%)

Model in 4-bit quantization

Requires: 24GB VRAM (single RTX 4090)

Cost: $5–50 electricity

Use cases for fine-tuning:

Adapting a general model to your coding style
Teaching a model domain-specific terminology
Improving performance on a narrow task (e.g., SQL generation)

RAG: Giving AI Access to Your Documents

Retrieval-Augmented Generation (RAG) solves a fundamental limitation: LLMs only know what was in their training data. They cannot access your private documents, recent emails, or local files.

RAG Pipeline

User Query

“What was our Q4 revenue?”

Embedding Model

Convert query to vector

Vector Database

Search similar docs

Retrieved Context

“Q4 2025 revenue was $12.3M, up 23% from Q3…”

LLM Prompt

Context + Question → Model

Answer

“Your Q4 revenue was $12.3M, a 23% increase…”

RAG vs Fine-Tuning vs Prompt Engineering

Method	Changes Model?	Updates Real-Time?	Cost	Best For
Prompt Engineering	No	N/A	Free	Adjusting behavior, few-shot examples
RAG	No	Yes	Low	Access to private/changing documents
Fine-Tuning	Yes	No (requires retrain)	Medium	Permanent behavior changes

Tool Use and Function Calling

Modern LLMs can use external tools through function calling. Instead of only generating text, the model can output structured requests to execute code, search the web, query databases, or call APIs.

How Function Calling Works

When a model supports tool use, it follows a four-step process:

Detect when a tool is needed — the model recognizes that the user’s request requires external action
Generate structured tool calls — instead of text, the model outputs JSON matching defined tool schemas
Wait for tool results — the external system executes the tool and returns results
Incorporate results — the model reads the tool output and crafts its final response

Function Calling with SearchExample

User: "What is the latest version of Python?"

LLM thinks: "I need current information. Let me search."

LLM outputs: {"tool": "search", "query": "latest Python version 2026"}

System executes: SearXNG search → returns results

LLM receives: "Python 3.13.2 released January 2026"

LLM responds: "The latest stable version of Python is 3.13.2,
              released in January 2026."

Models with Native Tool Use Support

Not all models are trained for function calling. Tool-use capability requires specific training. Models designed for tool use include:

Qwen3-Coder-Next / Qwen3-Coder-30B-A3B: Trained specifically for agentic coding workflows with recovery behaviors
Llama 4 Scout/Maverick: Native function calling support
IBM Granite: Strong enterprise function-calling benchmarks
Salesforce xLAM: Purpose-built for multi-turn tool use
OpenAI gpt-oss: Designed for reliable agentic tasks

Critical Distinction

Models that excel at code generation but were NOT trained for tool use may fail completely when asked to emit structured tool_calls JSON. This distinction — code generation vs agentic tool use — is one of the most important concepts in modern AI deployment.

Agentic Coding: Beyond Code Completion

Agentic coding is the next evolution beyond code completion. Instead of generating code snippets in a chat window, agentic systems use tool calls to read project structures, create and modify files, run tests, execute shell commands, debug iteratively, and use version control.

Agentic WorkflowExample

User: "Fix the authentication bug in the login system"

Agent → tool_call: read_file("src/auth/login.py")
Agent → reads file, identifies the issue
Agent → tool_call: write_file("src/auth/login.py", fixed_code)
Agent → tool_call: run_command("pytest tests/test_auth.py -v")
Agent → reads test results: all pass
Agent → responds: "Fixed! The issue was missing input validation on line 47."

Deep Dive Available

Agentic coding, tool-call parsers, and model selection for agent workflows are covered in depth in Lesson 05: Coding LLMs & Agentic AI. This section introduces the concept — the full technical deep dive is next.

Tool Use vs RAG: When to Use Which

Aspect	RAG	Tool Use
Purpose	Access static documents	Execute dynamic operations
Response time	Fast (vector search)	Variable (depends on tool)
Data freshness	Depends on index updates	Real-time
Best for	Knowledge bases, manuals	Web search, code execution, APIs
Example	“What’s in our Q4 report?”	“What’s the weather right now?”

Use RAG when the answer is in documents you control. Use tool calling when the answer requires real-time data or computation.

Deployment Patterns — From Single GPU to Enterprise Clusters

Level 1: Single GPU, Single User

Single GPU Deployment

User

(browser)

Inference Engine

(Ollama / vLLM)

GPU

(1×24GB)

Typical setup: Ollama on a single RTX 4090 running a 32B AWQ model. Perfect for personal use — chatting, coding assistance, document analysis. Setup time: 30 minutes.

Limitations: One user at a time. Cannot run models larger than VRAM allows. No redundancy.

Level 2: Multi-GPU, Single Node

Multi-GPU Single Node (TP=2)

User

(browser)

vLLM

(TP=2)

GPU 0

(32GB)

GPU 1

(32GB)

64GB combined VRAM

Benefits over Level 1:

Run 70B models (AWQ) that do not fit on one GPU
Model-switching strategies (sleep mode, multiple compose files)
Can serve 5-10 concurrent users with vLLM

Model-switching pattern: When you need different models for different tasks, use mode-switching scripts. With vLLM sleep mode, switching takes 4-13 seconds instead of 50-160 seconds.

Level 3: Multi-GPU, Multi-Node

Multi-Node Pipeline Parallelism

Users

Load Balancer

Node 1

Layers 1–32 · 8× H100 (640GB)

Network

Node 2

Layers 33–64 · 8× H100 (640GB)

1.28TB combined GPU memory

Pipeline parallelism has higher latency than tensor parallelism (network round-trips between nodes), but it is the only way to serve models that do not fit in a single node’s GPU memory.

Level 4: Kubernetes + vLLM Autoscaling

Kubernetes Autoscaled Deployment

Kubernetes Cluster

vLLM Pod 1

(8×H100)

vLLM Pod 2

(8×H100)

vLLM Pod 3

(8×H100)

← Auto-scaled →

Shared Model Storage (NFS)

Horizontal Pod Autoscaler (HPA)

Scale up: >80% GPU · Scale down: <20% GPU

How it works:

A Kubernetes Horizontal Pod Autoscaler monitors GPU utilization
When demand exceeds capacity, new vLLM pods are launched
Each pod attaches to a GPU node and loads the model from shared storage
A load balancer distributes requests across all active pods
When demand drops, pods are scaled down to save resources

Self-Hosted vs API: The Cost Analysis

Usage Level	API Cost (GPT-4o)	Self-Hosted (8xH100)	Winner
100K tokens/day	$45/month	$3,500/month	API
1M tokens/day	$450/month	$3,500/month	API
10M tokens/day	$4,500/month	$3,500/month	Self-hosted
100M tokens/day	$45,000/month	$3,500/month	Self-hosted (13x)
1B tokens/day	$450,000/month	$7,000/month	Self-hosted (64x)

Derived calculations. API costs based on GPT-4o pricing ($2.50/$10 per 1M tokens, blended ~$5/1M). Self-hosted assumes 8xH100 colocation at ~$3,500-$7,000/month (varies by provider and commitment term). Actual crossover depends on token mix, model choice, and infrastructure costs.. As of Feb 2026.

The crossover point for H100-class hardware is approximately 5-10 million tokens per day. Below that, API is cheaper. Above that, self-hosting wins and the savings compound rapidly with scale.

For consumer hardware, the crossover is much lower:

Usage Level	API Cost (GPT-4o)	Self-Hosted (Dual RTX 5090)	Winner
Any amount	$20+/month	~$80/month electricity	Self-hosted

Derived calculation. Electricity assumes dual RTX 5090 at 1,150W total, ~8 hrs/day usage at $0.12/kWh. Hardware cost ($4,000-5,000) is a one-time purchase. API baseline is ChatGPT Plus subscription ($20/month).. As of Feb 2026.

The consumer GPU advantage: hardware is a one-time purchase ($4,000-5,000 for dual RTX 5090 setup), and ongoing costs are just electricity. Even at minimal usage, self-hosting is cheaper after 3-4 months.

Data Sovereignty and Compliance

For regulated industries, self-hosting is not just about cost — it is about legal requirements:

HIPAA (Healthcare): Patient data cannot leave your infrastructure without a BAA. Most AI API providers do not offer BAAs.
GDPR (EU): Requires data to be processed within the EU. Self-hosting on EU-located servers guarantees compliance.
ITAR (Defense): Certain technical data cannot leave the United States. Self-hosted AI on air-gapped networks is the only option.
Financial Regulations: Many banks and funds require all data processing to occur on audited infrastructure.

Best Practice

Even if you start with cloud APIs for convenience, architect your application with an OpenAI-compatible API interface. This way, switching from cloud to self-hosted later requires only changing the base URL — no code changes.

Hands-On Exercises and Summary

Exercise 1: VRAM Budget Calculator

beginner10 minutes

Objective: Calculate whether specific models fit on different GPU configurations.

Hardware:

Config A: Single RTX 4090 (24GB VRAM)
Config B: Dual RTX 5090 (32GB × 2 = 64GB total, TP=2)
Config C: Single H100 SXM (80GB VRAM)

Models:

Model 1: Yi-Coder-9B (FP16) — 9B × 2 bytes = 18GB weights
Model 2: Qwen2.5-Coder-32B-AWQ (4-bit) — 32B × 0.5 bytes = 16GB weights
Model 3: Llama 3.3 70B-AWQ (4-bit) — 70B × 0.5 bytes = 35GB weights

Instructions:

For each model, calculate total VRAM needed: Weights + KV cache at 8K context (~0.5GB for 9B, ~1.3GB for 32B, ~2.8GB for 70B) + Overhead (~1.5GB)
For TP=2 configurations, divide weights and KV cache by 2 per GPU
Fill in the compatibility table

Model	Total VRAM	Config A (24GB)	Config B (64GB TP=2)	Config C (80GB)
Yi-Coder-9B FP16	?	Fits / No	Fits / No	Fits / No
Qwen2.5-Coder-32B AWQ	?	Fits / No	Fits / No	Fits / No
Llama 3.3 70B AWQ	?	Fits / No	Fits / No	Fits / No

Exercise 2: Model Selection Decision

intermediate15 minutes

Objective: Choose the right model for three real-world scenarios.

Available Models: Qwen3-Coder-Next-AWQ (~48GB, agentic coding, 262K context), Qwen2.5-Coder-32B-AWQ (~18GB, code generation, 128K), DeepSeek-R1-Distill-32B-AWQ (~18GB, reasoning, 32K), Llama 3.3 70B-AWQ (~40GB, general purpose, 128K), Yi-Coder-9B (~5GB, lightweight, 128K).

Scenario A: You need to refactor a complex 2,000-line Python file with multiple classes and async functions.

Scenario B: You are analyzing a legal contract and need to identify potential risks and contradictions between clauses.

Scenario C: You are running a Discord bot that needs to answer quick questions from 20 concurrent users about a game wiki.

Exercise 3: Multi-Model Deployment Strategy

advanced20 minutes

Objective: Design a complete model-switching deployment for a dual 32GB GPU setup.

Scenario: You manage a dual-GPU server (64GB total VRAM) that needs to support three workloads:

Morning (8am-12pm): Software development — need a coding assistant
Afternoon (1pm-5pm): Document analysis and strategy — need deep reasoning
Evening (6pm-10pm): Casual conversation and voice assistant — need a fast chat model

Your Task:

Select models for each time slot. Calculate VRAM requirements and verify all three fit.
Design the switching sequence. Write the vLLM sleep mode API calls needed to switch between modes.
Calculate switching overhead: cold start vs sleep mode L2 vs L1.
Document edge cases: mid-conversation switches, urgent model access, 24GB VRAM constraint.

Deliverable: A one-page deployment plan with model selections, VRAM calculations, API sequences, and timing estimates.

Key Takeaways

After completing this lesson, you understand:

LLMs are transformer-based neural networks that predict the next token, with intelligence emerging from scale and specialization
Self-attention allows every token to attend to every other token, enabling understanding across long contexts
The KV cache trades VRAM for speed, and grows linearly with context length
AWQ quantization reduces model size by 75% with minimal quality loss, enabling large models on consumer hardware
vLLM’s PagedAttention and continuous batching make production-grade serving possible on personal GPUs
Model selection is about matching the right specialist to the right task, not choosing the biggest model
GPU hardware ranges from $400 consumer cards to $3M datacenter racks, with VRAM being the critical bottleneck at every tier
Emerging technologies (Flash Attention, speculative decoding, linear attention) will continue making AI more accessible
Self-hosting becomes cost-effective at surprisingly low usage levels on consumer hardware
Architect your applications with OpenAI-compatible APIs to freely switch between cloud and self-hosted

Resources and Further Reading

Official Documentation:

vLLM: docs.vllm.ai
Hugging Face Model Hub: huggingface.co/models
NVIDIA CUDA Toolkit: developer.nvidia.com/cuda-toolkit

Foundational Papers:

“Attention Is All You Need” (Vaswani et al., 2017) — The transformer paper
“Language Models are Few-Shot Learners” (Brown et al., 2020) — GPT-3 / scaling laws
“AWQ: Activation-aware Weight Quantization” (Lin et al., 2023) — Quantization method
“Efficient Memory Management for LLM Serving with PagedAttention” (Kwon et al., 2023) — vLLM

Community Resources:

r/LocalLLaMA (Reddit) — Self-hosted AI community
Hugging Face Open LLM Leaderboard — Model benchmarks
LMSys Chatbot Arena — Blind model comparisons

Next Lesson Preview

Continue to Lesson 05: Coding LLMs & Agentic AI to dive deep into coding benchmarks, the 22 model families landscape, agentic tool calling, vLLM parser matching, censorship analysis, and how to select and deploy the right coding model for any scenario.

Sources and References

Model Cards and Specifications

[1] Meta Llama 3.3 70B Instruct — HuggingFace. 70B params, 128K context. https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct (opens in new tab) (as of 2026-02-23, verified 2026-02-23)
[2] DeepSeek V3 — HuggingFace. 685B MoE, 37B active. SWE-bench Verified: 73.1%, Aider Polyglot: 74.2%. https://huggingface.co/deepseek-ai/DeepSeek-V3-0324 (opens in new tab) (as of 2026-02-23, verified 2026-02-23)
[3] GLM-5 — HuggingFace. 744B MoE, 40B active. SWE-bench: 77.8%. Also: arXiv paper (2602.15763v1). https://huggingface.co/zai-org/GLM-5 (opens in new tab) (as of 2026-02-23, verified 2026-02-23)
[4] Qwen2.5-Coder-32B-Instruct — HuggingFace. HumanEval: 92.7%, MBPP: 90.2%. https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct (opens in new tab) (as of 2026-02-23, verified 2026-02-23)
[5] DeepSeek-R1 — HuggingFace. R1-Distill-Qwen-32B outperforms o1-mini on AIME 2024 math benchmark. https://huggingface.co/deepseek-ai/DeepSeek-R1 (opens in new tab) (as of 2026-02-23, verified 2026-02-23)

Benchmarks and Rankings

[6] SWE-bench Verified Leaderboard. Open vs proprietary gap: MiniMax M2.5 at 80.2% vs Claude Opus 4.5 at 80.9%. Rankings are point-in-time. https://www.swebench.com/ (opens in new tab) (as of 2026-02-23, verified 2026-02-23)
[7] HumanEval benchmark — OpenAI. Individual model scores sourced from respective model cards. https://github.com/openai/human-eval (opens in new tab) (as of 2026-02-23, verified 2026-02-23)

Pricing

[8] OpenAI API Pricing. GPT-4o: $2.50/$10 per 1M tokens. ChatGPT Plus: $20/month. Pricing as of Feb 2026. https://openai.com/api/pricing/ (opens in new tab) (as of 2026-02-23, verified 2026-02-23)
[9] Anthropic Pricing. Claude Pro: $20/month. Claude Opus 4.5 API: $5/$25 per 1M tokens. Pricing as of Feb 2026. https://www.anthropic.com/pricing (opens in new tab) (as of 2026-02-23, verified 2026-02-23)
[10] Google Gemini API Pricing. Gemini 2.5 Pro: $1.25/$10 per 1M tokens (≤200K context). Pricing as of Feb 2026. https://ai.google.dev/gemini-api/docs/pricing (opens in new tab) (as of 2026-02-23, verified 2026-02-23)

Hardware

[11] NVIDIA A100 Tensor Core GPU. 80GB HBM2e, 400W TDP, 312 TFLOPS FP16. https://www.nvidia.com/en-us/data-center/a100/ (opens in new tab) (as of 2026-02-23, verified 2026-02-23)
[12] NVIDIA H100 Tensor Core GPU. 80GB HBM3, 700W TDP, 989 TFLOPS FP16 dense. https://www.nvidia.com/en-us/data-center/h100/ (opens in new tab) (as of 2026-02-23, verified 2026-02-23)
[13] NVIDIA H200 Tensor Core GPU. 141GB HBM3e, 700W TDP, 4.8 TB/s bandwidth. https://www.nvidia.com/en-us/data-center/h200/ (opens in new tab) (as of 2026-02-23, verified 2026-02-23)
[14] NVIDIA B200 / DGX B200. 192GB HBM3e, 1,000W TDP, 8 TB/s bandwidth. https://www.nvidia.com/en-us/data-center/dgx-b200/ (opens in new tab) (as of 2026-02-23, verified 2026-02-23)
[15] AMD Instinct MI300X. 192GB HBM3, 5,300 GB/s bandwidth. https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html (opens in new tab) (as of 2026-02-23, verified 2026-02-23)

Software and Tools

[16] vLLM Documentation. PagedAttention, continuous batching, sleep mode features. https://docs.vllm.ai/ (opens in new tab) (as of 2026-02-23, verified 2026-02-23)
[17] Marlin CUDA Kernel — IST-DASLab. 4-bit quantized inference acceleration. https://github.com/IST-DASLab/marlin (opens in new tab) (as of 2026-02-23, verified 2026-02-23)

Industry Reports

[18] Reuters: ChatGPT sets record for fastest-growing user base. 100 million monthly active users by January 2023, per UBS/Similarweb data. Published Feb 1, 2023. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/ (opens in new tab) (as of 2023-02-01, verified 2026-02-23)

Methodology Notes

Derived calculations (electricity costs, break-even analyses) use the following assumptions: Electricity rate: $0.12/kWh (approximate U.S. national average residential rate). API cost blending: ~$5 per 1M tokens (average of input/output for GPT-4o class models). Self-hosted monthly costs: Based on colocation estimates including power, cooling, and hardware amortization. Team API estimates ($1,000-5,000/month for 50 engineers): Based on moderate individual API usage of $20-100/month per developer. All volatile data in this lesson was last verified on 2026-02-23. Benchmark scores, API pricing, and model rankings change frequently. Quarterly review recommended.

LLMs & AI Infrastructure

Learning Outcomes

01Introduction — Why Run Your Own AI?

The Cloud AI Problem

The Privacy Problem

The Cost Problem

The Freedom Problem

Note

The Self-Hosted Alternative

02What Is an LLM? — The Brain Analogy

Language Models: The Core Idea

Neural Networks: Layers of Math That Learn

Neurons, Weights, and Activation Functions

Training vs Inference: Learning vs Using

Best Practice

How Training Actually Works: A Simplified Example

03The Transformer Revolution — How Modern AI Thinks

The Problem with Older Architectures

Self-Attention: Looking at Everything Simultaneously

Key, Query, and Value: The Attention Mechanism

Multi-Head Attention: Multiple Perspectives

Positional Encoding: Knowing Word Order

The AI Timeline

"Attention Is All You Need"

GPT-3 (175B parameters)

ChatGPT — AI Goes Mainstream

The Open-Source Explosion

Reasoning Models Change the Game

Specialists Beat Generalists

Note

04Tokens, Context Windows, and the KV Cache

What Are Tokens?

Warning

Context Windows: How Much the AI Remembers

The KV Cache: Memory for Attention

Warning

Temperature and Sampling: Controlling Creativity

05Parameters, Model Sizes, and the MoE Revolution

What Are Parameters?

Parameter Count vs Intelligence

Real-World Analogy

Dense vs Mixture of Experts (MoE)

Hospital Analogy

The MoE Tradeoff

MoE Range

VRAM Warning

Scaling Laws and Diminishing Returns

06Quantization Deep Dive — Shrinking Models Without Losing Intelligence

What Is Quantization?

The Precision Ladder

Image Compression Analogy

AWQ: Activation-Aware Weight Quantization

AWQ Quality Preservation

Quantization Format Comparison

Marlin Kernel Acceleration

Best Practice

07Inference Engines — The Software That Makes It Fast

What Is an Inference Engine?

vLLM: The Production Standard

Airline Seat Analogy

vLLM Performance

Tensor Parallelism: Splitting Across GPUs

Mural Painting Analogy

Interconnect Bandwidth

vLLM Sleep Mode: Instant Model Switching

Engine Comparison

Note

08Model Families and How to Choose

Major Open-Weight Model Families

Qwen (Alibaba Cloud, China)

DeepSeek (DeepSeek AI, China)

VRAM Requirement

Llama (Meta, USA)

Mistral (Mistral AI, France)

Yi (01.AI, China)

Stale Models

GLM (Zhipu AI, China)

MiniMax (MiniMax AI, China)

Kimi (Moonshot AI, China)

GPT-OSS (OpenAI, USA)

Introduction — Why Run Your Own AI?

What Is an LLM? — The Brain Analogy

The Transformer Revolution — How Modern AI Thinks

Tokens, Context Windows, and the KV Cache

Parameters, Model Sizes, and the MoE Revolution

Quantization Deep Dive — Shrinking Models Without Losing Intelligence

Inference Engines — The Software That Makes It Fast

Model Families and How to Choose

GPU Hardware — From Gaming Cards to Datacenter Supercomputers

Emerging Technologies and the Future

Deployment Patterns — From Single GPU to Enterprise Clusters

Hands-On Exercises and Summary