LLM Fundamentals

Understand how large language models work, from tokenization to context windows to hallucination.

10 articles0 of 10 completed

How LLMs Actually Work

LLMs predict text, they don't understand it. Here's how large language models work under the hood, from training to transformers to next-token prediction, and why it matters for how you use them.

7 min read

Parameters and Why Size Matters

Parameters are the numbers that make AI models work. Here's what they are, why models have billions of them, and what the count actually tells you about capability.

5 min read

The Inference Pipeline

Inference is where AI models do their actual work. Here's what happens during inference, why it's the bottleneck, and what determines speed and cost.

7 min read

Temperature and Creative Control

Temperature controls how random or deterministic an AI model's output is. Here's what it does technically, how it relates to top-p and top-k, and when to adjust it.

5 min read

Context Windows and What Fits

Context windows determine how much an AI model can 'see' at once. Here's what they are technically, how attention scales, and practical strategies for working within their limits.

6 min read

Tokenization Under the Hood

Tokenization isn't just a technical detail. It shapes how LLMs process your input. Understanding it changes the way you write prompts.

4 min read

MoE: More Model, Less Compute

MoE models have a trillion parameters but only activate a fraction per token. How expert routing works, why it matters for cost, and which major models use it.

8 min read

Continued Pretraining for Custom Models

Continued pretraining adapts a general LLM to a specific domain using large unlabeled data. How it works, how it differs from fine-tuning, and real examples.

8 min read

Hallucination and How to Fight It

AI hallucination isn't a bug you can patch. It's a consequence of how language models work. Here's what causes it, how to measure it, and what actually reduces it.

6 min read

Streaming Real-Time Responses

Streaming LLM responses reduces perceived latency and improves UX. Here's how server-sent events work, how to implement streaming with OpenAI and Anthropic, and what to watch for in production.

5 min read

Up next

Prompt Engineering

Master system prompts, few-shot techniques, chain of thought reasoning, and structured output.

→

Get Insanely Good at AI

Chapter 2: How AI Actually Works goes deeper into the mechanics: tokenization, transformers, and next-token prediction. The understanding that changes how you work with every AI tool.

Get the Book