Ai Engineering 8 min read

What Is an LLM? How Large Language Models Actually Work

LLMs predict text, they don't understand it. Here's how large language models work under the hood, from training to transformers to next-token prediction, and why it matters for how you use them.

An LLM doesn’t reason. It doesn’t understand. It predicts the next token. That single fact explains most of what large language models can do, most of what they can’t, and why they fail in the ways they do.

What an LLM Actually Is

A large language model is a neural network trained to predict the next token in a sequence. You give it text. It converts that text into tokens (chunks of characters, roughly 4 per word in English). Then it computes a probability distribution over its entire vocabulary and picks the next token. Repeat. That’s generation.

The model has no concept of “truth” or “meaning.” It learned statistical patterns from massive amounts of text. When you ask “What is the capital of France?”, it doesn’t retrieve a fact. It generates “Paris” because that sequence of tokens appeared overwhelmingly in its training data after that question pattern. It’s right for the wrong reasons. When the pattern is weak or ambiguous, it still generates something. Tokenization shapes exactly how the model sees your input, and different token sequences can produce different outputs even when the meaning is identical.

This matters because it reframes how you should think about prompts. You’re not asking a system that “knows” things. You’re steering a probability engine toward outputs that match what you want.

How LLMs Are Trained

Training happens in stages.

Pre-training is the foundation. The model sees trillions of tokens from the open web, books, code, and other text. The objective is simple: given a sequence of tokens, predict the next one. No labels. No human feedback. Just raw text and a loss function that penalizes wrong predictions. The model learns grammar, facts (as statistical correlations), reasoning patterns (as common token sequences), and the structure of different domains. A model trained on code learns that function is often followed by a name and parentheses. One trained on Wikipedia learns that capital cities follow certain question patterns. This phase is expensive. GPT-3 cost millions in compute. Modern frontier models cost hundreds of millions.

Fine-tuning narrows the model for specific behavior. After pre-training, the model can complete text, but it doesn’t follow instructions. Fine-tuning on instruction-response pairs teaches it to do what users ask. “Summarize this” gets paired with good summaries. “Write Python code for X” gets paired with correct code. The model learns to map instruction patterns to desired output patterns.

RLHF (Reinforcement Learning from Human Feedback) refines behavior further. Humans rank model outputs (better vs. worse). A reward model learns to predict those rankings. The main model is then optimized to maximize that reward. This is how models learn to be helpful, avoid harmful content, and produce outputs humans prefer. RLHF doesn’t add knowledge. It shapes style, tone, and alignment with human preferences. Some models use alternative approaches (constitutional AI, direct preference optimization), but the goal is the same: align model behavior with what humans want.

The Transformer Architecture

Every modern LLM is built on the transformer architecture, introduced in the 2017 paper “Attention Is All You Need” by Vaswani et al. The key innovation is self-attention: the ability to look at every token in the input sequence and weight how much each token should influence the output.

Before transformers, models processed text sequentially. RNNs and LSTMs had to pass information through a chain, one token at a time. Long-range dependencies (a pronoun referring to something 50 tokens back) were hard to capture. Attention solves this. Each token can directly attend to every other token. The model learns which relationships matter. “The cat sat on the mat because it was tired” requires linking “it” to “cat.” Attention lets the model do that in one step.

Transformers stack many attention layers. Each layer refines the representation. Early layers capture local patterns (phrases, syntax). Deeper layers capture broader structure (topic, discourse, reasoning chains). The result is a representation of the input that encodes both local and global context. This is why transformers scaled so well: add more layers, add more parameters, train on more data, and capability improves in predictable ways.

The cost is quadratic in sequence length. For n tokens, attention does n² comparisons. Double the context window and you quadruple the compute. That’s why context limits exist and why very long contexts are slower and more expensive. Models that support 200K or 2M tokens use optimizations (sparse attention, sliding windows) that approximate full attention without the full cost.

What LLMs Can and Can’t Do

LLMs excel at pattern-heavy tasks. They’ve seen millions of examples of code, summaries, translations, and Q&A. They generate text that fits those patterns. Code generation works because code has repetitive structure: def, return, if, else appear in predictable arrangements. Summarization works because summaries follow predictable formats. Creative writing works because the model learned narrative patterns from fiction. The more examples of a pattern in the training data, the more reliably the model reproduces it.

They struggle with precise computation. Ask an LLM to multiply two 10-digit numbers. It might get it right. It might not. It’s not doing arithmetic. It’s generating tokens that look like arithmetic. The model has no internal calculator. It learned that “123 * 456” is often followed by something that looks like a product, but it doesn’t compute the product. Same for dates, citations, and anything requiring exact lookup. For exact tasks, use tools. Let the model write code; run the code to get the answer. Give it a database or API when it needs real-time data.

Hallucination is the flip side of next-token prediction. When the model doesn’t have a strong pattern, it still generates. It invents citations, makes up API endpoints, and confidently states wrong facts. The output is fluent. It reads well. It’s wrong. This isn’t a bug. It’s the model doing exactly what it was trained to do: produce plausible next tokens.

Model Sizes and Parameters

“Parameters” are the learned weights in the neural network. Each connection between neurons has a weight. A 7-billion-parameter model has 7 billion of these numbers. More parameters generally mean more capacity to capture patterns, but the relationship isn’t linear. A 70B model isn’t 10x smarter than a 7B model. Diminishing returns kick in.

Rough scale: 7B models (Llama 3.2, Mistral) run on consumer hardware and handle many tasks well. 70B models (Llama 3 70B, Qwen) approach frontier quality for specific domains. Frontier models (GPT-4, Claude, Gemini) are hundreds of billions of parameters and require datacenter-scale infrastructure. Memory scales with parameters: a 7B model at 16-bit precision needs about 14GB; at 4-bit quantization, about 4GB. That’s why 7B models run on laptops and 70B models need serious hardware.

Parameter count isn’t everything. Training data quality, architecture choices, and fine-tuning matter as much. A well-trained 7B model can outperform a poorly tuned 70B model on narrow tasks. “Large” in “large language model” refers to parameter count, but the real differentiator is the combination of scale, data, and alignment.

Why This Matters for Using AI Well

Understanding that LLMs predict tokens, not truth, changes how you use them.

Verify outputs. Don’t trust factual claims without checking. The model has no concept of “I made that up.” It will state fabrications with the same confidence as facts.

Use the right tool for the task. Pattern matching: let the model do it. Exact computation, lookups, or verification: use code, databases, or APIs. Embeddings and retrieval augment generation when you need the model to ground its output in real data.

Design prompts for the mechanism. You’re steering probability distributions. Clear, specific instructions work better than vague ones. Put critical constraints early in the context. Match your phrasing to patterns the model has seen.

Understand the limits. Context windows, tokenization, and attention mechanics all constrain what the model can “see” and how it processes your input. When something behaves oddly, the answer often lies in these mechanics.

The goal isn’t to become a researcher. It’s to build intuition that makes you better at prompting, debugging, and building systems that use LLMs effectively. When you know that the model predicts tokens, you stop expecting it to “think” and start designing for what it actually does. When you understand attention and context limits, you stop blaming the model for “forgetting” and start structuring your prompts accordingly. Get Insanely Good at AI covers these mechanics in depth: how models work, why they fail, and how to use them in production. The foundation matters. The rest builds on it.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading