Here’s something that used to keep me up at night: why does ChatGPT feel instant, while my own attempts at running a large language model on a cloud GPU felt like waiting for dial-up internet to load a JPEG in 1997?
The answer, as it turns out, has very little to do with raw computing power. It’s about memory. Specifically, it’s about moving bytes around in clever ways that would make a logistics expert weep with joy. Welcome to the bizarre, beautiful world of LLM inference optimization.
The Compute Tax: Why LLM Inference is Hard
Let me paint you a picture. You’ve got this magnificent neural network with 70 billion parameters. Each parameter is a number. Each number needs to be fetched from memory, multiplied, added, and the result stored somewhere. Simple enough, right?
Here’s the twist that makes everything complicated: autoregressive decoding.
When an LLM generates text, it doesn’t spit out a whole sentence at once. It predicts one token at a time. Think of it like a chef who has to make a five-course meal, but they can only cook one ingredient at a time, and they have to taste everything before adding the next ingredient. “First I’ll add salt… tastes… okay now pepper… tastes… now garlic…”
This means that for every single token the model generates, it needs to:
- Load the entire model from memory (yes, all 70 billion parameters)
- Do some math
- Produce one measly token
- Repeat
For a 100-token response, that’s loading the model 100 times. Each load requires moving hundreds of gigabytes through your GPU’s memory bus. And here’s the kicker — memory bandwidth improves much slower than compute power. NVIDIA’s GPU floating-point performance grew 80x between 2012 and 2022, but memory bandwidth? Only 17x.
This is what engineers call the “Memory Wall,” and it’s been the bane of AI researchers’ existence for years. Your GPU might have the computational power of a small sun, but it spends most of its time sitting idle, drumming its fingers on the table, waiting for data to arrive from memory.
It’s like having a Formula 1 car stuck in city traffic. All that horsepower, nowhere to go.
The Memory Anchor: Optimizing the KV Cache
Trading VRAM for Velocity
Before we fix the memory wall, we need to understand a crucial concept: the KV Cache (Key-Value Cache).
Remember how I said the model generates one token at a time? Well, here’s a slightly horrifying fact: without caching, the model would have to recompute everything for every token it generates. If you’re generating the 50th token, the model would re-process all 49 previous tokens from scratch. That’s not a traffic jam — that’s purgatory.
The KV cache is the solution. It stores intermediate computations (specifically, the “keys” and “values” from the attention mechanism) so the model doesn’t have to redo work. But this creates a new problem: memory management.
Picture this: you’re running a server handling thousands of concurrent requests. Each request has its own KV cache. Some requests need long responses (big cache), some need short ones (small cache). Some requests finish early, some take forever. It’s like trying to park cars of wildly different sizes in a parking garage where cars keep arriving and leaving unpredictably.
Traditional systems pre-allocated memory for the maximum possible sequence length. Running a model that supports 8,000 tokens? Every request gets 8,000 tokens worth of memory, even if it only needs 50. The result? 60-80% of KV cache memory was wasted through fragmentation and over-allocation.
PagedAttention: How vLLM Changed Everything
In 2023, a team at UC Berkeley looked at this mess and said, “Wait, haven’t operating systems solved this problem already?”
They were right. The same engineers who figured out how to manage memory in your computer’s RAM decades ago had already cracked this nut. The solution? Paging.
PagedAttention, implemented in vLLM, breaks the KV cache into small, fixed-size “pages” (or blocks) that can be stored anywhere in memory. Instead of requiring one contiguous chunk of VRAM for each request, the cache becomes a scattered collection of blocks linked together by a lookup table.
Think of it like switching from a library where every book series must sit on adjacent shelves to one where books can go anywhere, and you just keep a catalog of where each one is. Suddenly, you can fit way more books.
The results were staggering:
- Memory waste dropped from 60-80% to under 4%
- Throughput improved 2-4x with the same hardware
- Memory sharing between requests became possible (if two users ask similar questions, they can share cache blocks)
But wait, there’s more. Quantization takes this further by shrinking the numbers themselves.
Quantization: Shrinking the Cache Without Losing the Logic
Here’s a fun fact: neural networks are surprisingly robust to imprecision. You can represent those 32-bit floating-point numbers as 8-bit integers and the model barely notices.
Modern KV cache quantization comes in several flavors:
FP8 Quantization: Shrinks numbers from 16 bits to 8 bits. Works on newer NVIDIA GPUs (Ada Lovelace and Hopper architectures). Typical accuracy loss? Minimal. Memory savings? 50%.
INT8 Quantization: Takes it further with integer representation. Recent research shows you can achieve 4x memory reduction with reconstruction errors below 0.004. That’s like photocopying a photocopy and still being able to read the text perfectly.
NVFP4 (on Blackwell GPUs): The new kid on the block. Cuts memory footprint by 50% compared to FP8, lets you double your context length or batch size, with less than 1% accuracy loss.
It’s like discovering you can fit twice as many books in your library by using thinner paper, and somehow the words are still just as readable.
Speculative Decoding: Two Heads are Faster Than One
Using Draft Models to Leapfrog Sequential Latency
Remember our chef who tastes after every ingredient? What if we hired a junior chef to guess the next five ingredients while the head chef is busy?
That’s speculative decoding in a nutshell.
The setup: you have two models. A tiny, fast “draft” model, and your big, accurate “target” model. The draft model is like an eager intern — quick but occasionally wrong. The target model is the senior partner who has to approve everything.
Here’s the Draft and Verify cycle:
- Draft Phase: The small model races ahead and predicts the next 5-8 tokens
- Verify Phase: The big model looks at all those predictions in parallel and says “yes, yes, yes, no, no”
- Accept: All tokens up to the first rejection are kept
- Repeat: Start drafting again from the last accepted token
The magic here is parallelism. While autoregressive decoding forces the big model to work sequentially (one token at a time), verification can happen all at once. If the draft model guessed correctly, you just generated 5 tokens in the time it normally takes to generate 1.
When it works well, speculative decoding achieves 2-3x speedups. Apple’s recent Mirror Speculative Decoding technique pushes this to 2.8-5.8x by getting even more clever with parallel execution across different accelerators.
But here’s the honest truth: it’s fragile. The effectiveness depends heavily on:
- How well the draft model matches the target model’s “thinking”
- Batch sizes (works best with small batches)
- The specific task (some tasks are more predictable than others)
When the draft model’s guesses are wrong most of the time, you’ve essentially added overhead for nothing. It’s like hiring an intern who keeps suggesting ingredients the head chef hates — more work, same result.
Still, for latency-sensitive single-user scenarios (like a chatbot), speculative decoding can feel like magic.
Architectural Shortcuts: FlashAttention & Kernel Fusion
Squeezing Every FLOP Out of the GPU
Let’s get a bit more technical. Inside every transformer model, there’s an operation called “attention.” It’s the secret sauce that lets the model understand context — relating each word to every other word in the input.
The problem? Naive attention implementations are horrifically memory-inefficient.
Standard attention computes a giant matrix of attention scores, stores it in memory, does some operations on it, and then reads it back out. For a sequence of 8,000 tokens, this matrix has 64 million entries. Writing and reading that matrix from GPU’s high-bandwidth memory (HBM) takes forever in GPU-time.
FlashAttention, created by Tri Dao and team, asked: “What if we just… didn’t store that matrix?”
The key insight is tiling. Instead of computing the entire attention matrix at once, FlashAttention breaks it into small blocks that fit in the GPU’s fast on-chip SRAM (think of it as L1 cache, but for a GPU). It computes attention for each block, updates a running result, and never materializes the full matrix.
It’s like reading a book by only looking at one paragraph at a time, remembering just enough to understand the story, rather than photocopying every page first.
The results:
- Exact same mathematical output (no approximation)
- 2-4x faster than standard attention
- Memory usage scales linearly with sequence length instead of quadratically
FlashAttention-3, optimized for NVIDIA’s H100 GPUs, takes this further with:
- Asynchronous execution: While one part of the chip is computing, another is loading the next chunk of data. No waiting.
- Warp specialization: Different groups of GPU threads specialize in different tasks (loading vs. computing), like a pit crew where everyone has one job and executes it perfectly.
- FP8 support: Lower precision for even faster math.
FlashAttention-3 achieves 75% of the H100’s theoretical maximum throughput. For context, naive implementations hit maybe 35%. That’s like tuning a car engine to get twice the horsepower with the same fuel.
Kernel fusion extends this principle beyond attention. The idea: instead of running separate GPU programs (kernels) for each operation — load data, compute something, store result, load again, compute something else — you fuse multiple operations into a single kernel. One load, multiple computations, one store.
Every time you avoid a round trip to HBM, you win. It’s death by a thousand optimizations, but they add up.
Continuous Batching: Maximizing the Pipeline
Why Waiting for a Full Batch is a Legacy Mistake
Here’s how batching used to work in the dark ages (circa 2021):
- Collect N requests
- Wait until ALL of them finish
- Return results
- Collect next N requests
- Repeat
See the problem? If one request in your batch needs 500 tokens and another needs 10, the short request sits around waiting for the long one to finish. The GPU is processing the long request while the short request’s user is drumming their fingers.
This is static batching, and it’s terrible.
Continuous batching (also called iteration-level scheduling) fixes this elegantly:
- Process all requests token by token
- The moment a request finishes, immediately slot in a new one
- Never wait for the whole batch to complete
Imagine a restaurant where tables are cleared and reseated the moment each party leaves, rather than waiting for all parties to finish simultaneously. The kitchen (GPU) stays continuously busy.
The implementation details matter:
- Chunked prefill: Break long initial prompts into smaller pieces that play nice with ongoing generation
- Ragged batching: Handle variable-length sequences without padding (no wasted computation)
- Dynamic scheduling: Smart algorithms decide which requests to prioritize
The numbers speak for themselves: continuous batching can deliver up to 23x throughput improvement over naive static batching. That’s not a typo. Twenty-three times.
Combined with PagedAttention, FlashAttention, and speculative decoding, you get inference servers that would have seemed like science fiction just a few years ago.
The Bigger Picture
What strikes me about all these optimizations is how they’re fundamentally about not doing work.
- PagedAttention: Don’t waste memory on empty space
- Quantization: Don’t use more bits than you need
- Speculative decoding: Don’t compute sequentially when you can verify in parallel
- FlashAttention: Don’t read and write more than necessary
- Continuous batching: Don’t let the GPU sit idle
Every breakthrough comes from someone looking at a system and asking, “Wait, why are we doing it this way?”
The teams at UC Berkeley (vLLM), Stanford (FlashAttention), and various research labs have essentially rebuilt LLM inference from first principles, questioning every assumption about how neural networks should run.
The result? Models that used to require server farms can now run on single machines. Responses that took seconds now take milliseconds. And this is just the beginning.
The memory wall is still there. Autoregressive decoding is still fundamentally sequential. But bit by bit, clever engineering keeps finding new ways to make intelligence cheaper and faster.
And somewhere, a GPU that used to spend 80% of its time waiting for memory is now actually doing the math it was built to do.
Sources
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- The Architecture Behind vLLM: How PagedAttention Improves Memory Utilization
- How PagedAttention resolves memory waste of LLM systems - Red Hat Developer
- Get 3× Faster LLM Inference with Speculative Decoding - BentoML
- An Introduction to Speculative Decoding for Reducing Latency in AI Inference - NVIDIA
- Mirror Speculative Decoding - Apple Machine Learning Research
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
- Continuous Batching from First Principles - Hugging Face
- Achieve 23x LLM Inference Throughput & Reduce p50 Latency - Anyscale
- GPU-Accelerated INT8 Quantization for KV Cache Compression
- Optimizing Inference with NVFP4 KV Cache - NVIDIA
- Memory Bandwidth and Compute Bottlenecks in LLM Inference
- Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference