Here’s something that used to keep me up at night: why does ChatGPT feel instant, while my own attempts at running a large language model on a cloud GPU felt like waiting for dial-up internet to load a JPEG in 1997?

The answer, as it turns out, has very little to do with raw computing power. It’s about memory. Specifically, it’s about moving bytes around in clever ways that would make a logistics expert weep with joy. Welcome to the bizarre, beautiful world of LLM inference optimization.

The Compute Tax: Why LLM Inference is Hard

Let me paint you a picture. You’ve got this magnificent neural network with 70 billion parameters. Each parameter is a number. Each number needs to be fetched from memory, multiplied, added, and the result stored somewhere. Simple enough, right?

Here’s the twist that makes everything complicated: autoregressive decoding.

When an LLM generates text, it doesn’t spit out a whole sentence at once. It predicts one token at a time. Think of it like a chef who has to make a five-course meal, but they can only cook one ingredient at a time, and they have to taste everything before adding the next ingredient. “First I’ll add salt… tastes… okay now pepper… tastes… now garlic…”

This means that for every single token the model generates, it needs to:

  1. Load the entire model from memory (yes, all 70 billion parameters)
  2. Do some math
  3. Produce one measly token
  4. Repeat

For a 100-token response, that’s loading the model 100 times. Each load requires moving hundreds of gigabytes through your GPU’s memory bus. And here’s the kicker — memory bandwidth improves much slower than compute power. NVIDIA’s GPU floating-point performance grew 80x between 2012 and 2022, but memory bandwidth? Only 17x.

This is what engineers call the “Memory Wall,” and it’s been the bane of AI researchers’ existence for years. Your GPU might have the computational power of a small sun, but it spends most of its time sitting idle, drumming its fingers on the table, waiting for data to arrive from memory.

It’s like having a Formula 1 car stuck in city traffic. All that horsepower, nowhere to go.

The Memory Anchor: Optimizing the KV Cache

Trading VRAM for Velocity

Before we fix the memory wall, we need to understand a crucial concept: the KV Cache (Key-Value Cache).

Remember how I said the model generates one token at a time? Well, here’s a slightly horrifying fact: without caching, the model would have to recompute everything for every token it generates. If you’re generating the 50th token, the model would re-process all 49 previous tokens from scratch. That’s not a traffic jam — that’s purgatory.

The KV cache is the solution. It stores intermediate computations (specifically, the “keys” and “values” from the attention mechanism) so the model doesn’t have to redo work. But this creates a new problem: memory management.

Picture this: you’re running a server handling thousands of concurrent requests. Each request has its own KV cache. Some requests need long responses (big cache), some need short ones (small cache). Some requests finish early, some take forever. It’s like trying to park cars of wildly different sizes in a parking garage where cars keep arriving and leaving unpredictably.

Traditional systems pre-allocated memory for the maximum possible sequence length. Running a model that supports 8,000 tokens? Every request gets 8,000 tokens worth of memory, even if it only needs 50. The result? 60-80% of KV cache memory was wasted through fragmentation and over-allocation.

PagedAttention: How vLLM Changed Everything

In 2023, a team at UC Berkeley looked at this mess and said, “Wait, haven’t operating systems solved this problem already?”

They were right. The same engineers who figured out how to manage memory in your computer’s RAM decades ago had already cracked this nut. The solution? Paging.

PagedAttention, implemented in vLLM, breaks the KV cache into small, fixed-size “pages” (or blocks) that can be stored anywhere in memory. Instead of requiring one contiguous chunk of VRAM for each request, the cache becomes a scattered collection of blocks linked together by a lookup table.

Think of it like switching from a library where every book series must sit on adjacent shelves to one where books can go anywhere, and you just keep a catalog of where each one is. Suddenly, you can fit way more books.

The results were staggering:

  • Memory waste dropped from 60-80% to under 4%
  • Throughput improved 2-4x with the same hardware
  • Memory sharing between requests became possible (if two users ask similar questions, they can share cache blocks)

But wait, there’s more. Quantization takes this further by shrinking the numbers themselves.

Quantization: Shrinking the Cache Without Losing the Logic

Here’s a fun fact: neural networks are surprisingly robust to imprecision. You can represent those 32-bit floating-point numbers as 8-bit integers and the model barely notices.

Modern KV cache quantization comes in several flavors:

FP8 Quantization: Shrinks numbers from 16 bits to 8 bits. Works on newer NVIDIA GPUs (Ada Lovelace and Hopper architectures). Typical accuracy loss? Minimal. Memory savings? 50%.

INT8 Quantization: Takes it further with integer representation. Recent research shows you can achieve 4x memory reduction with reconstruction errors below 0.004. That’s like photocopying a photocopy and still being able to read the text perfectly.

NVFP4 (on Blackwell GPUs): The new kid on the block. Cuts memory footprint by 50% compared to FP8, lets you double your context length or batch size, with less than 1% accuracy loss.

It’s like discovering you can fit twice as many books in your library by using thinner paper, and somehow the words are still just as readable.

Speculative Decoding: Two Heads are Faster Than One

Using Draft Models to Leapfrog Sequential Latency

Remember our chef who tastes after every ingredient? What if we hired a junior chef to guess the next five ingredients while the head chef is busy?

That’s speculative decoding in a nutshell.

The setup: you have two models. A tiny, fast “draft” model, and your big, accurate “target” model. The draft model is like an eager intern — quick but occasionally wrong. The target model is the senior partner who has to approve everything.

Here’s the Draft and Verify cycle:

  1. Draft Phase: The small model races ahead and predicts the next 5-8 tokens
  2. Verify Phase: The big model looks at all those predictions in parallel and says “yes, yes, yes, no, no”
  3. Accept: All tokens up to the first rejection are kept
  4. Repeat: Start drafting again from the last accepted token

The magic here is parallelism. While autoregressive decoding forces the big model to work sequentially (one token at a time), verification can happen all at once. If the draft model guessed correctly, you just generated 5 tokens in the time it normally takes to generate 1.

When it works well, speculative decoding achieves 2-3x speedups. Apple’s recent Mirror Speculative Decoding technique pushes this to 2.8-5.8x by getting even more clever with parallel execution across different accelerators.

But here’s the honest truth: it’s fragile. The effectiveness depends heavily on:

  • How well the draft model matches the target model’s “thinking”
  • Batch sizes (works best with small batches)
  • The specific task (some tasks are more predictable than others)

When the draft model’s guesses are wrong most of the time, you’ve essentially added overhead for nothing. It’s like hiring an intern who keeps suggesting ingredients the head chef hates — more work, same result.

Still, for latency-sensitive single-user scenarios (like a chatbot), speculative decoding can feel like magic.

Architectural Shortcuts: FlashAttention & Kernel Fusion

Squeezing Every FLOP Out of the GPU

Let’s get a bit more technical. Inside every transformer model, there’s an operation called “attention.” It’s the secret sauce that lets the model understand context — relating each word to every other word in the input.

The problem? Naive attention implementations are horrifically memory-inefficient.

Standard attention computes a giant matrix of attention scores, stores it in memory, does some operations on it, and then reads it back out. For a sequence of 8,000 tokens, this matrix has 64 million entries. Writing and reading that matrix from GPU’s high-bandwidth memory (HBM) takes forever in GPU-time.

FlashAttention, created by Tri Dao and team, asked: “What if we just… didn’t store that matrix?”

The key insight is tiling. Instead of computing the entire attention matrix at once, FlashAttention breaks it into small blocks that fit in the GPU’s fast on-chip SRAM (think of it as L1 cache, but for a GPU). It computes attention for each block, updates a running result, and never materializes the full matrix.

It’s like reading a book by only looking at one paragraph at a time, remembering just enough to understand the story, rather than photocopying every page first.

The results:

  • Exact same mathematical output (no approximation)
  • 2-4x faster than standard attention
  • Memory usage scales linearly with sequence length instead of quadratically

FlashAttention-3, optimized for NVIDIA’s H100 GPUs, takes this further with:

  • Asynchronous execution: While one part of the chip is computing, another is loading the next chunk of data. No waiting.
  • Warp specialization: Different groups of GPU threads specialize in different tasks (loading vs. computing), like a pit crew where everyone has one job and executes it perfectly.
  • FP8 support: Lower precision for even faster math.

FlashAttention-3 achieves 75% of the H100’s theoretical maximum throughput. For context, naive implementations hit maybe 35%. That’s like tuning a car engine to get twice the horsepower with the same fuel.

Kernel fusion extends this principle beyond attention. The idea: instead of running separate GPU programs (kernels) for each operation — load data, compute something, store result, load again, compute something else — you fuse multiple operations into a single kernel. One load, multiple computations, one store.

Every time you avoid a round trip to HBM, you win. It’s death by a thousand optimizations, but they add up.

Continuous Batching: Maximizing the Pipeline

Why Waiting for a Full Batch is a Legacy Mistake

Here’s how batching used to work in the dark ages (circa 2021):

  1. Collect N requests
  2. Wait until ALL of them finish
  3. Return results
  4. Collect next N requests
  5. Repeat

See the problem? If one request in your batch needs 500 tokens and another needs 10, the short request sits around waiting for the long one to finish. The GPU is processing the long request while the short request’s user is drumming their fingers.

This is static batching, and it’s terrible.

Continuous batching (also called iteration-level scheduling) fixes this elegantly:

  • Process all requests token by token
  • The moment a request finishes, immediately slot in a new one
  • Never wait for the whole batch to complete

Imagine a restaurant where tables are cleared and reseated the moment each party leaves, rather than waiting for all parties to finish simultaneously. The kitchen (GPU) stays continuously busy.

The implementation details matter:

  • Chunked prefill: Break long initial prompts into smaller pieces that play nice with ongoing generation
  • Ragged batching: Handle variable-length sequences without padding (no wasted computation)
  • Dynamic scheduling: Smart algorithms decide which requests to prioritize

The numbers speak for themselves: continuous batching can deliver up to 23x throughput improvement over naive static batching. That’s not a typo. Twenty-three times.

Combined with PagedAttention, FlashAttention, and speculative decoding, you get inference servers that would have seemed like science fiction just a few years ago.

The Bigger Picture

What strikes me about all these optimizations is how they’re fundamentally about not doing work.

  • PagedAttention: Don’t waste memory on empty space
  • Quantization: Don’t use more bits than you need
  • Speculative decoding: Don’t compute sequentially when you can verify in parallel
  • FlashAttention: Don’t read and write more than necessary
  • Continuous batching: Don’t let the GPU sit idle

Every breakthrough comes from someone looking at a system and asking, “Wait, why are we doing it this way?”

The teams at UC Berkeley (vLLM), Stanford (FlashAttention), and various research labs have essentially rebuilt LLM inference from first principles, questioning every assumption about how neural networks should run.

The result? Models that used to require server farms can now run on single machines. Responses that took seconds now take milliseconds. And this is just the beginning.

The memory wall is still there. Autoregressive decoding is still fundamentally sequential. But bit by bit, clever engineering keeps finding new ways to make intelligence cheaper and faster.

And somewhere, a GPU that used to spend 80% of its time waiting for memory is now actually doing the math it was built to do.


Sources