Skip to main content
← All posts
6 min read

Context Window Management Is a New Engineering Discipline

LLMs have finite context. Managing what goes in — and when — is now a first-class engineering concern, not a prompt hack.

Share

Memory management was once considered a niche systems concern. Then applications got complex enough that ignoring it meant your program crashed, leaked, or silently corrupted state. The field figured out allocation strategies, garbage collection, cache hierarchies, and eviction policies. It took decades and became foundational.

We are at the beginning of that same arc with LLM context windows. Right now, most teams treat context as an afterthought — stuff the relevant content in, hope the model picks out what matters, and debug hallucinations as if they were model failures. They are not model failures. They are context engineering failures.

What a context window actually is

A context window is not a bucket you fill. It is the only working memory an LLM has during a single inference call. Everything the model "knows" for that call — the system prompt, the conversation history, the retrieved documents, the tool outputs, the examples — has to fit inside it. When the window fills up, something gets truncated. Usually you don't control what.

Modern frontier models have large windows: 128K tokens, 200K tokens, in some cases more. That sounds like a lot until you're running a multi-step agent that retrieves five documents per step, keeps a running scratchpad, includes a detailed system prompt, and appends tool call logs. You burn through 128K tokens faster than you think, and at the edges of the window, model attention degrades. Position matters. Studies on long-context models consistently find that information in the middle of a long context gets less reliable retrieval than information at the start or end — the "lost in the middle" phenomenon. A full context window is not a well-utilized context window.

Why naive RAG fails here

Retrieval-augmented generation is the current standard answer to context limits. You embed your documents, index them, and retrieve the top-K chunks by semantic similarity at query time. This works well in demos. It degrades in production for a specific reason: retrieval optimizes for semantic similarity, not for what the model needs at this step.

Say an agent is three steps into a workflow. It's just extracted structured data from a PDF and needs to validate it against a business rule. A semantic similarity search retrieves the five document chunks most similar to the query — which are often the same five chunks every time, because the query is similar. What the model actually needs might be the exception list for that specific rule category, which is in a chunk that scored 0.61 on similarity instead of 0.87.

Naive RAG treats retrieval as a search problem. Context management treats it as a scheduling problem: given what this particular model call needs to accomplish, what information maximizes the probability of a correct, useful output?

Those are different problems with different solutions.

Chunking is necessary but not sufficient

The standard advice for RAG is "chunk better." Use overlapping windows. Respect sentence boundaries. Store hierarchical summaries alongside raw chunks. This is all correct and none of it is enough.

Chunking determines what units are available for retrieval. It says nothing about how much context the model actually needs to reason correctly, which chunks depend on each other to be coherent, or whether the sum of the top-K chunks exceeds the usable window even if it fits in the technical limit.

Consider a 10,000-word technical specification with dependencies between sections. Section 4 defines terms used in Section 7. If your agent retrieves Section 7 without Section 4, it's working with an incomplete semantic context even if both chunks individually look relevant. Overlap helps with sentences. It doesn't help with semantic dependencies across a large document.

The deeper issue is that chunking is a data structuring decision made offline, but context management is a runtime decision made per-call. What the model needs varies by task, query, and step in a pipeline. Treating chunk selection as a one-time data engineering problem means you've hardcoded a retrieval strategy that may be wrong for most of your actual queries.

The analogy to memory management

In systems programming, you can't just allocate memory without thinking about lifetime. When does this data become irrelevant? Who owns it? What happens when the reference is no longer valid? Engineers who don't answer these questions ship programs that leak.

LLM context has the same structure. Every token in the context window has a lifetime. The conversation history from ten turns ago may be irrelevant to the current task. The retrieved document chunk that was useful at step two is noise at step seven. The detailed system prompt that's necessary for open-ended queries is overhead for a focused extraction task.

Memory management in systems gave us allocators, garbage collectors, and RAII. The LLM equivalent is starting to take shape: context compressors that summarize history rather than truncating it, dynamic retrieval that re-queries mid-pipeline rather than front-loading all context, tiered context where high-priority information is placed at window boundaries, and context budgets that limit what each agent step can consume.

None of this is standard yet. Most production AI systems have none of it. That's where we are in the arc.

What first-class context engineering looks like

The teams that are getting this right share a few practices that the teams getting it wrong don't have.

They instrument context usage. Every LLM call logs what was in the context, how many tokens it consumed, and what the model did with it. When a failure happens, they can inspect the exact context state rather than guessing. This is the equivalent of heap profiling — you can't fix what you can't observe.

They treat context as a resource with a budget. Each step in a pipeline gets an allocation: this much for system instructions, this much for retrieved content, this much for conversation history. When a step exceeds its budget, the system compresses before it truncates. Compression preserves meaning. Truncation just removes tokens.

They separate what the model needs from what you have available. Having a 200K-token document doesn't mean 200K tokens should go into the context. The question is: what is the minimum context required for this step to succeed? Anything beyond that is noise that competes for attention.

They version context strategies alongside code. The system prompt is version-controlled. The retrieval strategy is reviewed when the task changes. Context bugs are tracked as engineering bugs, not model quality issues. This is the organizational change more than the technical one.

The cost of getting this wrong

Context bugs fail quietly. The model produces a plausible-sounding output that's wrong because a critical piece of information was evicted, placed in the low-attention middle of the window, or contradicted by a stale chunk from a previous step. These bugs don't throw exceptions. They don't show up in error logs. They show up as incorrect decisions made by systems that everyone trusts.

In high-stakes applications — legal reasoning, medical triage, financial analysis — a context management failure is not a minor quality issue. It's a systemic reliability failure that can't be caught by conventional testing because the failure mode depends on what happens to be in the window at a specific point in time.

This is why context window management is becoming a discipline rather than a prompt engineering tip. The stakes are high enough, the failure modes are subtle enough, and the solutions are specialized enough that it needs to be treated as what it is: a foundational engineering problem, not a model tuning problem.

We built virtual memory because programs needed more address space than physical RAM could provide, and naively running out was unacceptable. We'll build the equivalent for LLM context because the same logic applies. The only question is how much production damage happens in the meantime.

Work with me

I consult with engineering teams on AI adoption, cloud architecture, and engineering effectiveness. If this post surfaced a challenge you're facing, let's talk.

Get in touch →

Explore more on these topics:

Subscribe to new posts

Get an email when I publish something new. No spam, unsubscribe any time.