Tokens & Context Window
Tokens are the small units a model reads, and the context window is the working space available in a single call. System instructions, the current question, earlier turns, retrieved documents, tool results, and even the answer being generated all have to fit inside that same space.
โถArchitecture Diagram
๐ StructureDashed line animations indicate the flow direction of data or requests
In early demos, a short prompt is often enough, so the limit barely feels real. In a production app, policy text, chat history, retrieval hits, and tool output start piling into the same request. Then useful instructions get pushed back, supporting evidence gets cut off, or there is not enough room left for the answer. The real problem is not just that the prompt is long. It is that the system has to keep choosing which context survives this turn.
When LLM usage was mostly short one-shot prompting, the context window looked like a model spec. As chat assistants, RAG pipelines, and agent loops became common, it started behaving more like an application constraint that shapes request design end to end.
An application bundles system instructions, the current request, recent history, retrieved passages, and tool results into one input packet. The model reads that packet as tokens and must also spend part of the same budget on the answer. That means more input usually leaves less room for output, and longer answers force the system to trim context more aggressively. As the limit gets closer, teams summarize older turns, reduce retrieval payloads, or split the work across multiple calls.
The context window is about how much can fit inside this call. Memory is about what should survive into the next call. Context engineering is about what deserves to be packed first into the limited space available right now. They often appear together, but if a request keeps overflowing or getting truncated, the context window is the first concept to look at.
In practice, teams rarely dump long sources into the model unchanged. They summarize first, keep only the passages that matter to the current answer, and feed back only the important fields from large tool results. A bigger window helps, but it does not remove the budgeting problem. If teams keep stuffing more into every request, cost and latency rise while the most important instructions become easier to lose in the noise.