🧮

Tokens & Context Window

FundamentalsThe token budget a model can see in one call

Tokens are the small units a model reads, and the context window is the working space available in a single call. System instructions, the current question, earlier turns, retrieved documents, tool results, and even the answer being generated all have to fit inside that same space.

▶Architecture Diagram

🔍 Structure

📜System Prompt

🧑User Turn

🕘History

📚Retrieved Docs

🛠️Tool Results

🧮Context Window

💬Output

Dashed line animations indicate the flow direction of data or requests

Why do you need it?

In early demos, a short prompt is often enough, so the limit barely feels real. In a production app, policy text, chat history, retrieval hits, and tool output start piling into the same request. Then useful instructions get pushed back, supporting evidence gets cut off, or there is not enough room left for the answer. The real problem is not just that the prompt is long. It is that the system has to keep choosing which context survives this turn.

Why did this approach emerge?

When LLM usage was mostly short one-shot prompting, the context window looked like a model spec. As chat assistants, RAG pipelines, and agent loops became common, it started behaving more like an application constraint that shapes request design end to end.

How does it work inside?

An application bundles system instructions, the current request, recent history, retrieved passages, and tool results into one input packet. The model reads that packet as tokens and must also spend part of the same budget on the answer. That means more input usually leaves less room for output, and longer answers force the system to trim context more aggressively. As the limit gets closer, teams summarize older turns, reduce retrieval payloads, or split the work across multiple calls.

Boundaries & Distinctions

The context window is about how much can fit inside this call. Memory is about what should survive into the next call. Context engineering is about what deserves to be packed first into the limited space available right now. They often appear together, but if a request keeps overflowing or getting truncated, the context window is the first concept to look at.

When should you use it?

In practice, teams rarely dump long sources into the model unchanged. They summarize first, keep only the passages that matter to the current answer, and feed back only the important fields from large tool results. A bigger window helps, but it does not remove the budgeting problem. If teams keep stuffing more into every request, cost and latency rise while the most important instructions become easier to lose in the noise.

Long document summarizationChat assistantsRAG responsesAgent loops