Skip to main content
AI Glossary

What is a context window?

A context window is the maximum amount of text an AI language model can consider at one time, measured in tokens. It holds the system instructions, your conversation history, retrieved documents, and the model's own reply. Everything outside that limit is invisible to the model, so it cannot reason over what it can't currently see.
Last updated June 2, 2026

A context window is the working memory of a large language model (LLM). When you send a message, the model loads everything it needs to answer into this window: the system prompt that defines its behaviour, the running conversation, any files or search results pulled in, and the response it is generating. The window has a fixed ceiling measured in tokens, and the model has no awareness of anything beyond it.

What is a context window in simple terms?

Think of the context window as the model's desk. It can only read and reason over the papers spread on the desk right now. Anything filed away in a cabinet across the room (an earlier part of the chat, a document you sent yesterday) is out of sight until something deliberately fetches it back onto the desk. The desk has a hard size limit, and that limit is the context window.

Context is measured in tokens, not words or characters. A token is a chunk of text, roughly three-quarters of a word in English. So a 128,000-token context window holds somewhere around 90,000 to 100,000 words of mixed input and output combined. That budget is shared: a longer system prompt or a big pasted document leaves less room for the conversation and the answer.

Why does context window size matter?

Context window size sets the ceiling on how much a model can reason about in a single request. A small window forces you to summarise or chop up long documents; a large window lets the model read an entire contract, a long thread, or a codebase in one pass and connect details that sit far apart. For real tasks, the practical effects break down like this:

  • Longer documents fit whole, so the model does not lose facts that were trimmed to make room.
  • Conversations stay coherent because earlier turns remain visible instead of being forgotten.
  • More retrieved evidence (search results, files) can be supplied, which tends to reduce hallucination.
  • Cost and latency rise with the number of tokens processed, so a bigger window is not automatically the right choice for every task.
  • Models can also struggle to use the middle of a very long window well, so quantity of context is not the same as quality of attention.
~3/4 of a wordapproximate English text represented by a single token, the unit a context window is measured inSource: OpenAI tokenizer guidance, 2024

Frontier models in 2026 commonly offer windows from roughly 128,000 tokens up to a million or more, depending on the model. Larger does not always mean better for a given job: a focused, well-chosen window of relevant text usually beats a giant window stuffed with marginally related material.

What happens when the context window is exceeded?

When a request would exceed the limit, something has to give. Either the system refuses the request with a token-limit error, or it silently drops the oldest or least relevant content to make the input fit. In a long chat, that usually means the earliest messages fall out of the window first, so the model can appear to 'forget' what you told it near the start of a session.

Well-built assistants do not just let context fall off a cliff. They actively manage what stays in the window using techniques like these:

  1. Summarising older turns into a compact recap that costs far fewer tokens than the full transcript.
  2. Archiving finished conversations and retrieving only the relevant pieces later, on demand.
  3. Storing durable facts about you in a separate memory layer and injecting just the ones that matter for the current request.
  4. Starting a fresh thread when the topic clearly changes, so unrelated history stops eating the budget.

How does MiyoMind manage long conversations?

MiyoMind keeps conversations coherent without endlessly inflating the context window. Rather than replaying every message verbatim, it distils older chats into a rolling briefing, archives completed threads, and remembers the facts that matter to you in a dedicated long-term memory layer, then feeds only the relevant pieces back into the window for each turn.

  • Rolling chat briefing: recent and long-term conversation context is compressed into a short, continuously updated summary so the assistant stays oriented across sessions without resending the entire history.
  • Conversation archiving: finished chats are archived and can be pulled back via recall when something earlier becomes relevant again, instead of clogging the active window.
  • Long-term memory: durable details you have shared are stored separately, encrypted at rest with AES-256-GCM, and a small top-K selection is injected into context when it helps.
  • Smart new threads: when a conversation goes idle or the topic shifts, MiyoMind can start a fresh thread so stale context stops competing for space.

Because of this, you can chat with Miyo across WhatsApp, Telegram, Discord, or the web dashboard over days and weeks, and it keeps the thread of what matters without you re-explaining yourself or hitting a wall the moment a single conversation grows long. MiyoMind also routes across frontier models from several providers, so the right model and its window are matched to the task rather than relying on one fixed limit.

Frequently asked questions

What is a context window in an LLM?

It is the maximum span of text an LLM can read and reason over in a single request, measured in tokens. The window holds the instructions, conversation history, any retrieved material, and the generated reply. Anything outside it is invisible to the model for that request.

How is a context window measured?

In tokens, not words. A token is a short chunk of text, roughly three-quarters of a word in English. A 128,000-token window therefore holds around 90,000 to 100,000 words of combined input and output, and both your prompt and the model's answer count toward that budget.

What happens when you exceed the context window?

The request either fails with a token-limit error or the system trims content to fit, usually by dropping the oldest or least relevant text first. In a long chat this can make the model appear to forget what was said early in the session.

Is a bigger context window always better?

Not necessarily. A larger window lets a model see more at once, but it costs more tokens, adds latency, and models can use the middle of a very long window less reliably. A focused window of genuinely relevant text often beats a huge one filled with marginal material.

How does MiyoMind handle conversations longer than the context window?

MiyoMind distils older messages into a rolling briefing, archives completed chats for on-demand recall, and keeps durable facts in an encrypted long-term memory layer. It injects only the relevant pieces into each request, so long-running conversations stay coherent without overflowing the window.

How does a context window relate to AI memory?

The context window is short-term working memory that resets each request, while a memory layer is durable storage that persists across sessions. MiyoMind pairs the two: long-term memory holds what matters, and only the relevant slice is loaded into the context window when needed.

Meet your new assistant

Already in WhatsApp, Telegram, Discord, and the web. 100 free credits every month — no card required.