Why Large Context in LLMs Doesnt Save You From Hallucinations

June 25, 2025

The Myth of “More Context = Fewer Hallucinations”

It seems to me that lately the excitement about increasing context size has died down a bit, but hasn’t completely faded. It’s unclear what exactly this is related to: either LLM news in general has already blurred everyone’s eyes and doesn’t cause much delight, or users and engineers have semi-intuitively felt that really big context size doesn’t give much of a win.

I’m inclined to believe that the very idea of “large context will solve my problems” is still alive, so let’s talk today about why this is another trap, why we’re worried about context size in the first place, what we’re actually trying to solve, and how to solve it properly.

Why do engineers fall into this trap? Because engineers are people. People don’t always have the freshest and most correct viewpoints. We always lack certain knowledge to interpret things with minimal detachment from reality. And we don’t always like to strain ourselves to think slowly, rationally interpret, and model things.

This isn’t necessarily related to laziness, but also, for example, to burning deadlines and other pushing factors.

What are we actually solving when we hope that larger context will help our AI system become… more accurate? This very striving for responses accuracy is most often caused by trying to escape emerging hallucinations – we wanted accurate answers “by default”, but in the process of choosing architecture we didn’t even think about how to achieve it.

Often this approach leads to requests context bloating up, with too much if not unnecessary, then diluted information getting into the context. There it is – the cause, and the bridge to hoping for salvation through getting longer context.

Anatomy of the Problem: Why Long Context Doesn’t Work

Let’s consider a very popular scenario where we want to build an NLI (natural language interface) chatbot based on LLM, “well, just like ChatGPT,” but with our data: with a database, or with our large and complex API.

Often such a task is too abstract; it seems that integrating our API into an AI system is simple – look, there are many agent frameworks, just take one and do it. We close our eyes to what exact goal we’re trying to achieve: does anyone even need this chatbot? If needed, what value does it provide? What are our accuracy requirements? Will we integrate any other features there?

Just intuitively from such a project we expect that the output will have few hallucinations, because the task seems simple: go to the API, fetch data, give it to the end user in a readable format.

This is indeed a simple task when we’re talking about simple APIs, like weather, pokemon, or any others with limited and thoroughly developed response format, understandable without large additional context in the form of documentation and other annotation/specifications.

How many such APIs have you seen? Especially in business?

In the scenario with an intuitive approach, we simply and quickly, using some framework and, most likely, an Agent-type architecture with function-calls, build an AI interface around our API, and it… doesn’t work at least somewhat adequately, accurately, sensibly.

We try to fix the prompt, adjust here and there, and at some point the results become a bit better, or even quite good! We joyfully assemble a release, deploy it to a demo environment, still joyfully call colleagues to try the product, and then it turns out that nothing works fine: the system thinks it’s 1743 and it’s defeating the French in Bavaria, and the response format and strcuture is drifting badly from call to call.

You bug-eyed launch the same build locally, and everything works… well, normally.

Alice in Hallucination Land. It’s unclear what to do with this and how to continue developing such a system effectively. You start investigating deeper, and it turns out that function calling is seemingly something like a “patch” in the form of JSON descriptions of each function that get into context with user requests, and the LLM tries to choose the right function to get the needed data for the response. You made 75 such functions, each with possibly 20 arguments.

Digging a bit deeper, it turns out that you almost don’t process responses from your function calls at all, every endpoint of your API return reeeeally long JSON’s, which you save for conversation history and… send with subsequent requests.

Hallucinations come exactly from here: it’s very difficult to force an AI system to do “everything at once” by throwing a lot of garbage into the request and folding your hands in a prayer gesture.

You think: “Well, if everything gets into context, then… we need bigger context! Let’s take a more expensive model,” – and… nothing will change substantially. But why?

Let’s look at some technological reasons.

If these technological reasons related to context in LLMs don’t interest you, feel free to scroll to the subtitle “Do we even need this memory?”

The “Lost-in-the-middle” Effect and Positional Biases

LLMs demonstrate a U-shaped attention curve: information at the beginning and end of context is processed well, but data in the middle gets “lost.”

Our wonderful models have a “built-in” bias toward token position in context: they’re predominantly trained on relatively short texts, and the model “gets used to” the fact that all the most valuable information – the question and its details – is located either at the beginning of the prompt or at the end.

Increasing context doesn’t help, but rather creates an even larger “dead zone” in the middle.

It turns out that often critically important information for response accuracy that falls in the middle of a long prompt can simply be “not considered.”

Example: our imperfect agent collects a lot of data from function calls, get a lot of big JSON’s, and ultimately there’s a call to prepare the final response. In this call, all the most valuable stuff – data obtained from the API – can partially fall right into this “dead zone.”

In addition to this, last year there was a series of studies revealing interesting nuances about context accuracy. For example, here when studying open-source models, it turned out that effective context length often doesn’t exceed even half of the declared, training length.

You might say: “Well, that’s open-source.” Yes, but all the models we love and use are based on the same architecture, which means they probably have similar problems. This is confirmed at least empirically – hallucination problems occur often and are hotly debated.

Attention Degradation in Long Sequences

Imagine you’re participating in a large online seminar. The first 15 minutes you listen attentively, take notes, try your best not to miss anything important.

At best, after an hour (but in reality…) your attention starts to scatter more and more, you think about lunch, about games on Steam, sometimes returning to the conversation. And further, the longer the zoom call goes without breaks, the more you’ll get distracted.

LLMs work surprisingly similarly when processing long context.

The attention mechanism in transformers is not an unlimited resource. As sequence length increases, the model starts “listening to itself” more instead of the original context.

The “Hallucinate at the Last” Phenomenon

Other research shows that hallucinations concentrate at the end of long generations. For example, if you ask a model to write a long document summary, the beginning will be quite accurate, based on the source text, but toward the end the model will start glitching, adding details that weren’t in the original data at all.

The reason is that as generation progresses, the model pays more and more attention to its own previous tokens rather than the original context. It’s like if during a speech you started quoting not the source material, but what you said 10 minutes ago, then quoting a quote from a quote… By the end you get into a game of broken telephone with yourself: “I never repeat myself, I never repeat myself, I never repeat myself…”

In the context of our agent, this how it might looks like: the model made several function calls, got data, started forming a response. At the beginning of the response it still remembers and uses real data from the API. But the longer its own response becomes, the more it relies on its previous formulations and patterns from training. Result: the beginning of the response is correct, but toward the end some “facts” appear that weren’t never got from any of our API call.

Additionally, it looks very funny when the agent “remembers” functions it didn’t call, or data that “should be correct” according to its “internal representations” (the model under the hood).

Accumulation of Computational Errors

Every operation in a neural network is a mathematical approximation. When we work with very long sequences or use model compression (quantization), small rounding and approximation errors start to accumulate. But it seems that the latter, if we use large and powerful models from OpenAI, Google, or Anthropic, doesn’t threaten us.

The most of LLM working process is a long chain of such approximations connected to each other sequentially. If at each step you have an error of 0.1%, then after 100 steps the “total error” can already be 10%.

In short, the longer the context and the more computational operations required to process it, the higher the probability that accumulated errors will distort the result.

The nastiest thing here is that these errors are practically impossible to catch during development of your system – they manifest stochastically and depend on the specific combination of input data.

You’ve probably already had the thought that all these problems can “work” together, creating an effect where increasing context not only doesn’t solve the hallucination problem, but potentially makes it worse.

This is exactly why experienced AI system engineers have long stopped trying to stuff “everything at once” into context and moved to more engineering solutions for managing memory and dialogue state. For the same reason, these same engineers didn’t have a wow effect when, for example, Gemini 2.5 Pro came out.

Do We Even Need This Memory?

Yes, conceptually it’s exactly about memory. Model context essentially is used by us head-on as a “memory element.” We try to pile together all the information needed to process a specific request.

Intuitively by default it seems to us that our AI system should remember in detail, exhaustively all the history of every conversation with it (on each iteration of conversation), roughly like Claude Desktop works, for example. We would very much like our own AI interface to be just as smart, attentive, and clear. As we’ve already seen earlier, a large number of pitfalls are hidden from our eyes.

Well, and in general, purely technically, a client from a “rich” and good model like Claude and our own assistant, into which we hastily stuffed our API, are conceptually different things.

Fortunately, this is quite easy to verify: if you already have an agent system with a bunch of problems and hallucinations, take all function calls (as they are) and “convert” them using LLM (vibe coding) into an MCP server. Connect this MCP server to Claude Desktop and try to communicate with your API this way.

With high probability, significant improvement won’t occur; the main thing is don’t cheat and keep all the complexity and redundancy of functions and their descriptions.

Serious Memory

In the minimal variant, we save conversation history with our system in a database – relational or document-oriented, doesn’t matter much in light of today’s topic.

External memory systems, RAG, Knowledge Graph and others – a separate and interesting topic that we won’t touch today. Accuracy pains also arise in vector or hybrid search in RAG-like systems, but the context problem can arise in them too, as a consequence of a situation where we, for example, “pulled too much from storage” or are still trying to baldly cross Agent architecture with RAG through function calls.

And Let’s Still Take a Step Back

All technical nuances and LLM context limitations are interesting and educational, but are they the root causes of hallucinations? Well, in a spherical vacuum – yes, but it’s quite obvious that we step on these rakes as a result of our preliminary decisions (or their absence).

Analyzing practically any problem arising when building an AI system, prosaically at the very bottom you find good old gaps in technical requirements, in modeling the real business task, not just in the tech solution itself.

Such gaps arise due to forced haste, or simply due to cognitive distortions when we tend, instead of pausing and thinking through the task in detail, to take the first “intuitively” emerging solution as it is.

Let’s continue examining our example: a chatbot with LLM around our API from which we want memory depth/understanding of dialogue history.

Why? For what? Upon examination, the overwhelming number of use cases are covered by using just the last 3-5 “question-answer” pairs. And possibly even less, especially if you have high requirements for accuracy and data freshness. In the race for good conversation memory, we shoot ourselves in the foot, jeopardizing what’s actually important to us – accuracy.

Function Calling Pitfalls

LLM providers and their APIs that support function calling run through the entire chain of function calls and their responses. Agent frameworks follow this interface and save all these messages in data structures.

You might think:

Well, if it’s needed – then it’s needed, we’ll store all intermediate calls!
And in general this is good, on each dialogue iteration LLM will get the entire context of previous calls! This just has to work well, in cases when the user asks clarifying questions! Everything is already in context!

Yes, this should work well and will work well… when your functions respond with short, informative, concise JSONs or something like that. We’ve already looked at the problem of long responses from your API – dialogue context can bloat extremely quickly – first response will be nice and precise, but the next will be about hallucinations.

What can we do? Well, first, return again to the level of project requirements and ask questions:

Do we even need all these functions?
Do we need all arguments in the function interface?
What client requests should these functions answer?
What are we trying to achieve after all?
What should be the main responsibility of our AI Agent?

Having thrown out the unnecessary ones, you can go further and start cleaning responses from your functions of all garbage, null fields and other stuff that won’t help LLM answer the posted question.

And now, attention, bonus for those who care about accuracy/freshness:

Try not saving intermediate results of function calls at all. Yes, this is a trade-off: in case the user asks a question that’s served by the same, just-called function, we’ll likely have to call it again, but purely semantically, pairs from user question and final response from your system may be more than enough to maintain a high level of conversation “awareness.”

Small Tips About Dialogue History

Processing and Compression

If you still decided you need history, then you need to figure out how exactly you need to follow it and how you’ll store this history.

When developing NLI chatbots, we often mix aplles with orangees – we use message history for two things:

Dialogue context: for each new request we look at recent messages.
Dialogue history for UI: frontend fetching it for displaying.

Apples and oranges can be separated. For example, for each dialogue we’ll maintain a separate record – a fact graph with something like mem0, or periodically make a separate request to LLM for dialogue history “compression” and use this blob as additional context, but much shorter and concise than just history of all messages (and tool calls if they are still saved and considered as needed or valuable).

All these solutions strongly depend on the task; personally I found the apples+orangees approach in the database quite acceptable for MVP, considering that message size, depth and cleanliness of pulled context remain adequate.

In case your “chatbot” pursues goals of modeling complex client communication scenarios and there can be several such scenarios, consider tracking dialogue state. There’s also nothing super complex here: it’s just a state machine or, even simpler, an enum status based on which a transparent decision is made about which scenario to follow.

An active prompt or the main/large part of the prompt usually depends on such a switch.

Why are these tips so short and why am I not elaborating on them deeply? Not only because they are not on topic, but because they would lead to likely unneeded complexity in your system! This is just FYI.

Monitoring and Evaluation System

If you’ve decided to monitor something, it’s better to watch what really affects users.

Metrics for Conversation Awareness

Forget immediately rushing to all sorts of semantic similarity scores and other academic metrics.

In addition to basic must-haves (is the server alive?), it’s worth thinking about collecting metrics such as:

How many times a user re-asks the same thing – if often, then the system doesn’t understand context.
How quickly the goal is achieved – time from first message to “Thank you, chatbot, you helped me!”
Percentage of “I didn’t understand” responses – both from your system and from the user in response to the final result of dialogue iteration.
Number of sessions per user – do people return or solve the task from the first time (if the metric is applicable at all, because you might not have any other interface in the company/startup at all).

Here a reasonable question might arise – how to monitor this at all? Hint below – structured logs can be analyzed more simply and even automatically using the same LLMs.

Another good thing to implement in the very first steps is a proper interface for user feedback – it will be invaluable.

A pro tip is to combine both – feedback and structured logs for your convenience and quicker debugging.

Runtime Hallucination Detection

In my humble opinion, the most reliable hallucination detector is structured logging. Every conditional “fact” in the response should be tied to a specific API call, known data source, function call, AI pipeline step “decigions” and so on.

Response: "Your order #12345 was shipped yesterday"
Source: GET /orders/12345, field "shipped_date": "2025-01-20"
...

If you can’t tie or conveniently find this information in logs at all, this is a huge blocker and needs urgent fixing.

How?

LLM-based, so-called hallucination detectors, are like treating alcoholism with vodka.

There’s a simpler and better option – structured logging. It is the opposite of rocket science and can be easily implemented by you in one commit today.

Structured logging is very prosaic: you simply define a dictionary, and on each iteration of communication with your system you fill its instance with key data about your AI pipeline decisions.

The benefit of structured logging especially flourishes if you have exactly an AI pipeline, not a oneshot-prompt agent for Pokemon API, where it’s sometimes extremely difficult to understand at what stage and what exactly went wrong.

Save the resulting JSON at least to where you save logs, but you can also write to database – it’ll come in handy, trust me! (Remember – feedback plus these logs!)

Each such log becomes something like a “passport” of every interaction with your system, which describes what it “thought” at each stage, what function it decided to call, and possibly even includes an explanation of why (if you use the Custom Chain of Thoughts pattern).

A/B Testing Dialogue Systems

Generally in AI system testing, great emphasis is placed on prompt testing – evaluation tests that need to be run after each change to these same prompts. This is all very good and necessary, it can protect from unpleasant surprises after the next deploy.

But what to test if we’re striving for actual product improvements based on LLM? Architectural decisions! Greater impact more often comes from things a bit further away than just changing the prompts.

Examples of useful splits for testing:

History depth: use 3 recent messages vs 10 recent ones.
Function calling: dumb agent approach vs a bit more complex but based on semantic routing.
Intermediate function results (filtered!): save vs don’t save.
Context compression: yes vs no.
And so on.

Conclusion

Key Takeaways, As Usual: Think Before You Do

Large context practically almost never solves the problem of bad architecture. Complex memory systems don’t solve the problem of bad, ungrounded requirements. Trendy frameworks don’t solve the problem of our unwillingness to think.

In the light of today’s topic – “Problems of context bloating and conversation awareness” – before running to write code for “memory,” answer yourself these questions:

What exactly needs to be remembered?
Who needs this information?
What will happen if the system does not remember it?
How much does this cost versus what pain does it solve?
Does it solve anything at all?

Often it turns out that 80% of “memory problems” aren’t problems at all. And the remaining 20% are solved in a couple hours of work dragging in a database or, even better, with prompt engineering or just old school engineering patterns.

Evolution from Just “Big Models” to “Smart Systems”

A smart system isn’t one with biggest context or more complex architecture. A smart system is one that solves the user’s task with minimal effort and maximum reliability. That undead neural system that works as an amplifier of the living neural system in our heads.

Success metrics should come from the business world, not from our techie caves. Did the user solve the task? Did they spend less/more time? Did they come back again? And not just accuracy on synthetic benchmarks – NOTHING NEW.

Small Checklist for Getting Started

Remove all garbage from context – a bunch of problems will disappear without additional effort.
If you maintain history in database – load only what’s needed – don’t stuff everything into each request. First understand what is “needed” and how much of it is required on each conversation iteration.
Don’t save intermediate function call results if accuracy is more important than convenience. At minimum – don’t save them raw. And don’t save them if they’re duplicated in the final system response anyway. Why?
Measure simple metrics – re-asks, time to solution, user returns, etc.
Add complexity then and only then when simple solutions stop working.

Start with recording (user_message: ..., assistant_response: ...) in database and loading the last 3-5 pairs for each new request. If this isn’t enough – then think about summarization, RAG and other stuff.

So be it, start with the popular (and very unreliable for complex systems) pattern – Agent, if your use case requires only a few simple functions returning correct and simple data structures – the agent will probably be enough.

Especially if you don’t have high accuracy and reliability requirements (you’re not fintech, just a convenient chat bot).

And if the system works reliably and brings benefit, users won’t care what your context size, architecture, or what model is under the hood.