Tools
When long chats drift: hidden errors in AI-assisted coding
2025-12-26
0 views
admin
How context drift sneaks in ## Concrete failures I ran into ## Why small errors compound ## Operational changes that actually reduced incidents ## When to stop the chat and run tests I learned the hard way that a long chat is not a single, stable memory. The model does still see earlier turns, but attention favors recent tokens. That means constraints you asserted an hour ago get quietly deprioritized. I would start a session by telling the assistant which framework, which version, and that we prefer existing helpers. Later in the same thread it would start suggesting APIs from a different ecosystem and I would only notice after a test failed. The change is subtle. Suggestions keep sounding plausible, so you keep accepting them until something breaks in CI. One time the model swapped our HTTP client mid-session. Early messages were explicitly about requests and sync code; after several prompts it began returning async httpx examples. I merged the change because the diff looked trivial. Tests passed locally but failed on staging where the event loop was different. The root cause was a lost constraint: "stay synchronous" sat at the top of the chat but no longer influenced later completions. Another example was assumptions about language features. I told the assistant we were on Python 3.9. After many turns it suggested match/case snippets without a reminder. The code compiled locally for me because I had 3.10 installed, but not for teammates. That one cost a couple of hours tracing a syntax error back to the assistant. These are not dramatic hallucinations. They are small shifts that only show up when you run the whole system. Long conversations create a chain of micro-decisions. One prompt changes variable names. The next builds on that change and assumes a different module layout. If any tool call in the chain returns partial or malformed output, the model often fills gaps with the most likely continuation. I had a search tool time out and the assistant continued as if the search returned exactly what it needed. The code it produced referenced functions that never existed in our codebase. When model outputs feed scripts, CI, or other tools, the silence or partial failures become amplification points. A missing check, a slightly wrong import, or a forgotten header becomes a new assumption later in the thread. That is why I now treat generation and validation as separate steps. I ask the model to draft, then run deterministic checks and require explicit evidence or a source for any claim about library behavior, often using a structured research flow when I need to verify API details. I changed three things first: force resets, log everything, and make tool outputs mandatory checkpoints. For resets I now split large tasks into multiple chats and explicitly restate constraints in each new session. That sounds annoying but it beats debugging a drifted session. Logging matters. I write the assistant output and every tool response into our tracing layer so I can replay where a suggestion originated. The replay was the only way I found the moment a decision flipped from sync to async in that HTTP client incident. I also started using explicit guardrails inside prompts: pin the runtime, the versions, and the style guide as a short checklist the model must reference before producing code. When I need a second opinion I dump the same prompt into another model in a shared chat workspace so I can see divergence patterns. That comparison often surfaces hidden assumptions faster than any single answer. For verification I require the model to include citations or exact function signatures for API changes and then I check those against docs using a focused research flow. My heuristic now is simple. If the session goes beyond a handful of turns or touches multiple subsystems, stop and validate. Run unit tests. Check that imports and runtime versions match every environment the code will run in. If a suggested change requires new dependencies, treat that as a new project and open a fresh conversation that lists the dependency policy up front. I still let the model draft and explore, but I do not let drafts propagate without explicit verification and logs to trace them back. That reduces the chance of a small context drift turning into a deployed outage. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse
how-totutorialguidedev.toaipython