
I am working on an AI workflow, which, like many others, has a classifier step. It looks at a user prompt and routes to a proper specialist agent — support, sales, feedback. The respective system prompt had this line:
Respond only with the exact query type. No explanation. No formatting.
During daily editing this line was accidentally lost.
After deployment, the classifier started returning markdown instead of a clean enum value:
**feedback**
At that point there was neither validation nor safe fallback value, so the workflow started to fail every time at that point.
Why you cannot just “add validation”?
First of all — this is a runtime issue. It will fail when a real LLM is requested, not in unit tests.
Second. The change leading to the bug is not in the code, it’s in the system prompts.
Third. The most important. This exact issue could be caught by validation, but what if the change happened within a complex json response? Or a node always returns markdown, but now the meaning is completely different? And even more, what if the form is still correct, but the enum distribution changed from 50:50 to 1:99?
In practice most AI work lives in the prototyping and research phase. Prompts change weekly. Models get swapped. Workflows are restructured. The output shapes are being discovered. Under such conditions proper validation is a big challenge in itself.
What actually happens here is that implicit contract between the LLM and the code is silently broken.
I’d call this behavioral drift. It has three properties that make it especially nasty:
- It looks correct to humans.
**feedback**next tofeedbackdoesn’t read as a bug. - It often passes automated checks. Both
feedbackand**feedback**are valid strings. Schemas describe what’s allowed, not what was normal. - Safe fallbacks make things worse. If
**feedback**value is silently classified as"General request"days might pass before you notice something is wrong.
Don’t fight it, but monitor
How do we deal with such bugs then?
Instead of trying to add more and more layers of sophisticated validation, one can instead compare the behaviour after any changes to the code or to the system prompts. This is kind of git diff applied to the workflow behaviour.
Imagine after dropping the aforementioned line from the system prompt we get a notification
Classification result format changed from "scalar string" to "markdown".
This is a serious alert worth immediate checking.
In a more complex case the monitoring system might warn you that
Distribution of node responses changed from 30/30/40 to 10/0/90
which is a behavioural drift probably concealing a serious problem.
Anomaly Detector tool
To obtain some peace of mind during rapid AI workflow prototyping I’ve implemented a small Kotlin/JVM tool which (in its current version) does the following:
- captures most important characteristics of each step output as “workflow node profiles”;
- upon request compares two workflow versions, flagging possible anomalies ordered by severity.
Example findings:
- a node is missing (never visited) in the new version;
- the node is here, but its output form changed drastically;
- node started to route to a different target most of the time.
I tried to keep the tool minimally invasive. To collect profiling information you just need to add checkpoints like this to your code:
detector.checkpoint(step = "classify-query", output = response.content)You can find the tool with full documentation here:
https://github.com/minogin/ai-anomaly-detector
The bigger point
Non-deterministic software is still software. It’s time to expand our toolkit for these kinds of systems. Not just answering deterministic-world questions, such as “is this node output correct?”, but also questions like “did this node’s output distribution change significantly?”
If you’ve shipped LLM workflows in production, you’ve probably seen something like this. Please share your experience, what non-deterministic problems you faced and what the cure was?