Observability and Evals: How to Detect Early When Your Assistant Deviates
17-03-2026
1. What to measure: beyond the prompt
An observable system measures:
- LLM inputs and outputs (text, size, tokenization).
- Tools used (which ones, how many times, with what inputs).
- Latencies: per turn, per tool, per component.
- Tokens used: per message, per session, per type.
Each turn must generate a complete technical trace.
2. Traceability per turn and user
Every interaction must be linked to:
userIdconversationIdturnId
This allows reconstructing sequences, detecting failures, and precise debugging. Additionally, the following should be stored:
- Reason codes for decisions.
- Logs of tools called and results.
- Fallback or error events.
3. Evals: deterministic vs conversational
- Deterministic: outputs are validated against an expected reference. Useful for tools, logic, and business rules.
- Conversational: response quality is evaluated through human labels or evaluator models. They measure relevance, tone, coverage.
Both types should be part of the continuous validation pipeline.
4. Thresholds, alerts, and action
Observing is not enough: action is required. Define:
- Critical KPIs: like cost per turn, fallback ratio, correct tool ratio.
- Thresholds: acceptable values by context or environment.
- Alerts: automatic, with defined channels.
- Actions: rollback, restart, safe fallback.
5. Circuit breakers and degradation
In the face of critical deviations:
- Apply circuit breakers that disable faulty routes.
- Activate controlled degradation: basic responses, only deterministic flow, no tools.
This protects the user experience and prevents repeated errors.
Critical KPIs Table
| KPI | Definition | Threshold | Alert | Suggested Action |
|---|---|---|---|---|
| Tool success rate | % of tools successfully called | >90% | Slack/Email | Review arguments and schema |
| Fallback rate | % of turns with fallback | <15% | Dashboard | Refine intent classifier |
| Cost per turn | Tokens x price | <0.05 € | Cloud logs | Limit context |
| Average time per tool | average ms per execution | <1200 ms | Grafana | Review slow tool |
Operational Checklist
- Regular session sampling (automated).
- Drafting or anonymization of PII.
- Accessible and updated metrics dashboard.
Frequently Asked Questions
-
How often should evals be run?
- Depends on the volume. Ideally, continuously for deterministic and every few days for conversational.
-
How to label data for evals?
- You can use human assistants, internal QA flows, or specialized models. Consistency is key.
Conclusion
Quality is not an accident: it is measured, traced, and improved. Observability and evals are not optional extras, but pillars for an LLM-based assistant to survive in production. At Lean Mind, we work with our clients to instrument these systems from day one, ensuring stability, traceability, and continuous improvement.
