Observability and Evals: How to Detect Early When Your Assistant Deviates

1. What to measure: beyond the prompt

An observable system measures:

LLM inputs and outputs (text, size, tokenization).
Tools used (which ones, how many times, with what inputs).
Latencies: per turn, per tool, per component.
Tokens used: per message, per session, per type.

Each turn must generate a complete technical trace.

2. Traceability per turn and user

Every interaction must be linked to:

userId
conversationId
turnId

This allows reconstructing sequences, detecting failures, and precise debugging. Additionally, the following should be stored:

Reason codes for decisions.
Logs of tools called and results.
Fallback or error events.

3. Evals: deterministic vs conversational

Deterministic: outputs are validated against an expected reference. Useful for tools, logic, and business rules.
Conversational: response quality is evaluated through human labels or evaluator models. They measure relevance, tone, coverage.

Both types should be part of the continuous validation pipeline.

4. Thresholds, alerts, and action

Observing is not enough: action is required. Define:

Critical KPIs: like cost per turn, fallback ratio, correct tool ratio.
Thresholds: acceptable values by context or environment.
Alerts: automatic, with defined channels.
Actions: rollback, restart, safe fallback.

5. Circuit breakers and degradation

In the face of critical deviations:

Apply circuit breakers that disable faulty routes.
Activate controlled degradation: basic responses, only deterministic flow, no tools.

This protects the user experience and prevents repeated errors.

Critical KPIs Table

KPI	Definition	Threshold	Alert	Suggested Action
Tool success rate	% of tools successfully called	>90%	Slack/Email	Review arguments and schema
Fallback rate	% of turns with fallback	<15%	Dashboard	Refine intent classifier
Cost per turn	Tokens x price	<0.05 €	Cloud logs	Limit context
Average time per tool	average ms per execution	<1200 ms	Grafana	Review slow tool

Operational Checklist

Regular session sampling (automated).
Drafting or anonymization of PII.
Accessible and updated metrics dashboard.

Frequently Asked Questions

How often should evals be run?
- Depends on the volume. Ideally, continuously for deterministic and every few days for conversational.
How to label data for evals?
- You can use human assistants, internal QA flows, or specialized models. Consistency is key.

Conclusion

Quality is not an accident: it is measured, traced, and improved. Observability and evals are not optional extras, but pillars for an LLM-based assistant to survive in production. At Lean Mind, we work with our clients to instrument these systems from day one, ensuring stability, traceability, and continuous improvement.