Observability and Evals: How to Detect Early When Your Assistant Deviates

17-03-2026

By Cristian Suarez Vera

1. What to measure: beyond the prompt

An observable system measures:

  • LLM inputs and outputs (text, size, tokenization).
  • Tools used (which ones, how many times, with what inputs).
  • Latencies: per turn, per tool, per component.
  • Tokens used: per message, per session, per type.

Each turn must generate a complete technical trace.

2. Traceability per turn and user

Every interaction must be linked to:

  • userId
  • conversationId
  • turnId

This allows reconstructing sequences, detecting failures, and precise debugging. Additionally, the following should be stored:

  • Reason codes for decisions.
  • Logs of tools called and results.
  • Fallback or error events.

3. Evals: deterministic vs conversational

  • Deterministic: outputs are validated against an expected reference. Useful for tools, logic, and business rules.
  • Conversational: response quality is evaluated through human labels or evaluator models. They measure relevance, tone, coverage.

Both types should be part of the continuous validation pipeline.

4. Thresholds, alerts, and action

Observing is not enough: action is required. Define:

  • Critical KPIs: like cost per turn, fallback ratio, correct tool ratio.
  • Thresholds: acceptable values by context or environment.
  • Alerts: automatic, with defined channels.
  • Actions: rollback, restart, safe fallback.

5. Circuit breakers and degradation

In the face of critical deviations:

  • Apply circuit breakers that disable faulty routes.
  • Activate controlled degradation: basic responses, only deterministic flow, no tools.

This protects the user experience and prevents repeated errors.

Critical KPIs Table

KPIDefinitionThresholdAlertSuggested Action
Tool success rate% of tools successfully called>90%Slack/EmailReview arguments and schema
Fallback rate% of turns with fallback<15%DashboardRefine intent classifier
Cost per turnTokens x price<0.05 €Cloud logsLimit context
Average time per toolaverage ms per execution<1200 msGrafanaReview slow tool

Operational Checklist

  • Regular session sampling (automated).
  • Drafting or anonymization of PII.
  • Accessible and updated metrics dashboard.

Frequently Asked Questions

  • How often should evals be run?

    • Depends on the volume. Ideally, continuously for deterministic and every few days for conversational.
  • How to label data for evals?

    • You can use human assistants, internal QA flows, or specialized models. Consistency is key.

Conclusion

Quality is not an accident: it is measured, traced, and improved. Observability and evals are not optional extras, but pillars for an LLM-based assistant to survive in production. At Lean Mind, we work with our clients to instrument these systems from day one, ensuring stability, traceability, and continuous improvement.