Skip to main content
Every agent execution is automatically evaluated by an LLM judge that scores performance from 0-100. These scores drive automatic improvement suggestions and help you track agent quality over time.

How It Works

1

Agent Executes

Your agent completes a run (triggered by user, schedule, or event)
2

Automatic Evaluation

Immediately after completion, an LLM judge analyzes the execution and assigns a score (0-100)
3

Score Recorded

Score is logged and displayed in run history
4

Improvement Suggestions

Low-scoring runs automatically trigger AI-generated improvement suggestions
No configuration needed - Evaluations run automatically on every execution. The system learns what “good” looks like based on your agent’s goals and past performance.

What Gets Evaluated

The LLM judge assesses multiple dimensions:

Accuracy

Did the agent correctly assess the situation and reach the right conclusion?

Completeness

Did it gather all necessary information before deciding?

Tool Usage

Were tools used appropriately and efficiently?

Output Quality

Is the response clear, well-formatted, and actionable?

Evaluation Scores

Scores range from 0-100:
  • 90-100: Excellent - Agent performed optimally
  • 75-89: Good - Minor improvements possible
  • 60-74: Acceptable - Some issues identified
  • Below 60: Needs improvement

Improvement Workflow

When multiple runs show similar low-scoring patterns, the system automatically:
  1. Analyzes the execution to identify specific issues
  2. Generates improvement suggestions for your system prompt
  3. Shows “Suggested Improvements” panel (see AI-Suggested Improvements)
  4. Lets you review and apply fixes with one click

AI-Suggested Improvements

Learn how to review and apply automatic improvement suggestions

Best Practices

When you see low scores, click into the run to understand what went wrong. The evaluation breakdown shows exactly what needs improvement.
Don’t ignore the “Suggested Improvements” panel. Most issues can be fixed by accepting the AI-generated prompt changes.
Evaluations are automatic, but you can also provide thumbs up/down feedback to give additional signal about what’s working.

What Makes a Good Score?

Don’t aim for perfect 100s - Agents don’t need to be perfect, they need to be useful. A consistent 80-90 range usually indicates a well-tuned agent.
Focus on:
  • Consistency: Is the agent reliable across different scenarios?
  • Improvement: Are scores trending upward after applying suggestions?
  • User satisfaction: Do evaluation scores align with user feedback?