How It Works
1
Agent Executes
Your agent completes a run (triggered by user, schedule, or event)
2
Automatic Evaluation
Immediately after completion, an LLM judge analyzes the execution and assigns a score (0-100)
3
Score Recorded
Score is logged and displayed in run history
4
Improvement Suggestions
Low-scoring runs automatically trigger AI-generated improvement suggestions
No configuration needed - Evaluations run automatically on every execution. The system learns what “good” looks like based on your agent’s goals and past performance.
What Gets Evaluated
The LLM judge assesses multiple dimensions:Accuracy
Did the agent correctly assess the situation and reach the right conclusion?
Completeness
Did it gather all necessary information before deciding?
Tool Usage
Were tools used appropriately and efficiently?
Output Quality
Is the response clear, well-formatted, and actionable?
Evaluation Scores
Scores range from 0-100:- 90-100: Excellent - Agent performed optimally
- 75-89: Good - Minor improvements possible
- 60-74: Acceptable - Some issues identified
- Below 60: Needs improvement
Improvement Workflow
When multiple runs show similar low-scoring patterns, the system automatically:- Analyzes the execution to identify specific issues
- Generates improvement suggestions for your system prompt
- Shows “Suggested Improvements” panel (see AI-Suggested Improvements)
- Lets you review and apply fixes with one click
AI-Suggested Improvements
Learn how to review and apply automatic improvement suggestions
Best Practices
Review Low-Scoring Runs
Review Low-Scoring Runs
When you see low scores, click into the run to understand what went wrong. The evaluation breakdown shows exactly what needs improvement.
Apply Suggested Improvements
Apply Suggested Improvements
Don’t ignore the “Suggested Improvements” panel. Most issues can be fixed by accepting the AI-generated prompt changes.
Track Trends, Not Individual Runs
Track Trends, Not Individual Runs
One low score isn’t a problem. Look for patterns - if average score is declining or consistently low for certain scenarios, take action.
Complement with Manual Feedback
Complement with Manual Feedback
Evaluations are automatic, but you can also provide thumbs up/down feedback to give additional signal about what’s working.
What Makes a Good Score?
Don’t aim for perfect 100s - Agents don’t need to be perfect, they need to be useful. A consistent 80-90 range usually indicates a well-tuned agent.
- Consistency: Is the agent reliable across different scenarios?
- Improvement: Are scores trending upward after applying suggestions?
- User satisfaction: Do evaluation scores align with user feedback?