Automatic Evaluations

Every agent execution is automatically evaluated by an LLM judge that scores performance from 0-100. These scores drive automatic improvement suggestions and help you track agent quality over time.

How It Works

Agent Executes

Your agent completes a run (triggered by user, schedule, or event)

Automatic Evaluation

Immediately after completion, an LLM judge analyzes the execution and assigns a score (0-100)

Score Recorded

Score is logged and displayed in run history

Improvement Suggestions

Low-scoring runs automatically trigger AI-generated improvement suggestions

No configuration needed - Evaluations run automatically on every execution. The system learns what “good” looks like based on your agent’s goals and past performance.

What Gets Evaluated

The LLM judge assesses multiple dimensions:

Accuracy

Did the agent correctly assess the situation and reach the right conclusion?

Completeness

Did it gather all necessary information before deciding?

Tool Usage

Were tools used appropriately and efficiently?

Output Quality

Is the response clear, well-formatted, and actionable?

Evaluation Scores

Scores range from 0-100:

90-100: Excellent - Agent performed optimally
75-89: Good - Minor improvements possible
60-74: Acceptable - Some issues identified
Below 60: Needs improvement

Improvement Workflow

When multiple runs show similar low-scoring patterns, the system automatically:

Analyzes the execution to identify specific issues
Generates improvement suggestions for your system prompt
Shows “Suggested Improvements” panel (see AI-Suggested Improvements)
Lets you review and apply fixes with one click

AI-Suggested Improvements

Learn how to review and apply automatic improvement suggestions

Best Practices

Review Low-Scoring Runs

When you see low scores, click into the run to understand what went wrong. The evaluation breakdown shows exactly what needs improvement.

Apply Suggested Improvements

Don’t ignore the “Suggested Improvements” panel. Most issues can be fixed by accepting the AI-generated prompt changes.

Track Trends, Not Individual Runs

One low score isn’t a problem. Look for patterns - if average score is declining or consistently low for certain scenarios, take action.

Complement with Manual Feedback

Evaluations are automatic, but you can also provide thumbs up/down feedback to give additional signal about what’s working.

What Makes a Good Score?

Don’t aim for perfect 100s - Agents don’t need to be perfect, they need to be useful. A consistent 80-90 range usually indicates a well-tuned agent.

Focus on:

Consistency: Is the agent reliable across different scenarios?
Improvement: Are scores trending upward after applying suggestions?
User satisfaction: Do evaluation scores align with user feedback?

Getting Started

Core Concepts

Chat

Agents

Detections

Improving Agents

Settings

How It Works

What Gets Evaluated

Accuracy

Completeness

Tool Usage

Output Quality

Evaluation Scores

Improvement Workflow

AI-Suggested Improvements

Best Practices

What Makes a Good Score?

Getting Started

Core Concepts

Chat

Agents

Detections

Improving Agents

Settings

​How It Works

​What Gets Evaluated

Accuracy

Completeness

Tool Usage

Output Quality

​Evaluation Scores

​Improvement Workflow

AI-Suggested Improvements

​Best Practices

​What Makes a Good Score?

How It Works

What Gets Evaluated

Evaluation Scores

Improvement Workflow

Best Practices

What Makes a Good Score?