> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cotool.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Automatic Evaluations

> Every agent run is automatically scored to identify improvement opportunities

Every agent execution is automatically evaluated by an LLM judge that scores performance from 0-100. These scores drive automatic improvement suggestions and help you track agent quality over time.

## How It Works

<Steps>
  <Step title="Agent Executes">
    Your agent completes a run (triggered by user, schedule, or event)
  </Step>

  <Step title="Automatic Evaluation">
    Immediately after completion, an LLM judge analyzes the execution and assigns a score (0-100)
  </Step>

  <Step title="Score Recorded">
    Score is logged and displayed in run history
  </Step>

  <Step title="Improvement Suggestions">
    Low-scoring runs automatically trigger AI-generated improvement suggestions
  </Step>
</Steps>

<Note>
  **No configuration needed** - Evaluations run automatically on every execution. The system learns what "good" looks like based on your agent's goals and past performance.
</Note>

## What Gets Evaluated

The LLM judge assesses multiple dimensions:

<CardGroup cols={2}>
  <Card title="Accuracy" icon="bullseye">
    Did the agent correctly assess the situation and reach the right conclusion?
  </Card>

  <Card title="Completeness" icon="list-check">
    Did it gather all necessary information before deciding?
  </Card>

  <Card title="Tool Usage" icon="wrench">
    Were tools used appropriately and efficiently?
  </Card>

  <Card title="Output Quality" icon="sparkles">
    Is the response clear, well-formatted, and actionable?
  </Card>
</CardGroup>

## Evaluation Scores

Scores range from 0-100:

* **90-100**: Excellent - Agent performed optimally
* **75-89**: Good - Minor improvements possible
* **60-74**: Acceptable - Some issues identified
* **Below 60**: Needs improvement

## Improvement Workflow

When multiple runs show similar low-scoring patterns, the system automatically:

1. **Analyzes the execution** to identify specific issues
2. **Generates improvement suggestions** for your system prompt
3. **Shows "Suggested Improvements" panel** (see AI-Suggested Improvements)
4. **Lets you review and apply fixes** with one click

<Card title="AI-Suggested Improvements" icon="sparkles" href="/improving-agents/ai-suggested-improvements">
  Learn how to review and apply automatic improvement suggestions
</Card>

## Best Practices

<AccordionGroup>
  <Accordion title="Review Low-Scoring Runs" icon="magnifying-glass">
    When you see low scores, click into the run to understand what went wrong. The evaluation breakdown shows exactly what needs improvement.
  </Accordion>

  <Accordion title="Apply Suggested Improvements" icon="wand-magic-sparkles">
    Don't ignore the "Suggested Improvements" panel. Most issues can be fixed by accepting the AI-generated prompt changes.
  </Accordion>

  <Accordion title="Track Trends, Not Individual Runs" icon="chart-area">
    One low score isn't a problem. Look for patterns - if average score is declining or consistently low for certain scenarios, take action.
  </Accordion>

  <Accordion title="Complement with Manual Feedback" icon="thumbs-up">
    Evaluations are automatic, but you can also provide thumbs up/down feedback to give additional signal about what's working.
  </Accordion>
</AccordionGroup>

## What Makes a Good Score?

<Warning>
  **Don't aim for perfect 100s** - Agents don't need to be perfect, they need to be useful. A consistent 80-90 range usually indicates a well-tuned agent.
</Warning>

Focus on:

* **Consistency**: Is the agent reliable across different scenarios?
* **Improvement**: Are scores trending upward after applying suggestions?
* **User satisfaction**: Do evaluation scores align with user feedback?
