How It Works
Criteria you define are injected into the agent’s system prompt so the agent actively tries to satisfy them. After every run, the Cotool Eval Harness independently grades each criterion as met or not met and explains why. Failed criteria can optionally trigger notifications to Slack, webhooks, or PagerDuty.Adding Acceptance Criteria
- Navigate to your agent’s details page
- Click Edit
- Scroll to the Acceptance Criteria section
- Add each criterion as a clear, evaluable statement
- Click Save Changes
Good vs. Poor Criteria
| Good | Poor |
|---|---|
| ”Check EDR telemetry before closing an alert as false positive" | "Investigate alerts properly" |
| "Never disable a detection rule without documenting justification" | "Be careful with rule changes" |
| "Correlate across at least two log sources before determining scope" | "Use tools correctly" |
| "Escalate to Tier 2 if lateral movement indicators are present" | "Don’t miss threats” |
Viewing Results
After each run, acceptance criteria results appear in two places:- Eval score popover — click the evaluation score badge on any run to see which criteria passed or failed, along with the judge’s explanation for each
- Run issues warning — a warning icon appears next to runs that have failed criteria, combined with any critical issues the judge identified
Notifications
Acceptance criteria failures can trigger notifications through the same output destinations used for agent outputs (Slack, webhooks, PagerDuty). Notification configuration is independent from regular output delivery — you can enable one without the other.Setting Up Notifications
- Click Edit on the agent details page
- In the Acceptance Criteria section, toggle Notify on failure
- Select one or more destinations from the dropdown (or create a new one)
- Click Save Changes
Destinations are shared across your organization. A Slack channel or webhook created for output delivery can also be used for acceptance criteria notifications.
Notification Format
Slack messages include the agent name (linked to the run), a timestamp, a pass/fail summary, and each failed criterion with the judge’s explanation. Webhooks receive a JSON payload:Best Practices
Start with 2–3 criteria
Start with 2–3 criteria
Begin with the most important conditions and expand over time. Too many criteria can dilute signal and slow down the feedback loop.
Write falsifiable statements
Write falsifiable statements
Each criterion should be something the judge can clearly confirm or deny from the run transcript. “Always include a confidence score” is falsifiable; “produce good output” is not.
Use criteria to catch regressions
Use criteria to catch regressions
If a prompt change fixes a recurring issue, add a criterion that encodes the fix (e.g. “Never close an alert without checking log ingestion status”). This prevents the behavior from regressing silently.
Pair with suggested improvements
Pair with suggested improvements
When criteria consistently fail, check the Suggested Improvements panel. The system may already have a prompt fix ready for you.
Limits
- Maximum 20 criteria per agent
- Each criterion can be up to 500 characters