Skip to main content
When agent runs score poorly on automatic evaluations, Cotool analyzes the issues and suggests specific prompt improvements. You simply review the suggested changes and accept or reject them - no manual prompt writing required.

How It Works

1

Pattern Detection

System identifies repeated issues across multiple low-scoring runs
2

Issue Clustering

Similar problems are grouped together to identify common root causes
3

Suggested Improvements Panel

You see a panel showing detected patterns and a “Generate Diff” button
4

Generate Diff

Click the button - AI generates specific prompt changes to fix the issues
5

Review Changes

See side-by-side diff of current vs. improved prompt
6

Accept or Reject

Apply the changes or reject if they don’t fit your needs
You don’t write the improvements - The AI analyzes what went wrong and generates the fix. Your role is to review and approve, not write from scratch.

Suggested Improvements Panel

When available, you’ll see a panel below the system prompt with suggested improvements. Here’s an example:
Issues Found
• Agent encountered empty log query results and immediately concluded the alert was noise without investigating why logs were missing
• Aent disabled the detection rule globally (enabled: false) rather than implementing targeted tuning or allow lists
• Agent marked the issue as 'Done' and took irreversible action without considering data ingestion issues or alternative explanations for missing logs

Suggested Change
• Require investigation before disabling rules
• Add validation for file hash existence
• Clarify severity thresholds

Citations
• Agent Execution: f9c63e37-dfa9-422f-bcc9-d27e36d255ca
• Eval Score: 60/100

Issue Categories

The system automatically detects and suggests fixes for:

Missing Steps

Agent skipped important investigation or validation stepsFix: Add explicit steps to prompt

Poor Decision Logic

Agent made incorrect conclusions or didn’t follow criteria properlyFix: Clarify decision criteria and edge cases

Tool Misuse

Wrong tools used, correct tools skipped, or inefficient patternsFix: Add tool usage guidance

Output Issues

Missing information, unclear formatting, or insufficient detailFix: Specify required output structure

Generating the Diff

When you click “Generate Diff”, the AI:
  1. Analyzes the specific issues identified
  2. Reviews your current system prompt
  3. Generates targeted changes to fix the problems
  4. Shows you a side-by-side comparison
Example Diff View:
  • Before (Current)
  • After (Suggested)
## Your Responsibilities
1. Analyze the alert details
2. Search logs for related activity
3. Determine if alert is valid
4. Update the ticket
Changes are highlighted in the diff view. Green shows additions, red shows removals, yellow shows modifications.

Multiple Improvements

You may see multiple suggestion cards if several low-scoring runs identified different issues:
Suggested Improvements (3)

1. Require investigation before disabling rules
   Eval: 60/100 | Run: f9c63e37...

2. Add validation for file hash existence  
   Eval: 55/100 | Run: a3d12f89...

3. Clarify severity thresholds
   Eval: 68/100 | Run: 7bd94c21...
You can:
  • Generate diffs for each individually
  • Address the lowest-scoring issue first
  • Reject suggestions that don’t apply to your use case

What Gets Improved

Issue: Agent skipped checking user history before closing alertFix: “Before determining severity, always: 1) Check user’s recent activity…”
Issue: Agent failed when API returned no dataFix: “If VirusTotal returns no data (404), classify as Medium and note that hash is unknown…”
Issue: Agent mishandled service account activityFix: “If user matches pattern ‘svc_’ or ‘service-’, verify activity against scheduled job list…”
Issue: Agent output was vague: “Alert looks suspicious”Fix: “Always include: 1) Specific indicators found, 2) Confidence level (High/Medium/Low), 3) Recommended actions…”

Best Practices

If you see the same issue across multiple runs, accept the suggestion. Repeated patterns indicate a real problem.
Use Agent Builder to test the updated prompt with edge cases before relying on it in production.
After accepting changes, watch evaluation scores for the next 10-20 runs to confirm improvement.
Not every low score needs a prompt change. Sometimes the issue is data quality, API availability, or legitimate edge cases.

Example: Real Improvement Flow

Run #1834 - Evaluation Score: 60/100 Issue Detected:
Agent encountered empty Splunk results and immediately 
marked alert as false positive without investigating why 
logs were missing. This could miss real threats if log 
ingestion is delayed.
Generated Diff:
## Investigation Steps
1. Search Splunk for related activity
+  - If query returns 0 results, verify:
+    a) Data ingestion status (check indexer health)
+    b) Time range (expand if near ingestion delay window)
+    c) Query syntax (test with broader search)
+  - Only conclude "no activity" after ruling out data issues

2. Assess findings
-  - No logs = likely false positive
+  - No logs = investigate further before concluding
+  - Document if missing logs prevent full assessment
You: Review diff, looks good ✅ Action: Click “Accept” Result: Prompt v5 created, agent now handles empty query results properly Validation: Next 15 runs average 82/100 (up from 65/100)

Manual Prompt Editing

You can always edit prompts manually instead of using suggestions:
  1. Go to Agent Settings → System Prompt
  2. Click “Edit”
  3. Make changes directly
  4. Save (creates new version)
But the AI-suggested flow is faster and often catches issues you might miss.