Manual feedback is optional - Automatic evaluations already drive improvement suggestions. Use manual feedback when you want to provide additional context or when the automatic score doesn’t capture the full picture.
How It Works
After each agent run, users can:- 👍 Thumbs up - Good execution, agent performed well
- 👎 Thumbs down - Poor execution, something went wrong
- 💬 Optional comment - Explain what was good or bad
Where to Provide Feedback
Feedback options appear in multiple places: Agent Execution Detail Page:- Click into any run from agent history
- Thumbs up/down buttons at the top
- Comment field for additional context
- React with 👍 or 👎 emoji
- Reply in thread for comments
- Reply with “+1” (positive) or “-1” (negative)
- Include explanation in email body
- Include feedback via API when invoking programmatically
- See API documentation for details
How Feedback Complements Evaluations
| Automatic Evaluations | Manual Feedback |
|---|---|
| Every run, always | Optional, when users have opinion |
| Objective LLM assessment | Subjective human judgment |
| Scores specific criteria | Overall impression |
| Drives AI suggestions | Adds context to patterns |
| Primary improvement method | Supplementary signal |
Think of manual feedback as a way to say “I agree/disagree with the automatic eval” or to highlight something the automatic system might have missed.
When to Provide Feedback
When Auto-Eval Seems Wrong
When Auto-Eval Seems Wrong
If a run scored 85/100 but you think it was actually poor (or vice versa), provide feedback to give corrective signal.
For Subjective Qualities
For Subjective Qualities
Tone, professionalism, or style preferences that automatic evals might not capture well.
To Highlight Patterns
To Highlight Patterns
If you notice the same issue across multiple runs, feedback comments help identify the pattern.
For Business Impact
For Business Impact
“This saved our analyst 2 hours” or “This missed a critical finding” - context the eval can’t measure.
Using Feedback Insights
Feedback is tracked alongside automatic evaluations: On Agent Dashboard:- Filter by feedback (show only thumbs up/down runs)
- Sort by feedback sentiment
- See feedback comments
- Common themes in thumbs-down comments inform prompt improvements
- Feedback helps validate that eval-driven changes are working
- Disagreement between eval scores and feedback indicates calibration needs
Feedback-Driven Improvements
While automatic evaluations drive most improvements, feedback helps identify issues to prioritize: Example Pattern:Best Practices
Focus on Actionable Feedback
Focus on Actionable Feedback
Good: “Agent didn’t check user history before concluding”Less useful: “This was bad”
Don't Over-Think It
Don't Over-Think It
Feedback should take 5 seconds. If a run was clearly good or bad, just click thumbs up/down. Comments optional.
Use for Exceptions
Use for Exceptions
You don’t need to feedback every run. Focus on runs that surprise you (much better or worse than expected).
Be Specific in Comments
Be Specific in Comments
When you do comment, mention specific issues or wins: “Missed checking VirusTotal” or “Perfect triage, saved me time”.
Feedback Metrics
Track feedback trends over time:Satisfaction Rate
% thumbs up out of total feedback given
Feedback Volume
How many runs receive feedback (higher = more engagement)
Sentiment Trend
Is satisfaction improving or declining?
Common Themes
Text analysis of comments to identify patterns
Correlation with Evaluations
Monitor how manual feedback aligns with automatic evaluations: Strong Agreement (good):- Evaluation criteria might need adjustment
- There’s a subjective quality not being measured
- Users value different things than the eval judges
Example: Feedback in Action
Run #2048 - Automatic Eval: 78/100 User Feedback: 👎 + Comment- Automatic eval scored it as “acceptable” (78/100)
- User feedback highlights a gap: audience-appropriate language
- This becomes a factor in future prompt improvements
When Feedback Overrides Evals
Manual feedback is particularly valuable when:- Business context matters - “Technically correct but missed our policy”
- Audience matters - “Right answer, wrong communication style”
- Edge cases - “Eval couldn’t know this is a known false positive”
- Performance matters - “Correct but too slow for our SLA”