Manual feedback is optional - Automatic evaluations already drive improvement suggestions. Use manual feedback when you want to provide additional context or when the automatic score doesnβt capture the full picture.
How It Works
After each agent run, users can:- π Thumbs up - Good execution, agent performed well
- π Thumbs down - Poor execution, something went wrong
- π¬ Optional comment - Explain what was good or bad
Where to Provide Feedback
Feedback options appear in multiple places: Agent Execution Detail Page:- Click into any run from agent history
- Thumbs up/down buttons at the top
- Comment field for additional context
- React with π or π emoji
- Reply in thread for comments
- Reply with β+1β (positive) or β-1β (negative)
- Include explanation in email body
- Include feedback via API when invoking programmatically
- See API documentation for details
How Feedback Complements Evaluations
| Automatic Evaluations | Manual Feedback |
|---|---|
| Every run, always | Optional, when users have opinion |
| Objective LLM assessment | Subjective human judgment |
| Scores specific criteria | Overall impression |
| Drives AI suggestions | Adds context to patterns |
| Primary improvement method | Supplementary signal |
Think of manual feedback as a way to say βI agree/disagree with the automatic evalβ or to highlight something the automatic system might have missed.
When to Provide Feedback
When Auto-Eval Seems Wrong
When Auto-Eval Seems Wrong
If a run scored 85/100 but you think it was actually poor (or vice versa), provide feedback to give corrective signal.
For Subjective Qualities
For Subjective Qualities
Tone, professionalism, or style preferences that automatic evals might not capture well.
To Highlight Patterns
To Highlight Patterns
If you notice the same issue across multiple runs, feedback comments help identify the pattern.
For Business Impact
For Business Impact
βThis saved our analyst 2 hoursβ or βThis missed a critical findingβ - context the eval canβt measure.
Using Feedback Insights
Feedback is tracked alongside automatic evaluations: On Agent Dashboard:- Filter by feedback (show only thumbs up/down runs)
- Sort by feedback sentiment
- See feedback comments
- Common themes in thumbs-down comments inform prompt improvements
- Feedback helps validate that eval-driven changes are working
- Disagreement between eval scores and feedback indicates calibration needs
Feedback-Driven Improvements
While automatic evaluations drive most improvements, feedback helps identify issues to prioritize: Example Pattern:Best Practices
Focus on Actionable Feedback
Focus on Actionable Feedback
Good: βAgent didnβt check user history before concludingβLess useful: βThis was badβ
Don't Over-Think It
Don't Over-Think It
Feedback should take 5 seconds. If a run was clearly good or bad, just click thumbs up/down. Comments optional.
Use for Exceptions
Use for Exceptions
You donβt need to feedback every run. Focus on runs that surprise you (much better or worse than expected).
Be Specific in Comments
Be Specific in Comments
When you do comment, mention specific issues or wins: βMissed checking VirusTotalβ or βPerfect triage, saved me timeβ.
Feedback Metrics
Track feedback trends over time:Satisfaction Rate
% thumbs up out of total feedback given
Feedback Volume
How many runs receive feedback (higher = more engagement)
Sentiment Trend
Is satisfaction improving or declining?
Common Themes
Text analysis of comments to identify patterns
Correlation with Evaluations
Monitor how manual feedback aligns with automatic evaluations: Strong Agreement (good):- Evaluation criteria might need adjustment
- Thereβs a subjective quality not being measured
- Users value different things than the eval judges
Example: Feedback in Action
Run #2048 - Automatic Eval: 78/100 User Feedback: π + Comment- Automatic eval scored it as βacceptableβ (78/100)
- User feedback highlights a gap: audience-appropriate language
- This becomes a factor in future prompt improvements
When Feedback Overrides Evals
Manual feedback is particularly valuable when:- Business context matters - βTechnically correct but missed our policyβ
- Audience matters - βRight answer, wrong communication styleβ
- Edge cases - βEval couldnβt know this is a known false positiveβ
- Performance matters - βCorrect but too slow for our SLAβ