Benchmarks
Benchmark System
Section titled “Benchmark System”The benchmark system evaluates your bot’s quality by running test dialogs and scoring each response.
How Benchmarks Work
Section titled “How Benchmarks Work”Plugy maintains a corpus of test dialogs that cover common support scenarios. During each self-learning iteration:
- Test dialogs are sent to your bot
- Each response is scored on four dimensions (Focus, Empathy, Consistency, Experience)
- Scores are combined into the B-score metric
- Results determine whether proposed improvements are approved
Dialog Corpus
Section titled “Dialog Corpus”The benchmark corpus includes:
- Standard dialogs — Common support scenarios (billing, technical, account issues)
- Multi-turn dialogs — Test context retention across conversation turns
- Edge case dialogs — Test handling of unusual or adversarial inputs
Scoring
Section titled “Scoring”Each response is evaluated on:
| Component | What It Checks |
|---|---|
| Focus | Does the answer address the customer’s actual question? |
| Empathy | Is the tone appropriate for the situation? |
| Consistency | Does it align with the knowledge base and prior answers? |
| Experience | Does it use conversation history effectively? |
B-Score Formula
Section titled “B-Score Formula”The B-score is a weighted geometric mean:
B = Focus^0.20 * Empathy^0.25 * Consistency^0.35 * Experience^0.20Weights are configurable per project.
Benchmark Rotation
Section titled “Benchmark Rotation”To prevent the bot from overfitting to static test dialogs:
- A core set of golden dialogs is always included
- Recent real conversations are periodically rotated in
- A holdout set is reserved for overfitting detection
- If performance drops on the holdout set, the improvement is rejected
This ensures your bot improves on real customer scenarios, not just test cases.