Benchmarks
Это содержимое пока не доступно на вашем языке.
Benchmark System
Section titled “Benchmark System”The benchmark system evaluates your bot’s quality by running test dialogs and scoring each response.
How Benchmarks Work
Section titled “How Benchmarks Work”Plugy maintains a corpus of test dialogs that cover common support scenarios. During each self-learning iteration:
- Test dialogs are sent to your bot
- Each response is scored on four dimensions (Focus, Empathy, Consistency, Experience)
- Scores are combined into the B-score metric
- Results determine whether proposed improvements are approved
Dialog Corpus
Section titled “Dialog Corpus”The benchmark corpus includes:
- Standard dialogs — Common support scenarios (billing, technical, account issues)
- Multi-turn dialogs — Test context retention across conversation turns
- Edge case dialogs — Test handling of unusual or adversarial inputs
Scoring
Section titled “Scoring”Each response is evaluated on:
| Component | What It Checks |
|---|---|
| Focus | Does the answer address the customer’s actual question? |
| Empathy | Is the tone appropriate for the situation? |
| Consistency | Does it align with the knowledge base and prior answers? |
| Experience | Does it use conversation history effectively? |
B-Score Formula
Section titled “B-Score Formula”The B-score is a weighted geometric mean:
B = Focus^0.20 * Empathy^0.25 * Consistency^0.35 * Experience^0.20Weights are configurable per project.
Benchmark Rotation
Section titled “Benchmark Rotation”To prevent the bot from overfitting to static test dialogs:
- A core set of golden dialogs is always included
- Recent real conversations are periodically rotated in
- A holdout set is reserved for overfitting detection
- If performance drops on the holdout set, the improvement is rejected
This ensures your bot improves on real customer scenarios, not just test cases.