Перейти к содержимому

Benchmarks

Это содержимое пока не доступно на вашем языке.

The benchmark system evaluates your bot’s quality by running test dialogs and scoring each response.

Plugy maintains a corpus of test dialogs that cover common support scenarios. During each self-learning iteration:

  1. Test dialogs are sent to your bot
  2. Each response is scored on four dimensions (Focus, Empathy, Consistency, Experience)
  3. Scores are combined into the B-score metric
  4. Results determine whether proposed improvements are approved

The benchmark corpus includes:

  • Standard dialogs — Common support scenarios (billing, technical, account issues)
  • Multi-turn dialogs — Test context retention across conversation turns
  • Edge case dialogs — Test handling of unusual or adversarial inputs

Each response is evaluated on:

ComponentWhat It Checks
FocusDoes the answer address the customer’s actual question?
EmpathyIs the tone appropriate for the situation?
ConsistencyDoes it align with the knowledge base and prior answers?
ExperienceDoes it use conversation history effectively?

The B-score is a weighted geometric mean:

B = Focus^0.20 * Empathy^0.25 * Consistency^0.35 * Experience^0.20

Weights are configurable per project.

To prevent the bot from overfitting to static test dialogs:

  1. A core set of golden dialogs is always included
  2. Recent real conversations are periodically rotated in
  3. A holdout set is reserved for overfitting detection
  4. If performance drops on the holdout set, the improvement is rejected

This ensures your bot improves on real customer scenarios, not just test cases.