Benchmarks

Benchmark System

The benchmark system evaluates your bot’s quality by running test dialogs and scoring each response.

Plugy maintains a corpus of test dialogs that cover common support scenarios. During each self-learning iteration:

Test dialogs are sent to your bot
Each response is scored on four dimensions (Focus, Empathy, Consistency, Experience)
Scores are combined into the B-score metric
Results determine whether proposed improvements are approved

The benchmark corpus includes:

Standard dialogs — Common support scenarios (billing, technical, account issues)
Multi-turn dialogs — Test context retention across conversation turns
Edge case dialogs — Test handling of unusual or adversarial inputs

Each response is evaluated on:

Component	What It Checks
Focus	Does the answer address the customer’s actual question?
Empathy	Is the tone appropriate for the situation?
Consistency	Does it align with the knowledge base and prior answers?
Experience	Does it use conversation history effectively?

The B-score is a weighted geometric mean:

B = Focus^0.20 * Empathy^0.25 * Consistency^0.35 * Experience^0.20

Weights are configurable per project.

To prevent the bot from overfitting to static test dialogs:

This ensures your bot improves on real customer scenarios, not just test cases.