Evaluate
Delivering AI solutions that customers can trust starts with rigorous evaluations. By using high-quality datasets and reliable metrics to test and optimize your AI, you can ensure consistent performance and earn the confidence of your customers.
Datasets
A dataset is a collection of examples used for evaluating your Ask-AI. An example is a conversation input, reference output pair.
Example curation
From your existing customer-facing analytics, you likely have a good idea of what types of questions you expect your users to ask, and what the correct responses are. You probably want to cover a few different common edge cases or situations you can imagine. Even 10-20 high-quality, manually-curated examples can go a long way.
As outlined in the quickstart you can easily use the playground to create examples.
Evaluator
Our evaluator will determine if your example is answered correctly by Inconvo or not.
We use LLM-as-judge to make this determination.
LLM-as-judge
LLM-as-judge evaluators use LLMs to score the application’s output. We encode the grading rules / criteria in the LLM prompt.
We compare the evaluated output to a reference output to determine correctness.