Traditional approaches to evaluating model quality don’t scale well. Human review is slow and costly, rule-based checks are often too rigid for open-ended outputs, and conventional metrics frequently fail to capture what users actually value. Evaluators solve these problems by using LLM-based judges to assess response quality at scale. They can measure nuanced attributes such as accuracy, completeness, and tone across thousands of examples, while providing detailed, per-example feedback that makes it easy to pinpoint where a model is falling short. The result is a more reliable and consistent way to understand model performance. The Oumi Agent simplifies evaluator creation by embedding ML expertise directly into your workflow, using natural language prompts to translate task goals into structured criteria, metrics, and edge cases without manual coding. By removing friction in defining scoring rubrics and handling complexity, it reduces development time from days to minutes, enabling consistent, rigorous evaluation throughout the model lifecycle while allowing evaluators to be reused and refined over time for compounding value.Documentation Index
Fetch the complete documentation index at: https://docs.oumi.ai/llms.txt
Use this file to discover all available pages before exploring further.
HOW IT WORKS
- Your model generates responses to a set of prompts
- The evaluator reads each response and scores it based on criteria you define
- You get scores and feedback across your dataset, showing where your model excels and where it struggles
WHAT TO DEFINE IN AN EVALUATOR
- Evaluation criteria: A prompt describing what the judge should assess (e.g., “Is the response accurate and complete?”)
- Judgment labels: The rating scale the judge uses (e.g., “poor”, “acceptable”, “good”, “excellent”)
- Scoring: Numeric scores mapped to each label
- Judge model: Which hosted LLM runs the evaluation (e.g., GLM-5, Qwen3-235B-A22B-Instruct-2507)
BUILT-IN & CUSTOM EVALUATORS
Oumi includes built-in evaluators (such as instruction following, safety, topic adherence, and truthfulness) to help you quickly establish baselines and gather early feedback. You can review, edit, and reuse these evaluators across evaluations, or create custom ones using the Builder to define the exact inputs your judge should consider. You can also describe your desired evaluator in natural language with the Oumi Agent, specifying scoring criteria, selecting the evaluator model, and including additional dataset fields for context as needed.Custom evaluators are reusable and should focus on a single, clearly defined property to ensure consistent and reliable results.
WHY EVALUATORS MATTER
- Before training: Benchmark a base model to see where it falls short
- After training: Measure whether fine-tuning actually improved quality
- Compare models: Run the same evaluators on different models to see which performs better
- Identify failure modes: Find specific examples where the model struggles, then use those insights to improve your training data
EXAMPLE EVALUATOR AXES
For a customer support bot, you might create separate evaluators for:- Accuracy: Did the response contain correct information?
- Tone: Was the response empathetic and professional?
- Completeness: Did it fully address the customer’s question?
- Policy compliance: Did it follow company guidelines?
WHAT’S NEXT
Defining evaluators
Establish criteria for measuring model performance
Evaluator recipes
Save and reuse evaluator configurations