Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.oumi.ai/llms.txt

Use this file to discover all available pages before exploring further.

The Oumi Agent turns model evaluation into a structured, repeatable loop. It configures evaluation runs from your task description, selects appropriate evaluators, and surfaces failure modes so you can pinpoint issues and improve your model without manually reviewing individual outputs. Setting up a rigorous evaluation pipeline from scratch typically takes days of engineering work. With the Oumi Agent, the same setup takes minutes. Because every run is fully recorded and reproducible, you build an audit trail automatically, reducing the cost of debugging regressions and making it practical to evaluate continuously rather than only at release time.

HOW EVALUATIONS WORK

Evaluations are created using the LLM-as-a-Judge framework, where a language model is prompted to analyze outputs and assign scores. The model performing this judgment is called the evaluator model. This is typically different from the model being tested, and can be a strong general-purpose model or a smaller model fine-tuned specifically for evaluation tasks. The model under evaluation is the system generating the prompt, response pairs being scored. This may be a baseline model or the current iteration of your custom model. When you run an evaluation, Oumi executes an evaluation run, which scores the model against a defined dataset and aggregates results. In terms of scoring, evaluators score model outputs against a specific criterion (e.g., safety, quality, or correctness). Each evaluator operates independently, allowing you to assess different dimensions of model performance in a modular way.

EVALUATION WORKFLOW

An evaluation run in Oumi follows these steps:
  • Generate responses by running a model on a set of prompts.
  • Score each (prompt, response) pair using one or more evaluators.
  • Aggregate results to assess overall performance.
  • Extract higher-level failure modes to explain underperformance.
Each new project includes predefined evaluators to help you get started quickly. These evaluators are editable, optional, and reusable, providing strong baseline configurations for your evaluations.