OUMI DATASETS

The Oumi Agent makes it easy to generate high-quality at any stage of the machine learning workflow. Using natural language prompts, you can create training data from scratch, synthesize datasets from failure modes, or analyze and refine existing data, without writing pipeline code. What traditionally requires weeks of manual collection, cleaning, and formatting can be completed in hours. The Oumi Agent automates the most time-consuming parts of dataset preparation such as schema validation, format conversion, and iterative refinement, so your team spends less time on data plumbing and more time on model quality. Because datasets are generated on-demand and scoped to your task, you also avoid the cost of sourcing or licensing large generic datasets that may not fit your use case.

STRUCTURE & CONTENTS

An is a structured collection of prompts and responses used to either train a model or evaluate its performance. Depending on your workflow, a dataset may include:

Prompt–response pairs for supervised fine-tuning
Prompts only, where model outputs are generated and evaluated separately
Multi-turn conversations for dialogue-based training or benchmarking

UPLOADING DATASETS

You can upload datasets directly into Oumi in a variety of common formats, including JSON, JSONL, CSV, and Parquet. All Oumi datasets follow a standardized internal that defines how messages, roles, and metadata are structured. During upload, Oumi automatically validates and converts your data into this format, ensuring it works seamlessly with training, evaluation, data synthesis, and analysis tools across the platform as well as modern machine learning pipelines.

RAW FILES

Oumi also supports uploading raw files to ground your models in proprietary or domain-specific data. This allows you to incorporate internal documents, knowledge bases, or other private content into your workflows. To learn more, please see Uploading raw files.

EXAMPLE USAGE

Here’s an example of a properly-formed dataset for Oumi in format:

{"messages": [{"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "The capital of France is Paris."}], "metadata": {"source": "geography"}}
{"messages": [{"role": "user", "content": "How do I make pasta?"}, {"role": "assistant", "content": "Boil water and add pasta for 8-10 minutes."}], "metadata": {"source": "cooking"}}
{"messages": [{"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "2+2 equals 4."}], "metadata": {"source": "math"}}

In addition to the standard messages field, you can also specify a metadata field that is a dictionary of metadata for your data row.

More dataset examples

WHAT’S NEXT

Add datasets

Upload and import datasets into the Oumi platform.

Add raw files

Upload and import raw files to contextualize and ground your data.

Data explorer

Explore, inspect, and validate your datasets.

Recipes

Adding new datasets using guided workflows.

Getting started

Oumi workflow

OUMI DATASETS

STRUCTURE & CONTENTS

UPLOADING DATASETS

RAW FILES

EXAMPLE USAGE

WHAT’S NEXT

Add datasets

Add raw files

Data explorer

Recipes

Getting started

Oumi workflow

Documentation Index

​STRUCTURE & CONTENTS

​UPLOADING DATASETS

​RAW FILES

​EXAMPLE USAGE

​WHAT’S NEXT

Add datasets

Add raw files

Data explorer

Recipes

STRUCTURE & CONTENTS

UPLOADING DATASETS

RAW FILES

EXAMPLE USAGE

WHAT’S NEXT