DEPLOYMENT OPTIONS

The Oumi Agent guides you through the final step of the model development lifecycle, from exporting your trained model to selecting and configuring the right inference target. Using your performance goals, latency requirements, and infrastructure constraints as inputs, it recommends whether to deploy locally or to a cloud provider and helps configure your serving setup accordingly. Deployment decisions that typically involve days of benchmarking and infrastructure research can be made in minutes. By matching your requirements to proven deployment configurations, the Oumi Agent reduces the risk of over-provisioning compute and helps you avoid costly trial-and-error with inference settings. Oumi exports models in a standard format compatible with popular inference engines, so you retain full flexibility over where and how you serve.

DEPLOYMENT WORKFLOW

Deployment in Oumi follows a straightforward sequence:

Export your trained model from the Oumi platform
Choose an inference target: run locally on your own hardware, or deploy to a cloud provider
Serve the model using a compatible inference engine (e.g., vLLM, Hugging Face Transformers)
Monitor and iterate: re-evaluate and retrain as production data evolves

CHOOSING A DEPLOYMENT TARGET

The right deployment target depends on your latency requirements, data privacy needs, and infrastructure preferences.

	Local Inference	Cloud Inference
Best for	Development, testing, air-gapped environments	Production, high-throughput, scalable APIs
Hardware	Your own GPU or CPU	Cloud GPU instances (AWS, GCP, Lambda, etc.)
Data privacy	Full control; data never leaves your machine	Depends on provider and configuration
Setup effort	Low; single command with vLLM	Moderate; instance provisioning required
Scalability	Limited to local resources	Scales horizontally on demand
Cost	Infrastructure you already own	Pay-per-use or reserved instance pricing

LOCAL INFERENCE

Run your exported model directly on your own hardware using vLLM or Hugging Face Transformers. This is the fastest way to get a model running after export and is ideal for iterative testing, internal tools, and privacy-sensitive workloads. Learn more about Local Inference →

CLOUD INFERENCE

Deploy your exported model to a cloud provider for scalable, production-grade serving. Oumi-exported models are compatible with several managed inference platforms and GPU cloud providers, including AWS Bedrock and Lambda. Learn more about Cloud Inference →

WHAT’S NEXT

Exporting your model

Download your trained model artifacts from Oumi.

Local inference

Serve your model on your own hardware with vLLM or Hugging Face.

Cloud inference

Deploy to AWS, Lambda, or another GPU cloud provider.

Getting started

Oumi workflow

DEPLOYMENT OPTIONS

DEPLOYMENT WORKFLOW

CHOOSING A DEPLOYMENT TARGET

LOCAL INFERENCE

CLOUD INFERENCE

WHAT’S NEXT

Exporting your model

Local inference

Cloud inference

Getting started

Oumi workflow

Documentation Index

​DEPLOYMENT WORKFLOW

​CHOOSING A DEPLOYMENT TARGET

​LOCAL INFERENCE

​CLOUD INFERENCE

​WHAT’S NEXT

Exporting your model

Local inference

Cloud inference

DEPLOYMENT WORKFLOW

CHOOSING A DEPLOYMENT TARGET

LOCAL INFERENCE

CLOUD INFERENCE

WHAT’S NEXT