Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.oumi.ai/llms.txt

Use this file to discover all available pages before exploring further.

The Oumi Agent guides you through the final step of the model development lifecycle, from exporting your trained model to selecting and configuring the right inference target. Using your performance goals, latency requirements, and infrastructure constraints as inputs, it recommends whether to deploy locally or to a cloud provider and helps configure your serving setup accordingly. Deployment decisions that typically involve days of benchmarking and infrastructure research can be made in minutes. By matching your requirements to proven deployment configurations, the Oumi Agent reduces the risk of over-provisioning compute and helps you avoid costly trial-and-error with inference settings. Oumi exports models in a standard format compatible with popular inference engines, so you retain full flexibility over where and how you serve.

DEPLOYMENT WORKFLOW

Deployment in Oumi follows a straightforward sequence:
  1. Export your trained model from the Oumi platform
  2. Choose an inference target: run locally on your own hardware, or deploy to a cloud provider
  3. Serve the model using a compatible inference engine (e.g., vLLM, Hugging Face Transformers)
  4. Monitor and iterate: re-evaluate and retrain as production data evolves

CHOOSING A DEPLOYMENT TARGET

The right deployment target depends on your latency requirements, data privacy needs, and infrastructure preferences.
Local InferenceCloud Inference
Best forDevelopment, testing, air-gapped environmentsProduction, high-throughput, scalable APIs
HardwareYour own GPU or CPUCloud GPU instances (AWS, GCP, Lambda, etc.)
Data privacyFull control; data never leaves your machineDepends on provider and configuration
Setup effortLow; single command with vLLMModerate; instance provisioning required
ScalabilityLimited to local resourcesScales horizontally on demand
CostInfrastructure you already ownPay-per-use or reserved instance pricing

LOCAL INFERENCE

Run your exported model directly on your own hardware using vLLM or Hugging Face Transformers. This is the fastest way to get a model running after export and is ideal for iterative testing, internal tools, and privacy-sensitive workloads. Learn more about Local Inference →

CLOUD INFERENCE

Deploy your exported model to a cloud provider for scalable, production-grade serving. Oumi-exported models are compatible with several managed inference platforms and GPU cloud providers, including AWS Bedrock and Lambda. Learn more about Cloud Inference →

WHAT’S NEXT

Exporting your model

Download your trained model artifacts from Oumi.

Local inference

Serve your model on your own hardware with vLLM or Hugging Face.

Cloud inference

Deploy to AWS, Lambda, or another GPU cloud provider.