DeepInfra Brings $0.08/1M Inference to Hugging Face Hub

On April 29, 2026, Hugging Face officially added DeepInfra to its Inference Providers program. Developers can now route inference requests directly to DeepInfra’s serverless GPU infrastructure using standard Hugging Face access tokens. The integration provides immediate access to thousands of open-source models without requiring developers to provision or manage underlying compute clusters.

Unified API and Routing

Accessing DeepInfra requires modifying a single parameter in the Hugging Face InferenceClient. By passing provider="deepinfra" in the API call, requests are automatically routed to DeepInfra endpoints. The integration also supports intelligent routing policies. Developers looking to reduce LLM API costs can configure the client with :cheapest to select the provider offering the lowest price per output token, while latency-sensitive applications can use :fastest to target the highest tokens per second.

Supported Models and LoRA Integration

DeepInfra supports multiple modalities across the Hugging Face ecosystem. The provider hosts frontier large language models, including the 1.6T MoE DeepSeek-V4-Pro, Kimi K2, and Qwen3.5 parameters up to 397B. Multimodal visual language models and text-to-image architectures like FLUX.1-dev and the Stable Diffusion families are also fully supported.

For custom fine-tunes, DeepInfra offers serverless LoRA (Low-Rank Adaptation) inference. Developers can deploy adapters hosted on the Hub with a pricing premium of roughly 50 percent higher than the base model cost.

Performance and Economics

DeepInfra competes heavily on price-performance metrics for open-weights models. Benchmarks from late April 2026 show highly optimized throughput and latency characteristics for frontier deployments on the DeepInfra Turbo tier.

Metric	DeepInfra Turbo Performance
Blended Price (gpt-oss-120B)	$0.08 per 1 million tokens
Time-to-First-Token (TTFT)	0.49s to 0.77s
Throughput (gpt-oss-120B)	161 tokens/sec
LoRA Adapter Premium	~50% of base model cost

Platform Integration and Billing

The partnership centralizes billing and authentication. Developers initiate a serverless endpoint using the “Deploy on DeepInfra” interface directly from a model’s Hugging Face page. Because the system relies on the standard HF_TOKEN, usage is billed directly to the developer’s Hugging Face account at DeepInfra’s standard rates. This eliminates the need to manage multiple vendor accounts or payment methods.

DeepInfra was a launch partner for the DeepSeek-V4 Preview earlier in the month, signaling its capacity to host massive mixture-of-experts architectures. By joining the Hugging Face Hub program, DeepInfra enters a competitive roster of specialized AI inference providers that already includes Cerebras, Groq, Fal.ai, Together AI, and SambaNova.

If you manage inference infrastructure for open-weights models, map your current compute costs against the Hugging Face provider rates. The combination of serverless LoRA support and programmatic routing policies allows you to shift traffic to DeepInfra dynamically when cost or throughput thresholds require it.

DeepInfra Brings $0.08/1M Inference to Hugging Face Hub

Unified API and Routing

Supported Models and LoRA Integration

Performance and Economics

Platform Integration and Billing

Keep Reading

How to Fuse PyTorch MLP Kernels for a 30% Inference Speedup

How to Serve DiffusionGemma Locally With vLLM

How to Route GPU GitHub Actions to Hugging Face Jobs

Cascaded Speech Pipeline Brings Reachy Mini Inference Local

How to Cut Checkpoint Time by 85% With TRL Delta Weight Sync