DeepInfra Brings $0.08/1M Inference to Hugging Face Hub
Developers can now route Hugging Face API requests directly to DeepInfra's serverless GPU infrastructure for high-performance model inference.
On April 29, 2026, Hugging Face officially added DeepInfra to its Inference Providers program. Developers can now route inference requests directly to DeepInfra’s serverless GPU infrastructure using standard Hugging Face access tokens. The integration provides immediate access to thousands of open-source models without requiring developers to provision or manage underlying compute clusters.
Unified API and Routing
Accessing DeepInfra requires modifying a single parameter in the Hugging Face InferenceClient. By passing provider="deepinfra" in the API call, requests are automatically routed to DeepInfra endpoints. The integration also supports intelligent routing policies. Developers looking to reduce LLM API costs can configure the client with :cheapest to select the provider offering the lowest price per output token, while latency-sensitive applications can use :fastest to target the highest tokens per second.
Supported Models and LoRA Integration
DeepInfra supports multiple modalities across the Hugging Face ecosystem. The provider hosts frontier large language models, including the 1.6T MoE DeepSeek-V4-Pro, Kimi K2, and Qwen3.5 parameters up to 397B. Multimodal visual language models and text-to-image architectures like FLUX.1-dev and the Stable Diffusion families are also fully supported.
For custom fine-tunes, DeepInfra offers serverless LoRA (Low-Rank Adaptation) inference. Developers can deploy adapters hosted on the Hub with a pricing premium of roughly 50 percent higher than the base model cost.
Performance and Economics
DeepInfra competes heavily on price-performance metrics for open-weights models. Benchmarks from late April 2026 show highly optimized throughput and latency characteristics for frontier deployments on the DeepInfra Turbo tier.
| Metric | DeepInfra Turbo Performance |
|---|---|
| Blended Price (gpt-oss-120B) | $0.08 per 1 million tokens |
| Time-to-First-Token (TTFT) | 0.49s to 0.77s |
| Throughput (gpt-oss-120B) | 161 tokens/sec |
| LoRA Adapter Premium | ~50% of base model cost |
Platform Integration and Billing
The partnership centralizes billing and authentication. Developers initiate a serverless endpoint using the “Deploy on DeepInfra” interface directly from a model’s Hugging Face page. Because the system relies on the standard HF_TOKEN, usage is billed directly to the developer’s Hugging Face account at DeepInfra’s standard rates. This eliminates the need to manage multiple vendor accounts or payment methods.
DeepInfra was a launch partner for the DeepSeek-V4 Preview earlier in the month, signaling its capacity to host massive mixture-of-experts architectures. By joining the Hugging Face Hub program, DeepInfra enters a competitive roster of specialized AI inference providers that already includes Cerebras, Groq, Fal.ai, Together AI, and SambaNova.
If you manage inference infrastructure for open-weights models, map your current compute costs against the Hugging Face provider rates. The combination of serverless LoRA support and programmatic routing policies allows you to shift traffic to DeepInfra dynamically when cost or throughput thresholds require it.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Fuse PyTorch MLP Kernels for a 30% Inference Speedup
Learn how to analyze PyTorch profiler traces and implement Liger kernel fusion to significantly reduce memory bandwidth bottlenecks in transformer models.
How to Serve DiffusionGemma Locally With vLLM
Learn how to deploy Google's 26B text diffusion model on local hardware to achieve massive parallel generation speeds using vLLM and Hugging Face.
How to Route GPU GitHub Actions to Hugging Face Jobs
Offload your training and GPU-heavy CI workloads to Hugging Face Jobs using their new ephemeral GitHub runners and action integrations.
Cascaded Speech Pipeline Brings Reachy Mini Inference Local
Hugging Face released an offline conversational stack for the Reachy Mini robot that replaces cloud APIs with a local pipeline built on Gemma 4 and Qwen3-TTS.
How to Cut Checkpoint Time by 85% With TRL Delta Weight Sync
Learn how to configure TRL Delta Weight Sync to reduce trillion-parameter model checkpointing times by 85 percent using Hugging Face Hub Buckets.