DeepInfra Brings $0.08/1M Inference to Hugging Face Hub
Developers can now route Hugging Face API requests directly to DeepInfra's serverless GPU infrastructure for high-performance model inference.
On April 29, 2026, Hugging Face officially added DeepInfra to its Inference Providers program. Developers can now route inference requests directly to DeepInfra’s serverless GPU infrastructure using standard Hugging Face access tokens. The integration provides immediate access to thousands of open-source models without requiring developers to provision or manage underlying compute clusters.
Unified API and Routing
Accessing DeepInfra requires modifying a single parameter in the Hugging Face InferenceClient. By passing provider="deepinfra" in the API call, requests are automatically routed to DeepInfra endpoints. The integration also supports intelligent routing policies. Developers looking to reduce LLM API costs can configure the client with :cheapest to select the provider offering the lowest price per output token, while latency-sensitive applications can use :fastest to target the highest tokens per second.
Supported Models and LoRA Integration
DeepInfra supports multiple modalities across the Hugging Face ecosystem. The provider hosts frontier large language models, including the 1.6T MoE DeepSeek-V4-Pro, Kimi K2, and Qwen3.5 parameters up to 397B. Multimodal visual language models and text-to-image architectures like FLUX.1-dev and the Stable Diffusion families are also fully supported.
For custom fine-tunes, DeepInfra offers serverless LoRA (Low-Rank Adaptation) inference. Developers can deploy adapters hosted on the Hub with a pricing premium of roughly 50 percent higher than the base model cost.
Performance and Economics
DeepInfra competes heavily on price-performance metrics for open-weights models. Benchmarks from late April 2026 show highly optimized throughput and latency characteristics for frontier deployments on the DeepInfra Turbo tier.
| Metric | DeepInfra Turbo Performance |
|---|---|
| Blended Price (gpt-oss-120B) | $0.08 per 1 million tokens |
| Time-to-First-Token (TTFT) | 0.49s to 0.77s |
| Throughput (gpt-oss-120B) | 161 tokens/sec |
| LoRA Adapter Premium | ~50% of base model cost |
Platform Integration and Billing
The partnership centralizes billing and authentication. Developers initiate a serverless endpoint using the “Deploy on DeepInfra” interface directly from a model’s Hugging Face page. Because the system relies on the standard HF_TOKEN, usage is billed directly to the developer’s Hugging Face account at DeepInfra’s standard rates. This eliminates the need to manage multiple vendor accounts or payment methods.
DeepInfra was a launch partner for the DeepSeek-V4 Preview earlier in the month, signaling its capacity to host massive mixture-of-experts architectures. By joining the Hugging Face Hub program, DeepInfra enters a competitive roster of specialized AI inference providers that already includes Cerebras, Groq, Fal.ai, Together AI, and SambaNova.
If you manage inference infrastructure for open-weights models, map your current compute costs against the Hugging Face provider rates. The combination of serverless LoRA support and programmatic routing policies allows you to shift traffic to DeepInfra dynamically when cost or throughput thresholds require it.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Implement the Advisor Strategy with Claude
Optimize AI agents by pairing high-intelligence advisor models with cost-effective executors using Anthropic's native advisor tool API.
Evaluation Now Consumes 20% of AI Compute Budgets
Hugging Face and the EvalEval Coalition report that evaluating frontier AI models now requires massive inference compute, driving up development costs.
Transformers.js v4, Now Inside Chrome Extensions
Hugging Face has published an integration guide for running Transformers.js v4 and the 500MB Gemma 4 E2B model locally inside Manifest V3 Chrome extensions.
Cloudflare Ships Panic and Abort Recovery for Rust Workers
Cloudflare updated Rust Workers to support WebAssembly exception handling, preventing isolated panics from crashing entire serverless instances.
How to Use Multimodal Sentence Transformers v5.4
Learn to implement multimodal embedding and reranker models using Sentence Transformers for advanced search across text, images, audio, and video.