Multiverse Launches CompactifAI App and API
Multiverse rolled out an offline CompactifAI app and a public API portal to bring compressed AI models to edge devices and self-serve users.
Multiverse Computing has turned its compressed-model strategy into a product distribution event. The company’s CompactifAI App now runs AI models locally and offline on edge devices, while a new CompactifAI API public portal exposes both compressed and original models through self-serve inference with token management and real-time usage monitoring. For developers, the shift matters because compression is moving from model research into deployable app and API surfaces.
Product Surface
The CompactifAI App is positioned for on-device inference with cloud fallback. Multiverse says users can run advanced AI models fully offline, then switch to cloud-based models via API when needed. The target use cases are straightforward: privacy-sensitive workloads and environments with unreliable connectivity.
The important change is accessibility. Multiverse had already distributed compressed models through AWS channels in 2025, but the March 2026 launch sequence moves the company closer to a standard developer workflow: install an app for local execution, or hit a public endpoint for hosted inference. If you are deciding between local inference and hosted inference in your own stack, this is the same operational split covered in work on running LLMs locally and reducing LLM API costs.
Multiverse says CompactifAI uses quantum-inspired mathematics to compress models by up to 95% while keeping precision within a 2 to 3% margin. The company also positions the platform around 50 to 80% lower inference costs, up to 2x faster inference, and near-complete accuracy retention.
API Catalog and Pricing
The new public portal matters because it puts prices next to model names. This is where compression becomes a buying decision instead of a benchmark claim.
| Model | Input price | Output price |
|---|---|---|
| BlackStar 10B | $0.02 / 1M tokens | $0.07 / 1M tokens |
| Hypernova 60B | $0.04 / 1M tokens | $0.14 / 1M tokens |
| OpenAI gpt-oss-20b | $0.03 / 1M tokens | $0.10 / 1M tokens |
| OpenAI gpt-oss-120b | $0.05 / 1M tokens | $0.23 / 1M tokens |
| Llama 3.3 70B Slim | $0.10 / 1M tokens | $0.21 / 1M tokens |
| Llama 3.3 70B | $0.15 / 1M tokens | $0.31 / 1M tokens |
| Mistral Small 3.1 Slim | $0.05 / 1M tokens | $0.08 / 1M tokens |
| Mistral Small 3.1 | $0.11 / 1M tokens | $0.17 / 1M tokens |
| Llama 3.1 8B Slim | $0.01 / 1M tokens | $0.07 / 1M tokens |
| Llama 3.1 8B | $0.02 / 1M tokens | $0.09 / 1M tokens |
| Llama 4 Scout Slim | $0.07 / 1M tokens | $0.10 / 1M tokens |
| Llama 4 Scout | $0.10 / 1M tokens | $0.14 / 1M tokens |
| DeepSeek R1 Slim | $0.28 / 1M tokens | $0.44 / 1M tokens |
Whisper Large V3 is also available for transcription at $0.00034 per minute.
The side-by-side structure is the key signal. Multiverse is selling compressed variants as drop-in economic alternatives to original models from OpenAI, Meta, DeepSeek, and Mistral families. If you build high-volume assistants, coding tools, or agent frameworks, visible token pricing plus self-serve access lowers the friction to benchmark compressed models against your current default.
HyperNova 60B 2602
The model doing most of the technical work behind this launch is HyperNova 60B 2602. Multiverse describes it as a 50% compressed version of gpt-oss-120b, retaining a 59B-parameter compressed architecture and a single-GPU footprint.
The company says the model peaks at 32 GB of memory and fits on a single 40 GB GPU. Compared with gpt-oss-120b on an H200 Tensor Core GPU, Multiverse reports a 39.5% throughput gain and 36 to 51% latency improvements across TTFT, TPOT, and ITL. At 1,000 requests per second, it claims roughly 400 extra requests per second of headroom, or about 28% fewer GPUs for the same traffic.
For teams building tool-using systems, the post-training work is the more relevant detail. HyperNova 60B 2602 adds tool-calling capability through targeted post-training based on knowledge distillation with synthetic examples from a larger teacher model. This is directly relevant to function calling and to evaluating agents, because lower-cost inference only matters if the model preserves action reliability.
Benchmark Positioning
Multiverse’s benchmark deltas are strongest in tool-oriented evaluations.
| Benchmark | Prior score | HyperNova 60B 2602 |
|---|---|---|
| BFCL v4 | 25 | 62 |
| τ²-Bench | 12 | 61 |
| Terminal Bench | 8 | 16 |
| AA-LCR | 34 | 36 |
| IFBench | 56 | 60 |
| MMLU-Pro | 71 | 74 |
Those numbers support a specific market claim. Multiverse is not only arguing that smaller models are cheaper. It is arguing that compressed models can remain useful for agentic coding and tool use, which is a stricter bar than simple chatbot quality. If your workload depends on structured actions, retries, and long-running workflows, you should care more about these benchmark categories than about generic knowledge scores alone.
Multiverse says it now serves more than 100 customers worldwide, including Iberdrola, Bosch, and the Bank of Canada. The company is clearly aiming beyond research buyers and toward enterprises that want cost control, local execution, and clearer deployment options.
If you run inference at scale, test the slim variants against the originals on your actual tool-calling and latency workload, then model the GPU savings and token spend together. This launch is most useful when you treat compression as an infrastructure choice, not a model novelty.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Run NVIDIA Nemotron 3 Nano 4B Locally on Jetson and RTX
Learn to deploy NVIDIA's Nemotron 3 Nano 4B locally with BF16, FP8, or GGUF on Jetson, RTX, vLLM, TensorRT-LLM, and llama.cpp.
How to Run IBM Granite 4.0 1B Speech for Multilingual Edge ASR and Translation
Learn how to deploy IBM Granite 4.0 1B Speech for fast multilingual ASR and translation on edge devices.
NVIDIA Launches Nemotron Coalition at GTC 2026
NVIDIA launched the Nemotron Coalition and expanded its open AI model lineup at GTC 2026, with the first coalition model set for Nemotron 4.
Google DeepMind Unveils AGI Cognitive Evaluation Framework and Launches $200,000 Kaggle Hackathon
Google DeepMind introduced a 10-faculty framework for measuring AGI progress and opened a $200,000 Kaggle evaluation hackathon.
How to Build Enterprise AI with Mistral Forge on Your Own Data
Learn how Mistral Forge helps enterprises build custom AI models with private data, synthetic data, evals, and flexible deployment.