CompactifAI Now Lets You Compress LLMs Through an API

Multiverse Computing has turned its compressed-model strategy into a product distribution event. The company’s CompactifAI App now runs AI models locally and offline on edge devices, while a new CompactifAI API public portal exposes both compressed and original models through self-serve inference with token management and real-time usage monitoring. For developers, the shift matters because compression is moving from model research into deployable app and API surfaces.

Product Surface

The CompactifAI App is positioned for on-device inference with cloud fallback. Multiverse says users can run advanced AI models fully offline, then switch to cloud-based models via API when needed. The target use cases are straightforward: privacy-sensitive workloads and environments with unreliable connectivity.

The important change is accessibility. Multiverse had already distributed compressed models through AWS channels in 2025, but the March 2026 launch sequence moves the company closer to a standard developer workflow: install an app for local execution, or hit a public endpoint for hosted inference. If you are deciding between local inference and hosted inference in your own stack, this is the same operational split covered in work on running LLMs locally and reducing LLM API costs.

Multiverse says CompactifAI uses quantum-inspired mathematics to compress models by up to 95% while keeping precision within a 2 to 3% margin. The company also positions the platform around 50 to 80% lower inference costs, up to 2x faster inference, and near-complete accuracy retention.

API Catalog and Pricing

The new public portal matters because it puts prices next to model names. This is where compression becomes a buying decision instead of a benchmark claim.

Model	Input price	Output price
BlackStar 10B	$0.02 / 1M tokens	$0.07 / 1M tokens
Hypernova 60B	$0.04 / 1M tokens	$0.14 / 1M tokens
OpenAI gpt-oss-20b	$0.03 / 1M tokens	$0.10 / 1M tokens
OpenAI gpt-oss-120b	$0.05 / 1M tokens	$0.23 / 1M tokens
Llama 3.3 70B Slim	$0.10 / 1M tokens	$0.21 / 1M tokens
Llama 3.3 70B	$0.15 / 1M tokens	$0.31 / 1M tokens
Mistral Small 3.1 Slim	$0.05 / 1M tokens	$0.08 / 1M tokens
Mistral Small 3.1	$0.11 / 1M tokens	$0.17 / 1M tokens
Llama 3.1 8B Slim	$0.01 / 1M tokens	$0.07 / 1M tokens
Llama 3.1 8B	$0.02 / 1M tokens	$0.09 / 1M tokens
Llama 4 Scout Slim	$0.07 / 1M tokens	$0.10 / 1M tokens
Llama 4 Scout	$0.10 / 1M tokens	$0.14 / 1M tokens
DeepSeek R1 Slim	$0.28 / 1M tokens	$0.44 / 1M tokens

Whisper Large V3 is also available for transcription at $0.00034 per minute.

The side-by-side structure is the key signal. Multiverse is selling compressed variants as drop-in economic alternatives to original models from OpenAI, Meta, DeepSeek, and Mistral families. If you build high-volume assistants, coding tools, or agent frameworks, visible token pricing plus self-serve access lowers the friction to benchmark compressed models against your current default.

HyperNova 60B 2602

The model doing most of the technical work behind this launch is HyperNova 60B 2602. Multiverse describes it as a 50% compressed version of gpt-oss-120b, retaining a 59B-parameter compressed architecture and a single-GPU footprint.

The company says the model peaks at 32 GB of memory and fits on a single 40 GB GPU. Compared with gpt-oss-120b on an H200 Tensor Core GPU, Multiverse reports a 39.5% throughput gain and 36 to 51% latency improvements across TTFT, TPOT, and ITL. At 1,000 requests per second, it claims roughly 400 extra requests per second of headroom, or about 28% fewer GPUs for the same traffic.

For teams building tool-using systems, the post-training work is the more relevant detail. HyperNova 60B 2602 adds tool-calling capability through targeted post-training based on knowledge distillation with synthetic examples from a larger teacher model. This is directly relevant to function calling and to evaluating agents, because lower-cost inference only matters if the model preserves action reliability.

Benchmark Positioning

Multiverse’s benchmark deltas are strongest in tool-oriented evaluations.

Benchmark	Prior score	HyperNova 60B 2602
BFCL v4	25	62
τ²-Bench	12	61
Terminal Bench	8	16
AA-LCR	34	36
IFBench	56	60
MMLU-Pro	71	74

Those numbers support a specific market claim. Multiverse is not only arguing that smaller models are cheaper. It is arguing that compressed models can remain useful for agentic coding and tool use, which is a stricter bar than simple chatbot quality. If your workload depends on structured actions, retries, and long-running workflows, you should care more about these benchmark categories than about generic knowledge scores alone.

Multiverse says it now serves more than 100 customers worldwide, including Iberdrola, Bosch, and the Bank of Canada. The company is clearly aiming beyond research buyers and toward enterprises that want cost control, local execution, and clearer deployment options.

If you run inference at scale, test the slim variants against the originals on your actual tool-calling and latency workload, then model the GPU savings and token spend together. This launch is most useful when you treat compression as an infrastructure choice, not a model novelty.

CompactifAI Now Lets You Compress LLMs Through an API

Product Surface

API Catalog and Pricing

HyperNova 60B 2602

Benchmark Positioning

Keep Reading

How to Run NVIDIA Nemotron 3 Nano 4B Locally on Jetson and RTX

Google AI Edge Eloquent brings free offline dictation to iOS

How to Run IBM Granite 4.0 1B Speech for Multilingual Edge ASR and Translation

Voxtral TTS: Mistral's Open-Source Answer to Voice Agents

Runpod Flash Removes Container Overhead for AI Inference