Stable Audio 3.0 Hits 6-Minute Tracks in 1.3 Seconds on H200

Stability AI launched Stable Audio 3.0, a family of open-weight models generating tracks up to six minutes and 20 seconds. Previous audio generation systems relied on fixed-length outputs and padded short clips with silence. The version 3.0 architecture supports variable-length generation with per-second granularity. This design lowers compute costs for short samples and doubles the maximum track capacity of the April 2024 model.

Hardware Profiles and Inference Speeds

The Stable Audio 3.0 family separates workloads across four distinct parameter classes. The 459M-parameter Stable Audio 3.0 Small SFX targets on-device sound effect generation for consumer hardware. The standard Stable Audio 3.0 Small composes full music tracks up to two minutes. It completes a two-minute track in 0.44 seconds on an NVIDIA H200 GPU and 3 seconds on a MacBook Pro M4 using CoreML.

The 1.4B-parameter Stable Audio 3.0 Medium extends the maximum track length to 6:20 minutes. It generates full-length audio in 1.31 seconds on an H200 during standard AI inference. The Medium variant requires approximately 6.5 GB of VRAM, making it viable for consumer hardware like the RTX 4060 or 3060. The 2.7B-parameter Stable Audio 3.0 Large remains restricted to API users and enterprise customers running high-volume platforms.

Semantic-Acoustic Architecture

The models utilize a Semantic-Acoustic Autoencoder, a new latent diffusion architecture projecting audio into a compressed latent space. This approach maintains stereo 44.1 kHz fidelity while exposing semantic structural control to the user.

The generation pipeline incorporates adversarial post-training and distillation. These techniques reduce the required number of inference steps, achieving high audio quality without massive compute budgets.

Editing and Tool Integration

Stability AI shipped specific editing tools alongside the base generation capabilities. Audio Inpainting lets users mask specific segments of a track to modify targeted instruments or fix transition errors. Causal Continuation accepts an existing audio clip and forces the model to extend the track naturally beyond the original endpoint.

The release includes documentation for LoRA (Low-Rank Adaptation). Developers can fine-tune the Small and Medium variants on proprietary audio libraries to create custom instruments or stylistic profiles. This mirrors the domain-specific customization seen in visual workflows like custom models in Adobe Firefly.

Licensing and Training Data

The dataset for Stable Audio 3.0 relies entirely on licensed data. Stability AI secured partnerships with Universal Music Group and Warner Music Group, supplementing the corpus with Creative Commons sources.

Organizations with under $1 million in annual revenue can deploy the open-weight models under the free Stability AI Community License. Larger organizations require an Enterprise License, which provides legal indemnification for generated outputs.

If you build audio generation applications, the split between Small and Medium models dictates your infrastructure path. Standard two-minute background tracks can now run entirely on edge devices using the Small variant, eliminating cloud compute costs. For full-length commercial tracks, budget at least 6.5 GB of VRAM per concurrent stream for the Medium model and plan for enterprise licensing if your revenue exceeds the community threshold.

Stable Audio 3.0 Hits 6-Minute Tracks in 1.3 Seconds on H200

Hardware Profiles and Inference Speeds

Semantic-Acoustic Architecture

Editing and Tool Integration

Licensing and Training Data

Keep Reading

How to Fine-Tune Cosmos Predict 2.5 for Robotics With LoRA

Single-Weight Gemini Omni Unifies Multimodal Video Generation

Roche Integrates PathAI Diagnostic Algorithms in $1.05B Deal

GPT-5.5 Instant Cuts ChatGPT Hallucinations by 52.5%

ChatGPT Images 2.0 Adds Multilingual Text and Thinking Mode