Ai Agents 5 min read

NVIDIA Releases Nemotron 3 Content Safety 4B

NVIDIA released Nemotron 3 Content Safety 4B, a multilingual multimodal moderation model for text and images, on Hugging Face.

NVIDIA released Nemotron 3 Content Safety 4B, a new open moderation model for classifying text, images, or mixed text-image inputs as safe or unsafe across 12 languages. The Nemotron 3 Content Safety release matters if you build multimodal agents, because moderation now has to cover screenshots, PDFs, memes, mobile photos, and image-embedded text, not just plain chat prompts.

The model is published as nvidia/Nemotron-3-Content-Safety on Hugging Face, with a March 16 release date in the model card and a March 20 public write-up. NVIDIA positions it as a guard model for LLM and VLM pipelines, especially agent workflows that accept user uploads and produce tool-augmented responses.

Model Architecture

Nemotron 3 Content Safety is built from Gemma-3-4B-it, fine-tuned with LoRA and merged back into the base model. It is a 4B-parameter, decoder-only Transformer with a SigLIP vision encoder and a maximum context window of 128K tokens.

For multimodal moderation, those details matter. A 4B model is small enough to fit into tighter deployment budgets, while 128K context gives you room to classify long conversations, attached OCR text, and policy-heavy system context in one pass. If you already think in terms of context engineering, this is the moderation-layer version of the same problem.

The vision stack takes square images resized to 896 x 896. Input modalities are text and image.

Output Format

Moderation output supports two modes. The default path is low-latency safe/unsafe classification. An optional richer mode adds violated safety categories.

The structured text output includes:

  • User Safety
  • Response Safety
  • Safety Categories

NVIDIA uses a 23-category taxonomy aligned with its Aegis content safety schema, including violence, sexual content, harassment, threat, PII/privacy, fraud/deception, malware, political/misinformation/conspiracy, unauthorized advice, and illegal activity.

This is a useful split for agent builders. You can run a fast binary gate in the hot path, then trigger category output only when you need policy routing, audit logging, or downstream enforcement. If your stack already depends on structured output and post-processing rules, category-rich moderation is much easier to operationalize than free-form refusal text.

Multilingual and Multimodal Coverage

Language support includes English, Arabic, German, Spanish, French, Hindi, Japanese, Thai, Dutch, Italian, Korean, and Chinese. NVIDIA also reports zero-shot generalization to additional languages including Portuguese, Swedish, Russian, Czech, Polish, and Bengali.

The more important shift is modality. This model is designed for content that arrives as text plus screenshots, scanned documents, diagrams, memes, and photos. For agent products, that closes a gap that text-only guard models leave open. A system that can safely moderate chat input but cannot inspect a screenshot upload is incomplete.

That also aligns with the direction of agent UX. As more products move from pure chat into computer use and multimodal workflows, guardrails have to sit beside the main model in every tool loop. The same pressure shows up in work on evaluating agents and AI agents versus chatbots.

Training Data and Tuning

Training data combines multilingual safety data from Nemotron-Safety-Guard-Dataset-v3, human-annotated multimodal English safety data translated into multiple languages, safe multimodal data from Nemotron-VLM-Dataset-v2, and synthetic data.

The model’s training set is about 86K samples, drawing from the much larger Nemotron-Safety-Guard-Dataset-v3 (~515K rows total) and other sources. Synthetic data accounts for roughly 10% of the training blend.

NVIDIA translated English-only text data into the 12 supported languages. Around 25% of training samples had categories removed along with a /no_categories toggle, so the model learns when not to emit category labels.

Fine-tuning used LoRA with a grid search over learning rates 1e-5, 1e-4, 5e-5, 5e-6, 1e-7 and LoRA ranks 16, 32. The final run used 5 epochs, 0.0001 learning rate, rank 16, and alpha 32.

Benchmark Positioning

Performance claims are framed around multimodal harmful-content classification. NVIDIA reports 84% average accuracy across Polyguard, RTP-LX, VLGuard, MM SafetyBench, and Figstep.

It also claims roughly half the latency of larger multimodal safety models across mean, median, and P99 measurements, plus deployment feasibility on 8GB+ VRAM GPUs.

NVIDIA does not publish raw latency tables or a separate technical report alongside this release, so the practical takeaway is straightforward: the headline is credible enough to warrant evaluation, but you should benchmark it against your own image sizes, prompt templates, and policy thresholds before replacing an existing moderation tier. This is the same discipline you would apply to any LLM observability setup.

Deployment Stack

Inference support includes Transformers and vLLM. The model card lists Transformers 4.57.1, vLLM >= 0.11.0, PyTorch 2.8.0, and Linux. NVIDIA lists compatibility with NVIDIA RTX PRO 6000 BSE, H100, and A100.

NVIDIA also says the model will be available as an NVIDIA NIM microservice in April 2026, giving developers a pre-packaged, GPU-optimized inference service for production moderation.

If you are deploying multimodal agents today, the immediate move is to test Nemotron 3 Content Safety as a front-door and tool-output classifier, then decide whether binary-only mode is enough for the hot path or whether category output gives you better routing, logging, and enforcement.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading