How to Fine-Tune Qwen3 on AMD MI300X Using ROCm

The MedQA project debuted on May 8, 2026, demonstrating the seamless training of clinical AI models on AMD hardware without the NVIDIA CUDA ecosystem. Developed during the lablab.ai AMD Developer Hackathon, the architecture leverages the AMD Instinct MI300X to train a specialized medical reasoning model natively on ROCm 6.1+. You can use this exact approach to run your own fine-tuning workloads on AMD compute instances using the standard Hugging Face pipeline.

Running advanced machine learning workloads on AMD GPUs requires bypassing traditional CUDA-centric dependencies. This configuration eliminates the need for code rewrites when moving established training scripts to AMD hardware. The default Hugging Face libraries process the workload efficiently when the environment is directed to the correct hardware interfaces.

Bypassing CUDA with ROCm Variables

To run training scripts natively on ROCm, you must explicitly point your framework to the AMD hardware and override default graphics architectures. The AI frameworks assume a CUDA environment by default. You can override this behavior by setting specific environment variables before initializing your Python scripts.

Configure your environment to expose the AMD hardware and set the specific architecture version for the MI300X. Add the following variables to your setup:

python import os

os.environ[“ROCR_VISIBLE_DEVICES”] = “0” os.environ[“HIP_VISIBLE_DEVICES”] = “0” os.environ[“HSA_OVERRIDE_GFX_VERSION”] = “9.4.2”

The ROCR_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES variables instruct the runtime to target the first available AMD GPU, preventing the system from searching for NVIDIA hardware. The HSA_OVERRIDE_GFX_VERSION value is strictly required for the MI300X. Setting it to 9.4.2 ensures compatibility with the specific instruction set used by this generation of Instinct accelerators.

Hardware Profile and Memory Utilization

The AMD Instinct MI300X features 192 GB of HBM3 memory and achieves a memory bandwidth of 5.3 TB/s. This immense memory capacity fundamentally changes how you configure the fine-tuning job. Instead of relying on typical memory reduction techniques like 4-bit or 8-bit quantization, the hardware can comfortably hold the base model, optimizer states, and gradients in higher precision.

The MedQA implementation uses Qwen3-1.7B as its base model. Given the 1.7-billion parameter count, the model footprint is small enough that the massive VRAM pool handles the entire process natively. You can run the training phase in full fp16 precision. Training in full fp16 avoids the precision degradation associated with heavily quantized weights, preserving the nuanced statistical relationships required for complex medical reasoning tasks.

Because of the high memory bandwidth and the lack of quantization overhead, the MI300X completes the Qwen3-1.7B fine-tuning pass in approximately five minutes.

Integrating the Hugging Face Stack

The MedQA training pipeline relies exclusively on the standard open-source ecosystem. You do not need specialized AMD forks of the core libraries. The ROCm 6.1+ implementation supports the standard Hugging Face stack natively.

Your environment must include the standard versions of Transformers, PEFT, TRL, and Accelerate. Once the environment variables are set, Accelerate detects the ROCm backend automatically and maps the tensor operations to the HIP implementation. The LoRA (Low-Rank Adaptation) method is handled entirely by the PEFT library.

When configuring PEFT for LoRA on ROCm, use the exact same configuration blocks you would use in a CUDA environment. The adapter weights will attach to the Qwen3-1.7B attention layers exactly as specified in the standard documentation. The resulting model artifact, such as the HK2184/medqa-qwen3-lora weights generated by the project, operates as a standard Hugging Face model.

Dataset Formatting for Explainability

The goal of fine-tuning in this context is to force the model to conform to a highly specific output structure. The MedQA implementation uses the MedMCQA dataset, which consists of multiple-choice questions from Indian medical entrance exams.

Generalist models often fail to provide consistent, machine-readable outputs alongside their chain of thought. To correct this, the training data must be formatted to demand two distinct outputs per prompt: the correct answer letter (A–D) and a clinical explanation justifying that letter.

Structure your training JSONL files to separate the instruction from the dual-output target. The model learns to output the deterministic letter first, followed by the reasoning token. This approach makes it easier to evaluate AI output programmatically using strict string matching on the first character, while preserving the explanatory text for human review.

Tradeoffs and Application Context

While the MI300X provides exceptional training speed for smaller architectures, this approach involves specific hardware constraints. The environment variable overrides apply directly to this specific Instinct accelerator. Moving the workload to a consumer-grade AMD card or a different enterprise generation requires adjusting the HSA_OVERRIDE_GFX_VERSION value to match the target silicon.

The resulting 1.7B model serves a distinct architectural purpose. Recent clinical benchmarks show that frontier models routinely hit 96% accuracy or higher on medical evaluations. The Qwen3-1.7B MedQA model is not designed to beat these massive APIs in broad knowledge retrieval. Instead, it provides a highly capable, domain-specific agent that runs efficiently on localized, secure hardware, fulfilling the strict data privacy requirements common in clinical settings.

Deploy your compiled LoRA weights using standard inference engines that support the ROCm backend. Configure your inference server to load the base Qwen3-1.7B model and apply the PEFT adapter weights at runtime to serve the specialized clinical endpoints.

How to Fine-Tune Qwen3 on AMD MI300X Using ROCm

Bypassing CUDA with ROCm Variables

Hardware Profile and Memory Utilization

Integrating the Hugging Face Stack

Dataset Formatting for Explainability

Tradeoffs and Application Context

Keep Reading

Scaling Ecom-RLVE for Verifiable AI Shopping Agents

How to Build a Domain-Specific Embedding Model

What Is Continued Pretraining in AI?

CyberSecQwen-4B Defeats Cisco 8B on CTI-MCQ Benchmark

DeepInfra Brings $0.08/1M Inference to Hugging Face Hub