Ai Engineering 3 min read

Tunix Hackathon Yields 1B-Parameter Gemma Reasoning Models

Google released the results of its Tunix hackathon, showcasing how developers trained small Gemma models to use reasoning traces on a strict compute budget.

On May 28, 2026, Google Developers AI published the results of the Google Tunix Hack, a competition where over 11,000 developers trained small base models to generate structured reasoning traces. The event required participants to transform Gemma-2-2B and Gemma-3-1B into models capable of producing internal logic inside <reasoning> tags before outputting a final answer.

Compute Constraints and Tooling

The competition enforced strict resource limits to simulate real-world developer constraints. Participants had a maximum of 9 hours to train their models using a single Kaggle TPU v5e-8 instance with 16G HBM per core. This forced submissions to optimize training efficiency rather than relying on brute-force parameter scaling. Developers relied on Tunix, Google’s JAX-native library for LLM post-training, to execute these pipelines.

Winning Reasoning Pipelines

The winning submissions demonstrated that complex reasoning capabilities can be distilled into 1B and 2B models using specific combinations of Group Relative Policy Optimization (GRPO) and Simple Preference Optimization (SimPO).

The first-place architecture, named G-RaR (Rubrics as Rewards), combined LoRA Supervised Fine-Tuning (SFT) with GRPO. Instead of a binary correctness check, G-RaR used a larger Gemma-3-12B model as a judge to score intermediate reasoning steps against task-specific rubrics. This method of evaluating AI output proved highly effective for structured logic tasks without requiring massive datasets.

The second-place entry, Pinocchio-1B, chained an “Act” pipeline consisting of SFT, SimPO, and GRPO, specifically optimized to complete within the 9-hour TPU session. The third-place IDEA-E Distillation project introduced a TF-IDF based reasoning reward. This reward mechanism ensured the text generated inside the reasoning tags remained substantively different from the final answer, preventing the model from simply leaking the conclusion early. An honorable mention, Gemmatron, utilized synthetic data generated by Gemini 2.5 Pro and Flash to build a cognitive backbone via SFT before applying GRPO.

Tunix Post-Training Capabilities

The hackathon served as a massive stress test for Tunix v0.1.0. Built for high-performance TPU execution and integration with MaxText, the library supports SFT, PPO, GRPO, GSPO-token, and DPO algorithms. Early benchmarking during the event showed that Tunix implementations improved the GSM8K pass@1 accuracy of Gemma-2-2B-IT by approximately 12%. This provides developers with a dedicated JAX alternative to other frameworks targeting post-training standardization.

The event also resulted in the public release of domain-specific reasoning models tailored for medical, chemistry, legal, and robotics use cases. The winning recipes, training datasets, and codebases are now available in the Tunix GitHub repository and on Kaggle. If you deploy lightweight models for domain-specific applications, you can use these open-source pipelines to add structured reasoning traces to your 1B and 2B parameter deployments without increasing inference hardware requirements.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading