Blog
AI engineering insights, practical advice, and things I'm learning.
AI Engineering
How to Run In-Loop Model Evaluations With olmo-eval
Learn how to set up olmo-eval to test large language model checkpoints during the training process using vLLM, LiteLLM, and Docker-based agent sandboxes.
Llm Evaluation · Model Training · Vllm · Litellm
AI Engineering
How to Fuse PyTorch MLP Kernels for a 30% Inference Speedup
Learn how to analyze PyTorch profiler traces and implement Liger kernel fusion to significantly reduce memory bandwidth bottlenecks in transformer models.
Pytorch · Kernel Fusion · Inference Optimization
AI Engineering
How to Serve DiffusionGemma Locally With vLLM
Learn how to deploy Google's 26B text diffusion model on local hardware to achieve massive parallel generation speeds using vLLM and Hugging Face.
Diffusion Models · Local Deployment · Vllm Inference
AI Engineering
How to Route GPU GitHub Actions to Hugging Face Jobs
Offload your training and GPU-heavy CI workloads to Hugging Face Jobs using their new ephemeral GitHub runners and action integrations.
Github Actions · Hugging Face · Gpu Computing
AI Engineering
How to Call Claude 4.5 via Apple Foundation Models Framework
Learn how to integrate Claude 4.5 into your Swift applications using Apple's new Foundation Models framework for hybrid on-device and cloud processing.
Claude 4 5 · Apple Foundation Models · Swift Programming
AI Engineering
How to Provision Google Colab GPUs From the Command Line
Learn how to install the Google Colab CLI, provision high-performance remote GPUs from your local terminal, and execute headless machine learning workflows.
Google Colab · Gpu Provisioning · Command Line Interface
AI Engineering
How to Stop OCR Degeneration With DharmaOCR Lite 3B
Dharma-AI's new DharmaOCR models apply DPO to eliminate autoregressive looping. Learn how to configure the 3B parameter model for structured JSON extraction.
Optical Character Recognition · Direct Preference Optimization · Structured Data Extraction
AI Engineering
How to Find GPU Gaps in PyTorch 2.12 With torch.profiler
Learn how to identify performance bottlenecks and idle GPU lanes using the native torch.profiler in PyTorch 2.12 across Blackwell and AMD hardware.
Pytorch · Gpu Optimization · Performance Profiling
AI Engineering
How to Automate Google Pay Integrations With MCP
Connect your AI development environment to real-time merchant data and documentation using the new Google Pay and Wallet Developer MCP server.
Mcp Server · Google Pay · Workflow Automation
AI Engineering
How to Cut Checkpoint Time by 85% With TRL Delta Weight Sync
Learn how to configure TRL Delta Weight Sync to reduce trillion-parameter model checkpointing times by 85 percent using Hugging Face Hub Buckets.
Hugging Face · Checkpointing · Trl Library
AI Engineering
How to Run Gemma 4 On-Device with LiteRT-LM
Learn how to configure LiteRT-LM to deploy the Gemma 4 model family locally across mobile, desktop, and edge environments with constrained JSON decoding.
Gemma 4 · Litert Lm · On Device Ai
AI Engineering
How to Fine-Tune Cosmos Predict 2.5 for Robotics With LoRA
Learn how to adapt NVIDIA's 2B and 14B Cosmos Predict 2.5 world foundation models using parameter-efficient fine-tuning methods like LoRA and DoRA.
Fine Tuning · Lora Dora · World Models
AI Engineering
How to Scale PyTorch Training With AWS Building Blocks
Learn how to configure AWS infrastructure and Hugging Face tools to optimize large-scale foundation model pre-training and inference workflows.
Pytorch · Aws Cloud · Foundation Models
AI Engineering
How to Fine-Tune Qwen3 on AMD MI300X Using ROCm
Learn how to configure ROCm 6.1 environment variables and use the Hugging Face stack to fine-tune Qwen3-1.7B on AMD hardware without CUDA.
Fine Tuning · Amd Rocm · Qwen 3
AI Engineering
How to Implement Event-Driven Webhooks in the Gemini API
Learn how to configure static and dynamic webhooks in the Gemini API to eliminate polling overhead for long-running AI operations and agent workflows.
Gemini Api · Webhooks · Event Driven Architecture
AI Engineering
How to Build Cross-Modal RAG Pipelines With Gemini Embedding 2
Learn how to process text, images, video, and audio into a single semantic vector space using Google's natively multimodal Gemini Embedding 2 model.
Multimodal Rag · Gemini Embedding 2 · Vector Databases
AI Engineering
Google Graduates LiteRT NPU Acceleration to Production
Learn how to configure LiteRT for hardware-accelerated on-device AI inference using Google's production-ready NPU capabilities.
Litert · On Device Ai · Npu Acceleration
AI Engineering
Build Real-Time Voice Agents with Cloudflare Agents SDK
Learn how to integrate low-latency voice interactions into your AI agents using Cloudflare's new @cloudflare/voice package and Durable Objects.
Cloudflare Workers · Voice Ai · Stt
AI Engineering
Build a Fast Multilingual OCR with Nemotron-OCR-v2
Learn how to deploy NVIDIA Nemotron-OCR-v2 for high-speed document extraction across six languages using synthetic data and GPU acceleration.
Nvidia Nemotron · Multilingual Ocr · Synthetic Data
AI Engineering
Train Multimodal Sentence Transformers for Visual Retrieval
Learn how to finetune multimodal embedding and reranker models for text, image, and audio using the updated Sentence Transformers library.
Sentence Transformers · Multimodal Ai · Embedding Models