Blog

AI engineering insights, practical advice, and things I'm learning.

Latest AI news, updated daily. Go to News →

AI Engineering

How to Run In-Loop Model Evaluations With olmo-eval

Learn how to set up olmo-eval to test large language model checkpoints during the training process using vLLM, LiteLLM, and Docker-based agent sandboxes.

Llm Evaluation · Model Training · Vllm · Litellm

AI Engineering

How to Fuse PyTorch MLP Kernels for a 30% Inference Speedup

Learn how to analyze PyTorch profiler traces and implement Liger kernel fusion to significantly reduce memory bandwidth bottlenecks in transformer models.

Pytorch · Kernel Fusion · Inference Optimization

AI Engineering

How to Serve DiffusionGemma Locally With vLLM

Learn how to deploy Google's 26B text diffusion model on local hardware to achieve massive parallel generation speeds using vLLM and Hugging Face.

Diffusion Models · Local Deployment · Vllm Inference

AI Engineering

How to Route GPU GitHub Actions to Hugging Face Jobs

Offload your training and GPU-heavy CI workloads to Hugging Face Jobs using their new ephemeral GitHub runners and action integrations.

Github Actions · Hugging Face · Gpu Computing

AI Engineering

How to Call Claude 4.5 via Apple Foundation Models Framework

Learn how to integrate Claude 4.5 into your Swift applications using Apple's new Foundation Models framework for hybrid on-device and cloud processing.

Claude 4 5 · Apple Foundation Models · Swift Programming

AI Engineering

How to Provision Google Colab GPUs From the Command Line

Learn how to install the Google Colab CLI, provision high-performance remote GPUs from your local terminal, and execute headless machine learning workflows.

Google Colab · Gpu Provisioning · Command Line Interface

AI Engineering

How to Stop OCR Degeneration With DharmaOCR Lite 3B

Dharma-AI's new DharmaOCR models apply DPO to eliminate autoregressive looping. Learn how to configure the 3B parameter model for structured JSON extraction.

Optical Character Recognition · Direct Preference Optimization · Structured Data Extraction

AI Engineering

How to Find GPU Gaps in PyTorch 2.12 With torch.profiler

Learn how to identify performance bottlenecks and idle GPU lanes using the native torch.profiler in PyTorch 2.12 across Blackwell and AMD hardware.

Pytorch · Gpu Optimization · Performance Profiling

AI Engineering

How to Automate Google Pay Integrations With MCP

Connect your AI development environment to real-time merchant data and documentation using the new Google Pay and Wallet Developer MCP server.

Mcp Server · Google Pay · Workflow Automation

AI Engineering

How to Cut Checkpoint Time by 85% With TRL Delta Weight Sync

Learn how to configure TRL Delta Weight Sync to reduce trillion-parameter model checkpointing times by 85 percent using Hugging Face Hub Buckets.

Hugging Face · Checkpointing · Trl Library

AI Engineering

How to Run Gemma 4 On-Device with LiteRT-LM

Learn how to configure LiteRT-LM to deploy the Gemma 4 model family locally across mobile, desktop, and edge environments with constrained JSON decoding.

Gemma 4 · Litert Lm · On Device Ai

AI Engineering

How to Fine-Tune Cosmos Predict 2.5 for Robotics With LoRA

Learn how to adapt NVIDIA's 2B and 14B Cosmos Predict 2.5 world foundation models using parameter-efficient fine-tuning methods like LoRA and DoRA.

Fine Tuning · Lora Dora · World Models

AI Engineering

How to Scale PyTorch Training With AWS Building Blocks

Learn how to configure AWS infrastructure and Hugging Face tools to optimize large-scale foundation model pre-training and inference workflows.

Pytorch · Aws Cloud · Foundation Models

AI Engineering

How to Fine-Tune Qwen3 on AMD MI300X Using ROCm

Learn how to configure ROCm 6.1 environment variables and use the Hugging Face stack to fine-tune Qwen3-1.7B on AMD hardware without CUDA.

Fine Tuning · Amd Rocm · Qwen 3

AI Engineering

How to Implement Event-Driven Webhooks in the Gemini API

Learn how to configure static and dynamic webhooks in the Gemini API to eliminate polling overhead for long-running AI operations and agent workflows.

Gemini Api · Webhooks · Event Driven Architecture

AI Engineering

How to Build Cross-Modal RAG Pipelines With Gemini Embedding 2

Learn how to process text, images, video, and audio into a single semantic vector space using Google's natively multimodal Gemini Embedding 2 model.

Multimodal Rag · Gemini Embedding 2 · Vector Databases

AI Engineering

Google Graduates LiteRT NPU Acceleration to Production

Learn how to configure LiteRT for hardware-accelerated on-device AI inference using Google's production-ready NPU capabilities.

Litert · On Device Ai · Npu Acceleration

AI Engineering

Build Real-Time Voice Agents with Cloudflare Agents SDK

Learn how to integrate low-latency voice interactions into your AI agents using Cloudflare's new @cloudflare/voice package and Durable Objects.

Cloudflare Workers · Voice Ai · Stt

AI Engineering

Build a Fast Multilingual OCR with Nemotron-OCR-v2

Learn how to deploy NVIDIA Nemotron-OCR-v2 for high-speed document extraction across six languages using synthetic data and GPU acceleration.

Nvidia Nemotron · Multilingual Ocr · Synthetic Data

AI Engineering

Train Multimodal Sentence Transformers for Visual Retrieval

Learn how to finetune multimodal embedding and reranker models for text, image, and audio using the updated Sentence Transformers library.

Sentence Transformers · Multimodal Ai · Embedding Models