Blog
Machine Learning and Recommender Systems
-
Claude Tool Use Cheatsheet
A one-page reference for the Messages API tool-use loop: request shape, response blocks,
stop_reason,tool_choice, schema rules, server tools, and common gotchas. -
Semantic IDs: Replacing Random Item IDs with Content-Derived Tokens
How YouTube compresses a 2048-dim video embedding into 8 integer tokens with RQ-VAE, and uses SentencePiece to plug them into a production ranking model. A detailed walkthrough of Singh et al. 2023.
-
From Reasoning to Agentic Thinking: What Changes and Why It's Hard
The shift from static reasoning (o1, R1) to agentic thinking changes everything: RL infrastructure, environment design, reward signals, and what "good thinking" even means.
-
TurboQuant Animated: Watch Vector Quantization Happen
Interactive 2D and 3D animations showing every step of TurboQuant: normalize, rotate, quantize, reconstruct. Add your own points and see the compression error in real time.
-
Seeing Tensors: PyTorch Operations Animated
Interactive animations showing exactly how reshape, transpose, broadcasting, matmul, and softmax transform tensors. Watch the data move.
-
Attention Residuals: When Residual Connections Learn to Be Selective
Standard residual connections accumulate every layer's output with equal weight, causing dilution. AttnRes replaces this with softmax attention over depth, the same linear-to-softmax transition that transformed sequence modeling. A linear algebra perspective.
-
Diffusion Models: The Intuition Behind Noise-Based Generation
How diffusion models generate images by learning to reverse noise. The forward process, reverse process, U-Net architecture, classifier-free guidance, and latent diffusion explained from first principles.
-
CLIP: Connecting Vision and Language
How CLIP learns a shared embedding space for images and text using contrastive learning. Architecture, training, zero-shot classification, and why it matters for modern multimodal AI.
-
Attention Mechanisms: From Vanilla to GQA and Beyond
How self-attention works in LLMs, why it is expensive, and how MQA, GQA, sliding window, and FlashAttention make it practical.
-
Test-Time Scaling: Spend Compute When It Matters
You can scale LLMs at inference, not just training. A guide to the two paradigms of test-time compute, the verification bottleneck, and when thinking longer actually helps.
-
Teaching LLMs to Use Tools
How supervised fine-tuning turns a text generator into a tool-using agent. Data formats, special tokens, loss masking, and incremental complexity.
-
Beyond GRPO: New Policy Optimization Methods for LLMs
What GRPO gets wrong and how Dr.GRPO, DAPO, GSPO, and SAPO fix it. The unified view of policy optimization for LLM alignment.
-
Recursive Reasoning with Tiny Networks
How a 7M-parameter network beats billion-parameter LLMs on reasoning puzzles by recursively refining its answer. The key ideas behind Tiny Recursion Models (TRM).
-
DeltaNet: Linear Transformers with the Delta Rule
An intuition-first guide to DeltaNet: why linear attention has a memory problem, how the delta rule fixes it, and the chunkwise parallel algorithm that makes it trainable on GPUs.
-
PyTorch Ops You Actually Use: From Tensors to Transformers
The operations that actually matter when you implement models in PyTorch: dimensions, view, einsum, broadcasting, matmul, and how they compose into a working transformer.
-
Mamba: Replacing Attention with Selective State Spaces
How Mamba replaces the quadratic attention bottleneck with linear-time selective state space models. SSMs, the selection mechanism, hardware-aware selective scan, and the trade-offs vs. transformers.
-
Vector Addition, Matrix Multiplication, and What They Mean in LLMs
A geometric guide to the two primitive vector operations inside transformers: addition as translation, multiplication as reshaping space, and why LLMs need both.
-
Beyond GRPO: New Policy Optimization Methods for LLMs
What GRPO gets wrong and how Dr.GRPO, DAPO, GSPO, and SAPO fix it. The unified view of policy optimization for LLM alignment.
-
How Visual Language Models Are Trained
The three-stage training recipe behind LLaVA, InternVL, Qwen-VL, and other VLMs. Vision encoders, projection layers, multi-stage training, and data formats explained.
-
Test-Time Scaling: Spend Compute When It Matters
You can scale LLMs at inference, not just training. A guide to the two paradigms of test-time compute, the verification bottleneck, and when thinking longer actually helps.
-
Test-Time Scaling: Spend Compute When It Matters
You can scale LLMs at inference, not just training. A guide to the two paradigms of test-time compute, the verification bottleneck, and when thinking longer actually helps.
-
Training MoE Right: Making Every Expert Count
The techniques that prevent expert collapse in Mixture-of-Experts LLMs: load balancing losses, routing strategies, shared experts, auxiliary-loss-free methods, and fine-grained expert segmentation.
-
Feel the AGI: Supervised Fine-Tuning in Your Browser
Fine-tune a 14M parameter language model in your browser. Load Pythia-14M, train on instruction-completion pairs, and watch it learn.
-
Scaling Laws for LLMs: The 3 Knobs You Actually Have
Scaling is not "bigger model." It's a budget allocation problem across parameters, tokens, and compute. A first-principles guide to what scaling laws actually say.
-
WTF Is Happening Inside a Transformer (Linear Algebra Edition)
An intuition-first guide to what transformers actually compute. Q, K, V demystified, attention as a data-dependent mixing matrix, and the MLP's expand-activate-compress pattern.
-
DeepSeek's Technical Playbook: From MLA to Conditional Memory
A deep dive into DeepSeek's key innovations: Multi-head Latent Attention, sparse MoE, sparse attention, scalable RL, and the Engram conditional memory architecture.
-
Reward Modeling and DPO: Learning What "Good" Means
How reward models turn human preferences into training signal, and how DPO skips the reward model entirely. Bradley-Terry, preference data, and offline alignment explained.
-
Positional Encodings for LLMs: From Sinusoidal to RoPE
How transformers understand token order. An intuition-first guide covering sinusoidal positional encodings and Rotary Position Embeddings (RoPE).
-
DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference
How DeepSeek uses idle decode-side NICs to double KV-Cache loading throughput in prefill-decode disaggregated serving.
-
Reinforcement Learning for LLMs
An intuition-first guide to the RL concepts behind RLHF, PPO, and GRPO — the background you need before diving into alignment algorithms.
-
PPO & GRPO for LLM Alignment
A first-principles guide to PPO and GRPO for LLM alignment, for ML engineers with minimal RL background.
-
Hashing for large scale similarity
Machine Learning
-
Implementing Matrix Factorisation using Tensorflow
My quora response
-
How exactly is machine learning used in recommendation engines?
My quora response