A growing collection of projects.
From a one-thread baseline to a vectorized fused CUDA kernel
Build a self-improving AI agent from scratch
A faithful decoder-only LM implementation in efficient PyTorch: RMSNorm, RoPE, GQA, SwiGLU, KV cache
A faithful implementation of MLA, DeepSeekMoE with aux-loss-free balancing, MTP heads, and FP8
Application of Autoencoders for collaborative filtering