A physics-based humanoid simulation and reinforcement-learning training pipeline for bipedal locomotion, built on MuJoCo, Gymnasium, and Stable-Baselines3. It runs on both Apple Silicon and Linux, and spans everything from classical control baselines to deep-RL policies that learn to walk from scratch.
Humanoid locomotion is a proving ground for reward design — and most of the real lessons come from failure. This project documents them honestly: policies that learn to cheat rather than walk, silent normalization bugs that hid an already-working policy, and a negative result on using LLMs to iterate reward functions.
Deep-dives on the build:
MuJoCo · Python · Gymnasium · Stable-Baselines3 (PPO / SAC / TD3) · NumPy