We trained a Unitree G1 humanoid (29 actuated joints, position-controlled) to walk by imitating a reference motion clip in MuJoCo, then stripped the reference down to see how little is enough.
The minimal version walked fastest. Stripping the reference signal did not just match the full-clip baseline — it beat it almost 2×. The constraint stopped being a corset.
This post reports the ablation, explains why fewer reference frames produce a faster gait, and flags the failure modes the minimal version still has (foreshadowing the reward-hacking story).
DeepMimic-style imitation learning is the workhorse of modern humanoid locomotion. Recent papers (PHC, HOVER, OmniH2O) keep pushing in one direction: more mocap data. 100,000 to 25 million reference frames is now standard for high-fidelity humanoid control.
We asked the opposite question. How little can you get away with and still produce a viable bipedal gait?
This is not just academic. Real-world humanoid deployment is bottlenecked on mocap availability. Capturing a high-quality motion-capture session is slow, expensive, and constrained to studio environments. If two static poses suffice for the most basic walking behavior, the implication ripples downstream: any system that can extract two poses from a photograph, a video frame, or a textbook diagram has enough information to seed a humanoid walking policy.
That is a wildly different starting point than what the field currently assumes.
Robot. Unitree G1, 29 actuated joints. The G1 is position-controlled at the actuator level: each joint runs a built-in PD loop (kp = 500) and accepts target angles, not torques. We learned this the hard way early in the project — applying our own external PD on top of the built-in one is mathematically equivalent to a much stiffer system with the wrong dynamics, and the robot falls after about 50 steps. With the built-in PD respected, the same policy code stays upright.
Simulator. MuJoCo at 50 Hz control / 500 Hz physics, using the official Menagerie G1 model.
Algorithm. PPO from Stable-Baselines3, 256×256 MLP policy. 16 parallel environments. Each ablation variant trained for ~10–20 million PPO steps, ~90–180 minutes on a single 4 GB laptop GPU. No multi-node training, no extravagant compute footprint.
Reward. Standard DeepMimic kernels — 0.65 pose-tracking, 0.15 velocity-tracking, 0.10 root orientation, 0.10 root height — plus a task reward (alive bonus + forward velocity). The mix is intentionally vanilla. The point of the experiment is to vary the reference data fed into the pose-tracking term, not the reward shape.
Reference data. Unitree's pre-retargeted LAFAN1 motion clip walk3_subject1. We auto-segment a single straight-walking cycle (~50 frames per cycle at 50 Hz), yaw-rotate it so the initial heading is +x, and use it as the canonical reference.
Three reference-density variants:
| Variant | What we feed the pose-tracking term |
|---|---|
| Full reference | The entire 50-frame cycle, phase-synchronized to the simulation clock. |
| 10 keyframes | 10 evenly-spaced poses from the cycle, linearly interpolated back up to 50 Hz before the policy sees them. |
| 2 poses | Just mid-stance and mid-swing (extracted by hip-pitch argmax / argmin), interpolated to 50 Hz. |
Anti-cheat. Two terminations protect against pathological policies discovering the reward structure: pelvis height below 0.3 m ends the episode (no knee-walking), and large pose-tracking error ends the episode (no micro-shuffle that ignores the reference). Without these guards, every reward signal we tried produced one of those two failure modes — a story we will tell in detail in a later post on reward hacking.
Three baselines side-by-side: pure-RL (no imitation), full-mimic (50 frames), and a decaying-mimic variant that phases the imitation reward out over training (the pose-tracking weight decays linearly from 1.0 to 0.0 over 10M steps):
Pure RL learns a micro-step shuffle within minutes and stops improving. The mimic policies discover real walking. The point of the side-by-side is to show that the reference signal is doing genuine work — it is not just a regularizer.
Below, each of the three reference-density variants on the G1, recorded after training to 1000-step survival. Watch what changes — and what doesn't.
Full reference (50 frames). 0.69 m/s, 13.87 m. This is the post-VecNormalize-fix gold-standard policy (we tagged it v12_gold in our archives; the story of the bug that hid it is its own post on reproducibility).
10 keyframes. 0.46 m/s, 9.31 m. Same pipeline, 80% less reference data:
2 poses only. 1.22 m/s, 24.58 m. The minimal version. Just mid-stance and mid-swing, interpolated to 50 Hz:
All three policies were evaluated over five deterministic episodes with the same seed and the same clip_obs configuration to make the numbers comparable.
A note before the results, because it would be misleading to skip it.
The G1's position-controlled actuators are not just a calibration detail — they are a load-bearing assumption that broke everything we tried in the first two weeks. PyBullet examples and most RL-humanoid tutorials assume torque control: your action is a 29-dimensional torque vector, the simulator integrates it, you read the next state. The G1 in MuJoCo Menagerie uses position actuators with built-in PD loops at the joint level. Your action is a target angle, not a torque, and the simulator's internal PD does the actual integration.
If you ignore this — which we did, repeatedly — and stack your own PD controller on top, the system becomes equivalent to one with a much higher effective stiffness than either you or the simulator expects. Episodes that look stable for the first 50 steps fall apart catastrophically afterward, and no amount of reward tuning fixes it because the underlying dynamics are wrong.
The journal entry the day we figured this out was a single line:
Adding PD on top → robot falls after ~50 steps. This is mathematically equivalent to a much stiffer system with wrong dynamics.
Once we respected the actuator type, the same policy code stayed upright for 1000-step episodes. The mocap-ablation results that follow assume this fix is in place. If your humanoid policy collapses around step 50 with no visible cause, check the actuator type before touching the reward function.
The headline result — 2 poses beating 50 frames — sounds like a paradox. It is not. Here is what is actually happening.
The DeepMimic pose-tracking term does two jobs at once:
When we drop from 50 frames to 2 poses, we lose job (2) entirely. The pose-tracking term still rewards the policy for being near one of the two poses, but it no longer constrains the trajectory between them. The policy is free to choose any amplitude that interpolates the endpoints, and PPO — with a velocity reward in the task term — picks the trajectory that maximizes forward speed.
In other words: the reference signal in the minimal variant is doing one job, not two, and the velocity reward is doing the second. The policy ends up running between two fixed poses instead of tracking a careful phase trajectory.
A quote from the project journal that captures this:
Temporal ordering of reference encodes direction. Static poses are direction-ambiguous. The two-pose variant works because the velocity reward also encodes direction — without it, the same two poses produce backward walking just as readily.
We confirmed this directly. A separate experiment (v14) used a single hand-encoded pose-pair from Winter's gait-analysis textbook — no mocap, no video, just hip and knee angles read off a static diagram — and ran the same pipeline with no velocity-reward bias. The robot walked backward at -0.18 m/s, perfectly stable, for 1000 steps. Two poses without temporal ordering are directionally ambiguous; the policy picks an arbitrary direction. We cover this in detail in the reward-hacking post — for now, just keep in mind that the 2-pose result here is using the full DeepMimic ordering (mid-stance → mid-swing), which is what produces forward motion.
This decomposition — direction vs. amplitude — explains the speed-up. The full reference forces a particular hip swing magnitude and a particular knee flexion shape. Those constraints add up. They pin the policy to "what walking looks like in the LAFAN1 dataset." When we strip them away, the policy finds a faster solution: longer strides, less torso wobble, more aggressive hip drive between the two endpoints.
There is also a less obvious mechanism: PPO's exploration. The full-reference pose-tracking term has a sharp gradient — small deviations from the per-frame target produce large reward drops. That gradient is helpful early in training (it gives the policy a strong "what to do next" signal from a randomly initialized network) but harmful late in training, because it disincentivizes the policy from exploring trajectories that deviate from the reference. The 2-pose variant has a much flatter pose-tracking landscape: as long as the policy passes near the two endpoints, the reward is roughly the same. That flatness invites exploration in the trajectory between the poses, and the velocity reward provides a strong gradient in the direction of "go faster." PPO ends up optimizing primarily for velocity, with pose-tracking acting as a soft directional constraint.
We can read this directly in the action statistics. The full-reference policy uses ~30% of its hip-pitch actuation range over a gait cycle; the 2-pose policy uses ~67%. The same is true for knee flexion. The minimal-reference policy is more aggressive across the board — the reference signal is not pulling it back to a particular pose at every timestep.
The 2-pose policy is faster, but it is not strictly better along every axis.
Style. The 2-pose gait looks tense. Arms barely swing. The torso stays remarkably rigid. The full-reference policy looks closer to natural human walking even though it is slower. If you are training a robot for video demos or human-robot interaction, the full reference is still the right tool.
Variance. Across five evaluation episodes with the same seed, the 2-pose policy shows higher variance in episode-level velocity and trajectory shape than the 10-keyframe or full-reference policies. It works, but it is "spikier" — small perturbations push it into less stable regimes.
Lateral drift. The minimal variant ends each 1000-step episode 4.56 m off the straight line (vs. 2.44 m for full reference). The minimal reference does not constrain lateral foot placement well; the velocity reward alone does not pin the trajectory in 2D.
Sim-to-real risk. All of these experiments are in MuJoCo. A faster, less constrained policy may transfer worse to hardware because it relies more heavily on sim-specific exploits of the dynamics. We have not tested this, and it is the open question we are most curious about. The full-reference policy has more "wisdom" baked in from human motion; the 2-pose policy is more "discovered" — and discovered policies tend to overfit the simulator.
Reset robustness. Every policy in the ablation was reset from a single nominal standing pose. We did not measure how each one handles off-distribution starting states (e.g., dropped from 30 cm, started with one foot off the ground, pushed mid-stride). Our intuition: the full-reference policy is more robust because the reference clip implicitly defines a "good state" the policy learns to converge to; the 2-pose policy is more brittle because that anchor is much weaker. We come back to this in the LLM-reward-iteration post — robustness turns out to be a state-space-coverage problem, not a reward-design problem, and reference data is one cheap way to expand the coverage.
The high-data direction in humanoid imitation learning is well-established. PHC (NVIDIA, 2023) trains on the entire AMASS dataset — millions of frames spanning thousands of motions. HOVER and OmniH2O push further, with general-purpose humanoid controllers built on 25M+ frames. The premise is that humanoid morphology is too high-dimensional and too unstable to be learned with sparse supervision; only a dense, motion-rich reference signal can produce policies that generalize and look natural.
That premise is true for general-purpose control. If you want a single policy that can walk, run, dance, recover from a push, and stand up after a fall, you do need lots of data.
Our experiment lives in a different regime. We are asking what the minimum-viable reference is for a specific skill (walking) on a specific morphology (G1). The 2-pose result does not contradict PHC; it complements it by showing how cheap the floor is. A general-purpose controller plausibly still needs 25M frames. A walking-only policy plausibly needs 2 poses plus a velocity reward.
The interesting question is what sits in the middle. How many skills can you train into a single humanoid policy with, say, one carefully-chosen pose-pair per skill — walking, turning, sit-to-stand — and a global velocity / orientation reward? We have not run that experiment yet. The 2-pose result here suggests it is worth running.
If you are working on humanoid locomotion via imitation learning, the budget for mocap is much smaller than the literature suggests:
The deeper claim is methodological. For a robot to walk at all, the reference signal primarily needs to break the symmetry between left and right. Anything beyond that is style, not substance. The velocity reward — which is essentially free, you get it from the simulator without any mocap — does the rest.
If you are reaching for 25 million reference frames before you have tried 2 poses, you are starting your search at the wrong end of the data axis.
A few things we did not test that we would like to see:
Everything in this post is reproducible from the archived checkpoints. The eval script is src/humanoid_sim/scripts/full_analysis.py; pass it a model .zip and the matching VecNormalize file (we cover why that pair must stay together in the reproducibility post) and you get the metrics that the bar charts above came from.
Numbers shown in this post are verbatim from:
archives/v12_gold/metrics.json — full reference, 0.693 m/s, 13.87 marchives/v9d_keyframes/metrics.json — 10 keyframes, 0.464 m/s, 9.31 marchives/v9e_posepair/metrics.json — 2 poses, 1.222 m/s, 24.58 mNo rounding above two decimal places in the chart; the prose uses the paper's two-significant-figure convention (0.69, 0.46, 1.22).
Two follow-up posts in this series get at the obvious next questions.
Once you remove the reference signal, the policy finds creative ways to "walk" that are not walking — knee-crawling, backward shuffling, single-leg hopping disguised as bipedal motion. Five Ways a Humanoid Cheats at Walking is the taxonomy.
And before any of these results were trustworthy, we had to fix a subtle reproducibility bug that hid a working policy for two weeks. The VecNormalize Trap is the diagnostic post — recommended reading if you are running PPO on humanoids and trust your training-time metrics.