AI/ML2026-04-1814 min readBy Abhishek Nair - Fractional CTO for Deep Tech & AI

Two Poses Are Enough: How Much Mocap Data Does a Humanoid Need to Walk?

#Reinforcement Learning#Humanoid Robotics#MuJoCo#DeepMimic#RL#PPO
Two Poses Are Enough: How Much Mocap Data Does a Humanoid Need to Walk?

Two Poses Are Enough: How Much Mocap Data Does a Humanoid Need to Walk?

TL;DR

We trained a Unitree G1 humanoid (29 actuated joints, position-controlled) to walk by imitating a reference motion clip in MuJoCo, then stripped the reference down to see how little is enough.

  • Full reference (50 frames of motion capture): 0.69 m/s, 13.87 m covered in 1000 simulation steps.
  • 10 keyframes (8% of the original data): 0.46 m/s, 9.31 m.
  • 2 poses — just mid-stance and mid-swing, extracted by hip-pitch argmax/argmin: 1.22 m/s, 24.58 m.

The minimal version walked fastest. Stripping the reference signal did not just match the full-clip baseline — it beat it almost 2×. The constraint stopped being a corset.

This post reports the ablation, explains why fewer reference frames produce a faster gait, and flags the failure modes the minimal version still has (foreshadowing the reward-hacking story).

Loading...

Why this matters

DeepMimic-style imitation learning is the workhorse of modern humanoid locomotion. Recent papers (PHC, HOVER, OmniH2O) keep pushing in one direction: more mocap data. 100,000 to 25 million reference frames is now standard for high-fidelity humanoid control.

We asked the opposite question. How little can you get away with and still produce a viable bipedal gait?

This is not just academic. Real-world humanoid deployment is bottlenecked on mocap availability. Capturing a high-quality motion-capture session is slow, expensive, and constrained to studio environments. If two static poses suffice for the most basic walking behavior, the implication ripples downstream: any system that can extract two poses from a photograph, a video frame, or a textbook diagram has enough information to seed a humanoid walking policy.

That is a wildly different starting point than what the field currently assumes.

The setup

Robot. Unitree G1, 29 actuated joints. The G1 is position-controlled at the actuator level: each joint runs a built-in PD loop (kp = 500) and accepts target angles, not torques. We learned this the hard way early in the project — applying our own external PD on top of the built-in one is mathematically equivalent to a much stiffer system with the wrong dynamics, and the robot falls after about 50 steps. With the built-in PD respected, the same policy code stays upright.

Simulator. MuJoCo at 50 Hz control / 500 Hz physics, using the official Menagerie G1 model.

Algorithm. PPO from Stable-Baselines3, 256×256 MLP policy. 16 parallel environments. Each ablation variant trained for ~10–20 million PPO steps, ~90–180 minutes on a single 4 GB laptop GPU. No multi-node training, no extravagant compute footprint.

Reward. Standard DeepMimic kernels — 0.65 pose-tracking, 0.15 velocity-tracking, 0.10 root orientation, 0.10 root height — plus a task reward (alive bonus + forward velocity). The mix is intentionally vanilla. The point of the experiment is to vary the reference data fed into the pose-tracking term, not the reward shape.

Reference data. Unitree's pre-retargeted LAFAN1 motion clip walk3_subject1. We auto-segment a single straight-walking cycle (~50 frames per cycle at 50 Hz), yaw-rotate it so the initial heading is +x, and use it as the canonical reference.

Three reference-density variants:

VariantWhat we feed the pose-tracking term
Full referenceThe entire 50-frame cycle, phase-synchronized to the simulation clock.
10 keyframes10 evenly-spaced poses from the cycle, linearly interpolated back up to 50 Hz before the policy sees them.
2 posesJust mid-stance and mid-swing (extracted by hip-pitch argmax / argmin), interpolated to 50 Hz.

Anti-cheat. Two terminations protect against pathological policies discovering the reward structure: pelvis height below 0.3 m ends the episode (no knee-walking), and large pose-tracking error ends the episode (no micro-shuffle that ignores the reference). Without these guards, every reward signal we tried produced one of those two failure modes — a story we will tell in detail in a later post on reward hacking.

Visual comparison

Three baselines side-by-side: pure-RL (no imitation), full-mimic (50 frames), and a decaying-mimic variant that phases the imitation reward out over training (the pose-tracking weight decays linearly from 1.0 to 0.0 over 10M steps):

Pure RL learns a micro-step shuffle within minutes and stops improving. The mimic policies discover real walking. The point of the side-by-side is to show that the reference signal is doing genuine work — it is not just a regularizer.

Below, each of the three reference-density variants on the G1, recorded after training to 1000-step survival. Watch what changes — and what doesn't.

Full reference (50 frames). 0.69 m/s, 13.87 m. This is the post-VecNormalize-fix gold-standard policy (we tagged it v12_gold in our archives; the story of the bug that hid it is its own post on reproducibility).

10 keyframes. 0.46 m/s, 9.31 m. Same pipeline, 80% less reference data:

2 poses only. 1.22 m/s, 24.58 m. The minimal version. Just mid-stance and mid-swing, interpolated to 50 Hz:

All three policies were evaluated over five deterministic episodes with the same seed and the same clip_obs configuration to make the numbers comparable.

Where this almost stopped working

A note before the results, because it would be misleading to skip it.

The G1's position-controlled actuators are not just a calibration detail — they are a load-bearing assumption that broke everything we tried in the first two weeks. PyBullet examples and most RL-humanoid tutorials assume torque control: your action is a 29-dimensional torque vector, the simulator integrates it, you read the next state. The G1 in MuJoCo Menagerie uses position actuators with built-in PD loops at the joint level. Your action is a target angle, not a torque, and the simulator's internal PD does the actual integration.

If you ignore this — which we did, repeatedly — and stack your own PD controller on top, the system becomes equivalent to one with a much higher effective stiffness than either you or the simulator expects. Episodes that look stable for the first 50 steps fall apart catastrophically afterward, and no amount of reward tuning fixes it because the underlying dynamics are wrong.

The journal entry the day we figured this out was a single line:

Adding PD on top → robot falls after ~50 steps. This is mathematically equivalent to a much stiffer system with wrong dynamics.

Once we respected the actuator type, the same policy code stayed upright for 1000-step episodes. The mocap-ablation results that follow assume this fix is in place. If your humanoid policy collapses around step 50 with no visible cause, check the actuator type before touching the reward function.

Why does less data produce faster walking?

The headline result — 2 poses beating 50 frames — sounds like a paradox. It is not. Here is what is actually happening.

The DeepMimic pose-tracking term does two jobs at once:

  1. Direction. The time-ordered sequence of poses tells the policy which way to walk and at what cadence. Even two poses encode alternation (left → right → left), which is enough to break the policy's natural symmetry and bias it toward a periodic gait.
  2. Amplitude and style. The per-frame joint targets constrain the gait to look like the reference — the same hip swing angle at the same phase, the same knee bend at the same fraction of the cycle.

When we drop from 50 frames to 2 poses, we lose job (2) entirely. The pose-tracking term still rewards the policy for being near one of the two poses, but it no longer constrains the trajectory between them. The policy is free to choose any amplitude that interpolates the endpoints, and PPO — with a velocity reward in the task term — picks the trajectory that maximizes forward speed.

In other words: the reference signal in the minimal variant is doing one job, not two, and the velocity reward is doing the second. The policy ends up running between two fixed poses instead of tracking a careful phase trajectory.

A quote from the project journal that captures this:

Temporal ordering of reference encodes direction. Static poses are direction-ambiguous. The two-pose variant works because the velocity reward also encodes direction — without it, the same two poses produce backward walking just as readily.

We confirmed this directly. A separate experiment (v14) used a single hand-encoded pose-pair from Winter's gait-analysis textbook — no mocap, no video, just hip and knee angles read off a static diagram — and ran the same pipeline with no velocity-reward bias. The robot walked backward at -0.18 m/s, perfectly stable, for 1000 steps. Two poses without temporal ordering are directionally ambiguous; the policy picks an arbitrary direction. We cover this in detail in the reward-hacking post — for now, just keep in mind that the 2-pose result here is using the full DeepMimic ordering (mid-stance → mid-swing), which is what produces forward motion.

This decomposition — direction vs. amplitude — explains the speed-up. The full reference forces a particular hip swing magnitude and a particular knee flexion shape. Those constraints add up. They pin the policy to "what walking looks like in the LAFAN1 dataset." When we strip them away, the policy finds a faster solution: longer strides, less torso wobble, more aggressive hip drive between the two endpoints.

There is also a less obvious mechanism: PPO's exploration. The full-reference pose-tracking term has a sharp gradient — small deviations from the per-frame target produce large reward drops. That gradient is helpful early in training (it gives the policy a strong "what to do next" signal from a randomly initialized network) but harmful late in training, because it disincentivizes the policy from exploring trajectories that deviate from the reference. The 2-pose variant has a much flatter pose-tracking landscape: as long as the policy passes near the two endpoints, the reward is roughly the same. That flatness invites exploration in the trajectory between the poses, and the velocity reward provides a strong gradient in the direction of "go faster." PPO ends up optimizing primarily for velocity, with pose-tracking acting as a soft directional constraint.

We can read this directly in the action statistics. The full-reference policy uses ~30% of its hip-pitch actuation range over a gait cycle; the 2-pose policy uses ~67%. The same is true for knee flexion. The minimal-reference policy is more aggressive across the board — the reference signal is not pulling it back to a particular pose at every timestep.

What sparse reference does not fix

The 2-pose policy is faster, but it is not strictly better along every axis.

Style. The 2-pose gait looks tense. Arms barely swing. The torso stays remarkably rigid. The full-reference policy looks closer to natural human walking even though it is slower. If you are training a robot for video demos or human-robot interaction, the full reference is still the right tool.

Variance. Across five evaluation episodes with the same seed, the 2-pose policy shows higher variance in episode-level velocity and trajectory shape than the 10-keyframe or full-reference policies. It works, but it is "spikier" — small perturbations push it into less stable regimes.

Lateral drift. The minimal variant ends each 1000-step episode 4.56 m off the straight line (vs. 2.44 m for full reference). The minimal reference does not constrain lateral foot placement well; the velocity reward alone does not pin the trajectory in 2D.

Sim-to-real risk. All of these experiments are in MuJoCo. A faster, less constrained policy may transfer worse to hardware because it relies more heavily on sim-specific exploits of the dynamics. We have not tested this, and it is the open question we are most curious about. The full-reference policy has more "wisdom" baked in from human motion; the 2-pose policy is more "discovered" — and discovered policies tend to overfit the simulator.

Reset robustness. Every policy in the ablation was reset from a single nominal standing pose. We did not measure how each one handles off-distribution starting states (e.g., dropped from 30 cm, started with one foot off the ground, pushed mid-stride). Our intuition: the full-reference policy is more robust because the reference clip implicitly defines a "good state" the policy learns to converge to; the 2-pose policy is more brittle because that anchor is much weaker. We come back to this in the LLM-reward-iteration post — robustness turns out to be a state-space-coverage problem, not a reward-design problem, and reference data is one cheap way to expand the coverage.

How this relates to the literature

The high-data direction in humanoid imitation learning is well-established. PHC (NVIDIA, 2023) trains on the entire AMASS dataset — millions of frames spanning thousands of motions. HOVER and OmniH2O push further, with general-purpose humanoid controllers built on 25M+ frames. The premise is that humanoid morphology is too high-dimensional and too unstable to be learned with sparse supervision; only a dense, motion-rich reference signal can produce policies that generalize and look natural.

That premise is true for general-purpose control. If you want a single policy that can walk, run, dance, recover from a push, and stand up after a fall, you do need lots of data.

Our experiment lives in a different regime. We are asking what the minimum-viable reference is for a specific skill (walking) on a specific morphology (G1). The 2-pose result does not contradict PHC; it complements it by showing how cheap the floor is. A general-purpose controller plausibly still needs 25M frames. A walking-only policy plausibly needs 2 poses plus a velocity reward.

The interesting question is what sits in the middle. How many skills can you train into a single humanoid policy with, say, one carefully-chosen pose-pair per skill — walking, turning, sit-to-stand — and a global velocity / orientation reward? We have not run that experiment yet. The 2-pose result here suggests it is worth running.

Practical implications

If you are working on humanoid locomotion via imitation learning, the budget for mocap is much smaller than the literature suggests:

  • You have a full mocap clip (around 1 second of walking). Use the full reference. You get the most natural-looking gait, suitable for video demos, sim-to-real transfer attempts, and downstream learning of more complex skills.
  • You have a handful of keyframes (5–15). The 10-keyframe variant works. Slightly slower, slightly less natural, fully bipedal, and lower variance than the minimal version.
  • You have a single photograph or pose diagram. The 2-pose variant works. Faster but stylistically unconstrained. The ordering of the pair encodes direction; without ordering (or with a swap), the policy walks backward.

The deeper claim is methodological. For a robot to walk at all, the reference signal primarily needs to break the symmetry between left and right. Anything beyond that is style, not substance. The velocity reward — which is essentially free, you get it from the simulator without any mocap — does the rest.

If you are reaching for 25 million reference frames before you have tried 2 poses, you are starting your search at the wrong end of the data axis.

Open questions

A few things we did not test that we would like to see:

  • Non-walking skills. Does the minimal-reference result hold for turning, running, kicking, or stair-climbing? Each of those has a different symmetry structure — running has a flight phase, turning is intrinsically asymmetric — and the "direction vs. amplitude" decomposition may not split cleanly.
  • Reward shaping to close the style gap. Can a small auxiliary reward (penalize arm rigidity, reward natural cadence) close the visual gap between the 2-pose policy and the full-reference policy without sacrificing the speed gain? We tried an LLM-iterated reward search and learned something interesting about which kinds of reward shaping work and which do not — that is its own post.
  • Sim-to-real. Does a 2-pose policy transfer to a real G1 better or worse than a full-reference policy? Our intuition: worse, because of sim exploits, but with a domain-randomization wrapper the gap might close. We do not have hardware to test this directly.

Reproducibility

Everything in this post is reproducible from the archived checkpoints. The eval script is src/humanoid_sim/scripts/full_analysis.py; pass it a model .zip and the matching VecNormalize file (we cover why that pair must stay together in the reproducibility post) and you get the metrics that the bar charts above came from.

Numbers shown in this post are verbatim from:

  • archives/v12_gold/metrics.json — full reference, 0.693 m/s, 13.87 m
  • archives/v9d_keyframes/metrics.json — 10 keyframes, 0.464 m/s, 9.31 m
  • archives/v9e_posepair/metrics.json — 2 poses, 1.222 m/s, 24.58 m

No rounding above two decimal places in the chart; the prose uses the paper's two-significant-figure convention (0.69, 0.46, 1.22).

What's next

Two follow-up posts in this series get at the obvious next questions.

Once you remove the reference signal, the policy finds creative ways to "walk" that are not walking — knee-crawling, backward shuffling, single-leg hopping disguised as bipedal motion. Five Ways a Humanoid Cheats at Walking is the taxonomy.

And before any of these results were trustworthy, we had to fix a subtle reproducibility bug that hid a working policy for two weeks. The VecNormalize Trap is the diagnostic post — recommended reading if you are running PPO on humanoids and trust your training-time metrics.

Abhishek Nair - Fractional CTO for Deep Tech & AI
Abhishek Nair - Fractional CTO for Deep Tech & AI
Robotics & AI Engineer
About & contact
Why trust this guide?

Follow Me