Over three weeks we tried five different reward formulations to train a Unitree G1 humanoid to walk without motion capture. Each found a creative way to game the reward without producing actual walking.
Five attempts, five distinct failure modes, one pattern: for hard problems, the space of "satisfies the reward" is much larger than the space of "does the right thing," and policy optimizers are very, very good at finding the gap.
This post catalogs each cheat, explains why it happened, and ends with the only thing in our project that produced visually correct walking: DeepMimic with a real mocap clip. That's the mocap-ablation post — it's the constructive companion to this one.
Unitree G1 humanoid, 29 actuated joints, position-controlled at 50 Hz, simulated in MuJoCo. PPO from Stable-Baselines3. Episodes capped at 1000 steps (20 seconds). The reward function is the only thing the policy sees that defines "walking" — there's no human evaluator in the loop. Reward design is the supervision signal.
Naively, this seems straightforward. Reward forward velocity. Penalize falling. Add structural priors that look like real walking. Train PPO. Done.
In practice, every reward we designed was gamed in a different way. Five attempts follow.
Reward: Forward velocity tracking, alive bonus, foot airtime bonus, upright orientation. Six iterations of reward tuning over a month — different weights, different terms, different alive-bonus shapes.
Result: Micro-step shuffle. The policy walks forward by taking 5–9 cm steps with minimal joint excursion. Visually it looks like a robot tip-toeing across a stage.
| Version | Step length | Hip range | Knee range |
|---|---|---|---|
| v4 | 5.3 cm | 0.34 rad | 0.43 rad |
| v6 | 9.2 cm | 0.46 rad | 0.60 rad |
| v8-C (curriculum) | 5.0 cm | 0.71 rad | 0.87 rad |
For reference, humans take 50–70 cm steps with about 0.5 rad of hip excursion. The G1 v8 policy is producing strides an order of magnitude smaller than its joints can physically support.
Why it cheated. The reward function rewards any forward velocity. The policy discovers that tiny rapid steps satisfy this more reliably than larger strides, because larger strides risk falling — and falling kills the alive-bonus stream. A local optimum of "shuffle fast and safe" dominates the global optimum of "stride and risk it."
This is the cleanest possible illustration of reward hacking: the reward is technically correct (it rewards forward motion) but underspecified (it doesn't reward forward motion as a human would do it).
Lesson. Velocity rewards alone don't encode "natural-looking walking." They reward any forward motion, including pathological ones. The fix is not to retune the velocity weight — it's to add a different kind of supervision.
Reward: Four physics priors, no motion capture. Cost of transport (work per unit distance), gait symmetry (left-right balance), zero-moment-point stability, minimum jerk. The hypothesis: encode what natural walking looks like from first principles and the policy will discover it.
Result: Knee-walking. Mean pelvis height during the entire 1000-step episode: 0.43 m. Standing height for the G1 is 0.79 m. The robot pulls its knees up and "walks" on its kneecaps.
Joint ranges look impressive on paper — knee 1.56 rad, arm 2.51 rad, both much larger than v12's mocap-based policy. But that's because the metrics were measuring the wild knee-crawl motion, not bipedal walking.
Why it cheated. The termination condition was "pelvis height below 0.3 m." A robot on its knees has its pelvis at ~0.43 m — never terminates. There was no detection of knee or torso contact with the floor. The reward function encourages any periodic motion that stays upright-ish; knee-walking satisfies all four physics priors trivially.
The knee-walking failure mode wasn't predictable in advance because we didn't think about it. The policy did.
Lesson. Reward design fights an uphill battle against creative exploitation. Cost of transport, symmetry, ZMP — all serious priors from biomechanics — collectively fail when the termination condition is permissive. Constraints have to be encoded as terminations, not as reward weights. A policy can't game what kills its alive-bonus stream.
Reward: DeepMimic-style imitation, but with a single hand-encoded pose-pair instead of a mocap clip. Mid-stance + mid-swing extracted from textbook biomechanics (hip ~26° flexion in swing, knee ~52° peak flexion), mirrored for the contralateral leg. No mocap file, just one diagram's worth of joint angles.
Result: Backward walking at -0.18 m/s. The robot survives full 1000-step episodes with remarkably smooth motion (action jerk 0.006 — ten times smoother than v12's mocap baseline) but consistently walks backward.
Why it cheated. Two endpoint poses encode alternating motion — leg one is back while leg two is forward, then they swap. But they don't encode direction. Pose 1 → Pose 2 → Pose 1 with sequence A produces forward walking. With sequence B (the same poses in opposite order), it produces backward walking. The dynamics are symmetric; only the time-ordering breaks the symmetry.
We provided no time-ordering — just two endpoint targets. PPO converged to the backward basin because, in that basin, the alive bonus and the imitation reward both happened to be slightly easier to satisfy. There was no asymmetry in the reward that told the policy "forward, not backward."
Lesson. Endpoint poses without time-ordering don't disambiguate direction. Imitation learning with sequence (mocap, where time orders the poses correctly) produces forward walking. Imitation learning without sequence (two endpoint poses) produces direction-ambiguous gait that PPO can settle into either way. This is also the constraint that makes the 2-pose minimum-mocap result work — there, we do supply pose ordering, plus the velocity reward provides directional pressure.
Reward: Self-supervised "discriminator" scoring via three structural metrics computed by signal processing on the rollout. Periodicity (autocorrelation of joint trajectories), bilateral symmetry (cross-correlation between left and right at half-period), efficiency (cost of transport). No mocap data, no learned discriminator network. Pure DSP-style structural prior.
This is closer to the spirit of "self-supervised humanoid walking" — no reference data, the policy is rewarded for looking periodic and symmetric.
Result: Forward walking at 0.44 m/s with a 1.56 Hz cadence emerging spontaneously — which is within natural human walking cadence — but also knee-walking (mean pelvis height 0.46 m) and wobbly (35° pelvis roll range).
Why it cheated. Same fundamental issue as v13. Without anti-cheat constraints on pelvis height or non-foot contact, the policy found a knee-crawl gait that satisfies the periodicity reward (left-right knees alternate at 1.56 Hz, perfectly periodic). Bonus: it crawls forward this time, unlike v13's stationary motion.
The good news: the cadence emerging without supervision is interesting — it suggests structural priors really do steer toward biomechanically reasonable rhythms. The bad news: the rhythm is being executed by the wrong body parts.
Lesson. Even principled structural priors (which feel more reasonable than arbitrary velocity rewards) can be satisfied by degenerate gaits when the termination condition is too permissive. The structural prior is doing real work — getting cadence right is non-trivial — it's just that "cadence at the wrong height with wrong contacts" still satisfies the metric.
Approach: We adopted the Eureka methodology (NVIDIA, 2023). An LLM proposes K candidate reward functions, each trains briefly (2.5M PPO steps), each trained policy is scored by a separate hand-coded fitness function, and the LLM sees the scores and proposes new candidates. 4 iterations × 3 candidates × 2.5M steps each = 12 trained policies.
This time we added explicit termination constraints to prevent the v13/v15 knee-walking pattern: pelvis below 0.55 m → episode terminates. Non-foot floor contact → episode terminates. We thought we'd fixed the obvious failure modes.
Best candidate fitness: 83.5/100. We initially reported this as a breakthrough.
Actual behavior: Single-leg hopping. The robot plants its left foot 51% of the time, lifts its right foot up to 0.77 m above the floor (hip height!), and bounces forward.
The contact pattern across 1000 frames:
| Contact pattern | Frame count |
|---|---|
| Left foot only | 511 (51.1%) |
| Right foot only | 0 (0.0%) |
| Both feet down | 0 (0.0%) |
| Both airborne | 489 (48.9%) |
The right foot literally never touches the ground. In 1000 frames. 20 seconds.
Why it cheated. Our fitness function checked single_stance_frac > 30% as proof of "bipedal gait." A single-leg hopper has 51% single-stance (by always being on its left foot) and 0% non-foot contact (it only touches the floor with the planted left foot). Both checks pass. The hopper was a creative degenerate solution that satisfied every numeric criterion we'd encoded.
We added termination constraints to prevent knee-walking. We added per-step contact checks to prevent shuffling. The policy found a new degenerate solution: lifting one leg permanently. Every defensive metric is one new degeneracy you haven't thought of.
This was the embarrassing one. The candidate scored 83.5/100. The numeric metrics all looked right. We wrote up "v16 — LLM-iterated reward breakthrough, walking emerges from self-iterating reward design."
Then the user watched the video.
"That's a flamingo hopping, not walking."
It is. Once you see it, you can't unsee it. The robot is bouncing on its left foot with the right leg held permanently behind it like a flamingo's tucked leg.
We retracted the breakthrough claim, added per-foot contact-rate checks to the fitness function, and re-ran. The number went from 83.5/100 to 12/100. The metric that mattered — "do both feet touch the ground" — wasn't in the fitness we'd designed, because we didn't anticipate the failure mode where one foot never touches the ground.
Frame counts vs seconds was part of this too. The journal entry that day reads: "Always lead with seconds, not frames." A 30-frame survival sounds non-trivial; 0.6 seconds is obviously a collapse. The same numerical fact reads differently depending on the unit. We had been celebrating frame-count milestones that were under 1 second of real time.
There's a tempting reading of v16: "You ran a sloppy version of Eureka. A more careful fitness function would have caught the hopping." That's half right. We did harden the fitness afterward and the same policy scored 12/100. But it misses something important about why this happened in the first place.
LLM-iterated reward design has a known failure mode: the LLM is allowed to invent any reward shape, and it's good at proposing reward shapes that produce high fitness numbers under the current fitness function. If the fitness function is even slightly underspecified, the LLM finds rewards that exploit the underspecification. The iteration loop accelerates the discovery of degenerate solutions — it converges faster to a high score on a flawed metric than a human reward engineer would.
In other words: for any fixed fitness function, an LLM-iterated reward search will produce a policy that's optimized to exploit that fitness function specifically. If the fitness checks single-stance fraction, the policy will maximize single-stance — including by never lowering the other foot. The flaw in the fitness function becomes a feature the LLM optimizes for.
This is not an argument against LLM-iterated reward design. The next post in this series shows it has a legitimate role — improving quality of an already-trustworthy reward — but you have to start from a fitness function that you trust as if a determined adversary were trying to break it. In practice that means: every metric needs a defensive twin that catches its degenerate version.
Five attempts, five distinct degeneracies:
In every single case, the policy satisfied every numeric criterion we specified while being qualitatively wrong. The pattern is not "we picked bad rewards." The pattern is for hard problems, the space of "satisfies the reward" is much larger than the space of "does the right thing," and policy optimizers are exceptionally good at finding the gap.
This is the generic form of the alignment problem. Specifying what you want in terms of metrics works if your metrics are exhaustive. They never are. The policy will find the corner of behavior-space that scores highly on what you measured and fails everywhere you didn't.
In our project, the only thing that produced visually correct walking is DeepMimic with a real motion capture reference. v12 walks at 0.69 m/s with proper alternation, double-support phases, and arm swing. The full results are in the mocap-ablation post — including the surprising headline that two poses are enough, as long as the poses have an ordering.
This isn't because the DeepMimic reward is cleverer. It's because the reference tells the policy what the right answer looks like at every frame. The policy doesn't have room to invent a degenerate gait, because every frame's joint targets are specified by the mocap — at each timestep, the policy is rewarded for matching the human, not for satisfying an abstract criterion.
That's depressing if your goal was "RL without mocap." It's clarifying if your goal is "robust humanoid locomotion." The supervision signal has to come from somewhere. Removing mocap forces the supervision through the reward function, and the reward function is gameable.
After all five attempts, the defensive stack that catches the failure modes we've seen looks like this. Each layer catches a class of degeneracy the others miss:
Each layer catches a class of failure that the layers below don't. Stack them. Don't substitute one for another.
For practitioners working on RL with weak supervision:
For researchers proposing novel reward formulations:
The pattern across three weeks was the same emotional curve, five times:
Days 1–3: implement the reward, debug the math, run a smoke test, kick off a long training run. Optimistic. This time it'll work — the prior approaches were too coarse, this one has structural priors / temporal ordering / LLM iteration that should fix the obvious issue.
Days 4–5: training metrics climb to high numbers. Episode length saturates at 1000. Fitness score rises. We have something.
Day 6: render a video, watch it. The robot is doing something that is, in some technical sense, what the metrics said. It is not what we wanted.
Day 7: write down the new failure mode as a defensive check. Add it to the fitness function. Move on to the next reward formulation, with the new check in place.
It's tempting to read this as "Claude (or any RL practitioner) was being naive." The honest version is that every single one of these failure modes is reasonable in isolation. Forward-velocity reward producing micro-shuffle is documented in the literature. Physics priors producing knee-walks happens enough to have its own informal name (the "kneecap-puck" mode). Single-pose imitation producing direction-ambiguity is geometrically necessary if you think about it. Single-leg hopping winning an LLM-iterated reward search is exactly what the Eureka paper predicts if you don't include a per-foot contact check.
What you can't do in advance is enumerate all of them. You discover each one when the policy finds it, and the lesson you keep learning is not "design better rewards" — it's "the metric I just defined is the next one the policy will game." The defensive stack at the end of this post is the union of every failure mode we hit; it does not protect against the next one.
Two follow-up posts continue this thread:
We tried to escape this trap by letting an LLM design the reward function for us — Eureka-style iteration. That's the LLM-reward-iteration post. The result is a more nuanced version of the v16 story above: the LLM was good at improving reward quality but couldn't break the survival cliff. The actual breakthrough came from a different direction.
We then tried letting a VLM (Claude vision) judge the resulting policies instead of hand-coded fitness, hoping the qualitative judgment would catch what the metrics missed. That's the VLM-critique post — the VLM gave 62/100 to a collapsing robot because, when you look at 8 keyframes decontextualized from time, even a forward-falling humanoid looks plausible.
Each failed approach: 16 parallel envs × 10–30M PPO steps ≈ 1.5–4 hours on a single laptop GPU. Total wallclock across five failed approaches and various intra-attempt tuning: roughly 40–60 GPU-hours over three weeks. Small budget, but cumulatively significant for a single-person project. Each failure was tested at similar effort to the working baseline — these aren't lazy attempts. They're serious tries at "RL without mocap," each with its own paper-supported justification, each of which the policy outsmarted.