We adapted the Eureka methodology (Ma et al., NVIDIA 2023) — using an LLM to iteratively propose reward functions for an RL agent — to cold-start humanoid bipedal walking on the Unitree G1. Over 24 candidates across two experimental rounds, the LLM proposed structurally diverse rewards: multiplicative stability gating, min-aggregated per-foot kinematics, curriculum-shaped weighting via step counter, joint-velocity stance/swing discrimination. The proposals were creative, the iteration loop converged, the LLM correctly self-corrected on observed failure modes.
No candidate achieved sustained walking.
The persistent failure mode was a survival cliff at 50–70 steps that no reward formulation we tried could break — even with 4× extended training. The best candidate (Round 2, iteration 2, candidate 2) was a curriculum-shaped reward that produced real bipedal stepping at 0.44 m/s for 61 steps before falling. Then we trained the same reward for 4× longer and got a 1.3% bilaterally-balanced backward walker.
This post is the first negative result of Eureka on humanoid bipedal locomotion that we're aware of. The findings:
Eureka is straightforward to describe. An LLM proposes K candidate reward functions given a task description and the history of prior attempts. Each candidate is trained briefly. The trained policies are scored by an independent fitness function. The LLM gets the fitness scores plus structured failure descriptions, and proposes new candidates. Loop until convergence.
NVIDIA's 2023 paper validated this on manipulation tasks — pen-spinning, drawer opening, hand dexterity benchmarks. The LLM-designed rewards consistently matched or beat human-designed rewards on those tasks. It was a real result. The natural follow-up question: does this generalize to humanoid bipedal locomotion?
Our project is a single-person attempt to answer that.
Robot. Unitree G1, 29 actuated joints, MuJoCo simulation, 50 Hz control. Same setup as the rest of this series.
Hard constraints via termination. Following the painful lesson from the reward-hacking saga — that reward shaping cannot prevent degenerate solutions — we added two physics-level termination conditions that fire regardless of what reward the LLM proposes:
The reward function doesn't need to encode these constraints. Termination handles them. The LLM is free to focus on what walking looks like, not on what walking isn't.
Configurable env. A new UnitreeG1ConfigurableEnv accepts an arbitrary Python callable as the reward function. The env exposes a structured state dict (joint positions, velocities, foot contacts, pelvis quaternion, action history) to that callable. The LLM writes the function as Python source; we sandbox-execute it (AST whitelist, no imports, no dunders) to keep accidents and adversarial outputs contained.
Fitness function (independent of the reward). This is the part that v16 got wrong and v17 got right. The fitness scores a trained policy on 8 game-proof criteria:
|L−R|/total < 0.10)The last three were added after v16's single-leg hopper fooled an earlier version. With the strengthened fitness, the hopper correctly scores low: "BILATERAL ASYMMETRY: this is single-leg hopping, not walking."
LLM access. Claude Opus 4.7 via the Claude Agent SDK, using subscription credentials. Each prompt includes the full history of prior candidates — the rewards, the fitness scores, the structured failure summaries — so the LLM can iterate on observed failures rather than re-trying the same shape twice.
Training budget. 2.5M PPO steps per candidate, 16 parallel envs, roughly 20 minutes wallclock each. Four iterations × three candidates = 12 candidates per run, ~4 hours total. Single laptop GPU.
We ran two rounds.
Round 1 (v16) used a weaker 5-criterion fitness function. The best candidate scored 98/100 and was, on inspection, a single-leg hopper — the right foot literally never touched the ground in 1000 frames. We initially reported this as a breakthrough. The user watched the video and said "that's a flamingo hopping, not walking." The retraction story is told in the reward-hacking saga.
Round 2 (v17) used the strengthened 8-criterion fitness above. The bilateral, grounded, and double_support metrics correctly identified the v16 hopper as not-walking. With the gate hardened, we re-ran the Eureka loop. Every candidate was now actually scored on whether both feet were doing the work.
What follows are the Round 2 results.
| Iter | Cand | LLM's approach | Fitness | ep_len | vx (m/s) | Notes |
|---|---|---|---|---|---|---|
| 0 | 0 | Phase-clocked gait reference | 45.6 | 41 | 0.08 | hopping + airborne |
| 0 | 1 | Clock-free contact-shaping | 67.0 | 30 | 0.21 | forward stumble |
| 0 | 2 | Energy + biomechanics anti-phase | 62.3 | 52 | — | reverted to hopping |
| 1 | 0 | Survival-first design | 60.8 | 46 | 0.36 | bilateral fixed, no double-support |
| 1 | 1 | Min-aggregated per-foot kinematics | 70.1 | 44 | 0.29 | real bipedal stance visible |
| 1 | 2 | CoT + foot-strike rotation | 63.4 | 42 | 0.30 | symmetry partial |
| 2 | 0 | Multiplicative stability gating | 64.3 | 36 | 0.27 | 50% airborne |
| 2 | 1 | Joint-velocity stance/swing | 67.6 | 52 | 0.29 | better swing recognition |
| 2 | 2 | Curriculum via step counter | 73.8 | 61 | 0.44 | real bipedal stepping, falls at 1.2 s (step 61) |
| 3 | 0 | Aggressive stand-first curriculum | 64.9 | 68 | 0.28 | longest survival |
| 3 | 1 | Explicit gait-clock reference | 65.3 | 43 | 0.28 | mediocre |
| 3 | 2 | Joint-vel + CoT efficiency | 62.0 | — | — | tail of run |
The best candidate — iter 2 cand 2 — scored 73.8/100 with a curriculum-shaped reward. The LLM's own rationale for this design:
"Early in the episode the policy is rewarded almost entirely for posture and grounded contact, then velocity weight ramps up smoothly so the policy learns to stand stably first and only then is incentivized to translate forward, directly attacking the 'all priors die at step ~50' failure mode."
That's a competent piece of curriculum design. It produced a policy that walks forward at 0.44 m/s with 4% bilateral imbalance (near-symmetric stepping), proper left-right alternation, and visible real walking poses for 30–50 frames — before losing balance at step 61.
It is the closest thing to walking that any non-mocap method produced in this project.
It is also, unambiguously, not walking. 61 steps is 1.22 seconds.
Reading through the candidate rationales is, frankly, a more pleasant experience than reading through a human researcher's notebook. Four things the LLM did better than expected:
If the question was "can an LLM design reward functions thoughtfully?" the answer from this experiment is yes. The proposals are inventive, the iteration is principled, and the loop converges.
Every candidate fell by step 70. Some at 30, some at 68, all under 70. This is the persistent failure mode, and it is structural, not reward-design-related:
The reward can be everything from "track this specific gait-clock signal" to "reward bilateral foot contact at this fraction" — and the policy still falls in the same 50–70 step window. The bottleneck isn't what the reward is asking for. It's that the policy has nothing in its training distribution that lets it answer.
We took the best candidate (iter 2 cand 2, curriculum-shaped) and trained it for 4× the budget — 10M PPO steps instead of 2.5M. Hypothesis: maybe the reward is right, the network just needs more training time to internalize it.
Result. Fitness decreased to 60.9. Episode length increased to 105 steps. The robot now walks backward at -0.42 m/s, with perfect bilateral balance (1.3% imbalance) and a 1.0 double-support score.
The extra training collapsed the curriculum's forward-motion component into a stable backward-walking basin. The reward gave the policy enough time to discover that if you walk backward, you can stay up indefinitely. It found a local optimum that satisfied the new bilateral and grounded criteria better than the forward-walking attempt — at the cost of being directionally wrong.
This is the second nail in the coffin. The survival cliff isn't a budget problem. It's a state-space coverage problem. Throwing 4× compute at a reward that doesn't give the policy access to mid-walk recovery states just lets the policy discover a stabler degeneracy.
Eureka's manipulation-task success (pen-spinning, drawer opening) relies on a property that humanoid walking doesn't have: a path from random exploration to task completion that the LLM-designed reward can shape.
For pen-spinning, the policy can flail randomly and occasionally produce something pen-spin-shaped that the reward identifies as progress. The reward then shapes the policy toward that direction. Random exploration touches the success manifold often enough for the reward to do the steering.
For cold-start humanoid walking, random exploration produces falls. Falls terminate the episode within 5–15 steps before the policy can discover anything walk-shaped. The reward cannot shape what the policy never produces.
This is the state-space coverage problem, and it's well-studied in the imitation learning literature. DeepMimic's Reference State Initialization (RSI) solves it by starting episodes at random phases of a motion capture reference — so the policy sees mid-walk states from training step 1. The policy learns recovery skills because the trajectory distribution contains near-fall scenarios.
Without mocap, there's no obvious source for RSI states. You can't initialize at "phase 0.3 of a walking cycle" if you don't have a walking cycle to point at.
We tried self-RSI: collect (qpos, qvel) snapshots from a brief-warmup policy's rollouts and use those as start states for the main training. The idea is to bootstrap coverage from whatever the warmup policy can produce, even if it's degenerate.
It helped — survival jumped from the ~50-step cliff to about 200 steps. But the underlying warmup policy was a single-leg hopper, and the RSI distribution inherited that hopping bias. The self-RSI'd policy walked further but in single-leg mode.
State-space coverage from a hopping warmup gives you a better hopper, not a walker. The lesson here is that RSI works if the reference trajectory is good. If the only thing you can RSI from is a degenerate gait, you've just built a more robust version of the degeneracy.
To make self-RSI work for non-mocap walking, you'd need a non-degenerate warmup policy. Producing one is exactly the problem self-RSI was supposed to solve. The chicken-and-egg loop is the actual research challenge.
For practitioners considering LLM-iterated reward design:
g1_configurable_env.py, llm_reward_loop.py, walking_fitness.py) is task-agnostic. Plug in a different state dict and a different fitness function and the LLM iteration applies. If your problem is reward-design-limited, this is a good multiplier.For researchers in humanoid locomotion specifically:
Two follow-ups that might actually break the survival cliff:
VLM critique replacing the hand-coded fitness. Have a vision-language model evaluate rollout naturalness and provide that as the failure summary. This addresses the evaluation problem (catching degenerate gaits that hand-coded metrics miss) but doesn't address the coverage problem. We pursued this — that's the VLM-critique post — and the result was its own form of negative finding.
LLM reward iteration on top of v12 (a working mocap policy), not from cold start. Use the LLM to refine an already-walking policy's gait quality instead of trying to build walking from scratch. This sidesteps the cold-start coverage issue entirely and should produce a publishable "natural gait refinement" result. We haven't run this yet. It's the most promising next experiment in this thread.
The general pattern: once you have a policy that can walk badly, LLM reward iteration is well-positioned to make it walk well. Building the bad walker from scratch is the hard part, and that's not what reward iteration is for.
Each candidate: 2.5M PPO steps × ~2200 fps × 16 envs ≈ 20 minutes wallclock. 4 iterations × 3 candidates = 12 candidates ≈ 4 hours wallclock per round. Extended training: 5.7M steps ≈ 45 minutes. All on a single 4 GB laptop GPU.
LLM API cost: zero — used Claude Code subscription via the Agent SDK, no per-token charges. The whole experimental round (24 candidates, two fitness functions, extended training, plus the analysis runs) cost about 8 GPU-hours and a few hours of LLM dialogue.
This is a tractable methodology for a single researcher. The compute footprint is small. The conclusion — that reward iteration alone is insufficient for cold-start humanoid bipedal walking — is robust across our two runs and the extended training experiment.
The series wraps with one more post in this line of inquiry.
After hand-coded fitness functions kept missing degeneracies (v16 hopping) and after LLM-iterated rewards hit the survival cliff (this post), we tried the obvious next thing: let a vision-language model judge the rollouts. The hypothesis was that human-like qualitative judgment would catch what metrics couldn't.
It did and it didn't. The result is The VLM That Scored a Collapsing Robot 62/100 — another honest accounting of where the abstraction breaks. The pattern across the three diagnostic posts in this series (reward hacking, this one, and the VLM critique) is that every "let's automate the supervision" attempt produced a new class of failure, not zero failures. The reliable signal in this project remained the same as it started: a real motion-capture reference and the minimum-viable-mocap result that two carefully-chosen poses are enough.