AI/ML2026-05-0615 min readBy Abhishek Nair - Fractional CTO for Deep Tech & AI

When LLMs Iterate on Rewards: A Negative Result From Humanoid Locomotion

#Reinforcement Learning#Humanoid Robotics#LLM Agents#Eureka#Reward Engineering#MuJoCo
Loading...

When LLMs Iterate on Rewards: A Negative Result From Humanoid Locomotion

TL;DR

We adapted the Eureka methodology (Ma et al., NVIDIA 2023) — using an LLM to iteratively propose reward functions for an RL agent — to cold-start humanoid bipedal walking on the Unitree G1. Over 24 candidates across two experimental rounds, the LLM proposed structurally diverse rewards: multiplicative stability gating, min-aggregated per-foot kinematics, curriculum-shaped weighting via step counter, joint-velocity stance/swing discrimination. The proposals were creative, the iteration loop converged, the LLM correctly self-corrected on observed failure modes.

No candidate achieved sustained walking.

The persistent failure mode was a survival cliff at 50–70 steps that no reward formulation we tried could break — even with 4× extended training. The best candidate (Round 2, iteration 2, candidate 2) was a curriculum-shaped reward that produced real bipedal stepping at 0.44 m/s for 61 steps before falling. Then we trained the same reward for 4× longer and got a 1.3% bilaterally-balanced backward walker.

This post is the first negative result of Eureka on humanoid bipedal locomotion that we're aware of. The findings:

  • LLM reward iteration is a legitimate tool. It catches design failures we don't anticipate and proposes shapes a human researcher wouldn't.
  • It is not a substitute for state-space coverage. If your task has a cold-start coverage bottleneck (random exploration can't reach task completion), reward iteration explores the design space but doesn't solve the underlying RL problem.
  • The methodology is reusable. Different task, different state dict, different fitness — same loop.
Loading...

Background: what Eureka does

Eureka is straightforward to describe. An LLM proposes K candidate reward functions given a task description and the history of prior attempts. Each candidate is trained briefly. The trained policies are scored by an independent fitness function. The LLM gets the fitness scores plus structured failure descriptions, and proposes new candidates. Loop until convergence.

NVIDIA's 2023 paper validated this on manipulation tasks — pen-spinning, drawer opening, hand dexterity benchmarks. The LLM-designed rewards consistently matched or beat human-designed rewards on those tasks. It was a real result. The natural follow-up question: does this generalize to humanoid bipedal locomotion?

Our project is a single-person attempt to answer that.

Setup

Robot. Unitree G1, 29 actuated joints, MuJoCo simulation, 50 Hz control. Same setup as the rest of this series.

Hard constraints via termination. Following the painful lesson from the reward-hacking saga — that reward shaping cannot prevent degenerate solutions — we added two physics-level termination conditions that fire regardless of what reward the LLM proposes:

  • Pelvis height below 0.55 m → episode terminates (prevents knee-walking)
  • Any non-foot body part contacts the floor → episode terminates (prevents crawling)

The reward function doesn't need to encode these constraints. Termination handles them. The LLM is free to focus on what walking looks like, not on what walking isn't.

Configurable env. A new UnitreeG1ConfigurableEnv accepts an arbitrary Python callable as the reward function. The env exposes a structured state dict (joint positions, velocities, foot contacts, pelvis quaternion, action history) to that callable. The LLM writes the function as Python source; we sandbox-execute it (AST whitelist, no imports, no dunders) to keep accidents and adversarial outputs contained.

Fitness function (independent of the reward). This is the part that v16 got wrong and v17 got right. The fitness scores a trained policy on 8 game-proof criteria:

  • standing (15 pts) — pelvis ≥ 0.65 m for ≥80% of episode
  • forward_vel (15) — mean forward velocity ≥ 0.3 m/s
  • survival (15) — episode survives ≥ 800 / 1000 steps
  • foot_only (10) — no non-foot floor contact
  • heading (10) — yaw drift below 30°
  • bilateral (15) — left and right foot ground times roughly equal (|L−R|/total < 0.10)
  • grounded (10) — airborne fraction below 20%
  • double_support (10) — both feet on ground simultaneously at least 5% of the time

The last three were added after v16's single-leg hopper fooled an earlier version. With the strengthened fitness, the hopper correctly scores low: "BILATERAL ASYMMETRY: this is single-leg hopping, not walking."

LLM access. Claude Opus 4.7 via the Claude Agent SDK, using subscription credentials. Each prompt includes the full history of prior candidates — the rewards, the fitness scores, the structured failure summaries — so the LLM can iterate on observed failures rather than re-trying the same shape twice.

Training budget. 2.5M PPO steps per candidate, 16 parallel envs, roughly 20 minutes wallclock each. Four iterations × three candidates = 12 candidates per run, ~4 hours total. Single laptop GPU.

Round 1 vs Round 2

We ran two rounds.

Round 1 (v16) used a weaker 5-criterion fitness function. The best candidate scored 98/100 and was, on inspection, a single-leg hopper — the right foot literally never touched the ground in 1000 frames. We initially reported this as a breakthrough. The user watched the video and said "that's a flamingo hopping, not walking." The retraction story is told in the reward-hacking saga.

Round 2 (v17) used the strengthened 8-criterion fitness above. The bilateral, grounded, and double_support metrics correctly identified the v16 hopper as not-walking. With the gate hardened, we re-ran the Eureka loop. Every candidate was now actually scored on whether both feet were doing the work.

What follows are the Round 2 results.

Results: all 12 candidates

IterCandLLM's approachFitnessep_lenvx (m/s)Notes
00Phase-clocked gait reference45.6410.08hopping + airborne
01Clock-free contact-shaping67.0300.21forward stumble
02Energy + biomechanics anti-phase62.352reverted to hopping
10Survival-first design60.8460.36bilateral fixed, no double-support
11Min-aggregated per-foot kinematics70.1440.29real bipedal stance visible
12CoT + foot-strike rotation63.4420.30symmetry partial
20Multiplicative stability gating64.3360.2750% airborne
21Joint-velocity stance/swing67.6520.29better swing recognition
22Curriculum via step counter73.8610.44real bipedal stepping, falls at 1.2 s (step 61)
30Aggressive stand-first curriculum64.9680.28longest survival
31Explicit gait-clock reference65.3430.28mediocre
32Joint-vel + CoT efficiency62.0tail of run

The best candidate — iter 2 cand 2 — scored 73.8/100 with a curriculum-shaped reward. The LLM's own rationale for this design:

"Early in the episode the policy is rewarded almost entirely for posture and grounded contact, then velocity weight ramps up smoothly so the policy learns to stand stably first and only then is incentivized to translate forward, directly attacking the 'all priors die at step ~50' failure mode."

That's a competent piece of curriculum design. It produced a policy that walks forward at 0.44 m/s with 4% bilateral imbalance (near-symmetric stepping), proper left-right alternation, and visible real walking poses for 30–50 frames — before losing balance at step 61.

It is the closest thing to walking that any non-mocap method produced in this project.

It is also, unambiguously, not walking. 61 steps is 1.22 seconds.

What the LLM did well

Reading through the candidate rationales is, frankly, a more pleasant experience than reading through a human researcher's notebook. Four things the LLM did better than expected:

  1. It iterated on actual failure modes. After seeing iteration 1 candidate 1's structured failure ("Robot is falling: only survives 44/1000 steps"), iter 2 candidate 0 explicitly opens with: "directly attacking the prior failure mode of 'high vx, fall in 40 steps.'" The LLM is reading the prior round's failures and proposing changes that target them.
  2. It proposed structurally different rewards each iteration. No two candidates have the same shape. The 12 candidates vary multiplicative gating, curricula, min-aggregation, contact discrimination, gait-clock references. The LLM doesn't get stuck on a single template.
  3. It cited project memory. When relevant, the LLM referenced stored notes about the standing-still trap and prior hybrid CPG+RL results. It was reading the available context, not just the immediate failure log.
  4. It self-corrected on degenerate solutions. When iter 2 cand 2's curriculum approach hit the cliff at 61 steps, iter 3 candidates proposed more aggressive curriculum variants — which extended survival to 68 steps before regressing on gait quality. The LLM correctly diagnosed the survival cliff as the bottleneck, even though it couldn't fix it.

If the question was "can an LLM design reward functions thoughtfully?" the answer from this experiment is yes. The proposals are inventive, the iteration is principled, and the loop converges.

What didn't work — the survival cliff

Every candidate fell by step 70. Some at 30, some at 68, all under 70. This is the persistent failure mode, and it is structural, not reward-design-related:

  • Cold start exploration. Episodes start in the nominal standing pose with zero velocity. The policy has no momentum to work with, no body-angle history, and is immediately at the constraint boundary (pelvis ~0.79 m, ~0.24 m margin before termination at 0.55 m).
  • No recovery data in the trajectory distribution. PPO learns from the trajectories it sees. With cold starts, the policy never experiences mid-walk tilt or near-fall states. By the time the policy has learned to step forward at all, any small lateral wobble is fatal — there are no recovery skills in the policy because there are no recovery scenarios in the data.
  • Reward design doesn't fix this. Different rewards produce different fall styles — knee-buckle, backward stumble, asymmetric hop, forward over-rotation — but the failure mode is the same: ~50–70 steps in, the policy hits a state it has never seen during training and has no learned response.

The reward can be everything from "track this specific gait-clock signal" to "reward bilateral foot contact at this fraction" — and the policy still falls in the same 50–70 step window. The bottleneck isn't what the reward is asking for. It's that the policy has nothing in its training distribution that lets it answer.

Extending the training budget — a worse failure

We took the best candidate (iter 2 cand 2, curriculum-shaped) and trained it for 4× the budget — 10M PPO steps instead of 2.5M. Hypothesis: maybe the reward is right, the network just needs more training time to internalize it.

Result. Fitness decreased to 60.9. Episode length increased to 105 steps. The robot now walks backward at -0.42 m/s, with perfect bilateral balance (1.3% imbalance) and a 1.0 double-support score.

The extra training collapsed the curriculum's forward-motion component into a stable backward-walking basin. The reward gave the policy enough time to discover that if you walk backward, you can stay up indefinitely. It found a local optimum that satisfied the new bilateral and grounded criteria better than the forward-walking attempt — at the cost of being directionally wrong.

This is the second nail in the coffin. The survival cliff isn't a budget problem. It's a state-space coverage problem. Throwing 4× compute at a reward that doesn't give the policy access to mid-walk recovery states just lets the policy discover a stabler degeneracy.

Why Eureka is insufficient here

Eureka's manipulation-task success (pen-spinning, drawer opening) relies on a property that humanoid walking doesn't have: a path from random exploration to task completion that the LLM-designed reward can shape.

For pen-spinning, the policy can flail randomly and occasionally produce something pen-spin-shaped that the reward identifies as progress. The reward then shapes the policy toward that direction. Random exploration touches the success manifold often enough for the reward to do the steering.

For cold-start humanoid walking, random exploration produces falls. Falls terminate the episode within 5–15 steps before the policy can discover anything walk-shaped. The reward cannot shape what the policy never produces.

This is the state-space coverage problem, and it's well-studied in the imitation learning literature. DeepMimic's Reference State Initialization (RSI) solves it by starting episodes at random phases of a motion capture reference — so the policy sees mid-walk states from training step 1. The policy learns recovery skills because the trajectory distribution contains near-fall scenarios.

Without mocap, there's no obvious source for RSI states. You can't initialize at "phase 0.3 of a walking cycle" if you don't have a walking cycle to point at.

Self-RSI — the half-step that didn't help

We tried self-RSI: collect (qpos, qvel) snapshots from a brief-warmup policy's rollouts and use those as start states for the main training. The idea is to bootstrap coverage from whatever the warmup policy can produce, even if it's degenerate.

It helped — survival jumped from the ~50-step cliff to about 200 steps. But the underlying warmup policy was a single-leg hopper, and the RSI distribution inherited that hopping bias. The self-RSI'd policy walked further but in single-leg mode.

State-space coverage from a hopping warmup gives you a better hopper, not a walker. The lesson here is that RSI works if the reference trajectory is good. If the only thing you can RSI from is a degenerate gait, you've just built a more robust version of the degeneracy.

To make self-RSI work for non-mocap walking, you'd need a non-degenerate warmup policy. Producing one is exactly the problem self-RSI was supposed to solve. The chicken-and-egg loop is the actual research challenge.

Implications

For practitioners considering LLM-iterated reward design:

  1. It's a real tool, especially for tasks where the reward design space is large and the failure modes are diverse. The LLM catches design failures that humans don't anticipate. The candidate variation across our 12 attempts is wider than what a single researcher would produce in the same time.
  2. It is not a substitute for state-space coverage. If your task has a coverage bottleneck — random exploration can't reach task completion — Eureka explores the design space but doesn't solve the underlying RL problem. Diagnose the bottleneck before deciding which tool to apply.
  3. The methodology is reusable. Our framework (g1_configurable_env.py, llm_reward_loop.py, walking_fitness.py) is task-agnostic. Plug in a different state dict and a different fitness function and the LLM iteration applies. If your problem is reward-design-limited, this is a good multiplier.

For researchers in humanoid locomotion specifically:

  1. First negative result of Eureka on cold-start humanoid bipedal locomotion that we're aware of. The methodology generalizes from manipulation, but the state-coverage limitation transfers. Reward iteration isn't enough.
  2. The LLM's reward proposals are surprisingly creative. Worth reading the rationales — many would have been reasonable human choices. Curriculum-by-step-counter, min-aggregated per-foot correctness, joint-velocity stance/swing discrimination — none are in the standard humanoid RL recipe book. The LLM is generating real signal about what reward shapes are plausible, even when none of them break the survival cliff.
  3. Strong constraints via termination remain the most reliable anti-cheat. Reward weights are gameable; physics-level termination is not. The two terminations we added at the start of this experiment kept every candidate honest — none of them tried to game pelvis height or non-foot contact, because they couldn't.

What we'd try next

Two follow-ups that might actually break the survival cliff:

VLM critique replacing the hand-coded fitness. Have a vision-language model evaluate rollout naturalness and provide that as the failure summary. This addresses the evaluation problem (catching degenerate gaits that hand-coded metrics miss) but doesn't address the coverage problem. We pursued this — that's the VLM-critique post — and the result was its own form of negative finding.

LLM reward iteration on top of v12 (a working mocap policy), not from cold start. Use the LLM to refine an already-walking policy's gait quality instead of trying to build walking from scratch. This sidesteps the cold-start coverage issue entirely and should produce a publishable "natural gait refinement" result. We haven't run this yet. It's the most promising next experiment in this thread.

The general pattern: once you have a policy that can walk badly, LLM reward iteration is well-positioned to make it walk well. Building the bad walker from scratch is the hard part, and that's not what reward iteration is for.

Compute footprint

Each candidate: 2.5M PPO steps × ~2200 fps × 16 envs ≈ 20 minutes wallclock. 4 iterations × 3 candidates = 12 candidates ≈ 4 hours wallclock per round. Extended training: 5.7M steps ≈ 45 minutes. All on a single 4 GB laptop GPU.

LLM API cost: zero — used Claude Code subscription via the Agent SDK, no per-token charges. The whole experimental round (24 candidates, two fitness functions, extended training, plus the analysis runs) cost about 8 GPU-hours and a few hours of LLM dialogue.

This is a tractable methodology for a single researcher. The compute footprint is small. The conclusion — that reward iteration alone is insufficient for cold-start humanoid bipedal walking — is robust across our two runs and the extended training experiment.

What's next

The series wraps with one more post in this line of inquiry.

After hand-coded fitness functions kept missing degeneracies (v16 hopping) and after LLM-iterated rewards hit the survival cliff (this post), we tried the obvious next thing: let a vision-language model judge the rollouts. The hypothesis was that human-like qualitative judgment would catch what metrics couldn't.

It did and it didn't. The result is The VLM That Scored a Collapsing Robot 62/100 — another honest accounting of where the abstraction breaks. The pattern across the three diagnostic posts in this series (reward hacking, this one, and the VLM critique) is that every "let's automate the supervision" attempt produced a new class of failure, not zero failures. The reliable signal in this project remained the same as it started: a real motion-capture reference and the minimum-viable-mocap result that two carefully-chosen poses are enough.

Abhishek Nair - Fractional CTO for Deep Tech & AI
Abhishek Nair - Fractional CTO for Deep Tech & AI
Robotics & AI Engineer
About & contact
Why trust this guide?

Follow Me