A humanoid walking policy reported 1000-step episodes during training but only 84 steps when we evaluated the saved checkpoint offline. Same model, same configuration, same evaluation harness — a 12× reproducibility gap that took two weeks to diagnose.
Two bugs were responsible. First, Stable-Baselines3's VecNormalize keeps running observation statistics that drift continuously during training and don't survive the save/load cycle cleanly. The saved vecnorm.pkl snapshot didn't match what the saved model expected. Second, the training CLI default for the reference motion clip didn't match the env class's default, so offline eval silently used a different reference than training did.
Either bug alone would have produced a regression. Together they masked each other: fixing one without the other showed no improvement, which made the search even harder. The combined fix is in production now and reproduces 5/5 across seeds. The policy was always good. Our measurement was always wrong.
We trained a DeepMimic-style policy on the Unitree G1 humanoid using PPO from Stable-Baselines3. The training callback fired its periodic evaluation and reported the expected:
Eval num_timesteps=3950000, episode_reward=1052.34
mean_ep_length: 1000.0
A 1000-step episode at 50 Hz is 20 seconds of upright walking. The reward was where we expected. Everything looked clean. We saved the model:
archives/v9b_sync2/best_model.zip # the PPO policy
archives/v9b_sync2/best_vecnorm.pkl # the VecNormalize statistics
Then we ran the offline evaluation script:
python compute_gait_metrics.py --model archives/v9b_sync2/best_model.zip
Episode length 84. The robot fell after about 1.7 seconds.
Same model. Same checkpoint. Same configuration. Loaded with VecNormalize.load() from the same .pkl file the training callback had been using internally. A 12× reproducibility gap with no visible cause.
Two weeks of debugging happened between "84 steps offline" and "we have a hypothesis":
training=False was set. No help.DummyVecEnv, same VecNormalize, same clip_obs argument. No help.numpy.random.seed. No help.At this point we were ready to assume PPO was nondeterministic in some new way and post a bug report to the SB3 GitHub. Worth noting: that assumption is almost always wrong. PPO is plenty deterministic given identical inputs. The issue is almost always that the inputs aren't identical, and you haven't noticed.
We added two diagnostic prints. One inside train_mimic.py just after model.learn() returned, when the eval callback had reported ep_len=1000. One inside the offline eval script, just after VecNormalize.load().
Both printed vec_env.obs_rms.mean and vec_env.obs_rms.var — the running statistics that VecNormalize uses to normalize observations.
training-time obs_rms.mean[5:10]: [-0.012, 0.183, -0.087, 0.412, 0.044]
offline obs_rms.mean[5:10]: [-0.014, 0.169, -0.092, 0.388, 0.041]
5–15% drift on most dimensions. Small, but enough to push the policy's input distribution off-manifold for a high-DOF humanoid.
The root cause: VecNormalize keeps a running mean and variance of observations that updates on every vec_env.step() call. The training env keeps updating across training. The eval env, run through SB3's EvalCallback, also updates — because EvalCallback syncs the training env's obs_rms into the eval env on every callback fire.
When the model is saved at the best-eval point (e.g., 3.95M steps), the model weights are frozen at that snapshot. But the vecnorm pickle gets saved at the end of training (e.g., 30M steps) — a 26M-step drift later. The mean and variance the saved policy expects are not the mean and variance it gets when you reload the saved pickle.
The policy was trained to expect observations normalized one way. It gets observations normalized differently. Walking-as-trained becomes falling-when-loaded.
This isn't a bug in SB3 per se — it's a consequence of how VecNormalize is designed plus how EvalCallback saves snapshots. The library is doing what it says on the tin. The cost is that "saving a checkpoint" is now a fragile two-file operation where the two files must be from the exact same moment of training, and SB3 doesn't enforce that.
Even after fixing the VecNormalize drift, offline eval still produced ~250-step episodes instead of 1000. Closer to the goal, still wrong. Two more days of bisecting later:
# train_mimic.py CLI default parser.add_argument("--clip-csv", type=str, default="data/lafan1_g1/g1/walk1_subject5.csv")
# g1_mimic_env.py __init__ default if clip_csv is None: clip_csv = "data/lafan1_g1/g1/walk3_subject1.csv"
The training script was invoked with --clip-csv walk1_subject5.csv — its own CLI default. The offline evaluation script constructed UnitreeG1MimicEnv() without passing clip_csv, falling through to the env class's __init__ default of walk3_subject1.csv. A different reference clip means different target poses at every timestep, which means the policy is attempting to imitate a sequence it never saw in training.
The two bugs were masking each other. Fixing only the VecNormalize drift showed marginal improvement. Fixing only the clip path showed no improvement. Fixing both — instantly reproducible.
The Unitree G1's observation space is fully characterized. Every dimension has known physical bounds — joint angles are bounded by their actuator ranges, linear velocity by the simulator's velocity cap, foot forces by body weight. There is no good reason to learn the normalization when we can write it down:
# Joint positions: normalize to [-1, 1] via actuator control range joint_lo, joint_hi = ctrl_range[:, 0], ctrl_range[:, 1] obs_normalized[5:34] = (joint_pos - 0.5*(joint_lo+joint_hi)) / (0.5*(joint_hi-joint_lo)) # Linear velocity: typical range [-2, 2] m/s obs_normalized[34:37] = lin_vel / 2.0 # Foot forces: range [0, ~500] N, centered at half body weight obs_normalized[40:42] = (foot_force - 150.0) / 150.0 # ... etc for all 50 observation dimensions
This is fixed, physics-derived normalization. It does not change across saves, loads, processes, or training runs. It replaces VecNormalize entirely for the observation space. (Reward normalization, where applicable, can still be done adaptively or simply skipped — adaptive reward stats are less reproducibility-sensitive because they only affect the value baseline, not the policy input.)
We wired it in via a --normalize-obs flag in train_mimic.py. When set, the environment emits already-normalized observations and the VecNormalize wrapper is not used at all.
Make the env class accept clip_csv=None only when the caller is deliberate about it. Log the resolved path prominently at construction so any mismatch is visible in logs even when nobody's looking for it:
def __init__(self, clip_csv=None, ...): if clip_csv is None: raise ValueError("clip_csv must be specified explicitly") print(f"[env] reference clip: {clip_csv}")
The subtler version of this lesson: every CLI default should be matched against the corresponding library default at every invocation. If a parameter has a default in two places, you have two sources of truth. Either match them and assert equality at startup, or have one reference the other.
We added a startup assertion: if --clip-csv is provided on the CLI but the env's default differs, fail fast with a loud error rather than continue silently with the wrong reference.
A few teammates have asked: why does SB3 ship VecNormalize if it has this footgun? The answer is that for general RL tasks — Atari, MuJoCo Humanoid v4, classic Gym envs — observation ranges are unknown to the user, the user does not want to characterize them, and an adaptive normalizer is the difference between "PPO trains" and "PPO diverges." It's the right default for the library's average user.
It becomes wrong as soon as you cross two thresholds. First, you know your observation space well enough to hand-write the normalization. Second, you care about being able to load a checkpoint and run it later — for offline eval, for downstream tasks, for sim-to-real deployment, or just for shipping a model to a teammate.
For humanoid locomotion research, both thresholds are crossed routinely. So the library default fights against the use case, and the right move is to opt out cleanly. Most teams I've talked to about this work around it with various forms of "always pickle the vecnorm next to the model and pray they stay in sync" — which works until it doesn't. The fix-it-at-the-source approach is more upfront work but turns a recurring fire into a non-issue.
With both fixes applied and the policy retrained on identical hyperparameters:
A note on which checkpoint this table reports: it's a re-trained policy under the corrected pipeline, not the same v12_gold archive used as the full-reference baseline in the mocap-ablation post. v12_gold was trained earlier and clocks 0.69 m/s; the post-fix retrain below clocks 0.90 m/s on the same setup. Different checkpoints, both reproducible.
| Metric | Training-time eval (reported) | Offline eval (5 seeds, deterministic) |
|---|---|---|
| Episode length | 1000 | 1000 / 1000 (5 of 5) |
| Forward velocity | 0.59 m/s | 0.90 m/s |
| Hip excursion range | 0.46 rad | 1.17 rad |
| Reproducibility | — | 5/5 identical |
Two things to notice about that table.
First, the offline numbers are better than the training-time numbers — higher velocity, larger hip range. That's because the originally-reported training numbers were also miscomputed by the same drift. The policy was always producing larger excursions than the buggy logging said. We just couldn't see them through our own normalization.
Second, 5/5 across seeds is the meaningful column. Before the fix, "did the policy work?" was a question we couldn't answer because the answer kept changing. Now the model + the deterministic normalization + the explicit reference clip form a closed loop. Every offline run produces the same numbers as the previous one. The model is the model.
Two weeks of debugging shows up as a line in the journal. The real cost is bigger.
Every result we had collected over the prior month of training runs was suddenly suspect. Did the v8 curriculum really work, or was the eval pipeline lying about it? Did the action-space ablations (29-dim, 14-dim, 6-dim) produce real differences, or were we measuring normalizer noise across different model sizes? We had to pick a subset of the prior results to re-run with the fix in place. Some of them held up; some of them changed materially. The 0.59 vs 0.90 m/s gap in the validation table above is one of the smaller surprises — others were ablations whose ordering flipped after the fix.
The hidden cost of an evaluation bug is not the time you spend on the bug. It's the time you spent collecting numbers you can't trust afterward. Catch evaluation bugs early or pay them back at compound interest.
Four takeaways that generalize beyond this particular incident:
VecNormalize is dangerous for reproducibility. If your observation space is well-characterized — physical limits known, sensor ranges fixed — prefer fixed normalization. You give up automatic statistics in exchange for determinism, and for sim work the trade is almost always worth it.
CLI defaults and library defaults must match exactly. If a parameter has a default in two places, you have two sources of truth, and they will diverge silently. Either reference one from the other or assert equality at startup. Diff-checking only the file that changed misses defaults silently chosen elsewhere.
Two-bug syndrome is real. When fixing bug A doesn't help, the most likely explanation is that bug B is hiding behind it. The combined symptom is "nothing helps" until both are fixed simultaneously. Treat "no change after fixing the obvious thing" as evidence of a second bug, not evidence that you fixed the wrong thing.
Treat training eval and offline eval as separate experiments until proven equivalent. Same code paths can produce different results because of in-process state (running statistics, JIT caches, random-state initialization order). Round-trip every model through serialize → reload → evaluate before trusting any number, even informal metrics.
The fix described here trades automatic statistics for hand-coded physics-derived normalization. It's the right trade when your observation space is fully characterized — which for most simulated robot tasks it is.
If your observation space includes dimensions whose ranges you don't know in advance, VecNormalize is still the sensible default. Common cases: a custom proprioceptive channel from a real-world sensor with unmodeled drift, a vision backbone where the embedding distribution is data-dependent, or a residual-policy setup that operates on the output of another network with non-stationary statistics. In those situations you don't have ground truth to hand-code, and adaptive normalization is the price of admission.
The decision rule is simple: how well-characterized is your observation space? For MuJoCo / Gymnasium tasks where you know the physical limits, deterministic wins. For mixed simulated-and-real or learned-feature pipelines, VecNormalize still earns its keep — but you should treat the saved pkl as a critical part of the checkpoint and protect it as carefully as the model weights.
Two follow-up posts make use of the reproducibility this one bought.
The full mocap ablation results — including the surprising 2-pose-beats-50-frames headline — were only trustworthy once we could reproduce a checkpoint deterministically. Before the fix, we couldn't tell if "2 poses faster than 50" was a real result or a normalization artifact.
And once we could measure honestly, we discovered that the policy had been finding creative ways to "walk" that aren't walking — knee-walking, backward shuffling, single-leg hopping disguised as bipedal motion. That's the reward-hacking taxonomy.
The pattern across all three posts: every quantitative claim about a humanoid RL policy is downstream of an evaluation harness, and the evaluation harness is software you wrote. Trust it last.