Robotics2025-12-3113 min readBy Abhishek Nair - Fractional CTO for Deep Tech & AI

From Flatland to Physics: Building a 3D Robot Simulation with RL in PyBullet

#Robotics#Reinforcement Learning#PyBullet#Simulation#3D Robotics#PPO
Loading...
From Flatland to Physics: Building a 3D Robot Simulation with RL in PyBullet

From Flatland to Physics: Building a 3D Robot Simulation with Reinforcement Learning in PyBullet

Originally published on Medium in December 2025. This is a re-publish with light voice edits and a minor scrub of an absolute filesystem path that appeared in the original code block. Canonical URL points to the Medium original.

What we're building

By the end of this project you'll have a simulation environment where a differential-drive robot navigates a warehouse cluttered with obstacles. The robot is equipped with a full sensor suite — lidar, RGB camera, depth camera, and IMU — and you'll have trained an RL agent using Proximal Policy Optimization (PPO) that consistently reaches its target.

The complete code is on GitHub: github.com/padawanabhi/pybullet_sim.

If you'd like a 2D primer covering the same ideas with simpler physics, see the 2D companion post — IR-SIM, classical control, and a smaller PPO setup.

Setting up PyBullet (the macOS caveat)

Before building anything, a quick install note. If you're on macOS, you might hit a wall: pip install pybullet may throw build errors because pip tries to compile from source rather than using pre-built wheels.

The fix is conda — more reliable for packages with complex compiled dependencies.

conda create -n pybullet_sim python=3.10 conda activate pybullet_sim conda install -c conda-forge pybullet pip install gymnasium stable-baselines3 numpy opencv-python

If you're on Linux or Windows, pip might work fine. If you see errors about "no matching distribution" or failed wheel builds, conda is your friend.

Building the robot: URDF fundamentals

Every robot in PyBullet starts as a URDF file — a universal format for describing robot geometry, joints, and physical properties. Our differential-drive robot is relatively simple: a rectangular body, two powered wheels, and a passive caster wheel for balance.

The key design decisions involve matching real-world proportions. Our robot has a wheelbase of 0.22 m (the distance between left and right wheels) and wheels with a 0.05 m radius. The body weighs 2 kg, with each wheel adding 0.1 kg. These numbers matter because physics simulation relies on realistic mass and inertia.

One important detail: wheel joints are defined as continuous rather than revolute. A revolute joint has angle limits; wheels need to spin freely forever. Continuous joints allow unlimited rotation in both directions.

<?xml version="1.0"?> <robot name="diff_drive_robot"> <!-- Base Link (Robot Body) --> <link name="base_link"> <visual> <geometry><box size="0.3 0.2 0.1"/></geometry> <origin xyz="0 0 0.05" rpy="0 0 0"/> <material name="blue"><color rgba="0.2 0.2 0.8 1"/></material> </visual> <collision> <geometry><box size="0.3 0.2 0.1"/></geometry> <origin xyz="0 0 0.05" rpy="0 0 0"/> </collision> <inertial> <mass value="2.0"/> <origin xyz="0 0 0.05" rpy="0 0 0"/> <inertia ixx="0.01" ixy="0" ixz="0" iyy="0.01" iyz="0" izz="0.02"/> </inertial> </link> <!-- Left Wheel --> <link name="left_wheel"> <visual> <geometry><cylinder radius="0.05" length="0.02"/></geometry> <origin xyz="0 0 0" rpy="1.5708 0 0"/> <material name="black"><color rgba="0.1 0.1 0.1 1"/></material> </visual> <collision> <geometry><cylinder radius="0.05" length="0.02"/></geometry> <origin xyz="0 0 0" rpy="1.5708 0 0"/> </collision> <inertial> <mass value="0.1"/> <origin xyz="0 0 0" rpy="1.5708 0 0"/> <inertia ixx="0.0001" ixy="0" ixz="0" iyy="0.0001" iyz="0" izz="0.0001"/> </inertial> </link> <!-- Right Wheel (mirror of left) --> <link name="right_wheel"> <visual> <geometry><cylinder radius="0.05" length="0.02"/></geometry> <origin xyz="0 0 0" rpy="1.5708 0 0"/> <material name="black"><color rgba="0.1 0.1 0.1 1"/></material> </visual> <collision> <geometry><cylinder radius="0.05" length="0.02"/></geometry> <origin xyz="0 0 0" rpy="1.5708 0 0"/> </collision> <inertial> <mass value="0.1"/> <origin xyz="0 0 0" rpy="1.5708 0 0"/> <inertia ixx="0.0001" ixy="0" ixz="0" iyy="0.0001" iyz="0" izz="0.0001"/> </inertial> </link> <!-- Caster Wheel (passive) --> <link name="caster_wheel"> <visual> <geometry><sphere radius="0.025"/></geometry> <material name="gray"><color rgba="0.5 0.5 0.5 1"/></material> </visual> <collision><geometry><sphere radius="0.025"/></geometry></collision> <inertial> <mass value="0.05"/> <inertia ixx="0.00001" ixy="0" ixz="0" iyy="0.00001" iyz="0" izz="0.00001"/> </inertial> </link> <!-- Wheel joints (continuous = free rotation) --> <joint name="left_wheel_joint" type="continuous"> <parent link="base_link"/><child link="left_wheel"/> <origin xyz="0 0.11 0.05" rpy="0 0 0"/> <axis xyz="0 1 0"/> <dynamics damping="0.1" friction="0.1"/> </joint> <joint name="right_wheel_joint" type="continuous"> <parent link="base_link"/><child link="right_wheel"/> <origin xyz="0 -0.11 0.05" rpy="0 0 0"/> <axis xyz="0 1 0"/> <dynamics damping="0.1" friction="0.1"/> </joint> <!-- Caster joint (fixed for simplicity) --> <joint name="caster_joint" type="fixed"> <parent link="base_link"/><child link="caster_wheel"/> <origin xyz="-0.12 0 0.025" rpy="0 0 0"/> </joint> <!-- Lidar mount --> <link name="lidar_link"> <visual> <geometry><cylinder radius="0.03" length="0.02"/></geometry> <material name="red"><color rgba="0.8 0.2 0.2 1"/></material> </visual> <inertial> <mass value="0.01"/> <inertia ixx="0.00001" ixy="0" ixz="0" iyy="0.00001" iyz="0" izz="0.00001"/> </inertial> </link> <joint name="lidar_joint" type="fixed"> <parent link="base_link"/><child link="lidar_link"/> <origin xyz="0.1 0 0.11" rpy="0 0 0"/> </joint> </robot>

The sensor suite

A robot without sensors is just a paperweight with ambition. Our simulation includes four sensor types, each serving a distinct purpose.

Loading...

Lidar is the primary navigation sensor. Using PyBullet's rayTestBatch(), we cast 36 rays in a 360° sweep around the robot, each with a 5 m range. The batch operation is crucial for performance — individual ray tests would be painfully slow. Each ray returns the distance to the nearest obstacle, giving us a 2D slice of the environment around the robot.

RGB and depth cameras provide visual information. The RGB camera captures what a human would see; the depth camera measures distance per pixel. We use 320×240 resolution with a 60° field of view. Camera positioning took some trial and error — the camera sits above and behind the robot body, looking forward along the robot's heading.

IMU (inertial measurement unit) simulates an accelerometer and gyroscope. Real IMU sensors are noisy, so we add Gaussian noise to make the simulation more realistic. This becomes important when training RL agents — if your simulation is too clean, your trained policy might fail on real hardware.

Odometry computes the robot's velocity from wheel rotations. The differential-drive kinematics are straightforward: linear velocity is the average of both wheel speeds, and angular velocity depends on the difference between them.

The sensor dashboard

PyBullet has built-in visualization for camera feeds that I initially overlooked. Instead of managing a separate matplotlib window (which kept stealing focus), you can enable sensor-buffer previews directly in the PyBullet GUI:

p.configureDebugVisualizer(p.COV_ENABLE_RGB_BUFFER_PREVIEW, 1) p.configureDebugVisualizer(p.COV_ENABLE_DEPTH_BUFFER_PREVIEW, 1) p.configureDebugVisualizer(p.COV_ENABLE_SEGMENTATION_MARK_PREVIEW, 1)

This gives you RGB, depth, and segmentation views in tabs alongside the main simulation window. Much cleaner than fighting with window managers.

Navigation: learning to keep it simple

Here's where I made my first major mistake. My initial navigation controller was… ambitious. It had stuck detection, position-history tracking, oscillation prevention, and a state machine with more modes than a Swiss Army knife. The result? The robot would stop, spin in circles, and make no progress toward the goal.

The problem was competing behaviours. When multiple subsystems all have opinions about what the robot should do, you get confusion, not coordination.

The fix was radical simplification. I replaced everything with a basic PID controller focused on two things: point the robot at the goal, and move forward. Obstacle avoidance became a simple modification of the desired heading — when lidar detects something close in front, blend the goal direction with an avoidance direction based on which side has more clearance.

# Simplified obstacle-avoidance logic if min_front_distance < safety_distance: avoidance_weight = 1.0 - (min_front_distance / safety_distance) desired_heading = blend(goal_heading, avoidance_heading, avoidance_weight)

The simplified controller reached the goal in 892 steps through a warehouse with 31 obstacles. Sometimes less really is more.

Path planning: A* and RRT

Once basic navigation worked, I added proper path planning. Two algorithms are implemented: A* and RRT.

A* operates on a discretized grid. Lidar scans get converted to an occupancy grid with 0.2 m resolution, and A* finds the optimal path using a Euclidean heuristic. It's fast — about 0.1 ms per planning cycle.

RRT (Rapidly-exploring Random Tree) works in continuous space. It grows a tree by randomly sampling positions and connecting them to the nearest existing node. RRT is slower (around 35 ms) but produces smoother paths in complex environments.

Interestingly, while A* plans faster, RRT led to faster overall navigation in my tests. The paths were more direct, requiring less replanning. This kind of counter-intuitive result is exactly why we run experiments rather than assuming.

The Gym wrapper: making it RL-ready

Reinforcement-learning frameworks like Stable-Baselines3 expect environments to follow the Gymnasium interface. This means implementing reset(), step(), and defining observation and action spaces.

Observation space is a 43-dimensional vector containing normalized lidar readings (36 values), heading error to goal, linear and angular velocity from odometry, IMU readings, and — critically — distance to goal. That last one seems obvious in hindsight, but I initially forgot it. The agent couldn't tell when it was close to success.

Action space is continuous with two values: linear velocity (0 to 0.5 m/s) and angular velocity (-1.5 to 1.5 rad/s). Linear velocity is clamped to be forward-only. Allowing backward movement would double the complexity without much benefit, and the simpler action space learns faster.

self.action_space = spaces.Box( low=np.array([0.0, -1.5]), high=np.array([0.5, 1.5]), dtype=np.float32 )

The RL training journey: from 0% to 100%

Now for the part that kept me up at night. Initial training results were dismal. The robot would move back and forth, spin in circles, or stop just short of the goal. Success rate: maybe 2 out of 5 episodes if I was lucky.

The culprit was a reward function that tried to do too much. I had bonuses for getting close, penalties for moving away, punishment for zero velocity, and harsh collision penalties. All these signals competed with each other, and the agent couldn't figure out what actually mattered.

The fix came through simplification. I stripped the reward function down to essentials.

  • Progress reward became the primary signal: +1.0 per metre closer to the goal. This is the clearest possible message — moving toward the target is good.
  • Goal reached gives a +50.0 bonus. Unmistakable "you did it" signal.
  • Collision penalty of -10.0 (reduced from -50.0). Harsh penalties can prevent exploration; the agent needs room to make mistakes and learn.
  • Time penalty of -0.01 per step encourages efficiency without being punitive.
  • Heading alignment of +0.1 for facing the goal when far away. This subtle reward guides early learning without dominating the signal.
  • Stuck detection only triggers if the robot makes less than 0.1 m of progress over 20 steps AND is more than 1 m from the goal. My original stuck detection was triggering thousands of times per episode on false positives.
def step(self, action): v = np.clip(action[0], 0.0, 0.7) # no backward driving w = np.clip(action[1], -2.0, 2.0) self.robot.set_velocity(v, w) for _ in range(int(self.time_step / (1/240))): p.stepSimulation() self.step_count += 1 obs = self._get_obs() info = self._get_info() current_distance = info["distance_to_goal"] progress = self.prev_distance - current_distance self.prev_distance = current_distance self.recent_progress.append(progress) if len(self.recent_progress) > 20: self.recent_progress.pop(0) reward = 0.0 # 1. PRIMARY: progress toward goal reward += progress # 1b. heading alignment when far from goal dtheta_abs = abs(obs[2]) if current_distance > 1.0: reward += 0.1 * (1.0 - dtheta_abs) # 2. goal reached if current_distance < self.goal_threshold: reward += 50.0 # 3. collision penalty if info["collision"]: reward -= 10.0 # 4. time penalty (efficiency) reward -= 0.01 # 5. stuck penalty (only if truly stuck and far from goal) if len(self.recent_progress) >= 20: if abs(sum(self.recent_progress[-20:])) < self.stuck_threshold and current_distance > 1.0: reward -= 0.1 # obstacle proximity (lidar < 0.3 m) min_lidar = float(np.min(obs[6:42])) if min_lidar < 0.3: reward -= 1.0 * (0.3 - min_lidar) # excessive rotation if abs(action[1]) > 2.0: reward -= 0.2 * (abs(action[1]) - 2.0) terminated = bool(current_distance < self.goal_threshold) truncated = bool(self.step_count >= self.max_steps) or bool(info["collision"]) return obs, reward, terminated, truncated, info

The final training configuration used PPO with 4 parallel environments on Apple Silicon (M2). After 1 million timesteps, the results spoke for themselves.

Evaluation results (5 episodes):

  • Success rate: 5 / 5 (100%)
  • Average reward: 57.43 ± 3.83
  • Average episode length: 164 ± 58 steps
  • Average distance traveled: 3.06 m ± 1.09 m
============================================================
Running 5 evaluation episodes
============================================================

Building warehouse environment...
Created warehouse with 31 obstacles
--- Episode 1/5 ---
Initial distance to goal: 3.56m
  Step 50:  dist=2.75m, reward=4.0
  Step 100: dist=1.70m, reward=9.4
  Step 150: dist=0.66m, reward=11.9
✓ SUCCESS | Steps: 159 | Reward: 61.9 | Final distance: 0.48m | Distance traveled: 3.10m

--- Episode 2/5 ---
Initial distance to goal: 2.72m
✓ SUCCESS | Steps: 120 | Reward: 58.2 | Final distance: 0.49m | Distance traveled: 2.26m

--- Episode 3/5 ---
Initial distance to goal: 5.10m
✓ SUCCESS | Steps: 275 | Reward: 50.3 | Final distance: 0.49m | Distance traveled: 5.17m

--- Episode 4/5 ---
✓ SUCCESS | Steps: 150 | Reward: 57.8 | Final distance: 0.48m

--- Episode 5/5 ---
✓ SUCCESS | Steps: 116 | Reward: 58.9 | Final distance: 0.49m

============================================================
EVALUATION RESULTS
Success rate: 5/5 (100.0%)
Average reward:         57.43 ± 3.83
Average episode length: 164.0 ± 57.9 steps
Average distance:        3.06m ± 1.09m
============================================================

Performance matters

A simulation that runs slowly is a simulation you won't iterate on. Initial performance was rough — under 1 FPS with all visualizations active. The bottleneck was surprising: camera image capture and debug line drawing.

The fix was to update expensive visualizations less frequently. Camera images update every 10 frames instead of every frame. Lidar rays are drawn every 10 frames, and only every tenth ray is shown. Path visualization updates every 10 frames with longer-lasting debug lines.

These changes brought the simulation from 0.8 FPS to over 4 FPS — still not blazing fast, but usable for development. When training headless (no GUI), performance is much better.

Extending the project

What I've shown here is a foundation. There's plenty of room to explore.

  • Different RL algorithms. I used PPO; the codebase supports SAC, TD3, and A2C. How do their learning curves compare? Does SAC's sample efficiency help with this environment?
  • More environment complexity. The warehouse has static obstacles. Add moving obstacles, multiple goals, or dynamic environments where obstacles appear and disappear.
  • Multi-goal training. Give the robot a sequence of waypoints or multiple goals to visit efficiently.
  • Curriculum learning. Start with an empty environment, gradually add obstacles as the agent improves.
  • Domain randomization. Randomize physics parameters, sensor noise, and obstacle configurations during training — prepares for sim-to-real transfer.

Lessons learned

A few principles stand out, looking back:

  • Simplicity wins. Both the navigation controller and the reward function improved dramatically when I removed complexity. It's tempting to add logic for every edge case, but each addition interacts with everything else.
  • Observation space matters. That missing distance-to-goal observation was a showstopper. The agent literally couldn't perceive success.
  • Tune your penalties carefully. Harsh punishments can shut down exploration. Moderate penalties guide behaviour without crushing the learning signal.
  • Progress is the primary reward. Everything else should support the core objective, not compete with it.
  • Test your detectors. My stuck detection was firing 32,000 times in 500 steps. That wasn't stuck detection — it was noise generation.

Wrapping up

We've traveled a long way from the 2D simplicity of IR-SIM. PyBullet brings real physics, real sensors, and real challenges. But the core principles remain: build incrementally, test constantly, and when things break (they will), simplify before adding complexity.

The complete project, including all training scripts and pre-trained models, is on GitHub. Clone it, break it, improve it.

Next time, maybe sim-to-real transfer — taking the virtual robot and making it control something you can actually touch. But that's a story for another post.

Resources

Abhishek Nair - Fractional CTO for Deep Tech & AI
Abhishek Nair - Fractional CTO for Deep Tech & AI
Robotics & AI Engineer
About & contact
Why trust this guide?

Follow Me