Originally published on Medium in December 2025. This is a re-publish with light voice edits; the canonical URL points to the Medium original.
By the end of this guide you'll understand:
I use IR-SIM (Intelligent Robot Simulator), a lightweight 2D simulation library that's perfect for learning. No 3D graphics card required, no complex physics engine — just pure, understandable robotics.
Most wheeled robots you've seen — from Roombas to warehouse AGVs — use what's called a differential drive system. It's beautifully simple: two wheels, each with its own motor. By controlling the speed of each wheel independently, you can make the robot go forward, backward, turn, or spin in place.
Think of it like a tank, or if you've ever steered a shopping cart by pushing harder on one side.
The math works like this:
From these two values you calculate individual wheel speeds, and vice versa. The robot's position updates according to:
x_new = x + v × cos(θ) × dt
y_new = y + v × sin(θ) × dt
θ_new = θ + ω × dt
Where θ is the robot's heading (which direction it's facing) and dt is the time step.
Understanding differential drive is crucial because it defines what the robot can and cannot do:
This constraint shapes everything else — from how we design controllers to what actions our RL agent can take.
IR-SIM is a Python library that simulates 2D robot environments. It handles the physics, visualization, and provides a clean interface for control.
# Create a virtual environment python3 -m venv venv source venv/bin/activate # Install dependencies pip install ir-sim gymnasium stable-baselines3 numpy matplotlib pyyaml
The robotics equivalent of "Hello, World":
import os os.environ['MPLBACKEND'] = 'TkAgg' # Important for macOS from irsim.world import World import matplotlib.pyplot as plt world_config = { 'world_name': 'my_first_robot', 'height': 10, 'width': 10, 'step_time': 0.1, 'sample_time': 0.1, 'offset': [0, 0], 'control_mode': 'keyboard' } world = World(**world_config) robot_config = { 'type': 'diff', # differential drive 'shape': 'circle', 'radius': 0.3, 'state': [2, 2, 0], # x, y, theta 'vel_min': [-1, -1], 'vel_max': [1, 1], 'goal': [8, 8] } world.add_robot(robot_config) obstacle_config = { 'type': 'obstacle', 'shape': 'circle', 'center': [5, 5], 'radius': 0.5 } world.add_obstacle(obstacle_config) plt.ion() for i in range(200): world.step() plt.pause(0.05)
Run this and you should see a circular robot and an obstacle. The robot won't move yet — we need to tell it what to do.
Before AI, roboticists solved navigation with clever control algorithms. Let's implement the simplest one: proportional control toward a goal.
Imagine you're walking toward a friend across a field. You naturally:
That's essentially what go-to-goal does:
import numpy as np class GoToGoalController: def __init__(self, k_linear=0.5, k_angular=2.0): self.k_linear = k_linear # forward speed gain self.k_angular = k_angular # turning speed gain def compute_velocity(self, robot_state, goal): """ robot_state: [x, y, theta] goal: [goal_x, goal_y] returns: [linear_vel, angular_vel] """ x, y, theta = robot_state goal_x, goal_y = goal dx = goal_x - x dy = goal_y - y distance = np.sqrt(dx**2 + dy**2) angle_to_goal = np.arctan2(dy, dx) heading_error = angle_to_goal - theta heading_error = np.arctan2(np.sin(heading_error), np.cos(heading_error)) if distance < 0.3: return [0.0, 0.0] linear_vel = np.clip(self.k_linear * distance, 0.0, 1.0) angular_vel = np.clip(self.k_angular * heading_error, -1.0, 1.0) return [linear_vel, angular_vel]
This works beautifully — until there's an obstacle in the way. The robot will drive straight into it because it only knows about the goal.
The simplest fix: if something is close and in front of you, turn away.
def compute_velocity_with_avoidance(self, robot_state, goal, obstacles): x, y, theta = robot_state for obs in obstacles: obs_x, obs_y = obs['center'] obs_radius = obs['radius'] dist = np.sqrt((obs_x - x)**2 + (obs_y - y)**2) - obs_radius - 0.3 angle_to_obs = np.arctan2(obs_y - y, obs_x - x) relative_angle = abs(angle_to_obs - theta) if dist < 0.5 and relative_angle < np.pi / 3: if angle_to_obs - theta > 0: return [0.1, -0.8] # turn right return [0.1, 0.8] # turn left return self.compute_velocity(robot_state, goal)
This is reactive navigation — simple, fast, but limited. It can get stuck in corners and doesn't plan ahead.
What if instead of abrupt "if-then" rules we created a smooth force field that naturally guides the robot?
Imagine the goal is a magnet attracting the robot and obstacles are magnets repelling it. The robot follows the combined force.
Attractive potential (goal):
F_att = k_att × (goal_position − robot_position)
Repulsive potential (obstacles):
F_rep = k_rep × (1/distance − 1/threshold) × direction_away
The robot moves in the direction of the total force.
Potential fields look elegant on paper but have a famous problem: local minima. Imagine two obstacles equally spaced on either side of the goal. The repulsive forces cancel out and the robot gets stuck.
This is where learning-based approaches shine — they can potentially find creative solutions that hand-crafted algorithms miss.
Here's where things get interesting. Instead of writing rules, what if we let the robot learn to navigate through trial and error?
Reinforcement learning has four components:
The agent takes actions, observes the results, and adjusts its strategy to maximize total reward. It's like training a dog with treats — except the "dog" is a neural network and the "treats" are positive numbers.
To use modern RL libraries like Stable-Baselines3, we wrap IR-SIM in a Gymnasium environment:
import gymnasium as gym from gymnasium import spaces import numpy as np class DiffDriveNavEnv(gym.Env): def __init__(self): super().__init__() self.observation_space = spaces.Box( low=-np.inf, high=np.inf, shape=(7,), dtype=np.float32 ) self.action_space = spaces.Box( low=np.array([0.0, -1.0]), # forward only! high=np.array([1.0, 1.0]), dtype=np.float32 ) def step(self, action): self.world.set_robot_velocity(action) self.world.step() obs = self._get_observation() reward = self._compute_reward(obs) terminated = self._check_goal_reached() truncated = self._check_collision() or self.steps > 1000 return obs, reward, terminated, truncated, {}
I chose 7 values that give the robot situational awareness:
def _get_observation(self): dx = self.goal[0] - self.robot.x dy = self.goal[1] - self.robot.y dist = np.sqrt(dx**2 + dy**2) dx_norm = dx / max(dist, 0.01) dy_norm = dy / max(dist, 0.01) angle_to_goal = np.arctan2(dy, dx) dtheta = angle_to_goal - self.robot.theta dtheta = np.arctan2(np.sin(dtheta), np.cos(dtheta)) v = self.robot.linear_velocity w = self.robot.angular_velocity min_obstacle_dist = self._get_min_obstacle_distance() return np.array([ dx_norm, dy_norm, dtheta, v, w, min_obstacle_dist, dist ], dtype=np.float32)
Here's the dirty secret of RL: the reward function is everything. A bad reward leads to bizarre, unexpected behavior. A good reward leads to elegant solutions.
I thought I was being clever:
def _compute_reward_v1(self, obs): reward = 0 reward += 100 / (obs[6] + 0.1) # inverse distance reward -= 0.01 # time penalty if obs[5] < 0.5: reward -= 10 * (0.5 - obs[5]) # near-obstacle penalty reward += 0.5 * abs(obs[3]) # any movement bonus reward += 1.0 * (1 - abs(obs[2]) / np.pi) # face goal bonus if self._check_collision(): reward -= 200 # collision penalty if obs[6] < 0.3: reward += 50 # goal bonus return reward
Result: the robot spun in circles. Why?
With so many reward terms, the agent got confused:
After reading papers on RL navigation, I simplified dramatically:
def _compute_reward_v2(self, obs): progress = self.prev_distance - obs[6] self.prev_distance = obs[6] reward = 0 reward += 10.0 * max(0.0, progress) # primary: reward forward progress angular_vel = abs(obs[4]) if angular_vel > 0.8: reward -= 5.0 * (angular_vel - 0.8) # penalty: excessive spinning if obs[5] < 0.3: reward -= 5.0 # penalty: too close to obstacles elif obs[5] < 0.5: reward -= 1.0 if obs[6] < 0.3: reward += 100.0 # goal bonus return reward
Key insights:
There was another crucial change: forward-only movement.
My original action space allowed [-1, 1] for linear velocity. The robot could back up. The problem: it learned to back away from obstacles indefinitely instead of navigating around them.
By constraining linear velocity to [0, 1], the robot had to move forward and find a path. It couldn't escape by retreating.
With the environment set up correctly, training is surprisingly straightforward:
from stable_baselines3 import PPO from stable_baselines3.common.callbacks import CheckpointCallback env = DiffDriveNavEnv(randomize=True) model = PPO( "MlpPolicy", env, learning_rate=3e-4, n_steps=2048, batch_size=64, n_epochs=10, gamma=0.99, verbose=1, tensorboard_log="./logs/tensorboard/" ) model.learn( total_timesteps=500_000, callback=CheckpointCallback( save_freq=20_000, save_path="./models/ppo/checkpoints/" ) ) model.save("./models/ppo/best_model")
Proximal Policy Optimization (PPO) is the workhorse of modern RL. It is:
It works by taking small, cautious steps in policy space — never changing too much at once.
On my M1 Mac, 500,000 timesteps takes about 45 minutes. Using MPS (Apple's GPU acceleration) speeds things up significantly:
import torch def get_device(): if torch.backends.mps.is_available(): return "mps" # Apple Silicon GPU if torch.cuda.is_available(): return "cuda" # NVIDIA GPU return "cpu"
After fixing the reward function and the action space, here's what successful navigation looks like:
The path is remarkably direct — the agent learned to find efficient routes through obstacle fields without explicit path-planning algorithms.
One of the most interesting experiments was running all five controllers on the same scenarios:
In this scenario the classical controllers actually outperform PPO in step count. Potential Field is particularly efficient at 100 steps. But watch what happens in a more challenging scenario:
This is the real story: PPO is more robust. When the obstacle configuration gets tricky, classical controllers fail — Go-to-Goal collides, Potential Field gets trapped in a local minimum. PPO finds a way through.
Notice how SAC (Soft Actor-Critic) shows the classic spinning behavior I mentioned earlier — that small spiral pattern is exactly what happens when the reward function doesn't penalize excessive rotation. (The SAC and TD3 runs were still trained on the old reward system.)
Go-to-Goal: fast, simple, interpretable. Fails on complex obstacle layouts.
Potential Field: smooth paths, no tuning. Local-minima trap, can time out.
PPO: robust, handles complex scenarios. Needs training, less interpretable.
SAC (needed retraining with the new reward): sample efficient in theory; struggled with this task.
TD3 (needed retraining with the new reward): good for continuous control; gave up too easily.
One more concept worth visualizing: odometry drift.
In the real world, robots estimate their position by counting wheel rotations. Small errors accumulate — wheels slip, measurements aren't perfect, and soon the robot thinks it's somewhere it's not.
In this simulation I added small random noise to wheel-encoder readings. Even with tiny errors per step, the estimated position drifts noticeably from the true path. This is why real robots need additional sensors (cameras, LIDAR, GPS) to correct their position estimates — a topic for another post.
After weeks of debugging, retraining and head-scratching:
Before training RL, implement classical solutions. They validate your simulation, provide baselines, and their behaviour informs RL design (for example, the forward-only constraint).
When RL fails, resist the urge to tune hyperparameters randomly. Instead:
Complex reward functions with many terms confuse the agent. Progress-based rewards with clear, consistent signals outperform elaborate multi-objective formulations.
Constraining the action space to match desired behaviour prevents the agent from learning counterproductive strategies. If your robot shouldn't back up, don't allow negative linear velocity.
The most valuable output isn't the final model — it's the journey. Diagnostic scripts and change logs help you understand what went wrong and why the fixes worked.
Want to try this?
# Clone and setup git clone https://github.com/padawanabhi/robot_sim.git cd robot_sim python3 -m venv venv && source venv/bin/activate pip install -r requirements.txt # Test simulation python scripts/01_test_simulation.py # Train RL agent python scripts/06_train_rl.py --algorithm ppo --timesteps 500000 # Evaluate trained model python scripts/07_evaluate_rl.py --algorithm ppo --episodes 5 # Compare all controllers python scripts/09_compare_controllers.py
tensorboard --logdir logs/tensorboard # open http://localhost:6006
This project opened doors to several exciting directions:
Robotics doesn't have to be intimidating. With a simple 2D simulator and some persistence, you can:
The most valuable lesson wasn't about robots or RL — it was about systematic problem solving. When your agent spins in circles, you don't blame the algorithm. You dig into the observations, verify the physics, and simplify until you find the bug.
That's the real skill. And it applies far beyond robotics.
If you've followed along, the 3D follow-up — From Flatland to Physics takes the same ideas into PyBullet with sensors, navigation algorithms, and PPO training on a 3D differential-drive robot.