Robotics2025-12-2812 min readBy Abhishek Nair - Fractional CTO for Deep Tech & AI

From Zero to Robot: A Beginner's Guide to 2D Robot Simulation and Reinforcement Learning

#Robotics#Reinforcement Learning#Stable-Baselines3#Tutorial#PPO#IR-SIM

From Zero to Robot: A Beginner's Guide to 2D Robot Simulation and Reinforcement Learning

Originally published on Medium in December 2025. This is a re-publish with light voice edits; the canonical URL points to the Medium original.

What we're building

By the end of this guide you'll understand:

How robots move — the math behind differential drive (two-wheeled) robots.
Classical navigation — how traditional algorithms like "go-to-goal" and potential fields work.
Reinforcement learning basics — training a neural network to navigate through trial and error.
Why simple things are hard — the debugging journey that took me from 0% to 100% success rate.

I use IR-SIM (Intelligent Robot Simulator), a lightweight 2D simulation library that's perfect for learning. No 3D graphics card required, no complex physics engine — just pure, understandable robotics.

Part 1: Understanding how robots move

The differential drive model

Most wheeled robots you've seen — from Roombas to warehouse AGVs — use what's called a differential drive system. It's beautifully simple: two wheels, each with its own motor. By controlling the speed of each wheel independently, you can make the robot go forward, backward, turn, or spin in place.

Think of it like a tank, or if you've ever steered a shopping cart by pushing harder on one side.

The math works like this:

Linear velocity (v) — how fast the robot moves forward or backward.
Angular velocity (ω) — how fast the robot rotates (spins).

From these two values you calculate individual wheel speeds, and vice versa. The robot's position updates according to:

x_new = x + v × cos(θ) × dt
y_new = y + v × sin(θ) × dt
θ_new = θ + ω × dt

Where θ is the robot's heading (which direction it's facing) and dt is the time step.

Why this matters

Understanding differential drive is crucial because it defines what the robot can and cannot do:

Can: move forward / backward.
Can: turn while moving (arc motion).
Can: spin in place.
Cannot: move sideways (no crab walking).

This constraint shapes everything else — from how we design controllers to what actions our RL agent can take.

Part 2: Setting up IR-SIM

IR-SIM is a Python library that simulates 2D robot environments. It handles the physics, visualization, and provides a clean interface for control.

Installation

# Create a virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install ir-sim gymnasium stable-baselines3 numpy matplotlib pyyaml

Your first simulation

The robotics equivalent of "Hello, World":

import os
os.environ['MPLBACKEND'] = 'TkAgg'  # Important for macOS

from irsim.world import World
import matplotlib.pyplot as plt

world_config = {
    'world_name': 'my_first_robot',
    'height': 10,
    'width': 10,
    'step_time': 0.1,
    'sample_time': 0.1,
    'offset': [0, 0],
    'control_mode': 'keyboard'
}

world = World(**world_config)

robot_config = {
    'type': 'diff',          # differential drive
    'shape': 'circle',
    'radius': 0.3,
    'state': [2, 2, 0],      # x, y, theta
    'vel_min': [-1, -1],
    'vel_max': [1, 1],
    'goal': [8, 8]
}
world.add_robot(robot_config)

obstacle_config = {
    'type': 'obstacle',
    'shape': 'circle',
    'center': [5, 5],
    'radius': 0.5
}
world.add_obstacle(obstacle_config)

plt.ion()
for i in range(200):
    world.step()
    plt.pause(0.05)

Run this and you should see a circular robot and an obstacle. The robot won't move yet — we need to tell it what to do.

Part 3: Classical navigation — go-to-goal controller

Before AI, roboticists solved navigation with clever control algorithms. Let's implement the simplest one: proportional control toward a goal.

The concept

Imagine you're walking toward a friend across a field. You naturally:

Look at where they are.
Turn to face them.
Walk forward.

That's essentially what go-to-goal does:

Calculate the angle to the goal.
Compare it to your current heading.
Turn to reduce the error.
Move forward.

The code

import numpy as np

class GoToGoalController:
    def __init__(self, k_linear=0.5, k_angular=2.0):
        self.k_linear = k_linear   # forward speed gain
        self.k_angular = k_angular # turning speed gain

    def compute_velocity(self, robot_state, goal):
        """
        robot_state: [x, y, theta]
        goal: [goal_x, goal_y]
        returns: [linear_vel, angular_vel]
        """
        x, y, theta = robot_state
        goal_x, goal_y = goal

        dx = goal_x - x
        dy = goal_y - y
        distance = np.sqrt(dx**2 + dy**2)

        angle_to_goal = np.arctan2(dy, dx)
        heading_error = angle_to_goal - theta
        heading_error = np.arctan2(np.sin(heading_error),
                                   np.cos(heading_error))

        if distance < 0.3:
            return [0.0, 0.0]

        linear_vel  = np.clip(self.k_linear  * distance, 0.0, 1.0)
        angular_vel = np.clip(self.k_angular * heading_error, -1.0, 1.0)
        return [linear_vel, angular_vel]

This works beautifully — until there's an obstacle in the way. The robot will drive straight into it because it only knows about the goal.

Adding obstacle avoidance

The simplest fix: if something is close and in front of you, turn away.

def compute_velocity_with_avoidance(self, robot_state, goal, obstacles):
    x, y, theta = robot_state
    for obs in obstacles:
        obs_x, obs_y = obs['center']
        obs_radius   = obs['radius']

        dist = np.sqrt((obs_x - x)**2 + (obs_y - y)**2) - obs_radius - 0.3
        angle_to_obs   = np.arctan2(obs_y - y, obs_x - x)
        relative_angle = abs(angle_to_obs - theta)

        if dist < 0.5 and relative_angle < np.pi / 3:
            if angle_to_obs - theta > 0:
                return [0.1, -0.8]  # turn right
            return [0.1, 0.8]       # turn left
    return self.compute_velocity(robot_state, goal)

This is reactive navigation — simple, fast, but limited. It can get stuck in corners and doesn't plan ahead.

Part 4: Potential fields — a smoother approach

What if instead of abrupt "if-then" rules we created a smooth force field that naturally guides the robot?

The concept

Imagine the goal is a magnet attracting the robot and obstacles are magnets repelling it. The robot follows the combined force.

Attractive potential (goal):

F_att = k_att × (goal_position − robot_position)

Repulsive potential (obstacles):

F_rep = k_rep × (1/distance − 1/threshold) × direction_away

The robot moves in the direction of the total force.

The limitation

Potential fields look elegant on paper but have a famous problem: local minima. Imagine two obstacles equally spaced on either side of the goal. The repulsive forces cancel out and the robot gets stuck.

This is where learning-based approaches shine — they can potentially find creative solutions that hand-crafted algorithms miss.

Part 5: Enter reinforcement learning

Here's where things get interesting. Instead of writing rules, what if we let the robot learn to navigate through trial and error?

RL in 60 seconds

Reinforcement learning has four components:

Agent — our robot's brain (a neural network).
Environment — the simulation world.
State — what the robot observes (position, obstacles, goal).
Reward — a number telling the robot if it's doing well or poorly.

The agent takes actions, observes the results, and adjusts its strategy to maximize total reward. It's like training a dog with treats — except the "dog" is a neural network and the "treats" are positive numbers.

The Gymnasium interface

To use modern RL libraries like Stable-Baselines3, we wrap IR-SIM in a Gymnasium environment:

import gymnasium as gym
from gymnasium import spaces
import numpy as np

class DiffDriveNavEnv(gym.Env):
    def __init__(self):
        super().__init__()
        self.observation_space = spaces.Box(
            low=-np.inf, high=np.inf,
            shape=(7,),
            dtype=np.float32
        )
        self.action_space = spaces.Box(
            low=np.array([0.0, -1.0]),   # forward only!
            high=np.array([1.0,  1.0]),
            dtype=np.float32
        )

    def step(self, action):
        self.world.set_robot_velocity(action)
        self.world.step()

        obs = self._get_observation()
        reward = self._compute_reward(obs)
        terminated = self._check_goal_reached()
        truncated  = self._check_collision() or self.steps > 1000
        return obs, reward, terminated, truncated, {}

The observation: what does the robot see?

I chose 7 values that give the robot situational awareness:

def _get_observation(self):
    dx = self.goal[0] - self.robot.x
    dy = self.goal[1] - self.robot.y
    dist = np.sqrt(dx**2 + dy**2)
    dx_norm = dx / max(dist, 0.01)
    dy_norm = dy / max(dist, 0.01)

    angle_to_goal = np.arctan2(dy, dx)
    dtheta = angle_to_goal - self.robot.theta
    dtheta = np.arctan2(np.sin(dtheta), np.cos(dtheta))

    v = self.robot.linear_velocity
    w = self.robot.angular_velocity
    min_obstacle_dist = self._get_min_obstacle_distance()

    return np.array([
        dx_norm, dy_norm, dtheta,
        v, w,
        min_obstacle_dist, dist
    ], dtype=np.float32)

Part 6: The reward function — where everything goes wrong

Here's the dirty secret of RL: the reward function is everything. A bad reward leads to bizarre, unexpected behavior. A good reward leads to elegant solutions.

My first attempt (disaster)

I thought I was being clever:

def _compute_reward_v1(self, obs):
    reward = 0
    reward += 100 / (obs[6] + 0.1)             # inverse distance
    reward -= 0.01                              # time penalty
    if obs[5] < 0.5:
        reward -= 10 * (0.5 - obs[5])           # near-obstacle penalty
    reward += 0.5 * abs(obs[3])                 # any movement bonus
    reward += 1.0 * (1 - abs(obs[2]) / np.pi)   # face goal bonus
    if self._check_collision():
        reward -= 200                           # collision penalty
    if obs[6] < 0.3:
        reward += 50                            # goal bonus
    return reward

Result: the robot spun in circles. Why?

The problem: competing signals

With so many reward terms, the agent got confused:

Inverse distance created weird gradients.
Movement bonus encouraged any movement, including spinning.
No penalty for excessive rotation.

The fix: research-based simplicity

After reading papers on RL navigation, I simplified dramatically:

def _compute_reward_v2(self, obs):
    progress = self.prev_distance - obs[6]
    self.prev_distance = obs[6]

    reward = 0
    reward += 10.0 * max(0.0, progress)           # primary: reward forward progress

    angular_vel = abs(obs[4])
    if angular_vel > 0.8:
        reward -= 5.0 * (angular_vel - 0.8)       # penalty: excessive spinning

    if obs[5] < 0.3:
        reward -= 5.0                              # penalty: too close to obstacles
    elif obs[5] < 0.5:
        reward -= 1.0

    if obs[6] < 0.3:
        reward += 100.0                            # goal bonus

    return reward

Key insights:

Progress-based reward — reward closing the gap, not absolute distance.
Rotation penalty — stop the spinning behavior explicitly.
Few, clear signals — don't confuse the agent with complex objectives.

The action-space fix

There was another crucial change: forward-only movement.

My original action space allowed [-1, 1] for linear velocity. The robot could back up. The problem: it learned to back away from obstacles indefinitely instead of navigating around them.

By constraining linear velocity to [0, 1], the robot had to move forward and find a path. It couldn't escape by retreating.

Part 7: Training the agent

With the environment set up correctly, training is surprisingly straightforward:

from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import CheckpointCallback

env = DiffDriveNavEnv(randomize=True)

model = PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    verbose=1,
    tensorboard_log="./logs/tensorboard/"
)

model.learn(
    total_timesteps=500_000,
    callback=CheckpointCallback(
        save_freq=20_000,
        save_path="./models/ppo/checkpoints/"
    )
)
model.save("./models/ppo/best_model")

PPO: why this algorithm?

Proximal Policy Optimization (PPO) is the workhorse of modern RL. It is:

Stable — doesn't diverge easily.
Sample efficient — learns reasonably fast.
General purpose — works for many problems.

It works by taking small, cautious steps in policy space — never changing too much at once.

Training time

On my M1 Mac, 500,000 timesteps takes about 45 minutes. Using MPS (Apple's GPU acceleration) speeds things up significantly:

import torch

def get_device():
    if torch.backends.mps.is_available():
        return "mps"   # Apple Silicon GPU
    if torch.cuda.is_available():
        return "cuda"  # NVIDIA GPU
    return "cpu"

Part 8: The results

After fixing the reward function and the action space, here's what successful navigation looks like:

The path is remarkably direct — the agent learned to find efficient routes through obstacle fields without explicit path-planning algorithms.

Comparing all approaches

One of the most interesting experiments was running all five controllers on the same scenarios:

In this scenario the classical controllers actually outperform PPO in step count. Potential Field is particularly efficient at 100 steps. But watch what happens in a more challenging scenario:

This is the real story: PPO is more robust. When the obstacle configuration gets tricky, classical controllers fail — Go-to-Goal collides, Potential Field gets trapped in a local minimum. PPO finds a way through.

Notice how SAC (Soft Actor-Critic) shows the classic spinning behavior I mentioned earlier — that small spiral pattern is exactly what happens when the reward function doesn't penalize excessive rotation. (The SAC and TD3 runs were still trained on the old reward system.)

The algorithm comparison

Go-to-Goal: fast, simple, interpretable. Fails on complex obstacle layouts.

Potential Field: smooth paths, no tuning. Local-minima trap, can time out.

PPO: robust, handles complex scenarios. Needs training, less interpretable.

SAC (needed retraining with the new reward): sample efficient in theory; struggled with this task.

TD3 (needed retraining with the new reward): good for continuous control; gave up too easily.

Part 9: Bonus — understanding odometry drift

One more concept worth visualizing: odometry drift.

In the real world, robots estimate their position by counting wheel rotations. Small errors accumulate — wheels slip, measurements aren't perfect, and soon the robot thinks it's somewhere it's not.

In this simulation I added small random noise to wheel-encoder readings. Even with tiny errors per step, the estimated position drifts noticeably from the true path. This is why real robots need additional sensors (cameras, LIDAR, GPS) to correct their position estimates — a topic for another post.

Part 10: Lessons learned

After weeks of debugging, retraining and head-scratching:

1. Start with classical controllers

Before training RL, implement classical solutions. They validate your simulation, provide baselines, and their behaviour informs RL design (for example, the forward-only constraint).

2. Debug systematically

When RL fails, resist the urge to tune hyperparameters randomly. Instead:

Verify environment physics.
Check observation values during episodes.
Ensure collision detection works.
Validate that reward signals make sense.

3. Simpler rewards are better

Complex reward functions with many terms confuse the agent. Progress-based rewards with clear, consistent signals outperform elaborate multi-objective formulations.

4. Action space design matters

Constraining the action space to match desired behaviour prevents the agent from learning counterproductive strategies. If your robot shouldn't back up, don't allow negative linear velocity.

5. Document everything

The most valuable output isn't the final model — it's the journey. Diagnostic scripts and change logs help you understand what went wrong and why the fixes worked.

Getting started yourself

Want to try this?

# Clone and setup
git clone https://github.com/padawanabhi/robot_sim.git
cd robot_sim
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Test simulation
python scripts/01_test_simulation.py

# Train RL agent
python scripts/06_train_rl.py --algorithm ppo --timesteps 500000

# Evaluate trained model
python scripts/07_evaluate_rl.py --algorithm ppo --episodes 5

# Compare all controllers
python scripts/09_compare_controllers.py

Monitor training

tensorboard --logdir logs/tensorboard
# open http://localhost:6006

What's next

This project opened doors to several exciting directions:

Curriculum learning — start with easy environments (no obstacles), gradually increase difficulty.
Multi-robot coordination — train multiple agents to navigate without colliding with each other.
Sim-to-real transfer — deploy learned policies to physical robots like TurtleBot or custom Arduino builds.
3D environments — graduate to Gazebo or Isaac Sim for 3D navigation.

Conclusion

Robotics doesn't have to be intimidating. With a simple 2D simulator and some persistence, you can:

Understand how robots move and think.
Implement classical control algorithms.
Train neural networks to navigate.
Learn the art of debugging ML systems.

The most valuable lesson wasn't about robots or RL — it was about systematic problem solving. When your agent spins in circles, you don't blame the algorithm. You dig into the observations, verify the physics, and simplify until you find the bug.

That's the real skill. And it applies far beyond robotics.

Resources

Full code: github.com/padawanabhi/robot_sim
IR-SIM docs: ir-sim.readthedocs.io
Stable-Baselines3: stable-baselines3.readthedocs.io
PPO paper: Proximal Policy Optimization Algorithms
Original Medium post: From Zero to Robot

If you've followed along, the 3D follow-up — From Flatland to Physics takes the same ideas into PyBullet with sensors, navigation algorithms, and PPO training on a 3D differential-drive robot.

Abhishek Nair - Fractional CTO for Deep Tech & AI

Robotics & AI Engineer

About & contact

From Zero to Robot: A Beginner's Guide to 2D Robot Simulation and Reinforcement Learning

From Zero to Robot: A Beginner's Guide to 2D Robot Simulation and Reinforcement Learning

What we're building

Part 1: Understanding how robots move

The differential drive model

Why this matters

Part 2: Setting up IR-SIM

Installation

Your first simulation

Part 3: Classical navigation — go-to-goal controller

The concept

The code

Adding obstacle avoidance

Part 4: Potential fields — a smoother approach

The concept

The limitation

Part 5: Enter reinforcement learning

RL in 60 seconds

The Gymnasium interface

The observation: what does the robot see?

Part 6: The reward function — where everything goes wrong

My first attempt (disaster)

The problem: competing signals

The fix: research-based simplicity

The action-space fix

Part 7: Training the agent

PPO: why this algorithm?

Training time

Part 8: The results

Comparing all approaches

The algorithm comparison

Part 9: Bonus — understanding odometry drift

Part 10: Lessons learned

1. Start with classical controllers

2. Debug systematically

3. Simpler rewards are better

4. Action space design matters

5. Document everything

Getting started yourself

Monitor training

What's next

Conclusion

Resources

Why trust this guide?

Related Articles

From Flatland to Physics: Building a 3D Robot Simulation with RL in PyBullet

Reinforcement Learning for Robotics: A Comprehensive 2025 Guide

Two Poses Are Enough: How Much Mocap Data Does a Humanoid Need to Walk?

Enjoyed this post?