Capstone Project: Voice-Controlled Delivery Robot

Congratulations on completing the Physical AI & Humanoid Robotics curriculum! This capstone project integrates everything you've learned: ROS 2, Gazebo simulation, NVIDIA Isaac, and voice commands to build a fully autonomous delivery robot.

Project Overview

Goal: Build a mobile robot that accepts voice commands to navigate to locations, pick up objects, and deliver them to specified destinations.

Duration: 2-4 weeks

Difficulty: Intermediate to Advanced

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Voice Command Layer                      │
│  Whisper STT → GPT-4 Intent → Action Planner               │
└─────────────────┬───────────────────────────────────────────┘
                  │
┌─────────────────▼───────────────────────────────────────────┐
│                   Navigation Stack (Nav2)                    │
│  SLAM → Path Planning → Obstacle Avoidance                  │
└─────────────────┬───────────────────────────────────────────┘
                  │
┌─────────────────▼───────────────────────────────────────────┐
│              Perception & Manipulation                       │
│  Camera → Object Detection → Gripper Control                │
└─────────────────┬───────────────────────────────────────────┘
                  │
┌─────────────────▼───────────────────────────────────────────┐
│                    Robot Platform                            │
│  Mobile Base + Robotic Arm (Simulated in Gazebo/Isaac)     │
└─────────────────────────────────────────────────────────────┘

Learning Objectives

By completing this project, you will:

✅ Integrate multiple ROS 2 packages into a cohesive system
✅ Implement voice-based human-robot interaction
✅ Apply SLAM and autonomous navigation in realistic environments
✅ Use computer vision for object detection and localization
✅ Coordinate mobile base and manipulator for pick-and-place tasks
✅ Test and debug a complex robotics system end-to-end

Phase 1: Setup & Simulation Environment

1.1 Choose Your Platform

Option A: Gazebo + TurtleBot 4

Free and open-source
Wide community support
Runs on any Linux machine

Option B: Isaac Sim + Carter Robot

Photorealistic simulation
GPU acceleration
Requires NVIDIA RTX GPU

1.2 Install Dependencies

# For Gazebo + TurtleBot 4
sudo apt install ros-humble-turtlebot4-desktop
sudo apt install ros-humble-navigation2
sudo apt install ros-humble-nav2-bringup

# For voice control
pip install openai-whisper pyaudio
pip install openai  # For GPT-4 intent parsing

# For perception
sudo apt install ros-humble-vision-msgs
pip install ultralytics  # YOLOv8 for object detection

1.3 Create Project Workspace

mkdir -p ~/capstone_ws/src
cd ~/capstone_ws/src
git clone https://github.com/turtlebot/turtlebot4.git
git clone https://github.com/ros-planning/navigation2.git
cd ~/capstone_ws
colcon build
source install/setup.bash

Phase 2: Voice Command Interface

2.1 Implement Speech Recognition

# src/voice_controller/voice_controller/voice_node.py
import rclpy
from rclpy.node import Node
from std_msgs.msg import String
import whisper
import pyaudio

class VoiceControllerNode(Node):
    def __init__(self):
        super().__init__('voice_controller')
        self.model = whisper.load_model("base")
        self.command_pub = self.create_publisher(String, '/voice_commands', 10)

        # Start listening loop
        self.timer = self.create_timer(5.0, self.listen_and_publish)

    def listen_and_publish(self):
        # Record 3 seconds of audio
        audio_file = self.record_audio(duration=3)

        # Transcribe
        result = self.model.transcribe(audio_file)
        command = result["text"].strip()

        # Publish command
        msg = String()
        msg.data = command
        self.command_pub.publish(msg)
        self.get_logger().info(f"Voice command: {command}")

def main():
    rclpy.init()
    node = VoiceControllerNode()
    rclpy.spin(node)

2.2 Intent Extraction

import openai

def parse_delivery_command(command):
    """Extract delivery task from natural language"""
    prompt = f"""
    Parse this delivery command: "{command}"

    Extract:
    - action: "deliver" | "pick" | "navigate"
    - object: item to deliver (e.g., "coffee cup")
    - destination: where to deliver (e.g., "office A")

    Return JSON only.
    """

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

    return json.loads(response.choices[0].message.content)

Example:

Input: "Deliver the coffee cup to office A"
Output: {"action": "deliver", "object": "coffee cup", "destination": "office A"}

3.1 Launch SLAM Mapping

# Start Gazebo simulation
ros2 launch turtlebot4_ignition_bringup ignition.launch.py

# Launch SLAM Toolbox
ros2 launch slam_toolbox online_async_launch.py

# Drive robot around to build map
ros2 run teleop_twist_keyboard teleop_twist_keyboard

3.2 Save Map

ros2 run nav2_map_server map_saver_cli -f ~/maps/office_map

from nav2_simple_commander.robot_navigator import BasicNavigator
from geometry_msgs.msg import PoseStamped

class DeliveryNavigator:
    def __init__(self):
        self.navigator = BasicNavigator()

        # Define known locations
        self.locations = {
            "office_a": (2.0, 3.0, 0.0),
            "office_b": (-1.0, 2.5, 1.57),
            "charging_station": (0.0, 0.0, 0.0),
        }

    def navigate_to(self, location_name):
        if location_name not in self.locations:
            return False

        goal = PoseStamped()
        goal.header.frame_id = 'map'
        goal.header.stamp = self.navigator.get_clock().now().to_msg()

        x, y, yaw = self.locations[location_name]
        goal.pose.position.x = x
        goal.pose.position.y = y
        goal.pose.orientation.z = np.sin(yaw / 2)
        goal.pose.orientation.w = np.cos(yaw / 2)

        self.navigator.goToPose(goal)

        # Wait for navigation to complete
        while not self.navigator.isTaskComplete():
            rclpy.spin_once(self.navigator, timeout_sec=0.1)

        return self.navigator.getResult() == TaskResult.SUCCEEDED

Phase 4: Object Detection & Manipulation

4.1 YOLOv8 Object Detector

from ultralytics import YOLO
from sensor_msgs.msg import Image
from cv_bridge import CvBridge

class ObjectDetectorNode(Node):
    def __init__(self):
        super().__init__('object_detector')
        self.model = YOLO('yolov8n.pt')  # Nano model for speed
        self.bridge = CvBridge()

        self.image_sub = self.create_subscription(
            Image,
            '/camera/image_raw',
            self.image_callback,
            10
        )

    def image_callback(self, msg):
        # Convert ROS image to OpenCV
        cv_image = self.bridge.imgmsg_to_cv2(msg, "rgb8")

        # Detect objects
        results = self.model(cv_image)

        # Find target object (e.g., "cup")
        for detection in results[0].boxes:
            class_id = int(detection.cls[0])
            class_name = self.model.names[class_id]

            if class_name == "cup":
                # Get bounding box center
                x_center = int((detection.xyxy[0][0] + detection.xyxy[0][2]) / 2)
                y_center = int((detection.xyxy[0][1] + detection.xyxy[0][3]) / 2)

                self.get_logger().info(f"Found cup at ({x_center}, {y_center})")
                # TODO: Convert to 3D coordinates and approach

4.2 Gripper Control (Placeholder)

class GripperController:
    def __init__(self):
        self.gripper_pub = self.create_publisher(
            Float64,
            '/gripper/position',
            10
        )

    def open_gripper(self):
        msg = Float64()
        msg.data = 0.08  # Fully open
        self.gripper_pub.publish(msg)

    def close_gripper(self):
        msg = Float64()
        msg.data = 0.0  # Fully closed
        self.gripper_pub.publish(msg)

Phase 5: Integration & Testing

5.1 Main Orchestrator

class DeliveryRobotOrchestrator:
    def __init__(self):
        self.navigator = DeliveryNavigator()
        self.detector = ObjectDetectorNode()
        self.gripper = GripperController()

    def execute_delivery(self, task):
        # 1. Navigate to object location
        self.get_logger().info(f"Navigating to {task['object']}...")
        success = self.navigator.navigate_to("pickup_zone")
        if not success:
            return False

        # 2. Detect and approach object
        self.get_logger().info(f"Detecting {task['object']}...")
        # TODO: Visual servoing to approach object

        # 3. Pick up object
        self.get_logger().info("Picking up object...")
        self.gripper.open_gripper()
        # TODO: Lower arm, close gripper

        # 4. Navigate to destination
        self.get_logger().info(f"Delivering to {task['destination']}...")
        success = self.navigator.navigate_to(task['destination'])

        # 5. Place object
        self.get_logger().info("Placing object...")
        self.gripper.open_gripper()

        return True

5.2 Test Scenarios

Scenario	Command	Expected Behavior
Basic Delivery	"Deliver the cup to office A"	Navigate → Pick → Deliver → Return
Multi-Object	"Pick up the red box and blue ball"	Sequential pick-and-place
Failure Recovery	"Deliver the cup" (cup not found)	Report error, return to start
Obstacle Avoidance	Navigate with dynamic obstacles	Replan path around obstacles

Success Criteria

Your capstone is complete when your robot can:

✅ Accept voice commands in natural language
✅ Navigate autonomously in a mapped environment
✅ Detect and localize objects using computer vision
✅ Pick up objects with a gripper (or simulated gripper)
✅ Deliver objects to specified locations
✅ Handle at least one failure case gracefully

Bonus Challenges

Want to take it further?

🌟 Multi-Robot Coordination: Deploy 2+ robots with task allocation
🌟 Sim-to-Real Transfer: Deploy on a physical robot (TurtleBot, Fetch, etc.)
🌟 Human Tracking: Follow a person using skeleton tracking
🌟 Dynamic Re-planning: Adapt to changing environments in real-time
🌟 Battery Management: Return to charging station when battery low

Resources

TurtleBot 4 Docs: https://turtlebot.github.io/turtlebot4-user-manual/
Nav2 Tutorials: https://navigation.ros.org/tutorials/index.html
YOLOv8 Docs: https://docs.ultralytics.com/
Whisper GitHub: https://github.com/openai/whisper
Isaac Sim Examples: https://docs.omniverse.nvidia.com/isaacsim/

Try It Yourself

Ask the chatbot:

"How do I integrate Nav2 with voice commands?"
"What's the best way to tune SLAM parameters for my environment?"
"Can I deploy this capstone project on a real robot?"

Good luck with your capstone project! You're ready to build intelligent physical AI systems. 🤖

Capstone Project: Voice-Controlled Delivery Robot

Project Overview

System Architecture

Learning Objectives

Phase 1: Setup & Simulation Environment

1.1 Choose Your Platform

1.2 Install Dependencies

1.3 Create Project Workspace

Phase 2: Voice Command Interface

2.1 Implement Speech Recognition

2.2 Intent Extraction

Phase 3: Autonomous Navigation

3.1 Launch SLAM Mapping

3.2 Save Map

3.3 Navigation Node

Phase 4: Object Detection & Manipulation

4.1 YOLOv8 Object Detector

4.2 Gripper Control (Placeholder)

Phase 5: Integration & Testing

5.1 Main Orchestrator

5.2 Test Scenarios

Success Criteria

Bonus Challenges

Resources

Try It Yourself

Project Overview​

System Architecture​

Learning Objectives​

Phase 1: Setup & Simulation Environment​

1.1 Choose Your Platform​

1.2 Install Dependencies​

1.3 Create Project Workspace​

Phase 2: Voice Command Interface​

2.1 Implement Speech Recognition​

2.2 Intent Extraction​

Phase 3: Autonomous Navigation​

3.1 Launch SLAM Mapping​

3.2 Save Map​

3.3 Navigation Node​

Phase 4: Object Detection & Manipulation​

4.1 YOLOv8 Object Detector​

4.2 Gripper Control (Placeholder)​

Phase 5: Integration & Testing​

5.1 Main Orchestrator​

5.2 Test Scenarios​

Success Criteria​

Bonus Challenges​

Resources​

Try It Yourself​

Project Overview

System Architecture

Learning Objectives

Phase 1: Setup & Simulation Environment

1.1 Choose Your Platform

1.2 Install Dependencies

1.3 Create Project Workspace

Phase 2: Voice Command Interface

2.1 Implement Speech Recognition

2.2 Intent Extraction

Phase 3: Autonomous Navigation

3.1 Launch SLAM Mapping

3.2 Save Map

3.3 Navigation Node

Phase 4: Object Detection & Manipulation

4.1 YOLOv8 Object Detector

4.2 Gripper Control (Placeholder)

Phase 5: Integration & Testing

5.1 Main Orchestrator

5.2 Test Scenarios

Success Criteria

Bonus Challenges

Resources

Try It Yourself