Het Patel

C.A.R.E. — Companion Autonomous Robotic Entity

Overview

C.A.R.E. bridges augmented reality and robotics to create an assistive robotic companion for individuals with mobility challenges. The system integrates Snap AR Spectacles with the Booster K1 robot, enabling intuitive human-robot interaction through an AR control interface. Built at Cal Hacks 12.0, this project demonstrates how cutting-edge AR and robotics technologies can work together to assist elderly and mobility-limited individuals.

What It Does

AR Control Interface: Users view live robot camera feeds through AR glasses with joystick and HUD overlay for intuitive control
Dual Operation Modes: Supports both autonomous patrol mode and manual control via head movements
AI-Powered Detection: Computer vision identifies and tracks people in real-time using Google Gemini AI
Low-Latency Communication: ROS2, WebSockets, and ngrok tunneling maintain responsive control with dual channels (low-bandwidth control, high-bandwidth video)
Head Tracking Navigation: User head orientation maps directly to robot movement commands for natural, hands-free control

Target Applications

Elderly and mobility-limited assistance - helping individuals navigate and interact with their environment
Security and patrol operations - autonomous monitoring with human oversight
Search and rescue missions - remote exploration in hazardous environments
Construction and infrastructure inspection - safe inspection of dangerous or hard-to-reach areas

Technologies Used

ROS2 Snap AR Spectacles Booster K1 Robot Google Gemini AI OpenCV WebSockets ngrok Snap Lens Studio Python

Technical Challenges & Solutions

Latency Reduction: Implemented dual WebSocket channels - one for low-bandwidth control signals and another for high-bandwidth video streaming to minimize control lag while maintaining video quality
Real-time Tracking: Integrated Gemini detection with ROS2 navigation stack for smooth human-following behavior with predictive movement
AR Interface Design: Created minimal, intuitive HUD elements that provide essential information without overwhelming the user's field of view
Sensor Calibration: Developed mapping system to accurately translate AR camera rotation data to robot motor angle commands for precise head-tracking control

Team

Built at Cal Hacks 12.0 by:

Het Patel - Booster Robot Control, Snap AR Integration, Vision Language Models
Sunny Deshpande - SLAM and Navigation
Atharv Mungale - Snap AR Software and Communication
Vetrivel Balaji - Snap AR Software and Communication

Language-Guided Humanoid Loco-Manipulation via Vision-Language-Action Models

View Project Report (PDF)

Overview

Developed an advanced framework for humanoid loco-manipulation using Vision-Language-Action (VLA) models within the OmniGibson simulation environment. The system integrates state-of-the-art VLA models with SLAM-based navigation to enable language-guided task execution in realistic household environments, with a focus on waste sorting and object manipulation tasks across 20+ diverse household scenes from the BEHAVIOR-1K benchmark.

Problem Statement

Traditional robotic manipulation systems require extensive task-specific programming and struggle to generalize across different environments and tasks. Humanoid robots need to seamlessly combine locomotion (navigation) and manipulation (grasping, sorting) while understanding natural language commands. Existing systems lack the ability to perform zero-shot task execution in novel household environments with language-based instructions.

Key Features

Vision-Language-Action Models: Integration of GR00T N1 and OpenVLA models for end-to-end language-to-action translation
Loco-Manipulation: Unified framework combining SLAM-based navigation with precise object manipulation
Zero-Shot Task Execution: Ability to execute novel tasks without task-specific training
Semantic Scene Understanding: 85%+ accuracy in language-guided semantic reasoning
Realistic Simulation: 20+ photorealistic household scenes from BEHAVIOR-1K benchmark
Multi-Modal Perception: Integration of RGB-D cameras, proprioceptive sensors, and language inputs
Dynamic Object Interaction: Robust manipulation of various household objects with different properties

Technologies Used

GR00T N1 OpenVLA OmniGibson SLAM BEHAVIOR-1K ROS2 PyTorch Python

Technical Architecture

Vision-Language-Action Pipeline

Natural language command parsing and semantic understanding
RGB-D image processing for scene understanding and object detection
VLA model generates low-level robot actions from high-level commands
Real-time action execution with feedback control

SLAM-Based Navigation Stack

Continuous robot pose estimation in household environments
Dynamic map building of 3D environment structure
Collision-free trajectory generation for locomotion
Seamless coordination between navigation and manipulation tasks

Manipulation Framework

Intelligent grasp pose generation for diverse object geometries
Precise end-effector control for object manipulation
Compliant manipulation with force/torque sensing
Multi-step manipulation sequences (pick, place, sort)

Results & Performance

Semantic Reasoning Accuracy: 85%+ in language-guided task understanding
Navigation Success Rate: High success in SLAM-based navigation across 20+ scenes
Manipulation Precision: Robust grasping and sorting of various household objects
Zero-Shot Generalization: Effective task execution without scene-specific training
Scene Diversity: Tested across kitchen, living room, bedroom, and bathroom environments
Object Variety: Successfully manipulated 50+ different object types

Challenges & Solutions

Challenge: Language-to-action translation - Mapping high-level natural language to low-level robot actions
Solution: Leveraged pre-trained VLA models (GR00T N1, OpenVLA) with transfer learning for household domain
Challenge: Loco-manipulation coordination - Simultaneous control of locomotion and manipulation systems
Solution: Developed hierarchical control architecture with SLAM for navigation and VLA for manipulation, coordinated through ROS2
Challenge: Sim-to-real gap - Simulation behavior differs from real-world physics
Solution: Used high-fidelity OmniGibson simulation with realistic object physics and rendering

Team Members

Het Patel,Vardhan Dongre,Sunny Deshpande

Automated Solar Panel Cleaning Robot: Design, Implementation and Software Control System

View Project Report (PDF)

Team

Author: Het Patel (20BEC1165)

Advisor: Dr. Sheena Christabel Pravin, Assistant Professor Senior Grade

Institution: School of Electronics Engineering, Vellore Institute of Technology, Chennai

Date: April 2024

Overview

Engineered and deployed an autonomous robotic system for grid-aware solar panel maintenance, leveraging ESP32 microcontroller for edge computation and Qt6 framework for cross-platform control software. The project addresses the critical challenge of dust accumulation on solar panels, which causes daily energy losses of 4.4% annually and up to 20% during prolonged dry periods.

The robotic system features a precision-engineered stainless steel chassis with caterpillar track drive system, specialized roller brush cleaning mechanism, intelligent water delivery system, and automated wiper assembly, all controlled through an intuitive cross-platform dashboard providing real-time telemetry, performance monitoring, and autonomous path planning capabilities.

Problem Statement

Solar photovoltaic panels face significant efficiency degradation due to dust accumulation, bird droppings, and environmental debris. Studies show performance reductions of up to 32% within eight months in similar climatic regions. Traditional cleaning methods are inadequate:

Manual Cleaning: High labor costs, inconsistent schedules, safety risks at height
High-Pressure Water Spray: Requires 16-meter water head, excessive water consumption
Mechanical Methods: Can damage panel surface, reduce lifespan
Electrostatic Methods: Cannot remove particles >0.2mm, affected by panel tilt angle

With solar panel systems having 25-year lifespans and 6-year energy payback periods, maintaining peak efficiency is economically critical.

Key Features

Hardware Design

Robust Chassis: Stainless steel 304 Grade (1mm sheet) construction with CNC laser cutting, CNC bending, and TIG welding
Caterpillar Track Drive: Four 12V geared DC motors (50 RPM, 346.8 N-cm torque) with 40mm width track belts
Active Cleaning Mechanism: Roller brush assembly with 12V DC motor (100 RPM, 103 N-cm torque) in 3D-printed housing
Water Delivery System: Centrifugal pump (8W, 10 L/min) with four-nozzle distribution achieving 166.68 cm/s velocity
Automated Wiper: Rack-and-pinion mechanism with servo motor for surface drying
Braking System: Linear slider-crank mechanism with servo control for stable positioning on 10-15° inclines
Environmental Sensors: BME680 sensor for temperature, humidity, pressure, and air quality monitoring

Software & Control

Qt6 Cross-Platform Dashboard: Runs on Windows, macOS, and Linux with rich GUI components
UDP Communication: Sub-100ms latency with 99% packet delivery rate
Monitor Screen: Real-time energy generation charts, weather data visualization, battery tracking
Performance Analytics: Individual panel visualization with color-coded indicators and maintenance logs
Interactive Control: Direct robot control for motors, brake, pump, wipers, and brush
Autonomous Path Planning: Pre-defined waypoint navigation with serpentine pattern optimization
Firebase Integration: REST API for live sensor data streaming and cloud monitoring

Performance Achievements

45% Water Consumption Reduction: Optimized nozzle flow rates versus manual methods
15% Energy Yield Improvement: Through regular automated cleaning maintenance
10-13 Minute Battery Runtime: 4200mAh LiPo battery (11.1V, 3S configuration)
99% Communication Reliability: UDP protocol with acknowledgment achieving <100ms latency
Incline Navigation: Tested and verified operation on 10-15° solar panel inclinations

Technologies Used

Hardware

ESP32 Johnson DC Motors BTS7960 Driver L293D Driver BME680 Sensor LiPo Battery Stainless Steel 304 3D Printing (FDM)

Software

Qt6 Framework C++ Arduino/C UDP Protocol Firebase REST API WiFi

Manufacturing

CNC Laser Cutting CNC Bending TIG Welding CAD Design

Results & Performance

Quantitative Achievements

Water Efficiency: 45% reduction compared to manual/traditional automated methods
Energy Yield: 15% annual increase through consistent cleaning maintenance
Communication Latency: <100ms UDP round-trip time with 99% packet delivery
Battery Runtime: 10-13 minutes continuous operation on 4200mAh LiPo
Flow Rate: 10 L/min distributed across 4 nozzles (2.5 L/min per nozzle)
Incline Capability: Successfully tested on 10-15° solar panel inclinations
Brush Speed: 100 RPM with 103 N-cm torque for effective dirt removal
Drive Speed: 50 RPM with 346.8 N-cm torque per motor pair

Real-World Impact

Daily energy loss prevention: Mitigates 4.4% annual average degradation
Peak period protection: Prevents >20% efficiency drops during extended dry periods
Lifespan optimization: Maintains performance throughout 25-year panel lifetime
ROI improvement: Accelerates 6-year energy payback period through enhanced output

Technical Architecture

Hardware System

Chassis & Structure: Stainless steel 304 Grade (1mm sheet) with CNC laser cutting → CNC bending → TIG welding process chain for outdoor durability and corrosion resistance.

Drive Assembly: Four 12V DC motors in paired sets with 40mm caterpillar track belts and idler pulley tensioning. BTS7960 motor drivers (43A max current) provide power.

Cleaning System: Roller brush (12V, 100 RPM, 103 N-cm) with centrifugal pump (10 L/min) distributing water through 4 nozzles at 41.67 cm³/s each.

Auxiliary Mechanisms: Rack-and-pinion wiper system and slider-crank braking mechanism, both servo-actuated.

Software Architecture

Qt6 Dashboard: Multi-screen interface including Monitor (energy charts, weather data), Performance (panel analytics), Control (robot operation), and Automate (path planning) screens.

ESP32 Firmware: Main control loop receives UDP commands, executes motor/servo control, and sends status updates. Processes commands for MOVE_FORWARD, MOVE_BACKWARD, TURN_LEFT, TURN_RIGHT, BRUSH_ON, PUMP_ON, WIPER_ACTIVATE, and BRAKE_APPLY.

Path Planning: Serpentine pattern optimization with alternating left-to-right, right-to-left row traversal for minimum distance cleaning.

Challenges and Solutions

Challenge 1: Traction on Inclined Panels

Problem: Standard wheels slip on smooth, tilted solar panel surfaces (10-15° inclination).

Solution: Implemented caterpillar track belt system with soft rubber compound material and idler pulleys for tension maintenance.

Result: Achieved reliable operation on 10-15° inclinations with four 12V motors providing sufficient climbing force.

Challenge 2: Uniform Water Distribution

Problem: Single-point water delivery creates uneven coverage and wastes water.

Solution: Designed four-nozzle distribution system with calculated flow rates and 2x velocity amplification.

Result: Uniform coverage enabling 45% water consumption reduction while maintaining cleaning efficacy.

Challenge 3: Real-Time Communication

Problem: Wireless control can experience packet loss and delays, compromising safety.

Solution: Implemented UDP protocol with acknowledgment system (500ms timeout) and 3-attempt retry logic.

Result: Achieved <100ms latency with 99% packet delivery rate for safe real-time control.

Challenge 4: Cross-Platform Deployment

Problem: Solar installations use diverse operating systems requiring universal compatibility.

Solution: Selected Qt6 framework for native cross-platform C++ development.

Result: Successfully deployed on Windows and Linux with identical features and reliable UDP communication.

Future Work

Enhanced Sensors: Wind speed and particulate matter sensors for environmental monitoring
AI Integration: Machine learning for predictive cleaning schedule optimization
Computer Vision: Camera-based dirt detection for targeted cleaning verification
Solar-Powered Operation: Self-charging capability for extended autonomous operation
Multi-Robot Coordination: Fleet management for large-scale solar farm deployments
Edge Detection: Ultrasonic/IR sensors for panel boundary detection and fall prevention
Weather Integration: Automatic scheduling based on weather API forecasts
Cloud Platform: Web-based monitoring dashboard for remote access

Medication and Multipurpose Drone for Wildlife Conservation

View Project Presentation (PDF)

Overview

Engineered a hexacopter drone platform designed specifically for Kaziranga National Park to track and protect endangered one-horned rhinos. The drone combines autonomous flight capabilities, computer vision for wildlife detection, and bio-inspired mechanisms for extended surveillance operations.

Problem Statement

Kaziranga National Park houses 66.7% (2,413 out of ~3,600) of the world's one-horned rhinos. Despite conservation efforts, rhinos face critical threats:

Poaching for horns despite government anti-poaching measures
Seasonal floods trapping rhinos without food for extended periods
Injuries requiring medical attention in remote, inaccessible areas
Limited ground-based surveillance capabilities across vast park areas

Key Features

Autonomous Flight System

Flight Endurance: 40.54 minutes continuous operation
Coverage Area: 500 hectares per mission
Flight Controller: Pixhawk 2.4.8 for precise autonomous control
Navigation: GPS-based waypoint navigation with SLAM capabilities

Bio-Inspired Design Features

Falcon-Like Claw Mechanism: Enables perching on tree branches to conserve battery during stationary surveillance
Bat-Like Sonar: 6x Ultrasonic sensors (HCSR-04) for 360° obstacle avoidance in dense forest environments
Power Management: Perching extends surveillance time by 3x through intelligent power conservation

Computer Vision & AI

Detection Model: YOLOv3 trained for rhino detection and tracking
Injury Assessment: Real-time computer vision to identify wounded rhinos
Surveillance: Continuous monitoring with automated alert system

Multiple Operation Modes

Medical Surveillance: Track injured rhinos and guide rescue teams to their location
Anti-Poaching Security: Continuous surveillance to deter poaching activities
Safari Assistance: Live video feed for tourists and location services for tour guides
Population Tracking: Automated rhino counting and movement pattern analysis

Mobile Application

Live video streaming from drone camera
Real-time rhino location tracking on map
User location services for safari groups
Remote drone control and mission planning

Technical Specifications

Hardware Components

Frame: Custom hexacopter design (8.4 kg total weight)
Motors: 6x TITAN T5010 300KV BLDC motors
ESC: 6x 30A Electronic Speed Controllers
Propellers: 18-inch with 6.5 pitch
Battery: TATTU 30,000mAh 6S 25C LiPo
Flight Controller: Pixhawk with buzzer and arming switch
Onboard Computer: Raspberry Pi 4B+ (4GB RAM)
Camera: FPV camera with real-time video transmission
Sensors: GPS (Ublox Neo M8N), 6x Ultrasonic sensors, LiDAR
Communication: VTX (10km range), 5.8GHz FPV antenna
Gripper: Custom servo-controlled claw mechanism

Performance Metrics

Thrust-to-Weight Ratio: 1.2:1
Total Thrust: 10,088.4 grams
Current Draw: 44.4A at cruising speed
Flight Time: 40.54 minutes calculated endurance
Project Cost: ₹116,444.22 (~$1,400 USD)

Software Architecture

Mission Planning

ArduPilot for autonomous waypoint navigation
Python-based mission planner for coverage optimization
Integration with Kaziranga population density heat maps
Automated path planning for maximum coverage efficiency

Computer Vision Pipeline

Real-time video capture from FPV camera
YOLOv3 object detection for rhino identification
Horn detection for injury assessment
Automated alert generation for park officials
GPS coordinate logging for rescue operations

ROS2 Integration

Sensor data fusion (LiDAR + GPS + IMU)
SLAM for real-time localization and mapping
Nav2 for path planning and obstacle avoidance
Multi-sensor coordination for autonomous flight

Technical Challenges & Solutions

Limited Flight Endurance: Implemented bio-inspired falcon claw perching mechanism allowing the drone to land on tree branches, conserving battery while maintaining surveillance. Extended effective operation time by 3x.
Dense Forest Navigation: Deployed bat-inspired ultrasonic sensor array (6 sensors) for 360-degree obstacle detection, enabling safe autonomous flight through dense vegetation.
Wildlife Detection Accuracy: Custom-trained YOLOv3 model on rhino dataset with specific focus on horn detection for injury assessment, achieving high detection rates in varied lighting conditions.
Remote Medical Assistance: Integrated GPS tracking with real-time video feed, enabling park officials to locate injured rhinos and dispatch medical teams with precise coordinates.

Impact & Results

Anti-Poaching: Continuous drone surveillance acts as deterrent for poaching activities
Faster Response: Real-time injured rhino detection enables immediate medical intervention
Conservation Data: Automated population tracking and movement pattern analysis
Tourism Enhancement: Live wildlife feed improves safari experience without disturbing animals
Cost-Effective: ₹116,444 system provides capabilities of much more expensive commercial drones

Future Enhancements

Extended Coverage: Upgrade to higher capacity batteries for longer flight times and larger area coverage
Advanced LiDAR: 3D terrain mapping and enhanced obstacle avoidance
Multi-Drone Coordination: Swarm-based surveillance for complete park coverage
Thermal Imaging: Night vision capabilities for 24/7 monitoring
Automated Medication Delivery: Payload system for remote medicine administration
AI-Based Analysis: Automated rhino counting and health assessment
Weather Resistance: Waterproofing for monsoon season operations
Expansion: Adapt system for other wildlife sanctuaries and endangered species

Technologies Used

ROS2 Python YOLOv3 OpenCV ArduPilot Pixhawk Raspberry Pi SLAM LiDAR GPS Navigation Ultrasonic Sensors Computer Vision Bio-inspired Robotics Embedded Systems Mobile App Development

Academic Context

Developed as a Control Systems course project (November 2022) demonstrating practical application of:

PID control for stable hexacopter flight
Sensor fusion and state estimation
Autonomous navigation and path planning
Real-world robotics system integration for wildlife conservation

Adaptive Vehicle Control Based on Pedestrian Behavior

View Project Presentation (PDF) View Project Report (PDF)

Overview

Developed a predictive autonomous vehicle control framework that dynamically adjusts vehicle speed and behavior in real-time based on pedestrian behavioral cues, moving beyond traditional reactive obstacle avoidance systems. The system addresses the fundamental gap between reactive and predictive autonomous navigation by anticipating pedestrian intent rather than simply reacting to proximity.

Technologies Used

ROS2 GEM e2 Vehicle LiDAR (Ouster) RGB-D Camera (OAK-D) YOLOv11 DBSCAN Stanley Controller PID Control GNSS Python Sensor Fusion

Problem Statement

Traditional AV navigation systems treat pedestrians as static obstacles outside of the road during cruising, relying on simple reactive braking once they cross. This reactive approach cannot handle complex pedestrian interactions or anticipate human intent. Our project developed a control framework that dynamically adjusts vehicle speed and control in real-time based on pedestrian behavior cues, rather than just proximity.

Key Features

Multi-Sensor Perception: LiDAR and RGB-D camera fusion for robust pedestrian detection
Pedestrian Behavior Prediction: Trajectory prediction, motion forecasting, and Time-to-Collision (TTC) calculation
Intelligent State Machine: Multi-phase decision system with CRUISE, STOP_YIELD, SLOW_CAUTION, and CREEP_PASS states
Real-time Adaptation: Dynamic speed and path adjustment based on predicted pedestrian behavior
Safety Controller: Emergency braking and velocity control with PID feedback
Stanley Controller: Precise lateral control for path following
Sensor Fusion: Weighted fusion of LiDAR (0.8 distance, 0.3 direction) and Camera (0.2 distance, 0.7 direction) data

Technical Architecture

Perception Stack

LiDAR Processing: Voxelization, ground filtering, outlier removal, DBSCAN clustering, tracking with EMA smoothing, geometric and motion-based human detection
RGB-D Processing: YOLOv11 object detection, depth extraction, pedestrian pose transformation to ego frame
Sensor Fusion: Time synchronization, data association with Euclidean distance matching (2.0m threshold), weighted averaging

Prediction Module

Pedestrian trajectory buffering and smoothing
Motion prediction using historical trajectory data
Ego vehicle trajectory prediction
Time-to-Collision (TTC) calculation

Planning & Control

High-Level Decision: Three-phase state machine (Critical Checks, Context, Recovery)
Safety Controller: Speed mapping (CRUISE → 5 m/s, SLOW → 2.5 m/s) and emergency braking
Stanley Controller: Minimize heading and cross-track error for lateral control
Velocity PID: Smooth acceleration/deceleration for longitudinal control

Results & Performance

Experiment Type	Experiments	Success Rate
Cruise Mode	5	100% (5/5)
No Pedestrian w/ Sign	10	100% (10/10)
Crossing Pedestrian w/ Sign	10	90% (9/10)
Stationary Pedestrian	5	100% (5/5)
Crossing Pedestrian	10	90% (9/10)
Pedestrian Walking Along Road	10	80% (8/10)
Vehicle Stanley Control	8	87.5% (7/8)

Overall System Success: ~91% across all scenarios

Challenges & Solutions

Challenge: Human movement is inherently uncertain and unpredictable
Solution: Implemented probabilistic trajectory prediction with motion smoothing and TTC-based early warning
Challenge: Sensor fusion with different modalities (LiDAR vs Camera)
Solution: Developed weighted fusion approach leveraging LiDAR's distance accuracy and Camera's directional precision
Challenge: Real-time decision making with safety constraints
Solution: Designed hierarchical state machine with critical safety checks, context-aware behavior, and recovery mechanisms

Team Members

Het Patel (hcp4) • Sunny Deshpande (sunnynd2) • Ansh Bhansali (anshb3) • Keisuke Ogawa (ogawa3)

Open-World Semantic-Based Zero-Shot 6D Pose Estimation Using SAM3 And FoundationPose

Open-World Zero-Shot 6D Pose Estimation Demo

View Project Report (PDF)

Overview

Developed a novel open-vocabulary 6D object pose tracking framework that extends NVIDIA's FoundationPose architecture to enable language-guided, zero-shot tracking of arbitrary objects without pre-registered CAD models. By integrating Moondream2 vision-language model, SAM-3 segmentation, and on-the-fly 3D mesh generation from Objaverse-XL, the system achieves real-time, occlusion-robust pose estimation with dynamic target switching via natural language prompts. This breakthrough enables robotic manipulators to seamlessly transition between tracking different objects (e.g., "grasp the red bottle" to "now grasp the blue cup") in unstructured environments without reinitialization.

Technologies Used

FoundationPose (CVPR 2024) SAM-3 Moondream2 VLM Objaverse-XL TripoSR YOLOv8 PyTorch 2.0/2.7 CUDA 11.8/12.6 NVDiffRast Gemini API Python

Problem Statement

Traditional 6D pose estimation methods face critical limitations that restrict their deployment in real-world robotic manipulation scenarios. NVIDIA's FoundationPose, while achieving zero-shot inference for unseen objects, requires pre-provided CAD models and manual mask annotation in the initial frame. It often fails in heavily occluded scenes (especially LineMOD dataset), where errors propagate through the mesh-matching and refinement stages. Furthermore, no existing system supports real-time, language-guided, multi-object pose estimation with dynamic target switching—a crucial capability for responsive robotic manipulation in novel environments.

Key Features

Open-Vocabulary Detection: Lightweight Moondream2 VLM for edge-compatible semantic scene understanding
Zero-Shot Mesh Generation: On-the-fly 3D proxy generation via Objaverse-XL retrieval (10M+ assets) and TripoSR
Language-Driven Segmentation: SAM-3 integration for text-prompt-based, occlusion-robust target segmentation
Dynamic Target Switching: Seamless mid-task object switching via natural language without reinitialization
Hierarchical Mesh Acquisition: Three-tier system (benchmark CAD → Objaverse retrieval → TripoSR generation)
Real-Time Tracking: Render-and-compare architecture with transformer-based pose refinement
Mesh Dictionary Caching: Asynchronous mesh fetching and caching to eliminate redundant queries
Multi-Object Support: Concurrent tracking of multiple objects with individual semantic labels

Technical Architecture

Stage 1: Semantic Scene Analysis

Moondream2 VLM generates comprehensive object inventory from RGB stream
Produces discrete candidate list (e.g., "red bottle," "blue cup," "black keyboard")
Gemini API fallback for semantic query enhancement when detection fails
Enables prompt-based object specification even for BOP unseen objects

Stage 2: On-the-Fly 3D Mesh Generation

Primary: Load ground-truth CAD models when available (benchmark datasets)
Retrieval: Query Objaverse-XL database via language-guided similarity search
Generation: TripoSR generates candidate mesh from single observed image
Selection: Mesh manager scores candidates based on silhouette, depth, and IoU alignment
Asynchronous fetching and caching eliminates redundant queries during multi-object tracking

Stage 3: Language-Driven Segmentation

SAM-3 accepts natural language prompts (e.g., "the red apple") to output pixel-level masks
Outperforms traditional R-CNN detectors in heavy clutter and occlusion
Temporal consistency for video tracking with frame-to-frame coherence
Simultaneous multi-object segmentation via distinct text prompts

Stage 4: Unified 6D Pose Estimation & Tracking

Render-and-compare: FoundationPose aligns retrieved mesh with video observation
Pose scoring: Uniform sampling, composite scoring (IoU + Depth + Silhouette)
Iterative refinement: Transformer-based pose refinement for sub-frame accuracy
Dynamic switching: Instant mask + mesh updates enable seamless target transitions

Results & Performance

Overall Performance (44 Evaluations, 12 Scenes, 18 Objects, 787 Frames)

Metric	Mean ± Std
ADD AUC	71.61% ± 39.10%
ADD-S AUC	88.31% ± 28.56%
Rotation Error	44.23° ± 58.38°
Translation Error	2.44cm ± 6.13cm
Processing Time	13.08s ± 0.15s

Best Performing Objects (Top 5 by ADD-S AUC = 100%)

Object	ADD AUC	Rotation (°)	Translation (cm)
Power Drill (4 scenes)	100.0%	2.0° ± 0.4°	0.24cm
Bleach Cleanser (3 scenes)	100.0%	2.4° ± 0.8°	0.34cm
Banana (2 scenes)	100.0%	5.3° ± 0.4°	0.40cm
Mustard Bottle (2 scenes)	90.0%	19.7° ± 25.6°	0.22cm
Pudding Box (1 scene)	100.0%	2.7°	0.24cm

Symmetric Object Performance (Demonstrating ADD-S Effectiveness)

Object	ADD AUC	ADD-S AUC	Translation
Master Chef Can (4 scenes)	83.0%	99.0%	0.55cm
Bowl (2 scenes)	8.0%	100.0%	0.38cm
Mug (2 scenes)	98.0%	100.0%	0.48cm
Tuna Fish Can (4 scenes)	98.0%	98.0%	0.57cm

Key Insight: ADD-S metric successfully handles symmetric objects, converting Bowl's 8% ADD → 100% ADD-S, demonstrating robust pose recovery despite rotational ambiguity

Runtime Performance Breakdown

Average Processing Time: 13.08 seconds per frame
SAM-3 Segmentation: ~12.5 seconds (95.6% of total time)
FoundationPose Tracking: ~0.58 seconds (near real-time after initialization)
First Frame Registration: 13-16 seconds (includes mask generation + initial pose alignment)

Comparison with State-of-the-Art

Method	Zero-Shot Objects	Text Prompt	ADD-S AUC (%)
PoseCNN	No	No	75.4
DenseFusion	No	No	82.3
FoundationPose	Yes	No	89.2
Ours (SAM3+FP)	Yes	Yes	88.31 ± 28.56

Unique Contribution: Only method combining zero-shot capability with text-prompted segmentation for dynamic, interactive pose tracking. Achieves competitive 88.31% ADD-S AUC across 18 diverse objects without CAD model pre-registration.

Challenges & Solutions

Challenge: Direct 3D reconstruction (YOLOv8 + SAM + TripoSR) produced low-quality meshes with artifacts
Solution: Shifted to retrieval-based strategy leveraging Objaverse-XL's 10M+ professionally designed meshes for clean, artifact-free geometry
Challenge: Symmetric objects (cylindrical items) exhibited rotational ambiguity
Solution: Adopted ADD-S metric accounting for pose equivalence, achieving 76.5% ADD-S AUC despite 111° rotation error
Challenge: SAM-3 mask generation dominated processing time (~12.5 sec/frame)
Solution: Modular architecture with FoundationPose achieving near real-time after initialization; future work targets GPU-accelerated SAM-3
Challenge: Retrieved meshes may differ in exact proportions from real instances
Solution: Depth-based scale estimator adjusts mesh dimensions; composite scoring (IoU + Depth + Silhouette) selects best candidate

Team Members

Het Patel (hcp4) • Sunny Deshpande (sunnynd2) • Ansh Bhansali (anshb3) • Keisuke Ogawa (ogawa3)

Course: CS543 Computer Vision, University of Illinois Urbana-Champaign

Date: November 2024

Mobile Manipulator for Mars Missions

Overview

Designed and simulated a mobile manipulator robot for Mars exploration missions, focusing on sample collection and terrain navigation in challenging environments.

Technologies Used

ROS2 Simulation Mobile Manipulation Gazebo

Constructa-1 Construction Robot

Overview

Architected and deployed a ROS2-based software stack for a 300kg payload Autonomous Mobile Robot (AMR) designed for unstructured construction environments with LiDAR and sensor fusion.

Technologies Used

ROS2 SLAM AMR LiDAR Sensor Fusion

Solid State Recorder for SAR Missions

Overview

Engineered core firmware for Synthetic Aperture Radar data storage subsystem at ISRO, enabling high-speed reliable data acquisition critical for earth observation payloads.

Technologies Used

Embedded Systems Firmware C/C++ SAR ISRO

Gaganyaan Cabin Display System

Overview

Contributed to ISRO's human spaceflight mission by developing embedded software for cabin display system, a key safety-critical HMI using OpenGL-ES and Petalinux.

Technologies Used

OpenGL-ES Embedded Linux HMI Petalinux ISRO

Detecting Food Item and Quantity

Overview

Architected computer vision system for Samsung IoT edge device deploying YOLOv4, R-CNN, BiT optimized with TensorRT achieving 96% accuracy in real-time food identification with <100ms latency.

Technologies Used

YOLOv4 TensorRT IoT Computer Vision Samsung

Regular Class Aircraft

View Design Requirements and Challenges (PDF)

Team

Team: Aviators International Team of VIT

Competition: SAE INDIA Aero Design Competition

Period: 2022-2023

Overview

As part of the Aviators International Team of VIT, I contributed to the design and development of a Regular Class Aircraft for the prestigious SAE INDIA Aero Design Competition. This project challenged us to create an aircraft capable of meeting strict performance requirements while maintaining stability and safety throughout its flight envelope.

Our team successfully engineered a parasol wing configuration aircraft optimized for maximum lift generation, constructed from lightweight yet robust materials including plywood and aluminum, enhanced through additive manufacturing techniques. The aircraft demonstrated exceptional performance characteristics including stable flight, precise control, and efficient power management.

Design Objectives

The competition requirements demanded strict adherence to several critical performance parameters:

Takeoff Performance: Complete takeoff sequence within 100 feet of runway
Flight Stability: Achieve and maintain stable flight within 400 feet altitude
Maneuverability: Demonstrate precise turning capability with controlled banking
Landing Safety: Execute safe landing procedures with minimal ground roll

Technical Specifications

Wing Configuration

Design Type: Parasol wing configuration for optimal lift-to-drag ratio
Aerodynamic Optimization: High-lift airfoil selection for low-speed performance
Wing Placement: Elevated mounting above fuselage for improved ground clearance and stability
Structural Integration: Efficient strut-braced design minimizing weight while maximizing strength

Materials & Construction

Primary Structure: Plywood for main structural components offering excellent strength-to-weight ratio
Reinforcement Elements: Aluminum components at critical stress points and connection interfaces
Advanced Manufacturing: Additive manufacturing (3D printing) for complex geometries and custom fittings
Surface Finish: Lightweight covering material for aerodynamic smoothness

Power System

Battery Configuration: Lithium Polymer (Li-Po) battery pack for high energy density
Propulsion: Single electric motor with optimized propeller selection
Flight Endurance: Approximately 14 minutes of continuous flight operation
Thrust Performance: Motor and propeller combination providing sufficient thrust for all competition requirements
Power Management: Electronic speed controller (ESC) for efficient motor control and battery protection

Key Features

Parasol Wing Advantage: Provides superior visibility from cockpit and improved stability compared to low-wing designs
Lightweight Construction: Optimized material selection achieving minimum weight without compromising structural integrity
Electric Propulsion: Clean, quiet operation with excellent throttle response and controllability
Modular Design: Easy assembly and disassembly for transport and maintenance
Competition Ready: Designed to meet all SAE INDIA Aero Design Competition specifications

Performance Achievements

Successfully met takeoff distance requirement of 100 feet
Achieved stable flight within 400 feet altitude envelope
Demonstrated precise turning and maneuvering capabilities
Completed safe landing procedures consistently
14-minute flight endurance exceeding minimum mission requirements

Technologies & Skills

Aircraft Design Aerodynamics CAD Modeling Additive Manufacturing Structural Analysis Flight Dynamics Electric Propulsion Composite Materials SAE Competition Team Collaboration

Learning Outcomes

This project provided invaluable experience in:

Aerospace Engineering Fundamentals: Practical application of aerodynamic principles, structural mechanics, and flight dynamics
Design Process: Complete aircraft design lifecycle from conceptual design through detailed engineering to flight testing
Manufacturing Techniques: Hands-on experience with traditional woodworking, metalworking, and modern additive manufacturing
System Integration: Coordinating multiple subsystems (structure, propulsion, control surfaces) into cohesive aircraft design
Competition Experience: Working under strict requirements, timelines, and performance specifications
Team Dynamics: Collaborating with multidisciplinary team members to achieve common goals

Bank Cheque Processing System

Overview

Built automated bank cheque processing system using OCR and image processing techniques for efficient cheque verification and data extraction.

Technologies Used

OCR Image Processing OpenCV Python

Cozmo Clench

Competition Details

Event: Cozmo Clench - Techfest, IIT Bombay

Year: 2022

Team: VIT Robotics Team

Overview

Developed an Arduino-based manually controlled rover robot for the Cozmo Clench robotics competition at Techfest, IIT Bombay. The robot was designed to navigate an arena, grip and manipulate blocks, and place them in designated target zones while overcoming various obstacles and challenges.

The competition challenged teams to design and build a manually controlled robot capable of navigating a 3000mm x 2500mm arena, gripping and lifting colored blocks, placing blocks in specific target zones, and operating within strict size and power constraints.

Key Features

Robot Design

Compact Dimensions: Robot designed within 300mm x 200mm x 300mm size constraints
Gripper Mechanism: Custom-designed claw mechanism for secure block manipulation
Sturdy Chassis: Robust frame construction for stability during block transport
Omnidirectional Movement: Four-wheel drive system for precise maneuvering

Control System

Arduino-based Control: Microcontroller-based architecture for motor control and sensor integration
Wireless Operation: Remote control system for manual robot operation
Power Management: Efficient 24V onboard power supply system
Motor Controllers: PWM-based motor drivers for smooth speed control

Mechanical Components

Gripper Assembly: Servo-controlled claw with adjustable grip strength
Drive Train: DC geared motors providing adequate torque for block manipulation
Structural Materials: Combination of metal and 3D-printed components
Sensor Integration: IR/ultrasonic sensors for obstacle detection

Technical Specifications

Maximum Dimensions: 300mm x 200mm x 300mm
Power Supply: 24V DC onboard battery system
Control: Manual wireless control with Arduino-based receiver
Gripper: Servo-controlled claw mechanism
Sensors: IR/Ultrasonic for obstacle detection
Motors: 4x DC geared motors for drive, 1-2 servos for gripper

Challenges and Solutions

Precise Block Gripping

Challenge: Achieving consistent and reliable grip on blocks of varying sizes

Solution: Designed adaptive gripper with rubber padding and adjustable servo angles for optimal grip force

Stability During Block Transport

Challenge: Robot tipping when lifting blocks due to center of gravity shift

Solution: Implemented low center of gravity design with counterweight and wide wheelbase for enhanced stability

Accurate Zone Placement

Challenge: Positioning blocks precisely in target zones under time pressure

Solution: Developed intuitive control mapping and practiced maneuvering patterns for efficient placement

Power Efficiency

Challenge: Battery drain during extended competition runs

Solution: Optimized power consumption through efficient motor selection and smart power management code

Competition Performance

Successfully completed block manipulation tasks
Demonstrated reliable gripper operation
Achieved consistent navigation and obstacle avoidance
Showcased robust mechanical and electrical design

Technologies & Skills

Arduino Embedded Systems C/C++ Robotics CAD Design 3D Printing Motor Control Wireless Communication Techfest IIT Bombay Mechanical Design

Learning Outcomes

This project provided hands-on experience in:

Robotics Design: End-to-end robot development from concept to competition
Embedded Systems: Arduino programming and hardware interfacing
Mechanical Engineering: CAD design, 3D printing, and mechanism development
Control Systems: Manual control interface and motor control algorithms
Team Collaboration: Working with cross-functional team on tight deadlines
Problem Solving: Rapid prototyping and iterative design improvements
Competition Experience: Performing under pressure in competitive environment

Home Automation Using Augmented Reality

Overview

Developed AR-based home automation system enabling intuitive control of IoT devices through augmented reality interface for seamless smart home management.