#Visual-Inertial Odometry #Semantic Segmentation #Robotics #Dynamic Indoor Navigation #Real-Time Systems #Jetson Orin

SA-VINS | Semantic Understanding-Aided VIO for Autonomous Robots in a Dynamic Indoor Environment

Improving navigation in dynamic indoor environments with real-time semantic reasoning

Abstract

This project presents a lightweight semantic front-end for Visual–Inertial Odometry (VIO) that suppresses features on independently moving objects before they corrupt state estimation. The approach segments potential dynamic objects, selectively promotes truly dynamic instances via interaction reasoning (mask overlap + depth proximity), and stabilizes coverage using depth-guided mask extension and short-term tracking. Integrated with OpenVINS, the system gates feature extraction and runs in real time on edge hardware. Across simulated warehouse scenes and real indoor experiments, it reduces Absolute Trajectory Error (ATE) in position by up to 34.4% and orientation by 16.8% versus standard VIO in dynamic environments, with no degradation in static scenes.

Introduction

Autonomous robots often operate in GPS-denied indoor environments such as warehouses or industrial facilities. Traditional Visual–Inertial Odometry (VIO) systems like OpenVINS provide efficient pose estimation but suffer from drift when visual features lie on independently moving objects. This challenge is amplified in dynamic scenes with humans, forklifts, and movable equipment.

Approach Overview

The solution is a real-time VIO front-end that integrates semantic understanding to suppress dynamic outlier features before they affect the filter. The system introduces:

  • A potential dynamic object segmenter using instance segmentation (YolactEdge).
  • A hierarchical heuristic classifier that promotes only interaction-linked objects to dynamic status.
  • A depth-guided mask extension and short-term tracking module to handle segmentation gaps.

Unlike methods that mask all movable classes, this approach selectively retains static-but-movable objects (e.g., parked forklifts) to preserve feature richness and tracking stability.

Methodology

The pipeline operates on a continuous camera stream and consists of:

  1. Instance Segmentation: Fine-tuned YolactEdge on a synthetic warehouse dataset generated in NVIDIA Isaac Sim; exported to ONNX and optimized with TensorRT for edge latency.
  2. Dynamic Outlier Classification: Interaction reasoning based on 2D mask overlap and depth proximity; recursive association chains propagate dynamic status (e.g., human → forklift → box).
  3. Mask Post-Processing: Depth-based mask extension (stereo) to cover thin/reflective boundaries and a lightweight tracker to recover transient segmentation loss.
  4. Integration with VIO: Dynamic masks gate feature extraction before optical flow and MSCKF updates in OpenVINS, so dynamic features never enter the state.

Experiments & Results

Evaluation covered simulated warehouse environments (Gazebo) and real indoor scenes using an Intel RealSense T265 camera. Key findings:

  • Dynamic scenes: ATE position improved by 34.4% and ATE orientation by 16.8% versus OpenVINS.
  • Static scenes: Performance parity with standard VIO—no degradation.
  • Edge real-time: Jetson Orin NX achieved ~18 FPS; desktop reached ~52 FPS for the semantic module.
  • Motion direction sensitivity: Standard VIO is more affected by parallel-moving objects than perpendicular ones; the proposed method mitigates this gap.
your-image-link-here
Simulation environment and trajectory evaluation for dynamic scenes.

Comparison with Existing Methods

Against DynaSLAM, this approach maintained tracking in texture-sparse segments and achieved edge real-time performance, whereas DynaSLAM showed high latency and frequent tracking failure in dynamic scenes on similar trajectories.

Conclusion & Future Work

Integrating semantic reasoning into VIO significantly improves robustness in dynamic indoor environments. Future directions include optical flow consistency for residual false negatives, learned motion saliency to replace hand-tuned thresholds, and extensions to optimization-based VIO frameworks with loop closure.