EmbodiedSplat logo

EmbodiedSplat 🛋️: Online Feed-Forward Semantic 3DGS
for Open-Vocabulary 3D Scene Understanding

CVPR 2026

National University of Singapore

Build and understand at Once! By taking over 300 streaming images, our EmbodiedSplat reconstructs whole-scene open-vocabulary 3DGS in online manner at up to 5-6 FPS per-frame processing time. Reconstructed scene supports diverse perception tasks such as open-vocabulary 3D semantic segmentation, 2D-rendered semantic segmentation and novel-view color synthesis with depth rendering.


💡 Our EmbodiedSplat simultaneously constructs RGB and open-vocabulary semantic field in online manner. During the exploration, it even refines the part with noisy semantic by exploiting our novel Online Sparse Coefficient Field with CLIP Global Codebook.

✔️ Online 3DGS   ✔️ Feed-Forward Design   ✔️ Near Real-time (5-6 FPS)  
✔️ Whole-scene (+300 images)   ✔️ Memory-Efficient   ✔️ Fast 3D Search

I wanna hear the music.

I wanna pee.

Where can I sit?


💡 Our EmbodiedSplat can localize the 3D objects based on the free-form language along its exploration.

Abstract

Understanding a 3D scene immediately with its exploration is essential for embodied tasks, where an agent must construct and comprehend the 3D scene in an online and nearly real-time manner. In this study, we propose EmbodiedSplat, an online feed-forward 3DGS for open-vocabulary scene understanding that enables simultaneous online 3D reconstruction and 3D semantic understanding from the streaming images. Unlike existing open-vocabulary 3DGS methods which are typically restricted to either offline or per-scene optimization setting, our objectives are two-fold: 1) Reconstructs the semantic-embedded 3DGS of the entire scene from over 300 streaming images in an online manner. 2) Highly generalizable to novel scenes with feed-forward design and supports nearly real-time 3D semantic reconstruction when combined with real-time 2D models. To achieve these objectives, we propose an Online Sparse Coefficients Field with a CLIP Global Codebook where it binds the 2D CLIP embeddings to each 3D Gaussian while minimizing memory consumption and preserving the full semantic generalizability of CLIP. Furthermore, we generate 3D geometric-aware CLIP features by aggregating the partial point cloud of 3DGS through 3D U-Net to compensate the 3D geometric prior to 2D-oriented language embeddings. Extensive experiments on diverse indoor datasets, including ScanNet, ScanNet++, and Replica, demonstrate both the effectiveness and efficiency of our method. Code will be publicly available on project website.

💡 Most of the semantic 3DGS works leverage 2D rendering function to distill/bind language embeddings to 3DGS. We don't!

💡 Instead, we propose clearly distinct yet creative strategy called Online Sparse Coefficient Field and CLIP Global Codebook to bind language embeddings while minimizing the memory consumption and not losing semantic capability of 2D VLM.

Contributions


  1. Novel framework for embodied 3D perception which enables online, whole-scene reconstruction for language-embedded 3DGS with up to 5-6 FPS inference speed.
  2. Combination of 2D CLIP Features with rich semantic capabilities and 3D CLIP Features with geometric prior.
  3. Sparse Coefficient Field with CLIP Global Codebook to store the per-Gaussian language embeddings compactly.
  4. Experiment results show that our framework significantly surpasses the existing semantic 3DGS in terms of segmentation performance and scene reconstruction time.

Overall Framework of EmbodiedSplat


Overall framework of EmbodiedSplat. We endow feed-forward 3DGS with semantic understanding capabilities by binding the two types of CLIP features: 1) 2D Semantic Features are attached to each Gaussian via Sparse Coefficient Field with CLIP Global Codebook, effectively reducing memory consumption while preserving semantic generalizability of CLIP. 2) 3D Geometric-aware Features are produced by aggregating the feature point cloud of 3DGS through 3D U-Net and temporal-aware memory adapter. These two types of features enable mutual compensation between semantic and 3D geometry, which results in superior understanding capabilities compared to the existing baselines.

BibTeX