Research

3D Scene Understanding with Vision-Language Models

I am currently exploring how vision-language models (VLMs) can be applied to 3D scene understanding. By integrating visual perception with language-based reasoning, I am working on building systems that interpret complex environments and interact meaningfully with them. Specifically, my goal is to enable embodied agents to perform egocentric 3D perception and scene reconstruction in both static and dynamic settings. This involves leveraging multimodal data—such as RGB, depth, and natural language—to infer spatial semantics, estimate human motion and intent, and reconstruct actionable representations of physical environments. I am particularly interested in scalable simulation frameworks and dataset construction methods that support generalizable scene understanding and manipulation, ultimately bridging the gap between real-world perception and virtual embodiment.

Mixed Reality (MR) and eXtended Reality (XR)

I am actively researching and developing immersive MR/XR systems that bridge the gap between physical and digital spaces. My work centers on creating intuitive spatial interfaces and adaptive interaction models that enhance presence, continuity, and embodied engagement in extended reality environments. This involves exploring multimodal input channels—such as gaze, gesture, and natural language—and context-aware feedback systems that respond fluidly to user intention and environmental dynamics. Through this, I aim to contribute to the foundations of persistent, scalable, and socially meaningful “XR wave” that redefine how people perceive, interact with, and inhabit both virtual and physical worlds.

3D Generative AI

I am engaged in advancing the field of 3D Generative Artificial Intelligence, with a focus on developing models that synthesize high-fidelity, semantically meaningful three-dimensional content. My research aims to enable automated and controllable generation of 3D assets—ranging from individual objects to complex scenes—by leveraging recent progress in generative modeling, neural implicit representations, and differentiable rendering techniques.

At the core of this work is the pursuit of scalable frameworks that bridge data-driven creativity with spatial reasoning, supporting applications in extended reality (XR), simulation environments, and digital twin systems. By integrating generative 3D models into interactive platforms, I aim to streamline the content creation pipeline and contribute to the foundations of next-generation immersive systems. My long-term objective is to expand the expressive potential of AI in spatial computing, enabling richer human–AI collaboration in both virtual and physical dimensions.

Digital Twin & Virtual Simulation for Robotics with VLA models

GROOT N1 Architecture

I am conducting research on real-time digital twin systems that enable simulation, monitoring, and control of robotic agents in complex and dynamic environments. My focus lies in integrating vision, language, and action (VLA) models to develop embodied agents capable of high-level reasoning and autonomous decision-making across both virtual and physical domains. This involves building generalizable frameworks that support scalable simulation-to-reality transfer and grounded interaction through multimodal inputs.