Haodong Li

I am a research intern at University of Pennsylvania, supervised by Prof. Lingjie Liu, and a M.Phil. student at Hong Kong University of Science and Technology (Guangzhou Campus), supervised by Prof. Ying-Cong Chen and Prof. Xin Tong. Previously, I received my B.Eng. degree in Automation from Zhejiang University in 2023.

My research interests include 3D vision and Generative Models. My personality type is ENTJ. Feel free to drop me an email if you have anything to discuss!

I am actively seeking for Ph.D. positions for Fall 2025!

Email  /  CV  /  Scholar  /  Twitter  /  Github

profile photo

Research

: Both authors contributed equally.
StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation
Haodong Li, Chen Wang, Jiahui Lei, Zhiyang Dou, Kostas Daniilidis, Jiatao Gu, Lingjie Liu

arXiv 2024
arXiv / Project Page / Github (Soon)

Video depth estimation is not merely an extension of image depth estimation. The consistency for dynamic and static regions in videos are fundamentally different. To tackle these challenges, StereoDiff synergizes stereo matching with video depth diffusion models, achieving superior video depth estimation performance.

LOTUS: Diffusion-based Visual Foundation Model for High-quality Dense Prediction
Jing He , Haodong Li , Wei Yin, Yixun Liang, Kaiqiang Zhou, Hongbo Zhang, Bingbing Liu, Ying-Cong Chen

arXiv 2024
arXiv / Project Page / Github GitHub stars / Demo (Depth) / Demo (Normal)

Lotus is a diffusion-based visual foundation model with a simple yet effective adaptation protocol, aiming to fully leverage the pre-trained diffusion's powerful visual priors for dense prediction. With minimal training data, Lotus achieves SoTA performance in two key geometry perception tasks, i.e., zero-shot monocular depth and normal estimation.

DisEnvisioner: Disentangled and Enriched Visual Prompt for Image Customization
Jing He , Haodong Li , Yongzhe Hu, Guibao Shen, Yingjie Cai, Weichao Qiu, Ying-Cong Chen

arXiv 2024
arXiv / Project Page / Github GitHub stars / Demo (Soon)

Characterized by its emphasis on the interpretation of subject-essential attributes, the proposed DisEnvisioner effectively identifies and enhances the subject-essential feature while filtering out other irrelevant information, enabling exceptional image customization without cumbersome tuning or relying on multiple reference images.

DOME: Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model
Songen Gu, Wei Yin, Bu Jin, Xiaoyang Guo, Junming Wang, Haodong Li, Qian Zhang, Xiaoxiao Long

arXiv 2024
arXiv / Project Page / Github

DOME is a diffusion-based world model that predicts future occupancy frames based on past observations. DOME exhibits two key features: (1) high-fidelity and long-duration generation enabled by its spatial-temporal diffusion transformer; and (2) fine-grained controllability, thanks to the trajectory encoder.

DIScene: Object Decoupling and Interaction Modeling for Complex Scene Generation
Xiao-Lei Li, Haodong Li, Hao-Xiang Chen, Tai-Jiang Mu, Shi-Min Hu

SIGGRAPH Asia 2024

DIScene is capable of generating complex 3D scene with decoupled objects and clear interactions. Leveraging a learnable Scene Graph and Hybrid Mesh-Gaussian representation, we get 3D scenes with superior quality. DIScene can also flexibly edit the 3D scene by changing interactive objects or their attributes, benefiting diverse applications.

LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching
Yixun Liang , Xin Yang , Jiantao Lin, Haodong Li, Xiaogang Xu, Ying-Cong Chen

CVPR 2024 Highlight
arXiv / Paper / Github GitHub stars / Demo / Video

We present LucidDreamer, a text-to-3D generation framework, to distill high-fidelity textures and shapes from pretrained 2D diffusion models with a novel Interval Score Matching objective and an advanced 3D distillation pipeline. Together, we achieve superior 3D generation results with photorealistic quality in a short training time.

Bi-TTA: Bidirectional Test-Time Adapter for Remote Physiological Measurement
Haodong Li, Hao Lu, Ying-Cong Chen

ECCV 2024
arXiv / Paper / Project Page

We introduce Bi-TTA, a method that leverages spatial and temporal consistency for appropriate self-supervision, coupled with novel prospective and retrospective adaptation strategies, enabling superior adaptation ability of pre-trained rPPG models to the target domain using only unannotated, instance-level target data.

Academic Service

Reviewer: ICLR 2025

Education

University of Pennsylvania (2024/06 - Now)
Research Intern
School of Engineering and Applied Science
Hong Kong University of Science and Technology (2023/09 - Now)
Master of Philosophy (M.Phil.) (General)
Information Hub, Guangzhou Campus
Zhejiang University (2019/09 - 2023/06)
Bachelor of Engineering (B.Eng.) in Automation
College of Control Science and Engineering