Wufei Ma

I am a PhD student at Johns Hopkins University, advised by Bloomberg Distinguished Professor Dr. Alan Yuille.

I obtained my B.S. with summa cum laude honor from Rensselaer Polytechnic Institute in 2020 and I had a double major in Computer Science and Mathematics. During my undergraduate years, I had worked with Prof. Bülent Yener on discriminative and generative models for microstructure images and with Prof. Lirong Xia on preference learning from natural language.

I’ve spent great time at Google Research, Reality Labs, Research Asia, Frontier AI & Robotics (FAR) and AWS CV Science, Research, and collaborated with many exceptional researchers.

news

Sep 18, 2025	One paper accepted to NeurIPS 2025.
Jun 26, 2025	One paper accepted to ICCV 2025.
Feb 26, 2025	Two papers accepted to CVPR 2025 (both as highlight).
Jan 23, 2025	One paper accepted to ICLR 2025.
Sep 26, 2024	One paper accepted to NeurIPS 2024.
Jul 04, 2024	Co-organizing OOD-CV workshop at ECCV 2024. Call for papers at ood-cv.org.
Jul 01, 2024	Two papers accepted to ECCV 2024.
Jun 10, 2024	I will present our Feint6K dataset at WINVU @ CVPR 2024.
Feb 28, 2024	One paper accepted to IEEE TMM.

selected publications

see all publications here

SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning

Wufei Ma, Yu-Cheng Chou, Qihao Liu , Xingrui Wang, Celso M de Melo, Jianwen Xie, and Alan Yuille

In Advances in Neural Information Processing Systems , 2025

3D Vision Vision-Lanugage

Abs Webpage arXiv

We introduce SpatialReasoner, a novel large vision-language model (LVLM) that address 3D spatial reasoning with explicit 3D representations shared between stages – 3D perception, computation, and reasoning. Explicit 3D representations provide a coherent interface that supports advanced 3D spatial reasoning and enable us to study the factual errors made by LVLMs.
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

Wufei Ma , Haoyu Chen, Guofeng Zhang, Jieneng Chen, Celso M de Melo, and Alan Yuille

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , Oct 2025

3D Vision Vision-Lanugage

Abs Webpage arXiv

We present 3DSRBench, a comprehensive 3D spatial reasoning benchmark.
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models

Wufei Ma, Luoxin Ye, Celso M de Melo, Jieneng Chen, and Alan Yuille

In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , Oct 2025

(Highlight, 3.0%)

3D Vision Vision-Lanugage

Abs Webpage arXiv

We systematically study the impact of 3D-informed data, architecture, and training setups and present SpatialLLM, an LMM with advanced 3D spatial reasoning abilities.
ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

Wufei Ma, Guofeng Zhang, Qihao Liu, Guanning Zeng, Adam Kortylewski , Yaoyao Liu, and Alan Yuille

In Advances in Neural Information Processing Systems , Oct 2024

Dataset 3D Vision

Abs Webpage arXiv Data Code

We present ImageNet3D, a large dataset for general-purpose object-level 3D understanding.
Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data

Wufei Ma, Kai Li, Zhongshi Jiang, Moustafa Meshry, Qihao Liu, Huiyu Wang, Christian Häne, and Alan Yuille

In European Conference on Computer Vision , Oct 2024

(Strong Double Blind)

Dataset Vision-Lanugage

Abs Webpage arXiv Data

We propose a novel task, retrieval from counterfacually augmented data, and a dataset, Feint6K, for video-text understanding.
Generating Images with 3D Annotations Using Diffusion Models

Wufei Ma^*, Qihao Liu^* , Jiahao Wang^* , Angtian Wang, Xiaoding Yuan , Yi Zhang, Zihao Xiao, Guofeng Zhang , Beijia Lu, Ruxiao Duan, Yongrui Qi, Adam Kortylewski , Yaoyao Liu, and Alan Yuille

In The Twelfth International Conference on Learning Representations , Oct 2024

(Spotlight, 5%)

Dataset 3D Vision

Abs arXiv Code

We propose 3D-DST that generates synthetic data with 3D groundtruth by incorporating 3D geomeotry control into diffusion models. With our diverse prompt generation, we effectively improve both in-distribution (ID) and out-of-distribution (OOD) performance for various 2D and 3D vision tasks.