1GigaAI 2Zhejiang University 3Tsinghua University
4Institute of Automation, Chinese Academy of Sciences 5Peking University 6Monash University
* Equal contribution † Corresponding author
We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird's-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. DriveGen3D enable the generation of long driving videos (up to 800×424 at 12 FPS) and corresponding 3D scenes, achieving state-of-the-art results while maintaining efficiency.
Overview of DriveGen3D. (a) Given textual and BEV layout conditions, our model first employs an accelerated Video Diffusion Transformer to synthesize a long driving video. (b) Next, a per-frame 3D Gaussian Splatting representation is utilized to construct entire scene from the generated video frames.
Text and BEV Layout Guided Generation
MagicDrive-DiT 615s
FastDrive-DiT (Ours) 278s
Figure 1: Comparison of video generation for MagicDriveDiT, Diffusion steps acceleration and Quantized DiT.
TABLE I: Accelerating the inference process of the video generation. 17f and 233f denote the frames count of generated videos.
Figure 2: Visualization of the input and output differences in consecutive timesteps. (a) All, (b) Condition, (c) Uncondition.
Figure 3: Typical examples of the data distribution of tensors in different attention blocks of MagicDriveDiT.
TABLE II: Time cost of different attention blocks of MagicDriveDiT during inference.
Figure 4: Visualization of the multi-view reconstructed video from a generated 3D scene.
TABLE III: Comparison of our method against prior feed-forward and optimization-based methods. The last two rows show novel view rendering performance with either GT or generated video input.
Comparison with raw videos. By default, our reconstruction stage takes the generated videos as input. We compare with DrivingForward to validate the difference of raw videos and generated videos.
Figure 5: Qualitative comparison between GT videos and generated videos for reconstruction.
@article{wang2025drivegen3d,
title={DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion},
author={Wang, Weijie and Zhu, Jiagang and Zhang, Zeyu and Wang, Xiaofeng and Zhu, Zheng and Zhao, Guosheng and Ni, Chaojun and Wang, Haoxiao and Huang, Guan and Chen, Xinze and Zhou, Yukun and Qin, Wenkang and Shi, Duochao and Li, Haoyun and Xiao, Yicheng and Chen, Donny Y. and Lu, Jiwen},
journal={arXiv preprint arXiv:2510.15264},
year={2025}
}