DriveGen3D Logo

DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion


Weijie Wang1,2*  Jiagang Zhu1,3*  Zeyu Zhang1  Xiaofeng Wang1,4  Zheng Zhu1†  Guosheng Zhao1,4 
Chaojun Ni1,5  Haoxiao Wang2  Guan Huang1  Xinze Chen1  Yukun Zhou1  Wenkang Qin1 
Duochao Shi2  Haoyun Li1,4  Yicheng Xiao4  Donny Y. Chen6  Jiwen Lu3

1GigaAI   2Zhejiang University   3Tsinghua University   
4Institute of Automation, Chinese Academy of Sciences   5Peking University   6Monash University

* Equal contribution Corresponding author

Abstract


We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird's-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. DriveGen3D enable the generation of long driving videos (up to 800×424 at 12 FPS) and corresponding 3D scenes, achieving state-of-the-art results while maintaining efficiency.

Method Overview




Overview of DriveGen3D. (a) Given textual and BEV layout conditions, our model first employs an accelerated Video Diffusion Transformer to synthesize a long driving video. (b) Next, a per-frame 3D Gaussian Splatting representation is utilized to construct entire scene from the generated video frames.

Visualization of FastDrive-DiT


Text and BEV Layout Guided Generation

MagicDrive-DiT 615s

FastDrive-DiT (Ours) 278s

Figure 1: Comparison of video generation for MagicDriveDiT, Diffusion steps acceleration and Quantized DiT.


Quantitative Acceleration Results


TABLE I: Accelerating the inference process of the video generation. 17f and 233f denote the frames count of generated videos.


Analysis of Diffusion Steps Acceleration


Figure 2: Visualization of the input and output differences in consecutive timesteps. (a) All, (b) Condition, (c) Uncondition.


Analysis of Quantization Acceleration


Figure 3: Typical examples of the data distribution of tensors in different attention blocks of MagicDriveDiT.


TABLE II: Time cost of different attention blocks of MagicDriveDiT during inference.


Results of FastRecon3D


Figure 4: Visualization of the multi-view reconstructed video from a generated 3D scene.


TABLE III: Comparison of our method against prior feed-forward and optimization-based methods. The last two rows show novel view rendering performance with either GT or generated video input.


Comparison with Ground Truth Videos


Comparison with raw videos. By default, our reconstruction stage takes the generated videos as input. We compare with DrivingForward to validate the difference of raw videos and generated videos.

Figure 5: Qualitative comparison between GT videos and generated videos for reconstruction.


Citation


@article{wang2025drivegen3d,
  title={DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion},
  author={Wang, Weijie and Zhu, Jiagang and Zhang, Zeyu and Wang, Xiaofeng and Zhu, Zheng and Zhao, Guosheng and Ni, Chaojun and Wang, Haoxiao and Huang, Guan and Chen, Xinze and Zhou, Yukun and Qin, Wenkang and Shi, Duochao and Li, Haoyun and Xiao, Yicheng and Chen, Donny Y. and Lu, Jiwen},
  journal={arXiv preprint arXiv:2510.15264},
  year={2025}
}