DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion

Weijie Wang^1,2* Jiagang Zhu^1,3* Zeyu Zhang¹ Xiaofeng Wang^1,4 Zheng Zhu^1† Guosheng Zhao^1,4
Chaojun Ni^1,5 Haoxiao Wang² Guan Huang¹ Xinze Chen¹ Yukun Zhou¹ Wenkang Qin¹
Duochao Shi² Haoyun Li^1,4 Yicheng Xiao⁴ Donny Y. Chen⁶ Jiwen Lu³

¹GigaAI ²Zhejiang University ³Tsinghua University
⁴Institute of Automation, Chinese Academy of Sciences ⁵Peking University ⁶Monash University

* Equal contribution ^† Corresponding author

Research Paper

Abstract

We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird's-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. DriveGen3D enable the generation of long driving videos (up to 800×424 at 12 FPS) and corresponding 3D scenes, achieving state-of-the-art results while maintaining efficiency.

Method Overview

Overview of DriveGen3D. (a) Given textual and BEV layout conditions, our model first employs an accelerated Video Diffusion Transformer to synthesize a long driving video. (b) Next, a per-frame 3D Gaussian Splatting representation is utilized to construct entire scene from the generated video frames.

Visualization of FastDrive-DiT

Text and BEV Layout Guided Generation

MagicDrive-DiT 615s

FastDrive-DiT (Ours) 278s

Figure 1: Comparison of video generation for MagicDriveDiT, Diffusion steps acceleration and Quantized DiT.

Analysis of Quantization Acceleration

Figure 3: Typical examples of the data distribution of tensors in different attention blocks of MagicDriveDiT.

TABLE II: Time cost of different attention blocks of MagicDriveDiT during inference.

Results of FastRecon3D

Figure 4: Visualization of the multi-view reconstructed video from a generated 3D scene.

TABLE III: Comparison of our method against prior feed-forward and optimization-based methods. The last two rows show novel view rendering performance with either GT or generated video input.

Comparison with Ground Truth Videos

Comparison with raw videos. By default, our reconstruction stage takes the generated videos as input. We compare with DrivingForward to validate the difference of raw videos and generated videos.

Figure 5: Qualitative comparison between GT videos and generated videos for reconstruction.