GameGen-X: Open-world Video Game Generation

Haoxuan Che1*, Xuanhua He2*, Quande Liu3✉, Cheng Jin1, Hao Chen1✉
1Hong Kong Univerity of Science and Technology;
2Univerity of Science and Technology of China;
3The Chinese Univerity of Hong Kong

* Equal Contribution       ✉ Co-corresponding Authors

[Arxiv]      [Github]      [Galleray]      [Huggingface space]     

For any inquiries, please email to: hche@ust.hk, qdliu0226@gmail.com.

Overview of GameGen-X Functionality

Part I for the basic functionality showcase, and Part II for the key features of GameGen-X (0:47).

TL;DR: The GameGen-X is a game content foundation model capable of generating creative and infinitely long game videos with AAA game quality, multiple high resolutions, 20 FPS real-time control at 320p, and open-domain generalization.

Demo Shouted out to "Journey to the West"
"Infinite" Generation with Control Signals (w/ 10x Video Acceleration)



A comparison among GameGen-X and past or concurrent works.

GameGAN (2020) Genie (2023) DIAMOND (2024) MarioVGG (2024) GameNGen (2024) Oasis (2024) GameGen-X (2024)
Video Length Infinite 2s Infinite 6 Frames Infinite Infinite Infinite
Training Domain 2D Games 2D Games Atari, CS:GO Mario DOOM Minecraft 150+ AAA Games
Resolution 360p 360p 280 x 150 64 x 48 240p 720p 720p
Real-time at 320p by FPS Yes No Yes No Yes Yes Yes
Open-domain Generation No No No No No No Yes
Control Ability Character Character Character Character Character Character Environment and Character

Abstract

We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos. This model facilitates high-quality, open-domain generation by simulating an extensive array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, predicting and altering future content based on the current clip, thus allowing for gameplay simulation. To realize this vision, we first collected and built an Open-World Video Game Dataset (OGameData) from scratch. It is the first and largest dataset for open-world game video generation and control, which comprises over one million diverse gameplay video clips with informative captions from GPT-4o. GameGen-X undergoes a two-stage training process, consisting of pre-training and instruction tuning. Firstly, the model was pre-trained via text-to-video generation and video continuation, endowing it with the capability for long-sequence, high-quality open-domain game video generation. Further, to achieve interactive controllability, we designed InstructNet to incorporate gamerelated multi-modal control signal experts. This allows the model to adjust latent representations based on user inputs, unifying character interaction, and scene content control for the first time in video generation. During instruction tuning, only the InstructNet is updated while the pre-trained foundation model is frozen, enabling the integration of interactive controllability without loss of diversity and quality of generated content. GameGen-X represents a significant leap forward in open-world game design using generative models. It demonstrates the potential of generative models to serve as auxiliary tools to traditional rendering techniques, effectively merging creative generation with interactive capabilities. The project will be available at https://github.com/GameGen-X/GameGen-X.

High-quality Game Generation

Character Generation    

Environment Generation    

Action Generation    

Event Generation    

Open-domain Generation    

Multi-modality Interactive Control

Structural Instruction Prompts    

Operation Signals    

Video Prompts    

Qualitative Comparision

In-domain Generation Comparision    

Open-domain Generation Comparision    

Control Comparision    

OGameData Showcase


OGameData Summary: OGameData is a comprehensive multi-genre open-world video game dataset, which contains generation and control subsets. Sourcing over 32,000 videos from local engines and the internet, each video ranges from several minutes to several hours in length. The dataset features more than 150 next-generation games across various genres, including open-world RPGs, FPS, racing games, action-puzzle games, and more. It also covers different perspectives (first-person, third-person) and styles (realistic, Eastern traditional, cyberpunk, post-apocalyptic, Western fantasy, etc.). After a rigorous selection process that spanned six months and involved multiple human experts and advanced model algorithms, we have curated over 4,000 hours of high-quality video clips, ranging from 720p to 4k resolution. These segments were meticulously annotated by GPT-4O, providing a rich source of labeled data for training and validation purposes. The OGameData is expected to become an invaluable resource for researchers and developers, enabling the exploration of various applications such as video game generative AI development, interactive control, and immersive virtual environments. Its imminent open-source release will offer the scientific community unprecedented access to a broad spectrum of video game data, fostering innovation and collaboration across multiple disciplines.



Clarifications: The OGameData dataset is only available for informational purposes only. The copyright remains with the original owners of the video. All videos of the OGameData dataset are obtained from the Internet which is not the property of our institutions. Our institution is not responsible for the content or the meaning of these videos. You agree not to reproduce, duplicate, copy, sell, trade, resell, or exploit for any commercial purposes, any portion of the videos, and any portion of derived data. You agree not to further copy, publish, or distribute any portion of the OGameData dataset.


OGameData for Generation Training    

OGameData for Instruction Tuning    

Acknowledgements: Our project page is borrowed from DreamBooth.