GameGen-X: Interactive Open-world Game Video Generation

Haoxuan Che^1*, Xuanhua He^2*, Quande Liu^3✉, Cheng Jin¹, Hao Chen^1✉
¹Hong Kong Univerity of Science and Technology;
²Univerity of Science and Technology of China;
³The Chinese Univerity of Hong Kong

^* Equal Contribution ✉ Co-corresponding Authors

[Arxiv] [Github] [Galleray] [Huggingface space]

For any inquiries, please email to: hche@ust.hk, qdliu0226@gmail.com.

Overview of GameGen-X Functionality

Part I for the basic functionality showcase, and Part II for the key features of GameGen-X (0:47).

TL;DR: The GameGen-X is a game content foundation model capable of generating creative and infinitely long game videos with AAA game quality, multiple high resolutions, 20 FPS real-time control at 320p, and open-domain generalization.

Demo Shouted out to "Journey to the West"

"Infinite" Generation with Control Signals (w/ 10x Video Acceleration)

A comparison among GameGen-X and past or concurrent works.

	GameGAN (2020)	Genie (2023)	DIAMOND (2024)	MarioVGG (2024)	GameNGen (2024)	Oasis (2024)	GameGen-X (2024)
Video Length	Infinite	2s	Infinite	6 Frames	Infinite	Infinite	Infinite
Training Domain	2D Games	2D Games	Atari, CS:GO	Mario	DOOM	Minecraft	150+ AAA Games
Resolution	360p	360p	280 x 150	64 x 48	240p	720p	720p
Real-time at 320p by FPS	Yes	No	Yes	No	Yes	Yes	Yes
Open-domain Generation	No	No	No	No	No	No	Yes
Control Ability	Character	Character	Character	Character	Character	Character	Environment and Character

Abstract

We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos. This model facilitates high-quality, open-domain generation by simulating an extensive array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, predicting and altering future content based on the current clip, thus allowing for gameplay simulation. To realize this vision, we first collected and built an Open-World Video Game Dataset (OGameData) from scratch. It is the first and largest dataset for open-world game video generation and control, which comprises over one million diverse gameplay video clips with informative captions from GPT-4o. GameGen-X undergoes a two-stage training process, consisting of pre-training and instruction tuning. Firstly, the model was pre-trained via text-to-video generation and video continuation, endowing it with the capability for long-sequence, high-quality open-domain game video generation. Further, to achieve interactive controllability, we designed InstructNet to incorporate gamerelated multi-modal control signal experts. This allows the model to adjust latent representations based on user inputs, unifying character interaction, and scene content control for the first time in video generation. During instruction tuning, only the InstructNet is updated while the pre-trained foundation model is frozen, enabling the integration of interactive controllability without loss of diversity and quality of generated content. GameGen-X represents a significant leap forward in open-world game design using generative models. It demonstrates the potential of generative models to serve as auxiliary tools to traditional rendering techniques, effectively merging creative generation with interactive capabilities. The project will be available at https://github.com/GameGen-X/GameGen-X.

High-quality Game Generation

Character Generation

Geralt of Rivia

Arthur Morgan

Eivor

Jin Sakai

Astroneer

Ice Magician

RoboCop

Security Guard

Environment Generation

Spring

Summer

Autumn

Winter

Lake

Sea

Lavender Field

Pyramid

Action Generation

Motorcycling (first-person)

Driving

Flying

Sailing

Motorcycling (third-person)

Walking

Riding

Carriage

Event Generation

Raining

Snowing

Thundering

Sunrising

Firing

Sandstormig

Tsunami

Tornado

Open-domain Generation

Cybermonk roaming in China town

TimeMaster standing in another dimension

Traveler with a cloak walking on Mars

Magic steam airship soaring in the clouds

Ghost walking under the blood moon

Venom Druid touring Runeforest

Angel looking at the Holy Kingdom

Mechanical life passing through the ruins

Multi-modality Interactive Control

Structural Instruction Prompts

Fire on the sky

Dark and star

Sunset happens

Fog emerging

Operation Signals

Move left (A)

Move right (D)

Move left (A)

Move right (D)

Video Prompts

Canny Prompt

Output Video 1

Motion Vector

Output Video

Qualitative Comparision

In-domain Generation Comparision

GameGen-X

OpenSora-Plan

OpenSora

CogVideoX

GameGen-X

OpenSora-Plan

OpenSora

CogVideoX

Open-domain Generation Comparision

GameGen-X

OpenSora-Plan

OpenSora

CogVideoX

GameGen-X

OpenSora-Plan

OpenSora

CogVideoX

GameGen-X

OpenSora-Plan

OpenSora

CogVideoX

GameGen-X

OpenSora-Plan

OpenSora

CogVideoX

GameGen-X

OpenSora-Plan

OpenSora

CogVideoX

Control Comparision

GameGen-X

Luma

Kling

Tongyi

GameGen-X

Luma

Kling

Tongyi

GameGen-X

Luma

Kling

Tongyi

OGameData Showcase

OGameData Summary: OGameData is a comprehensive multi-genre open-world video game dataset, which contains generation and control subsets. Sourcing over 32,000 videos from local engines and the internet, each video ranges from several minutes to several hours in length. The dataset features more than 150 next-generation games across various genres, including open-world RPGs, FPS, racing games, action-puzzle games, and more. It also covers different perspectives (first-person, third-person) and styles (realistic, Eastern traditional, cyberpunk, post-apocalyptic, Western fantasy, etc.). After a rigorous selection process that spanned six months and involved multiple human experts and advanced model algorithms, we have curated over 4,000 hours of high-quality video clips, ranging from 720p to 4k resolution. These segments were meticulously annotated by GPT-4O, providing a rich source of labeled data for training and validation purposes. The OGameData is expected to become an invaluable resource for researchers and developers, enabling the exploration of various applications such as video game generative AI development, interactive control, and immersive virtual environments. Its imminent open-source release will offer the scientific community unprecedented access to a broad spectrum of video game data, fostering innovation and collaboration across multiple disciplines.

Clarifications: The OGameData dataset is only available for informational purposes only. The copyright remains with the original owners of the video. All videos of the OGameData dataset are obtained from the Internet which is not the property of our institutions. Our institution is not responsible for the content or the meaning of these videos. You agree not to reproduce, duplicate, copy, sell, trade, resell, or exploit for any commercial purposes, any portion of the videos, and any portion of derived data. You agree not to further copy, publish, or distribute any portion of the OGameData dataset.

OGameData for Generation Training

A person in a trench coat and hat walks along a riverbank, approaching wooden houses on a misty morning. In this atmospheric sequence from the action-adventure game Red Dead Redemption 2, Arthur Morgan is depicted walking along a serene riverbank, highlighted by his distinctive wide-brimmed white hat and dark trench coat. The environment features a tranquil riverside setting bathed in golden sunlight with mist lingering over distant forests. The camera tailing Arthur captures steady shots that subtly reveal more of the lush greenery and rustic buildings emerging on the left bank as he proceeds forward, reflecting the game's characteristic blend of exploration and attention to scenic detail,

Cars drive through a city intersection at sunset, with a horse statue visible in the background. In this sequence from the open-world game Grand Theft Auto V, a sleek black muscle car is seen navigating through a downtown intersection at dusk. The scene captures the essence of urban life with palm trees lining the streets and modern buildings in the background, one prominently featuring an imposing silver statue of a rearing horse. As the sky glows with a soft purple and pink hue, suggesting early evening, various camera angles provide expansive and cinematic views: starting with wide shots to establish the setting and transitioning into closer perspectives that capture details like traffic lights turning green and an approaching garbage truck on one side. The environment effectively conveys a rich, immersive atmosphere typical of GTA V’s detailed cityscapes.

OGameData for Instruction Tuning

Environmental Basics: Widen the path in front of the main character as they walk forward. Main Character: Move steadily along the path, decreasing distance to distant buildings. Environmental Changes: Enhance visibility and detail of approaching village structures over time. Sky/Lighting: Maintain clear skies and consistent daylight throughout. aesthetic score: 5.47, motion score: 15.37, camera motion: pan_right, perspective: third, shot size: full.

Environmental Basics: Show a lush, green countryside path lined with stone walls and trees under a sunset sky. Main Character: Have the main character riding steadily forward on horseback along the path. Environmental Changes: Slowly move the horse and rider deeper into the scene along the path. Sky/Lighting: Maintain consistent golden sunset lighting throughout. aesthetic score: 5.54, motion score: 8.45, camera motion: zoom_in, perspective: third, shot size: full.

Acknowledgements: Our project page is borrowed from DreamBooth.