Haoxuan Che1*, Xuanhua He2*, Quande Liu3✉, Cheng Jin1, Hao Chen1✉ 1Hong Kong Univerity of Science and Technology;
2Univerity of Science and Technology of China;
3The Chinese Univerity of Hong Kong
For any inquiries, please email to: hche@ust.hk, qdliu0226@gmail.com.
Overview of GameGen-X Functionality
Part I for the basic functionality showcase, and Part II for the key features of GameGen-X (0:47).
Demo Shouted out to "Journey to the West"
Abstract
We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos.
This model facilitates high-quality, open-domain generation by simulating an extensive array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events.
Additionally, it provides interactive controllability, predicting and altering future content based on the current clip, thus allowing for gameplay simulation.
To realize this vision, we first collected and built an Open-World Video Game Dataset (OGameData) from scratch.
It is the first and largest dataset for open-world game video generation and control, which comprises over one million diverse gameplay video clips sampling from over 150 games with informative captions from GPT-4o.
GameGen-$\mathbb{X}$ undergoes a two-stage training process, consisting of foundation model pre-training and instruction tuning.
Firstly, the model was pre-trained via text-to-video generation and video continuation, endowing it with the capability for long-sequence, high-quality open-domain game video generation.
Further, to achieve interactive controllability, we designed InstructNet to incorporate game-related multi-modal control signal experts.
This allows the model to adjust latent representations based on user inputs, unifying character interaction and scene content control for the first time in video generation. During instruction tuning, only the InstructNet is updated while the pre-trained foundation model is frozen, enabling the integration of interactive controllability without loss of diversity and quality of generated video content.
GameGen-$\mathbb{X}$ represents a significant leap forward in open-world video game design using generative models.
It demonstrates the potential of generative models to serve as auxiliary tools to traditional rendering techniques, effectively merging creative generation with interactive capabilities.
The project will be available at https://github.com/GameGen-X/GameGen-X.
High-quality Game Generation
Character Generation
Geralt of Rivia
Arthur Morgan
Eivor
Jin Sakai
Astroneer
Ice Magician
RoboCop
Security Guard
Environment Generation
Spring
Summer
Autumn
Winter
Lake
Sea
Lavender Field
Pyramid
Action Generation
Motorcycling (first-person)
Driving
Flying
Sailing
Motorcycling (third-person)
Walking
Riding
Carriage
Event Generation
Raining
Snowing
Thundering
Sunrising
Firing
Sandstormig
Tsunami
Tornado
Open-domain Generation
Cybermonk roaming in China town
TimeMaster standing in another dimension
Traveler with a cloak walking on Mars
Magic steam airship soaring in the clouds
Ghost walking under the blood moon
Venom Druid touring Runeforest
Angel looking at the Holy Kingdom
Mechanical life passing through the ruins
Multi-modality Interactive Control
Structural Instruction Prompts
Fire on the sky
Dark and star
Sunset happens
Fog emerging
Operation Signals
Move left (A)
Move right (D)
Move left (A)
Move right (D)
Video Prompts
Canny Prompt
Output Video 1
Motion Vector
Output Video
Qualitative Comparision
Generation Comparision
GameGen-X
OpenSora-Plan
OpenSora
CogVideoX
GameGen-X
OpenSora-Plan
OpenSora
CogVideoX
GameGen-X
OpenSora-Plan
OpenSora
CogVideoX
Control Comparision
GameGen-X
Luma
Kling
Tongyi
GameGen-X
Luma
Kling
Tongyi
GameGen-X
Luma
Kling
Tongyi
OGameData Showcase
OGameData Summary:
OGameData is a comprehensive multi-genre open-world video game dataset, which contains generation and control subsets.
Sourcing over 32,000 videos from local engines and the internet, each video ranges from several minutes to several hours in length.
The dataset features more than 150 next-generation games across various genres, including open-world RPGs, FPS, racing games, action-puzzle games, and more.
It also covers different perspectives (first-person, third-person) and styles (realistic, Eastern traditional, cyberpunk, post-apocalyptic, Western fantasy, etc.).
After a rigorous selection process that spanned six months and involved multiple human experts and advanced model algorithms, we have curated over 4,000 hours of high-quality video clips, ranging from 720p to 4k resolution.
These segments were meticulously annotated by GPT-4O, providing a rich source of labeled data for training and validation purposes.
The OGameData is expected to become an invaluable resource for researchers and developers, enabling the exploration of various applications such as video game generative AI development, interactive control, and immersive virtual environments.
Its imminent open-source release will offer the scientific community unprecedented access to a broad spectrum of video game data, fostering innovation and collaboration across multiple disciplines.
OGameData for Generation Training
A person in a trench coat and hat walks along a riverbank, approaching wooden houses on a misty morning. In this atmospheric sequence from the action-adventure game Red Dead Redemption 2, Arthur Morgan is depicted walking along a serene riverbank, highlighted by his distinctive wide-brimmed white hat and dark trench coat. The environment features a tranquil riverside setting bathed in golden sunlight with mist lingering over distant forests. The camera tailing Arthur captures steady shots that subtly reveal more of the lush greenery and rustic buildings emerging on the left bank as he proceeds forward, reflecting the game's characteristic blend of exploration and attention to scenic detail,
Cars drive through a city intersection at sunset, with a horse statue visible in the background. In this sequence from the open-world game Grand Theft Auto V, a sleek black muscle car is seen navigating through a downtown intersection at dusk. The scene captures the essence of urban life with palm trees lining the streets and modern buildings in the background, one prominently featuring an imposing silver statue of a rearing horse. As the sky glows with a soft purple and pink hue, suggesting early evening, various camera angles provide expansive and cinematic views: starting with wide shots to establish the setting and transitioning into closer perspectives that capture details like traffic lights turning green and an approaching garbage truck on one side. The environment effectively conveys a rich, immersive atmosphere typical of GTA V’s detailed cityscapes.
OGameData for Instruction Tuning
Environmental Basics: Widen the path in front of the main character as they walk forward. Main Character: Move steadily along the path, decreasing distance to distant buildings. Environmental Changes: Enhance visibility and detail of approaching village structures over time. Sky/Lighting: Maintain clear skies and consistent daylight throughout. aesthetic score: 5.47, motion score: 15.37, camera motion: pan_right, perspective: third, shot size: full.
Environmental Basics: Show a lush, green countryside path lined with stone walls and trees under a sunset sky. Main Character: Have the main character riding steadily forward on horseback along the path. Environmental Changes: Slowly move the horse and rider deeper into the scene along the path. Sky/Lighting: Maintain consistent golden sunset lighting throughout. aesthetic score: 5.54, motion score: 8.45, camera motion: zoom_in, perspective: third, shot size: full.
Acknowledgements:
Our project page is borrowed from DreamBooth.