arxiv_ai 95% Match Research Paper AI Researchers,Game Developers,Robotics Engineers,HCI Researchers 1 week ago

Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents

reinforcement-learning › game-playing

📄 Abstract

Abstract: We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned native keyboard-mouse inputs. Unlike API- or GUI-based approaches, this paradigm enables large-scale continual pre-training across heterogeneous domains, including OS, web, and simulation games. Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data. Key techniques include a decaying continual loss to reduce causal confusion and an efficient Sparse-Thinking strategy that balances reasoning depth and inference cost. Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks, is close to the generality of fresh humans in unseen web 3d games, and outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS benchmarks. Scaling results on training-time and test-time confirm that the unified action space sustains improvements when scaled to cross-game and multimodal data. Our results demonstrate that simple, scalable action representations combined with large-scale pre-training provide a promising path toward generalist agents with broad computer-use abilities.

Authors (27)

Zihao Wang

Xujing Li

Yining Ye

Junjie Fang

Haoming Wang

Longxiang Liu

+21 more

Submitted

October 27, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

Game-TARS is a generalist game agent trained with a unified, scalable action space anchored to human-aligned keyboard-mouse inputs, enabling large-scale continual pre-training across OS, web, and game domains. Key innovations include a decaying continual loss for reduced causal confusion and an efficient Sparse-Thinking strategy to balance reasoning depth and inference cost.

Business Value

Paves the way for more versatile AI agents that can automate complex tasks across various digital environments, from gaming to software control.

Paper Metadata

Innovation Type

Agent Architecture and Training Paradigm

Deployment Feasibility

Moderate, requires significant computational resources for training and potentially for inference, but the unified action space simplifies integration.

Limitations Addressed

Addresses the limitations of domain-specific agents and API/GUI-based approaches by creating a generalist agent capable of operating across diverse environments using natural human inputs.

Performance Gains

Achieves approximately 2 times the success rate on open-world Minecraft tasks compared to previous SOTA models.

Technical Tags

Foundation ModelsGeneralist AgentsMultimodal AgentsUnified Action SpaceContinual LearningSparse ThinkingHuman-Aligned InputsOS InteractionWeb InteractionGame PlayingMinecraftFPS Games

Research Topics

Artificial General Intelligence (AGI)Embodied AIMultimodal LearningReinforcement LearningHuman-Computer Interaction

Methods & Architectures

Unified Action SpaceContinual Pre-trainingDecaying Continual LossSparse-Thinking StrategyMultimodal Input Processing Foundation ModelsTransformer-based Architectures

Applications & Tasks

Gaming Robotics Human-Computer Interaction Operating Systems Creating Generalist AI AgentsScalable Agent TrainingHandling Heterogeneous DomainsBalancing Reasoning Depth and Cost Playing Diverse GamesInteracting with Operating SystemsNavigating Web EnvironmentsControlling Game Characters via Keyboard/Mouse

Datasets & Benchmarks

Benchmarks

Minecraft: ~2x success rate over SOTA • Web 3D Games: Close to human generality • FPS Benchmarks: Outperforms GPT-5, Gemini-2.5-Pro, Claude-4-Sonnet

Success RateGeneralityPerformance Comparison

Related Fields

Reinforcement LearningArtificial General IntelligenceMultimodal AIHuman-Computer InteractionGame AI

Keywords

Foundation ModelsGeneralist AgentMultimodalGame AgentUnified Action SpaceContinual LearningSparse ThinkingMinecraftOS InteractionWeb InteractionAGI

Academic Context

#Artificial General Intelligence (AGI)#Embodied AI#Multimodal Learning#Reinforcement Learning#Human-Computer Interaction

Companies & Organizations

Companies Mentioned

Google (Gemini) OpenAI (GPT) Anthropic (Claude)

Commercial Potential

Potential Products

Advanced Game AIAutomated Software AgentsVirtual Assistants for Complex Tasks

Target Industries

GamingSoftware DevelopmentAutomationVirtual Reality

Use Case Examples

An AI agent that can play a wide variety of video games, from strategy to FPS.An agent that can navigate and interact with a desktop operating system like a human user.

Competitive Edge

Surpasses leading LLMs and prior SOTA game agents in generality and performance across diverse tasks by leveraging a unified action space and advanced training techniques.

Resource Requirements

Compute Needs

Very high compute requirements for pre-training on 500B tokens.

Data Requirements

Requires diverse multimodal data from OS, web, and games.

Deployment Constraints

Inference cost and complexity might be high for real-time applications.

Scalability

The unified action space and continual learning paradigm are designed for scalability across domains and tasks.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers