arxiv_ml 95% Match Research Paper AI safety researchers,LLM developers,AI ethicists,Security professionals 1 week ago

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

ai-safety › alignment

📄 Abstract

Abstract: Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.

Authors (7)

Thomas Kuntz

Agatha Duzan

Hao Zhao

Francesco Croce

Zico Kolter

Nicolas Flammarion

+1 more

Submitted

June 17, 2025

arXiv Category

cs.SE

arXiv PDF

Key Contributions

Introduces OS-Harm, a new benchmark for measuring the safety of computer use agents (LLM agents interacting with GUIs). OS-Harm includes 150 tasks across three harm categories (misuse, prompt injection, misbehavior) and uses an automated judge to evaluate safety violations.

Business Value

Crucial for building trust and enabling the widespread adoption of LLM agents in user-facing applications, ensuring they operate safely and ethically, thereby mitigating risks for businesses and users.

Paper Metadata

Innovation Type

Benchmark/Evaluation Framework

Deployment Feasibility

High. The benchmark itself is a tool for evaluation, not a deployed system. Its adoption depends on community acceptance.

Limitations Addressed

Lack of standardized safety evaluation for LLM agents,Overlooked safety concerns in popular LLM agent systems,Difficulty in systematically testing for harmful behaviors

Performance Gains

Provides a standardized method to measure and compare the safety performance of different computer use agents.

Technical Tags

AI SafetyLLM AgentsComputer Use AgentsBenchmarkSafety EvaluationHarmful BehaviorPrompt InjectionModel MisbehaviorOSWorld Environment

Research Topics

AI Safety and AlignmentLLM Agent EvaluationRobustness of AI SystemsSecurity of AI AgentsHuman-Computer Interaction

Methods & Architectures

Benchmark CreationAutomated JudgeTask Design for Safety ViolationsInteraction with GUI elements

Applications & Tasks

AI Safety Human-Computer Interaction Software Development Cybersecurity Evaluating safety of AI agentsIdentifying potential harms from LLM agentsDeveloping standardized safety benchmarksTesting robustness against attacks Measuring safety of computer use agentsTesting for deliberate misuse, prompt injection, and model misbehaviorEvaluating agents across various OS applications

Datasets & Benchmarks

Benchmarks

OS-Harm benchmark

Safety violation rateAccuracy of agent actionsEffectiveness of harm mitigation

Related Fields

AI SafetyMachine LearningHuman-Computer InteractionSoftware EngineeringCybersecurity

Keywords

AI SafetyLLM AgentsBenchmarkEvaluationHarmful BehaviorPrompt InjectionMisuseOSWorldAlignmentRobustnessCybersecurity

Academic Context

#AI Safety and Alignment#LLM Agent Evaluation#Robustness of AI Systems#Security of AI Agents#Human-Computer Interaction

Technology Stack

Frameworks & Libraries

OSWorld

Programming Languages

Python

Commercial Potential

Potential Products

AI safety testing platformsTools for evaluating LLM agent security

Target Industries

TechnologySoftware DevelopmentAI ResearchCybersecurity

Use Case Examples

Testing new LLM agents before public releaseBenchmarking different agent architectures for safetyDeveloping defenses against prompt injection attacks

Competitive Edge

Establishes a dedicated benchmark for evaluating the safety of computer use agents, filling a critical gap in the current AI safety landscape.

Market Opportunity

Growing rapidly with the proliferation of LLM agents.

Revenue Models

Could be integrated into AI development platforms or offered as a service.

Resource Requirements

Compute Needs

Low for using the benchmark, moderate for running agent evaluations.

Data Requirements

The benchmark itself defines the tasks and scenarios.

Deployment Constraints

Requires integration with agent execution environments,Automated judge accuracy needs validation

Scalability

The benchmark is designed to be extensible with new tasks and scenarios.

Production Readiness

Maturity Level

Research

Time to Market

Immediate adoption potential within the research community.

Patent Potential

Low.

View Full Paper Back to Papers