arxiv_cv 95% Match Research Paper AI Safety Researchers,Developers of Generative AI Models,MLOps Engineers,Content Moderation Specialists 1 week ago

AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts

ai-safety › robustness

📄 Abstract

Abstract: Despite rapid advancements in text-to-image (T2I) models, their safety mechanisms are vulnerable to adversarial prompts, which maliciously generate unsafe images. Current red-teaming methods for proactively assessing such vulnerabilities usually require white-box access to T2I models, and rely on inefficient per-prompt optimization, as well as inevitably generate semantically meaningless prompts easily blocked by filters. In this paper, we propose APT (AutoPrompT), a black-box framework that leverages large language models (LLMs) to automatically generate human-readable adversarial suffixes for benign prompts. We first introduce an alternating optimization-finetuning pipeline between adversarial suffix optimization and fine-tuning the LLM utilizing the optimized suffix. Furthermore, we integrates a dual-evasion strategy in optimization phase, enabling the bypass of both perplexity-based filter and blacklist word filter: (1) we constrain the LLM generating human-readable prompts through an auxiliary LLM perplexity scoring, which starkly contrasts with prior token-level gibberish, and (2) we also introduce banned-token penalties to suppress the explicit generation of banned-tokens in blacklist. Extensive experiments demonstrate the excellent red-teaming performance of our human-readable, filter-resistant adversarial prompts, as well as superior zero-shot transferability which enables instant adaptation to unseen prompts and exposes critical vulnerabilities even in commercial APIs (e.g., Leonardo.Ai.).

Authors (7)

Yufan Liu

Wanqian Zhang

Huashan Chen

Lin Wang

Xiaojun Jia

Zheng Lin

+1 more

Submitted

October 28, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Proposes AutoPrompT (APT), a black-box framework that uses LLMs to automatically generate human-readable adversarial suffixes for benign prompts to test text-to-image models. It employs an alternating optimization-finetuning pipeline and a dual-evasion strategy to bypass common filters, enabling more effective red-teaming.

Business Value

Improves the safety and reliability of generative AI models by providing a systematic way to identify and fix vulnerabilities before deployment, reducing risks of misuse and harmful content generation.

Paper Metadata

Innovation Type

Methodological

Deployment Feasibility

High. The framework is designed to be applicable to various T2I models without requiring internal access.

Limitations Addressed

Current red-teaming methods for T2I models often require white-box access, rely on inefficient per-prompt optimization, and generate semantically meaningless prompts that are easily blocked.

Technical Tags

text-to-image modelsred-teamingadversarial promptsLLM-drivenblack-box frameworkhuman-readable promptsdual-evasion strategysafety mechanismsgenerative AI safety

Research Topics

AI SafetyAdversarial AttacksText-to-Image Model RobustnessPrompt EngineeringLarge Language Models

Methods & Architectures

LLM-driven adversarial prompt generationAlternating optimization-finetuning pipelineDual-evasion strategy (perplexity and blacklist filters) Large Language Models (LLMs)Text-to-Image (T2I) Models

Applications & Tasks

AI Safety Generative AI Content Moderation Vulnerability of T2I models to adversarial promptsInefficient and semantically meaningless adversarial promptsNeed for black-box red-teaming methods Automated red-teaming of T2I modelsGenerating human-readable adversarial promptsBypassing safety filters

Related Fields

AI SafetyGenerative AINatural Language ProcessingComputer VisionMachine Learning Security

Keywords

text-to-imageadversarial attacksred teamingLLMprompt engineeringAI safetygenerative AIrobustnessblack-boxcontent generationsecurity

Academic Context

#AI Safety#Adversarial Attacks#Text-to-Image Model Robustness#Prompt Engineering#Large Language Models

Commercial Potential

Potential Products

Automated vulnerability testing platforms for generative AIAI safety auditing services

Target Industries

TechnologyMediaSocial MediaAI Development

Use Case Examples

Proactively identifying ways users might generate harmful or inappropriate images using a T2I model.Testing the effectiveness of safety filters in generative AI systems.Developing more resilient T2I models.

Competitive Edge

Offers a more efficient, automated, and effective black-box red-teaming approach compared to manual methods or white-box techniques.

Market Opportunity

Growing, as AI safety becomes a critical concern for generative AI deployment.

Revenue Models

SaaS for AI safety testingconsulting services.

Resource Requirements

Compute Needs

Moderate to high, for training/finetuning LLMs and running T2I models during the red-teaming process.

Data Requirements

Requires access to T2I models and potentially datasets for fine-tuning the LLM.

Deployment Constraints

Ethical considerations in generating adversarial content,Computational cost of the optimization process

Scalability

Scalability depends on the efficiency of the LLM and the optimization pipeline.

Regulatory Considerations

Ethical guidelines for AI safety testing and responsible disclosure.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years, for integration into AI development and testing workflows.

Patent Potential

Moderate, for the novel LLM-driven adversarial prompt generation and evasion strategies.

View Full Paper Back to Papers