Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large language models (LLMs) are increasingly vulnerable to multi-turn
jailbreak attacks, where adversaries iteratively elicit harmful behaviors that
bypass single-turn safety filters. Existing defenses predominantly rely on
passive rejection, which either fails against adaptive attackers or overly
restricts benign users. We propose a honeypot-based proactive guardrail system
that transforms risk avoidance into risk utilization. Our framework fine-tunes
a bait model to generate ambiguous, non-actionable but semantically relevant
responses, which serve as lures to probe user intent. Combined with the
protected LLM's safe reply, the system inserts proactive bait questions that
gradually expose malicious intent through multi-turn interactions. We further
introduce the Honeypot Utility Score (HUS), measuring both the attractiveness
and feasibility of bait responses, and use a Defense Efficacy Rate (DER) for
balancing safety and usability. Initial experiment on MHJ Datasets with recent
attack method across GPT-4o show that our system significantly disrupts
jailbreak success while preserving benign user experience.
Authors (3)
ChenYu Wu
Yi Wang
Yang Liao
Submitted
October 16, 2025
Key Contributions
Proposes an Active Honeypot Guardrail System that proactively probes user intent in multi-turn interactions to detect and confirm LLM jailbreaks. It utilizes a fine-tuned 'bait' model to generate ambiguous responses, gradually exposing malicious intent, and introduces metrics like HUS and DER for evaluation.
Business Value
Significantly enhances the security and reliability of LLM deployments, protecting against malicious use and ensuring safer user interactions in sensitive applications.