Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match research paper AI safety researchers,LLM developers,security professionals,AI ethicists 2 weeks ago

Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks

ai-safety › alignment
📄 Abstract

Abstract: Large language models (LLMs) are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive rejection, which either fails against adaptive attackers or overly restricts benign users. We propose a honeypot-based proactive guardrail system that transforms risk avoidance into risk utilization. Our framework fine-tunes a bait model to generate ambiguous, non-actionable but semantically relevant responses, which serve as lures to probe user intent. Combined with the protected LLM's safe reply, the system inserts proactive bait questions that gradually expose malicious intent through multi-turn interactions. We further introduce the Honeypot Utility Score (HUS), measuring both the attractiveness and feasibility of bait responses, and use a Defense Efficacy Rate (DER) for balancing safety and usability. Initial experiment on MHJ Datasets with recent attack method across GPT-4o show that our system significantly disrupts jailbreak success while preserving benign user experience.
Authors (3)
ChenYu Wu
Yi Wang
Yang Liao
Submitted
October 16, 2025
arXiv Category
cs.CR
arXiv PDF

Key Contributions

Proposes an Active Honeypot Guardrail System that proactively probes user intent in multi-turn interactions to detect and confirm LLM jailbreaks. It utilizes a fine-tuned 'bait' model to generate ambiguous responses, gradually exposing malicious intent, and introduces metrics like HUS and DER for evaluation.

Business Value

Significantly enhances the security and reliability of LLM deployments, protecting against malicious use and ensuring safer user interactions in sensitive applications.