Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The human-like proficiency of Large Language Models (LLMs) has brought
concerns about their potential misuse for generating persuasive and
personalised disinformation at scale. While prior work has demonstrated that
LLMs can generate disinformation, specific questions around persuasiveness and
personalisation (generation of disinformation tailored to specific demographic
attributes) remain largely unstudied. This paper presents the first
large-scale, multilingual empirical study on persona-targeted disinformation
generation by LLMs. Employing a red teaming methodology, we systematically
evaluate the robustness of LLM safety mechanisms to persona-targeted prompts. A
key novel result is AI-TRAITS (AI-generaTed peRsonAlIsed disinformaTion
dataSet), a new dataset of around 1.6 million texts generated by eight
state-of-the-art LLMs. AI-TRAITS is seeded by prompts that combine 324
disinformation narratives and 150 distinct persona profiles, covering four
major languages (English, Russian, Portuguese, Hindi) and key demographic
dimensions (country, generation, political orientation). The resulting
personalised narratives are then assessed quantitatively and compared along the
dimensions of models, languages, jailbreaking rate, and personalisation
attributes. Our findings demonstrate that the use of even simple
personalisation strategies in the prompts significantly increases the
likelihood of jailbreaks for all studied LLMs. Furthermore, personalised
prompts result in altered linguistic and rhetorical patterns and amplify the
persuasiveness of the LLM-generated false narratives. These insights expose
critical vulnerabilities in current state-of-the-art LLMs and offer a
foundation for improving safety alignment and detection strategies in
multilingual and cross-demographic contexts.
Key Contributions
This paper presents the first large-scale, multilingual study on persona-targeted disinformation generation by LLMs, introducing the AI-TRAITS dataset (1.6M texts) generated by eight SOTA LLMs. It systematically evaluates LLM safety mechanism robustness against persona-targeted prompts, revealing insights into the interplay between safeguards, personalization, and disinformation.
Business Value
Provides critical insights for developing more robust AI safety measures and content moderation strategies, helping platforms combat the spread of AI-generated disinformation and protect users from personalized manipulation.