arxiv_cl 96% Match Research Paper AI Safety Researchers,Policy Makers,Platform Security Teams,NLP Researchers 2 weeks ago

A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

ai-safety › robustness

📄 Abstract

Abstract: The human-like proficiency of Large Language Models (LLMs) has brought concerns about their potential misuse for generating persuasive and personalised disinformation at scale. While prior work has demonstrated that LLMs can generate disinformation, specific questions around persuasiveness and personalisation (generation of disinformation tailored to specific demographic attributes) remain largely unstudied. This paper presents the first large-scale, multilingual empirical study on persona-targeted disinformation generation by LLMs. Employing a red teaming methodology, we systematically evaluate the robustness of LLM safety mechanisms to persona-targeted prompts. A key novel result is AI-TRAITS (AI-generaTed peRsonAlIsed disinformaTion dataSet), a new dataset of around 1.6 million texts generated by eight state-of-the-art LLMs. AI-TRAITS is seeded by prompts that combine 324 disinformation narratives and 150 distinct persona profiles, covering four major languages (English, Russian, Portuguese, Hindi) and key demographic dimensions (country, generation, political orientation). The resulting personalised narratives are then assessed quantitatively and compared along the dimensions of models, languages, jailbreaking rate, and personalisation attributes. Our findings demonstrate that the use of even simple personalisation strategies in the prompts significantly increases the likelihood of jailbreaks for all studied LLMs. Furthermore, personalised prompts result in altered linguistic and rhetorical patterns and amplify the persuasiveness of the LLM-generated false narratives. These insights expose critical vulnerabilities in current state-of-the-art LLMs and offer a foundation for improving safety alignment and detection strategies in multilingual and cross-demographic contexts.

Key Contributions

This paper presents the first large-scale, multilingual study on persona-targeted disinformation generation by LLMs, introducing the AI-TRAITS dataset (1.6M texts) generated by eight SOTA LLMs. It systematically evaluates LLM safety mechanism robustness against persona-targeted prompts, revealing insights into the interplay between safeguards, personalization, and disinformation.

Business Value

Provides critical insights for developing more robust AI safety measures and content moderation strategies, helping platforms combat the spread of AI-generated disinformation and protect users from personalized manipulation.

Paper Metadata

Innovation Type

Dataset Creation and Empirical Study

Deployment Feasibility

High for the research findings, moderate for direct application in real-time detection systems which require further development.

Limitations Addressed

Lack of large-scale, multilingual studies on persona-targeted disinformation generation by LLMs and the robustness of their safety mechanisms against such sophisticated attacks.

Technical Tags

LLM SafetyDisinformation GenerationPersona TargetingRed TeamingMultilingual StudyLarge-Scale EvaluationAI-Generated ContentPersuasiveness

Research Topics

AI EthicsLLM SecurityMisinformationAI RobustnessCross-cultural AI

Methods & Architectures

Red TeamingPersona-based PromptingLarge-scale Data GenerationMultilingual Analysis State-of-the-art LLMs

Applications & Tasks

AI Safety Social Media Information Security Natural Language Generation DisinformationMisinformationLLM MisuseSafety Mechanism Robustness Evaluating LLM safeguardsGenerating persona-targeted disinformationAssessing disinformation persuasiveness

Datasets & Benchmarks

Datasets

AI-TRAITS

Disinformation generation ratePersuasivenessSafety mechanism effectiveness

Related Fields

AI SafetyNatural Language ProcessingSocial ScienceInformation SecurityComputational Social Science

Keywords

LLMDisinformationAI SafetyPersona TargetingRed TeamingMultilingualRobustnessPersuasionAI EthicsMisinformationLarge Language ModelsDatasetEvaluation

Academic Context

#AI Ethics#LLM Security#Misinformation#AI Robustness#Cross-cultural AI

Commercial Potential

Potential Products

Disinformation detection toolsAI safety auditing servicesContent moderation platforms

Target Industries

TechnologySocial MediaCybersecurityMediaGovernment

Use Case Examples

Testing LLM vulnerability to generating targeted fake newsDeveloping defenses against AI-powered influence operationsUnderstanding cross-cultural risks of LLM misuse

Competitive Edge

Establishes a new benchmark and methodology for evaluating LLM robustness against sophisticated disinformation attacks, filling a critical gap in current AI safety research.

Market Opportunity

Significant market interest in AI safety and combating online misinformation.

Revenue Models

Consultingdevelopment of specialized AI safety toolsdata licensing.

Resource Requirements

Compute Needs

High (for generating and analyzing 1.6M texts)

Data Requirements

Large-scale, diverse prompts covering disinformation narratives and persona profiles across multiple languages.

Deployment Constraints

Ethical considerations in generating harmful content,Complexity of multilingual analysis

Scalability

The methodology is designed for large-scale studies and can be scaled to include more LLMs, languages, and disinformation strategies.

Regulatory Considerations

Ethical guidelines for AI research involving disinformationData privacy

Production Readiness

Maturity Level

Research

Time to Market

2-3 years (for developing practical detection/mitigation tools)

Patent Potential

Low (primarily an empirical study and dataset)

View Full Paper Back to Papers