arxiv_ml 95% Match Research Paper Cybersecurity Professionals,Threat Intelligence Analysts,AI Researchers,Security Auditors,LLM Developers 20 hours ago

Large Language Models are Unreliable for Cyber Threat Intelligence

large-language-models › evaluation

📄 Abstract

Abstract: Several recent works have argued that Large Language Models (LLMs) can be used to tame the data deluge in the cybersecurity field, by improving the automation of Cyber Threat Intelligence (CTI) tasks. This work presents an evaluation methodology that other than allowing to test LLMs on CTI tasks when using zero-shot learning, few-shot learning and fine-tuning, also allows to quantify their consistency and their confidence level. We run experiments with three state-of-the-art LLMs and a dataset of 350 threat intelligence reports and present new evidence of potential security risks in relying on LLMs for CTI. We show how LLMs cannot guarantee sufficient performance on real-size reports while also being inconsistent and overconfident. Few-shot learning and fine-tuning only partially improve the results, thus posing doubts about the possibility of using LLMs for CTI scenarios, where labelled datasets are lacking and where confidence is a fundamental factor.

Key Contributions

This paper presents a novel evaluation methodology for LLMs in Cyber Threat Intelligence (CTI) that quantifies consistency and confidence. It provides new evidence of security risks, showing that state-of-the-art LLMs perform insufficiently on real-world CTI tasks, are inconsistent, and overconfident, even with few-shot learning or fine-tuning.

Business Value

Highlights critical risks in deploying LLMs for cybersecurity, preventing potential security breaches or misinformed decision-making. Guides organizations on the limitations and appropriate use cases for LLMs in CTI.

Paper Metadata

Innovation Type

Methodological/Evaluative

Deployment Feasibility

Low for direct, unverified deployment in critical CTI roles. High for using the proposed evaluation methodology to assess LLM suitability.

Limitations Addressed

The assumption that LLMs can effectively tame the data deluge in cybersecurity and automate CTI tasks. Addresses the lack of robust evaluation methods for LLMs in this domain, particularly regarding consistency and confidence.

Technical Tags

Large Language Models (LLMs)Cyber Threat Intelligence (CTI)Zero-shot LearningFew-shot LearningFine-tuningConsistencyConfidence EstimationSecurity RisksData DelugeAutomation

Research Topics

AI in CybersecurityLLM ReliabilityEvaluation MethodologiesCyber Threat AnalysisAI Safety and Security

Methods & Architectures

Evaluation Methodology DevelopmentZero-shot LearningFew-shot LearningFine-tuningConsistency QuantificationConfidence Level QuantificationEmpirical Evaluation Large Language Models (e.g., GPT variants)

Applications & Tasks

Cybersecurity Information Security Threat Intelligence LLM UnreliabilityQuantifying LLM PerformanceAssessing LLM ConfidenceSecurity Risks of AI Automation Automating CTI tasksEvaluating LLMs for CTIQuantifying LLM consistency and confidence

Datasets & Benchmarks

Datasets

Dataset of 350 threat intelligence reports

Performance metrics on CTI tasksConsistency scoresConfidence levels

Related Fields

CybersecurityArtificial IntelligenceMachine LearningNatural Language ProcessingAI SafetyRisk Management

Keywords

LLMCyber Threat IntelligenceCTIEvaluationReliabilityConsistencyConfidenceSecurity RisksAI AutomationFew-shot LearningFine-tuningCybersecurity

Academic Context

#AI in Cybersecurity#LLM Reliability#Evaluation Methodologies#Cyber Threat Analysis#AI Safety and Security

Commercial Potential

Potential Products

LLM evaluation tools for cybersecurity applicationsFrameworks for secure LLM deployment in CTI

Target Industries

CybersecurityInformation TechnologyFinanceGovernment

Use Case Examples

Assessing the reliability of LLM-generated threat reportsDetermining if LLMs can be trusted for automated CTI analysisDeveloping safer AI tools for cybersecurity

Competitive Edge

Provides a critical counterpoint to optimistic claims about LLMs in CTI, offering a rigorous evaluation framework and highlighting significant risks.

Market Opportunity

Significant market for cybersecurity solutions and threat intelligence services.

Revenue Models

N/A (Research)

Resource Requirements

Compute Needs

Requires significant computational resources for running and evaluating multiple LLMs on a substantial dataset.

Data Requirements

A diverse dataset of real-world threat intelligence reports is crucial for evaluation.

Deployment Constraints

LLMs' inherent unreliability and lack of guaranteed performance,Need for human oversight in CTI tasks,Potential for adversarial attacks on LLMs

Scalability

The evaluation methodology can be scaled to more LLMs and larger CTI datasets.

Regulatory Considerations

Potential implications for AI regulation in critical infrastructure and security domains.

Production Readiness

Maturity Level

Research/Validation

Time to Market

N/A (Research)

Patent Potential

Low (primarily an evaluation study).

View Full Paper Back to Papers