Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Several recent works have argued that Large Language Models (LLMs) can be
used to tame the data deluge in the cybersecurity field, by improving the
automation of Cyber Threat Intelligence (CTI) tasks. This work presents an
evaluation methodology that other than allowing to test LLMs on CTI tasks when
using zero-shot learning, few-shot learning and fine-tuning, also allows to
quantify their consistency and their confidence level. We run experiments with
three state-of-the-art LLMs and a dataset of 350 threat intelligence reports
and present new evidence of potential security risks in relying on LLMs for
CTI. We show how LLMs cannot guarantee sufficient performance on real-size
reports while also being inconsistent and overconfident. Few-shot learning and
fine-tuning only partially improve the results, thus posing doubts about the
possibility of using LLMs for CTI scenarios, where labelled datasets are
lacking and where confidence is a fundamental factor.
Key Contributions
This paper presents a novel evaluation methodology for LLMs in Cyber Threat Intelligence (CTI) that quantifies consistency and confidence. It provides new evidence of security risks, showing that state-of-the-art LLMs perform insufficiently on real-world CTI tasks, are inconsistent, and overconfident, even with few-shot learning or fine-tuning.
Business Value
Highlights critical risks in deploying LLMs for cybersecurity, preventing potential security breaches or misinformed decision-making. Guides organizations on the limitations and appropriate use cases for LLMs in CTI.