Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 96% Match Methodology/Security Paper AI Security Researchers,LLM Developers,Auditors,Organizations deploying LLMs 2 weeks ago

PoTS: Proof-of-Training-Steps for Backdoor Detection in Large Language Models

large-language-models › training-methods
📄 Abstract

Abstract: As Large Language Models (LLMs) gain traction across critical domains, ensuring secure and trustworthy training processes has become a major concern. Backdoor attacks, where malicious actors inject hidden triggers into training data, are particularly insidious and difficult to detect. Existing post-training verification solutions like Proof-of-Learning are impractical for LLMs due to their requirement for full retraining, lack of robustness against stealthy manipulations, and inability to provide early detection during training. Early detection would significantly reduce computational costs. To address these limitations, we introduce Proof-of-Training Steps, a verification protocol that enables an independent auditor (Alice) to confirm that an LLM developer (Bob) has followed the declared training recipe, including data batches, architecture, and hyperparameters. By analyzing the sensitivity of the LLMs' language modeling head (LM-Head) to input perturbations, our method can expose subtle backdoor injections or deviations in training. Even with backdoor triggers in up to 10 percent of the training data, our protocol significantly reduces the attacker's ability to achieve a high attack success rate (ASR). Our method enables early detection of attacks at the injection step, with verification steps being 3x faster than training steps. Our results highlight the protocol's potential to enhance the accountability and security of LLM development, especially against insider threats.
Authors (4)
Issam Seddik
Sami Souihi
Mohamed Tamaazousti
Sara Tucci Piergiovanni
Submitted
October 16, 2025
arXiv Category
cs.CR
arXiv PDF

Key Contributions

Introduces Proof-of-Training-Steps (PoTS), a novel verification protocol for detecting backdoor attacks in LLMs during training. PoTS allows an independent auditor to confirm the developer's adherence to the training recipe by analyzing the sensitivity of the LM-Head to input perturbations, enabling early detection and reducing computational costs compared to post-training methods.

Business Value

Enhances the security and trustworthiness of LLMs, crucial for their adoption in sensitive applications, thereby mitigating risks associated with malicious attacks and ensuring reliable AI systems.