Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: As Large Language Models (LLMs) gain traction across critical domains,
ensuring secure and trustworthy training processes has become a major concern.
Backdoor attacks, where malicious actors inject hidden triggers into training
data, are particularly insidious and difficult to detect. Existing
post-training verification solutions like Proof-of-Learning are impractical for
LLMs due to their requirement for full retraining, lack of robustness against
stealthy manipulations, and inability to provide early detection during
training. Early detection would significantly reduce computational costs. To
address these limitations, we introduce Proof-of-Training Steps, a verification
protocol that enables an independent auditor (Alice) to confirm that an LLM
developer (Bob) has followed the declared training recipe, including data
batches, architecture, and hyperparameters. By analyzing the sensitivity of the
LLMs' language modeling head (LM-Head) to input perturbations, our method can
expose subtle backdoor injections or deviations in training. Even with backdoor
triggers in up to 10 percent of the training data, our protocol significantly
reduces the attacker's ability to achieve a high attack success rate (ASR). Our
method enables early detection of attacks at the injection step, with
verification steps being 3x faster than training steps. Our results highlight
the protocol's potential to enhance the accountability and security of LLM
development, especially against insider threats.
Authors (4)
Issam Seddik
Sami Souihi
Mohamed Tamaazousti
Sara Tucci Piergiovanni
Submitted
October 16, 2025
Key Contributions
Introduces Proof-of-Training-Steps (PoTS), a novel verification protocol for detecting backdoor attacks in LLMs during training. PoTS allows an independent auditor to confirm the developer's adherence to the training recipe by analyzing the sensitivity of the LM-Head to input perturbations, enabling early detection and reducing computational costs compared to post-training methods.
Business Value
Enhances the security and trustworthiness of LLMs, crucial for their adoption in sensitive applications, thereby mitigating risks associated with malicious attacks and ensuring reliable AI systems.