arxiv_ml 96% Match Methodology/Security Paper AI Security Researchers,LLM Developers,Auditors,Organizations deploying LLMs 2 weeks ago

PoTS: Proof-of-Training-Steps for Backdoor Detection in Large Language Models

large-language-models › training-methods

📄 Abstract

Abstract: As Large Language Models (LLMs) gain traction across critical domains, ensuring secure and trustworthy training processes has become a major concern. Backdoor attacks, where malicious actors inject hidden triggers into training data, are particularly insidious and difficult to detect. Existing post-training verification solutions like Proof-of-Learning are impractical for LLMs due to their requirement for full retraining, lack of robustness against stealthy manipulations, and inability to provide early detection during training. Early detection would significantly reduce computational costs. To address these limitations, we introduce Proof-of-Training Steps, a verification protocol that enables an independent auditor (Alice) to confirm that an LLM developer (Bob) has followed the declared training recipe, including data batches, architecture, and hyperparameters. By analyzing the sensitivity of the LLMs' language modeling head (LM-Head) to input perturbations, our method can expose subtle backdoor injections or deviations in training. Even with backdoor triggers in up to 10 percent of the training data, our protocol significantly reduces the attacker's ability to achieve a high attack success rate (ASR). Our method enables early detection of attacks at the injection step, with verification steps being 3x faster than training steps. Our results highlight the protocol's potential to enhance the accountability and security of LLM development, especially against insider threats.

Authors (4)

Issam Seddik

Sami Souihi

Mohamed Tamaazousti

Sara Tucci Piergiovanni

Submitted

October 16, 2025

arXiv Category

cs.CR

arXiv PDF

Key Contributions

Introduces Proof-of-Training-Steps (PoTS), a novel verification protocol for detecting backdoor attacks in LLMs during training. PoTS allows an independent auditor to confirm the developer's adherence to the training recipe by analyzing the sensitivity of the LM-Head to input perturbations, enabling early detection and reducing computational costs compared to post-training methods.

Business Value

Enhances the security and trustworthiness of LLMs, crucial for their adoption in sensitive applications, thereby mitigating risks associated with malicious attacks and ensuring reliable AI systems.

Paper Metadata

Innovation Type

Protocol/Methodology

Deployment Feasibility

High. Designed as an independent auditing protocol that can be integrated into development and deployment pipelines.

Limitations Addressed

Impracticality of full retraining for LLMs (required by Proof-of-Learning), lack of robustness against stealthy manipulations, and inability to provide early detection during training. Addresses the high computational cost and late detection of existing methods.

Performance Gains

Enables early detection of backdoor attacks, significantly reducing computational costs and improving security posture compared to post-training verification.

Technical Tags

Backdoor DetectionLarge Language Models (LLMs)Proof-of-Training-Steps (PoTS)Verification ProtocolAuditingTraining RecipeLM-Head SensitivityInput PerturbationsEarly DetectionStealthy Manipulations

Research Topics

AI SecurityLLM TrustworthinessModel VerificationAdversarial Machine LearningMachine Learning Auditing

Methods & Architectures

Proof-of-Training-Steps (PoTS) protocolAnalysis of LM-Head sensitivity to input perturbationsAuditing training data batches, architecture, and hyperparameters Large Language Models (LLMs)Language Modeling Head (LM-Head)

Applications & Tasks

LLM Development AI Security Software Auditing Detecting backdoor attacks in LLMsEnsuring trustworthiness of LLM training processesLimitations of existing post-training verification methods (e.g., Proof-of-Learning)Need for early detection during training Verifying that LLM developers adhere to declared training recipesDetecting hidden triggers and malicious manipulations in LLM trainingProviding an independent auditing mechanism for LLM training

Related Fields

AI SecurityMachine LearningTrustworthy AILLMsAuditing

Keywords

LLM securitybackdoor attackverificationauditingProof-of-Training-StepsPoTSLM-Headtraining recipeearly detectiontrustworthy AImodel integrity

Academic Context

#AI Security#LLM Trustworthiness#Model Verification#Adversarial Machine Learning#Machine Learning Auditing

Commercial Potential

Potential Products

AI security auditing servicesLLM verification toolsSecure LLM development platforms

Target Industries

TechnologyCybersecurityFinanceGovernmentAny industry using LLMs

Use Case Examples

An independent auditor uses PoTS to verify that a company's LLM was trained on the claimed dataset and without malicious triggers.An LLM developer integrates PoTS checks during training to catch potential backdoor injections early.

Competitive Edge

Offers a proactive, efficient, and robust solution for LLM backdoor detection compared to existing post-training methods like Proof-of-Learning, which are impractical and less effective for large models.

Market Opportunity

Large and growing, driven by the increasing use of LLMs and the associated security risks.

Revenue Models

Subscription services for auditing toolsconsulting services for AI security.

Resource Requirements

Compute Needs

Requires computational resources for analyzing LM-Head sensitivity, significantly less than full retraining.

Data Requirements

Access to the LLM's LM-Head and the ability to perform targeted input perturbations.

Deployment Constraints

Requires cooperation from the LLM developer to allow access for auditing; effectiveness depends on the specific attack and perturbation strategy.

Scalability

Designed to be more scalable than full retraining, making it suitable for large LLMs.

Regulatory Considerations

Could inform standards for AI model development and auditing.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for integration into security auditing tools.

View Full Paper Back to Papers