Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large language models (LLMs) are increasingly used in sensitive domains,
where their ability to infer personal data from seemingly benign text
introduces emerging privacy risks. While recent LLM-based anonymization methods
help mitigate such risks, they often rely on proprietary models (e.g., GPT-4),
raising concerns about cost and the potential exposure of sensitive data to
untrusted external systems. To address this, we introduce SElf-refining
Anonymization with Language model (SEAL), a novel distillation framework for
training small language models (SLMs) to perform effective anonymization
without relying on external models at inference time. SEAL leverages
adversarial interactions between an LLM anonymizer and an inference model to
collect trajectories of anonymized texts and inferred attributes, which are
then used to distill anonymization and critique capabilities into SLMs through
supervised fine-tuning and preference learning. The resulting models learn both
to anonymize text and to evaluate their outputs, enabling iterative improvement
of anonymization quality via self-refinement. Experiments on SynthPAI, a
dataset of synthetic personal profiles and text comments, demonstrate that SLMs
trained with SEAL achieve substantial improvements in anonymization
capabilities. Notably, 8B models attain a privacy-utility trade-off comparable
to that of the GPT-4 anonymizer and, with self-refinement, even surpass it in
terms of privacy protection. These results highlight the effectiveness of our
adversarial distillation framework for training SLMs as efficient anonymizers.
Authors (3)
Kyuyoung Kim
Hyunjun Jeon
Jinwoo Shin
Key Contributions
SEAL is a novel distillation framework that trains small language models (SLMs) for effective anonymization without relying on external LLMs at inference. It uses adversarial interactions to distill anonymization and critique capabilities into SLMs, addressing cost and privacy concerns associated with proprietary models.
Business Value
Enables organizations to leverage LLM capabilities for sensitive data processing while ensuring robust privacy protection, reducing compliance risks and operational costs.