arxiv_cl 75% Match Research Paper Legal professionals,Judiciary,Data privacy officers,NLP researchers,Software developers in legal tech 3 weeks ago

Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments

large-language-models › multimodal-llms

📄 Abstract

Abstract: To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called Thunder-DeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.

Authors (5)

Sungeun Hahm

Heejin Kim

Gyuseong Lee

Hyunji Park

Jaejin Lee

Submitted

June 18, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Proposes Thunder-DeID, an end-to-end de-identification framework for Korean court judgments that aligns with legal requirements. It introduces the first Korean legal dataset with annotated judgments and PII, a systematic PII categorization, and a DNN-based pipeline to address the challenges of de-identification at scale.

Business Value

Enables public disclosure of court judgments while protecting personal data, fostering transparency in the judiciary and compliance with privacy regulations.

Paper Metadata

Innovation Type

Framework and Dataset Creation

Deployment Feasibility

Moderate to High. Requires domain expertise for legal text and PII definition, but the DNN pipeline is a standard ML deployment.

Limitations Addressed

Inadequacy of current de-identification processes for court judgments at scale; vague legal definitions of PII not suited for technical solutions.

Technical Tags

de-identificationPII detectionKorean legal textdeep neural networksNLP pipelinenamed entity recognitionlegal techdata privacy

Research Topics

Data De-identificationPrivacy PreservationLegal NLPNamed Entity RecognitionDeep Learning Applications

Methods & Architectures

Deep Neural Network (DNN)-based pipelineNamed Entity Recognition (NER)Data annotationSystematic categorization Deep Neural Networks (DNNs)

Applications & Tasks

Legal Judiciary Data Privacy South Korea De-identifying court judgments at scaleHandling vague legal definitions of PIIBalancing open access to justice with data protection Personally Identifiable Information (PII) detectionDe-identification of legal documentsNamed Entity Recognition (NER)

Datasets & Benchmarks

Datasets

Korean legal dataset (annotated judgments)

De-identification accuracyRecallPrecisionF1-score

Related Fields

Natural Language ProcessingLegal TechnologyData PrivacyMachine LearningInformation Security

Keywords

de-identificationPIIKoreanlegalcourt judgmentsNLPDNNNERdata privacyjusticeframeworkdataset

Academic Context

#Data De-identification#Privacy Preservation#Legal NLP#Named Entity Recognition#Deep Learning Applications

Commercial Potential

Potential Products

Automated legal document de-identification softwareData anonymization services for legal firms

Target Industries

Legal ServicesGovernmentTechnology (Legal Tech)

Use Case Examples

Publicly releasing anonymized court rulingsEnsuring compliance with GDPR-like regulations for legal data

Competitive Edge

Specifically tailored for Korean legal judgments, addressing unique linguistic and legal challenges not covered by general de-identification tools.

Market Opportunity

Growing market for legal tech and data privacy solutions.

Revenue Models

SaaS for de-identification serviceslicensing of the framework.

Resource Requirements

Compute Needs

Moderate for training DNNs, potentially high for processing large volumes of judgments.

Data Requirements

Annotated Korean legal judgments.

Deployment Constraints

Requires careful legal review and validation to ensure compliance and accuracy. Specificity to Korean legal context.

Scalability

Designed to handle court judgments at scale.

Regulatory Considerations

Strict adherence to South Korean privacy laws and judicial disclosure mandates.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for a production-ready system.

View Full Paper Back to Papers