Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 75% Match Research Paper Machine learning engineers,Data scientists,Researchers in recommendation systems 1 week ago

Modular Linear Tokenization (MLT)

large-language-models › model-architecture
📄 Abstract

Abstract: This paper introduces Modular Linear Tokenization (MLT), a reversible and deterministic technique for encoding high-cardinality categorical identifiers into compact numerical vectors. Unlike traditional hashing or one-hot encodings, MLT preserves bijective mappings by leveraging modular arithmetic over finite fields and invertible linear transformations. The method offers explicit control of dimensionality and computational scalability while maintaining full reversibility, even for millions of identifiers. Experimental results on the MovieLens 20M dataset show that MLT achieves comparable predictive performance to supervised embeddings while requiring significantly fewer parameters and lower training cost. An open-source implementation of MLT is available on PyPI (https://pypi.org/project/light-mlt/) and GitHub (https://github.com/tcharliesschmitz/light-mlt).
Authors (1)
Tcharlies Schmitz
Submitted
October 29, 2025
arXiv Category
cs.LG
arXiv PDF Code

Key Contributions

Introduces Modular Linear Tokenization (MLT), a reversible and deterministic technique for encoding high-cardinality categorical identifiers into compact numerical vectors. MLT uses modular arithmetic and invertible linear transformations, offering explicit control over dimensionality and scalability while maintaining full reversibility.

Business Value

Enables more efficient processing and modeling of large-scale categorical data, reducing memory footprint and computational costs in applications like recommendation systems and feature engineering.

View Code on GitHub