Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 85% Match Research Paper Computational Biologists,Genomic Researchers,Bioinformaticians,Drug Discovery Scientists 1 week ago

ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibility Data

generative-ai › autoregressive
📄 Abstract

Abstract: The advent of single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) offers an innovative perspective for deciphering regulatory mechanisms by assembling a vast repository of single-cell chromatin accessibility data. While foundation models have achieved significant success in single-cell transcriptomics, there is currently no foundation model for scATAC-seq that supports zero-shot high-quality cell identification and comprehensive multi-omics analysis simultaneously. Key challenges lie in the high dimensionality and sparsity of scATAC-seq data, as well as the lack of a standardized schema for representing open chromatin regions (OCRs). Here, we present ChromFound, a foundation model tailored for scATAC-seq. ChromFound utilizes a hybrid architecture and genome-aware tokenization to effectively capture genome-wide long contexts and regulatory signals from dynamic chromatin landscapes. Pretrained on 1.97 million cells from 30 tissues and 6 disease conditions, ChromFound demonstrates broad applicability across 6 diverse tasks. Notably, it achieves robust zero-shot performance in generating universal cell representations and exhibits excellent transferability in cell type annotation and cross-omics prediction. By uncovering enhancer-gene links undetected by existing computational methods, ChromFound offers a promising framework for understanding disease risk variants in the noncoding genome.
Authors (12)
Yifeng Jiao
Yuchen Liu
Yu Zhang
Xin Guo
Yushuai Wu
Chen Jiang
+6 more
Submitted
May 19, 2025
arXiv Category
q-bio.GN
arXiv PDF

Key Contributions

ChromFound is presented as the first universal foundation model for single-cell ATAC-seq data, addressing the lack of such models for this modality. It utilizes a hybrid architecture and genome-aware tokenization to capture long genomic contexts and regulatory signals, enabling high-quality zero-shot cell identification and multi-omics analysis.

Business Value

Accelerates biological research and drug discovery by providing powerful tools for analyzing complex genomic data, potentially leading to new therapeutic targets and personalized treatments.