arxiv_ml 90% Match Research Paper Researchers in graph learning,Deep learning theorists,Causal inference experts,Data scientists in scientific domains 1 week ago

Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information

graph-neural-networks › graph-learning

📄 Abstract

Abstract: Uncovering hidden graph structures underlying real-world data is a critical challenge with broad applications across scientific domains. Recently, transformer-based models leveraging the attention mechanism have demonstrated strong empirical success in capturing complex dependencies within graphs. However, the theoretical understanding of their training dynamics has been limited to tree-like graphs, where each node depends on a single parent. Extending provable guarantees to more general directed acyclic graphs (DAGs) -- which involve multiple parents per node -- remains challenging, primarily due to the difficulty in designing training objectives that enable different attention heads to separately learn multiple different parent relationships. In this work, we address this problem by introducing a novel information-theoretic metric: the kernel-guided mutual information (KG-MI), based on the $f$-divergence. Our objective combines KG-MI with a multi-head attention framework, where each head is associated with a distinct marginal transition kernel to model diverse parent-child dependencies effectively. We prove that, given sequences generated by a $K$-parent DAG, training a single-layer, multi-head transformer via gradient ascent converges to the global optimum in polynomial time. Furthermore, we characterize the attention score patterns at convergence. In addition, when particularizing the $f$-divergence to the KL divergence, the learned attention scores accurately reflect the ground-truth adjacency matrix, thereby provably recovering the underlying graph structure. Experimental results validate our theoretical findings.

Authors (5)

Yuan Cheng

Yu Huang

Zhe Xiong

Yingbin Liang

Vincent Y. F. Tan

Submitted

October 29, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

This paper provides the first provable guarantees for transformers learning directed acyclic graphs (DAGs) by introducing Kernel-Guided Mutual Information (KG-MI). This novel information-theoretic metric, combined with a multi-head attention framework, enables transformers to learn complex DAG structures with multiple parents per node.

Business Value

Enabling more accurate discovery of causal relationships and complex dependencies in data can lead to breakthroughs in scientific research, improved decision-making in business, and more robust AI systems.

Paper Metadata

Innovation Type

Theoretical and Methodological Innovation

Deployment Feasibility

Moderate. While transformers are widely used, applying them for provable DAG learning requires specialized implementation and validation on relevant datasets.

Limitations Addressed

Lack of theoretical understanding for transformers learning general DAGs,Difficulty in designing training objectives for multi-parent relationships,Existing guarantees limited to tree-like graphs

Performance Gains

Provides provable guarantees for learning DAGs,Enables learning of structures with multiple parents per node

Technical Tags

transformersattention mechanismdirected acyclic graphsDAGsgraph structure learningkernel-guided mutual informationKG-MIf-divergencemulti-head attentionprovable guarantees

Research Topics

Graph Representation LearningDeep Learning TheoryCausal InferenceInformation Theory

Methods & Architectures

Transformer-based modelsKernel-guided mutual information (KG-MI)Multi-head attention frameworkInformation-theoretic objectives Transformer

Applications & Tasks

Scientific discovery Systems biology Social network analysis Causal inference Structure LearningCausal Discovery Learning directed acyclic graphs (DAGs)Uncovering hidden graph structures

Related Fields

Machine LearningGraph TheoryCausalityInformation TheoryDeep Learning

Keywords

transformersDAGgraph learningstructure learningattentionmutual informationinformation theorycausal inferenceprovable guaranteesmulti-head attention

Academic Context

#Graph Representation Learning#Deep Learning Theory#Causal Inference#Information Theory

Commercial Potential

Potential Products

Causal discovery toolsGraph structure inference software

Target Industries

BiotechnologyPharmaceuticalsFinanceSocial SciencesAI Research

Use Case Examples

Inferring gene regulatory networks from expression data.Discovering causal relationships between variables in economic models.

Competitive Edge

Provides the first theoretical framework for transformers to provably learn general directed acyclic graphs, overcoming limitations of previous work focused on simpler graph structures.

Resource Requirements

Compute Needs

Likely high, due to the complexity of transformers and the need for extensive training to learn graph structures.

Data Requirements

Data where underlying causal or dependency structures are represented as DAGs.

Deployment Constraints

Computational resources,Availability of suitable data for DAG learning

Scalability

Scalability depends on the efficiency of the transformer architecture and the KG-MI computation for large graphs.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers