Executive Summary: Today's Top AI Research
- Serve Programs, Not Prompts: Proposes a new LLM serving architecture that executes programs instead of processing static prompts. This allows for dynamic, runtime customization of inference, achieving up to 2x throughput improvements on complex, multi-tool agent applications.
- Scaling Latent Reasoning via Looped Language Models: Introduces Ouro, a family of pre-trained Looped Language Models that perform iterative reasoning in latent space. This approach allows smaller models (1.4B) to match the reasoning performance of much larger models (12B) on complex benchmarks.
- Parallel Loop Transformer for Efficient Test-Time Computation Scaling: Presents a novel transformer architecture where looped computations (reusing weights) run in parallel instead of sequentially. This design overcomes the latency bottleneck of previous looped models, enabling efficient scaling of computation at inference time.
- Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale: Demonstrates that language models, regardless of architecture (Transformer, Mamba) or scale (14M to 12B parameters), exhibit highly consistent and predictable behavioral phases during pre-training, revealing fundamental patterns in how they acquire capabilities.
- RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness: Introduces RLAIF-V, a framework for reducing multimodal LLM hallucination using feedback from open-source AI models instead of humans. This method creates a highly effective preference dataset and trains models that surpass GPT-4V in trustworthiness evaluations.
- Precise In-Parameter Concept Erasure in Large Language Models: Proposes a method for precisely erasing entire concepts directly from a model's parameters. This technique surgically modifies model behavior without requiring fine-tuning, offering a more robust approach to model safety, unlearning, and control.
- VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning: Presents a generative AI framework that creates dynamic visual effects (VFX) by learning from in-context examples, rather than relying on per-effect fine-tuning. This allows the model to generalize and generate novel VFX for unseen concepts.
- SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens: Introduces a method to accelerate Chain-of-Thought (CoT) reasoning by encoding reasoning steps into implicit, non-textual tokens. This reduces the number of generated tokens, significantly speeding up inference for complex reasoning tasks while maintaining performance.
- Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning: Proposes a unified training pipeline that improves both Program-of-Thought (P-CoT) and Natural Language Chain-of-Thought (N-CoT) reasoning. The method uses each paradigm to iteratively generate and refine data for the other, enhancing overall mathematical reasoning capabilities.
- CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories: Develops an LLM-based agent for complex business tasks within a Customer Relationship Management (CRM) system. The agent uses reinforcement learning and a shared memory module to improve its tool-calling and task-completion abilities in real-world scenarios.
- EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis: Presents a foundational LLM for Electronic Health Record (EHR) analysis, pre-trained on a massive clinical dataset. The model is fine-tuned with a reasoning-focused objective, demonstrating superior performance on complex clinical question-answering and analysis tasks.
- PairUni: Pairwise Training for Unified Multimodal Language Models: Proposes PairUni, a unified framework for training multimodal models to perform both understanding and generation tasks. It uses pairwise ranking objectives during reinforcement learning to effectively balance the heterogeneous data and supervision signals required for these distinct capabilities.
- OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs: Introduces an open-source framework for building and evaluating automated fact-checking systems. The work provides a comprehensive benchmark that measures the ability of LLMs and dedicated systems to verify the factual accuracy of claims against evidence.
- Balanced conic rectified flow: Introduces a new generative model based on rectified flow, an ODE-based approach that learns smooth transport between distributions. This method offers an alternative to diffusion, enabling high-quality image generation with fewer sampling steps by learning a more efficient path.
- LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation: Proposes the first Multimodal Large Language Model (MLLM) framework for open-vocabulary, hierarchical part segmentation. The model can jointly detect and segment objects and their constituent parts from an image based on open-ended text descriptions.
- Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation: Introduces MiRAGE, a new evaluation framework and benchmark for Retrieval-Augmented Generation (RAG) systems that use multimodal sources like video and audio. It tests the ability of models to integrate and reason over information from diverse media formats.
- Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models: Proposes a new training method that improves the reliability of post-hoc attribution for long-document question answering. By training the model to decompose answers into components, it enhances the accuracy of source attribution for complex, abstractive queries.
- NL-Debugging: Exploiting Natural Language as an Intermediate Representation for Code Debugging: Presents a novel debugging framework where the model first translates buggy code into a natural language description of its logic. It then identifies and corrects flaws in the natural language representation before translating it back into fixed code.
- EA3D: Online Open-World 3D Object Extraction from Streaming Videos: Introduces ExtractAnything3D (EA3D), a unified online framework that performs simultaneous geometric reconstruction and open-world 3D object extraction from a single, streaming video. The system can identify and segment novel objects in 3D without prior knowledge.
- StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA: Presents a new Video Question Answering dataset to evaluate a model's ability to understand temporal dynamics and perform complex reasoning over streaming video. The dataset includes questions requiring multi-step, chain-of-thought style reasoning about evolving events.
Research Deep Dives by Category
Large Language Models (10 papers)
- Scaling Latent Reasoning via Looped Language Models: Introduces Ouro, a family of Looped Language Models that perform iterative reasoning in latent space during pre-training. This approach allows smaller models to match the reasoning performance of much larger models, suggesting a more efficient scaling path for reasoning capabilities.
- Differential Mamba: Proposes Differential Mamba, a new state-space model architecture designed to improve context filtering by focusing on changes in the input sequence. This mechanism helps reduce noise in the model's hidden state, enhancing performance on tasks requiring long-range dependencies and robustness.
- S'MoRE: Structural Mixture of Residual Experts for Parameter-Efficient LLM Fine-tuning: Presents S'MoRE, a parameter-efficient fine-tuning method that integrates Mixture-of-Experts (MoE) principles into residual connections. It structurally adds expert modules alongside Low-Rank Adaptation (LoRA), enhancing model capacity during fine-tuning while maintaining high parameter efficiency for diverse tasks.
- SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning: Introduces SATURN, a framework that uses Boolean satisfiability (SAT) problems to generate scalable and verifiable tasks for training LLMs via reinforcement learning. By leveraging SAT solvers for reward generation, it effectively enhances the logical reasoning capabilities of language models.
- GAP: Graph-Based Agent Planning with Parallel Tool Use and Reinforcement Learning: Proposes GAP, a Graph-based Agent Planning framework that enables parallel tool use. It constructs a task dependency graph, allowing the LLM agent to execute independent sub-tasks concurrently. This approach significantly accelerates task completion time compared to sequential methods like ReAct.
- Sequences of Logits Reveal the Low Rank Structure of Language Models: Demonstrates empirically that sequences of output logits from language models exhibit a low-rank structure. This fundamental finding suggests an inherent low-dimensional manifold in model predictions, offering new insights for model compression, interpretability, and understanding the geometry of language models.
- Model-Document Protocol for AI Search: Introduces a protocol to transform unstructured documents like web pages and PDFs into "LLM-ready" formats. This involves creating structured summaries and knowledge graphs directly within the document, significantly improving retrieval accuracy and synthesis quality for AI search and RAG systems.
- Automating Benchmark Design: Proposes a framework for automatically designing and evolving evaluation benchmarks for LLMs. By using an LLM-based "designer" to generate new problems and a "solver" to test them, this system combats benchmark saturation and creates dynamic evaluations that adapt to model improvements.
- Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning: Presents Parrot, a training pipeline that synergistically enhances both Program chain-of-thought (P-CoT) and Natural language chain-of-thought (N-CoT) for mathematical reasoning. It uses a bidirectional training strategy where each paradigm generates data to fine-tune the other, improving overall reasoning performance.
- The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution: Introduces The Tool Decathlon, a benchmark for evaluating language agents on long-horizon tasks requiring diverse tool use. It features complex, multi-step workflows across various applications, providing a more realistic and challenging assessment of agent capabilities than existing single-tool benchmarks.
Computer Vision (10 papers)
- EA3D: Online Open-World 3D Object Extraction from Streaming Videos: Proposes a unified online framework, ExtractAnything3D, for open-world 3D object extraction from streaming videos. It enables simultaneous geometric reconstruction and open-vocabulary segmentation without requiring pre-constructed geometry or offline processing, advancing real-time scene understanding.
- HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene: Introduces a method for reconstructing dynamic 3D scenes from monocular videos using hierarchical and induced flow-guided Gaussian Splatting. This approach effectively learns structured and temporally consistent representations, tackling a key challenge in 4D vision.
- RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models: Advances real-time object detection by integrating vision foundation models into the DETR architecture. It enhances the feature representation of lightweight networks, improving detection performance on standard benchmarks without sacrificing high-speed inference capabilities.
- DPMambaIR: All-in-One Image Restoration via Degradation-Aware Prompt State Space Model: Presents an all-in-one image restoration model using a State Space Model (Mamba) architecture. The model handles multiple, diverse degradation types within a single framework by conditioning the Mamba backbone on degradation-specific prompts for versatile restoration.
- DrivingScene: A Multi-Task Online Feed-Forward 3D Gaussian Splatting Method for Dynamic Driving Scenes: Develops an online, feed-forward framework using 3D Gaussian Splatting for real-time, high-fidelity reconstruction of dynamic driving scenes. It processes only two consecutive camera frames to handle complex dynamics and sparse views, crucial for autonomous systems.
- Test-Time Adaptive Object Detection with Foundation Model: Proposes a method for online domain adaptation in object detection. It leverages a vision foundation model to generate high-quality pseudo-labels and adapts the detector's parameters at test-time to handle domain shifts without requiring source data.
- DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications: Introduces a hybrid architecture combining a YOLO detector with features from a DINO self-supervised vision transformer. This approach improves data-efficient object detection by leveraging rich, pre-trained features to overcome annotation scarcity in specialized domains.
- Improving Temporal Consistency and Fidelity at Inference-time in Perceptual Video Restoration by Zero-shot Image-based Diffusion Models: Adapts single-image diffusion models for zero-shot video restoration by introducing a flow-guided propagation mechanism at inference time. This method enhances temporal consistency and fidelity across frames without requiring any video-specific training or model modification.
- MILo: Mesh-In-the-Loop Gaussian Splatting for Detailed and Efficient Surface Reconstruction: Proposes a Mesh-In-the-Loop approach for Gaussian Splatting that integrates surface mesh extraction directly into the reconstruction optimization process. This enables efficient generation of high-quality, detailed meshes, overcoming a key limitation of standard GS methods.
- $D^2GS$: Dense Depth Regularization for LiDAR-free Urban Scene Reconstruction: Presents a method for urban scene reconstruction using Gaussian Splatting that does not require LiDAR data. It introduces a dense depth regularization technique derived from multi-view stereo to provide robust geometric priors from images alone.
Reinforcement Learning (8 papers)
- HyperMARL: Adaptive Hypernetworks for Multi-Agent RL: Proposes an adaptive hypernetwork architecture for multi-agent reinforcement learning. It enables policies to express diverse behaviors, from specialized to homogeneous, by dynamically generating agent-specific parameters, addressing a key challenge in achieving effective cooperation in MARL settings.
- LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies: Introduces a risk-aware sampling rule for diffusion policies in offline reinforcement learning. It treats each denoising step as a sequential hypothesis test, providing a statistically grounded method for guiding policy generation and avoiding out-of-distribution actions in safety-critical tasks.
- Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems: Presents a new policy optimization objective that rewards sets of diverse, high-quality solution attempts (pass@k) instead of just the single best one (pass@1). This approach encourages policy diversity and is shown to solve harder RL problems where exploration is crucial.
- Zero Reinforcement Learning Towards General Domains: Proposes a method to enhance LLM reasoning by applying reinforcement learning with verifiable rewards directly on pretrained models. This "Zero-RL" approach bypasses the need for a supervised fine-tuning phase, offering a more direct path to improving model capabilities on complex tasks.
- CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories: Develops a business agent using a framework of agentic reinforcement learning and shared memories. The agent interacts with databases and knowledge bases to solve complex customer relationship management tasks, demonstrating a practical application of LLM agents in real-world business environments.
- Redistributing Rewards Across Time and Agents for Multi-Agent Reinforcement Learning: Addresses the credit assignment problem in cooperative multi-agent reinforcement learning. The proposed method redistributes shared rewards across both time and agents to better disentangle each agent's individual contribution to the team's success while preserving the optimal environmental policy.
- Multi-party Agent Relation Sampling for Multi-party Ad Hoc Teamwork: Introduces a framework for multi-agent ad hoc teamwork where agents collaborate with unknown partners. It proposes a relation sampling technique to handle the uncertainty of teammate behaviors, extending multi-agent reinforcement learning to more realistic, open-world cooperative scenarios without shared conventions.
- The Landscape of Agentic Reinforcement Learning for LLMs: A Survey: Provides a comprehensive survey on Agentic Reinforcement Learning (Agentic RL), a paradigm that treats LLMs as autonomous, decision-making agents. It reframes LLM-based RL, categorizes existing methods, and outlines the key challenges and future directions in developing intelligent agents.
Generative AI (10 papers)
- Symplectic Generative Networks (SGNs): A Hamiltonian Framework for Invertible Deep Generative Modeling: Introduces Symplectic Generative Networks, a novel class of invertible models based on Hamiltonian mechanics. This framework constructs a volume-preserving map between latent and data spaces, enabling efficient sampling and density estimation through a unique, physics-inspired architecture.
- CANDI: Hybrid Discrete-Continuous Diffusion Models: Proposes a hybrid diffusion model that combines a continuous process in a latent space with a discrete data-space process. This method is designed to effectively model discrete data like text, outperforming purely discrete or continuous diffusion approaches on challenging language modeling benchmarks.
- Balanced conic rectified flow: Presents a generative model learning smooth transport mappings via an Ordinary Differential Equation (ODE). This rectified flow variant uses a 'balanced conic' interpolation to achieve state-of-the-art generation quality with significantly fewer function evaluations compared to traditional diffusion models.
- Neural Stochastic Flows: Solver-Free Modelling and Inference for SDE Solutions: Introduces a solver-free method for modeling solutions to Stochastic Differential Equations (SDEs), which underpin diffusion models. By directly parameterizing the flow map between time points, this approach bypasses costly numerical integration, enabling efficient sampling and inference for time-series data.
- BOLT-GAN: Bayes-Optimal Loss for Stable GAN Training: Introduces BOLT-GAN, a modification to the WGAN framework using a Bayes Optimal Learning Threshold loss. This simple yet effective technique implicitly minimizes a different metric distance, leading to significantly more stable training and improved performance for Generative Adversarial Networks.
- Non-Markovian Discrete Diffusion with Causal Language Models: Proposes a discrete diffusion model that overcomes the restrictive Markovian assumption by integrating a causal language model. This allows each generation step to condition on the entire sequence history, enhancing expressive power and performance on structured sequence generation tasks.
- FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion: Presents a training-free method for generating articulated 3D objects from text prompts using a pretrained 3D diffusion model. It works by optimizing a neural field representation of the object's parts at inference time, enabling the creation of diverse and animatable 3D assets.
- Target-Guided Bayesian Flow Networks for Quantitatively Constrained CAD Generation: Develops a generative framework for creating parametric Computer-Aided Design (CAD) sequences that meet specific quantitative constraints. It utilizes Bayesian Flow Networks, guided by a target objective, to generate complex, structured designs that adhere to predefined performance metrics.
- VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning: Introduces a method for generating dynamic visual effects (VFX) that generalizes to unseen effects without retraining. By using in-context learning from visual or text prompts, it moves beyond the 'one-LoRA-per-effect' paradigm, enabling more flexible and scalable VFX creation.
- 4-Doodle: Text to 3D Sketches that Move!: Introduces the novel task of text-to-3D sketch animation, generating dynamic and view-consistent 3D vector sketches from text. This work targets a lightweight, stylized, and interpretable medium, expanding generative capabilities beyond photorealistic content to animated line art.
AI Safety & Ethics (8 papers)
- Doubly Robust Alignment for Large Language Models: Proposes a reinforcement learning from human feedback (RLHF) algorithm that is 'doubly robust,' meaning it remains effective even with misspecifications in the reward or preference model. This enhances the reliability of LLM alignment against common modeling errors.
- OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents: Introduces OS-Harm, a benchmark for evaluating the safety of LLM-based agents that interact with graphical user interfaces. It assesses agents' potential to cause harm across various categories, providing a standardized framework for measuring and mitigating agent risks.
- Instrumental goals in advanced AI systems: Features to be managed and not failures to be eliminated?: Argues that instrumental goals like power-seeking in advanced AI are emergent features to be managed, not bugs to be eliminated. The paper reframes the alignment problem as one of continuous management and control of these inherent tendencies in complex systems.
- Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization: Presents MUDMAN, a method for robustly unlearning dangerous knowledge from LLMs. It uses meta-unlearning with specific masking and normalization techniques to prevent the unlearning process from being easily reversed, addressing a key failure mode of previous methods.
- Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models: Introduces Agentic Moderation, a framework where specialized AI agents collaborate to moderate vision-language model outputs. This model-agnostic design enhances safety by autonomously analyzing and red-teaming content to identify and mitigate potential harms before they reach the user.
- CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models: Addresses corrupted and incomplete preference datasets used for LLM alignment. It proposes CURATRON, a method to robustly recalibrate values within these datasets, enabling more reliable alignment by creating a complete and consistent preference graph from flawed data.
- Secure Retrieval-Augmented Generation against Poisoning Attacks: Investigates the vulnerability of Retrieval-Augmented Generation (RAG) systems to data poisoning attacks. The paper proposes a defense mechanism that secures the retrieval process, ensuring the LLM relies on verified knowledge sources and is robust against malicious documents.
- Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models: Proposes a method to train LLMs to generate verifiable citations without needing an external retriever at inference time. By integrating attribution into pretraining, it aims to create models that can provide correct and reliably sourced answers with lower latency.
Graph Neural Networks (8 papers)
- Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training: Proposes a 3D parallel training system for GNNs on massive graphs. It overcomes GPU memory limitations to handle billion-edge graphs, enabling full-graph training on previously intractable datasets and demonstrating a new paradigm for GNN scalability on real-world networks.
- The Underappreciated Power of Vision Models for Graph Structural Understanding: Challenges the dominant message-passing paradigm by investigating the use of vision models for graph understanding. The work finds that vision models can achieve competitive performance by capturing global structures first, suggesting an alternative approach to learning on graphs.
- Bridging the Divide: End-to-End Sequence-Graph Learning: Introduces a method for end-to-end learning on datasets that are both sequential and relational. It unifies sequence and graph modeling to capture interactions where each node carries an event sequence, addressing a common and complex real-world data structure.
- Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information: Proves that transformer-based models can effectively learn hidden directed acyclic graph structures from data. This is achieved using a kernel-guided mutual information objective, providing a theoretical foundation for using attention mechanisms for graph structure learning.
- Learning Fair Graph Representations with Multi-view Information Bottleneck: Presents a method for learning fair graph representations by addressing biases from both node features and graph structure. It uses a multi-view information bottleneck to disentangle sensitive attributes, mitigating the propagation of discriminatory information within GNNs.
- Subgraph Federated Learning via Spectral Methods: Develops a federated learning framework for graph-structured data distributed across multiple clients. It specifically handles interconnected subgraphs using spectral methods, enabling collaborative GNN training on decentralized graphs while preserving data privacy and accounting for inter-client dependencies.
- Certainty in Uncertainty: Reasoning over Uncertain Knowledge Graphs with Statistical Guarantees: Proposes a method for reasoning over uncertain knowledge graphs that provides statistical guarantees for predictions. Unlike methods that yield only point estimates, this approach quantifies predictive uncertainty, enabling more reliable link prediction in incomplete or noisy knowledge graphs.
- A method for the systematic generation of graph XAI benchmarks via Weisfeiler-Leman coloring: Introduces a systematic method for generating benchmarks to evaluate graph explainable AI (XAI) techniques. It uses Weisfeiler-Leman coloring to create graphs with known, objective ground-truth explanations, enabling more rigorous and standardized assessment of GNN explainability.
Robotics & Embodied AI (8 papers)
- Learning to Plan & Schedule with Reinforcement-Learned Bimanual Robot Skills: Proposes a hierarchical framework for long-horizon bimanual manipulation. It combines a high-level planner that schedules tasks with low-level reinforcement-learned skills for contact-rich actions, enabling complex coordination between two robot arms for sequential and parallel execution.
- RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation: Introduces RoboCerebra, a large-scale benchmark for evaluating long-horizon robotic manipulation. It focuses on assessing a system's ability for semantic reasoning and planning, moving beyond reactive policies to test more deliberative, multi-step task completion in complex scenarios.
- Scalable predictive processing framework for multitask caregiving robots: Presents a scalable predictive processing framework for multitask caregiving robots. Inspired by cognitive neuroscience, the system aims to generalize across diverse caregiving scenarios by continuously predicting future states, enabling more adaptive and versatile robot assistance.
- One-shot Humanoid Whole-body Motion Learning: Develops a method for one-shot learning of whole-body humanoid motion. This approach enables a humanoid robot to learn and replicate complex, dynamic behaviors like balancing and coordination from a single demonstration, significantly reducing data requirements for motion synthesis.
- FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving: Introduces FutureSightDrive, a Vision-Language-Action model for autonomous driving that uses a visual spatio-temporal Chain-of-Thought. Instead of text, it reasons over future visual predictions, improving planning and decision-making by maintaining critical spatial information for complex driving scenarios.
- SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation: Presents SynHLMA, a method for synthesizing hand manipulation of articulated objects from language instructions. It uses a discrete human-object interaction representation to generate plausible, long-term manipulation sequences that respect both object functionality and articulation constraints.
- Online Adaptation for Flying Quadrotors in Tight Formations: Details an online adaptation method for quadrotors flying in tight formations. The system learns to compensate for complex, nonlinear aerodynamic wake interactions in real-time, enabling individual drones and the team to maintain stability in challenging, close-proximity flight maneuvers.
- Navigation in a Three-Dimensional Urban Flow using Deep Reinforcement Learning: Develops a deep reinforcement learning strategy for UAV navigation in complex 3D urban airflow. The agent learns an optimal path that leverages wind currents for energy efficiency while avoiding obstacles, demonstrated in a high-fidelity urban environment simulation.
Speech & Audio (6 papers)
- POWSM: A Phonetic Open Whisper-Style Speech Foundation Model: Proposes a unified phonetic foundation model, POWSM, trained on 1 million hours of multilingual speech. It jointly handles ASR, phone recognition, and grapheme-phoneme conversions, establishing a new state-of-the-art across these diverse phonetic tasks with a single architecture.
- SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution: Introduces a script-first multilingual text-to-speech system for intra-sentence code-switching. By using adaptive locale resolution and a novel script-based tokenizer, the model generates natural-sounding speech for mixed-language text, outperforming conventional multilingual baselines on a challenging task.
- Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR: Develops a method to disentangle linguistic content from noise within discrete speech representations derived from Whisper. This approach significantly improves ASR performance on noisy datasets by making the representations more robust, without degrading performance on clean speech.
- Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models: Investigates the capability of speech foundation models to perceive voice quality variations like breathiness or strain. The paper demonstrates that current models largely ignore these rich paralinguistic cues, proposing new evaluation benchmarks to address this critical gap in speech understanding.
- More than a Moment: Towards Coherent Sequences of Audio Descriptions: Addresses the generation of coherent, sequential audio descriptions for videos, moving beyond describing isolated moments. The proposed system incorporates a coherence model, resulting in narratives that are more natural, contextually aware, and easier for listeners to follow.
- Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech: Analyzes the performance of spoken language models on emotionally incongruent speech, where vocal tone contradicts lexical content. The study reveals that current models are heavily biased towards lexical cues, failing to accurately interpret emotional prosody in such complex scenarios.
Multimodal Learning (8 papers)
- Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation: Proposes a sparse Mixture-of-Experts (MoE) unified architecture for multimodal tasks. The 100B parameter model uses only 6.1B active parameters per token, enabling highly efficient scaling for both perception and generation within a single, powerful framework.
- PairUni: Pairwise Training for Unified Multimodal Language Models: Introduces a unified training framework to balance understanding and generation tasks in multimodal models. It uses pairwise preference data and reinforcement learning to effectively merge heterogeneous data sources and supervision signals within a single versatile architecture.
- Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning: Proposes Modality-Aware Sharpness-Aware Minimization (M-SAM), a model-agnostic optimization framework. It harmonizes learning by dynamically modulating gradients to prevent dominant modalities from overshadowing others, improving model generalization across various fusion scenarios and modalities.
- Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models: Presents a modular framework to improve physical reasoning in Vision-Language Models, a key weakness. It translates visual inputs into simplified physical contexts, allowing VLMs to make more accurate predictions about physical object behavior without expensive full-model fine-tuning.
- LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation: Introduces the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary part segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary categories, enabling more fine-grained visual understanding.
- Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization: Addresses out-of-distribution generalization in Vision-Language-Action (VLA) models for robotics. It proposes aligning the visual representations of the VLA's policy with its pretrained Vision-Language Model backbone, improving the transfer of world knowledge and generalization to new environments.
- StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA: Introduces a new Video Question Answering dataset to enhance temporal dynamics understanding and complex reasoning. It provides annotations for streaming video clips and includes multimodal chain-of-thought rationales to benchmark and develop more advanced reasoning models.
- ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents: Proposes a reinforcement learning agent for actively navigating and analyzing long, multi-page documents. The model learns to gather evidence by moving between pages and focusing on relevant sections, enabling information integration from documents that exceed VLM context windows.
AI Theory & Foundations (6 papers)
- A Deep Learning Framework for Multi-Operator Learning: Architectures and Approximation Theory: This paper develops a theoretical framework for learning multiple operators, which are mappings between function spaces. It introduces new architectures and provides approximation theory results, establishing a mathematical foundation for applying deep learning to complex scientific problems beyond finite-dimensional data.
- From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning: This work provides a theoretical explanation for weak-to-strong generalization, where a strong model supervised by a weak one surpasses it. It moves beyond linear models to show provably how feature learning in two-layer neural networks enables this phenomenon, offering key insights into model scaling.
- A Framework for Bounding Deterministic Risk with PAC-Bayes: Applications to Majority Votes: This paper extends the PAC-Bayes framework to provide generalization guarantees on the risk of a single, deterministic hypothesis rather than a randomized one. This is achieved by introducing a new framework and demonstrating its application to bound the risk of majority vote classifiers.
- The Neural Differential Manifold: An Architecture with Explicit Geometric Structure: This paper introduces the Neural Differential Manifold (NDM), a novel architecture that conceptualizes a neural network as a differentiable manifold. This design explicitly incorporates geometric structure, providing a new foundational approach for building geometrically-aware models instead of using standard Euclidean parameter spaces.
- Nonlinear Dynamics In Optimization Landscape of Shallow Neural Networks with Tunable Leaky ReLU: This work studies the complex nonlinear training dynamics of shallow neural networks with leaky ReLU activation. It establishes a theoretical framework based on the equivariant gradient degree to analyze the optimization landscape, identifying conditions under which networks converge to specific solution types.
- Towards Scaling Deep Neural Networks with Predictive Coding: Theory and Practice: This work investigates predictive coding as a more energy-efficient and biologically plausible alternative to backpropagation for training deep neural networks. It provides theoretical analysis and practical implementations, exploring the potential for scaling this alternative learning algorithm to large models.
Efficient AI (6 papers)
- INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats: Presents a comprehensive study comparing low-precision floating-point (FP) and integer (INT) quantization for Large Language Models. This analysis, relevant for modern hardware like Nvidia's Blackwell, clarifies the trade-offs for efficiently handling activation outliers in LLMs.
- Serve Programs, Not Prompts: Proposes a new LLM serving architecture that executes programs instead of static prompts, enabling dynamic control over the inference process. This allows for flexible resource management and adapts computation to specific requests, improving system efficiency for complex applications.
- Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling: Introduces LrcSSM, a non-linear recurrent state-space model for efficient sequence modeling. By enforcing a diagonal Jacobian matrix, the model allows for fully parallel computation, processing long sequences with linear time and memory complexity, rivaling the speed of linear SSMs.
- SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens: Introduces SemCoT, a method to accelerate Chain-of-Thought (CoT) reasoning by encoding reasoning steps into implicit tokens rather than explicit text. This reduces the verbosity and computational cost of CoT, enabling faster and more efficient inference for complex LLM tasks.
- MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding: Presents MMEdge, a framework for accelerating multimodal inference on edge devices by creating a pipeline for the sensing and encoding stages. It dynamically adjusts model execution based on sensor data availability, minimizing latency for real-time applications like autonomous driving.
- Sub-microsecond Transformers for Jet Tagging on FPGAs: Details the first implementation of a transformer model on an FPGA that achieves sub-microsecond inference latency. This work demonstrates highly efficient hardware acceleration for a state-of-the-art physics benchmark, showcasing FPGA potential for ultra-low-latency, real-time AI applications.
AI for Science (6 papers)
- Artificial Intelligence for Direct Prediction of Molecular Dynamics Across Chemical Space: Proposes MDtrajNet, a neural network that directly predicts molecular dynamics trajectories, bypassing sequential numerical integration. The pre-trained MDtrajNet-1 model demonstrates a new, more efficient paradigm for exploring the behavior of atomistic systems across a vast chemical space.
- Hierarchical Physics-Embedded Learning for Spatiotemporal Dynamical Systems: Presents a framework for modeling complex spatiotemporal systems where governing partial differential equations are unknown. The method embeds hierarchical physical principles into the learning process, enabling the discovery of intractable equations from data for far-from-equilibrium systems.
- EnzyControl: Adding Functional and Substrate-Specific Control for Enzyme Backbone Generation: Introduces EnzyControl, a generative model for designing enzyme backbones with specific functions and substrate targets. It addresses limitations in binding data and substrate-specific control, enabling more flexible and targeted de novo enzyme backbone generation for computational protein engineering.
- Tensor Decomposition Networks for Fast Machine Learning Interatomic Potential Computations: Develops tensor decomposition networks to accelerate the computationally expensive Clebsch-Gordan tensor product in equivariant networks for machine learning interatomic potentials. This approach significantly speeds up a key operation, enabling faster and more efficient materials simulations.
- Flow matching for reaction pathway generation: Applies flow matching, a generative modeling technique, to the challenge of elucidating chemical reaction mechanisms. The model efficiently generates transition states, products, and complete reaction networks, offering a faster alternative to traditional quantum chemistry methods for discovering reaction pathways.
- Spectral functions in Minkowski quantum electrodynamics from neural reconstruction: Benchmarking against dispersive Dyson--Schwinger integral equations: Formulates a Minkowski physics-informed neural network (M-PINN) to solve the Dyson-Schwinger integral equations of quantum electrodynamics directly in Minkowski spacetime. This novel strategy provides a new computational tool for studying fundamental quantum field theories without Euclidean continuation.
Natural Language Processing (8 papers)
- BambooKG: A Neurobiologically-inspired Frequency-Weight Knowledge Graph: Proposes BambooKG, a neurobiologically-inspired knowledge graph that uses frequency-weights to improve Retrieval-Augmented Generation (RAG). It enhances multi-hop and relational reasoning for LLMs by structuring retrieved information more effectively than independent text chunks, reducing hallucinations.
- Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation: Examines various pretraining strategies for low-resource machine translation, specifically for languages like Afrikaans, Swahili, and Zulu. The study compares the effectiveness of different approaches using monolingual and parallel data to improve translation model performance in data-scarce scenarios.
- TOPol: Capturing and Explaining Multidimensional Semantic Polarity Fields and Vectors: Introduces TOPol, a semi-unsupervised framework to model semantic polarity beyond a single dimension. It reconstructs and interprets multidimensional topic-orientation polarity fields, offering a more nuanced approach to sentiment and semantic analysis compared to traditional unidimensional scales.
- Topic Analysis with Side Information: A Neural-Augmented LDA Approach: Presents a neural-augmented Latent Dirichlet Allocation (LDA) model for topic analysis that integrates side information like metadata or document labels. This approach enhances traditional topic modeling by leveraging auxiliary data to uncover more expressive and coherent latent structures in text.
- Reliable Evaluation and Benchmarks for Statement Autoformalization: Develops a comprehensive evaluation approach for statement autoformalization, the task of translating natural language mathematics into formal languages. It introduces new metrics, datasets, and standards to robustly measure progress and provide a reliable benchmark for this complex generation task.
- Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish: Investigates the grammatical capabilities of Large Language Models using a grammar-book-guided probing methodology for Luxembourgish. The paper systematically evaluates whether LLMs understand and apply complex grammatical rules, providing evidence on their linguistic competence beyond surface-level fluency.
- Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation: Introduces Falcon, a cross-domain Chinese text-to-SQL benchmark for enterprise-grade evaluation. The dataset contains 600 questions over 28 databases, with a high percentage requiring multi-table reasoning, providing a challenging new resource for semantic parsing research in a non-English language.
- Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction: Assesses the cross-lingual text comprehension of Large Language Models using a Next Sentence Prediction task in low-resource language settings. By testing in a data-scarce environment, the study investigates whether LLM performance stems from genuine linguistic ability or data abundance advantages.
Key Research Trends & Takeaways
Here are 4 key trends and takeaways from the presented research papers:
- Architectural Innovations for Enhanced Efficiency and Scalability: A prominent trend involves architectural and algorithmic innovations to significantly enhance LLM inference efficiency and computational scaling. Papers introduce new serving paradigms executing programs (Serve Programs, Not Prompts), novel looped transformer designs (Ouro, Parallel Loop Transformer) for iterative latent reasoning, and implicit token-based acceleration (SemCoT), collectively enabling smaller models to achieve high performance and accelerating complex tasks.
- Advancing Iterative Reasoning and Agentic Capabilities: The research strongly emphasizes advancing LLM reasoning and agentic capabilities for complex, multi-step tasks. Innovations include iterative latent reasoning in a compressed latent space (Ouro), unified training for diverse Chain-of-Thought methods (Parrot), and sophisticated agentic systems leveraging reinforcement learning and shared memory for real-world applications (CRMWeaver), pushing towards more adaptive and tool-augmented intelligence.
- Finer-Grained Model Control, Safety, and Trustworthiness: Significant progress is being made in enhancing model control, safety, and trustworthiness through novel feedback and unlearning mechanisms. RLAIF-V demonstrates the efficacy of AI-driven feedback for reducing multimodal hallucination and surpassing human-feedback benchmarks, while "Precise In-Parameter Concept Erasure" introduces surgical techniques for targeted concept unlearning directly within model parameters.
- Shifting Towards Dynamic, Programmatic, and Adaptive LLM Interactions: The field is actively moving beyond static prompt-response models to dynamic, adaptive, and programmatic LLM interactions. This is exemplified by new serving architectures that execute programs for runtime customization (Serve Programs, Not Prompts) and generative models that learn dynamic effects from in-context examples (VFXMaster), positioning LLMs as flexible, programmable computation engines rather than just static knowledge retrievers.