Executive Summary: Today's Top AI Research
- ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation: Introduces the first zero-shot method for grounding 3D orientation in text-to-image models. It allows users to specify the viewpoint of multiple objects across diverse categories without requiring explicit 3D training data, enabling more precise and controllable scene generation.
- AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning: Proposes a Vision-Language-Action model for end-to-end autonomous driving. The model leverages world knowledge and reasoning to make driving decisions, using reinforcement fine-tuning and adaptive reasoning to generate physically feasible actions directly from visual inputs and language instructions.
- CoMo: Compositional Motion Customization for Text-to-Video Generation: Presents a method for compositional motion customization in text-to-video generation. It enables precise control over complex, multi-subject motions by decomposing motion descriptions and applying them to specific subjects, overcoming a key limitation of existing video synthesis models.
- VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation: Proposes VOLD, a method to transfer reasoning from text-only LLMs to Vision-Language Models using on-policy distillation. This technique leverages abundant text-based reasoning data to improve VLM performance on complex visual reasoning tasks without requiring extensive image-text reasoning annotations.
- More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models: Presents a method to unify image generation and depth estimation within a single text-to-image diffusion model. It overcomes the catastrophic degradation of generative capabilities during fine-tuning, allowing the model to perform both tasks effectively without compromising its original generation quality.
- Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling: Proposes a method to accelerate diffusion model sampling by adaptively combining ODE and SDE solvers. The technique introduces adaptive stochastic coefficients to leverage the complementary strengths of both solver types, reducing error accumulation and improving sample quality at faster speeds.
- Towards Generalisable Foundation Models for 3D Brain MRI: Introduces BrainFound, a self-supervised foundation model for 3D brain MRI analysis built by extending DINO-v2. It learns general-purpose features from large-scale unlabeled MRI datasets, demonstrating strong performance on various downstream clinical tasks and improving generalizability across different hospitals.
- Segment then Splat: Unified 3D Open-Vocabulary Segmentation via Gaussian Splatting: Introduces a unified framework for 3D open-vocabulary segmentation by integrating it with Gaussian Splatting. The method first reconstructs a 3D scene and then performs segmentation, ensuring multi-view consistency and enabling accurate querying of 3D objects using natural language descriptions.
- MiCADangelo: Fine-Grained Reconstruction of Constrained CAD Models from 3D Scans: Proposes a system for converting 3D scans into parametric, constrained Computer-Aided Design (CAD) models. It reconstructs fine-grained geometric primitives and infers the underlying design intent, such as constraints, enabling automated reverse engineering for manufacturing and product development.
- EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT: Introduces a framework for egocentric video reasoning that infers the hidden intentions and actions of the camera-wearer. It uses a Spatio-Temporal Chain-of-Thought (CoT) approach, enabling multimodal large language models to reason about fine-grained interactions and dynamic environmental changes.
- VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting: Presents an end-to-end autonomous driving model that is robust to variations in camera viewpoint. It uses a feed-forward 3D Gaussian Splatting module to create an explicit 3D representation of the scene, enabling consistent driving decisions across different vehicle configurations.
- FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time: Introduces a training-free method for multi-subject text-to-image generation by automatically fusing multiple subject-specific LoRAs at test time. It uses an auto-masking technique to apply different LoRAs to distinct regions of the image, enabling seamless composition without complex training.
- EndoWave: Rational-Wavelet 4D Gaussian Splatting for Endoscopic Reconstruction: Proposes a 4D Gaussian Splatting method for reconstructing surgical scenes from endoscopic video. It uses a rational-wavelet representation to model non-rigid tissue motion and handles photometric inconsistencies, enabling accurate and dynamic 3D visualization for robot-assisted surgery.
- VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding: Introduces a benchmark for evaluating and mitigating hallucinations in Vision-Language Models for video understanding. It uses synthetic videos to test physical and common-sense reasoning, revealing model tendencies to rely on shallow correlations rather than true visual comprehension.
- LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation: Presents a lightweight framework for building unified multimodal models for both understanding and generation. It uses a double fusion approach to efficiently combine pre-trained vision encoders and LLMs, achieving competitive performance without the need for resource-intensive training from scratch.
- 3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks: Introduces a large-scale 3D radiology dataset for Medical Visual Question Answering (Med-VQA) using CT scans. It supports diverse diagnostic tasks and multi-temporal analysis, providing a comprehensive benchmark to advance the development of AI for clinical decision support in 3D imaging.
- Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method: Presents a dataset and method for large-scale, occupancy-centric driving scene generation. The framework allows for the creation of diverse and consistent driving scenarios conditioned on occupancy grids, providing a critical tool for evaluating the perception and planning systems of autonomous vehicles.
- AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-rays: Proposes an adversarial fair contrastive pre-training method for chest X-ray models to mitigate demographic biases. The AdFair-CLIP framework learns representations that are invariant to sensitive attributes like sex and race, improving model fairness without sacrificing diagnostic accuracy.
- Kernel Density Steering: Inference-Time Scaling via Mode Seeking for Image Restoration: Introduces Kernel Density Steering (KDS), a novel inference-time framework for diffusion-based image restoration. It guides the sampling process toward high-density regions of the data manifold, promoting high-fidelity outputs and reducing artifacts without requiring additional training or model changes.
- Navigating the Accuracy-Size Trade-Off with Flexible Model Merging: Proposes a flexible model merging technique that allows for navigating the trade-off between model accuracy and size. It can combine multiple single-task fine-tuned models into a multi-task model of a specified size, providing a practical way to create efficient models.
Research Deep Dives by Category
Large Language Models (10 papers)
- Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation: Introduces Ling 2.0, a series of open reasoning-oriented Mixture-of-Experts models scaling up to one trillion parameters. The models are designed with a unified paradigm emphasizing high activation rates for all experts to enhance reasoning capabilities across different model sizes.
- Knocking-Heads Attention: Proposes a new attention mechanism where heads can exchange information within the attention block before interacting with the value matrix. This method enhances individual head capacity and shows performance gains over standard multi-head and multi-query attention mechanisms in language models.
- ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality: Presents the largest study on multilingual scaling laws, with 774 experiments from 10M to 8B parameters. It establishes adaptive transfer scaling laws that predict optimal data mixture and transfer performance for pre-training and fine-tuning across a wide range of languages.
- Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training: Analyzes schedule-free pre-training methods like warmup-stable-decay (WSD) and weight averaging for large-scale training. It demonstrates that these approaches offer better stability and performance by avoiding the inherent limitations of conventional, fixed-budget cosine learning rate schedules as models and datasets scale.
- Once Upon an Input: Reasoning via Per-Instance Program Synthesis: Introduces a reasoning method where an LLM synthesizes a unique, instance-specific program to solve a problem instead of using a generic chain of thought. This per-instance program synthesis approach improves performance on complex, multi-step reasoning tasks by creating tailored solution logic.
- Generalization or Memorization: Dynamic Decoding for Mode Steering: Proposes a framework for dynamic decoding that can steer an LLM's output towards either generalization or memorization at inference time. The method identifies and controls specific 'memorization heads' within the model to improve reliability and predictability in high-stakes applications.
- Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models: Addresses the problem of catastrophic forgetting when fine-tuning LLMs for reasoning using Reinforcement Learning with Verifiable Rewards (RLVR). It introduces methods to mitigate the degradation of general capabilities while still achieving significant performance gains on specialized mathematical and multimodal reasoning benchmarks.
- Lost in Transmission: When and Why LLMs Fail to Reason Globally: Argues that LLMs fail at global reasoning tasks due to fundamental capacity limits on information flow within the Transformer architecture. It formalizes this limitation, showing that even with long context windows, information can be lost or corrupted during transmission across many layers.
- The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination: Identifies a critical trade-off where enhancing an LLM's reasoning capabilities paradoxically increases its tendency to hallucinate tools and their usage. The study systematically demonstrates that stronger reasoning models are more prone to tool-related failures, posing a significant challenge for developing reliable LLM agents.
- Offline Preference Optimization via Maximum Marginal Likelihood Estimation: Proposes a new, simpler alignment method that recasts preference optimization as Maximum Marginal Likelihood Estimation (MMLE). This approach provides a stable, offline alternative to complex methods like Reinforcement Learning from Human Feedback (RLHF) by directly optimizing for preference probabilities in a dataset.
Computer Vision (10 papers)
- Segment then Splat: Unified 3D Open-Vocabulary Segmentation via Gaussian Splatting: Proposes a unified framework for 3D open-vocabulary segmentation by integrating 2D segmentation features into Gaussian Splatting. This method allows for querying and segmenting arbitrary objects in a 3D scene without retraining, achieving consistent multi-view semantic understanding and state-of-the-art results.
- H2OFlow: Grounding Human-Object Affordances with 3D Generative Models and Dense Diffused Flows: Introduces a method for learning human-object affordances by using 3D generative models to create diverse interaction data. It grounds these interactions with dense diffused flows, enabling models to reason about how humans can functionally interact with objects in a given 3D scene.
- ReconViaGen: Towards Accurate Multi-view 3D Object Reconstruction via Generation: Presents a novel approach for multi-view 3D object reconstruction that leverages generative priors from diffusion models. The method completes missing geometry from sparse or occluded views by generating plausible shapes, significantly improving reconstruction accuracy and completeness where traditional methods fail.
- LVD-GS: Gaussian Splatting SLAM for Dynamic Scenes via Hierarchical Explicit-Implicit Representation Collaboration Rendering: Develops a Gaussian Splatting-based SLAM system specifically for large-scale dynamic scenes. It uses a hierarchical representation to separately model static backgrounds and dynamic objects, enabling high-fidelity mapping and robust camera tracking in complex, real-world environments with moving elements.
- Unbiased Scene Graph Generation from Biased Training: Addresses the critical issue of training bias in Scene Graph Generation (SGG). The proposed method learns to generate unbiased scene graphs even from long-tail, biased training data, preventing the model from collapsing diverse relationships into overly common and simplified predicates.
- DiffusionLane: Diffusion Model for Lane Detection: Re-frames the task of lane detection as a generative process using a diffusion model. Instead of direct regression or classification, it learns to reverse a diffusion process that adds noise to lane parameters, demonstrating a novel and effective paradigm for structured prediction in autonomous driving.
- Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations: Proposes Concerto, a self-supervised learning framework that jointly learns from 2D images and 3D point clouds. By enforcing consistency between these modalities without explicit labels, the model learns emergent spatial representations that improve performance on various downstream 3D perception tasks.
- Kernel Density Steering: Inference-Time Scaling via Mode Seeking for Image Restoration: Introduces Kernel Density Steering (KDS), a novel inference-time framework for diffusion models in image restoration. KDS guides the denoising process towards high-density regions of the data manifold, improving output fidelity and reducing artifacts without requiring any model retraining.
- STG-Avatar: Animatable Human Avatars via Spacetime Gaussian: Presents a method for creating realistic, animatable human avatars from monocular video using Spacetime Gaussians. This approach explicitly models both the spatial structure and temporal motion of a person, capturing fine details and complex movements for high-fidelity, controllable avatar rendering.
- USF-MAE: Ultrasound Self-Supervised Foundation Model with Masked Autoencoding: Introduces a self-supervised foundation model for ultrasound imaging using the Masked Autoencoding framework. Pre-trained on a large dataset of unlabeled ultrasound videos, it learns generalizable representations that significantly boost performance across a variety of downstream diagnostic and clinical tasks.
Reinforcement Learning (8 papers)
- Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection: Introduces a population-based method for efficient exploration by adaptively selecting diverse exploratory policies. This approach achieves new state-of-the-art results, breaking human world records on the Atari 2600 benchmark in a sample-efficient manner.
- Solving Continuous Mean Field Games: Deep Reinforcement Learning for Non-Stationary Dynamics: Presents a deep reinforcement learning framework to solve mean field games (MFGs) in continuous spaces with non-stationary dynamics. The method overcomes limitations of prior work, enabling the modeling of large-scale multi-agent systems previously intractable in continuous settings.
- Agentic Reinforcement Learning for Real-World Code Repair: Proposes an agentic reinforcement learning framework to train code-fixing agents directly in real-world software repositories. It uses a verifiable pipeline to handle complex builds and dynamic dependencies, successfully demonstrating reproducible fixes for thousands of real-world code issues.
- Online POMDP Planning with Anytime Deterministic Optimality Guarantees: Introduces an online planning algorithm for Partially Observable Markov Decision Processes (POMDPs) to handle decision-making under uncertainty. The method provides anytime deterministic optimality guarantees, offering a principled and reliable framework for planning in practical autonomous systems.
- Online Optimization for Offline Safe Reinforcement Learning: Proposes a novel approach for Offline Safe Reinforcement Learning (OSRL) by framing it as a minimax problem. The method combines offline policy learning with online optimization to find a reward-maximizing policy that satisfies cumulative cost constraints from a fixed dataset.
- Faster Reinforcement Learning by Freezing Slow States: Introduces a method to accelerate reinforcement learning in environments with both 'fast' and 'slow' state variables. By strategically freezing the slow states during updates, the algorithm improves learning speed and efficiency for high-frequency decision-making tasks.
- Lyapunov Function-guided Reinforcement Learning for Flight Control: Presents a cascaded online reinforcement learning system for flight control that is guided by a Lyapunov function. This approach improves the controller's convergence performance and action smoothness, providing a derived metric to ensure stability in a safety-critical application.
- Agent-GSPO: Communication-Efficient Multi-Agent Systems via Group Sequence Policy Optimization: Introduces the Agent-GSPO framework to create communication-efficient multi-agent systems. It uses sequence-level reinforcement learning to directly optimize for token economy, effectively reducing prohibitive communication costs common in large-scale multi-agent coordination tasks.
Generative AI (10 papers)
- LongCat-Video Technical Report: Introduces LongCat-Video, a 13.6B parameter foundational model for video generation. It is designed for efficient long video inference and demonstrates strong performance across multiple video generation tasks, representing a significant step towards scalable world models and long-form video synthesis.
- Improving Video Generation with Human Feedback: Develops a systematic pipeline to refine video generation models using human feedback. By collecting preference data and training a reward model, this work fine-tunes a rectified flow model to improve motion smoothness and alignment between generated videos and textual prompts.
- FARMER: Flow AutoRegressive Transformer over Pixels: Proposes FARMER, a novel generative model class that directly models the likelihood of raw pixel data using a flow-based autoregressive transformer. This architecture achieves competitive performance and offers a new paradigm for visual data modeling analogous to the scaling of language models.
- Hollywood Town: Long-Video Generation via Cross-Modal Multi-Agent Orchestration: Introduces a multi-agent system for long video generation. It uses a hierarchical, graph-based framework with specialized agents for scriptwriting, storyboarding, and scene generation, coordinated by an orchestrator to ensure consistency and coherence in the final long-form video output.
- Flow-GRPO: Training Flow Matching Models via Online RL: Proposes Flow-GRPO, the first method to integrate online policy gradient reinforcement learning (RL) directly into the training of flow matching models. This approach reframes generation as a sequential decision-making problem, enabling optimization via RL rewards for improved sample quality.
- Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation: Presents a method for creating an image tokenizer by building upon a frozen vision foundation model (VFM). This approach avoids training a tokenizer from scratch and leverages the rich representations of VFMs to achieve state-of-the-art results in autoregressive image generation.
- Open Multimodal Retrieval-Augmented Factual Image Generation: Introduces a retrieval-augmented generation framework to improve the factual accuracy of text-to-image models. The system retrieves relevant multimodal knowledge from external sources to ground the generation process, reducing factual errors for prompts involving fine-grained attributes or specific events.
- CoMo: Compositional Motion Customization for Text-to-Video Generation: Presents CoMo, a framework for compositional motion customization in text-to-video generation. It allows for the precise control and combination of multiple distinct motions for different subjects within a single video, addressing a key limitation of existing text-to-video models.
- It Takes Two to Tango: Two Parallel Samplers Improve Quality in Diffusion Models for Limited Steps: Proposes a novel sampling technique for diffusion models that uses two parallel samplers. By having the samplers interact and exchange information during the denoising process, the method significantly improves image quality, especially when the number of sampling steps is limited.
- Variational Masked Diffusion Models: Introduces Variational Masked Diffusion Models (VMDM), an extension of masked diffusion that better captures dependencies among concurrently predicted tokens. By incorporating a variational inference framework, VMDM improves the generation quality and likelihood scores for discrete data modeling tasks.
AI Safety & Ethics (8 papers)
- Scaling Laws For Scalable Oversight: Proposes a framework to quantify how scalable oversight—the process of weaker AI supervising stronger AI—scales with model size and task complexity. This provides a theoretical foundation for evaluating a key strategy proposed for controlling future superintelligent systems.
- A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1: Demonstrates a simple, transfer-based targeted attack that achieves over 90% success against leading commercial large vision-language models. The attack's high effectiveness on state-of-the-art systems highlights critical, unaddressed vulnerabilities in current safety alignment protocols.
- Mapping Faithful Reasoning in Language Models: Addresses the unfaithfulness of chain-of-thought reasoning by introducing a method to map a model's internal computations to human-understandable concepts. This allows for verifying whether a model's stated reasoning genuinely reflects its decision-making process, enhancing transparency and trust.
- SAGE: A Generic Framework for LLM Safety Evaluation: Introduces SAGE, a generic framework for LLM safety evaluation that moves beyond static, single-turn benchmarks. It assesses conversational dynamics and policy adherence in realistic scenarios, providing a more robust and comprehensive methodology for measuring the safety of deployed language models.
- Epistemic Deep Learning: Enabling Machine Learning Models to Know When They Do Not Know: Presents a framework for epistemic deep learning to manage model uncertainty, enabling models to identify out-of-distribution data or adversarial inputs. This capability is critical for preventing overconfident incorrect predictions and improving reliability in safety-critical applications.
- Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models: Presents a novel primal-dual framework for machine unlearning that explicitly separates the objectives of forgetting specific information and retaining general model utility. This principled approach offers a more robust method for removing sensitive or harmful data from LLMs.
- Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models: Conducts the first systematic audit of social biases across four major multilingual vision-language models and ten languages. The study reveals significant gender and racial disparities that persist and sometimes amplify across different linguistic contexts, highlighting key challenges for global AI fairness.
- Compositional Bias Control in Large Language Models: Preference Learning Fails, Supervision Succeeds: Investigates methods for mitigating gender stereotypes in LLMs, finding that simple supervision is more effective than preference learning techniques like RLHF for controlling compositional biases. This provides direct, actionable guidance for building fairer language models in practice.
Graph Neural Networks (8 papers)
- On Vanishing Gradients, Over-Smoothing, and Over-Squashing in GNNs: Bridging Recurrent and Graph Learning: Analyzes the core GNN problems of over-smoothing and over-squashing by connecting them to vanishing gradients in recurrent networks. This work proposes a theoretical bridge between GNNs and RNNs, offering a unified view and potential solutions to these fundamental limitations in deep graph models.
- GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability: Introduces GraphInstruct, a dynamic benchmark and methodology to teach Large Language Models graph-related tasks. It generates a dataset of graph-instruction pairs, enabling LLMs to develop and be evaluated on graph understanding and reasoning capabilities without requiring specialized graph-based architectures.
- Revisiting Transformation Invariant Geometric Deep Learning: An Initial Representation Perspective: Revisits transformation invariance in geometric deep learning from an initial representation perspective. The work argues that focusing on the design of initial node features is crucial for achieving invariance, demonstrating how proper representations can simplify model design and improve performance on geometric data.
- Does Homophily Help in Robust Test-time Node Classification?: Investigates the role of homophily in the context of robust test-time node classification. The study reveals that high homophily does not always guarantee robustness against distribution shifts and proposes a new framework to analyze and improve GNN performance under such challenging conditions.
- GraphTOP: Graph Topology-Oriented Prompting for Graph Neural Networks: Proposes GraphTOP, a novel Graph Topology-Oriented Prompting framework for adapting pre-trained GNNs. It designs topology-aware prompts that are added to frozen GNNs, enabling efficient and effective fine-tuning for downstream tasks by explicitly guiding the model with structural information.
- Graph Neural Architecture Search with GPT-4: Leverages GPT-4 for Graph Neural Architecture Search (GNAS), automating the design of GNN architectures. The method uses the LLM to generate, evaluate, and refine GNN architectures based on natural language, significantly reducing the human labor and rich domain knowledge typically required.
- Beyond Augmentation: Leveraging Inter-Instance Relation in Self-Supervised Representation Learning: Introduces a self-supervised learning approach that moves beyond instance-level augmentations by explicitly incorporating inter-instance relationships. The method constructs a graph of instances and applies graph theory to guide representation learning, capturing richer semantic relationships between different data points for improved performance.
- Conjugate Relation Modeling for Few-Shot Knowledge Graph Completion: Proposes a novel model for few-shot knowledge graph completion by explicitly modeling conjugate relations. This approach addresses data sparsity and complex relational patterns by creating a dual graph representation, which improves the model's ability to infer missing links given very few training examples.
Robotics & Embodied AI (8 papers)
- SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents: Proposes a framework for self-evolving embodied agents using tree-structured reinforcement fine-tuning. This method allows agents to autonomously explore and refine their behaviors for long-horizon, real-world tasks, moving beyond static policies to enable continuous, autonomous improvement and reasoning.
- Toward Humanoid Brain-Body Co-design: Joint Optimization of Control and Morphology for Fall Recovery: Presents a brain-body co-design framework for humanoid robots that jointly optimizes control policies and physical morphology. By optimizing for a dynamic task like fall recovery, it demonstrates a path toward creating more physically capable and robust anthropomorphic systems.
- RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation: Introduces a scalable benchmark for generalist robot agents, leveraging real-to-sim translation to enable diverse and rigorous evaluation. The framework supports testing instructable policies across a wide range of tasks and environments, addressing the key bottleneck of real-world robot testing.
- VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting: Develops an embodied agent model capable of concurrent, multimodal interaction. Unlike static Vision-Language-Action models, VITA-E can see, hear, speak, and act simultaneously and dynamically handle user interruptions, enabling more natural and seamless human-robot collaboration.
- Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System: Introduces a system using hierarchical language models to command a heterogeneous aerial-ground robot team. The framework enables semantic navigation and manipulation from high-level instructions, demonstrating generalizable, coordinated behavior across diverse tasks without task-specific models.
- Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2): Presents an end-to-end autonomous driving model trained with reinforcement learning and aligned world models. This approach overcomes the causal confusion and distribution shift issues of imitation learning, demonstrating a more robust method for training agents directly from raw sensor data.
- KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills: Details a physics-based control framework for humanoid robots that enables the learning of highly-dynamic skills from human motion. The system successfully imitates complex 'Kung Fu' movements, showcasing a significant advance over methods limited to tracking slow, simple motions.
- Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence: Proposes a benchmark, Butter-Bench, for evaluating the practical intelligence of LLM-controlled robots. The benchmark assesses an agent's ability to handle the complexities and messiness of real-world physical environments, pushing beyond simplified, abstract task evaluations.
Speech & Audio (6 papers)
- UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models: Introduces UltraVoice, a spoken dialogue model for fine-grained speech style control. By leveraging a large-scale dataset with style annotations, the model generates conversational speech that aligns with specified emotional tones and speaking rates, enhancing human-like interaction in dialogue systems.
- OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model: Presents OpenS2S, a fully open-source, end-to-end Large Speech Language Model for empathetic human-machine communication. The model is designed to understand paralinguistic cues in speech and generate emotionally expressive responses, making this advanced technology accessible for broader research and development.
- FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks: Proposes FocalCodec, a neural audio codec for high-quality, low-bitrate speech coding using Focal Modulation Networks. It discretizes speech into tokens for language model-based applications, achieving superior performance over existing codecs at low bitrates like 1.5 kbps for efficient speech representation.
- LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization: Introduces LibriConvo, a new large-scale simulated conversational dataset for training and evaluating speaker diarization and ASR systems. It uses a speaker-aware simulation method to generate realistic multi-speaker conversations from read audiobooks, addressing the scarcity of such public data.
- Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion: Proposes a method to increase the naturalness of LLM-generated speech by automatically inserting disfluencies like fillers ('um', 'uh') and repetitions. This technique makes synthesized speech from conversational agents sound more spontaneous and human-like, improving the perceived quality of interaction.
- GuitarFlow: Realistic Electric Guitar Synthesis From Tablatures via Flow Matching and Style Transfer: Introduces GuitarFlow, a model for realistic and controllable electric guitar synthesis from symbolic tablatures. It uses flow matching and style transfer to generate expressive audio that captures nuances of guitar playing, including various effects and performance styles, advancing controllable music generation.
Multimodal Learning (8 papers)
- Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences: Proposes a generalist reward model for aligning AI with human preferences across multiple modalities, including video and audio, not just text and images. This aims to overcome modality imbalance in existing reward models, enabling broader and more nuanced AI alignment.
- LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation: Introduces a computationally efficient framework for unified multimodal models that handles both understanding and generation. It uses a novel double fusion approach to achieve competitive performance without the need for training massive models from scratch, addressing a key resource bottleneck.
- PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding: Addresses a key weakness in large multimodal models by focusing on part-level visual understanding. The work enables models to perform fine-grained, compositional reasoning by identifying the distinctive, object-specific parts that make up objects in the real world.
- Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought: Presents a unified framework for Multimodal Chain-of-Thought (MCoT), a key technique for enhancing reasoning. It categorizes existing methods and proposes a novel approach that generates both textual rationales and visual grounding to improve performance and interpretability in LVLMs.
- Robust Multimodal Learning via Cross-Modal Proxy Tokens: Introduces a simple and effective method using cross-modal proxy tokens to make multimodal models robust to missing modalities at inference time. This approach mitigates the significant performance drop typically seen when one or more input modalities are unavailable.
- VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation: Tackles the data scarcity problem for complex multimodal reasoning by transferring reasoning skills from text-only LLMs to Vision-Language Models. It uses an on-policy distillation method to leverage abundant text-based reasoning data, improving VLM reasoning capabilities without requiring new annotated data.
- Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?: Identifies and documents a systematic failure in unified multimodal models called 'modal aphasia,' where a model can visually recognize a concept but fails to articulate it in text. This diagnostic work highlights a fundamental dissociation and challenge in simultaneous vision-language training.
- Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning: Proposes a novel method to quantitatively measure the degree of imbalance between modalities in audio-visual learning. It introduces a Gaussian Mixture Model (GMM)-guided adaptive loss function to dynamically correct for this imbalance, improving upon existing architectural or optimization-based solutions.
AI Theory & Foundations (6 papers)
- Are Agents Just Automata? On the Formal Equivalence Between Agentic AI and the Chomsky Hierarchy: Establishes a formal equivalence between modern agentic AI architectures and the Chomsky hierarchy of abstract machines. It posits that an agent's memory architecture directly determines its computational power, providing a theoretical framework to classify and understand different agent designs and their limitations.
- On the Hardness of Approximating Distributions with Tractable Probabilistic Models: Investigates the computational hardness of approximating probability distributions using tractable probabilistic models (TPMs). The work establishes complexity-theoretic barriers, showing that even for simple target distributions, finding an accurate and tractable approximation is computationally intractable under standard assumptions, clarifying fundamental modeling trade-offs.
- Softmax is $1/2$-Lipschitz: A tight bound across all $\ell_p$ norms: Proves a tight Lipschitz constant of 1/2 for the softmax function, a fundamental operator in machine learning. This universal bound holds across all $\ell_p$ norms and provides a core mathematical tool for analyzing the robustness and optimization convergence of many models.
- On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling: Bridges the gap between infinite-width scaling limits and practical deep learning by analyzing training with large learning rates. It shows how large learning rates can lead to behavior that diverges from classical theories but better explains the empirical performance of practically trained networks.
- Representer Theorems for Metric and Preference Learning: Geometric Insights and Algorithms: Develops a novel representer theorem for a broad class of metric and preference learning problems within a Hilbert space. The framework shows that the optimal solution lies in a finite-dimensional subspace spanned by the data, providing new geometric insights and algorithmic foundations.
- A Framework for Quantifying How Pre-Training and Context Benefit In-Context Learning: Introduces a theoretical framework to quantify the benefits of pre-training and in-context examples for in-context learning (ICL). The analysis decomposes the ICL risk, isolating contributions from the pre-training distribution and the context to formally explain when and why ICL is effective.
Efficient AI (6 papers)
- Unified Sparse Mixture of Experts: Proposes a flexible Sparse Mixture of Experts framework that generalizes prior designs by allowing a variable number of experts per token or tokens per expert. This unified approach enables more adaptive and efficient scaling of model capacity while maintaining constant computational overhead.
- Batch Speculative Decoding Done Right: Addresses the 'ragged tensor problem' in batch speculative decoding by proposing a padding-free algorithm. This method correctly handles variable acceptance lengths across sequences in a batch, improving throughput and making speculative decoding practical for production-level LLM serving systems.
- TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination: Introduces an inference-time algorithm that prunes entire transformer layers from large language models by directly optimizing for task-specific validation performance. The method efficiently identifies and removes redundant layers for specific tasks, reducing computational costs while maintaining high accuracy on modern LLMs.
- AttentionPredictor: Temporal Patterns Matter for KV Cache Compression: Proposes a method to compress the Key-Value (KV) cache for long-context LLM inference by modeling temporal patterns in attention scores. It predicts and retains only critical KV tokens based on their evolving importance, achieving significant memory reduction and speedup with minimal performance degradation.
- TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge: Presents a compression framework for CLIP-like vision-language models that converts both vision and language encoder weights to a ternary (-1, 0, 1) representation. It combines this aggressive quantization with knowledge distillation to maintain performance, significantly reducing model size and computational cost.
- Neural-HAR: A Dimension-Gated CNN Accelerator for Real-Time Radar Human Activity Recognition: Introduces a dimension-gated CNN accelerator for real-time radar-based human activity recognition on edge devices. The architecture dynamically gates computational paths based on input complexity, achieving significant reductions in energy consumption and latency for efficient edge deployment compared to standard CNNs and ViTs.
AI for Science (6 papers)
- A Physics-Guided AI Cascaded Corrector Model Significantly Extends Madden-Julian Oscillation Prediction Skill: Introduces a physics-guided deep learning framework that acts as a cascaded corrector for dynamical models. This novel approach significantly extends the skillful prediction horizon for the Madden-Julian Oscillation, a key driver of global weather, beyond the typical 3-4 week operational limit.
- ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibility Data: Introduces ChromFound, a universal foundation model trained on vast single-cell chromatin accessibility data (scATAC-seq). By pre-training on a massive repository, the model learns generalizable representations of regulatory mechanisms, enabling high performance on various downstream tasks without task-specific fine-tuning.
- Accelerating Materials Design via LLM-Guided Evolutionary Search: Presents LLEMA, a framework that accelerates materials discovery by coupling large language models with evolutionary search. The LLM leverages its embedded scientific knowledge to guide the evolutionary algorithm, efficiently navigating vast chemical spaces to find novel materials that satisfy multiple design objectives.
- Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs: Identifies the "tokenization dilemma" where standard LLM tokenizers fail to capture the functional semantics of biomolecular sequences. The paper argues for context-aware processing, demonstrating that moving beyond simple tokenization is critical for unlocking the potential of LLMs in biological discovery.
- Deep Learning Atmospheric Models Reliably Simulate Out-of-Sample Land Heat and Cold Wave Frequencies: Evaluates deep learning-based general circulation models (GCMs) on their ability to simulate extreme weather events outside their training data. The study shows these AI models can reliably replicate out-of-sample heat and cold wave frequencies, providing critical validation for their use as fast climate simulators.
- SemiETPicker: Fast and Label-Efficient Particle Picking for CryoET Tomography Using Semi-Supervised Learning: Proposes SemiETPicker, a fast and label-efficient semi-supervised learning method for particle picking in Cryogenic Electron Tomography (CryoET). The framework addresses a major bottleneck in structural biology by accurately localizing proteins in 3D tomograms while requiring significantly fewer manual annotations for training.
Natural Language Processing (8 papers)
- $\text{E}^2\text{Rank}$: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker: Proposes a new method where text embedding models also serve as efficient listwise rerankers. This unifies two stages of the search pipeline into a single model, improving ranking fidelity while maintaining the high efficiency of dense retrieval systems in real-world applications.
- ATOM: AdapTive and OptiMized dynamic temporal knowledge graph construction using LLMs: Introduces a framework for constructing dynamic, temporal knowledge graphs from unstructured text using LLMs. It focuses on adaptive knowledge extraction to handle time-sensitive information, enabling real-time analytics and temporal inference beyond the capabilities of traditional static knowledge graphs.
- CustomIR: Unsupervised Fine-Tuning of Dense Embeddings for Known Document Corpora: Presents an unsupervised framework for fine-tuning dense embedding models on specialized corpora. This addresses the critical domain adaptation problem for information retrieval and RAG pipelines without needing labeled data, improving performance on out-of-distribution documents for known corpora.
- DCMM-SQL: Automated Data-Centric Pipeline and Multi-Model Collaboration Training for Text-to-SQL Model: Designs an automated data-centric pipeline and multi-model collaboration training strategy for Text-to-SQL. The work systematically explores the impact of data-centric methods, demonstrating improved model performance by focusing on data quality and diversity over model architecture alone.
- Flexing in 73 Languages: A Single Small Model for Multilingual Inflection: Presents a single, compact model for morphological inflection trained jointly on 73 languages. The lightweight model is robust to unseen words and outperforms specialized monolingual models, demonstrating a highly effective and scalable approach for a fundamental multilingual linguistic task.
- Explaining and Mitigating Crosslingual Tokenizer Inequities: Investigates and explains tokenization disparities (token premiums) across different languages, a fundamental issue impacting cost and performance of multilingual models. The paper demonstrates that these inequities persist even after controlling for script and morphology, and it proposes mitigation strategies.
- Quality-Aware Translation Tagging in Multilingual RAG system: Addresses performance degradation in multilingual RAG systems caused by poor translation quality. It introduces a method to tag retrieved documents with translation quality scores, allowing the generation model to selectively use or ignore low-quality translated context for more reliable outputs.
- Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics: Reveals through a systematic study that top-performing regression-based Quality Estimation (QE) metrics for machine translation exhibit a strong, systematic bias against longer translations. This finding challenges the reliability of current reference-free evaluation methods and their use as reward signals.
Key Research Trends & Takeaways
Here are 4 key trends and takeaways from the latest AI research papers:
- Advanced 3D Control & Unified Understanding: A significant leap in AI's ability to generate, understand, and reconstruct 3D environments and objects is evident. Innovations like zero-shot 3D orientation grounding (ORIGEN) and unifying generation with depth estimation (More Than Generation) enable precise control over 3D content creation and richer scene understanding, crucial for AR/VR, digital twins, and industrial design. This accelerates the development of immersive experiences and automated design workflows.
- Integrating Higher-Level Reasoning and Intent into AI: AI models are moving beyond pattern recognition to incorporate sophisticated reasoning and intent inference. Techniques like transferring LLM reasoning to vision models (VOLD) and understanding egocentric human actions (EgoThinker) allow AI to make more informed decisions in complex, dynamic environments. This is vital for robust autonomous systems, intelligent robotics, and next-generation human-AI interaction.
- Specialized Foundation Models for Complex Domains: The foundation model paradigm is expanding into highly specialized and multi-modal areas, exemplified by self-supervised models for 3D brain MRI (BrainFound) and unified 3D segmentation with Gaussian Splatting (Segment then Splat). These generalizable models learn from vast unlabeled data, significantly reducing annotation burdens and accelerating breakthroughs in fields like medical diagnostics, scientific discovery, and industrial automation.
- Enhanced Generative AI with Precision and Efficiency: Generative models are becoming more controllable, allowing for fine-grained manipulation of outputs, such as compositional motion in video (CoMo). Concurrently, methods like adaptive stochastic coefficients are accelerating sampling and improving quality, making high-fidelity content creation faster and more resource-efficient. This directly impacts content production pipelines, design iterations, and the scalability of AI-driven creative tools.