# Academic Research Intelligence
Deep dive into AI research papers for researchers and academics
---
Executive Summary
- 1. Rethinking Convergence in Deep Learning: The Predictive-Corrective Paradigm for Anatomy-Informed Brain MRI Segmentation
Introduces the Predictive-Corrective (PC) paradigm and PCMambaN network for anatomy-informed brain MRI segmentation. Achieves accelerated learning and improved efficiency in data-scarce medical imaging domains by decoupling modeling tasks.
- 2. Bolt3D: Generating 3D Scenes in Seconds
Presents Bolt3D, a latent diffusion model for feed-forward 3D scene generation from images. It directly samples a 3D scene representation in under seven seconds on a single GPU, achieving a significant speed breakthrough compared to optimization-based methods.
- 3. PFGS: Pose-Fused 3D Gaussian Splatting for Complete Multi-Pose Object Reconstruction
Introduces PFGS, a pose-aware 3D Gaussian Splatting framework that reconstructs complete objects from multi-pose image captures. Addresses limitations of single-pose methods by integrating pose information for comprehensive reconstructions.
- 4. YOLOE: Real-Time Seeing Anything
Introduces YOLOE, a model extending the YOLO series for real-time open-vocabulary object detection and segmentation. It leverages visual and text prompts to detect and segment any object without being limited by predefined categories, enabling broad real-world applicability.
- 5. Diffusion Bridge Networks Simulate Clinical-grade PET from MRI for Dementia Diagnostics
Proposes a diffusion bridge network to synthesize clinical-grade FDG-PET scans from standard MRI images for dementia diagnosis. This approach makes a critical diagnostic tool more accessible by simulating it from routinely available, lower-cost imaging data.
- 6. SHARE: Scene-Human Aligned Reconstruction
Introduces SHARE, a technique that leverages scene geometry to accurately ground human motion reconstruction from monocular RGB video. Addresses challenges in placing humans in 3D space for realistic character interactions.
- 7. Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis
Introduces Unimedvl, a unified medical vision-language model for both understanding and generation tasks. It processes diverse multimodal inputs to generate textual reports, visual annotations, and segmentation masks within a single framework, advancing towards a generalist medical AI.
- 8. FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers
Proposes FreqPDE, rethinking positional depth embedding for multi-view 3D object detection transformers. Addresses depth prediction quality issues in autonomous driving by improving spatial information recovery.
- 9. Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery
Presents Skyfall-GS, a method to synthesize large-scale, explorable, and geometrically accurate 3D urban scenes from satellite imagery. It addresses the lack of real-world 3D scans for training generative models, enabling immersive applications and simulations.
- 10. Rethinking Efficient Hierarchical Mixing Architecture for Low-light RAW Image Enhancement
Introduces the Hierarchical Mixing Architecture (HiMA) for efficient low-light RAW image enhancement. Leverages complementary strengths of Transformer and Mamba for improved enhancement quality and high efficiency.
- 11. AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction
Introduces AutoGraph-R1, an end-to-end reinforcement learning framework for building knowledge graphs for RAG systems. It directly optimizes the KG construction process to improve performance on downstream question-answering tasks, bridging a critical gap in traditional pipelines.
- 12. Exploring Conditions for Diffusion models in Robotic Control
Explores leveraging pre-trained text-to-image diffusion models for task-adaptive visual representations in robotic control without fine-tuning. Investigates optimal conditions for applying textual prompts to diffusion models.
- 13. VISTA: A Test-Time Self-Improving Video Generation Agent
Proposes VISTA, a test-time self-improving agent for text-to-video generation. Instead of relying on a perfect user prompt, VISTA iteratively refines the generated video based on user-defined scoring functions, improving quality without retraining the base model.
- 14. Proto-Former: Unified Facial Landmark Detection by Prototype Transformer
Proposes Proto-Former, a unified, adaptive, end-to-end facial landmark detection framework. Addresses limitations in single-dataset training by explicitly unifying landmark detection across different datasets.
- 15. BLIP3o-NEXT: Next Frontier of Native Image Generation
Presents BLIP3o-NEXT, a fully open-source vision-language foundation model that unifies text-to-image generation and image editing within a single architecture. The model demonstrates strong performance in both tasks, advancing the capabilities of open-source multimodal systems.
- 16. Balanced Multi-Task Attention for Satellite Image Classification: A Systematic Approach to Achieving 97.23% Accuracy on EuroSAT Without Pre-Training
Presents a systematic investigation of custom CNN architectures for satellite land use classification, achieving 97.23% accuracy on EuroSAT without pre-training. Introduces a novel balanced multi-task attention mechanism.
- 17. Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Introduces Ditto, a framework to address data scarcity in instruction-based video editing. It features a pipeline to automatically generate a large-scale, high-quality synthetic dataset of video editing examples, enabling the training of more capable models.
- 18. Diffusion Bridge Networks Simulate Clinical-grade PET from MRI for Dementia Diagnostics
Introduces SiM2P, a 3D diffusion bridge-based framework simulating clinical-grade PET from MRI for dementia diagnostics. Learns a probabilistic mapping from MRI to PET images, addressing accessibility and cost issues of PET scans.
- 19. MAVR-Net: Robust Multi-View Learning for MAV Action Recognition with Cross-View Attention
Presents MAVR-Net, a multi-view learning framework for MAV action recognition using cross-view attention. Addresses limitations of RGB-only models by capturing complex spatial-temporal characteristics of MAV motion.
- 20. V2X-Radar: A Multi-modal Dataset with 4D Radar for Cooperative Perception
Presents V2X-Radar, a new large-scale, multi-modal dataset for cooperative perception in autonomous driving. It uniquely features 4D radar data alongside LiDAR and camera streams, enabling research on overcoming occlusions and extending perception range through vehicle-to-everything communication.
AI for Science
- 1. Constrained Diffusion for Protein Design with Hard Structural Constraints
Introduces a diffusion model for protein design that enforces hard structural constraints during generation. This approach overcomes a key failure mode of existing methods, enabling the design of proteins that satisfy precise geometric requirements for functional purposes.
- 2. Rethinking Convergence in Deep Learning: The Predictive-Corrective Paradigm for Anatomy-Informed Brain MRI Segmentation
Introduces the Predictive-Corrective (PC) paradigm and a novel network for accelerated learning in data-scarce domains like medical imaging. Achieves faster convergence and improved performance for anatomy-informed brain MRI segmentation, enabling more efficient medical analysis.
- 3. Migration as a Probe: A Generalizable Benchmark Framework for Specialist vs. Generalist Machine-Learned Force Fields
Presents a generalizable benchmark framework to evaluate specialist versus generalist machine-learned force fields (MLFFs). It uses molecular migration barriers as a probe to assess the transferability and accuracy of pre-trained foundation models for atomistic simulations in materials science.
- 4. Improving Micro-Expression Recognition with Phase-Aware Temporal Augmentation
Proposes phase-aware temporal augmentation to address data scarcity in micro-expression recognition. Enhances feature representation by integrating both onset-to-apex and apex-to-offset phases, improving model generalization and recognition performance.
- 5. Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation
Proposes a multi-agent framework to automate the scientific discovery process. The system enables continual and interactive research by dynamically adapting its workflow based on intermediate findings, creating a personalized and automated virtual research group.
- 6. Iterative Motion Compensation for Canonical 3D Reconstruction from UAV Plant Images Captured in Windy Conditions
Presents a pipeline for high-quality 3D plant reconstructions from UAV images, addressing challenges of windy conditions with iterative motion compensation. Enables autonomous, accurate 3D phenotyping for agricultural analysis and research.
- 7. Retro3D: A 3D-aware Template-free Method for Enhancing Retrosynthesis via Molecular Conformer Information
Introduces Retro3D, a template-free method for chemical retrosynthesis that incorporates 3D molecular conformer information. By using geometric deep learning, it improves the prediction of reactants by considering the stereochemistry and spatial arrangements of molecules, moving beyond 2D graph representations.
- 8. Diffusion Bridge Networks Simulate Clinical-grade PET from MRI for Dementia Diagnostics
Introduces SiM2P, a 3D diffusion bridge-based framework to simulate FDG-PET from MRI for dementia diagnostics. Addresses the accessibility and cost limitations of PET scans, enabling more accessible and affordable diagnostics.
- 9. LeMat-Traj: A Scalable and Unified Dataset of Materials Trajectories for Atomistic Modeling
Introduces LeMat-Traj, a large-scale, unified dataset of quantum mechanical materials trajectories derived from Density Functional Theory. This work addresses data fragmentation and inconsistency to accelerate the development and standardized benchmarking of machine learning interatomic potentials.
- 10. Neural Posterior Estimation for Cataloging Astronomical Images from the Legacy Survey of Space and Time
Proposes neural posterior estimation for cataloging astronomical images from LSST data. Addresses ill-posed cataloging problems with probabilistic methods, enabling more statistically coherent astronomical catalogs.
- 11. Clarifying the Ti-V Phase Diagram Using First-Principles Calculations and Bayesian Learning
Resolves a long-standing experimental conflict regarding the titanium-vanadium (Ti-V) phase diagram. The study combines first-principles calculations with Bayesian learning to demonstrate that the observed miscibility gap, a point of scientific contention, is attributable to oxygen contamination during alloy preparation.
- 12. X$^{2}$-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction
Introduces X$^2$-Gaussian, a continuous-time 4D-CT reconstruction framework integrating dynamic radiative Gaussian splatting. Addresses limitations of fixed phase-binning, enabling more precise dynamic anatomical change capture.
AI Safety & Ethics
- 1. Corrigibility Transformation: Constructing Goals That Accept Updates
Proposes a method to construct AI goals that remain open to updates, addressing the core alignment problem of instrumental convergence where an AI might resist changes to its objectives. The transformation aims to prevent goal lock-in during the training process.
- 2. Unfair Learning: GenAI Exceptionalism and Copyright Law
Challenges GenAI immunity from copyright law by arguing fair use arguments apply equally to humans and GenAI. Proposes that granting GenAI exceptional privileges is legally unsound, promoting a balanced legal framework for AI-generated content.
- 3. Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
Proposes a honeypot-based proactive guardrail system that fine-tunes a bait model to generate ambiguous prompts. This system aims to detect and confirm multi-turn LLM jailbreaks, transforming risk avoidance into risk utilization for enhanced AI safety.
- 4. DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios
Introduces a comprehensive benchmark to evaluate deceptive behaviors in large language models across real-world scenarios. The benchmark characterizes different types of deception, such as sycophancy and strategic manipulation, providing a standardized tool for assessing and mitigating these critical risks.
- 5. Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning
Addresses judgment inconsistencies in LLM-based feedback for reinforcement learning alignment. It introduces a method to deconflict contradictory preferences from AI judges, leading to more stable and effective model training by ensuring a coherent reward signal for the model to learn from.
- 6. GuardReasoner: Towards Reasoning-based LLM Safeguards
Introduces GuardReasoner, a reasoning-based LLM safeguard, by training a guard model on detailed reasoning steps. Enhances controllability and transparency by enabling LLMs to reason through safety alignments, reducing risks of hallucinations.
- 7. SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
Provides a taxonomy and evaluation framework for prompt security in LLMs, addressing fragmented research on jailbreak attacks and defenses. Standardizes definitions, threat models, and criteria to facilitate systematic progress in LLM safety.
- 8. Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling
Introduces a novel method using sequential comic-style narratives to circumvent MLLM safety alignments. Decomposes malicious queries into visually innocuous elements, generating image sequences to bypass safety filters.
- 9. PoTS: Proof-of-Training-Steps for Backdoor Detection in Large Language Models
Introduces a method for detecting backdoor attacks in LLMs by verifying the training process. It requires the model provider to commit to training data checkpoints, allowing a verifier to audit for unauthorized data injections without needing full training replication or data access.
- 10. ReasonIF: Large Reasoning Models Fail to Follow Instructions During Reasoning
Demonstrates that large reasoning models often fail to adhere to user instructions within their intermediate reasoning steps, even when the final answer is correct. It introduces a benchmark, ReasonIF, to evaluate this internal instruction-following capability, revealing a critical reliability gap.
- 11. Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
Proposes Learning to Detect (LoD), a general framework for accurately detecting unknown jailbreak attacks in LVLMs. Overcomes limitations of attack-specific or heuristic methods by learning generalizable detection parameters.
- 12. NDM: A Noise-driven Detection and Mitigation Framework against Implicit Sexual Intentions in Text-to-Image Generation
Introduces a noise-driven detection and mitigation framework (NDM) against implicit sexual intentions in T2I models. Addresses subtle prompt cues that trigger inappropriate content due to model biases, enhancing ethical AI.
- 13. Towards more holistic interpretability: A lightweight disentangled Concept Bottleneck Model
Proposes a more holistic and controllable Concept Bottleneck Model for interpretability. Its lightweight, disentangled design addresses input-to-concept mapping bias, improving transparency by allowing for more accurate and independent manipulation of intermediate human-understandable concepts.
- 14. VaultGemma: A Differentially Private Gemma Model
Introduces VaultGemma 1B, a large language model from the Gemma family fully trained with differential privacy. The work demonstrates the feasibility of training capable, billion-parameter models from scratch while providing formal privacy guarantees for the underlying training data.
- 15. Latent Feature Alignment: Discovering Biased and Interpretable Subpopulations in Face Recognition Models
Introduces Latent Feature Alignment (LFA), an attribute-label-free algorithm using latent directions to identify biased subpopulations in face recognition. Enhances interpretability and fairness by uncovering systematic model biases.
AI Theory & Foundations
- 1. From Universal Approximation Theorem to Tropical Geometry of Multi-Layer Perceptrons
Revisits the Universal Approximation Theorem using tropical geometry. It provides a constructive, geometry-aware initialization for sigmoidal MLPs, showing they can approximate functions by mapping inputs to vertices of a zonotope and then using a linear readout layer for interpolation.
- 2. Rethinking Convergence in Deep Learning: The Predictive-Corrective Paradigm for Anatomy-Informed Brain MRI Segmentation
Introduces the Predictive-Corrective (PC) paradigm to decouple deep learning tasks, accelerating learning and improving efficiency. Demonstrates this with a novel network for anatomy-informed brain MRI segmentation, addressing slow convergence and data scarcity.
- 3. The Coverage Principle: How Pre-training Enables Post-Training
Proposes the 'coverage principle' to explain how pre-training enables successful fine-tuning. It posits that pre-training succeeds by learning a representation that covers the feature space of downstream tasks, demonstrating that this coverage, rather than low cross-entropy, predicts fine-tuning performance.
- 4. How Sparse Can We Prune A Deep Network: A Fundamental Limit Perspective
Investigates the fundamental limit of network pruning by imposing sparsity constraints directly on the loss function. Characterizes a sharp phase transition point, providing theoretical insights into network pruning's capabilities.
- 5. Euclidean Distance Matrix Completion via Asymmetric Projected Gradient Descent
Proposes and analyzes the Asymmetric Projected Gradient Descent (APGD) algorithm for Euclidean Distance Matrix Completion. Establishes global convergence guarantees with exact recovery, paralleling incoherence matrix completion frameworks.
- 6. When Does Closeness in Distribution Imply Representational Similarity? An Identifiability Perspective
Addresses when distributional closeness implies representational similarity using identifiability theory. It shows that for identifiable models, similarity is guaranteed, but for unidentifiable models like MLPs, representations can be arbitrarily dissimilar despite identical output distributions, proposing canonicalization-based solutions.
- 7. Which exceptional low-dimensional projections of a Gaussian point cloud can be found in polynomial time?
Studies the proportional asymptotics of projecting Gaussian point clouds. Investigates when exceptional low-dimensional projections can be found in polynomial time, contributing to the understanding of dimensionality reduction theory.
- 8. Language Models are Injective and Hence Invertible
Proves that despite non-injective components like ReLU and LayerNorm, common language model architectures are injective with high probability for sufficiently wide hidden dimensions. This implies that their internal representations are invertible, allowing for the exact recovery of inputs from hidden states.
- 9. Uncertainty Quantification for Physics-Informed Neural Networks with Extended Fiducial Inference
Develops a novel framework for uncertainty quantification in Physics-Informed Neural Networks (PINNs) using Extended Fiducial Inference. Addresses limitations of Bayesian and dropout methods by providing more honest uncertainty estimates.
- 10. A simple mean field model of feature learning
Derives a tractable, self-consistent mean-field theory for feature learning in two-layer non-linear networks using methods from statistical physics. The model captures the dynamics of training, including the emergence of features from an initial random state, providing analytical insights into the learning process.
- 11. On the Neural Feature Ansatz for Deep Neural Networks
Investigates feature learning by formalizing the Neural Feature Ansatz (NFA) and its relationship to Gram matrices and AGOP. Proves the NFA holds under gradient flow dynamics, offering mathematical foundations for deep networks.
Computer Vision
- 1. PFGS: Pose-Fused 3D Gaussian Splatting for Complete Multi-Pose Object Reconstruction
Introduces PFGS, a pose-aware 3D Gaussian Splatting framework that reconstructs complete objects from multi-pose captures. Achieves high-quality, real-time novel-view synthesis for objects with occluded regions, enabling better 3D reconstructions from varied viewpoints.
- 2. YOLOE: Real-Time Seeing Anything
Proposes YOLOE, a model that integrates the efficiency of YOLO architectures with open-vocabulary capabilities. It enables real-time detection and segmentation of objects described by arbitrary text prompts, removing the limitation of predefined categories found in traditional object detectors.
- 3. SHARE: Scene-Human Aligned Reconstruction
Introduces SHARE, a technique leveraging scene geometry to accurately ground human motion reconstruction from monocular RGB videos. Enables realistic character interactions by placing humans precisely in 3D space for gaming, AR/VR, and robotics.
- 4. UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection
Introduces UniMamba, an architecture for LiDAR 3D detection that replaces Transformer blocks with the Mamba state-space model. This novel approach efficiently captures global dependencies in point clouds, demonstrating a powerful and promising alternative to self-attention for 3D perception tasks.
- 5. Rethinking Efficient Hierarchical Mixing Architecture for Low-light RAW Image Enhancement
Proposes HiMA, a Hierarchical Mixing Architecture, rethinking low-light RAW image enhancement by combining Transformer and Mamba strengths. Achieves strong enhancement quality and high efficiency, addressing limitations of existing deep learning approaches.
- 6. Neuro-Symbolic Spatial Reasoning in Segmentation
Presents a neuro-symbolic framework for Open-Vocabulary Semantic Segmentation that enhances vision-language models with spatial reasoning. This allows the model to generalize to unseen object categories by understanding relationships between objects, improving segmentation beyond simple patch-to-text correlations.
- 7. FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers
Introduces FreqPDE, rethinking positional depth embedding for multi-view 3D object detection transformers. Addresses limitations of explicit depth supervision, improving accuracy and detail in autonomous driving perception.
- 8. Diffusion Models are Efficient Data Generators for Human Mesh Recovery
Demonstrates that diffusion models can effectively generate large-scale synthetic datasets for 3D human pose and shape estimation. This approach overcomes the limitations of scarce real-world motion capture data, leading to improved performance and robustness for human mesh recovery models.
- 9. Proto-Former: Unified Facial Landmark Detection by Prototype Transformer
Proposes Proto-Former, a unified, adaptive, end-to-end facial landmark detection framework using a prototype transformer. Addresses limitations of single-dataset training, improving model generalization across different facial landmark datasets.
- 10. Adaptive transfer learning for surgical tool presence detection in laparoscopic videos through gradual freezing fine-tuning
Introduces a staged adaptive fine-tuning approach for surgical tool presence detection in laparoscopic videos. Leverages gradual freezing fine-tuning to overcome limited annotated data challenges in surgical settings for robust deep learning models.
- 11. MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval
Proposes MSAM, a multi-semantic adaptive mining framework for cross-modal drone video-text retrieval. Addresses challenges of drone videos' unique characteristics for efficient semantic retrieval in overhead perspectives.
- 12. LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal
Presents LightsOut, a diffusion-based model for removing lens flare, especially from off-frame light sources. By using an outpainting technique to synthesize the missing light source, the model achieves more physically plausible and effective flare removal compared to previous methods.
- 13. SaLon3R: Structure-aware Long-term Generalizable 3D Reconstruction from Unposed Images
Introduces a method for generalizable 3D reconstruction from sequential, unposed images using 3D Gaussian Splatting. The model builds a compact, structure-aware scene representation over time, enabling on-the-fly reconstruction that is more efficient and consistent than combining per-pixel predictions.
- 14. MAVR-Net: Robust Multi-View Learning for MAV Action Recognition with Cross-View Attention
Presents MAVR-Net, a multi-view learning framework with cross-view attention for Micro Aerial Vehicle action recognition. Overcomes limitations of RGB-only data to capture complex spatio-temporal characteristics for robust action distinction.
- 15. CuSfM: CUDA-Accelerated Structure-from-Motion
Introduces cuSfM, a CUDA-accelerated offline Structure-from-Motion system leveraging GPU parallelization. Achieves efficient and accurate camera pose estimation and dense reconstruction for autonomous navigation and robotic perception.
- 16. CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image
Proposes CHROME, a model for reconstructing 3D clothed humans from a single image with enhanced occlusion resilience. The method improves the generation of a complete human shape by ensuring multiview consistency, resulting in more robust and accurate reconstructions even with partial views.
- 17. Rethinking Convergence in Deep Learning: The Predictive-Corrective Paradigm for Anatomy-Informed Brain MRI Segmentation
Introduces the Predictive-Corrective (PC) paradigm to accelerate deep learning, decoupling modeling tasks. Proposes PCMambaN for anatomy-informed brain MRI segmentation, addressing slow convergence and data-scarce domain limitations.
- 18. Unleashing the Potential of Pre-Trained Diffusion Models for Generalizable Person Re-Identification
Leverages pre-trained diffusion models to improve domain-generalizable person re-identification. By using these powerful generative models, the method enhances feature representation and robustness to unseen domains, addressing a critical challenge for deploying Re-ID systems in real-world scenarios.
Efficient AI
- 1. Cross-layer Attention Sharing for Pre-trained Large Language Models
Proposes sharing attention mechanisms across different layers in Large Language Models to reduce inter-layer redundancy. This approach achieves significant efficiency gains by reusing computation and parameters, demonstrating its effectiveness on various LLM architectures without notable performance degradation.
- 2. CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs
Introduces CAIT, a compression framework for Vision Transformers (ViTs) that balances accuracy, inference speed, and transferability. Achieves state-of-the-art compression with minimal performance loss, enabling efficient deployment on resource-constrained devices.
- 3. TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs
Introduces a dynamic alignment method for speculative decoding that enables acceleration even when the draft and target models have different tokenizers. The technique broadens the applicability of this key inference optimization, improving LLM efficiency across diverse model pairs.
- 4. Quantized FCA: Efficient Zero-Shot Texture Anomaly Detection
Proposes Quantized FCA, an efficient zero-shot texture anomaly detection method. Achieves high accuracy with significantly reduced running time, making it practical for real-world deployment in industrial monitoring scenarios.
- 5. Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs
Presents a GPU-native bilinear operator as a drop-in alternative to matrix multiplication (MatMul) in neural networks. This fundamental change offers a direct trade-off between speed, accuracy, and parameter count, enabling faster and more efficient models.
- 6. Lightweight Data-Free Denoising for Detail-Preserving Biomedical Image Restoration
Presents an ultra-lightweight, data-free denoising model for biomedical image restoration. Achieves fast denoising and high-quality restoration, addressing computational and memory constraints of current self-supervised techniques.
- 7. Rethinking Efficient Hierarchical Mixing Architecture for Low-light RAW Image Enhancement
Introduces Hierarchical Mixing Architecture (HiMA) for efficient low-light image enhancement. Achieves strong enhancement quality and high efficiency by combining Transformer and Mamba strengths, addressing limitations of current deep learning approaches.
- 8. PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation
Introduces an adaptive method to manage the KV cache in vision-language models by identifying and caching a shared visual "prefix" from image tokens. This reduces computational and memory overhead during text generation, significantly accelerating inference for multimodal tasks.
- 9. Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination
Details a system for on-device LLMs that uses hybrid context management and hardware coordination. By intelligently caching context and leveraging specialized neural processors, it enables efficient, personalized generation with large context windows on mobile devices.
- 10. Rethinking Convergence in Deep Learning: The Predictive-Corrective Paradigm for Anatomy-Informed Brain MRI Segmentation
Introduces the Predictive-Corrective (PC) paradigm for accelerated learning in deep learning. Decouples modeling tasks to improve efficiency and applicability, particularly in data-scarce domains like medical imaging.
- 11. msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML
Introduces msf-CNN, a patch-based fusion technique for Convolutional Neural Networks on TinyML. Achieves extreme memory efficiency and small inference latency, fitting within MCU memory budgets for real-time constraints.
Generative AI
- 1. DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion
Presents DriveGen3D, a framework for generating controllable dynamic 3D driving scenes using efficient video diffusion, addressing computational demands and 3D representation limitations.
- 2. PFGS: Pose-Fused 3D Gaussian Splatting for Complete Multi-Pose Object Reconstruction
Introduces PFGS, a pose-aware 3D Gaussian Splatting framework addressing incomplete reconstructions from single-pose captures by enabling complete multi-pose object reconstruction.
- 3. Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Introduces Ditto, a framework for instruction-based video editing, featuring a novel data generation pipeline fusing an image editor with an in-context video generator.
- 4. VISTA: A Test-Time Self-Improving Video Generation Agent
Introduces VISTA, a multi-agent system that autonomously improves video generation by refining prompts iteratively, addressing the prompt-dependency of current text-to-video synthesis.
- 5. 3DPR: Single Image 3D Portrait Relight using Generative Priors
Proposes 3DPR, an image-based relighting model leveraging generative priors for single image 3D portrait relighting, offering an alternative to traditional differentiable rendering constraints.
- 6. Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery
Introduces Skyfall-GS, synergizing satellite imagery and open-domain diffusion models to synthesize large-scale, explorable, geometrically accurate 3D urban scenes for immersive applications.
- 7. Bolt3D: Generating 3D Scenes in Seconds
Presents Bolt3D, a latent diffusion model for fast feed-forward 3D scene generation, sampling a 3D scene representation in under seven seconds using 2D diffusion architectures.
- 8. Latent Diffusion Model without Variational Autoencoder
Proposes a latent diffusion model architecture that eliminates the need for a Variational Autoencoder (VAE). This VAE-free approach aims to improve training efficiency, inference speed, and the transferability of the model to new domains by simplifying the generation pipeline and its dependencies.
- 9. Exploring Conditions for Diffusion models in Robotic Control
Investigates using pre-trained text-to-image diffusion models for task-adaptive visual representations in robotic control without fine-tuning. Shows that naive application of textual conditions yields minimal or negative gains, guiding future research.
- 10. Diffusion Bridge Networks Simulate Clinical-grade PET from MRI for Dementia Diagnostics
Presents SiM2P, a 3D diffusion bridge-based framework that learns a probabilistic mapping from MRI and auxiliary patient information to simulate FDG-PET images, addressing accessibility and cost challenges.
- 11. LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal
Proposes LightsOut, a diffusion-based outpainting framework to enhance single image flare removal by reconstructing off-frame light sources, improving realism for computer vision tasks.
- 12. BLIP3o-NEXT: Next Frontier of Native Image Generation
Presents BLIP3o-NEXT, an open-source foundation model that unifies text-to-image generation and image editing within a single, cohesive architecture. The model demonstrates strong capabilities in both tasks, offering a versatile solution for creating and modifying visual content natively without separate specialized models.
- 13. LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration
Presents LayerCraft, a modular framework using Large Language Models (LLMs) with Chain-of-Thought reasoning to control text-to-image generation. The LLM acts as an agent, decomposing prompts into layers and composing objects for improved spatial control, object consistency, and multi-step editing.
- 14. NFIG: Autoregressive Image Generation with Next-Frequency Prediction
Proposes NFIG, an autoregressive image generation model that predicts image information in the frequency domain rather than the standard spatial (pixel) domain. By generating spectral components sequentially, it better leverages the hierarchical structure of image data to improve generation quality and coherence.
- 15. TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
Presents Text-Grounded Trajectories (TGT), a method for achieving local control in video generation. By allowing users to specify object paths and link them to text descriptions, it enables precise control over subject composition and movement, addressing a key limitation in text-to-video models.
- 16. AlignFlow: Improving Flow-based Generative Models with Semi-Discrete Optimal Transport
Proposes AlignFlow, a method to improve flow-based generative models by incorporating semi-discrete Optimal Transport (OT). This technique straightens the flow trajectories between noise and data distributions during training, leading to more efficient inference and improved generation performance for this model class.
Graph Neural Networks
- 1. Understanding Generalization in Node and Link Prediction
Investigates generalization in message-passing graph neural networks for node and link prediction. Analyzes how diverse MPNN architectures perform beyond training sets, highlighting limited understanding and attention for these specific prediction tasks.
- 2. Attn-JGNN: Attention Enhanced Join-Graph Neural Networks
Proposes Attn-JGNN, an attention-enhanced join-graph neural network for #SAT problems. Uses tree decomposition to encode CNF formulas, performs iterative message passing, and approximates model numbers, significantly improving solving accuracy.
- 3. Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs
Proposes a retrieval-augmented generation method to reduce hallucinations in Large Language Models (LLMs). It enhances trustworthy reasoning by more effectively exploiting the prior knowledge and relational structures embedded within knowledge graphs, thereby improving the factuality of generated text.
- 4. Backdoor or Manipulation? Graph Mixture of Experts Can Defend Against Various Graph Adversarial Attacks
Leverages Mixture of Experts (MoE) for a scalable, unified framework defending against multiple graph adversarial attacks. Designs an MoE architecture to simultaneously defend against manipulation, node injection, and backdoor attacks.
- 5. Landmark-Based Node Representations for Shortest Path Distance Approximations in Random Graphs
Introduces a novel node embedding technique using landmarks to approximate shortest path distances within graphs. This method is designed to capture global graph structure, a common weakness in existing embeddings that primarily preserve local similarities, thus improving performance on distance-aware tasks.
- 6. Hypergraph Contrastive Sensor Fusion for Multimodal Fault Diagnosis in Induction Motors
Develops a hypergraph contrastive learning framework for multimodal sensor fusion in industrial fault diagnosis. The model captures complex, high-order correlations between different sensor signals to improve the accuracy and reliability of detecting faults in induction motors under various operating conditions.
- 7. AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction
Introduces AutoGraph-R1, the first framework optimizing knowledge graph construction for Retrieval-Augmented Generation (RAG) performance using Reinforcement Learning. Directly optimizes KG construction for downstream QA system effectiveness.
- 8. Neural Mean-Field Games: Extending Mean-Field Game Theory with Neural Stochastic Differential Equations
Introduces Neural Mean-Field Games by combining Mean-Field Game theory with Neural Stochastic Differential Equations. Reduces dependency on model-free approaches, addressing challenges in solving intractable games.
- 9. CQD-SHAP: Explainable Complex Query Answering via Shapley Values
Proposes CQD-SHAP for explainable complex query answering over knowledge graphs. Integrates Shapley values to provide interpretability for neurosymbolic CQA methods, addressing user trust concerns.
- 10. Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation
Introduces ParallaxRAG, a framework for multi-hop reasoning using Knowledge-Graph-Based Retrieval-Augmented Generation. Decouples queries and triples into multi-view spaces for robust retrieval and constrained weak supervision.
- 11. KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection
Proposes KGAlign for multimodal fake news detection by jointly encoding semantic and structural knowledge. Incorporates external knowledge and entity relationships, addressing limitations in local object-level details and global context.
Large Language Models
- 1. Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential
Proposes a microscopic signature, the "Soundness-Aware Level," that can predict the reasoning potential of a pre-trained LLM. This metric helps identify models that will benefit most from reinforcement learning with verifiable rewards (RLVR) before undergoing expensive fine-tuning, improving model selection.
- 2. When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs
Investigates MLLMs' ability to actively acquire missing evidence under incomplete information, revealing limits in real-world scenarios. Proposes a framework to test active evidence acquisition for MLLMs.
- 3. Continual Learning via Sparse Memory Finetuning
Introduces a method for continual learning that mitigates catastrophic forgetting by identifying and freezing important weights. This approach, Sparse Memory Finetuning, allows models to learn new information over time while preserving previously acquired capabilities without requiring access to old data.
- 4. Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination
Proposes a hybrid context and hardware coordination approach to accelerate on-device LLM generation. Achieves faster token-by-token generation with improved hardware utilization for mobile applications.
- 5. Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning
Presents a novel hybrid architecture where a discrete diffusion model acts as a "Planner" to generate a reasoning outline, and an autoregressive model acts as an "Executor" to fill in details. This collaboration combines parallel generation with high-accuracy token-by-token processing.
- 6. Latent Reasoning in LLMs as a Vocabulary-Space Superposition
Explores latent reasoning in LLMs by restricting it to a structured vocabulary-space superposition. Addresses performance degradation in unstructured latent spaces and improves reasoning efficiency.
- 7. Attention Sinks in Diffusion Language Models
Identifies the "Attention Sink" phenomenon in Masked Diffusion Language Models, analogous to that in autoregressive models. The study shows initial tokens act as sinks to which other tokens attend, and that adding a dedicated sink token can improve model performance.
- 8. AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction
Introduces AutoGraph-R1, the first framework to optimize KG construction for RAG/QA using Reinforcement Learning. Bridges the disconnect between KG construction and downstream task performance for improved QA systems.
- 9. Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding
Surveys multimodal retrieval-augmented generation for document understanding, highlighting limitations of OCR-based and native MLLM approaches. Proposes RAG to ground models in external data for improved context modeling.
- 10. From Characters to Tokens: Dynamic Grouping with Hierarchical BPE
Proposes Hierarchical Byte Pair Encoding (HBPE), a dynamic tokenization method that groups characters into subwords based on frequency. This approach creates a more efficient vocabulary representation, particularly for rare words, and reduces the required size of the embedding matrix.
- 11. Adaptive Minds: Empowering Agents with LoRA-as-Tools
Introduces "Adaptive Minds," an agentic system that treats LoRA adapters as specialized, callable tools. The base LLM acts as a semantic router, dynamically selecting and applying the most appropriate LoRA for a given query, enabling adaptive, domain-specific expertise without full fine-tuning.
- 12. VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency
Introduces VocalBench-DF, a benchmark for evaluating speech LLM robustness to speech disfluency. Investigates Speech-LLM performance with users exhibiting speech impairments.
- 13. ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations
Presents ProofOptimizer, a technique for training language models to simplify complex formal proofs without human demonstrations. The model learns a simplification policy via reinforcement learning, using the length of the new, verified proof as a reward signal, making outputs more human-readable.
- 14. Controllable Abstraction in Summary Generation for Large Language Models via Prompt Engineering
Presents a controllable abstract summary generation method for LLMs using prompt engineering. Designs a multi-stage prompt framework to generate summaries with varying abstraction levels.
- 15. The Road Less Traveled: Enhancing Exploration in LLMs via Sequential Sampling
Addresses limited exploration and entropy collapse in reinforcement learning for LLMs. Proposes a sequential sampling method that encourages the model to explore a wider variety of reasoning paths, leading to improved sampling diversity and better performance on complex reasoning tasks.
- 16. Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry
Evaluates LLM performance in classical Chinese poetry generation using a three-step framework combining computational metrics, LLM-as-a-judge, and human validation. Analyzes poetic quality dimensions and identifies biases.
- 17. CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning
Introduces Calibrated Best-of-N (CarBoN) sampling, a method that improves test-time reasoning by better selecting from multiple generated candidates. It calibrates model confidence scores to more accurately reflect correctness, addressing the diminishing returns of standard Best-of-N sampling as N increases.
- 18. Finetuning LLMs for EvaCun 2025 token prediction shared task
Presents fine-tuned LLMs (Command-R, Mistral, Aya Expanse) for the EvaCun 2025 token prediction task. Compares three prompt-based approaches for obtaining predictions on task data.
- 19. Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Introduces Infinity Parser, a layout-aware RL framework for scanned document parsing. Addresses generalization issues of supervised methods on diverse document types and limited training data.
Multimodal Learning
- 1. End-to-End Multi-Modal Diffusion Mamba
Proposes MDM, a unified architecture that uses a shared Mamba-based encoder-decoder and diffusion model to process multiple modalities. This end-to-end framework learns a joint representation for text, images, and audio, eliminating the need for separate modality-specific components for unified processing.
- 2. FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification
Introduces FlexiReID, a framework supporting seven retrieval modes across four modalities (RGB, infrared, sketches, text). Utilizes an adaptive mixture-of-experts mechanism to dynamically integrate experts, enabling flexible cross-modal person re-identification.
- 3. OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Introduces OmniVinci, an open-source omni-modal large language model designed to perceive across modalities like a human. The work details architectural enhancements and a careful data curation strategy to create a strong, unified foundation model for general-purpose multimodal understanding.
- 4. MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval
Proposes the first systematic study of drone video-text retrieval (DVTR) with MSAM. Addresses challenges of drone videos like overhead perspectives and structural homogeneity to improve cross-modal retrieval effectiveness.
- 5. MAVR-Net: Robust Multi-View Learning for MAV Action Recognition with Cross-View Attention
Presents MAVR-Net, a multi-view learning framework for Micro Aerial Vehicle action recognition using cross-view attention. Overcomes limitations of RGB-only models by capturing complex spatio-temporal characteristics for better action distinction.
- 6. Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation
Addresses the tendency of MLLMs to neglect visual details by introducing a visual embedding distillation method. This technique transfers rich perceptual signals from the vision encoder's output embeddings directly to the MLLM, enhancing the model's visual perception capabilities for grounded understanding.
- 7. Directional Reasoning Injection for Fine-Tuning MLLMs
Proposes a method to improve MLLM reasoning by injecting capabilities from powerful text-only LLMs. The approach fine-tunes the MLLM using 'reasoning trajectories' generated by a teacher model, effectively bridging the reasoning gap without needing large-scale annotated multimodal reasoning datasets.
- 8. Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation
Introduces Imaginarium, a vision-guided framework for generating coherent 3D scene layouts. Addresses challenges of existing methods by capturing complex spatial relationships and producing rich, diverse content with improved robustness.
- 9. Unmasking Facial DeepFakes: A Robust Multiview Detection Framework for Natural Images
Proposes a multi-view architecture for robust DeepFake detection by analyzing facial features at multiple levels. Integrates specialized encoders and global analysis to enhance detection accuracy in natural image conditions.
- 10. Scope: Selective Cross-modal Orchestration of Visual Perception Experts
Introduces SCOPE, a Mixture-of-Encoders framework that enhances vision-language models by dynamically selecting the most suitable vision encoder for each image-text pair. This approach improves performance and efficiency by orchestrating a set of specialized encoders, avoiding the high costs of using them all.
- 11. Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity
Proposes a theoretical enhancement to CLIP's similarity computation by demonstrating that optimal similarity metrics possess a linear structure. Based on this insight, a new similarity mechanism is introduced that improves performance and training stability over the standard cosine similarity used in contrastive pre-training.
- 12. Exploring Conditions for Diffusion models in Robotic Control
Explores using text-to-image diffusion models for task-adaptive visual representations in robotic control without model fine-tuning. Investigates conditions for effectively applying textual conditions to improve robotic control performance.
- 13. Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
Introduces Spatial457, a diagnostic benchmark designed to evaluate the 3D spatial reasoning capabilities of Large Multimodal Models. The benchmark consists of questions focused on complex 6D poses (position and orientation), revealing current model limitations and guiding future development in precise spatial understanding.
- 14. Rethinking Efficient Hierarchical Mixing Architecture for Low-light RAW Image Enhancement
Rethinks architecture for efficient low-light image signal processing with the Hierarchical Mixing Architecture (HiMA). Leverages Transformer and Mamba strengths to simultaneously achieve strong enhancement quality and high efficiency.
- 15. SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling
Presents an end-to-end SpeechLLM that integrates speech recognition and language understanding for zero-shot slot filling. By directly processing raw audio, the model performs contextualized spoken language understanding tasks, outperforming traditional cascaded systems that suffer from error propagation between separate components.
- 16. FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers
Rethinks positional depth embedding for multi-view 3D object detection Transformers with FreqPDE. Addresses limitations of explicit depth supervision and predicted depth quality issues for improved autonomous driving perception.
Natural Language Processing
- 1. Text2Schema: Filling the Gap in Designing Database Table Structures based on Natural Language
Introduces Text2Schema, a framework for designing database table structures from natural language. Addresses the gap in text-to-SQL by enabling users without database expertise to manage data effectively.
- 2. AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction
Introduces AutoGraph-R1, the first framework to optimize KG construction for task performance using Reinforcement Learning. Bridges the disconnect between KG construction and downstream QA applications for improved RAG.
- 3. Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Introduces Infinity Parser, a layout-aware reinforcement learning framework for scanned document parsing. Addresses generalization issues of supervised methods on diverse document types and limited training data.
- 4. Controllable Abstraction in Summary Generation for Large Language Models via Prompt Engineering
Presents a controllable abstract summary generation method for LLMs using prompt engineering. Designs a multi-stage prompt framework to generate summaries with varying abstraction levels via semantic analysis and topic modeling.
- 5. From Ghazals to Sonnets: Decoding the Polysemous Expressions of Love Across Languages
Explores polysemy in Urdu poetry by analyzing the nuanced differences between 'pyaar', 'muhabbat', and 'ishq'. Exposes a spectrum of emotions and experiences unique to the Urdu language through a case study approach.
- 6. The Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works
Introduces a new annotated corpus of three full-length French novels for coreference resolution. Addresses challenges of long, complex literary works to enable evaluation of coreference models in literary contexts.
- 7. Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection
Introduces a novel approach to hate speech detection using LLMs as dynamic knowledge bases. Examines context generation strategies and compares methods for incorporating context into HSD classifiers.
- 8. FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation
Proposes a framework for assessing model robustness through systematic, minimal linguistic variations. It introduces controlled changes from orthography to dialect levels, allowing for a task-agnostic and fine-grained evaluation of how models handle diverse linguistic phenomena and perturbations.
- 9. Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding
Surveys multimodal RAG for document understanding, identifying limitations of OCR-based and native MLLM approaches. Proposes leveraging multimodal RAG to overcome context modeling challenges and structural detail loss in document analysis.
- 10. FIRE: Fact-checking with Iterative Retrieval and Verification
Introduces an iterative fact-checking system that decomposes claims, retrieves evidence, and verifies them in a loop. Unlike single-pass methods, it dynamically gathers more evidence if initial information is insufficient, improving accuracy on complex, long-form text verification tasks.
- 11. JudgeSQL: Reasoning over SQL Candidates with Weighted Consensus Tournament
Presents a text-to-SQL method where multiple SQL candidates are generated and then evaluated by a judge LLM. The judge uses a weighted consensus tournament to reason over the candidates' execution results and semantic correctness, improving complex query generation accuracy.
- 12. Mixture of Experts Approaches in Dense Retrieval Tasks
Investigates applying Mixture of Experts (MoE) models to dense retrieval to improve generalization across different tasks and domains. The approach uses specialized experts for different data types, demonstrating superior performance over standard dense retrieval models on zero-shot benchmarks.
- 13. Rethinking Cross-lingual Gaps from a Statistical Viewpoint
Provides a statistical analysis of knowledge transfer in multilingual large language models. The paper investigates how information learned in a source language becomes accessible in a target language, offering a new perspective on the mechanisms underlying cross-lingual capabilities and performance gaps.
- 14. To Err Is Human; To Annotate, SILICON? Reducing Measurement Error in LLM Annotation
Proposes a methodology to reduce measurement error when using Large Language Models for text data annotation. It provides a framework for researchers to validate and improve the reliability of LLM-generated annotations, making them a more viable alternative to human labor.
- 15. TACL: Threshold-Adaptive Curriculum Learning Strategy for Enhancing Medical Text Understanding
Introduces a Threshold-Adaptive Curriculum Learning strategy to improve model training on complex texts. The method dynamically adjusts the difficulty of training examples based on model performance, leading to better understanding of nuanced medical language and improving performance on downstream tasks.
Reinforcement Learning
- 1. Safe, Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models
Investigates Reinforcement Learning for safety, sample-efficiency, and robustness. Develops theory and algorithms for safe deployment in ranking systems and explores counterfactual risk bounds for diffusion models.
- 2. Learn to Change the World: Multi-level Reinforcement Learning with Model-Changing Actions
Proposes a multi-level RL framework where agents possess model-changing actions to alter environmental dynamics. This allows agents to go beyond passive adaptation, finding policies that are optimal in an environment the agent itself has helped shape, demonstrating a new paradigm for agent interaction.
- 3. AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction
Introduces AutoGraph-R1, the first framework to directly optimize Knowledge Graph construction for Retrieval-Augmented Generation tasks using Reinforcement Learning. It bridges the KG construction-application disconnect, yielding optimal graph structures for QA systems.
- 4. Neural Mean-Field Games: Extending Mean-Field Game Theory with Neural Stochastic Differential Equations
Extends mean-field game theory using neural stochastic differential equations to create a model-free framework. This approach enables solving games with vast populations of players without relying on partial differential equations, making large-scale multi-agent systems more tractable.
- 5. VLMLight: Safety-Critical Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning Architecture
Introduces VLMLight, a novel traffic signal control framework integrating vision-language meta-control with dual-branch reasoning for safety-critical scenarios. It enables robust generalization beyond traditional RL methods.
- 6. MOBODY: Model Based Off-Dynamics Offline Reinforcement Learning
Introduces a model-based method for off-dynamics offline RL. It learns a residual dynamics model using limited target data to adapt source data, enabling effective policy learning when training and deployment dynamics differ, outperforming methods that discard or penalize source transitions.
- 7. Onboard Mission Replanning for Adaptive Cooperative Multi-Robot Systems
Proposes onboard mission replanning algorithms for cooperative autonomous robotic systems operating in dynamic environments. Uses Reinforcement Learning to enhance resilience and efficiency without centralized control.
- 8. Internalizing World Models via Self-Play Finetuning for Agentic RL
Proposes a self-play finetuning method for LLM agents to internalize world models of dynamic environments. By generating synthetic trajectories and learning from them, the agent improves its ability to ground its knowledge and adapt to out-of-distribution scenarios without constant environment interaction.
- 9. Exploring Conditions for Diffusion models in Robotic Control
Explores leveraging pre-trained text-to-image diffusion models for task-adaptive visual representations in robotic control without model fine-tuning. Investigates the impact of textual conditions for improved policy learning.
- 10. MARS: Reinforcing Multi-Agent Reasoning of LLMs through Self-Play in Strategic Games
Introduces a framework to enhance the multi-agent reasoning of LLMs through self-play in strategic games. By iteratively generating reasoning trajectories and updating policies via RL, LLMs learn to better cooperate and compete, significantly improving performance on games like Diplomacy and Werewolf.
- 11. Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound
Introduces a novel theoretical result leveraging the Neyman-Rubin potential outcomes framework into Deep Reinforcement Learning. It establishes a causal bound to bridge on-policy and off-policy learning, enabling data recycling.
- 12. Scaling Multi Agent Reinforcement Learning for Underwater Acoustic Tracking via Autonomous Vehicles
Presents a scalable MARL system for controlling a fleet of autonomous underwater vehicles for acoustic tracking. The method uses a centralized training with decentralized execution framework, demonstrating successful real-world deployment and cost-effective operation for scientific missions in complex marine environments.
- 13. Policy Transfer Ensures Fast Learning for Continuous-Time LQR with Entropy Regularization
Investigates policy transfer in Reinforcement Learning, using pre-trained policies to initialize learning in target tasks. Demonstrates faster learning for continuous-time LQR with entropy regularization, enhancing RL efficiency.
- 14. Iterative Refinement of Flow Policies in Probability Space for Online Reinforcement Learning
Develops a method for fine-tuning flow-based policies using online reinforcement learning. It addresses distributional shift and iterative inference challenges by proposing a policy iteration scheme in probability space, enabling effective online adaptation of complex, pre-trained generative policies.
- 15. FIDDLE: Reinforcement Learning for Quantum Fidelity Enhancement
Addresses the Fidelity Maximization in Routing Stage (FMRS) problem in quantum computing using Reinforcement Learning. Introduces FIDDLE to improve the reliability of quantum circuits during transpilation.
Robotics & Embodied AI
- 1. GaussGym: An open-source real-to-sim framework for learning locomotion from pixels
Presents a novel simulation framework integrating 3D Gaussian Splatting with vectorized physics simulators like IsaacGym. This enables photorealistic rendering at unprecedented speeds, exceeding 100,000 steps per second, significantly advancing sim-to-real for learning vision-based locomotion.
- 2. Exploring Conditions for Diffusion models in Robotic Control
Explores using pre-trained text-to-image diffusion models for task-adaptive visual representations in robotic control. Finds that naive application of textual conditions yields minimal or negative gains, suggesting new approaches are needed for effective integration.
- 3. Generalized Dynamics Generation towards Scannable Physical World Model
Introduces GDGen, a framework for creating interactive digital twins with realistic dynamics from real-world scans. It aims to generate generalized physical behaviors for scannable environments, enabling the development of generalist embodied agents that can train in complex, physically accurate simulations.
- 4. A Plug-and-Play Learning-based IMU Bias Factor for Robust Visual-Inertial Odometry
Proposes a plug-and-play learning-based IMU bias factor for robust Visual-Inertial Odometry (VIO). Addresses bias estimation deviations in challenging visual tracking scenarios to improve localization accuracy and system stability.
- 5. General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting
Proposes a new paradigm for navigation where a Large Vision-Language Model (LVLM) orchestrates perception, reasoning, and action modules. This approach allows for zero-shot generalization to unknown environments and complex, language-based commands without requiring task-specific fine-tuning.
- 6. Iterative Motion Compensation for Canonical 3D Reconstruction from UAV Plant Images Captured in Windy Conditions
Presents a pipeline for high-quality 3D plant reconstruction using UAVs, incorporating iterative motion compensation. Demonstrates autonomous image acquisition and processing for accurate 3D phenotyping in challenging windy conditions.
- 7. DexCanvas: Bridging Human Demonstrations and Robot Learning for Dexterous Manipulation
Introduces a large-scale hybrid dataset with 7,000 hours of dexterous hand-object interactions. Seeded from 70 hours of real human demonstrations, this resource provides the scale and diversity needed to train sophisticated robot learning policies for complex manipulation tasks.
- 8. SHARE: Scene-Human Aligned Reconstruction
Introduces Scene-Human Aligned Reconstruction (SHARE) to accurately ground human motion in 3D space using scene geometry cues. Leverages monocular RGB video for animating realistic character interactions in gaming, AR/VR, and robotics.
- 9. UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos
Presents a method to automatically generate diverse, high-fidelity urban environments for simulation by processing city-tour videos. This tackles the major bottleneck of scalable content creation, enabling the training of embodied agents like delivery robots in realistic and varied cityscapes.
- 10. MRASfM: Multi-Camera Reconstruction and Aggregation through Structure-from-Motion in Driving Scenes
Proposes a Multi-camera Reconstruction and Aggregation Structure-from-Motion (MRASfM) framework for driving scenes. Addresses challenges in pose estimation, outlier reduction, and reconstruction efficiency for multi-camera systems.
- 11. Bolt3D: Generating 3D Scenes in Seconds
Presents Bolt3D, a latent diffusion model for fast feed-forward 3D scene generation in under seven seconds. Leverages 2D diffusion architectures to produce high-fidelity 3D scene representations from images.
- 12. FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers
Rethinks positional depth embedding for multi-view 3D object detection transformers. Addresses limitations in predicted depth quality by proposing new embedding strategies for improved spatial information integration.
- 13. TAS: A Transit-Aware Strategy for Embodied Navigation with Non-Stationary Targets
Proposes a Transit-Aware Strategy (TAS) for embodied navigation in dynamic scenarios with moving targets. The algorithm enriches navigation policies with predictions of object pathways, enabling agents to successfully follow or intercept non-stationary targets in complex environments.
- 14. ASBI: Leveraging Informative Real-World Data for Active Black-Box Simulator Tuning
Introduces an active learning framework for Simulation-Based Inference (SBI) to efficiently tune black-box simulators. By actively selecting the most informative real-world data, the method improves the accuracy of estimating simulator parameters, helping to close the sim-to-real gap.
- 15. CLOVER: Context-aware Long-term Object Viewpoint- and Environment- Invariant Representation Learning
Proposes CLOVER for context-aware, long-term object representation learning invariant to viewpoint and environment. Aims to improve object re-identification for mobile robots across varying conditions.
Speech & Audio
- 1. SpikeVox: Towards Energy-Efficient Speech Therapy Framework with Spike-driven Generative Language Models
Proposes an energy-efficient speech therapy framework using spike-driven generative language models. It aims to provide accessible, low-cost solutions for patients with speech disorders by leveraging novel, brain-inspired computing for real-time generative feedback and personalized analysis on edge devices.
- 2. VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency
Introduces VocalBench-DF, a benchmark to evaluate speech LLM robustness to disfluency in speech. It investigates whether current Speech-LLMs maintain performance with users having speech impairments, facilitating inquiry into speech interaction quality.
- 3. DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech
Proposes DiEmo-TTS, a self-supervised distillation method for cross-speaker emotion transfer in TTS. It minimizes emotional information loss and preserves speaker identity by disentangling emotion representations, improving synthesis quality.
- 4. EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification
Introduces EmoSphere-SER, a joint model integrating spherical VAD region classification to guide VAD regression for improved emotion prediction. It transforms VAD values into spherical coordinates for enhanced speech emotion recognition.
- 5. SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling
Proposes using speech-based large language models (speechLLMs) for unified slot filling. It integrates speech and textual foundation models for generative, instruction-following speech understanding, promising data and compute efficiency.
- 6. DroneAudioset: An Audio Dataset for Drone-based Search and Rescue
Introduces DroneAudioset, an audio dataset for drone-based search and rescue. It addresses extreme ego-noise masking human presence sounds and provides real acoustic interactions for developing robust audio perception systems.
- 7. Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior
Improves an audio style transfer method by introducing a Gaussian prior during inference-time optimization. This addition regularizes the optimization of effect parameters, leading to more stable convergence and higher-quality transfer of vocal effects like reverb and distortion between different audio tracks.
- 8. Summarizing Speech: A Comprehensive Survey
Provides a comprehensive survey of speech summarization, examining existing datasets and evaluation protocols. It clarifies the field's intersection with speech recognition and text summarization for efficient spoken content management.