# Academic Research Intelligence
Deep dive into AI research papers for researchers and academics
---
Executive Summary
- 1. ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Introduces ThinkMorph, a multimodal model learning interleaved chain-of-thought reasoning by treating text and image as complementary. Fine-tuned on 24K reasoning traces, it demonstrates emergent properties for enhanced multimodal reasoning across diverse tasks, enabling more sophisticated visual-language understanding.
- 2. NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
Proposes NAUTILUS, a large multimodal model for underwater scene understanding, addressing the lack of large-scale datasets. It enables multi-task perception from multiple granularities, advancing automated underwater exploration and analysis with improved scene comprehension capabilities.
- 3. Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Introduces DUST, a dual-stream diffusion framework for world-model augmented Vision-Language-Action (VLA) models. It addresses modality conflicts between state and action prediction, enhancing VLA performance across diverse robotic tasks by jointly modeling future observations and actions.
- 4. Who Made This? Fake Detection and Source Attribution with Diffusion Features
Presents FRIDA, a lightweight framework using diffusion features for fake image detection and source attribution. It addresses generalization challenges of supervised detectors across unseen generators, enabling robust authenticity verification and misinformation detection in synthetic image generation.
- 5. Deep Neural Watermarking for Robust Copyright Protection in 3D Point Clouds
Proposes a robust deep neural watermarking framework for copyright protection in 3D point clouds. It addresses challenges posed by geometric and non-geometric attacks, offering enhanced resilience compared to conventional methods for safeguarding intellectual property in 3D digital media.
- 6. ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning
Introduces ANCHOR, integrating adversarial training with hard-mined supervised contrastive learning for robust representation learning. It enhances model resilience against adversarial attacks by learning discriminant patterns that are less susceptible to imperceptible perturbations.
- 7. Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers
Proposes a Data-Free Quantization (DFQ) method for Vision Transformers (ViTs) that addresses semantic distortion and inadequacy using semantic alignment and reinforcement. It enables model quantization without real data, enhancing privacy and security for ViT deployment.
- 8. PROFIT: A Specialized Optimizer for Deep Fine Tuning
Introduces PROFIT, an optimizer specifically designed for deep fine-tuning of converged models on new tasks or datasets. It aims to improve fine-tuning efficiency and model performance, addressing a gap in scholarship concerning performance-focused fine-tuning strategies.
- 9. SAGS: Self-Adaptive Alias-Free Gaussian Splatting for Dynamic Surgical Endoscopic Reconstruction
Proposes SAGS, a self-adaptive alias-free Gaussian Splatting method for dynamic surgical endoscopic reconstruction. It addresses aliasing and artifacts in deformable tissue reconstruction from endoscopic videos, improving visualization quality.
- 10. Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
Proposes an Audio-Visual Speech Enhancement (AVSE) system that jointly models separation and dereverberation for complex acoustic scenarios. It leverages visual auxiliary information to extract target speech effectively, improving perceptual quality in challenging real-world conditions.
- 11. NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception
Introduces NegoCollab, a common representation negotiation approach for heterogeneous collaborative perception. It addresses domain gaps in intermediate features shared among agents with fixed perception models, improving collaborative performance by aligning features to a unified representation.
- 12. Gaussian Combined Distance: A Generic Metric for Object Detection
Proposes Gaussian Combined Distance (GCD) as a generic similarity metric for object detection, addressing limitations of IoU-based metrics, especially for small objects. GCD enhances model performance by offering better sensitivity to positional deviations than traditional metrics.
- 13. Deep learning denoising unlocks quantitative insights in operando materials microscopy
Presents a deep learning-based denoising framework for quantitative operando microscopy. It preserves physical fidelity and enhances resolution, enabling deeper insights into dynamic chemical and physical processes in functional materials across various modalities and scales.
- 14. Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes
Proposes Sh-ViT, a lightweight Vision Transformer for robust occluded person re-identification in complex surveillance scenes. It enhances robustness to occlusion through a shuffle module in the final transformer layer, outperforming existing methods on challenging ReID tasks.
- 15. Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Introduces Phased DMD, a few-step distribution matching distillation method using score matching within subintervals. It addresses limitations of one-step distillation in complex generative tasks by extending DMD to multi-step distillation more efficiently, improving synthesis quality.
- 16. LifWavNet: Lifting Wavelet-based Network for Non-contact ECG Reconstruction from Radar
Proposes LifWavNet, a lifting wavelet network for non-contact ECG reconstruction from radar signals. It employs learnable lifting wavelets for adaptive feature capture and synthesis, offering an unobtrusive approach to cardiac monitoring by improving radar-to-ECG reconstruction.
- 17. Generative diffusion modeling protocols for improving the Kikuchi pattern indexing in electron back-scatter diffraction
Presents generative diffusion modeling protocols to enhance Kikuchi pattern indexing in electron back-scatter diffraction (EBSD). It addresses limitations of traditional methods at high scanning speeds by improving signal-to-noise ratio and pattern interpretation for crystallographic orientation extraction.
- 18. WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond
Introduces WildfireX-SLAM, a large-scale low-altitude RGB-D dataset for wildfire SLAM. It aims to facilitate research in 3D Gaussian splatting-based SLAM for challenging forest environments, supporting wildfire response and management.
- 19. From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration
Introduces a multi-agent framework for editable scientific illustrations that outputs vector graphics with semantic structure. It addresses rasterization limitations and cumbersome code-based methods, enabling post-editability and rearrangement of visual components.
- 20. A fragile zero-watermarking method based on dual quaternion matrix decomposition
Proposes a fragile zero-watermarking method using dual quaternion matrix decomposition for medical image copyright protection. It extracts stable features without modifying the original image, providing a means to detect tampering and verify content integrity during transmission.
AI for Science
- 1. Deep learning denoising unlocks quantitative insights in operando materials microscopy
Introduces a framework for unsupervised deep learning-based denoising in microscopy. Demonstrates preservation of physical fidelity and reduction of noise in quantitative analysis of functional materials, enabling deeper insights.
- 2. Generative diffusion modeling protocols for improving the Kikuchi pattern indexing in electron back-scatter diffraction
Proposes generative diffusion modeling protocols to enhance Kikuchi pattern indexing in EBSD. Addresses limitations of traditional methods in high-speed scanning by improving signal-to-noise ratio for accurate crystallographic orientation extraction.
- 3. Accelerating Radiative Transfer for Planetary Atmospheres by Orders of Magnitude with a Transformer-Based Machine Learning Model
Develops a transformer-based machine learning model to accelerate radiative transfer calculations for planetary atmospheres. Achieves orders-of-magnitude speedup, enabling more accurate simulations by reducing reliance on numerical simplifications.
- 4. Traceable Drug Recommendation over Medical Knowledge Graphs
Proposes TraceDR, a drug recommendation system operating over a medical knowledge graph. Ensures traceability of recommendations for high-stake applications, overcoming limitations of current deep learning approaches lacking interpretability.
- 5. LifWavNet: Lifting Wavelet-based Network for Non-contact ECG Reconstruction from Radar
Presents LifWavNet, a lifting wavelet network for non-contact ECG reconstruction from radar signals. Employs learnable lifting wavelets for adaptive feature capture, enabling unobtrusive cardiac monitoring.
- 6. Querying functional and structural niches on spatial transcriptomics data
Introduces a query-based analytical paradigm for spatial transcriptomics data. Enables identification of functional and structural niches, revealing conserved niche patterns and universal tissue organization principles.
- 7. From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration
Presents a multi-agent framework for editable scientific illustrations. Outputs semantic structure rather than rasterized images, enabling post-editability and manipulation of visual components for improved information density and control.
- 8. LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature
Introduces a multi-modal toolbox using LLMs and VLMs to automatically extract and organize synthesis procedures from scientific literature. Facilitates systematic analysis and accelerates materials discovery by curating procedural knowledge.
- 9. Hierarchical Bayesian Model for Gene Deconvolution and Functional Analysis in Human Endometrium Across the Menstrual Cycle
Presents a hierarchical Bayesian model for deconvolving bulk RNA-seq data into cell-type profiles. Analyzes human endometrial tissue across the menstrual cycle, revealing cell-type-specific dynamics.
- 10. A Multi-tiered Human-in-the-loop Approach for Interactive School Mapping Using Earth Observation and Machine Learning
Introduces a multi-tiered human-in-the-loop framework for interactive school mapping. Combines machine learning analysis of geospatial data with human input to improve accuracy and completeness of educational facility records, especially in developing regions.
- 11. MapSAM2: Adapting SAM2 for Automatic Segmentation of Historical Map Images and Time Series
Adapts SAM2 for automatic segmentation of historical map images and time series. Addresses challenges of stylistic variability and limited data in historical maps, enabling more efficient spatio-temporal dataset construction.
- 12. SUSTAINABLE Platform: Seamless Smart Farming Integration Towards Agronomy Automation
Introduces SUSTAINABLE, a smart farming platform integrating IoT, AI, and satellite imaging for sustainable agriculture. Focuses on viticulture, enabling efficient, traceable agronomy automation with role-based task orchestration.
- 13. MolChord: Structure-Sequence Alignment for Protein-Guided Drug Design
Proposes MolChord, integrating structure-sequence alignment for protein-guided drug design. Effectively aligns protein and molecular representations, guiding drug discovery by generating candidate molecules with desired pharmacological properties.
- 14. AI Agents in Drug Discovery
Explores AI agents for drug discovery, leveraging LLMs with perception, computation, and memory tools. Enables autonomous reasoning, task execution, and iterative hypothesis refinement for accelerated research workflows.
- 15. NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
Introduces NAUTILUS, a large multimodal model for underwater scene understanding. Addresses the absence of large-scale datasets by enabling multi-task perceptions from multiple granularities for automated underwater exploration.
AI Safety & Ethics
- 1. ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning
Introduces ANCHOR, integrating adversarial training with contrastive learning for robust representation learning. Achieves state-of-the-art robustness against adversarial attacks, demonstrating improved generalization on downstream tasks. Enables more reliable AI systems in adversarial environments.
- 2. Who Made This? Fake Detection and Source Attribution with Diffusion Features
Proposes FRIDA, a framework for fake image recognition and source identification using diffusion features. Achieves robust generalization across unseen generators, outperforming existing methods. Enables improved detection of synthetic media and attribution of its origin.
- 3. A Hybrid Deep Learning and Forensic Approach for Robust Deepfake Detection
Introduces a hybrid deep learning and forensic approach for robust deepfake detection. Combines deep learning's generalization with forensic analysis's interpretability, overcoming limitations of single methods. Enhances detection reliability against evolving manipulation techniques.
- 4. Trans-defense: Transformer-based Denoiser for Adversarial Defense with Spatial-Frequency Domain Representation
Proposes Trans-defense, a transformer-based denoiser for adversarial defense using spatial-frequency domain representation. Achieves robust protection against adversarial attacks by effectively removing perturbations. Enhances security of deep learning models in critical applications.
- 5. C-LEAD: Contrastive Learning for Enhanced Adversarial Defense
Introduces C-LEAD, utilizing contrastive learning for enhanced adversarial defense. Achieves improved robustness against adversarial attacks by learning discriminative representations. Demonstrates superior performance in preventing adversarial misclassifications, enhancing AI safety.
- 6. A fragile zero-watermarking method based on dual quaternion matrix decomposition
Proposes a fragile zero-watermarking method using dual quaternion matrix decomposition for medical image copyright protection. Constructs watermarks without modifying original images, safeguarding against tampering during transmission.
- 7. Rethinking Robust Adversarial Concept Erasure in Diffusion Models
Investigates and quantifies robust adversarial concept erasure in diffusion models. Proposes methods to mitigate sensitive content generation by addressing specificity of adversarial training. Enhances safety and control over generative AI outputs.
- 8. Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning
Introduces BEAT, a framework for injecting visual backdoors into multimodal LLM agents via contrastive trigger learning. Enables persistent execution of attacker-specified policies upon visual trigger detection. Highlights new attack surfaces and security vulnerabilities in embodied AI.
- 9. Contrastive Knowledge Transfer and Robust Optimization for Secure Alignment of Large Language Models
Presents a fine-tuning method combining contrastive distillation and noise-robust training for secure LLM alignment. It improves semantic consistency and robustness by freezing the backbone and transferring knowledge boundaries, enhancing safety.
- 10. Semantically-Aware LLM Agent to Enhance Privacy in Conversational AI Services
Introduces Local Optimizations for Pseudonymization with Semantic Integrity Directed Entity (Local-PI) to enhance privacy in conversational AI. It pseudonymizes Personally Identifiable Information (PII) while preserving semantic integrity, reducing security risks.
- 11. Characterizing Selective Refusal Bias in Large Language Models
Explores selective refusal bias in LLM safety guardrails, analyzing refusal rates across demographic groups. It identifies and characterizes how LLMs may refuse harmful content targeting specific groups differently, highlighting ethical concerns in AI deployment.
- 12. Referee: Reference-aware Audiovisual Deepfake Detection
Introduces Referee, a reference-aware audiovisual deepfake detection method. Leverages speaker-specific cues from one-shot examples to detect manipulations beyond spatiotemporal artifacts, improving generalization to unseen forgeries.
- 13. Deep Neural Watermarking for Robust Copyright Protection in 3D Point Clouds
Proposes a robust deep neural watermarking framework for copyright protection in 3D point clouds. Achieves resilience against geometric and non-geometric attacks, addressing unique challenges of 3D data. Enables secure intellectual property management for 3D content.
AI Theory & Foundations
- 1. Manifold Learning for Hyperspectral Images
Proposes a manifold learning method using Uniform Manifold Approximation and Projection for X-Ray Transmission Multi-Energy images. Captures nonlinear correlations to improve data topology representation, enhancing neural network performance in decision-making processes.
- 2. ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning
Introduces ANCHOR, a framework combining adversarial training and supervised contrastive learning to learn robust representations. It addresses gradient-based attacks by improving model generalization and robustness against imperceptible input perturbations.
- 3. Kernel Mean Embedding Topology: Weak and Strong Forms for Stochastic Kernels and Implications for Model Learning
Introduces a novel Kernel Mean Embedding Topology for stochastic kernels. Provides weak and strong formulations, highlighting utility for relaxed policy spaces and implications for model learning. Offers a new theoretical framework for analyzing learning systems.
- 4. PROFIT: A Specialized Optimizer for Deep Fine Tuning
Proposes PROFIT, an optimizer specifically designed for fine-tuning converged models on new tasks. It aims to improve model performance during fine-tuning, addressing a gap in current optimization research focused on efficiency.
- 5. An In-depth Study of LLM Contributions to the Bin Packing Problem
Investigates LLM-generated heuristics for the online bin packing problem, reassessing their behavior and interpretability. It examines if LLMs contribute novel insights to combinatorial optimization heuristics.
- 6. Gaussian Combined Distance: A Generic Metric for Object Detection
Introduces Gaussian Combined Distance (GCD) as a generic metric for object detection. It addresses limitations of IoU, particularly for small objects, by proposing a novel similarity metric to enhance detection performance.
- 7. A Regularized Newton Method for Nonconvex Optimization with Global and Local Complexity Guarantees
Proposes a regularized Newton method for nonconvex optimization, aiming to achieve optimal global complexity and quadratic local convergence simultaneously. It addresses a long-standing trade-off in convergence properties.
- 8. ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Proposes ThinkMorph, a model for multimodal reasoning using interleaved text and image thoughts. It learns from 24K reasoning traces, demonstrating emergent properties in coordinating modalities for advanced reasoning.
- 9. Supervised Quadratic Feature Analysis: Information Geometry Approach for Dimensionality Reduction
Introduces supervised dimensionality reduction using an information geometry approach. It aims to maximize class discriminability in a low-dimensional feature space, offering potential for improved insights.
- 10. DO-IQS: Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping with Unknown Gain Functions
Introduces Dynamics-Aware Offline Inverse Q-Learning (DO-IQS) for optimal stopping problems with unknown gain functions. Recovers optimal stopping regions from expert trajectories, enabling safe real-world applications. Addresses limitations of current IRL methods in dynamic settings.
- 11. Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Introduces Phased Distribution Matching Distillation (Phased DMD) for distilling score-based generative models. It enables few-step distillation by using score matching within subintervals, improving efficiency for complex generative tasks.
- 12. Quantitative Bounds for Length Generalization in Transformers
Provides the first quantitative bounds on required training sequence lengths for transformers to achieve length generalization. It addresses the ability of models to perform on longer, unseen inputs.
- 13. Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning
Investigates Reinforcement Learning with Verifiable Rewards (RLVR) for mathematical reasoning. Analyzes its generalization limits on combinatorial problems, questioning its ability to foster genuine reasoning beyond superficial correctness. Highlights challenges in achieving true generalization.
- 14. Deep learning denoising unlocks quantitative insights in operando materials microscopy
Presents a framework integrating unsupervised deep learning denoising for operando microscopy. It preserves physical fidelity and enhances quantitative analysis across modalities, unlocking new insights in materials science.
Computer Vision
- 1. MapSAM2: Adapting SAM2 for Automatic Segmentation of Historical Map Images and Time Series
Adapts SAM2 for automatic segmentation of historical map images and time series. It addresses challenges of stylistic variability and limited data for historical maps, enabling automated analysis and construction of spatio-temporal datasets from historical map archives.
- 2. ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Introduces ThinkMorph, a multimodal model learning interleaved chain-of-thought reasoning by treating text and image as complementary modalities. Achieves improved reasoning across diverse tasks, demonstrating emergent properties in multimodal reasoning.
- 3. Who Made This? Fake Detection and Source Attribution with Diffusion Features
Proposes FRIDA, a lightweight framework for fake image recognition and source identification using diffusion features. Achieves robust generalization across unseen generators, addressing concerns about authenticity and misinformation.
- 4. Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Introduces DUST, a dual-stream diffusion framework for world-model augmented Vision-Language-Action models. Addresses modality conflicts for improved robotic policy learning and enhances VLA performance across diverse tasks.
- 5. NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception
Proposes NegoCollab, a negotiation approach for heterogeneous collaborative perception by aligning agent features to a common representation. Eliminates domain gaps, improving collaborative performance in multi-agent systems.
- 6. Gaussian Combined Distance: A Generic Metric for Object Detection
Introduces Gaussian Combined Distance (GCD) as a generic metric for object detection. Addresses limitations of IoU, particularly for small objects, by offering improved similarity measurement and detection performance.
- 7. Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes
Proposes Sh-ViT, a lightweight Vision Transformer for robust occluded person re-identification. Introduces a Shuffle module to enhance robustness against occlusion and improve performance in complex surveillance scenes.
- 8. NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
Introduces NAUTILUS, a large multimodal model for underwater scene understanding. It addresses the need for multi-task perceptions from multiple granularities, aiming to automate underwater exploration despite the lack of large-scale datasets.
- 9. Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
Presents an effective Audio-Visual Speech Enhancement (AVSE) system that jointly models separation and dereverberation. Achieves improved speech quality in complex acoustic environments with interfering sounds and reverberation.
- 10. SRAGAN: Saliency Regularized and Attended Generative Adversarial Network for Chinese Ink-wash Painting Style Transfer
Introduces SRAGAN, a Saliency Regularized and Attended Generative Adversarial Network for Chinese ink-wash painting style transfer. Effectively learns and transfers target-domain style patterns onto source-domain content images.
- 11. From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration
Introduces a multi-agent framework for editable scientific illustrations, outputting semantic vector graphics instead of rasterized images. This enables post-editing and rearrangement of visual components for improved usability.
- 12. EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting
Proposes EF-3DGS, an Event-Aided Free-Trajectory 3D Gaussian Splatting method. Achieves robust scene reconstruction from casually captured videos, overcoming limitations of traditional methods in high-speed or low-frame-rate scenarios.
- 13. Deep learning denoising unlocks quantitative insights in operando materials microscopy
Presents a framework integrating unsupervised deep learning denoising for quantitative microscopy. It enhances effective resolution and quantitative analysis in operando microscopy workflows across various modalities and scales.
- 14. GASP: Gaussian Splatting for Physic-Based Simulations
Integrates Gaussian Splatting with physics simulation for 3D scenes, avoiding meshing mechanisms. It modifies physics-grounded Newtonian dynamics to align with 3D Gaussian components for improved simulation.
- 15. Panoramic Out-of-Distribution Segmentation for Autonomous Driving
Introduces a method for panoramic Out-of-Distribution (OOD) segmentation for autonomous driving. Addresses challenges of background clutter and pixel distortions in panoramic images, enabling better outlier identification.
Efficient AI
- 1. Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers
Proposes a data-free quantization method for Vision Transformers that improves semantic alignment and adequacy of synthetic data. Achieves competitive quantization performance without real data, enhancing privacy and security for ViT deployment.
- 2. Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes
Introduces Sh-ViT, a lightweight Vision Transformer for robust person re-identification in occluded surveillance scenes. Enhances robustness to occlusion and viewpoint distortion with a novel Shuffle module, offering improved performance in challenging real-world scenarios.
- 3. ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning
Presents ANCHOR, integrating adversarial training with hard-mined supervised contrastive learning for robust representation learning. Improves robustness against adversarial attacks by leveraging contrastive learning on challenging examples, enhancing model reliability.
- 4. Who Made This? Fake Detection and Source Attribution with Diffusion Features
Introduces FRIDA, a lightweight framework for fake image recognition and source identification using diffusion features. Effectively detects synthetic images and attributes their source, addressing concerns about authenticity and misinformation in generative AI.
- 5. SpecAware: A Spectral-Content Aware Foundation Model for Unifying Multi-Sensor Learning in Hyperspectral Remote Sensing Mapping
Proposes SpecAware, a spectral-content aware foundation model for unifying multi-sensor learning in hyperspectral remote sensing. It addresses challenges in joint training across heterogeneous HSI data by leveraging sensor meta-attributes.
- 6. NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception
Proposes NegoCollab, a common representation negotiation approach for heterogeneous collaborative perception. Addresses domain gaps in intermediate features from different perception models by aligning them to a unified representation, improving collaborative performance.
- 7. Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Proposes Phased DMD for few-step distribution matching distillation of score-based generative models. Achieves multi-step distillation efficiency without increased memory or depth, improving one-step distilled model performance on complex tasks like text-to-video generation.
- 8. AMD-Hummingbird: Towards an Efficient Text-to-Video Model
Introduces AMD-Hummingbird, an efficient text-to-video generation model. It balances computational efficiency with high visual quality, targeting resource-limited devices and enabling real-world deployment of T2V models.
- 9. DC4GS: Directional Consistency-Driven Adaptive Density Control for 3D Gaussian Splatting
Presents DC4GS, a 3D Gaussian Splatting method with Directional Consistency-driven Adaptive Density Control. It incorporates gradient direction coherence into density control for better local structural complexity capture, improving 3D reconstruction.
- 10. PROFIT: A Specialized Optimizer for Deep Fine Tuning
Introduces PROFIT, an optimizer specifically designed for incrementally fine-tuning converged models on new tasks. Addresses the gap in optimizing fine-tuning for performance, offering an alternative to traditional optimizers for efficient adaptation.
Generative AI
- 1. Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
Presents an audio-visual speech enhancement system that jointly models separation and dereverberation in complex scenarios. Addresses the limitations of previous methods in handling interfering sounds and reverberation, aiming for improved perceptual quality of extracted speech in real-world conditions.
- 2. Who Made This? Fake Detection and Source Attribution with Diffusion Features
Introduces FRIDA, a lightweight framework using diffusion features for fake image detection and source attribution. It generalizes across unseen generators without extensive retraining, addressing concerns about authenticity and misinformation.
- 3. Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Proposes Phased DMD, a few-step distillation technique for score-based generative models. It enables more efficient one-step generators, improving text-to-video generation by synthesizing intricate motions without increased memory or computational depth.
- 4. Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Introduces DUST, a dual-stream diffusion framework for world-model augmented VLAs. It jointly predicts next-state observations and actions, overcoming modality conflicts and enhancing robotic policy learning across diverse tasks.
- 5. DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model
Proposes DANCER, a diffusion model-based framework for dance animation generation. Integrates condition enhancement and rendering to synthesize realistic human dancing videos, addressing challenges in visual quality and temporal continuity.
- 6. DeblurSDI: Blind Image Deblurring Using Self-diffusion
Proposes DeblurSDI, a zero-shot, self-supervised framework for blind image deblurring using self-diffusion. It requires no prior training and effectively addresses ill-posed inverse problems without extensive external datasets.
- 7. NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
Introduces NAUTILUS, a large multimodal model for underwater scene understanding. It addresses multi-task perceptions from multiple granularities, aiming to automate underwater exploration despite data scarcity.
- 8. ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Presents ThinkMorph, a unified model for multimodal reasoning using interleaved text and image thoughts. It learns complementary reasoning processes, advancing multimodal chain-of-thought capabilities on diverse tasks.
- 9. Generative diffusion modeling protocols for improving the Kikuchi pattern indexing in electron back-scatter diffraction
Applies generative diffusion models to improve Kikuchi pattern indexing in EBSD. It addresses limitations in high-speed scanning scenarios where noise degrades pattern quality, enhancing crystallographic orientation extraction.
- 10. A Hybrid Deep Learning and Forensic Approach for Robust Deepfake Detection
Proposes a hybrid deep learning and forensic framework for robust deepfake detection. It combines deep learning's generalization with forensic analysis' interpretability to counter evolving manipulation techniques.
- 11. Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing
Proposes CIELR, a method for complex image editing using LLM reasoning without joint fine-tuning. Converts complex instructions into simpler ones, addressing high computational complexity and training costs of existing LLM-DM integration.
- 12. SRAGAN: Saliency Regularized and Attended Generative Adversarial Network for Chinese Ink-wash Painting Style Transfer
Introduces SRAGAN, a GAN-based model for Chinese ink-wash painting style transfer. It uses saliency regularization and attention to effectively learn and transfer target-domain style patterns onto content images.
- 13. E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources
Introduces E-MMDiT, an efficient and lightweight multimodal diffusion model for fast image synthesis. Achieves low training resource requirements and fast synthesis with only 304M parameters, addressing limitations of large-scale diffusion models.
- 14. From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration
Presents a multi-agent framework for editable scientific illustrations. It generates semantically structured vector graphics rather than rasterized images, enabling post-editability and rearrangement of visual components.
- 15. Referee: Reference-aware Audiovisual Deepfake Detection
Introduces Referee, a reference-aware audiovisual deepfake detection method. It leverages speaker-specific cues from one-shot examples to detect manipulations beyond spatiotemporal artifacts, improving generalization.
Graph Neural Networks
- 1. Spectral Neural Graph Sparsification
Proposes Spectral Preservation Network for graph representation learning, generating reduced graphs that preserve spectral properties. Demonstrates faithful proxies for original graphs, enabling efficient graph learning with GNNs. Shows potential for improved network analysis.
- 2. MDAS-GNN: Multi-Dimensional Spatiotemporal GNN with Spatial Diffusion for Urban Traffic Risk Forecasting
Introduces MDAS-GNN, a Multi-Dimensional Attention-based Spatial-diffusion Graph Neural Network for urban traffic risk forecasting. Integrates traffic safety, infrastructure, and environmental dimensions to capture complex spatial and temporal dependencies, improving prediction accuracy.
- 3. Geometry-Aware Edge Pooling for Graph Neural Networks
Presents Geometry-Aware Edge Pooling for GNNs, optimizing pooling operations to preserve fundamental graph structures while reducing graph size. Aims for improved interpretability and generalization in graph learning tasks.
- 4. Graph Neural Networks for Molecular Property Prediction
Introduces Graph Neural Networks (GNNs) for molecular property prediction, demonstrating superior performance over traditional methods. Enables more accurate and efficient prediction of molecular characteristics for drug discovery and materials science.
- 5. A Systematic Literature Review of Spatio-Temporal Graph Neural Network Models for Time Series Forecasting and Classification
Provides a comprehensive systematic literature review of spatio-temporal Graph Neural Network models for time series analysis. Summarizes modeling approaches and application domains, offering an overview of GNNs in forecasting and classification.
- 6. Graph Semi-Supervised Learning for Point Classification on Data Manifolds
Proposes a graph semi-supervised learning framework for point classification on data manifolds using VAEs and geometric graphs. Captures nonlinear correlations for improved classification accuracy in low-dimensional embedding spaces.
- 7. Community Detection on Model Explanation Graphs for Explainable AI
Introduces Modules of Influence (MoI) framework using community detection on model explanation graphs. Identifies feature modules jointly affecting predictions, aiding in understanding bias, redundancy, and causality patterns in Explainable AI.
- 8. Learning Sparse Approximate Inverse Preconditioners for Conjugate Gradient Solvers on GPUs
Introduces a learning-based approach using GNNs to learn sparse approximate inverse preconditioners for Conjugate Gradient solvers. Aims to improve performance and speed up construction for efficient linear system solving on GPUs.
- 9. Graph Diffusion that can Insert and Delete
Introduces a discrete Denoising Diffusion Probabilistic Model (DDPM) for graph generation that supports dynamic graph size adjustment. Enables conditional generation tasks by allowing insertion and deletion of atoms/bonds.
- 10. FairAD: Computationally Efficient Fair Graph Clustering via Algebraic Distance
Introduces FairAD, a computationally efficient method for fair graph clustering using algebraic distance. Aims to partition graph nodes into clusters while ensuring proportional representation of protected groups, addressing fairness concerns.
- 11. MolChord: Structure-Sequence Alignment for Protein-Guided Drug Design
Introduces MolChord, a framework for protein-guided drug design that aligns protein structural representations with molecular representations. Integrates structure-sequence alignment and property prediction for accelerating drug discovery.
- 12. Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks
Proposes a hierarchical GNN framework for analyzing traditional village spatial morphology using multi-modal feature fusion. Addresses limitations of single-disciplinary approaches by integrating diverse data for comprehensive analysis.
Large Language Models
- 1. ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Introduces ThinkMorph, a unified model for multimodal reasoning, by treating text and image as complementary modalities. It learns from 24K interleaved reasoning traces, demonstrating emergent properties in multimodal chain-of-thought reasoning across various tasks with varying visual engagement.
- 2. Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Proposes DUST, a world-model augmented VLA framework using dual-stream diffusion to jointly predict next-state observations and actions. It addresses modality conflicts, enhancing VLA performance across diverse robotic tasks by better handling inherent differences between visual and action modalities.
- 3. PROFIT: A Specialized Optimizer for Deep Fine Tuning
Introduces PROFIT, an optimizer specifically designed for incremental fine-tuning of converged models on new tasks or datasets. It aims to improve model performance during fine-tuning, addressing a gap in scholarship focused on efficiency rather than performance enhancement in this critical stage.
- 4. Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing
Proposes CIELR, a method for complex image editing that leverages LLM reasoning to understand implicit user intentions. It converts complex instructions into simpler ones, avoiding the high computational cost and training complexity of jointly fine-tuning LLMs and diffusion models.
- 5. RzenEmbed: Towards Comprehensive Multimodal Retrieval
Introduces RzenEmbed, a unified framework for learning multimodal embeddings across text, images, videos, and visual documents. It extends CLIP-based frameworks to offer comprehensive retrieval support beyond natural images, bridging a gap in current MLLM capabilities.
- 6. NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Introduces NoisyRollout, a data augmentation method for vision-language models to enhance reinforcement learning and visual reasoning. It addresses imperfect visual perception and improves policy exploration by mixing training trajectories, leading to better reasoning capabilities.
- 7. NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
Presents NAUTILUS, a large multimodal model for underwater scene understanding, demanding multi-task perceptions from multiple granularities. It addresses the lack of large-scale underwater multi-task instruction-tuning datasets, crucial for automated underwater exploration.
- 8. DeblurSDI: Blind Image Deblurring Using Self-diffusion
Proposes DeblurSDI, a zero-shot, self-supervised framework for blind image deblurring using self-diffusion. It requires no prior training and addresses limitations of traditional and deep learning methods in adapting to real-world scenarios. Enables effective deblurring without extensive training.
- 9. ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning
Introduces ANCHOR, integrating adversarial training with hard-mined supervised contrastive learning to achieve robust representation learning. It addresses the vulnerability of neural networks to adversarial attacks by enhancing their decision-making resilience.
- 10. Who Made This? Fake Detection and Source Attribution with Diffusion Features
Introduces FRIDA, a lightweight framework for fake image recognition and source attribution using diffusion features. It addresses the generalization challenges of existing detectors across unseen generators, enabling better detection of synthetic images.
- 11. Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes
Introduces Sh-ViT, a lightweight Vision Transformer for robust occluded person re-identification. It incorporates a Shuffle module to enhance robustness against occlusion and other challenges in complex surveillance scenes.
- 12. Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Proposes Phased DMD, a few-step distribution matching distillation method using score matching within subintervals. It addresses limitations of one-step distilled models on complex tasks by extending DMD to multi-step distillation efficiently, balancing model capacity and computational depth.
Multimodal Learning
- 1. ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Introduces ThinkMorph, a unified multimodal model for interleaved chain-of-thought reasoning. It learns to use text and image thoughts complementarily to advance reasoning, demonstrating emergent properties on diverse tasks with varying visual engagement.
- 2. NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
Proposes NAUTILUS, a large multimodal model for underwater scene understanding, addressing multi-task perceptions from multiple granularities. It aims to overcome the lack of large-scale underwater multi-task instruction-tuning datasets hindering progress.
- 3. Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Introduces DUST (Dual-Stream Diffusion), a world-model augmented VLA framework that jointly predicts next-state observations and actions. It addresses modality conflicts and enhances VLA performance across diverse robotic tasks.
- 4. RzenEmbed: Towards Comprehensive Multimodal Retrieval
Proposes RzenEmbed, a unified framework for learning embeddings across text, images, videos, and visual documents. It extends CLIP-based models for comprehensive multimodal retrieval beyond natural images.
- 5. Referee: Reference-aware Audiovisual Deepfake Detection
Introduces Referee, a reference-aware audiovisual deepfake detection method that leverages speaker-specific cues from one-shot examples. It detects manipulations beyond spatiotemporal artifacts, improving generalization to unseen forgeries.
- 6. Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
Proposes an effective audio-visual speech enhancement (AVSE) system that jointly models separation and dereverberation. It addresses complex acoustic environments with interfering sounds and reverberation for improved extracted speech quality.
- 7. NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Introduces NoisyRollout, a data augmentation method for vision-language models that enhances policy exploration and robustness to imperfect visual perception. It mixes training trajectories to improve reasoning capabilities.
- 8. GeoFM: Enhancing Geometric Reasoning of MLLMs via Synthetic Data Generation through Formal Language
Proposes GeoFM, a method to enhance multimodal LLM geometric reasoning by generating synthetic data using formal language. It addresses the scarcity of high-quality geometric data and current methods' limitations in synthetic data generation.
- 9. Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds
Proposes Alignment across Trees, a method for modality alignment that constructs and aligns tree-like hierarchical features for both image and text. It uses a semantic-aware visual encoder for improved multimodal integration.
- 10. Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing
Proposes CIELR, a method for complex image editing via LLM reasoning. It converts complex instructions into simpler steps, avoiding high computational cost and training complexity associated with jointly fine-tuning LLMs and diffusion models.
- 11. NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception
Proposes NegoCollab, a common representation negotiation approach for heterogeneous collaborative perception. Addresses domain gaps in intermediate features from different perception models by aligning them to a common representation.
Natural Language Processing
- 1. ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Introduces ThinkMorph, a multimodal reasoning model that treats text and image thoughts as complementary modalities. It learns from 24K interleaved reasoning traces to advance reasoning capabilities in visually-engaged tasks, demonstrating emergent properties in multimodal CoT reasoning.
- 2. Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Proposes DUST, a dual-stream diffusion framework for world-model augmented VLA models. It addresses modality conflicts between next-state and action predictions, enhancing VLA performance across diverse robotic tasks by jointly modeling these sequences.
- 3. Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing
Proposes CIELR, a method that converts complex image editing instructions into a sequence of simple ones using LLM reasoning. It avoids joint fine-tuning of LLMs and diffusion models, reducing computational complexity and training costs for complex editing tasks.
- 4. RzenEmbed: Towards Comprehensive Multimodal Retrieval
Introduces RzenEmbed, a unified framework for learning multimodal embeddings across text, images, videos, and visual documents. It extends CLIP-based frameworks for comprehensive retrieval, addressing limitations in supporting diverse visual modalities beyond natural images.
- 5. Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
Proposes an approach to enhance spatio-temporal zero-shot action recognition by using language-driven description attributes alongside action classes. This reduces ambiguity from multi-semantic words, improving understanding of intended action concepts in complex video data.
- 6. Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Proposes Phased DMD, a method for distilling score-based generative models into efficient one-step generators. It addresses underperformance on complex tasks by using score matching within subintervals for few-step distillation, improving generation of intricate motions.
- 7. RegionRAG: Region-level Retrieval-Augumented Generation for Visually-Rich Documents
Introduces RegionRAG, a retrieval-augmented generation method that uses region-level retrieval for visually-rich documents. It addresses irrelevant content by focusing on salient information, improving retrieval relevance and generation quality for complex documents.
- 8. Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
Proposes an effective audio-visual speech enhancement (AVSE) system that jointly models separation and dereverberation. It addresses complex acoustic environments with interfering sounds and reverberation, improving perceptual quality of extracted speech in real-world scenarios.
- 9. Who Made This? Fake Detection and Source Attribution with Diffusion Features
Proposes FRIDA, a lightweight framework for fake image recognition and source identification using diffusion features. Addresses generalization issues of existing detectors across unseen generators.
- 10. NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Introduces NoisyRollout, a data augmentation method for enhancing visual reasoning in VLMs. It addresses policy exploration and imperfect visual perception by mixing training trajectories, improving reinforcement learning capabilities for vision-language models.
- 11. NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
Presents NAUTILUS, a large multimodal model for underwater scene understanding, addressing the lack of large-scale instruction-tuning datasets. It enables multi-task perceptions from multiple granularities for automated underwater exploration.
- 12. NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception
Proposes NegoCollab, a common representation negotiation approach for heterogeneous collaborative perception. It addresses domain gaps in intermediate features from fixed perception models by aligning them to a common representation, improving collaborative performance.
Reinforcement Learning
- 1. Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Proposes DUST, a world-model augmented VLA framework using dual-stream diffusion. It jointly predicts next-state observations and actions by addressing modality conflicts, enhancing VLA performance across diverse tasks.
- 2. NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception
Introduces NegoCollab, a negotiation approach for heterogeneous collaborative perception. It aligns features from different agents to a common representation, eliminating domain gaps and improving collaborative performance.
- 3. NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Presents NoisyRollout, a data augmentation method for RL in vision-language models. It enhances policy exploration and scales test-time compute by mixing training trajectories, addressing imperfect visual perception.
- 4. ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning
Introduces ANCHOR, integrating adversarial training with hard-mined supervised contrastive learning. This framework aims to learn robust representations by mitigating adversarial attacks and enhancing discrimination.
- 5. Who Made This? Fake Detection and Source Attribution with Diffusion Features
Presents FRIDA, a lightweight framework for fake image recognition and source identification using diffusion features. It addresses generalization issues of supervised detectors across unseen generators, improving fake detection and attribution.
- 6. StateSpaceDiffuser: Bringing Long Context to Diffusion World Models
Introduces StateSpaceDiffuser, a diffusion-based world model framework that incorporates state space models. It addresses the lack of long-term context, improving temporal coherence in generated scenes and enhancing action-conditioned visual prediction.
- 7. DO-IQS: Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping with Unknown Gain Functions
Proposes DO-IQS, an offline inverse Q-learning method for optimal stopping problems with unknown gain functions. It recovers optimal stopping regions by approximating continuation and stopping gain functions.
- 8. NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
Introduces NAUTILUS, a large multimodal model for underwater scene understanding. It addresses the lack of large-scale underwater multi-task instruction-tuning datasets, enabling multi-task perceptions from multiple granularities.
- 9. Reinforcement Learning for Long-Horizon Unordered Tasks: From Boolean to Coupled Reward Machines
Introduces a reinforcement learning approach for long-horizon unordered tasks using coupled reward machines. Addresses limitations of standard RMs for tasks with subtasks executable in any order, enabling more efficient learning in complex scenarios.
- 10. Diabetes Lifestyle Medicine Treatment Assistance Using Reinforcement Learning
Presents an offline contextual bandit approach using RL for personalized lifestyle prescriptions in Type 2 diabetes. It learns individualized prescriptions from aggregated NHANES data to minimize glucose risk-reward.
- 11. Kernel Mean Embedding Topology: Weak and Strong Forms for Stochastic Kernels and Implications for Model Learning
Introduces Kernel Mean Embedding Topology for stochastic kernels, offering weak and strong formulations. This construction is useful for relaxed policy spaces and has implications for model learning in reinforcement learning contexts.
- 12. Game Theoretic Resilience Recommendation Framework for CyberPhysical Microgrids Using Hypergraph MetaLearning
Proposes a physics-aware cyberphysical resilience framework for microgrids using hypergraph meta-learning and game theory. Models attacker and defender strategies to recommend optimal resilience measures against coordinated cyberattacks.
- 13. Realistic pedestrian-driver interaction modelling using multi-agent RL with human perceptual-motor constraints
Develops a multi-agent RL model for realistic pedestrian-driver interaction. It incorporates human perceptual-motor constraints to simulate perception and action mechanisms, improving safety modeling for autonomous vehicles.
- 14. Reinforcement Learning for Accelerator Beamline Control: a simulation-based approach
Introduces RLABC, a Python library reframing accelerator beamline optimization as a reinforcement learning problem. It automates beamline configuration optimization using the Elegant simulation framework.
- 15. Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models
Proposes a framework for semantic interpretability in RL using vision-language models. It aims to construct human-understandable feature spaces and verifiable policies, moving beyond manual specification.
- 16. PROFIT: A Specialized Optimizer for Deep Fine Tuning
Introduces PROFIT, an optimizer specifically designed for deep fine-tuning. Increments fine-tuning on new tasks or datasets, aiming to improve model performance rather than just efficiency, addressing a gap in fine-tuning scholarship.
Robotics & Embodied AI
- 1. Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Proposes DUST, a dual-stream diffusion framework for world-model augmented Vision-Language-Action models. Addresses modality conflicts between next-state observations and actions, enhancing VLA performance across diverse tasks.
- 2. NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception
Introduces NegoCollab, a negotiation approach for heterogeneous collaborative perception. Learns a common representation to align features from different agents, mitigating domain gaps and improving collaborative performance.
- 3. NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
Presents NAUTILUS, a large multimodal model for underwater scene understanding. Addresses the lack of large-scale underwater multi-task instruction-tuning datasets to achieve multi-task perceptions from multiple granularities.
- 4. SAGS: Self-Adaptive Alias-Free Gaussian Splatting for Dynamic Surgical Endoscopic Reconstruction
Proposes SAGS, a self-adaptive alias-free Gaussian Splatting method for dynamic surgical endoscopic reconstruction. It addresses aliasing and artifacts from tissue movement, enhancing visualization quality in robot-assisted surgery.
- 5. ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Introduces ThinkMorph, a unified model for multimodal reasoning. Posits text and image thoughts are complementary, learns from interleaved reasoning traces, and demonstrates emergent properties in multimodal chain-of-thought reasoning.
- 6. PROFIT: A Specialized Optimizer for Deep Fine Tuning
Introduces PROFIT, an optimizer specifically designed for incremental fine-tuning of converged models on new tasks or datasets. It aims to improve model performance beyond traditional optimizers like SGD or Adam.
- 7. Leveraging Foundation Models for Enhancing Robot Perception and Action
Investigates systematically leveraging foundation models to enhance robotic capabilities, focusing on improving localization, interaction, and manipulation in unstructured environments through semantics-aware intelligence.
- 8. NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Proposes NoisyRollout, a data augmentation method for reinforcement learning. Enhances policy exploration and addresses imperfect visual perception in vision-language models by mixing training trajectories.
- 9. Realistic pedestrian-driver interaction modelling using multi-agent RL with human perceptual-motor constraints
Models realistic pedestrian-driver interactions using multi-agent RL incorporating human perceptual-motor constraints. Captures underlying mechanisms of perception and action for safer autonomous vehicle system development.
- 10. Object-IR: Leveraging Object Consistency and Mesh Deformation for Self-Supervised Image Retargeting
Presents Object-IR, a self-supervised architecture for image retargeting using mesh warping. It leverages object consistency and geometric-preserving constraints to eliminate distortion in semantically important regions.
- 11. ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning
Introduces ANCHOR, integrating adversarial training with hard-mined supervised contrastive learning. Aims to improve representation learning robustness by addressing vulnerabilities to adversarial perturbations.
- 12. WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond
Presents WildfireX-SLAM, a large-scale RGB-D dataset for wildfire SLAM. Addresses the absence of comprehensive datasets for 3D Gaussian Splatting-based SLAM in large-scale forest scenes.
- 13. GASP: Gaussian Splatting for Physic-Based Simulations
Introduces GASP, integrating Gaussian Splatting with physics-based simulations. It modifies Newtonian dynamics to align with 3D Gaussian components, bypassing traditional meshing mechanisms for scene simulation.
- 14. Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning
Introduces BEAT, the first framework for injecting visual backdoors into MLLM embodied agents. Uses contrastive trigger learning to enable persistent, attacker-specified multi-step policies triggered by visual cues.
- 15. RObotic MAnipulation Network (ROMAN) -- Hybrid Hierarchical Learning for Solving Complex Sequential Tasks
Presents ROMAN, a Hybrid Hierarchical Learning framework for robotic manipulation. Addresses long sequential task solving and enables versatile, robust manipulation skills in embodied AI.
- 16. EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting
Introduces EF-3DGS, an event-camera-aided method for free-trajectory 3D Gaussian Splatting. It addresses challenges in high-speed or low-frame-rate reconstruction scenarios where traditional methods fail.
Speech & Audio
- 1. Referee: Reference-aware Audiovisual Deepfake Detection
Proposes Referee, a reference-aware audiovisual deepfake detection method. Leverages speaker-specific cues from one-shot examples to detect manipulations beyond spatiotemporal artifacts, improving generalization to unseen forgeries.
- 2. ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Introduces ThinkMorph, a unified multimodal model trained on interleaved reasoning traces. Learns to treat text and image thoughts as complementary modalities, advancing interleaved chain-of-thought reasoning capabilities for multimodal tasks.
- 3. Deep Neural Watermarking for Robust Copyright Protection in 3D Point Clouds
Proposes a robust deep neural watermarking framework to address copyright protection challenges in 3D point clouds. The method is designed to be resilient against geometric and non-geometric attacks, enhancing intellectual property security.
- 4. ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning
Proposes ANCHOR, integrating adversarial training with hard-mined supervised contrastive learning for robust representation learning. Enhances model resilience against adversarial attacks by learning discriminant patterns more effectively.
- 5. Who Made This? Fake Detection and Source Attribution with Diffusion Features
Introduces FRIDA, a lightweight framework for fake image detection and source attribution using diffusion features. Addresses generalization issues of supervised detectors across unseen generators, enabling more robust authenticity verification.
- 6. Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Introduces DUST, a world-model augmented Vision-Language-Action framework using dual-stream diffusion. Addresses modality conflicts in jointly predicting next-state observations and action sequences, enhancing VLA performance across diverse tasks.
- 7. LifWavNet: Lifting Wavelet-based Network for Non-contact ECG Reconstruction from Radar
Presents LifWavNet, a lifting wavelet network for non-contact ECG reconstruction from radar signals. Employs learnable lifting wavelets for adaptive feature capture and synthesis, offering a promising approach for unobtrusive cardiac monitoring.
- 8. Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
Proposes an effective audio-visual speech enhancement system for complex acoustic environments by jointly modeling separation and dereverberation. Achieves well-performed enhancement in complex scenarios, improving perceptual quality of extracted speech.
- 9. NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
Presents NAUTILUS, a large multimodal model for underwater scene understanding, addressing multi-task perception demands. Aims to achieve automated underwater exploration by overcoming the lack of large-scale underwater instruction-tuning datasets.
- 10. NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception
Proposes NegoCollab, a negotiation approach for heterogeneous collaborative perception. Addresses domain gaps in intermediate features from fixed perception models, enabling robust collaboration by aligning features to a common representation.