AI Research Archive

Recording AI Revolution One Day At A Time

⭐

Wednesday, November 5, 2025

CURRENT

Executive Briefing Bullets (20) JSON

Investigates subtraction accuracy in eight LLMs, finding it lags behind addition. Errors in (a-b) are consistently related to errors in (b-a), suggesting models struggle with non-commutativity. This h...

Fast, Private, and Protected: Safeguarding Data Privacy and Defending Against Model Poisoning Attacks in Federated Learning

Introduces Fast, Private, and Protected (FPP), a novel approach for federated learning that safeguards data privacy and defends against model poisoning attacks. It aims to ensure secure and robust dis...

LTD-Bench: Evaluating Large Language Models by Letting Them Draw

Introduces LTD-Bench, a benchmark for evaluating LLMs' spatial reasoning capabilities through drawing. It addresses the limitations of opaque numerical metrics by providing an intuitive understanding ...

SEAL - A Symmetry EncourAging Loss for High Energy Physics

Introduces SEAL, a symmetry-encouraging loss function for high energy physics. It improves robustness and data efficiency of machine learning models by explicitly respecting physical symmetries, even ...

Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs

Introduces the 'Three Taxes' framework to analyze performance inefficiencies in distributed LLMs. Proposes moving beyond BSP to achieve efficient multi-GPU inference by addressing bulk synchronous, lo...

An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks

Reveals a jailbreak strategy that evades defenses by extracting information from failed attacks and evolving itself. It provides an automated framework for discovering, retrieving, and evolving strate...

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

Formalizes AI research agents as search policies navigating solution spaces using operators. Focuses on improving agent performance in MLE-bench by enhancing search, exploration, and generalization fo...

Rethinking LLM Human Simulation: When a Graph is What You Need

Identifies a class of simulation problems where Graph Neural Networks (GNNs) outperform LLMs. Introduces Graph-based Models (GEMs) that match or surpass LLM baselines for human simulation despite bein...

Path-Consistency with Prefix Enhancement for Efficient Inference in LLMs

Introduces path-consistency, leveraging confidence of earlier answers to guide generation and enhance LLM inference efficiency. It identifies promising prefixes to reduce computational cost and time c...

Regularization Through Reasoning: Systematic Improvements in Language Model Classification via Explanation-Enhanced Fine-Tuning

Fine-tunes LLMs for classification by attaching explanations to labels, systematically improving naturalness, comprehensiveness, and adherence. This explanation-enhanced approach yields better convers...

Enhancing Federated Learning Privacy with QUBO

Proposes a QUBO formulation to enhance privacy in federated learning by bounding the risk of membership inference attacks. This method aims to improve data protection while maintaining model utility i...

IG-Pruning: Input-Guided Block Pruning for Large Language Models

Proposes IG-Pruning, a novel input-aware method for pruning transformer layers in LLMs. It dynamically removes layers based on input, reducing computational costs for efficient inference without signi...

In Good GRACEs: Principled Teacher Selection for Knowledge Distillation

Proposes GRACE, a lightweight score to quantify teacher model effectiveness for student model distillation. It measures distributional properties of student gradients without a verifier, enabling prin...

Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects

Addresses GPU NUMA effects in large-scale attention workloads by proposing Swizzle, a novel kernel scheduling strategy. It exploits NUMA-aware locality to optimize attention performance, mitigating me...

PrivGNN: High-Performance Secure Inference for Cryptographic Graph Neural Networks

Proposes PrivGNN, a high-performance secure inference protocol for graph neural networks. It addresses the challenge of securing GNNs and graph data in privacy-critical cloud environments, enabling se...

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Presents AutoAdv, a training-free framework for automated multi-turn jailbreaking of LLMs. It achieves high attack success rates by combining adaptive adversarial prompting and prompt refinement, impr...

Multi-Personality Generation of LLMs at Decoding-time

Proposes a novel Multi-Personality Generation (MPG) framework for LLMs at decoding time. It flexibly controls multiple personalities without retraining, enhancing adaptability and robustness for user-...

The Sequential Edge: Inverse-Entropy Voting Beats Parallel Self-Consistency at Matched Compute

Compares sequential and parallel self-consistency for LLM reasoning, finding sequential voting with inverse entropy outperforms parallel methods at equal compute. This demonstrates a more efficient sc...

On Extending Direct Preference Optimization to Accommodate Ties

Derives and investigates two DPO variants that explicitly model ties in pairwise comparisons. Experiments show explicit tie handling can be added without performance degradation, improving DPO's robus...

ExplicitLM: Decoupling Knowledge from Parameters via Explicit Memory Banks

Proposes ExplicitLM, a novel architecture with a million-scale external memory bank storing human-readable knowledge. This decouples knowledge from parameters, enabling direct inspection and modificat...

📅

Tuesday, November 4, 2025

Executive Briefing Bullets (20) JSON

Erasing 'Ugly' from the Internet: Propagation of the Beauty Myth in Text-Image Models

Investigates how generative AI models encode 'beauty' norms and erase 'ugliness'. Studies the propagation of Western beauty myths in text-image models and discusses societal implications, particularly...

Complex QA and language models hybrid architectures, Survey

Surveys complex question-answering strategies using hybrid LLM architectures. Reviews methods for addressing specific, complex questions beyond chatbot capabilities, exploring power-generation and cli...

Adversarial D\'ej\`a Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

Proposes dictionary learning for adversarial training to defend LLMs against jailbreak attacks. Aims to improve generalization to unseen attacks by creating more robust safety guardrails, addressing a...

PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase Attacks

Introduces PADBen, a benchmark for evaluating AI text detectors against paraphrase attacks. Reveals that iterative paraphrasing evades current detectors by creating an intermediate laundering region, ...

MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

Introduces MARS-SQL, a multi-agent RL framework for complex Text-to-SQL tasks. It decomposes the problem into specialized agents for grounding, generation, and validation, improving accuracy and handl...

A note on large deviations for interacting particle dynamics for finding mixed Nash equilibria with applications to GANs

Considers a method for finding mixed Nash equilibria in two-layer zero-sum games using entropic regularization. Applies interacting particle dynamics and large deviations theory to problems in GAN tra...

Assessing LLM Reasoning Steps via Principal Knowledge Grounding

Introduces a framework to assess LLM reasoning's knowledge grounding by collecting principal knowledge and evaluating intermediate reasoning steps. It comprises knowledge collection, grounding assessm...

Low-Rank Adaptation for Foundation Models: A Comprehensive Review

Provides a comprehensive review of Low-Rank Adaptation (LoRA) for foundation models. It analyzes LoRA's effectiveness in adapting large models to downstream tasks, addressing parameter efficiency chal...

Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs

Investigates safety and fairness risks in parameter-efficient fine-tuning (PEFT) of LLMs. Compares four PEFT methods (LoRA, DoRA, ICL, Prompt Tuning) to assess trade-offs between efficiency and alignm...

DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching

Introduces DTS, a framework for enhancing large reasoning models by pruning over-long chain-of-thought traces. It uses decoding tree sketching to identify short, accurate reasoning paths, reducing inf...

Reevaluating Self-Consistency Scaling in Multi-Agent Systems

Reevaluates self-consistency scaling in multi-agent systems using Gemini 2.5 models. Examines trade-offs of increasing sampled reasoning paths, comparing pooled outputs to single chain-of-thought, and...

ToM: Leveraging Tree-oriented MapReduce for Long-Context Reasoning in Large Language Models

Introduces ToM, a framework leveraging Tree-oriented MapReduce for long-context reasoning in LLMs. It improves logical coherence over RAG and divide-and-conquer methods by optimizing graph traversal f...

Spatial Knowledge Graph-Guided Multimodal Synthesis

Proposes a framework for generating spatially coherent multimodal data by integrating spatial knowledge graphs with MLLMs. It addresses spatial perception limitations in MLLMs, enabling the creation o...

When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding

Introduces ReSpec, a retrieval-enhanced speculative decoding framework for LLM acceleration. It optimizes cache scheduling as a graph problem using Lexicographic Minimax Path Optimization to minimize ...

Diversity-Aware Policy Optimization for Large Language Model Reasoning

Presents a systematic investigation into diversity's impact on LLM reasoning via RL. Proposes a diversity-aware policy optimization framework to enhance reasoning capabilities and stability, addressin...

SEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignment

Proposes SEPS, a semantic-enhanced patch slimming framework for fine-grained cross-modal alignment. It addresses patch redundancy and ambiguity in MLLMs by optimizing patch selection for improved visi...

On the Variance, Admissibility, and Stability of Empirical Risk Minimization

Proves that suboptimality of Empirical Risk Minimization (ERM) is due to large bias, with variance bounded by the minimax rate. Provides an elementary proof in the fixed design setting and extends it ...

Leveraging the Cross-Domain & Cross-Linguistic Corpus for Low Resource NMT: A Case Study On Bhili-Hindi-English Parallel Corpus

Introduces the Bhili-Hindi-English Parallel Corpus (BHEPC), the largest of its kind. Leverages cross-domain and cross-linguistic data to address low-resource Neural Machine Translation challenges for ...

Contextual Tokenization for Graph Inverted Indices

Introduces CORGII, a graph indexing framework for efficient subgraph isomorphism retrieval. Uses contextual graph representations and inverted indices to overcome limitations of exhaustive scoring in ...

Bayesian Additive Main Effects and Multiplicative Interaction Models using Tensor Regression for Multi-environmental Trials

Proposes a Bayesian tensor regression model for phenotype prediction across multiple factors. Incorporates spike-and-slab structures to identify relevant interactions and uses prior distributions to r...

📅

Monday, November 3, 2025

Executive Briefing Bullets (20) JSON

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

Introduces ThinkMorph, a multimodal model learning interleaved chain-of-thought reasoning by treating text and image as complementary. Fine-tuned on 24K reasoning traces, it demonstrates emergent prop...

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

Introduces DUST, a dual-stream diffusion framework for world-model augmented Vision-Language-Action (VLA) models. It addresses modality conflicts between state and action prediction, enhancing VLA per...

Deep Neural Watermarking for Robust Copyright Protection in 3D Point Clouds

Proposes a robust deep neural watermarking framework for copyright protection in 3D point clouds. It addresses challenges posed by geometric and non-geometric attacks, offering enhanced resilience com...

Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers

Proposes a Data-Free Quantization (DFQ) method for Vision Transformers (ViTs) that addresses semantic distortion and inadequacy using semantic alignment and reinforcement. It enables model quantizatio...

SAGS: Self-Adaptive Alias-Free Gaussian Splatting for Dynamic Surgical Endoscopic Reconstruction

Proposes SAGS, a self-adaptive alias-free Gaussian Splatting method for dynamic surgical endoscopic reconstruction. It addresses aliasing and artifacts in deformable tissue reconstruction from endosco...

NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception

Introduces NegoCollab, a common representation negotiation approach for heterogeneous collaborative perception. It addresses domain gaps in intermediate features shared among agents with fixed percept...

Deep learning denoising unlocks quantitative insights in operando materials microscopy

Presents a deep learning-based denoising framework for quantitative operando microscopy. It preserves physical fidelity and enhances resolution, enabling deeper insights into dynamic chemical and phys...

Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

Introduces Phased DMD, a few-step distribution matching distillation method using score matching within subintervals. It addresses limitations of one-step distillation in complex generative tasks by e...

Generative diffusion modeling protocols for improving the Kikuchi pattern indexing in electron back-scatter diffraction

Presents generative diffusion modeling protocols to enhance Kikuchi pattern indexing in electron back-scatter diffraction (EBSD). It addresses limitations of traditional methods at high scanning speed...

From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration

Introduces a multi-agent framework for editable scientific illustrations that outputs vector graphics with semantic structure. It addresses rasterization limitations and cumbersome code-based methods,...

NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

Proposes NAUTILUS, a large multimodal model for underwater scene understanding, addressing the lack of large-scale datasets. It enables multi-task perception from multiple granularities, advancing aut...

Who Made This? Fake Detection and Source Attribution with Diffusion Features

Presents FRIDA, a lightweight framework using diffusion features for fake image detection and source attribution. It addresses generalization challenges of supervised detectors across unseen generator...

ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning

Introduces ANCHOR, integrating adversarial training with hard-mined supervised contrastive learning for robust representation learning. It enhances model resilience against adversarial attacks by lear...

PROFIT: A Specialized Optimizer for Deep Fine Tuning

Introduces PROFIT, an optimizer specifically designed for deep fine-tuning of converged models on new tasks or datasets. It aims to improve fine-tuning efficiency and model performance, addressing a g...

Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

Proposes an Audio-Visual Speech Enhancement (AVSE) system that jointly models separation and dereverberation for complex acoustic scenarios. It leverages visual auxiliary information to extract target...

Gaussian Combined Distance: A Generic Metric for Object Detection

Proposes Gaussian Combined Distance (GCD) as a generic similarity metric for object detection, addressing limitations of IoU-based metrics, especially for small objects. GCD enhances model performance...

Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes

Proposes Sh-ViT, a lightweight Vision Transformer for robust occluded person re-identification in complex surveillance scenes. It enhances robustness to occlusion through a shuffle module in the final...

LifWavNet: Lifting Wavelet-based Network for Non-contact ECG Reconstruction from Radar

Proposes LifWavNet, a lifting wavelet network for non-contact ECG reconstruction from radar signals. It employs learnable lifting wavelets for adaptive feature capture and synthesis, offering an unobt...

WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond

Introduces WildfireX-SLAM, a large-scale low-altitude RGB-D dataset for wildfire SLAM. It aims to facilitate research in 3D Gaussian splatting-based SLAM for challenging forest environments, supportin...

A fragile zero-watermarking method based on dual quaternion matrix decomposition

Proposes a fragile zero-watermarking method using dual quaternion matrix decomposition for medical image copyright protection. It extracts stable features without modifying the original image, providi...

📅

Friday, October 31, 2025

Executive Briefing Bullets (20) JSON

Emu3.5: Native Multimodal Models are World Learners

Introduces Emu3.5, a large-scale multimodal world model pre-trained end-to-end with a unified next-token prediction objective. Trained on over 10 trillion vision-language tokens, it natively predicts ...

Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving

Proposes a new planning method for end-to-end autonomous driving using constraint-aware flow matching. This generative approach overcomes the mode collapse issue of imitation learning by producing div...

NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods

Introduces a framework for consistent and reproducible evaluation of novel view synthesis methods like NeRFs and 3D Gaussian Splatting. It provides standardized implementations and evaluation protocol...

SAMRI: Segment Anything Model for MRI

Adapts the Segment Anything Model (SAM) for medical magnetic resonance imaging (MRI) segmentation. This work demonstrates how a large-scale vision foundation model can be effectively fine-tuned for a ...

CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

Introduces CronusVLA, a vision-language-action model for robotic manipulation that leverages temporal information from multiple frames. By moving beyond the single-frame paradigm, this approach enhanc...

MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory

Introduces MoralCLIP, a method to imbue vision-language models with the ability to reason about moral dimensions of content. It aligns image-text representations with principles from Moral Foundations...

Masked Diffusion Captioning for Visual Feature Learning

Proposes Masked Diffusion Captioning (MDC), a novel self-supervised method for learning visual features. The approach trains a model to caption images using an image-conditioned masked diffusion langu...

DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

Presents DOVE, a diffusion model for real-world video super-resolution that achieves high performance in a single sampling step. This overcomes the significant latency of traditional iterative diffusi...

JOGS: Joint Optimization of Pose Estimation and 3D Gaussian Splatting

Proposes JOGS, a unified framework that jointly optimizes 3D Gaussian points and camera poses for novel view synthesis. This approach eliminates the dependency on external pose estimation tools like C...

A Survey on Efficient Large Language Model Training: From Data-centric Perspectives

Provides a comprehensive survey on efficient post-training for Large Language Models (LLMs) from a data-centric viewpoint. The paper reviews methods and challenges related to data annotation costs and...

Disentangled 4D Gaussian Splatting: Rendering High-Resolution Dynamic World at 343 FPS

Presents Disentangled 4D Gaussian Splatting (Disentangled4DGS), a novel method for dynamic scene rendering. By disentangling static and dynamic components, it achieves high-resolution, real-time rende...

The Impact and Outlook of 3D Gaussian Splatting

Provides a comprehensive survey of 3D Gaussian Splatting (3DGS), a transformative technique for 3D scene representation. The paper analyzes follow-up research that enhances efficiency, scalability, an...

Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

Addresses the challenge of selecting effective pre-training data for long-context LLMs. The paper proposes a method to quantify long-range dependencies in text, enabling the filtering of documents tha...

ProstNFound+: A Prospective Study using Medical Foundation Models for Prostate Cancer Detection

Presents ProstNFound+, a prospective clinical study validating the use of medical foundation models for prostate cancer detection from micro-ultrasound images. This work demonstrates the real-world ap...

SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing

Proposes SplitFlow, a method for inversion-free image editing with rectified flow models. By decomposing the flow into content and structure components, it allows for high-fidelity, text-guided edits ...

LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering

Presents LODGE, a level-of-detail (LOD) method for 3D Gaussian Splatting that enables real-time rendering of large-scale scenes on memory-constrained devices. It creates a hierarchical representation,...

HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location

Introduces HyGen, a system for efficient LLM serving that co-locates latency-sensitive online requests and throughput-oriented offline requests. By dynamically managing resources and batching strategi...

DDL: A Large-Scale Datasets for Deepfake Detection and Localization in Diversified Real-World Scenarios

Introduces DDL, a large-scale dataset for deepfake detection and localization designed to cover diverse real-world scenarios. By including a wide range of AIGC-generated content and manipulation types...

Spiking Patches: Asynchronous, Sparse, and Efficient Tokens for Event Cameras

Introduces Spiking Patches, a novel tokenization method specifically designed for asynchronous and sparse data from event cameras. This approach creates an event representation that preserves the inhe...

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Presents CRAG-MM, a new benchmark for evaluating Multi-Modal Retrieval-Augmented Generation (MM-RAG) systems. It focuses on multi-turn conversational scenarios, such as those encountered with wearable...

📅

Thursday, October 30, 2025

Executive Briefing Bullets (20) JSON

Serve Programs, Not Prompts

Proposes a new LLM serving architecture that executes programs instead of processing static prompts. This allows for dynamic, runtime customization of inference, achieving up to 2x throughput improvem...

Parallel Loop Transformer for Efficient Test-Time Computation Scaling

Presents a novel transformer architecture where looped computations (reusing weights) run in parallel instead of sequentially. This design overcomes the latency bottleneck of previous looped models, e...

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

Introduces RLAIF-V, a framework for reducing multimodal LLM hallucination using feedback from open-source AI models instead of humans. This method creates a highly effective preference dataset and tra...

VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

Presents a generative AI framework that creates dynamic visual effects (VFX) by learning from in-context examples, rather than relying on per-effect fine-tuning. This allows the model to generalize an...

Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning

Proposes a unified training pipeline that improves both Program-of-Thought (P-CoT) and Natural Language Chain-of-Thought (N-CoT) reasoning. The method uses each paradigm to iteratively generate and re...

EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis

Presents a foundational LLM for Electronic Health Record (EHR) analysis, pre-trained on a massive clinical dataset. The model is fine-tuned with a reasoning-focused objective, demonstrating superior p...

OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs

Introduces an open-source framework for building and evaluating automated fact-checking systems. The work provides a comprehensive benchmark that measures the ability of LLMs and dedicated systems to ...

LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

Proposes the first Multimodal Large Language Model (MLLM) framework for open-vocabulary, hierarchical part segmentation. The model can jointly detect and segment objects and their constituent parts fr...

Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models

Proposes a new training method that improves the reliability of post-hoc attribution for long-document question answering. By training the model to decompose answers into components, it enhances the a...

EA3D: Online Open-World 3D Object Extraction from Streaming Videos

Introduces ExtractAnything3D (EA3D), a unified online framework that performs simultaneous geometric reconstruction and open-world 3D object extraction from a single, streaming video. The system can i...

Scaling Latent Reasoning via Looped Language Models

Introduces Ouro, a family of pre-trained Looped Language Models that perform iterative reasoning in latent space. This approach allows smaller models (1.4B) to match the reasoning performance of much ...

Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale

Demonstrates that language models, regardless of architecture (Transformer, Mamba) or scale (14M to 12B parameters), exhibit highly consistent and predictable behavioral phases during pre-training, re...

Precise In-Parameter Concept Erasure in Large Language Models

Proposes a method for precisely erasing entire concepts directly from a model's parameters. This technique surgically modifies model behavior without requiring fine-tuning, offering a more robust appr...

SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens

Introduces a method to accelerate Chain-of-Thought (CoT) reasoning by encoding reasoning steps into implicit, non-textual tokens. This reduces the number of generated tokens, significantly speeding up...

CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories

Develops an LLM-based agent for complex business tasks within a Customer Relationship Management (CRM) system. The agent uses reinforcement learning and a shared memory module to improve its tool-call...

PairUni: Pairwise Training for Unified Multimodal Language Models

Proposes PairUni, a unified framework for training multimodal models to perform both understanding and generation tasks. It uses pairwise ranking objectives during reinforcement learning to effectivel...

Balanced conic rectified flow

Introduces a new generative model based on rectified flow, an ODE-based approach that learns smooth transport between distributions. This method offers an alternative to diffusion, enabling high-quali...

Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation

Introduces MiRAGE, a new evaluation framework and benchmark for Retrieval-Augmented Generation (RAG) systems that use multimodal sources like video and audio. It tests the ability of models to integra...

NL-Debugging: Exploiting Natural Language as an Intermediate Representation for Code Debugging

Presents a novel debugging framework where the model first translates buggy code into a natural language description of its logic. It then identifies and corrects flaws in the natural language represe...

StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

Presents a new Video Question Answering dataset to evaluate a model's ability to understand temporal dynamics and perform complex reasoning over streaming video. The dataset includes questions requiri...

📅

Wednesday, October 29, 2025

Executive Briefing Bullets (20) JSON

SPICE: Self-Play In Corpus Environments Improves Reasoning

Introduces a reinforcement learning framework where a single model acts as both a Challenger and a Reasoner. The model self-improves by generating reasoning problems from a large text corpus, demonstr...

Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs

Proposes a method for Multimodal Large Language Models to improve complex visual reasoning by generating intermediate 'visual thoughts.' The model learns to sketch in a latent space, mimicking human c...

Pie: A Programmable Serving System for Emerging LLM Applications

Presents Pie, a programmable serving system designed for complex LLM applications involving agentic workflows. It replaces the monolithic token generation loop with a flexible system that can execute ...

AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis

Introduces a data synthesis method inspired by the Zone of Proximal Development (ZPD). It generates training tasks at the edge of an LLM's capabilities, enabling the model to effectively expand its re...

SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

Creates a benchmark to disentangle reasoning from factual recall in language models. It generates controlled, synthetic 'worlds' with alternate physics or facts, allowing for precise evaluation of a m...

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

Introduces a large multi-modal model capable of processing contexts up to 1 million tokens, including images, video, and text. It achieves state-of-the-art performance on long-context visual understan...

ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

Introduces a benchmark to evaluate if AI agents can replicate research from astrophysics papers. It tests an agent's ability to perform a complex workflow, including understanding the paper, writing c...

Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

Demonstrates that Reinforcement Learning can significantly improve the performance of LLM-based search agents on long-horizon tasks. By learning from experience, the RL-trained agents outperform promp...

GaussianFusion: Gaussian-Based Multi-Sensor Fusion for End-to-End Autonomous Driving

Develops a multi-sensor fusion method for autonomous driving based on 3D Gaussian representations. The approach effectively combines information from various sensors like cameras and LiDAR into a unif...

Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Introduces a large-scale commonsense reasoning benchmark covering over 100 languages and cultures. Constructed through participatory methods, it evaluates the ability of LLMs to handle culturally-spec...

Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond

Provides a comprehensive survey on general world models, a key concept for AGI. It analyzes OpenAI's Sora within this framework, discussing its capabilities, limitations, and the future trajectory for...

Zero-Shot Tokenizer Transfer

Introduces a method to transfer a language model to a new tokenizer without retraining. This technique allows for adapting models to new languages or domains efficiently, improving performance and red...

emg2speech: synthesizing speech from electromyography using self-supervised speech models

Develops a neuromuscular speech interface that synthesizes audible speech directly from electromyographic (EMG) signals of orofacial muscles. The system leverages self-supervised speech representation...

Temporal Blindness in Multi-Turn LLM Agents: Misaligned Tool Use vs. Human Time Perception

Identifies 'temporal blindness' in LLM agents, where they fail to account for real-world time progression during multi-turn interactions. The paper diagnoses this issue and demonstrates its negative i...

Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way

Proposes a diffusion-based large language model that natively supports variable-length text generation. By treating the [EOS] token as a special signal, the model overcomes a key limitation of previou...

ZTRS: Zero-Imitation End-to-end Autonomous Driving with Trajectory Scoring

Proposes a 'Zero-Imitation' framework for end-to-end autonomous driving. Instead of relying on expert demonstrations, the model learns by generating and scoring its own trajectories based on safety an...

OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

Presents a method for learning reward models for complex, long-form agentic tasks. The system uses reinforcement learning and web-grounded feedback to train reward models that can evaluate the correct...

CPathAgent: An Agent-based Foundation Model for Interpretable High-Resolution Pathology Image Analysis Mimicking Pathologists' Diagnostic Logic

Presents an agent-based foundation model for analyzing high-resolution pathology images. The model mimics the diagnostic logic of human pathologists by sequentially selecting and analyzing regions of ...

RoboOmni: Proactive Robot Manipulation in Omni-modal Context

Proposes a framework for proactive robotic manipulation using omni-modal context from vision, language, and audio. The robot can infer human intent and proactively assist in tasks without explicit ins...

NVSim: Novel View Synthesis Simulator for Large Scale Indoor Navigation

Presents a framework to automatically create large-scale, navigable simulators for indoor environments from simple image sequences. It adapts 3D Gaussian Splatting to build photorealistic scenes, enab...

📅

Tuesday, October 28, 2025

Executive Briefing Bullets (20) JSON

ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation

Introduces the first zero-shot method for grounding 3D orientation in text-to-image models. It allows users to specify the viewpoint of multiple objects across diverse categories without requiring exp...

CoMo: Compositional Motion Customization for Text-to-Video Generation

Presents a method for compositional motion customization in text-to-video generation. It enables precise control over complex, multi-subject motions by decomposing motion descriptions and applying the...

More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Presents a method to unify image generation and depth estimation within a single text-to-image diffusion model. It overcomes the catastrophic degradation of generative capabilities during fine-tuning,...

Towards Generalisable Foundation Models for 3D Brain MRI

Introduces BrainFound, a self-supervised foundation model for 3D brain MRI analysis built by extending DINO-v2. It learns general-purpose features from large-scale unlabeled MRI datasets, demonstratin...

MiCADangelo: Fine-Grained Reconstruction of Constrained CAD Models from 3D Scans

Proposes a system for converting 3D scans into parametric, constrained Computer-Aided Design (CAD) models. It reconstructs fine-grained geometric primitives and infers the underlying design intent, su...

VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting

Presents an end-to-end autonomous driving model that is robust to variations in camera viewpoint. It uses a feed-forward 3D Gaussian Splatting module to create an explicit 3D representation of the sce...

EndoWave: Rational-Wavelet 4D Gaussian Splatting for Endoscopic Reconstruction

Proposes a 4D Gaussian Splatting method for reconstructing surgical scenes from endoscopic video. It uses a rational-wavelet representation to model non-rigid tissue motion and handles photometric inc...

LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

Presents a lightweight framework for building unified multimodal models for both understanding and generation. It uses a double fusion approach to efficiently combine pre-trained vision encoders and L...

Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method

Presents a dataset and method for large-scale, occupancy-centric driving scene generation. The framework allows for the creation of diverse and consistent driving scenarios conditioned on occupancy gr...

Kernel Density Steering: Inference-Time Scaling via Mode Seeking for Image Restoration

Introduces Kernel Density Steering (KDS), a novel inference-time framework for diffusion-based image restoration. It guides the sampling process toward high-density regions of the data manifold, promo...

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Proposes a Vision-Language-Action model for end-to-end autonomous driving. The model leverages world knowledge and reasoning to make driving decisions, using reinforcement fine-tuning and adaptive rea...

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

Proposes VOLD, a method to transfer reasoning from text-only LLMs to Vision-Language Models using on-policy distillation. This technique leverages abundant text-based reasoning data to improve VLM per...

Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling

Proposes a method to accelerate diffusion model sampling by adaptively combining ODE and SDE solvers. The technique introduces adaptive stochastic coefficients to leverage the complementary strengths ...

Segment then Splat: Unified 3D Open-Vocabulary Segmentation via Gaussian Splatting

Introduces a unified framework for 3D open-vocabulary segmentation by integrating it with Gaussian Splatting. The method first reconstructs a 3D scene and then performs segmentation, ensuring multi-vi...

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

Introduces a framework for egocentric video reasoning that infers the hidden intentions and actions of the camera-wearer. It uses a Spatio-Temporal Chain-of-Thought (CoT) approach, enabling multimodal...

FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time

Introduces a training-free method for multi-subject text-to-image generation by automatically fusing multiple subject-specific LoRAs at test time. It uses an auto-masking technique to apply different ...

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Introduces a benchmark for evaluating and mitigating hallucinations in Vision-Language Models for video understanding. It uses synthetic videos to test physical and common-sense reasoning, revealing m...

3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks

Introduces a large-scale 3D radiology dataset for Medical Visual Question Answering (Med-VQA) using CT scans. It supports diverse diagnostic tasks and multi-temporal analysis, providing a comprehensiv...

AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-rays

Proposes an adversarial fair contrastive pre-training method for chest X-ray models to mitigate demographic biases. The AdFair-CLIP framework learns representations that are invariant to sensitive att...

Navigating the Accuracy-Size Trade-Off with Flexible Model Merging

Proposes a flexible model merging technique that allows for navigating the trade-off between model accuracy and size. It can combine multiple single-task fine-tuned models into a multi-task model of a...

📅

Monday, October 27, 2025

Executive Briefing Bullets (20) JSON

WorldGrow: Generating Infinite 3D World

Presents WorldGrow, a framework for generating infinitely extendable 3D worlds. It addresses the challenges of creating large, continuous environments with coherent geometry and realistic appearance, ...

RigAnything: Template-Free Autoregressive Rigging for Diverse 3D Assets

Proposes RigAnything, a template-free, autoregressive transformer model for 3D asset rigging. It probabilistically generates joints, skeleton topologies, and skinning weights, making diverse 3D assets...

Breaking the Batch Barrier (B3) of Contrastive Learning via Smart Batch Mining

Presents a method to overcome the batch size dependency in contrastive learning. The proposed Smart Batch Mining technique allows models to learn effective representations without requiring large batc...

Epipolar Geometry Improves Video Generation Models

Improves video generation models by incorporating epipolar geometry constraints into large latent diffusion transformers. This approach enhances geometric consistency, stabilizes motion, and reduces v...

zip2zip: Inference-Time Adaptive Tokenization via Online Compression

Proposes zip2zip, an inference-time adaptive tokenization method for large language models. It uses online compression to dynamically adjust the tokenizer's vocabulary to domain-specific inputs, impro...

Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Investigates the operational mechanisms of Classifier-Free Guidance (CFG) in text-to-image diffusion models. The paper proposes a new interpretation based on foresight fixed point iterations, aiming t...

SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models

Presents SAMA, a Video Large Multimodal Model designed for fine-grained spatio-temporal understanding. It enables multi-turn, referential grounded video chat by mastering both video referring understa...

CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting

Introduces CLIPGaussian, a universal and multimodal style transfer method for representations based on Gaussian Splatting (GS). It extends style transfer beyond simple color changes for GS-based image...

RiverMamba: A State Space Model for Global River Discharge and Flood Forecasting

Presents RiverMamba, a State Space Model for global-scale river discharge and flood forecasting. This approach aims to improve the accuracy and efficiency of early warning systems by moving beyond loc...

Self-Refining Language Model Anonymizers via Adversarial Distillation

Proposes a self-refining framework for training language model-based anonymizers using adversarial distillation. This approach enhances privacy in LLM applications by creating open-source anonymizers ...

Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets

Introduces Seed3D, a system that converts images into high-fidelity, simulation-ready 3D assets. It aims to bridge the gap between content diversity and physics accuracy in world simulators, providing...

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Introduces VITA-1.5, a Multimodal Large Language Model focused on achieving GPT-4o level real-time interaction. It integrates vision and speech modalities to enhance dialogue systems, addressing the n...

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

Proposes InfiniPot-V, a key-value (KV) cache compression method for multimodal large language models processing streaming video. It allows for hour-long video reasoning on memory-constrained devices b...

InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation

Introduces InfiniDreamer, a novel framework for generating arbitrarily long human motion sequences. It overcomes the lack of long motion training data by using a segment score distillation approach, e...

Grasp2Grasp: Vision-Based Dexterous Grasp Translation via Schr\"odinger Bridges

Presents Grasp2Grasp, a vision-based approach for dexterous grasp translation using Schrödinger Bridges. Given a visual observation of a source hand, the method synthesizes a functionally equivalent g...

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

Introduces RTV-Bench, a new benchmark for evaluating Multimodal Large Language Models on continuous perception, understanding, and reasoning in dynamic environments. It uses real-time video to assess ...

ArtiLatent: Realistic Articulated 3D Object Generation via Structured Latents

Proposes ArtiLatent, a generative framework for synthesizing articulated 3D objects with fine-grained geometry and realistic appearance. It jointly models part geometry and articulation by embedding s...

Lorentz Local Canonicalization: How to Make Any Network Lorentz-Equivariant

Introduces Lorentz Local Canonicalization (LLoCa), a general framework that renders any standard neural network architecture Lorentz-equivariant. This method removes the need for specialized layers, b...

Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation

Introduces Hierarchical Soft Mixture-of-Experts (HoME) with a Mamba-based architecture for 3D medical image segmentation. The model is designed to efficiently process diverse 3D medical modalities and...

Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

Presents Frame In-N-Out, a method for unbounded and controllable image-to-video generation. It leverages cinematic techniques to address key challenges in controllability, temporal coherence, and deta...

📅

Friday, October 24, 2025

Executive Briefing Bullets (20) JSON

Attentive Convolution: Unifying the Expressivity of Self-Attention with Convolutional Efficiency

Introduces Attentive Convolution, a layer unifying the global receptive field of self-attention with the efficiency of convolutions. The resulting AC-Net architecture achieves competitive performance ...

Sherlock: Self-Correcting Reasoning in Vision-Language Models

Presents Sherlock, a framework for Vision-Language Models that performs self-correction on its own reasoning steps without external verifiers. By generating and refining hypotheses internally, it impr...

Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach

Introduces a video generation framework that improves physical plausibility by regularizing the model with 3D point trajectories. By augmenting 2D videos with this 3D-aware data, the fine-tuned latent...

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

Introduces OpenWorldSAM, a framework that extends the Segment Anything Model (SAM) to perform universal image segmentation from open-ended language prompts. By integrating a vision-language model, it ...

BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

Demonstrates emergent properties in biological vision models by scaling hierarchical contrastive learning on a large-scale, taxonomy-curated dataset. The resulting BioCLIP 2 model shows improved zero-...

Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

Develops a method where an LLM iteratively fine-tunes itself to improve its ability to generate adversarial suffixes that jailbreak other models. This automated self-improvement loop discovers more ef...

Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models

Presents Spatial-DISE, a unified benchmark for evaluating the spatial reasoning capabilities of Vision-Language Models across four key dimensions: Direction, Intersection, Scale, and Existence. It pro...

Statistical Inference for Generative Model Comparison

Develops a method for statistically comparing generative models by providing confidence intervals on the distance between a model's generated distribution and the true data distribution. This allows f...

AccuQuant: Simulating Multiple Denoising Steps for Quantizing Diffusion Models

Introduces AccuQuant, a post-training quantization method for diffusion models that mitigates the accumulation of quantization errors over multiple denoising steps. By simulating a few sampling steps ...

JAMUN: Bridging Smoothed Molecular Dynamics and Score-Based Learning for Conformational Ensembles

Proposes a framework that bridges smoothed molecular dynamics (MD) with score-based generative models to efficiently sample protein conformational ensembles. The model learns from smoothed MD trajecto...

Positional Encoding Field

Proposes Positional Encoding Field (PEF), a continuous function that generates positional encodings for Diffusion Transformers based on patch coordinates. This method improves generation quality and a...

Sampling from multi-modal distributions with polynomial query complexity in fixed dimension via reverse diffusion

Provides the first algorithm for sampling from multi-modal distributions, including Gaussian mixtures, with query complexity that is polynomial in the multi-modality parameters. The method is based on...

Generative diffusion model surrogates for mechanistic agent-based biological models

Proposes using generative diffusion models as computationally efficient surrogates for mechanistic, agent-based biological models like the Cellular-Potts Model (CPM). The surrogate model learns to emu...

MoORE: SVD-based Model MoE-ization for Conflict- and Oblivion-Resistant Multi-Task Adaptation

Proposes a 'model MoE-ization' strategy that converts a pretrained model's weight matrices into Mixture-of-Experts (MoE) layers for multi-task adaptation. This SVD-based method mitigates task conflict...

FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

Presents a training-free method for subject-driven text-to-image generation that grafts cross-image features at inference time. It preserves subject identity from reference images by manipulating atte...

AnyPcc: Compressing Any Point Cloud with a Single Universal Model

Introduces AnyPcc, a universal point cloud geometry compression model designed to generalize across diverse data distributions. It uses a robust context model and efficient handling of out-of-distribu...

Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology

Challenges the dominant two-stage paradigm in computational pathology by demonstrating that a properly regularized, end-to-end trained model can outperform methods relying on pre-trained, frozen encod...

Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories

Proposes a new evaluation framework to assess large-scale video generation models as simulators of multi-person pedestrian dynamics. The study finds that while models produce visually realistic scenes...

REOBench: Benchmarking Robustness of Earth Observation Foundation Models

Introduces REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models against real-world perturbations. It assesses model performance under variou...

PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

Introduces Online Audio-Visual Event Parsing (On-AVEP) and a Predictive Future Modeling (PreFM) framework to enable real-time event parsing in videos. The model processes video streams incrementally a...

📅

Thursday, October 23, 2025

Executive Briefing Bullets (20) JSON

PixelWorld: How Far Are We from Perceiving Everything as Pixels?

Proposes a unified "perceive everything as pixels" approach for agentic models, encoding both text and images into a shared pixel-space representation. This framework aims to eliminate separate text t...

SheetBrain: A Neuro-Symbolic Agent for Accurate Reasoning over Complex and Large Spreadsheets

Presents a neuro-symbolic agent designed for complex reasoning over large spreadsheets. It combines a neural model for understanding natural language queries with a symbolic engine for executing opera...

CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation

Integrates causal graphs into Retrieval-Augmented Generation (RAG) to enhance reasoning and reduce context disruption. By retrieving and reasoning over causal relationships instead of just semantic si...

ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers

Proposes a novel method for improving tool retrieval by 'instilling' LLM reasoning capabilities into the retriever itself. This is achieved by having the LLM generate synthetic queries and tool usage ...

DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference

Introduces a difficulty-adaptive reasoning framework for token-efficient LLM inference. The system dynamically adjusts the complexity of its 'thinking traces' based on a problem's perceived difficulty...

VGD: Visual Geometry Gaussian Splatting for Feed-Forward Surround-view Driving Reconstruction

Introduces Visual Geometry Gaussian Splatting (VGD), a feed-forward method for surround-view autonomous driving scene reconstruction. It uses a visual geometry-aware transformer to explicitly model 3D...

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

Introduces a balanced, long-context benchmark for evaluating LLMs with context lengths up to 256K. The benchmark features five distinct length levels and is designed to mitigate knowledge leakage and ...

MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models

Introduces MoAlign, a motion-centric representation alignment method for text-to-video diffusion models. It explicitly aligns motion representations within the model's U-Net architecture, improving th...

JointCQ: Improving Factual Hallucination Detection with Joint Claim and Query Generation

Improves factual hallucination detection by jointly generating claims from an LLM's response and verification queries for those claims. This joint process creates a stronger signal for identifying uns...

MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting

Introduces a Mixture of Experts (MoE) architecture for dynamic 3D Gaussian Splatting. This approach uses different 'expert' networks to model various types of motion and scene dynamics, enabling high-...

metaTextGrad: Automatically optimizing language model optimizers

Introduces a method where Large Language Models automatically optimize the update rules of learning algorithms. By representing optimizer logic as text, LLMs can meta-learn and propose superior optimi...

Lost in the Maze: Overcoming Context Limitations in Long-Horizon Agentic Search

Investigates and addresses context limitations in long-horizon agentic search tasks. The work identifies how agents 'get lost' during long explorations and proposes a framework to improve information ...

Machine Text Detectors are Membership Inference Attacks

Reframes the problem of detecting machine-generated text as a form of Membership Inference Attack (MIA). This conceptual link reveals that text detectors inherently expose information about a model's ...

Ninja Codes: Neurally Generated Fiducial Markers for Stealthy 6-DoF Tracking

Presents Ninja Codes, neurally-generated fiducial markers for 6-DoF tracking that blend into real-world environments. An encoder network subtly alters arbitrary images to embed tracking information, c...

Ninja Codes: Neurally Generated Fiducial Markers for Stealthy 6-DoF Tracking

Presents neurally-generated fiducial markers that blend stealthily into environments for 6-DoF tracking. An encoder network subtly alters images to embed trackable codes, creating markers that are bot...

LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts

Utilizes reinforcement learning (RL) to enhance the advanced reasoning capabilities of LLMs over long contexts. The method trains models to discover and apply complex thinking patterns required for hi...

CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation

Presents an edge-first framework for processing continuous multimodal sensor streams into compact semantic tokens. It enables cost- and uncertainty-aware cooperation between edge devices and cloud-bas...

D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

Proposes a Detector-to-Differentiable Critic (D2D) framework to improve the numeracy of text-to-image diffusion models. By incorporating a differentiable object counting module as a critic during trai...

WikiVideo: Article Generation from Multiple Videos

Introduces the task of grounded article generation from multiple, diverse videos about a real-world event. The goal is to create a Wikipedia-style article where all information is explicitly supported...

OpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform

Presents an open-source platform for creating context-aware safety guardrails for LLM applications. The system allows developers to define and enforce complex safety policies, enabling more robust pro...

📅

Wednesday, October 22, 2025

Executive Briefing Bullets (20) JSON

Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains

Proposes Compressed Latent Reasoning (CoLaR), a framework that dynamically compresses token-level Chain-of-Thought into a latent space. This approach accelerates inference and reduces computational co...

From Volume Rendering to 3D Gaussian Splatting: Theory and Applications

Provides a comprehensive theoretical overview and survey of 3D Gaussian Splatting (3DGS), tracing its evolution from classical volume rendering. The paper details the underlying principles, mathematic...

World-in-World: World Models in a Closed-Loop World

Introduces "World-in-World," a framework for evaluating generative world models in a closed-loop setting for decision-making tasks. This work bridges the gap between visual simulation and agent contro...

DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection

Introduces DCAD-2000, a large-scale multilingual corpus covering over 2000 languages, constructed from web-crawled data. It proposes a novel "Data Cleaning as Anomaly Detection" method to ensure high ...

UltraGen: High-Resolution Video Generation with Hierarchical Attention

Introduces UltraGen, a high-resolution video generation model based on a diffusion transformer. It employs a novel Hierarchical Attention mechanism to efficiently model both local and global dependenc...

When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

Provides a comprehensive survey and meta-analysis of methods integrating Large Language Models with 3D spatial data (3D-LLMs). The paper categorizes methodologies, summarizes key tasks and datasets, a...

OmniNWM: Omniscient Driving Navigation World Models

Introduces OmniNWM, an omniscient driving navigation world model designed to predict future states across multiple modalities (video, LiDAR, maps). The model handles long sequences, incorporates preci...

Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning

Proposes Janus-Pro-R1, a Multimodal Large Language Model that uses reinforcement learning to create a synergistic link between visual comprehension and generation. This allows the model's understandin...

3D Audio-Visual Segmentation

Introduces and defines the task of 3D Audio-Visual Segmentation. This work extends 2D audio-visual segmentation into 3D space, aiming to identify and segment sounding objects within a 3D scene represe...

Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model for Microclimate Impact Prediction

Demonstrates the use of a fine-tuned geospatial foundation model for detecting, simulating, and predicting urban heat island effects. The model leverages diverse data sources to generate high-resoluti...

DeepSeek-OCR: Contexts Optical Compression

Introduces DeepSeek-OCR, a novel method for extreme long-context compression by mapping text into an optical 2D representation. This approach leverages an encoder-decoder architecture to potentially b...

RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

Presents RAD, a closed-loop Reinforcement Learning framework for end-to-end autonomous driving. It trains a driving policy directly in a large-scale, 3D Gaussian Splatting-based simulated environment,...

Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval

Proposes a corpus-free pipeline for training dense retrieval models by using a Large Language Model to generate synthetic queries and hard negative passages. This "generate, don't retrieve" approach e...

The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure

Proposes the "Translation Barrier Hypothesis," arguing that poor multilingual generation in LLMs for low-resource languages stems from an implicit task-solving-then-translation pipeline failure. This ...

Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Proposes Re-ttention, an ultra-sparse attention mechanism for Diffusion Transformers that statistically reshapes attention maps to focus computation on important query-key pairs. This method significa...

SAM 2++: Tracking Anything at Any Granularity

Presents SAM 2++, a unified framework for video tracking that can handle targets of any granularity, from points and boxes to masks. It extends the Segment Anything Model (SAM) with a novel design to ...

Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

Introduces Visionary-R1, a method that uses reinforcement learning to mitigate shortcut learning in visual reasoning models. By rewarding generalizable reasoning paths over simple correlations, it imp...

Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain

Introduces Robobench, a comprehensive benchmark for evaluating Multimodal Large Language Models as the high-level reasoning "brain" for embodied agents. The benchmark assesses capabilities in percepti...

MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models

Presents MSR-Align, a framework for improving safety-aware reasoning in Vision-Language Models. It uses a policy-grounded multimodal alignment technique to steer the model's chain-of-thought process a...

Descriptor: Occluded nuScenes: A Multi-Sensor Dataset for Evaluating Perception Robustness in Automated Driving

Introduces Occluded nuScenes, a new multi-sensor dataset for evaluating perception model robustness in automated driving. The dataset systematically introduces synthetic occlusions to sensors, providi...

📅

Tuesday, October 21, 2025

Executive Briefing Bullets (20) JSON

A Comprehensive Survey on World Models for Embodied AI

Provides a comprehensive survey on world models for embodied AI agents. It organizes the field by defining world models as internal simulators that capture environment dynamics, enabling agents to sup...

Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

Demonstrates that visual autoregressive models can outperform diffusion models in inference-time scaling through search strategies. While search offers limited benefits for diffusion models, it signif...

Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

Presents a method to reduce visual hallucinations in Vision-Language Models (VLMs) by incorporating a verification step. It uses retrospective resampling, where the model verifies its own generated te...

A Synthetic Data-Driven Radiology Foundation Model for Pan-tumor Clinical Diagnosis

Develops a radiology foundation model for pan-tumor clinical diagnosis using synthetic data to overcome the scarcity of annotated medical images. The model is trained on a large-scale synthetic datase...

Attention (as Discrete-Time Markov) Chains

Introduces a novel theoretical interpretation of the attention matrix in Transformers as a discrete-time Markov chain. This framework unifies common attention operations like selection and averaging a...

Scaling Laws for Deepfake Detection

Presents a systematic study of scaling laws for deepfake detection, analyzing model performance against the number of real image domains, generation methods, and training images. The work provides fou...

Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning

Proposes a method to scale Multi-modal Large Language Models (MLLMs) by decoupling their perception and reasoning modules. This allows for upgrading the internal language model without expensive joint...

REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting

Presents REALM, an MLLM-Agent framework for open-world 3D reasoning and editing on Gaussian Splatting representations. The agent interprets complex human instructions to perform precise 3D segmentatio...

FairGen: Enhancing Fairness in Text-to-Image Diffusion Models via Self-Discovering Latent Directions

Proposes FairGen, a method to enhance fairness in text-to-image diffusion models by self-discovering latent directions associated with biases. The approach allows for mitigating these biases during th...

SSL4Eco: A Global Seasonal Dataset for Geospatial Foundation Models in Ecology

Presents SSL4Eco, a global, seasonal, and multi-spectral dataset for self-supervised learning in ecology. It provides a large-scale resource of remote sensing imagery to train geospatial foundation mo...

Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

Proposes an industry-level omni-modal large language model pipeline integrating auditory, visual, and linguistic modalities. The system overcomes challenges like limited tri-modal datasets and high co...

Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments

Introduces Morpheus, a benchmark for evaluating the physical reasoning of video generative models using real-world physical experiments. It provides a dataset and evaluation suite to test a model's ab...

Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models

Introduces a vision-centric model for autonomous driving that performs 4D occupancy forecasting and planning. It uses an implicit residual world model to predict changes in the scene rather than recon...

Embody 3D: A Large-scale Multimodal Motion and Behavior Dataset

Introduces Embody 3D, a large-scale multimodal dataset featuring 500 hours of 3D motion data from 439 participants. The dataset includes diverse single-person and two-person interactions, providing a ...

Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs

Proposes a method to add pixel-level segmentation capabilities to frozen, pre-trained Multimodal Large Language Models (MLLMs) without fine-tuning the base model. It trains a lightweight segmentation ...

Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention

Presents Scale-DiT, a diffusion transformer model for ultra-high-resolution text-to-image generation. It introduces a hierarchical local attention mechanism to overcome the quadratic complexity of sta...

VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

Introduces VisionSelector, an end-to-end learnable module for compressing visual tokens in Multimodal LLMs. It adaptively selects the most informative tokens from high-resolution or multi-image inputs...

Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

Systematically investigates the cross-task generalization capabilities of vision-language-action (VLA) models for robotic manipulation. The study analyzes how VLA models perform on unseen tasks, provi...

StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales

Introduces StretchySnake, a flexible training strategy for State Space Models (SSMs) in action recognition. By training the model on clips of varying spatio-temporal scales, it improves generalization...

PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

Introduces PRISMM-Bench, a benchmark for evaluating the ability of Large Multimodal Models (LMMs) to detect multimodal inconsistencies in scientific papers. It tests whether models can reason across t...

📅

Monday, October 20, 2025

Executive Briefing Bullets (20) JSON

Rethinking Convergence in Deep Learning: The Predictive-Corrective Paradigm for Anatomy-Informed Brain MRI Segmentation

Introduces the Predictive-Corrective (PC) paradigm and PCMambaN network for anatomy-informed brain MRI segmentation. Achieves accelerated learning and improved efficiency in data-scarce medical imagin...

PFGS: Pose-Fused 3D Gaussian Splatting for Complete Multi-Pose Object Reconstruction

Introduces PFGS, a pose-aware 3D Gaussian Splatting framework that reconstructs complete objects from multi-pose image captures. Addresses limitations of single-pose methods by integrating pose inform...

Diffusion Bridge Networks Simulate Clinical-grade PET from MRI for Dementia Diagnostics

Proposes a diffusion bridge network to synthesize clinical-grade FDG-PET scans from standard MRI images for dementia diagnosis. This approach makes a critical diagnostic tool more accessible by simula...

Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

Introduces Unimedvl, a unified medical vision-language model for both understanding and generation tasks. It processes diverse multimodal inputs to generate textual reports, visual annotations, and se...

Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery

Presents Skyfall-GS, a method to synthesize large-scale, explorable, and geometrically accurate 3D urban scenes from satellite imagery. It addresses the lack of real-world 3D scans for training genera...

AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction

Introduces AutoGraph-R1, an end-to-end reinforcement learning framework for building knowledge graphs for RAG systems. It directly optimizes the KG construction process to improve performance on downs...

VISTA: A Test-Time Self-Improving Video Generation Agent

Proposes VISTA, a test-time self-improving agent for text-to-video generation. Instead of relying on a perfect user prompt, VISTA iteratively refines the generated video based on user-defined scoring ...

BLIP3o-NEXT: Next Frontier of Native Image Generation

Presents BLIP3o-NEXT, a fully open-source vision-language foundation model that unifies text-to-image generation and image editing within a single architecture. The model demonstrates strong performan...

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Introduces Ditto, a framework to address data scarcity in instruction-based video editing. It features a pipeline to automatically generate a large-scale, high-quality synthetic dataset of video editi...

MAVR-Net: Robust Multi-View Learning for MAV Action Recognition with Cross-View Attention

Presents MAVR-Net, a multi-view learning framework for MAV action recognition using cross-view attention. Addresses limitations of RGB-only models by capturing complex spatial-temporal characteristics...

Bolt3D: Generating 3D Scenes in Seconds

Presents Bolt3D, a latent diffusion model for feed-forward 3D scene generation from images. It directly samples a 3D scene representation in under seven seconds on a single GPU, achieving a significan...

YOLOE: Real-Time Seeing Anything

Introduces YOLOE, a model extending the YOLO series for real-time open-vocabulary object detection and segmentation. It leverages visual and text prompts to detect and segment any object without being...

SHARE: Scene-Human Aligned Reconstruction

Introduces SHARE, a technique that leverages scene geometry to accurately ground human motion reconstruction from monocular RGB video. Addresses challenges in placing humans in 3D space for realistic ...

FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers

Proposes FreqPDE, rethinking positional depth embedding for multi-view 3D object detection transformers. Addresses depth prediction quality issues in autonomous driving by improving spatial informatio...

Rethinking Efficient Hierarchical Mixing Architecture for Low-light RAW Image Enhancement

Introduces the Hierarchical Mixing Architecture (HiMA) for efficient low-light RAW image enhancement. Leverages complementary strengths of Transformer and Mamba for improved enhancement quality and hi...

Exploring Conditions for Diffusion models in Robotic Control

Explores leveraging pre-trained text-to-image diffusion models for task-adaptive visual representations in robotic control without fine-tuning. Investigates optimal conditions for applying textual pro...

Proto-Former: Unified Facial Landmark Detection by Prototype Transformer

Proposes Proto-Former, a unified, adaptive, end-to-end facial landmark detection framework. Addresses limitations in single-dataset training by explicitly unifying landmark detection across different ...

Balanced Multi-Task Attention for Satellite Image Classification: A Systematic Approach to Achieving 97.23% Accuracy on EuroSAT Without Pre-Training

Presents a systematic investigation of custom CNN architectures for satellite land use classification, achieving 97.23% accuracy on EuroSAT without pre-training. Introduces a novel balanced multi-task...

Diffusion Bridge Networks Simulate Clinical-grade PET from MRI for Dementia Diagnostics

Introduces SiM2P, a 3D diffusion bridge-based framework simulating clinical-grade PET from MRI for dementia diagnostics. Learns a probabilistic mapping from MRI to PET images, addressing accessibility...

V2X-Radar: A Multi-modal Dataset with 4D Radar for Cooperative Perception

Presents V2X-Radar, a new large-scale, multi-modal dataset for cooperative perception in autonomous driving. It uniquely features 4D radar data alongside LiDAR and camera streams, enabling research on...

📅

Friday, October 17, 2025

Executive Briefing Bullets (20) JSON

ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

Introduces ScholarBench, a benchmark for evaluating LLMs on complex academic problem-solving. It targets specialized contexts to assess academic reasoning ability, addressing limitations of prior benc...

EasyNER: A Customizable Easy-to-Use Pipeline for Deep Learning- and Dictionary-based Named Entity Recognition from Medical and Life Science Text

Develops EasyNER, an easy-to-use pipeline for Named Entity Recognition in medical and life science text. It provides automated text mining to help researchers utilize information from large bodies of ...

Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding

Proposes Vgent, a graph-based Retrieval-Augmented Generation framework for long video understanding. It addresses challenges in processing extended video tokens and retaining long-term sequential info...

PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis

Proposes PIA, a deepfake detection method using phoneme-temporal and identity-dynamic analysis. It aims to improve the identification of modern deepfakes generated by advanced generative models, overc...

Vision-Centric Activation and Coordination for Multimodal Large Language Models

Introduces VaCo, a framework optimizing MLLM representations through vision-centric activation and coordination. It enhances analytical abilities by leveraging multiple vision foundation models, addre...

Pruning Overparameterized Multi-Task Networks for Degraded Web Image Restoration

Applies pruning to overparameterized multi-task networks for degraded web image restoration. It addresses the quality of web images affected by lossy operations, aiming to recover clean, high-quality ...

Towards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology

Introduces DentVFM, the first family of vision foundation models for oral and maxillofacial radiology. It addresses limitations of single-modality, task-specific dental AI systems, aiming for generali...

Consistent text-to-image generation via scene de-contextualization

Proposes scene de-contextualization for consistent text-to-image generation. It addresses identity shift by decoupling subject and scene context, enabling identity-preserving images across diverse sce...

Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

Introduces Efficient Video Sampling (EVS), a method for pruning temporally redundant tokens in videos. It addresses scalability limitations of VLMs processing dense frame sequences, reducing token red...

SteeringTTA: Guiding Diffusion Trajectories for Robust Test-Time-Adaptation

Proposes SteeringTTA, an inference-only framework guiding diffusion-based input adaptation for test-time adaptation. It steers diffusion trajectories to improve robustness across distortion types, add...

Are LLMs Stable Formal Logic Translators in Logical Reasoning Across Linguistically Diversified Texts?

Investigates LLM stability in translating natural language to formal logic for reasoning. Identifies inconsistencies in symbolic representations across linguistic forms, highlighting a need for more r...

CAP: Evaluation of Persuasive and Creative Image Generation

Introduces three evaluation metrics: Creativity, prompt Alignment, and Persuasiveness (CAP) for advertisement image generation. Addresses the challenge of evaluating Text-to-Image models beyond simple...

Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images

Presents a zero-shot pipeline for creating hyperrealistic 3D avatars from phone images. Introduces a generative canonicalization approach to address geometric inconsistencies and improve identity pres...

CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts

Introduces CLEAR, a causal-inference-based framework for robust histopathology tumor detection. It leverages semantic features while mitigating OOD shifts by modeling causal relationships, improving g...

DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation

Proposes DOS, a method for directional object separation in text embeddings for multi-object image generation. It addresses challenges in T2I models with multiple objects, mitigating object neglect an...

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Introduces PaddleOCR-VL, a compact Vision-Language Model for multilingual document parsing. It efficiently supports 109 languages and excels at recognizing complex elements like text, tables, and char...

Acquisition of interpretable domain information during brain MR image harmonization for content-based image retrieval

Proposes a framework for brain MR image harmonization that acquires interpretable domain information. It disentangles domain-invariant and domain-specific features to improve machine learning performa...

Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video

Introduces an ego-proactive Video-LLM for streaming video that actively understands and anticipates events. It focuses on proactive coherence and just-in-time perception and reasoning for dynamic, evo...

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Introduces RepTok, a generative modeling framework using single continuous latent tokens from self-supervised ViTs. It adapts semantic tokens with low-level details for faithful image reconstruction, ...

WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging

Introduces WeCKD, a weakly-supervised chained distillation network for efficient multimodal medical imaging. It addresses knowledge degradation and inefficient supervision in traditional KD by using a...

📅

Thursday, October 16, 2025

Executive Briefing Bullets (20) JSON

Taming the Fragility of KV Cache Eviction in LLM Inference

Introduces a new KV cache eviction strategy that dynamically adapts eviction thresholds based on predicted future importance. Achieves significant memory reduction and speedup in LLM inference by pres...

D-SMART: Enhancing LLM Dialogue Consistency via Dynamic Structured Memory And Reasoning Tree

Introduces D-SMART, a dynamic structured memory and reasoning tree framework to enhance LLM dialogue consistency. Addresses factual inconsistencies and logical decay in multi-turn dialogues by adaptiv...

Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons

Introduces Breadcrumbs Reasoning, using learned compression beacons to periodically compress the KV cache. Achieves memory-efficient long-context reasoning by reducing KV cache costs, enabling LLMs to...

Beyond Single-Reward: Multi-Pair, Multi-Perspective Preference Optimization for Machine Translation

Introduces a multi-pair, multi-perspective preference optimization for machine translation that addresses flawed reward signals and inefficient data utilization. Improves LLM alignment to human prefer...

Make an Offer They Can't Refuse: Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment

Explores Bayesian Persuasion (BP) in natural language for single-turn dialogues to enhance LLM strategic persuasion. Incorporates information asymmetry and avoids pre-commitment assumptions, improving...

Sparse Subnetwork Enhancement for Underrepresented Languages in Large Language Models

Presents a framework for enhancing LLM capabilities in underrepresented languages by fine-tuning language-specific subnetworks. Identifies language-specific neurons and tunes associated weights, impro...

Assessing Web Search Credibility and Response Groundedness in Chat Assistants

Introduces a novel methodology for evaluating chat assistants' web search behavior, focusing on source credibility and response groundedness. Assesses how assistants integrate web search, highlighting...

ICA-RAG: Information Completeness Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis

Proposes ICA-RAG, an adaptive retrieval-augmented generation framework guided by information completeness for disease diagnosis. Tailors retrieval strategies to diagnostic difficulty and sample inform...

FreshTab: Sourcing Fresh Data for Table-to-Text Generation Evaluation

Introduces FreshTab, an on-the-fly table-to-text benchmark generation from Wikipedia. Combats LLM data contamination and enables domain-sensitive evaluation, addressing precision needs in table-to-tex...

Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses

Introduces DualHyp, an audio-visual speech error correction framework using an LLM to compose N-best hypotheses from ASR and VSR models. Enhances error correction by reasoning over modality-specific e...

Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

Positions attention heads as a mechanistic blueprint for LLM reasoning, distinguishing between local and global attention for fine-grained policy optimization. Enables legible internal logic and impro...

MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning

Proposes MemoTime, a memory-augmented temporal knowledge graph to enhance LLM temporal reasoning. Addresses challenges in understanding evolving event sequences and compound operators, enabling more a...

BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning

Presents BRIEF-Pro, a universal, lightweight compressor for distillation of relevant evidence in retrieval-augmented generation. Enables fast and accurate multi-hop reasoning by summarizing retrieved ...

Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models

Assesses whether the Concordia framework can effectively model Theory of Mind (ToM) in simulated environments using GPT-4. Explores if LLMs can perform tasks requiring genuine understanding of others'...

The Mechanistic Emergence of Symbol Grounding in Language Models

Introduces a controlled evaluation framework to investigate the mechanisms and loci of symbol grounding emergence in (vision-)language models. Explores how symbols acquire meaning by connecting to rea...

How Sampling Affects the Detectability of Machine-written texts: A Comprehensive Study

Systematically examines how decoding strategies affect the detectability of machine-written texts. Demonstrates the robustness of text detection systems to changes in generation settings, highlighting...

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

Surveys Arabic LLM evaluation benchmarks, analyzing 40+ resources across NLP tasks, knowledge, and culture. Proposes a taxonomy and identifies critical gaps, revealing progress and areas needing devel...

Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation

Proposes a confidence estimation method for RAG systems using feed-forward network activations to align with output correctness. Enables response abstinence based on uncertainty, improving LLM trustwo...

A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

Presents the first large-scale, multilingual study on personalized disinformation generation by LLMs. Investigates the interplay between safeguards, personalization, and disinformation, revealing LLM ...

Towards Region-aware Bias Evaluation Metrics

Identifies topical differences in gender bias across regions and proposes region-aware bias evaluation metrics. Addresses limitations of existing benchmarks by considering context-specific biases, lea...

📅

Wednesday, October 15, 2025

Executive Briefing Bullets (19) JSON

Hierarchical Reasoning with Vision-Language Models for Incident Reports from Dashcam Videos

Proposes a hierarchical reasoning framework for incident report generation from dashcam videos. Aims to improve out-of-distribution scenario hazard understanding for autonomous driving models.

SpineBench: Benchmarking Multimodal LLMs for Spinal Pathology Analysis

Introduces SpineBench, a Visual Question Answering benchmark for fine-grained spinal pathology analysis. Evaluates multimodal LLMs, addressing limitations of existing general medical benchmarks.

State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding

Applies pre-trained state space models for video classification using prompt learning. Gathers and spreads spatio-temporal information for efficient adaptation to downstream tasks.

UniGS: Unified Geometry-Aware Gaussian Splatting for Multimodal Rendering

Proposes UniGS, a unified representation and framework for multimodal 3D reconstruction using Gaussian Splatting. Renders RGB, depth, normals, and semantic logits simultaneously with high fidelity.

BEEP3D: Box-Supervised End-to-End Pseudo-Mask Generation for 3D Instance Segmentation

Introduces BEEP3D for 3D instance segmentation using box-level supervision. Generates pseudo-masks end-to-end, addressing annotation costs and ambiguity in overlapping regions.

BIGFix: Bidirectional Image Generation with Token Fixing

Proposes BIGFix for bidirectional image generation using token fixing. Aims to improve inference efficiency by combining auto-regressive modeling with multi-token prediction.

Multiplicative Loss for Enhancing Semantic Segmentation in Medical and Cellular Images

Proposes Multiplicative Loss and Confidence-Adaptive Multiplicative Loss for semantic segmentation. Enhances performance in medical and cellular images, especially with limited data.

CurriFlow: Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion

Introduces CurriFlow, a semantic occupancy prediction framework for 3D Semantic Scene Completion. Integrates optical flow for temporal alignment, addressing motion reasoning and occlusion challenges.

Scene Coordinate Reconstruction Priors

Presents a probabilistic reinterpretation of training Scene Coordinate Regression (SCR) models. Infuses high-level reconstruction priors to improve implicit scene representations for 3D vision.

VideoLucy: Deep Memory Backtracking for Long Video Understanding

Introduces VideoLucy, a framework with deep memory backtracking for long video understanding. Addresses challenges in temporal context capture and sparse frame sampling for agent-based systems.

HoneyBee: Data Recipes for Vision-Language Reasoners

Introduces data curation approaches to study their impact on Vision-Language reasoning capabilities. Analyzes effects of context sources and implements targeted data interventions.

Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding

Proposes spatio-temporally consistent proxy nodes to represent dynamic objects for vectorized video representation. Enables easy editing by overcoming vulnerabilities of pixel-level matching.

PAGS: Priority-Adaptive Gaussian Splatting for Dynamic Driving Scenes

Introduces Priority-Adaptive Gaussian Splatting (PAGS) for reconstructing dynamic 3D urban scenes. Injects task-aware semantic priorities into 3D representations to address fidelity vs. cost trade-off...

AngularFuse: A Closer Look at Angle-based Perception for Spatial-Sensitive Multi-Modality Image Fusion

Proposes an angle-based perception approach for spatial-sensitive multi-modality image fusion. Integrates visible-infrared information to produce enhanced images for downstream tasks.

The Impact of Synthetic Data on Object Detection Model Performance: A Comparative Analysis with Real-World Data

Conducts a comparative analysis of synthetic versus real-world data for object detection fine-tuning. Investigates opportunities to optimize workflows in industries like manufacturing.

Ivan-ISTD: Rethinking Cross-domain Heteroscedastic Noise Perturbations in Infrared Small Target Detection

Proposes Ivan-ISTD, a wavelet-guided framework for Infrared Small Target Detection. Addresses cross-domain shift and heteroscedastic noise perturbations using invariance learning.

Local Background Features Matter in Out-of-Distribution Detection

Investigates the role of local background features in out-of-distribution detection. Addresses overconfident predictions on OOD data, improving reliability in real-world deployments.

Learning to Recognize Correctly Completed Procedure Steps in Egocentric Assembly Videos through Spatio-Temporal Modeling

Proposes Spatio-Temporal Occlusion-Resilient Modeling for Procedure Step Recognition. Enhances robustness and accuracy in recognizing completed steps from egocentric assembly videos.

Towards General Urban Monitoring with Vision-Language Models: A Review, Evaluation, and a Research Agenda

Reviews, evaluates, and proposes a research agenda for using Vision-Language Models in general urban monitoring. Addresses challenges in object diversity, environmental conditions, and contextual unde...

📅

Tuesday, October 14, 2025

No research highlights available for this date

📅

Monday, October 13, 2025

Executive Briefing Bullets (20) JSON

SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images

Introduces SpatialSplat for efficient semantic 3D reconstruction from sparse unposed images. Associates primitives with compressed semantic features, addressing limitations of prior methods in incorpo...

ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition

Introduces ProbRes, a probabilistic residual search framework based on jump-diffusion for open-world egocentric activity recognition. Balances prior-guided exploration and likelihood-driven exploitati...

Solving Inverse Problems with FLAIR

Proposes FLAIR to solve inverse imaging problems using flow-based latent generative models. Addresses intractable data likelihood and direct generative model integration challenges for improved fideli...

HoliTom: Holistic Token Merging for Fast Video Large Language Models

Proposes HoliTom for holistic token merging to accelerate video LLMs. Addresses computational inefficiency caused by redundant video tokens with an efficient token pruning strategy.

SMF: Template-free and Rig-free Animation Transfer using Kinetic Codes

Introduces Self-supervised Motion Fields (SMF) for template-free, rig-free animation transfer. Addresses limitations of existing methods like motion jitter and limited generalization to unseen motions...

SQ-GAN: Semantic Image Communications Using Masked Vector Quantization

Introduces SQ-GAN integrating semantic image coding and vector quantization for optimized image compression. Focuses on source coding, compliant with legacy systems, using semantic segmentation maps f...

Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption

Introduces online Video Depth Anything (oVDA) for temporally-consistent depth prediction with low memory. Adapts LLM techniques like latent feature caching for efficient online processing.

BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception

Introduces BLINK-Twice, a vision-centric reasoning benchmark for MLLMs. Focuses on challenging perceptual tasks requiring reasoning from visual context rather than external knowledge.

MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

Introduces MomentSeg for moment-centric sampling in referring video object segmentation. Jointly learns sampling strategies to improve temporal reasoning and fine-grained visual comprehension.

Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

Investigates using discrete semantic entropy (DSE) to filter questions likely to generate hallucinations in radiology VLMs. Aims to improve accuracy in medical image-based visual question answering.

The Role of Video Generation in Enhancing Data-Limited Action Understanding

Proposes a text-to-video diffusion transformer to generate annotated data for training, addressing data scarcity in video action understanding. Enables scalable, realistic data generation without huma...

CQ-DINO: Mitigating Gradient Dilution via Category Queries for Vast Vocabulary Object Detection

Proposes CQ-DINO to mitigate gradient dilution in vast vocabulary object detection. Addresses positive and hard negative gradient dilution by introducing category queries, improving learning signals f...

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

Introduces DenseDPO for fine-grained temporal preference optimization in video diffusion models. Addresses limitations of pairwise video comparisons by enabling detailed temporal preference learning.

DiffMark: Diffusion-based Robust Watermark Against Deepfakes

Proposes DiffMark, a diffusion-based robust watermarking framework against deepfakes. Enables seamless watermark fusion during image generation, offering improved robustness against deepfake manipulat...

Differentially Private 2D Human Pose Estimation

Develops a differentially private framework for 2D human pose estimation. Provides formal privacy guarantees while addressing the data utility degradation typically associated with differential privac...

TARO: Toward Semantically Rich Open-World Object Detection

Proposes TARO for semantically rich open-world object detection, moving beyond closed-world assumptions. Aims to assign subcategories to novel objects for enhanced decision-making in safety-critical c...

Tag-Enriched Multi-Attention with Large Language Models for Cross-Domain Sequential Recommendation

Proposes TEMA-LLM for cross-domain sequential recommendation. Integrates tag-enriched multi-attention and LLMs to capture both domain-specific and cross-domain user behaviors effectively.

Enhancing Infrared Vision: Progressive Prompt Fusion Network and Benchmark

Proposes a progressive prompt fusion network for infrared image enhancement, addressing coupled degradations. Revisit imaging models to improve effectiveness on infrared sensors due to significant ima...

Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models

Proposes dynamic Chain-of-Thought for boosting multi-modal keyphrase prediction in vision-language models. Addresses limitations in handling absence and unseen scenarios, and overestimation in existin...

Cattle-CLIP: A Multimodal Framework for Cattle Behaviour Recognition

Presents Cattle-CLIP, a multimodal framework for cattle behavior recognition using semantic cues. Improves video-based visual feature recognition performance by leveraging semantic information.

📅

Friday, October 10, 2025

Executive Briefing Bullets (20) JSON

Uncertainty-Aware Diffusion Guided Refinement of 3D Scenes

Introduces an uncertainty-aware diffusion guided refinement framework for 3D scene reconstruction from a single image. It addresses limitations in existing methods that render incoherent and blurry no...

D$^2$GS: Depth-and-Density Guided Gaussian Splatting for Stable and Accurate Sparse-View Reconstruction

Proposes D2GS, a depth-and-density guided Gaussian Splatting method that addresses instability and performance degradation in sparse-view reconstruction. It improves accuracy by identifying and fixing...

X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering

Introduces X2Video, a diffusion model for photorealistic video rendering guided by intrinsic channels and multimodal controls. It allows manipulation of color, material, geometry, and lighting with re...

ReSplat: Learning Recurrent Gaussian Splats

Proposes ReSplat, a recurrent Gaussian splatting model that iteratively refines 3D Gaussians without explicit gradients. It leverages rendering error as a feedback signal for improved performance, esp...

Less is More: Compact Clue Selection for Efficient Retrieval-Augmented Generation Reasoning

Proposes compact clue selection for efficient Retrieval-Augmented Generation (RAG) reasoning, optimizing input for LLMs. It extracts and organizes answer-relevant clues from documents to enhance reaso...

FlowNIB: An Information Bottleneck Analysis of Bidirectional vs. Unidirectional Language Models

Proposes FlowNIB to analyze bidirectional vs. unidirectional language models using the Information Bottleneck principle. It investigates the theoretical reasons behind bidirectional models' better con...

Argument Summarization and its Evaluation in the Era of Large Language Models

Investigates integrating LLMs into Argument Summarization (ArgSum) systems and proposes a novel prompt-based evaluation scheme. It validates this scheme through a new human benchmark dataset for ArgSu...

FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

Introduces FlashDLM to accelerate Diffusion Language Model (DLM) inference using efficient KV caching and guided diffusion. It addresses slow inference in DLMs by optimizing token generation processes...

XYZCylinder: Feedforward Reconstruction for Driving Scenes Based on A Unified Cylinder Lifting Method

Presents XYZCylinder, a feedforward reconstruction method for driving scenes using a unified cylinder lifting approach. It improves generalization by learning a fixed view transformation for single-re...

Dual-Stream Alignment for Action Segmentation

Introduces the Dual-Stream Alignment Network (DSA Net) for action segmentation, proposing a novel dual-stream approach. It learns action-wise features to enhance performance by modeling spatio-tempora...

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Introduces NaViL, a native training approach for Multimodal Large Language Models (MLLMs) that systematically studies its design space and scaling properties under data constraints. It aims to improve...

Spectral Prefiltering of Neural Fields

Presents a method to optimize neural fields for spectral prefiltering in a single forward pass by analytically scaling Fourier feature embeddings. This enables efficient, resolution-independent neural...

SViM3D: Stable Video Material Diffusion for Single Image 3D Generation

Presents SViM3D, a framework predicting multi-view consistent physically based rendering (PBR) materials from a single image. It extends latent video diffusion models to efficiently generate 3D object...

SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion

Introduces SatFusion, a unified framework for enhancing satellite IoT images by fusing multi-temporal and multi-source data. It exploits complementary information across temporal and source dimensions...

R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation

Introduces R2RGEN for real-to-real 3D data generation to achieve spatially generalized robotic manipulation. It aims to train visuomotor policies robust to variations in object distribution, environme...

I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization

Introduces I&S-ViT, an inclusive and stable method for post-training quantization of Vision Transformers (ViTs). It addresses cost issues by enabling low-bit operation while mitigating performance dro...

Redundant Semantic Environment Filling via Misleading-Learning for Fair Deepfake Detection

Proposes a misleading-learning approach for fair deepfake detection that addresses dual-overfitting issues. It fills redundant semantic environments to improve fairness and reduce demographic bias in ...

Language Surgery in Multilingual Large Language Models

Investigates representation alignment in multilingual LLMs, particularly in middle layers, to disentangle language-specific and language-agnostic information. It confirms alignment and analyzes its be...

Targetless LiDAR-Camera Calibration with Neural Gaussian Splatting

Proposes Targetless LiDAR-Camera Calibration (TLC-Calib) using Neural Gaussian Splatting. It jointly optimizes sensor poses and a neural Gaussian-based representation, eliminating the need for physica...

DEGS: Deformable Event-based 3D Gaussian Splatting from RGB and Event Stream

Presents DEGS, a deformable event-based 3D Gaussian Splatting method for dynamic scenes using RGB and event streams. It addresses challenges in reconstructing dynamic scenes from low-framerate RGB vid...

📅

Thursday, October 9, 2025

Executive Briefing Bullets (20) JSON

SanDRA: Safe Large-Language-Model-Based Decision Making for Automated Vehicles Using Reachability Analysis

Proposes SanDRA, the first safe LLM-based decision-making framework for automated vehicles using reachability analysis. Addresses LLM hallucinations and integrates vehicle dynamics for safer autonomou...

Bring the Apple, Not the Sofa: Impact of Irrelevant Context in Embodied AI Commands on VLA Models

Investigates the robustness of Vision-Language-Action (VLA) models in embodied AI to linguistic perturbations, specifically irrelevant context in commands. Presents a novel systematic study evaluating...

Diffusion Trajectory-guided Policy for Long-horizon Robot Manipulation

Introduces a Diffusion Trajectory-guided policy for long-horizon robot manipulation, leveraging diffusion models to mitigate compounding errors in imitation learning. Addresses challenges in out-of-di...

RAISE: A self-driving laboratory for interfacial property formulation discovery

Introduces RAISE, a Robotic Autonomous Imaging Surface Evaluator, a closed-loop, self-driving laboratory. Links liquid formulation optimization with surface wettability assessment for interfacial prop...

Sampling Strategies for Robust Universal Quadrupedal Locomotion Policies

Investigates sampling strategies for configuration variations to generate robust universal locomotion policies for quadrupedal robots. Compares joint gain sampling strategies to enable single reinforc...

Distributed 3D Source Seeking via SO(3) Geometric Control of Robot Swarms

Presents a geometric control framework on the Lie group SO(3) for 3D source-seeking by robot swarms. Avoids Euler-angle singularities and quaternion ambiguities, ensuring intrinsic orientation represe...

DPL: Depth-only Perceptive Humanoid Locomotion via Realistic Depth Synthesis and Cross-Attention Terrain Reconstruction

Introduces DPL, a depth-only perceptive humanoid locomotion framework using realistic depth synthesis and cross-attention terrain reconstruction. Addresses limitations of current depth-image and eleva...

A Narwhal-Inspired Sensing-to-Control Framework for Small Fixed-Wing Aircraft

Presents an end-to-end sensing-to-control pipeline for small fixed-wing aircraft, combining bio-inspired hardware, physics-informed dynamics learning, and convex control allocation. Inspired by narwha...

EffiTune: Diagnosing and Mitigating Training Inefficiency for Parameter Tuner in Robot Navigation System

Introduces EffiTune to diagnose and mitigate training inefficiency for parameter tuners in robot navigation systems. Balances classical and learning-based methods to improve adaptability and stability...

Artists' Views on Robotics Involvement in Painting Productions

Explores professional abstract artists' perceptions of co-creative interactions with an autonomous painting robotic arm. Analyzes their experiences through semi-structured interviews to understand hum...

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

Introduces RLinf-VLA, a unified and efficient framework for Vision-Language-Action (VLA) and Reinforcement Learning (RL) training. Addresses error accumulation in VLA models trained with supervised fi...

Assist-As-Needed: Adaptive Multimodal Robotic Assistance for Medication Management in Dementia Care

Presents Assist-As-Needed, an adaptive multimodal robotic assistance system for medication management in dementia care. Addresses limitations of one-size-fits-all assistive technologies by adapting as...

Diffusing Trajectory Optimization Problems for Recovery During Multi-Finger Manipulation

Proposes a framework using diffusion models to autonomously identify recovery needs and optimize contact-rich trajectories for multi-fingered robotic manipulation. Enables recovery behaviors to resume...

Tailoring materials into kirigami robots

Explores kirigami potential in robotics by tailoring materials for multifunctional, lightweight, and adaptable solutions. Details how kirigami components can be optimized for specific robotic applicat...

Safe Obstacle-Free Guidance of Space Manipulators in Debris Removal Missions via Deep Reinforcement Learning

Develops a model-free workspace trajectory planner for space manipulators using a TD3 agent for safe debris removal. Employs local control strategies for singularity avoidance and manipulability enhan...

Temporal-Prior-Guided View Planning for Periodic 3D Plant Reconstruction

Proposes temporal-prior-guided view planning for periodic 3D plant reconstruction. Aligns previous models with new observations and uses inflation to accommodate plant growth for improved reconstructi...

M^3RS: Multi-robot, Multi-objective, and Multi-mode Routing and Scheduling

Introduces the M3RS problem for multi-robot missions, considering quality of service as a variable. Addresses time-constrained missions with multiple execution modes, varying resource needs, durations...

Generating and Optimizing Topologically Distinct Guesses for Mobile Manipulator Path Planning with Path Constraints

Proposes a pipeline for mobile manipulator path planning that generates and optimizes topologically distinct paths with end-effector constraints. Circumvents local optima convergence by discovering mu...

P2 Explore: Efficient Exploration in Unknown Cluttered Environment with Floor Plan Prediction

Proposes P2 Explore for efficient robot exploration in unknown cluttered environments by predicting floor plans. Improves exploration efficiency by overcoming limitations of traditional frontier-based...

COMPAct: Computational Optimization and Automated Modular design of Planetary Actuators

Introduces COMPAct, a framework for computational optimization and automated modular design of planetary actuators. Systematically identifies optimal gearbox parameters for a given motor across four g...

📅

Wednesday, October 8, 2025

Executive Briefing Bullets (20) JSON

SAFER: Advancing Safety Alignment via Efficient Ex-Ante Reasoning

Proposes SAFER, a framework for Safety Alignment via Efficient Ex-Ante Reasoning, enhancing LLM safety by instantiating structured reasoning to address harmful content generation. Demonstrates improve...

Lang-PINN: From Language to Physics-Informed Neural Networks via a Multi-Agent Framework

Introduces Lang-PINN, a multi-agent framework enabling LLMs to generate physics-informed neural networks (PINNs) from language descriptions. Simplifies PINN construction by automating PDE formulation,...

Measuring LLM Novelty As The Frontier Of Original And High-Quality Output

Introduces a new novelty metric for LLM generations, addressing limitations of prior work evaluating originality and quality. Aims to reliably measure LLM's ability to generate novel, high-quality out...

ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists

Introduces ExpertLongBench, an expert-level benchmark with 11 tasks across 9 domains for long-form generation. Utilizes structured checklists validated by domain experts to evaluate LLM adherence to s...

Language Models Surface the Unwritten Code of Science and Society

Explores leveraging LLM biases to reveal society's "unwritten code" like implicit stereotypes. Proposes a framework using a case study in science to uncover hidden rules in peer review, making biases ...

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

Proposes a holistic evaluation for RAG systems and web agents on deep search tasks using hint-free questions and factorized metrics. Addresses limitations of current benchmarks that leak reasoning pat...

Applications of Large Models in Medicine

Explores advancements and applications of large models (LLMs, Vision, 3D, Multimodal) in medicine, revolutionizing disease prediction, diagnosis, and drug discovery. Integrates GNNs for medical knowle...

SocialNLI: A Dialogue-Centric Social Inference Dataset

Introduces SocialNLI (SoNLI), the first social dialogue inference dataset, assessing models' social abilities via theory-of-mind inferences from human dialogue. Addresses LLMs' struggles with sophisti...

Self-Filtered Distillation with LLMs-generated Trust Indicators for Reliable Patent Classification

Introduces Self-Filtered Distillation for patent classification, using LLM-generated rationales as trust signals. Addresses logical errors and misalignments in rationales by filtering noise for stable...

Submodular Context Partitioning and Compression for In-Context Learning-short paper

Addresses ICL's quadratic input complexity by proposing submodular context partitioning and compression. Mitigates information redundancy from partitions to improve performance, enabling efficient few...

Decoding Partial Differential Equations: Cross-Modal Adaptation of Decoder-only Models to PDEs

Adapts decoder-only LLMs to solve partial differential equations (PDEs) by proposing cross-modal adaptation, addressing limitations of encoder-only models. Shows potential for LLMs in scientific machi...

What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts

Analyzes prompt underspecification in LLMs, showing fragile inference and instability across model/prompt changes. Proposes methods to manage underspecification, enabling more reliable LLM application...

FAID: Fine-Grained AI-Generated Text Detection Using Multi-Task Auxiliary and Multi-Level Contrastive Learning

Proposes FAID, a fine-grained detection framework using multi-task auxiliary and contrastive learning to classify human, LLM, and hybrid texts. Introduces FAIDSet, a multilingual dataset for improved ...

Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization

Proposes Self-Routing RAG (SR-RAG), a framework binding selective retrieval with knowledge verbalization. Enables LLMs to improve RAG accuracy and efficiency by making better retrieval decisions, brid...

Tracing Multilingual Factual Knowledge Acquisition in Pretraining

Traces factual knowledge acquisition and cross-lingual consistency in LLM pretraining. Focuses on OLMo-7B, finding improvements in accuracy and consistency over time, providing insights into how factu...

To model human linguistic prediction, make LLMs less superhuman

Investigates using LLMs as cognitive models of human linguistic prediction by making them less superhuman. Suggests that improving LLM performance on prediction tasks requires making them more human-l...

WildIFEval: Instruction Following in the Wild

Introduces WildIFEval, a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Evaluates LLMs' ability to handle complex instructions spanning broad lexical and t...

The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures

Surveys recent efforts to overcome the quadratic complexity bottleneck of transformer attention. Critically analyzes sub-quadratic attention variants, RNNs, state space models, and hybrid architecture...

SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?

Proposes SimulatorArena to systematically study the reliability of LLM-simulated users for AI assistant evaluation. Addresses the lack of benchmarks for automatic evaluation, aiming to determine if si...

Cross-Lingual Mental Health Ontologies for Indian Languages: Bridging Patient Expression and Clinical Understanding through Explainable AI and Human-in-the-Loop Validation

Proposes CL-PDE, a framework for cross-lingual mental health ontologies using graphs for Indian languages. Bridges patient expression and clinical understanding via explainable AI and human-in-the-loo...

📅

Tuesday, October 7, 2025

Executive Briefing Bullets (18) JSON

Joint Learning of Pose Regression and Denoising Diffusion with Score Scaling Sampling for Category-level 6D Pose Estimation

Proposes a joint learning framework for 6D pose estimation using denoising diffusion and score scaling sampling. This method improves training convergence and reduces the need for additional pose vali...

Training Optimal Large Diffusion Language Models

Introduces Quokka, the first systematic scaling law for diffusion language models. It covers compute and data-constrained regimes, offering practical guidance for DLM training and future AI research.

LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Proposes LaDiR, unifying latent diffusion with LLMs for improved text reasoning. This framework enables iterative refinement of reasoning paths, addressing autoregressive decoding limitations and enha...

SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs

Introduces SwiReasoning, enabling LLMs to switch between latent and explicit reasoning for Pareto-superior performance. This framework enhances token efficiency and robustness, particularly for challe...

PatentMind: A Multi-Aspect Reasoning Graph for Patent Similarity Evaluation

Introduces PatentMind, a framework for patent similarity evaluation using a Multi-Aspect Reasoning Graph. It decomposes patents into technical, application, and claim dimensions for comprehensive anal...

Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models

Introduces Reason-RFT, a reinforcement fine-tuning framework for VLMs to improve visual reasoning. This approach mitigates overfitting from supervised fine-tuning, enhancing generalization and real-wo...

AutoMiSeg: Automatic Medical Image Segmentation via Test-Time Adaptation of Foundation Models

Introduces AutoMiSeg, a zero-shot pipeline for automatic medical image segmentation using foundation models. This approach combines VLMs and segmentation models for direct segmentation without expert ...

Single-Core Superscalar Optimization of Clifford Neural Layers

Optimizes Clifford neural layers for inference speed using superscalar techniques. This approach addresses computational bottlenecks in equivariant networks, enabling faster execution without sacrific...

SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation

Introduces SyMerge for synergistic model merging via single-layer adaptation. This framework moves beyond task non-interference to actively enhance cross-task performance, offering improved model comb...

StructPrune: Structured Global Pruning asymptotics with $\mathcal{O}(\sqrt{N})$ GPU Memory

Introduces StructPrune for structured global pruning of LLMs with reduced GPU memory. This method balances efficiency and robustness by leveraging asymptotic analysis and layer-independent pruning.

Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis

Disentangles recall and reasoning in transformers using layer-wise analysis. This method reveals distinct internal mechanisms for these abilities, aiding in understanding model behavior and targeted i...

Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba

Introduces MAVE, a cross-attentive Mamba framework for high-fidelity voice editing and zero-shot TTS. This model achieves state-of-the-art speech editing and competitive TTS results, outperforming exi...

Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models

Proposes Test-Time Token-Level Cross-Validation for dLLMs to address early termination issues. This method allows revision of tokens across iterations, improving final output quality and mitigating er...

Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

Introduces a framework to ensure contextual integrity in LLMs by training them to reason about information disclosure. This approach uses reinforcement learning to align LLM behavior with human prefer...

From Compression to Expression: A Layerwise Analysis of In-Context Learning

Analyzes in-context learning representations across transformer layers, revealing a layerwise compression-to-expression phenomenon. This insight helps understand how LLMs capture task-specific informa...

How Many Parameters Does Your Task Really Need? Task Specific Pruning with LLM-Sieve

Introduces LLM-Sieve, a framework for task-specific pruning of LLMs to minimal parameter subsets. This method achieves efficient and faithful task performance by using output-aligned projections.

Wave-PDE Nets: Trainable Wave-Equation Layers as an Alternative to Attention

Proposes Wave-PDE Nets, a novel architecture using trainable wave-equation layers as an alternative to attention. This approach models global dependencies efficiently, offering a powerful mechanism fo...

Deep Learning without Weight Symmetry

Proposes a new framework that avoids backpropagation's weight symmetry requirement by using a biologically plausible mechanism. This approach addresses the weight transport problem, enabling more biol...

📅

Monday, October 6, 2025

Executive Briefing Bullets (19) JSON

A Survey of Defenses against AI-generated Visual Media: Detection, Disruption, and Authentication

Surveys research on defenses against AI-generated visual media, covering detection, disruption, and authentication methods. Provides a systematic and timely review essential for understanding and miti...

Toward a Holistic Evaluation of Robustness in CLIP Models

Provides a comprehensive assessment of CLIP model robustness by investigating specific visual factors and safety objectives like confidence uncertainty. Aims to offer new perspectives beyond overall c...

Filter-Guided Diffusion for Controllable Image Generation

Proposes filter-guided diffusion for controllable image generation, enhancing zero-shot image-to-image translation and editing. Addresses runtime and memory costs of existing feature injection methods...

WaveNet-SF: A Hybrid Network for Retinal Disease Detection Based on Wavelet Transform in Spatial-Frequency Domain

Proposes WaveNet-SF, a hybrid network using wavelet transform for enhanced retinal disease detection from OCT images. Addresses challenges like speckle noise and varying lesion sizes for critical time...

So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection

Introduces So-Fake, a benchmark and explanation framework for social media image forgery detection. Addresses limitations in current datasets and detection methods for realistic, diverse social media ...

Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection

Proposes fine-grained abnormality prompt learning for zero-shot anomaly detection. Addresses limitations of current methods focusing on coarse-grained semantics by enabling recognition of finer-graine...

Neural Posterior Estimation with Autoregressive Tiling for Detecting Objects in Astronomical Images

Proposes neural posterior estimation with autoregressive tiling for detecting faint, overlapping objects in astronomical images. Introduces an amortized variational inference procedure for small-objec...

InsideOut: An EfficientNetV2-S Based Deep Learning Framework for Robust Multi-Class Facial Emotion Recognition

Presents InsideOut, an EfficientNetV2-S based framework for robust multi-class facial emotion recognition. Addresses challenges like occlusions and illumination variations for improved FER performance...

SoccerSynth-Detection: A Synthetic Dataset for Soccer Player Detection

Presents SoccerSynth-Detection, a synthetic dataset addressing diversity limitations for soccer player detection. Aims to improve algorithm adaptation to varied soccer video contexts with frequent occ...

Latent Diffusion Unlearning: Protecting Against Unauthorized Personalization Through Trajectory Shifted Perturbations

Proposes latent diffusion unlearning using trajectory shifted perturbations to protect against unauthorized personalization. Addresses concerns regarding data privacy and intellectual property protect...

Ranked from Within: Ranking Large Multimodal Models Without Labels

Proposes a method to rank large multimodal models without labels, exploring alternative signals beyond standard performance evaluation. Aims to provide efficient ways to choose between models when fac...

RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

Introduces RichControl for training-free spatial control in text-to-image generation. Addresses limitations of feature injection methods by improving structural alignment and reducing visual artifacts...

GCVAMD: A Modified CausalVAE Model for Causal Age-related Macular Degeneration Risk Factor Detection and Prediction

Introduces GCVAMD, a modified CausalVAE model for detecting and predicting Age-related Macular Degeneration risk factors. Aims to improve early-stage detection for reducing vision loss possibilities.

PyRadiomics-cuda: a GPU-accelerated 3D features extraction from medical images within PyRadiomics

Presents PyRadiomics-cuda, a GPU-accelerated extension for extracting 3D features from medical images. Dramatically reduces processing times for volumetric datasets while maintaining API compatibility...

Training-Free Out-Of-Distribution Segmentation With Foundation Models

Explores training-free out-of-distribution segmentation using foundation models. Investigates the capability of these models to detect unknown regions in semantic segmentation, a capability previously...

Revisiting Reweighted Risk for Calibration: AURC, Focal, and Inverse Focal Loss

Revisits reweighted risk functionals for model calibration, establishing a connection between calibration error and selective classification. Clarifies theoretical links for common deep learning losse...

LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models

Introduces LEAML, a label-efficient adaptation framework for multimodal LLMs on OOD visual tasks. Leverages scarce labeled VQA samples and unlabeled images to generate pseudo QA pairs for adaptation.

One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework

Introduces a unified zero-shot captioning framework shifting from image-centric to patch-centric paradigms. Enables captioning at a finer granularity, moving beyond global representations for more det...

Gate-Shift-Pose: Enhancing Action Recognition in Sports with Skeleton Information

Introduces Gate-Shift-Pose, enhancing action recognition by integrating skeleton pose data with RGB frames. Evaluates early and late fusion strategies for athlete fall classification in figure skating...

📅

Friday, October 3, 2025

Executive Briefing Bullets (20) JSON

Dynamic Bundling with Large Language Models for Zero-Shot Inference on Text-Attributed Graphs

Introduces dynamic bundling with LLMs for zero-shot inference on text-attributed graphs. Addresses limited graph information and unreliable responses by proposing a novel framework, enabling better ge...

Scaling Laws for Optimal Data Mixtures

Proposes a systematic method to determine optimal data mixtures for foundation models using scaling laws. Accurately predicts model loss based on size and mixture proportions, enabling efficient large...

A Few Large Shifts: Layer-Inconsistency Based Minimal Overhead Adversarial Example Detection

Introduces a lightweight, plug-in framework for adversarial example detection leveraging internal layer-wise inconsistencies. Addresses limitations of external models and complex architectures, improv...

Transformers Discover Molecular Structure Without Graph Priors

Demonstrates Transformers discovering molecular structure without graph priors, challenging GNN dominance. Avoids fixed graph limitations and improves expressivity and inference speed for molecular ma...

Accelerating Attention with Basis Decomposition

Presents BD Attention (BDA), a lossless reformulation of attention using Basis Decomposition. Achieves mathematically guaranteed acceleration by restructuring multi-head projections, improving efficie...

A Methodology for Transparent Logic-Based Classification Using a Multi-Task Convolutional Tsetlin Machine

Explores the applicability of the Convolutional Tsetlin Machine for large-scale machine learning. Offers transparent, logic-based classification with comparable performance to neural networks, enhanci...

PepCompass: Navigating peptide embedding spaces using Riemannian Geometry

Introduces PepCompass, navigating peptide embedding spaces using Riemannian geometry. Addresses distorted exploration and inefficient optimization from flat Euclidean metrics, improving antimicrobial ...

PENEX: AdaBoost-Inspired Neural Network Regularization

Introduces Penalized Exponential Loss (PENEX), a multi-class exponential loss formulation amenable to optimization. Offers AdaBoost-inspired regularization for neural networks, grounding generalizatio...

High-Fidelity Speech Enhancement via Discrete Audio Tokens

Introduces DAC-SE1, a simplified language model-based speech enhancement framework using discrete high-resolution audio representations. Achieves high-fidelity enhancement with a simplified pipeline, ...

Accuracy of Discretely Sampled Stochastic Policies in Continuous-time Reinforcement Learning

Analyzes a policy execution framework sampling actions from stochastic policies at discrete time points. Proves accuracy bounds as sampling mesh size tends to zero, addressing challenges in continuous...

Learning Equivariant Models by Discovering Symmetries with Learnable Augmentations

Proposes learning equivariant models by discovering symmetries with learnable augmentations. Addresses limitations of fixed equivariant architectures and implicit learning, enabling more flexible and ...

Enhancing Electricity-System Resilience with Adaptive Robust Optimization and Conformal Uncertainty Characterization

Proposes a tri-level optimization model integrating proactive actions, disruptions, and reactive responses for electricity system resilience. Uses conformal prediction for uncertainty, enhancing syste...

xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity

Investigates scaling behavior of xLSTM and Transformers, showing competitive performance with linear time-complexity. Enables prediction of model performance relative to compute budgets, offering effi...

Randomized Gradient Subspaces for Efficient Large Language Model Training

Analyzes gradient space dynamics to find efficient training methods for LLMs. Proposes using randomized gradient subspaces to capture most gradient energy, reducing memory bottlenecks and improving tr...

Explicit Discovery of Nonlinear Symmetries from Dynamic Data

Proposes LieNLSD, a method for explicit discovery of nonlinear symmetries from dynamic data. Determines the number of infinitesimal generators, advancing symmetry discovery beyond linear methods and i...

StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold

Proposes StelLA, a geometry-aware extension of LoRA using a three-factor decomposition on the Stiefel manifold. Improves LoRA performance by exploiting geometric structure, offering better parameter-e...

FairContrast: Enhancing Fairness through Contrastive learning and Customized Augmenting Methods on Tabular Data

Proposes FairContrast for enhancing fairness through contrastive learning and customized data augmentation on tabular data. Offers a powerful approach to debiasing algorithms and improving fairness wh...

Bias beyond Borders: Global Inequalities in AI-Generated Music

Introduces GlobalDISCO, a large-scale dataset to analyze biases in AI-generated music across countries, languages, cultures, and genres. Addresses underexplored research on global diversity and bias i...

Multidata Causal Discovery for Statistical Hurricane Intensity Forecasting

Leverages a multidata causal discovery framework for hurricane intensity forecasting. Addresses limitations of correlation-based methods by incorporating causal discovery, improving generalizability a...

How Well Can Preference Optimization Generalize Under Noisy Feedback?

Addresses the impact of noisy human feedback on preference optimization for LLM alignment. Analyzes generalization capabilities under unrealistic noise conditions, crucial for reliable LLM alignment.

📅

Thursday, October 2, 2025

Executive Briefing Bullets (20) JSON

DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision

Introduces DreamCS, a geometry-aware text-to-3D generation method using unpaired 3D reward supervision. It mitigates 2D bias artifacts common in prior methods, enabling better human preference alignme...

Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Proposes using diffusion models as noise-aware latent reward models for preference optimization in diffusion models. Shows pre-trained diffusion models are naturally suited for step-level preference a...

ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction

Proposes ATAS, a self-distillation framework for enhanced open-vocabulary dense prediction. Addresses CLIP's struggle with fine-grained understanding by focusing on semantic coherence and vision-langu...

Learning Frequency and Memory-Aware Prompts for Multi-Modal Object Tracking

Presents a dual-adapter framework learning frequency and memory-aware prompts for multi-modal object tracking. Addresses underutilization of modality-specific frequency structure and long-range tempor...

NSARM: Next-Scale Autoregressive Modeling for Robust Real-World Image Super-Resolution

Introduces NSARM, an autoregressive modeling approach for robust real-world image super-resolution. Addresses limitations of diffusion models in Real-ISR by improving output quality and efficiency wit...

Visual Self-Refinement for Autoregressive Models

Proposes a plug-and-play refinement module for autoregressive models to enhance spatial correspondence modeling. Operates as a post-pretraining step to jointly refine generated visual tokens, improvin...

ProtoMask: Segmentation-Guided Prototype Learning

Studies the use of image segmentation foundation models to improve the truthfulness of learned prototypes in explainable AI. Aims to enhance explainability beyond post-hoc saliency techniques.

PhraseStereo: The First Open-Vocabulary Stereo Image Segmentation Dataset

Introduces PhraseStereo, the first dataset for phrase-region segmentation in stereo image pairs. Addresses limitations in phrase grounding by leveraging stereo vision's rich geometric cues.

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

Proposes JEPA-T, a unified multimodal framework for image generation using joint-embedding predictive architecture with text fusion. Enhances fusion by incorporating cross-attention after the feature ...

Graph Integrated Multimodal Concept Bottleneck Model

Presents MoE-SGT, a reasoning-driven framework augmenting Concept Bottleneck Models (CBMs) with a Graph Transformer and MoE module. Addresses limitations of single-modal CBMs by incorporating structur...

Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models

Investigates why Vision-Language Models underutilize spatial cues, identifying an imbalance between vision and text token norms. Proposes interpretability tools to expose this mechanism and improve sp...

Rectified Diffusion Guidance for Conditional Generation

Revisits Classifier-Free Guidance (CFG) theory for diffusion models, rigorously confirming improper coefficient configurations can risk misuse. Proposes rectified guidance to ensure proper combination...

STORK: Faster Diffusion And Flow Matching Sampling By Resolving Both Stiffness And Structure-Dependence

Introduces STORK, a method for faster diffusion and flow matching sampling by addressing ODE stiffness and structure-dependence. Enables quality-preserving sampling with fewer function evaluations for...

SoftCFG: Uncertainty-guided Stable Guidance for Visual autoregressive Model

Proposes SoftCFG to address guidance diminishing and over-guidance issues in autoregressive models. Uses uncertainty guidance for stable generation, improving visual coherence by managing conditional ...

Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs

Proposes a training-free framework using MLLM uncertainty for guidance in complex visual tasks. Leverages intrinsic uncertainty to improve fine-grained perception without task-specific fine-tuning or ...

ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

Introduces ImageDoctor, a unified multi-aspect evaluation model for text-to-image generation. Uses grounded image reasoning to provide comprehensive and interpretable feedback on image quality, moving...

Dressing the Imagination: A Dataset for AI-Powered Translation of Text into Fashion Outfits and A Novel KAN Adapter for Enhanced Feature Adaptation

Presents FLORA, the first comprehensive dataset for fashion language-to-outfit translation, containing industry-specific terminology. Introduces a KAN adapter for enhanced feature adaptation in AI-dri...

SEE: See Everything Every Time -- Adaptive Brightness Adjustment for Broad Light Range Images via Events

Proposes SEE, an adaptive brightness adjustment method for event cameras across broad light ranges. Addresses the research gap of utilizing event data beyond low-light enhancement.

Cascaded Diffusion Framework for Probabilistic Coarse-to-Fine Hand Pose Estimation

Introduces a cascaded diffusion framework for probabilistic coarse-to-fine hand pose estimation. Addresses pose ambiguities and uncertainties by refining predictions in a cascaded manner, improving ac...

A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features

Proposes a feed-forward camera localization method from image features, aiming for faster mapping time compared to state-of-the-art approaches. Raises the question of achieving competitive accuracy mu...

📅

Wednesday, October 1, 2025

Executive Briefing Bullets (20) JSON

AICrypto: A Comprehensive Benchmark for Evaluating Cryptography Capabilities of Large Language Models

Introduces AICrypto, the first comprehensive benchmark to evaluate LLM cryptography capabilities. Comprising 135 multiple-choice questions, 150 CTF challenges, and 18 proof problems, it covers a broad...

Linearly Homomorphic Ring Signature Scheme over Lattices

Proposes the first lattice-based linearly homomorphic ring signature scheme. This scheme combines anonymity with verifiable homomorphic computation, demonstrating potential for confidential blockchain...

Mutual Information Minimization for Side-Channel Attack Resistance via Optimal Noise Injection

Introduces mutual information minimization via optimal noise injection as a countermeasure against side-channel attacks. This approach aims to be more efficient for resource-constrained systems like I...

Thunderdome: Timelock-Free Rationally-Secure Virtual Channels

Introduces Thunderdome, the first timelock-free payment channel network (PCN). It leverages virtual channels to extend a timelock-free primitive, addressing vulnerabilities to timelock and censoring a...

Zero Trust-based Decentralized Identity Management System for Autonomous Vehicles

Presents a novel Zero Trust-based Decentralized Identity Management (D-IM) protocol for autonomous vehicles. This system enhances cybersecurity in dynamic, untrusted environments by integrating Zero T...

Finding Phones Fast: Low-Latency and Scalable Monitoring of Cellular Communications in Sensitive Areas

Introduces low-latency systems for high-quality, instantaneous monitoring of cellular communications to detect unauthorized devices in sensitive areas. Addresses a critical gap in current security sys...

TRUE: A Reproducible Framework for LLM-Driven Relevance Judgment in Information Retrieval

Reintroduces a reproducible framework, TRUE, for LLM-driven relevance judgment in Information Retrieval. It addresses the lack of standardized workflows in existing methods, aiming for reliable label ...

Palace: A Library for Interactive GPU-Accelerated Large Tensor Processing and Visualization

Presents Palace, an open-source library for interactive, GPU-accelerated out-of-core tensor processing and visualization. It enables efficient handling of large tensor datasets for scientific fields.

Aristotle: Mastering Logical Reasoning with A Logic-Complete Decompose-Search-Resolve Framework

Proposes Aristotle, a logic-complete framework for LLM logical reasoning that decomposes, searches, and resolves problems. It aims to improve both the efficacy and efficiency of LLM reasoning by lever...

ActorDB: A Unified Database Model Integrating Single-Writer Actors, Incremental View Maintenance, and Zero-Trust Messaging

Presents ActorDB, a novel database architecture unifying single-writer actors, incremental view maintenance, and zero-trust security. This system aims to reduce architectural complexity for modern dat...

AntiFLipper: A Secure and Efficient Defense Against Label-Flipping Attacks in Federated Learning

Proposes AntiFLipper, a novel and computationally efficient defense against multi-class label-flipping attacks in Federated Learning. It aims to protect the global model's performance degradation caus...

Authenticated Private Set Intersection: A Merkle Tree-Based Approach for Enhancing Data Integrity

Proposes authenticated Private Set Intersection (PSI) schemes by integrating Merkle Trees with existing PSI protocols. This enhances data integrity in PSI, addressing vulnerabilities to attacks that m...

Chypnosis: Undervolting-based Static Side-channel Attacks

Presents Chypnosis, an undervolting attack technique that indirectly stops a target circuit's clock to enable static side-channel attacks. Crucially, it also blocks detection mechanisms while preservi...

Logic Solver Guided Directed Fuzzing for Hardware Designs

Proposes logic solver guided directed fuzzing for hardware designs to improve early bug detection in complex IC designs. This approach extends verification efforts for incremental updates in hardware ...

Managing Differentiated Secure Connectivity using Intents

Proposes the concept of differentiated secure connectivity using intents for 5G/6G mobile networks. This approach aims to express and enforce complex, goal-driven security requirements beyond current ...

Optimal Threshold Signatures in Bitcoin

Formulates threshold signature schemes for cryptocurrencies like Bitcoin as an optimization problem. It determines the optimal threshold to balance security against user lockout risks.

On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search

Presents a journalist-centered approach to LLM-powered document search for newsrooms, prioritizing transparency and editorial control. Evaluates small language models for investigative document search...

How Diffusion Models Memorize

Analyzes latent space dynamics to explain how diffusion models memorize training data. Shows memorization is driven by specific aspects of the diffusion and denoising process, raising privacy concerns...

Fingerprinting LLMs via Prompt Injection

Proposes fingerprinting LLMs via prompt injection to detect model derivations without altering the base model. This method aims to robustly identify model provenance even after post-processing.

IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation

Introduces IMProofBench, a benchmark for evaluating AI on research-level mathematical proof generation. It consists of 39 peer-reviewed problems designed by expert mathematicians to assess advanced re...

📅

Tuesday, September 30, 2025

Executive Briefing Bullets (20) JSON

3D-LATTE: Latent Space 3D Editing from Textual Instructions

Introduces 3D-LATTE, a training-free method for instruction-based 3D asset editing operating in the latent space of native 3D diffusion models. Addresses view-inconsistent editing signals common in 2D...

ART-DECO: Arbitrary Text Guidance for 3D Detailizer Construction

Introduces ART-DECO, a neural model that generates high-quality 3D assets with detailed geometry and texture from coarse proxies guided by text prompts. Achieves instantaneous detailization in under 1...

Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think

Proposes Representation Entanglement for Generation (REPA) to simplify training diffusion transformers. Integrates external visual representations from pretrained models through alignment.

SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution

Presents SimpleGVR, a baseline for latent-cascaded video super-resolution. Decouples semantic content generation from detail synthesis for efficient video upscaling.

Implicit-ARAP: Efficient Handle-Guided Neural Field Deformation via Local Patch Meshing

Introduces Implicit-ARAP for efficient handle-guided neural field deformation. Leverages local patch meshing to balance surface quality, robustness, and efficiency in neural field manipulation.

Freqformer: Frequency-Domain Transformer for 3-D Reconstruction and Quantification of Human Retinal Vasculature

Introduces Freqformer, a Transformer-based model with a frequency-domain module for 3D retinal vasculature reconstruction and quantification from single OCTA scans.

Towards agile multi-robot systems in the real world: Fast onboard tracking of active blinking markers for relative localization

Introduces a novel onboard tracking approach for vision-based relative localization using active blinking markers in multi-robot systems. Improves robustness for aerial vehicles.

ReDDiT: Rehashing Noise for Discrete Visual Generation

Proposes ReDDiT, a rehashing noise approach for discrete diffusion transformers to improve expressive capacity. Addresses design of noise and sampling heuristics in discrete diffusion models.

Score Replacement with Bounded Deviation for Rare Prompt Generation

Proposes a score replacement method with bounded deviation for rare prompt generation in diffusion models. Addresses struggle with rare concepts by improving prompt switching robustness.

FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection

Proposes Forced Prompt Learning (FA) for Vision-Language Models to improve OOD detection. Makes full use of VLMs' inherent capabilities without relying on external datasets.

Reconstruct Anything Model: a lightweight foundation model for computational imaging

Proposes the Reconstruct Anything Model (RAM), a lightweight foundation model for computational imaging. Addresses limitations of iterative and unrolled architectures for imaging inverse problems.

Light of Normals: Unified Feature Representation for Universal Photometric Stereo

Proposes LINO UniPS with Light Register Tokens to unify photometric stereo under arbitrary lighting. Enforces decoupling of illumination and normal information for universal application.

Generative Video Semantic Communication via Multimodal Semantic Fusion with Large Model

Introduces a generative video semantic communication framework using multimodal semantic fusion with large models. Addresses limitations of traditional syntactic communication for 6G immersive scenari...

Controllable Reference Guided Diffusion with Local Global Fusion for Real World Remote Sensing Image Super Resolution

Proposes a controllable reference-guided diffusion method with local-global fusion for remote sensing image super-resolution. Integrates complementary information from auxiliary data.

Learning Smooth State-Dependent Traversability from Dense Point Clouds

Presents SPARTA for estimating state-dependent traversability from point clouds without needing approach angle as input. Addresses computational inefficiency during planning.

Chronic Obstructive Pulmonary Disease Prediction Using Deep Convolutional Network

Proposes a deep convolutional network for COPD prediction using lung sound auscultation. Addresses the demand for automated tools in early disease detection.

RAM-W1K: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis

Introduces RAM-W1K, a multi-task wrist dataset and benchmark for Rheumatoid Arthritis research. Addresses limitations in CAD research due to annotation challenges.

ZeroScene: A Zero-Shot Framework for 3D Scene Generation from a Single Image and Controllable Texture Editing

Introduces ZeroScene, a zero-shot framework for 3D scene generation from a single image with controllable texture editing. Leverages large vision models for asset quality and scene coherence.

Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter

Introduces Mod-Adapter for tuning-free, versatile multi-concept personalization. Enables customization of abstract concepts like pose and lighting without test-time fine-tuning.

Counterfactual Visual Explanation via Causally-Guided Adversarial Steering

Introduces Causal-Guided Adversarial Steering for counterfactual visual explanations. Addresses view-inconsistent editing signals by incorporating causal relationships.

📅

Monday, September 29, 2025

Executive Briefing Bullets (19) JSON

Differential-Integral Neural Operator for Long-Term Turbulence Forecasting

Proposes a novel neural operator combining differential and integral forms to address error accumulation in long-term turbulence forecasting. Achieves improved physical fidelity and accuracy compared ...

Metric-Agnostic Conformal Bounds for Probabilistic Image Reconstruction

Proposes a framework for computing provably valid prediction bounds for probabilistic image reconstruction algorithms. Enables statistically guaranteed claims about reconstructed subjects from sparse ...

FERD: Fairness-Enhanced Data-Free Robustness Distillation

Introduces FERD to address fairness issues in data-free robustness distillation. Identifies and tackles key problems leading to robustness disparity across categories, improving fairness in model tran...

SeamCrafter: Enhancing Mesh Seam Generation for Artist UV Unwrapping via Reinforcement Learning

Introduces SeamCrafter, a GPT-style seam generator using reinforcement learning for UV unwrapping. Enhances mesh seam generation, addressing distortion and fragmentation issues in 3D texturing workflo...

Surgical Vision World Model

Proposes a Surgical Vision World Model to facilitate realistic and interactive surgical simulation. Enables action-controlled data generation for training autonomous surgical agents when real data acq...

Diverse Subset Selection via Norm-Based Sampling and Orthogonality

Proposes a simple and effective method combining feature norms, randomization, and orthogonality for diverse subset selection. Selects informative samples from large unlabeled pools for annotation, ad...

NeuVAS: Neural Implicit Surfaces for Variational Shape Modeling

Introduces NeuVAS, a framework for variational shape modeling using neural implicit surfaces. Addresses challenges in modeling shapes with sparse geometric control, like 3D curve sketches.

Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

Investigates Vision-Language Models' performance in fine-grained tasks like font recognition. Highlights challenges VLMs face in distinguishing texture from semantics, impacting aesthetic and design-r...

Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling

Proposes Plan-R1, a two-stage trajectory planning framework that decouples principle alignment from behavior learning. Enables safe and feasible trajectory planning inspired by large language models.

iTACO: Interactable Digital Twins of Articulated Objects from Casually Captured RGBD Videos

Focuses on motion analysis and part-level segmentation from casually captured RGBD videos for articulated objects. Enables interactable digital twins from practical, scalable acquisition, useful for e...

APTx Neuron: A Unified Trainable Neuron Architecture Integrating Activation and Computation

Proposes the APTx Neuron, a novel, unified neural computation unit integrating activation and computation into a single trainable expression. Eliminates separate activation layers for computational ef...

Can Diffusion Models Disentangle? A Theoretical Perspective

Presents a novel theoretical framework for understanding disentangled representation learning in diffusion models. Establishes identifiability conditions and derives sample complexity bounds for disen...

Diffence: Fencing Membership Privacy With Diffusion Models

Introduces Diffence, a novel defense against membership inference attacks using diffusion models. It removes distinguishing features between member and non-member data by regenerating inputs, enhancin...

Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025)

Presents a hierarchical multimodal recurrent ensemble that maps video, audio, and language embeddings to fMRI responses. Integrates information over time to predict distributed cortical responses to m...

Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance

Enhances diffusion models' compositional generation power on rare concepts using LLM guidance. Demonstrates improved generation of rare compositions by exposing frequent concepts relevant to targets d...

Mobi-$\pi$: Mobilizing Your Robot Learning Policy

Formulates the policy mobilization problem to improve generalization of visuomotor policies to novel robot positions. Addresses poor generalization from limited robot positions and camera viewpoints i...

Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities

Introduces VLN-PE, a physically realistic platform to bridge the embodied gap in Vision-and-Language Navigation. Systematically evaluates VLN methods in physical robotic settings across different pipe...

Geometry aware inference of steady state PDEs using Equivariant Neural Fields representations

Introduces enf2enf, a neural field approach for predicting steady-state PDEs with geometric variability. Encodes geometries into latent features anchored at spatial locations, preserving locality for ...

STQE: Spatial-Temporal Attribute Quality Enhancement for G-PCC Compressed Dynamic Point Clouds

Proposes an STQE network exploiting spatial-temporal correlations to enhance quality of G-PCC compressed dynamic point clouds. Addresses the unexplored area of quality enhancement for compressed dynam...

📅

Friday, September 26, 2025

Executive Briefing Bullets (20) JSON

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Introduces TempSamp-R1, a reinforcement fine-tuning framework for video LLMs, addressing inefficient on-policy sampling in large temporal spaces. Achieves improved effectiveness for video temporal gro...

AnyPlace: Learning Generalized Object Placement for Robot Manipulation

Introduces AnyPlace, a two-stage method trained on synthetic data for generalized object placement in robot manipulation. Leverages VLMs for rough placement location identification, focusing on releva...

Estimating Deep Learning energy consumption based on model architecture and training environment

Investigates how model architecture and training environment affect deep learning energy consumption. Analyzes trade-offs by training various computer vision models and collecting energy and accuracy ...

FoMo-0D: A Foundation Model for Zero-shot Tabular Outlier Detection

Presents FoMo-0D, a pre-trained foundation model for zero/few-shot outlier detection on tabular data. Addresses the bottleneck of unsupervised algorithm and hyperparameter selection for effective OD u...

Expressiveness of Multi-Neuron Convex Relaxations in Neural Network Certification

Investigates whether multi-neuron convex relaxations overcome the single-neuron convex barrier in neural network certification. Addresses questions about their expressiveness and limitations for robus...

Benchmarking for Practice: Few-Shot Time-Series Crop-Type Classification on the EuroCropsML Dataset

Presents the first comprehensive benchmark for evaluating supervised and self-supervised learning for few-shot time-series crop-type classification. Assesses algorithm efficacy in challenging, real-wo...

A Decision Theoretic Framework for Measuring AI Reliance

Argues current definitions of appropriate AI reliance lack formal statistical grounding. Proposes a decision-theoretic framework for measuring AI reliance, focusing on human-AI decision-making and com...

Semantic Edge-Cloud Communication for Real-Time Urban Traffic Surveillance with ViT and LLMs over Mobile Networks

Proposes a semantic edge-cloud communication framework for real-time urban traffic surveillance using ViT and LLMs. Addresses understanding dynamic traffic scenarios and responsive user interaction ov...

Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training

Introduces data-centric elastic pipeline parallelism for efficient long-context LLM training. Addresses communication overhead issues in existing schemes like sequence parallelism by optimizing partit...

The Use of the Simplex Architecture to Enhance Safety in Deep-Learning-Powered Autonomous Systems

Explores using the Simplex architecture to enhance safety in deep-learning autonomous systems. Addresses trustworthiness issues related to anomalous samples, distribution shifts, and adversarial attac...

TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues

Proposes a new task and benchmark for temporal reasoning in multi-session dialogues, under-studied previously. Introduces TReMu, a neuro-symbolic framework to enhance LLM-agent temporal reasoning capa...

Reinforcement Learning in Categorical Cybernetics

Shows that major reinforcement learning algorithms fit into categorical cybernetics' framework of parameterized bidirectional processes. Extends Bellman operators to parameterized optics for action-va...

The Value of Information in Human-AI Decision-making

Contributes a decision-theoretic framework for characterizing the value of information in human-AI pairings. Focuses on improving performance of collaborating agents by understanding their information...

RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training

Introduces tail batching, a novel rollout strategy to mitigate long-tail rollout issues in synchronous RL post-training for LLMs. Aims to reduce GPU underutilization without compromising training accu...

MPC-based Deep Reinforcement Learning Method for Space Robotic Control with Fuel Sloshing Mitigation

Presents an integrated RL and MPC framework for autonomous satellite docking, mitigating fuel sloshing effects. Integrates PPO and SAC RL algorithms with MPC, leveraging MPC's predictive capabilities ...

Physics Informed Neural Networks for design optimisation of diamond particle detectors for charged particle fast-tracking at high luminosity hadron colliders

Models conductive electrodes in diamond particle detectors using physics-informed neural networks. Extends the classical Ramo-Shockley formalism to optimize design for fast-tracking at high luminosity...

Text-Augmented Multimodal LLMs for Chemical Reaction Condition Recommendation

Proposes using text-augmented multimodal LLMs for chemical reaction condition recommendation. Aims to reliably discover effective conditions during reaction exploration, addressing labor-intensive tri...

The Asymptotic Behavior of Attention in Transformers

Investigates the theoretical properties of transformer attention mechanisms in large language models. Analyzes how increasing model size and depth affects performance and identifies potential diminish...

Fine-Tuning LLMs to Analyze Multiple Dimensions of Code Review: A Maximum Entropy Regulated Long Chain-of-Thought Approach

Proposes a maximum entropy regulated long chain-of-thought approach for fine-tuning LLMs to analyze code review dimensions. Enhances LLM context understanding and reasoning compared to human reviewers...

Dual-Path Phishing Detection: Integrating Transformer-Based NLP with Structural URL Analysis

Proposes a dual-path phishing detection framework integrating transformer-based NLP and structural URL analysis. Addresses limitations of traditional methods by comprehensively analyzing both semantic...

📅

Thursday, September 25, 2025

Executive Briefing Bullets (20) JSON

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Introduces SoFar, a language-grounded orientation representation that bridges spatial reasoning and object manipulation. Defines object orientations using natural language in a reference-frame-free ma...

Generating 360{\deg} Video is What You Need For a 3D Scene

Proposes WorldPrompter, a generative pipeline using 360° video as an intermediate representation for synthesizing traversable 3D scenes. Captures full-scene context and ensures visual consistency, off...

VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model

Proposes using Vision Language Models (VLMs) to interpret human demonstration videos and generate robot action plans. Integrates keyframe selection, visual perception, and VLM reasoning into a pipelin...

NERO: Explainable Out-of-Distribution Detection with Neuron-level Relevance

Introduces NERO, a framework for explainable out-of-distribution (OOD) detection using neuron-level relevance. Enhances reliability in deep learning, particularly medical imaging, by flagging potentia...

Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding

Introduces CBM-HNMU, a Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding. Leverages concept bottleneck models for effective interventions and mutual understanding, addre...

Localized LoRA: A Structured Low-Rank Approximation for Efficient Fine-Tuning

Proposes Localized LoRA, a generalized framework for parameter-efficient fine-tuning that models weight updates using low-rank matrices applied to structured blocks. Overcomes limitations of global lo...

Urania: Differentially Private Insights into AI Use

Introduces Urania, a framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. Employs private clustering and keyword extraction, providing e...

Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis

Presents Latent Wavelet Diffusion (LWD), a lightweight framework for ultra-high-resolution image synthesis. Introduces a frequency-aware masking strategy derived from wavelet energy maps to focus diff...

Unifying Symbolic Music Arrangement: Track-Aware Reconstruction and Structured Tokenization

Presents a unified framework for automatic multitrack music arrangement handling diverse scenarios via track-aware reconstruction and structured tokenization. Enables flexible any-to-any instrumentati...

GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

Introduces GraphEQA, utilizing 3D semantic scene graphs for real-time Embodied Question Answering. Addresses challenges in acquiring semantic representations and leveraging prior knowledge for efficie...

Deciphering Functions of Neurons in Vision-Language Models

Investigates the functions of individual neurons in Vision-Language Models (VLMs) by observing activations with visual and text tokens. Reveals insights into VLM internals, crucial for fostering trans...

SEM: Enhancing Spatial Understanding for Robust Robot Manipulation

Introduces SEM, a diffusion-based policy framework that explicitly enhances spatial understanding for robust robot manipulation. Addresses limitations of 3D point cloud and 2D image encoders by improv...

Probabilistic Online Event Downsampling

Proposes POLED, a probabilistic framework for event downsampling that models event importance. Addresses high bandwidth and computational demands of event cameras by adaptively downsampling events bas...

SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model

Proposes SurgVidLM for multi-grained surgical video understanding using Large Language Models (LLMs). Facilitates surgeons in understanding surgical scenes and procedures by enabling fine-grained vide...

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Momentum

Introduces AAPO, an RL-based method enhancing LLM reasoning by leveraging Advantage Momentum. Eliminates dependency on value models in group relative advantage estimation, simplifying training and imp...

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Introduces OmniSpatial, a comprehensive benchmark to evaluate and improve spatial reasoning in Vision-Language Models (VLMs). Addresses the limitations of existing tasks by covering more elementary la...

CLOSP: A Unified Semantic Space for SAR, MSI, and Text in Remote Sensing

Introduces CLOSP, a unified semantic space for Synthetic Aperture Radar (SAR), Multispectral Imagery (MSI), and Text in remote sensing. Bridges the gap for text-to-image retrieval systems by exploitin...

Macroeconomic Forecasting with Large Language Models

Compares the accuracy of Large Language Models (LLMs) against traditional methods for macroeconomic forecasting. Investigates LLM effectiveness in capturing intricate patterns in macroeconomic time se...

A GEN AI Framework for Medical Note Generation

Proposes MediNotes, a generative AI framework automating SOAP note creation from medical conversations using LLMs and Retrieval-Augmented Generation. Addresses administrative burden and physician burn...

HawkBench: Investigating Resilience of RAG Methods on Stratified Information-Seeking Tasks

Introduces HawkBench, a human-labeled benchmark to rigorously assess the resilience of Retrieval-Augmented Generation (RAG) methods. Stratifies tasks based on information-seeking behaviors to evaluate...

📅

Wednesday, September 24, 2025

Executive Briefing Bullets (20) JSON

EventVL: Understand Event Streams via Multimodal Large Language Model

Proposes EventVL, the first generative event-based MLLM framework for explicit semantic understanding of event streams. Bridges event streams and multimodal LLMs for enhanced semantic understanding, e...

Latent Beam Diffusion Models for Generating Visual Sequences

Introduces a novel beam search strategy for latent space exploration in diffusion models. Enables conditional generation of full image sequences with improved visual consistency, addressing challenges...

Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images

Presents a system using Multimodal LLMs to analyze millions of images for temporal change patterns. Answers open-ended queries about city trends without predetermined subjects, enabling large-scale vi...

Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable

Proposes dual data alignment to improve the generalizability of AI-generated image detectors. Addresses overfitting on non-causal attributes by matching semantic content between real and synthetic ima...

JL1-CD: A New Benchmark for Remote Sensing Change Detection and a Robust Multi-Teacher Knowledge Distillation Framework

Introduces JL1-CD, a large-scale remote sensing change detection dataset, and a robust multi-teacher knowledge distillation framework. Addresses scarcity of high-resolution datasets and improves perfo...

Gaussian Herding across Pens: An Optimal Transport Perspective on Global Gaussian Reduction for 3DGS

Proposes an optimal transport perspective for 3D Gaussian Splatting (3DGS) compaction. Casts compaction as global Gaussian mixture reduction, addressing memory and rendering budgets by reducing redund...

Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation

Proposes Token Preference Optimization (TPO) with self-calibrated rewards for hallucination mitigation in LVLMs. Addresses lack of scalable token-level rewards and visual-anchored tokens for improved ...

REACT: Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation

Introduces REACT, a framework for real-time Scene Graph Generation (SGG). Addresses the trade-off between performance and inference speed, enabling SGG for downstream tasks like reasoning for embodied...

Lavida-O: Elastic Masked Diffusion Models for Unified Multimodal Understanding and Generation

Proposes Lavida-O, a unified Masked Diffusion Model for multimodal understanding and generation. Enables image-level understanding, object grounding, image editing, and high-resolution text-to-image s...

xAI-CV: An Overview of Explainable Artificial Intelligence in Computer Vision

Surveys representative methods in Explainable AI (xAI) for computer vision. Addresses the challenge of "black-box" models by providing insights into decision-making processes for improved reliability.

SparseDiT: Token Sparsification for Efficient Diffusion Transformer

Introduces SparseDiT, a novel framework implementing token sparsification in Diffusion Transformers. Addresses computational costs by reducing self-attention complexity, enabling more efficient genera...

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

Proposes ICEdit, a framework for precise instruction-based image editing using Diffusion Transformers. Achieves a precision-efficiency tradeoff by leveraging inherent comprehension and generation abil...

AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection

Proposes AvatarShield, a visual reinforcement learning framework for detecting human-centric synthetic videos. Addresses threats from realistic synthetic human body generation with controllable moveme...

Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels

Improves semantic correspondence estimation through 3D-aware pseudo-labeling. Trains an adapter to refine off-the-shelf models, addressing ambiguities in symmetric objects or repeated parts for better...

HDM: Hybrid Diffusion Model for Unified Image Anomaly Detection

Proposes HDM, a hybrid diffusion model for unified image anomaly detection. Addresses challenges of complex anomaly patterns by improving coordination between anomaly sample generation and detection.

DWTGS: Rethinking Frequency Regularization for Sparse-view 3D Gaussian Splatting

Proposes DWTGS, a framework rethinking frequency regularization for sparse-view 3D Gaussian Splatting. Leverages wavelet transforms to address overfitting to high-frequency details and improve novel v...

Injecting Explainability and Lightweight Design into Weakly Supervised Video Anomaly Detection Systems

Presents TCVADS, a system for weakly supervised video anomaly detection with explainability and lightweight design. Leverages knowledge distillation and cross-modal contrastive learning for efficient,...

Zero-Shot Visual Deepfake Detection: Can AI Predict and Prevent Fake Content Before It's Created?

Explores zero-shot deepfake detection, enabling detection without prior exposure to specific variations. Studies self-supervised learning, transformer classifiers, generative model fingerprinting, and...

VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation

Presents VLN-Zero, a framework for zero-shot vision-language navigation using neurosymbolic planning. Leverages VLMs to construct symbolic scene graphs for efficient exploration and adaptation in unse...

Prompt-DAS: Annotation-Efficient Prompt Learning for Domain Adaptive Semantic Segmentation of Electron Microscopy Images

Proposes Prompt-DAS, a promptable multitask framework for annotation-efficient domain adaptive segmentation of EM images. Utilizes point prompts for unsupervised domain adaptation and weakly supervise...

📅

Tuesday, September 23, 2025

Executive Briefing Bullets (20) JSON

Visual Instruction Pretraining for Domain-Specific Foundation Models

Proposes Visual Instruction Pretraining (ViTP) to improve foundation models in downstream domains by leveraging top-down reasoning influence on low-level perceptual features. Enhances perception-reaso...

COLA: Context-aware Language-driven Test-time Adaptation

Proposes COLA, a context-aware language-driven test-time adaptation framework for domain adaptation without shared labels. Enables adaptation to multiple target domains by leveraging language to guide...

Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception

Proposes Self-Distilled RoI Predictors to improve fine-grained perception in Multimodal LLMs by focusing on salient regions. Addresses trade-offs between training data needs and computational cost for...

Accurate and Efficient Low-Rank Model Merging in Core Space

Proposes the Core Space merging framework for efficient merging of Low-Rank Adaptation (LoRA) models. Avoids merging full weight matrices, maintaining efficiency while enabling adaptation of large neu...

Deep Learning as the Disciplined Construction of Tame Objects

Provides an overview of tame geometry's role in deep learning, focusing on convergence guarantees for stochastic gradient descent in nonsmooth nonconvex settings. Illustrates how deep learning models ...

Stencil: Subject-Driven Generation with Context Guidance

Introduces Stencil, a framework for subject-driven generation with context guidance, addressing subject consistency issues in diffusion models. Balances fidelity and efficiency by improving prompt-bas...

From Restoration to Reconstruction: Rethinking 3D Gaussian Splatting for Underwater Scenes

Presents R-Splatting, a unified framework bridging underwater image restoration and 3D Gaussian Splatting. Improves rendering quality and geometric fidelity for 3D reconstruction in challenging underw...

ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

Introduces ContextFlow, a training-free framework for video object editing via adaptive context enrichment. Addresses fidelity and temporal consistency challenges in diffusion-based video manipulation...

OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System

Introduces OnePiece, a framework integrating context engineering and reasoning into industrial cascade ranking systems. Addresses limitations of solely architectural transplanting by leveraging LLM br...

Semantic and Visual Crop-Guided Diffusion Models for Heterogeneous Tissue Synthesis in Histopathology

Presents a latent diffusion model for heterogeneous histopathology image generation using semantic segmentation and visual crops. Overcomes challenges in tissue heterogeneity and morphological feature...

MAESTRO: Task-Relevant Optimization via Adaptive Feature Enhancement and Suppression for Multi-task 3D Perception

Introduces MAESTRO, a framework for multi-task 3D perception that adaptively enhances and suppresses features to mitigate task conflicts. Improves learning efficiency and perception accuracy by managi...

CoBEVMoE: Heterogeneity-aware Feature Fusion with Dynamic Mixture-of-Experts for Collaborative Perception

Introduces CoBEVMoE, a collaborative perception framework using dynamic Mixture-of-Experts for heterogeneity-aware feature fusion. Mitigates perceptual diversity issues by dynamically adapting experts...

Unsupervised Structural-Counterfactual Generation under Domain Shift

Presents a novel unsupervised generative modeling challenge for counterfactual sample generation across domains without parallel data. Relies on causal graphs to address challenges beyond conventional...

Multimodal Medical Image Classification via Synergistic Learning Pre-training

Introduces a synergistic learning pre-training framework for multimodal semi-supervised medical image classification. Addresses modality fusion and label scarcity challenges by consistently learning a...

SmaRT: Style-Modulated Robust Test-Time Adaptation for Cross-Domain Brain Tumor Segmentation in MRI

Proposes SmaRT, a style-modulated robust test-time adaptation method for cross-domain brain tumor segmentation. Addresses instability and inconsistency in adaptation strategies for medical imaging dom...

LLaSA: A Sensor-Aware LLM for Natural Language Reasoning of Human Activity from IMU Data

Proposes LLaSA, a sensor-aware LLM for natural language reasoning of human activity from IMU data. Introduces SensorCap and OpenSQA resources for causal and explanatory reasoning in wearable systems.

Tight PAC-Bayesian Risk Certificates for Contrastive Learning

Develops PAC-Bayesian risk certificates for contrastive representation learning. Provides statistical theory for contrastive learning, bounding generalization error for foundation models trained via a...

DT-NeRF: A Diffusion and Transformer-Based Optimization Approach for Neural Radiance Fields in 3D Reconstruction

Proposes DT-NeRF, a diffusion and transformer-based optimization method for Neural Radiance Fields. Enhances detail recovery and multi-view consistency in 3D scene reconstruction, outperforming tradit...

Interpreting vision transformers via residual replacement model

Interprets vision transformers via a residual replacement model and analysis of 6.6K features. Reveals feature evolution and encoding of curves, providing insights into ViT processing from low-level p...

Validation-Free Sparse Learning: A Phase Transition Approach to Feature Selection

Proposes validation-free sparse learning using a phase transition approach for feature selection. Addresses AI's environmental footprint by promoting frugal and interpretable models with reduced compl...

📅

Monday, September 22, 2025

Executive Briefing Bullets (20) JSON

Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval

Introduces Grounding via View Retrieval (GVR), a zero-shot method for 3D visual grounding in 3D Gaussian Splatting. It overcomes per-scene training limitations by using view retrieval, enabling effici...

SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI

Introduces SeCodePLT, a unified platform for evaluating code GenAI security. It addresses limitations of existing benchmarks by offering dynamic analysis and scalable evaluation, improving precision o...

Negotiative Alignment: Embracing Disagreement to Achieve Fairer Outcomes -- Insights from Urban Studies

Proposes Negotiative Alignment to achieve fairer outcomes by embracing disagreement. A community-centered study with diverse groups reveals systematic disagreement patterns, enhancing urban assessment...

World Modelling Improves Language Model Agents

Proposes Dynamics Modeling (DyMo) to augment LLMs with state prediction for tool use in stateful environments. This enables LLMs to predict future states via an internal environment model, improving a...

Beyond the Average: Distributional Causal Inference under Imperfect Compliance

Proposes a regression-adjusted estimator for distributional treatment effects with imperfect compliance. It leverages treatment assignment as an instrumental variable to identify distributional effect...

Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward

Proposes Perception-R1 to enhance multimodal reasoning in MLLMs using visual perception rewards. This approach addresses overlooked perception capabilities, a prerequisite for advanced multimodal reas...

Transfer learning under latent space model

Proposes a transfer learning method for latent space models to improve network analysis and link prediction. It leverages information from similar networks to enhance estimation accuracy, especially f...

AttentionDrop: A Novel Regularization Method for Transformer Models

Proposes AttentionDrop, a family of stochastic regularization techniques operating on self-attention distributions. This method combats overfitting in transformer models, particularly with limited or ...

DSDNet: Raw Domain Demoir\'eing via Dual Color-Space Synergy

Introduces DSDNet for raw domain demoiréeing using dual color-space synergy. This addresses severe visual degradation from moirée artifacts in smartphone captured screen images, overcoming limitations...

CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation

Introduces CLIPTTA for robust contrastive vision-language test-time adaptation. It addresses misalignment in standard test-time adaptation objectives for VLMs, improving performance and mitigating fai...

Algorithmic Fairness: Not a Purely Technical but Socio-Technical Property

Argues that algorithmic fairness is a socio-technical property, not purely technical. It highlights misconceptions limiting metric effectiveness and calls for a broader understanding beyond mathematic...

Who is Responsible When AI Fails? Mapping Causes, Entities, and Consequences of AI Privacy and Ethical Incidents

Analyzes 202 AI incidents to develop a taxonomy of causes, entities, and consequences. It classifies incidents across the AI lifecycle, addressing limitations in existing taxonomies for prevention and...

Towards Interactive and Learnable Cooperative Driving Automation: a Large Language Model-Driven Decision-Making Framework

Introduces an LLM-driven decision-making framework for cooperative driving automation. It aims to enhance interaction and continuous learning for connected autonomous vehicles in complex scenarios.

NeuroRAD-FM: A Foundation Model for Neuro-Oncology with Distributionally Robust Training

Develops NeuroRAD-FM, a neuro-oncology foundation model with distributionally robust training. It improves generalization across cohorts and predicts molecular markers, addressing challenges in hetero...

Noise-Robustness Through Noise: Asymmetric LoRA Adaption with Poisoning Expert

Introduces Asymmetric LoRA Adaptation with Poisoning Experts (LoPE) for noise-robust parameter-efficient fine-tuning. This framework enhances model adaptation by leveraging noise rather than relying s...

Training More Robust Classification Model via Discriminative Loss and Gaussian Noise Injection

Introduces a novel training framework with a discriminative loss and Gaussian noise injection for robust classification. It enhances intra-class compactness and decision boundary margins without degra...

Space Group Equivariant Crystal Diffusion

Introduces SGEquiDiff, a crystal generative model handling space group constraints with equivariant likelihoods. This accelerates inverse design of crystalline materials by naturally incorporating sym...

Training A Neural Network For Partially Occluded Road Sign Identification In The Context Of Autonomous Vehicles

Proposes a deep learning model for recognizing partially occluded road signs for autonomous vehicles. It addresses the complexity introduced by occlusions, aiming for improved accuracy in challenging ...

Boosting Active Learning with Knowledge Transfer

Proposes a knowledge transfer method to boost uncertainty estimation in Active Learning, particularly for domain tasks like cryo-ET classification. It addresses challenges in training complex auxiliar...

Towards Robust Visual Continual Learning with Multi-Prototype Supervision

Proposes Multi-Prototype Supervision for robust visual continual learning using language-guided supervision. It addresses semantic ambiguity and intra-class diversity limitations of single-target appr...

📅

Friday, September 19, 2025

Executive Briefing Bullets (20) JSON

Consistent causal discovery with equal error variances: a least-squares perspective

Proposes a least-squares perspective for consistent causal discovery in linear acyclic SEMs with equal error variances. Establishes theoretical guarantees for unique DAG identification, demonstrating ...

Hamiltonian Descent Algorithms for Optimization: Accelerated Rates via Randomized Integration Time

Introduces Hamiltonian Descent Algorithms for optimization, leveraging randomized integration time. Achieves accelerated convergence rates similar to gradient descent for convex functions, offering a ...

Semiparametric Learning from Open-Set Label Shift Data

Addresses open-set label shift by proposing a semiparametric density ratio model framework. Handles novel classes absent from training without restrictive assumptions, offering improved theoretical gu...

Rate doubly robust estimation for weighted average treatment effects

Develops rate doubly robust estimation for weighted average treatment effects (WATE), a versatile class of causal estimands. Addresses robustness limitations in existing literature, enabling more reli...

Efficient Dual-domain Image Dehazing with Haze Prior Perception

Proposes an efficient dual-domain image dehazing method using haze prior perception. Combines spatial and frequency domain features to overcome limitations of existing transformer-based models, enabli...

Gradient Distance Function

Proposes Gradient Distance Functions (GDFs) to represent non-watertight surfaces in deep learning. GDFs are differentiable at the surface, remedying brittleness of UDFs and enabling representation of ...

GCDance: Genre-Controlled 3D Full Body Dance Generation Driven By Music

Introduces GCDance, a diffusion-based framework for genre-specific 3D full-body dance generation driven by music. Achieves physically realistic and synchronized dance sequences while adhering to genre...

A Mutual Information Perspective on Multiple Latent Variable Generative Models for Positive View Generation

Proposes a framework quantifying the contribution of latent variables in Multiple Latent Variable Generative Models (MLVGMs) using mutual information. Offers a systematic understanding of generative d...

AutoEdit: Automatic Hyperparameter Tuning for Image Editing

Introduces AutoEdit for automatic hyperparameter tuning in text-guided image editing. Addresses the challenge of manual tuning by automating the process, reducing computational costs and improving edi...

Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation

Proposes semantically consistent style transfer using diffusion models for synthetic-to-real domain adaptation. Improves performance of vision models trained on synthetic data, especially in adverse c...

Gap-Dependent Bounds for Federated $Q$-learning

Presents the first gap-dependent analysis of regret and communication cost for on-policy federated Q-learning in tabular MDPs. Achieves improved bounds compared to worst-case analyses, offering a more...

Sharp Matrix Empirical Bernstein Inequalities

Presents two sharp, closed-form empirical Bernstein inequalities for symmetric random matrices with bounded eigenvalues. Achieves tight adaptation to unknown variance, matching matrix Bernstein inequa...

Variational Gaussian Approximation in Replica Analysis of Parametric Models

Revisits the replica method for parametric models by employing a variational Gaussian approximation. Enables deferred and empirical data averages, leading to stationarity conditions for intractable in...

MedFuncta: A Unified Framework for Learning Efficient Medical Neural Fields

Introduces MedFuncta, a unified framework for learning efficient medical neural fields. Addresses challenges in scaling Neural Fields to large medical datasets, offering a powerful alternative to disc...

HPGN: Hybrid Priors-Guided Network for Compressed Low-Light Image Enhancement

Introduces HPGN, a hybrid priors-guided network for enhancing compressed low-light images. Integrates compression and illumination priors in a unified framework, addressing joint enhancement challenge...

Morph: A Motion-free Physics Optimization Framework for Human Motion Generation

Presents Morph, a motion-free physics optimization framework for human motion generation. Addresses physically implausible motions by incorporating physics constraints, offering a new approach to real...

Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration

Proposes DM-Calib, a diffusion-based approach for monocular camera intrinsic parameter estimation. Leverages diffusion models trained on massive data for improved generalization across diverse real-wo...

Physics-Informed Representation Alignment for Sparse Radio-Map Reconstruction

Introduces PhyRMDM, a physics-informed framework for sparse radio-map reconstruction. Aligns physical constraints with data-driven features, establishing a novel approach for accurate reconstruction u...

Erased or Dormant? Rethinking Concept Erasure Through Reversibility

Rethinks concept erasure in diffusion models by evaluating robustness and reversibility. Investigates whether erasure truly eliminates generative capacity or achieves only superficial suppression, pro...

Lightweight and Accurate Multi-View Stereo with Confidence-Aware Diffusion Model

Introduces a confidence-aware diffusion model for lightweight and accurate multi-view stereo reconstruction. Achieves 3D geometry reconstruction from calibrated images efficiently, demonstrating the p...

📅

Thursday, September 18, 2025

Executive Briefing Bullets (20) JSON

Large Language Models for Information Retrieval: A Survey

Surveys the integration of Large Language Models (LLMs) into Information Retrieval (IR) systems. It details how LLMs capture complex signals and semantic nuances, evolving IR from term-based methods t...

Synthesis and Perceptual Scaling of High Resolution Naturalistic Images Using Stable Diffusion

Introduces a method for synthesizing and perceptually scaling high-resolution naturalistic images using Stable Diffusion. It focuses on generating perceptually continuous variations of naturalistic st...

Effort-Optimized, Accuracy-Driven Labelling and Validation of Test Inputs for DL Systems: A Mixed-Integer Linear Programming Approach

Presents a Mixed-Integer Linear Programming approach for effort-optimized, accuracy-driven labeling and validation of test inputs for Deep Learning (DL) systems. It aims to build highly accurate datas...

Adaptive Off-Policy Inference for M-Estimators Under Model Misspecification

Proposes a method for valid inference for M-estimators using adaptively collected bandit data under model misspecification. It provides robust statistical approaches for data collected adaptively, lik...

Brain age identification from diffusion MRI synergistically predicts neurodegenerative disease

Demonstrates that brain age identification from diffusion MRI (dMRI) synergistically predicts neurodegenerative disease. It leverages dMRI's sensitivity to microstructural changes to build an earlier ...

UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by Regional Visual Language Supervision

Presents UniPLV, a framework for label-efficient open-world 3D scene understanding using regional visual language supervision. It unifies point clouds and images for robust recognition without manual ...

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

Proposes a self-supervised method for Embodied Image Captioning, enabling agents to describe objects while exploring environments. It fine-tunes captioning models via a three-phase framework for enhan...

GenExam: A Multidisciplinary Text-to-Image Exam

Introduces GenExam, the first benchmark for multidisciplinary text-to-image exams. It features 1,000 samples across 10 subjects, evaluating integrated understanding, reasoning, and generation capabili...

Object Pose Estimation through Dexterous Touch

Presents an object pose estimation approach using sensorimotor exploration and Reinforcement Learning (RL). It enables robots to actively control hand interactions for pose estimation, especially when...

InterKey: Cross-modal Intersection Keypoints for Global Localization on OpenStreetMap

Proposes InterKey, a cross-modal approach for global localization on OpenStreetMap. It enables robust localization for autonomous vehicles by matching sensor data with OSM, addressing scalability limi...

Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations

Proposes a novel framework for identity-preserving text-to-video generation using spatial-temporal decoupled representations. It addresses the trade-off between spatial coherence and temporal smoothne...

GROOD: GRadient-Aware Out-of-Distribution Detection

Proposes GROOD, a gradient-aware approach for Out-of-Distribution (OOD) detection in deep learning. It improves reliability in real-world applications by better distinguishing near-OOD samples compare...

Imputation-Powered Inference

Introduces Imputation-Powered Inference to address blockwise missingness in multi-modal and multi-site data. It offers a solution for complex missingness patterns that challenge standard inference met...

Physics-informed, boundary-constrained Gaussian process regression for the reconstruction of fluid flow fields

Presents a general method for physics-informed, boundary-constrained Gaussian process regression for reconstructing fluid flow fields. It uses adapted covariance functions to obtain estimates and cons...

Stereo Anything: Unifying Zero-shot Stereo Matching with Large-Scale Mixed Data

Introduces StereoAnything, a data-centric framework unifying zero-shot stereo matching with large-scale mixed data. It enhances generalization capabilities for stereo matching models in unseen domains...

Lightweight Gradient-Aware Upscaling of 3D Gaussian Splatting Images

Introduces a lightweight gradient-aware upscaling technique for 3D Gaussian Splatting (3DGS) on GPUs. It achieves higher rendering speeds and reduces artifacts by leveraging analytical image gradients...

CROP: Contextual Region-Oriented Visual Token Pruning

Introduces CROP (Contextual Region-Oriented Visual Token Pruning), a framework to compress visual tokens in VLM-based VQA. It localizes and prunes redundant visual tokens, reducing memory and computat...

Cross-Distribution Diffusion Priors-Driven Iterative Reconstruction for Sparse-View CT

Proposes CDPIR, a Cross-Distribution Diffusion Priors-Driven Iterative Reconstruction framework for Sparse-View CT. It addresses out-of-distribution problems and enhances reconstruction quality with r...

Rest2Visual: Predicting Visually Evoked fMRI from Resting-State Scans

Introduces Rest2Visual, a method to predict visually evoked fMRI from resting-state scans. It bridges spontaneous brain activity with stimulus-driven responses, offering a way to interpret rs-fMRI.

MetricNet: Recovering Metric Scale in Generative Navigation Policies

Introduces MetricNet, a framework for recovering metric scale in generative navigation policies. It addresses issues of unscaled abstract spaces and short-sighted actions in learned navigation, enabli...

📅

Wednesday, September 17, 2025

No research highlights available for this date

📅

Tuesday, September 16, 2025

Executive Briefing Bullets (20) JSON

Predictable Compression Failures: Why Language Models Actually Hallucinate

Proposes transformers minimize expected conditional description length over orderings, not permutation-invariant length, explaining hallucinations. Shows transformers are Bayesian in expectation, not ...

Kernel Embeddings and the Separation of Measure Phenomenon

Proves kernel covariance embeddings achieve information-theoretically perfect separation of probability distributions. Establishes equivalence between testing measure equality and singularity between ...

Contractive kinetic Langevin samplers beyond global Lipschitz continuity

Proposes novel discretizations of kinetic Langevin SDEs for sampling from log-concave distributions with superlinear gradient growth. Shows contractivity and log-Sobolev inequality, establishing non-a...

Generalized Dirichlet Energy and Graph Laplacians for Clustering Directed and Undirected Graphs

Introduces generalized Dirichlet energy (GDE) to cluster directed and undirected graphs. GDE handles asymmetry in directed graphs, extending classical spectral methods and preserving directional infor...

Deep learning joint extremes of metocean variables using the SPAR model

Presents a deep learning framework using the SPAR model for multivariate joint extremes of metocean variables. Transforms multivariate extremes to angular density modeling, enabling improved tail anal...

Social Perception of Faces in a Vision-Language Model

Explores social perception of faces in CLIP by comparing embedding similarities between prompts and face images. Systematically varies dimensions like age, gender, and race to analyze social perceptio...

Next-Generation Reservoir Computing for Dynamical Inference

Presents a scalable reservoir computing implementation for dynamical systems using pseudorandom nonlinear projection. Offers a flexible alternative to polynomial projections for time series data analy...

The Morgan-Pitman Test of Equality of Variances and its Application to Machine Learning Model Evaluation and Selection

Proposes using the Morgan-Pitman test for equality of variances in forecasting errors. Enhances robustness against heavy-tailed distributions and outliers, aiding machine learning model evaluation and...

Solving ill-conditioned polynomial equations using score-based priors with application to multi-target detection

Proposes a framework integrating score-based diffusion priors with moment-based estimators to solve ill-conditioned polynomial equations. Stabilizes polynomial recovery from noisy statistical features...

Adapting Projection-Based Reduced-Order Models using Projected Gaussian Process

Adapts projection-based reduced-order models using Projected Gaussian Process. Addresses challenges in updating parametric ROMs by utilizing snapshot data and POD basis modes for improved representati...

Preconditioned subgradient method for composite optimization: overparameterization and fast convergence

Introduces a preconditioned subgradient method for composite optimization problems. Demonstrates fast convergence even with ill-conditioned or overparameterized smooth maps, applicable to data science...

High Effort, Low Gain: Fundamental Limits of Active Learning for Linear Dynamical Systems

Analyzes fundamental limits of active learning for linear dynamical systems, focusing on excitation input's effect on sample complexity. Presents lower bounds and system-theoretic conditions for poten...

Eigen-convergence of Gaussian kernelized graph Laplacian by manifold heat interpolation

Studies spectral convergence of graph Laplacians to Laplace-Beltrami operators using manifold heat interpolation. Proves convergence with Gaussian kernels by setting bandwidth parameter $\epsilon \sim...

A comparison between geostatistical and machine learning models for spatio-temporal prediction of PM2.5 data

Compares geostatistical and machine learning models for PM2.5 spatio-temporal prediction. Highlights the impact of low-cost sensors on data granularity and enables real-time, high-resolution air quali...

Contrastive Network Representation Learning

Proposes a contrastive learning framework for network representation learning, specifically for subject-specific, high-dimensional, sparse brain connectivity data. Preserves structural and semantic pr...

Kernel-based Stochastic Approximation Framework for Nonlinear Operator Learning

Develops a stochastic approximation framework for learning nonlinear operators using Mercer operator-valued kernels. Encompasses compact and diagonal kernels, inducing expressive vector-valued reprodu...

Some Robustness Properties of Label Cleaning

Demonstrates that learning procedures using aggregated labels are robust against issues impossible without data cleaning. This robustness appears in risk consistency and improved generalization.

The Honest Truth About Causal Trees: Accuracy Limits for Heterogeneous Treatment Effect Estimation

Analyzes accuracy limits of causal trees for heterogeneous treatment effect estimation. Discusses how fitting procedures using CART or variants are believed to be adaptive, but reveals limitations.

Piecewise Deterministic Markov Processes for Bayesian Neural Networks

Introduces Piecewise Deterministic Markov Process (PDMP) samplers for Bayesian Neural Networks. Permits subsampling of likelihoods, overcoming limitations of traditional MCMC in computation.

A Permutation-free Kernel Two-Sample Test

Introduces a permutation-free kernel two-sample test for MMD statistics. Designs a level-$\alpha$ test by overcoming intractable limiting distributions, offering finite-sample validity without permuta...

📅

Monday, September 15, 2025

Executive Briefing Bullets (20) JSON

GARD: Gamma-based Anatomical Restoration and Denoising for Retinal OCT

Introduces GARD (Gamma-based Anatomical Restoration and Denoising), a novel deep learning approach for retinal OCT image denoising. Effectively balances noise reduction with preservation of crucial an...

Efficient Learned Image Compression Through Knowledge Distillation

Presents an efficient learned image compression method through knowledge distillation. Maps images to a low-dimensional latent space for entropy coding, reconstructing approximations at the receiver, ...

Compressed Video Quality Enhancement: Classifying and Benchmarking over Standards

Provides a systematic classification and benchmarking of compressed video quality enhancement (CVQE) methods across standards. Addresses limitations in linking methods to artifacts and comparative ana...

HHI-Assist: A Dataset and Benchmark of Human-Human Interaction in Physical Assistance Scenario

Introduces HHI-Assist, a dataset and benchmark for human-human interaction in physical assistance. Addresses challenges in accurate human motion prediction for assistive robots in complex physical int...

GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation

Proposes GC-VLN, a training-free framework for vision-and-language navigation. Formulates navigation guidance as graph constraint optimization, enabling deployment in continuous environments without e...

Talk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving

Proposes Talk2PC, enhancing 3D visual grounding for autonomous driving through LiDAR and Radar point cloud fusion. Moves beyond 2D VLMs to leverage rich 3D representations from point clouds for improv...

GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

Introduces GROVE, a generalized reward framework for learning open-vocabulary physical skills for simulated agents. Enables skill learning without manual reward engineering or task-specific demonstrat...

Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation

Proposes a novel prompt optimization framework for text-to-image generation using self-rewarding large vision-language models. Alleviates dependence on large-scale manual data and biases from trained ...

Uncovering Neuroimaging Biomarkers of Brain Tumor Surgery with AI-Driven Methods

Develops AI-driven methods to uncover neuroimaging biomarkers for brain tumor surgery outcome prediction. Addresses limitations of curated datasets by using AI to analyze complex imaging data for impr...

Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image

Proposes a novel approach using Geometry and Perception Guided Gaussians for multiview-consistent 3D generation from a single image. Addresses poor multiview consistency and lack of geometric detail i...

Immunizing Images from Text to Image Editing via Adversarial Cross-Attention

Proposes Attention Attack, a novel adversarial attack disrupting cross-attention for text-based image editing. Immunizes images from text-to-image editing by targeting the visual component, enhancing ...

Ordinality of Visible-Thermal Image Intensities for Intrinsic Image Decomposition

Introduces a novel training-free approach for intrinsic image decomposition using visible and thermal image pairs. Leverages ordinality of intensities to decompose images into shading and reflectance ...

Chord: Chain of Rendering Decomposition for PBR Material Estimation from Generated Texture Images

Proposes Chord, a two-stage framework for PBR material generation. Synthesizes shaded, tileable texture images using a fine-tuned diffusion model and then decomposes them to estimate PBR materials, im...

Polarization Denoising and Demosaicking: Dataset and Baseline Method

Presents a dataset and baseline method for polarization denoising and demosaicking of DoFP polarimeter images. Addresses the scarcity of research on the joint task, crucial for applications using pola...

Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?

Investigates if generative geospatial diffusion models can excel as discriminative geospatial foundation models. Explores their potential to capture multi-grained semantic information for improved rep...

Dynamic Motion Blending for Versatile Motion Editing

Introduces MotionCutMix, an online data augmentation technique for text-guided motion editing. Dynamically generates training triplets by blending body part motions, significantly expanding training d...

MedM-VL: What Makes a Good Medical LVLM?

Systematically explores model architectures and training strategies for Medical Large Vision-Language Models (LVLMs) based on LLaVA. Aims to define what makes a good medical LVLM for complex multimoda...

Integrative Variational Autoencoders for Generative Modeling of an Image Outcome with Multiple Input Images

Introduces the Integrative Variational Autoencoder (InVA) for image-on-image regression in multimodal neuroimaging. Models outcome images as functions of shared and modality-specific features, offerin...

Efficient and Effective Adaptation of Multimodal Foundation Models in Sequential Recommendation

Introduces the IISAN framework for parameter-efficient fine-tuning of multimodal foundation models in sequential recommendation. Significantly enhances efficiency in GPU memory and training speed comp...

Earth Observation Foundation Model PhilEO: Pretraining on the MajorTOM and FastTOM Datasets

Introduces PhilEO, an Earth Observation Foundation Model pretrained on massive datasets. Demonstrates competitive performance against specialized models, enabling efficient fine-tuning for downstream ...

📅

Friday, September 12, 2025

Executive Briefing Bullets (20) JSON

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Introduces FLUX-Reason-6M, a 6 million image dataset, and PRISM-Bench, a benchmark for text-to-image reasoning. Addresses performance gaps in open-source models and enables comprehensive evaluation of...

Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM-based Multi-Agent Systems

Proposes MOAT, a multi-agent joint alignment tuning framework to harmonize LLM-based multi-agent systems. Addresses capability gaps and poor coordination issues arising from independent agent fine-tun...

Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings

Investigates energy efficiency and performance trade-offs in LLM inference across tasks and DVFS settings. Identifies and optimizes factors influencing runtime efficiency without compromising performa...

Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval

Proposes a Gradient-Attention Guided Dual-Masking framework for robust text-based person retrieval. Addresses scarcity of person-centric data and limitations of global contrastive learning for fine-gr...

UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images

Proposes UnsafeBench, a framework to evaluate image safety classifiers. Benchmarks effectiveness and robustness on both real-world and AI-generated images, addressing concerns about misuse of text-to-...

Adaptive kernel predictors from feature-learning infinite limits of neural networks

Derives adaptive kernel predictors from feature-learning infinite-width neural network limits. Provides explicit expressions for kernel predictors and numerical calculation methods, advancing understa...

Visual Grounding from Event Cameras

Introduces Talk2Event, the first large-scale benchmark for language-driven object grounding using event camera data. Addresses the gap in multimodal perception for event cameras, leveraging their adva...

Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement

Proposes Medverse, a universal model for full-resolution 3D medical image analysis including segmentation, transformation, and enhancement. Enables high-fidelity predictions and global anatomical unde...

GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving

Introduces GEMINUS, a Mixture-of-Experts framework for end-to-end autonomous driving. Features a Global Expert and Scene-Adaptive Experts Group with a Dual-aware Router to handle diverse traffic envir...

Diffusion-Based Action Recognition Generalizes to Untrained Domains

Proposes using Vision Diffusion Model features aggregated by a transformer for action recognition. Achieves human-like generalization across context and viewpoint variations in untrained domains, over...

VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results

Summarizes the VQualA 2025 Challenge on visual quality comparison for Large Multimodal Models (LMMs). Introduces a novel benchmark for evaluating LMMs' reasoning about visual quality differences acros...

Prompting the Market? A Large-Scale Meta-Analysis of GenAI in Finance NLP (2022-2025)

Presents MetaGraph, a methodology for extracting knowledge graphs from financial NLP literature. Analyzes research trends in GenAI for finance NLP, defining an ontology and structuring research insigh...

ALL-PET: A Low-resource and Low-shot PET Foundation Model in the Projection Domain

Introduces ALL-PET, a low-resource, low-shot PET foundation model operating in the projection domain. Leverages a latent diffusion model and innovative augmentation strategies to overcome data scarcit...

MOLLM: Multi-Objective Large Language Model for Molecular Design -- Optimizing with Experts

Introduces MOLLM, a Multi-Objective Large Language Model for molecular design. Combines domain knowledge with LLMs and in-context learning for multi-objective optimization of molecular properties.

MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems

Introduces MetaRAG, a metamorphic testing framework for hallucination detection in RAG systems. Addresses challenges specific to RAG where responses must align with retrieved evidence, unlike standalo...

ABS-Mamba: SAM2-Driven Bidirectional Spiral Mamba Network for Medical Image Translation

Proposes ABS-Mamba, a novel network for medical image translation integrating SAM2 for semantic representation and Mamba for structure preservation. Harmonizes global semantics and local fidelity, add...

The Oxford Spires Dataset: Benchmarking Large-Scale LiDAR-Visual Localisation, Reconstruction and Radiance Field Methods

Introduces the Oxford Spires Dataset, a large-scale multi-modal dataset for benchmarking LiDAR-visual tasks. Establishes benchmarks for localization, reconstruction, and novel-view synthesis using syn...

MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction

Proposes MR-UIE, a framework using Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction. Enhances LLM performance in structured output scenarios requiring compl...

Integrating Anatomical Priors into a Causal Diffusion Model

Integrates anatomical priors into a causal diffusion model for 3D brain MRI synthesis. Addresses the lack of inductive biases in counterfactual models, preserving fine-grained anatomical details for p...

Modelling Analogies and Analogical Reasoning: Connecting Cognitive Science Theory and NLP Research

Connects cognitive science theory of analogical reasoning with NLP research. Shows how these notions are relevant for major NLP challenges, offering a cognitive lens to understand and advance analogic...

📅

Thursday, September 11, 2025

Executive Briefing Bullets (20) JSON

Alternating Minimization Schemes for Computing Rate-Distortion-Perception Functions with $f$-Divergence Perception Constraints

Proposes an alternating minimization scheme (OAM) to compute the rate-distortion-perception function with $f$-divergence constraints. Characterizes optimal parametric solutions, enabling efficient com...

A transport approach to the cutoff phenomenon

Introduces a new transport approach using a W-TV transport inequality and parabolic regularization to study the cutoff phenomenon for Markov processes. This bypasses the use of varentropy, offering an...

Identification and Estimation of Simultaneous Equation Models Using Higher-Order Cumulant Restrictions

Addresses identification challenges in linear simultaneous-equation models by exploiting higher-order moments of non-Gaussian data. Relaxes the typical assumption of uncorrelated structural errors, en...

Foundation Models for Autonomous Driving Perception: A Survey Through Core Capabilities

Surveys foundation models for autonomous driving perception, analyzing their impact on generalization, scalability, and robustness. Introduces a taxonomy based on four core capabilities, examining how...

Event Camera Meets Resource-Aware Mobile Computing: Abstraction, Algorithm, Acceleration, Application

Explores event-based vision for high-agility mobile devices, focusing on abstraction, algorithms, acceleration, and applications. Addresses challenges of noisy events and stable perception for low-lat...

Good Deep Features to Track: Self-Supervised Feature Extraction and Tracking in Visual Odometry

Introduces self-supervised feature extraction and tracking for visual odometry to improve robustness in challenging settings. Addresses issues like lighting changes and dynamic scenes that degrade per...

CNN-ViT Hybrid for Pneumonia Detection: Theory and Empiric on Limited Data without Pretraining

Explores a CNN-ViT hybrid model for pneumonia detection trained from scratch on limited data. Demonstrates the architectural strengths of the hybrid model, achieving competitive performance on balance...

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Proposes Sigma, a Siamese Mamba network for multi-modal semantic segmentation. Leverages additional modalities (X-modality) alongside RGB to enhance perception and scene understanding, particularly in...

RewardDance: Reward Scaling in Visual Generation

Investigates reward scaling in visual generation using Reinforcement Learning. Addresses limitations of CLIP-based RMs and Bradley-Terry losses, proposing a method for effective scaling in Vision-Lang...

SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video

Proposes SAFT, a system for reconstructing 3D shape and appearance of fabrics from monocular video using differentiable physical simulations. Achieves realistic deformations and rendering for dynamic ...

Bias in the Loop: How Humans Evaluate AI-Generated Suggestions

Investigates how task design and individual differences affect human evaluation of AI suggestions through a randomized experiment. Reveals psychological factors influencing the success and failure of ...

On the Sample Complexity of Set Membership Estimation for Linear Systems with Disturbances Bounded by Convex Sets

Establishes convergence rates for set membership identification in linear systems under relaxed assumptions on persistent excitation and disturbances. Uses a block-martingale small-ball condition enab...

X-Part: high fidelity and structure coherent shape decomposition

Introduces X-Part, a controllable generative model for decomposing 3D objects into semantically meaningful, structurally coherent parts with high geometric fidelity. Addresses limitations in controlla...

A Survey of World Models for Autonomous Driving

Systematically reviews recent advances in world models for autonomous driving, highlighting their role in robust scene interpretation and safe decision-making. Discusses how these models integrate mul...

D\'ej\`a Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse

Proposes Dj vu, an efficient video-language query engine that reuses computations between video frames using learning-based inter-frame techniques. Addresses the computational burden of Vision Transfo...

Physics-Guided Rectified Flow for Low-light RAW Image Enhancement

Presents a physics-guided rectified flow method for low-light RAW image enhancement. Addresses limitations of synthetic datasets by physically modeling sensor noise more comprehensively, improving enh...

SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation

Introduces SocialNav-SUB, a benchmark for evaluating Vision-Language Models (VLMs) in social robot navigation scene understanding. Assesses VLM capabilities in inferring social navigation contexts cru...

Vision Transformer with Sparse Scan Prior

Introduces a Sparse Scan Self-Attention mechanism ($\rm{S}^3\rm{A}$) for Vision Transformers, inspired by human eye scanning. Predefines anchors of interest for tokens to reduce computational overhead...

Learning Robust Representations via Bidirectional Transition for Visual Reinforcement Learning

Introduces a Bidirectional Transition approach for learning robust representations in visual reinforcement learning. Aims to create reliable representations by predicting future states and tracing his...

GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts

Introduces GeneVA, a dataset of human annotations for generative text-to-video artifacts. Addresses the need for systematic benchmarks to study and mitigate unpredictable artifacts like impossible phy...

📅

Wednesday, September 10, 2025

Executive Briefing Bullets (20) JSON

BEAM: Bridging Physically-based Rendering and Gaussian Modeling for Relightable Volumetric Video

Introduces BEAM, a novel pipeline bridging 4D Gaussian representations with physically-based rendering to produce relightable volumetric video. Achieves efficient, high-quality rendering of dynamic 3D...

GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions

Introduces GraspCoT, integrating physical property reasoning into LLMs for flexible language-instruction-guided 6-DoF robotic grasping. Enables robots to comprehend and execute grasping tasks by lever...

Semi-SMD: Semi-Supervised Metric Depth Estimation via Surrounding Cameras for Autonomous Driving

Introduces Semi-SMD, a semi-supervised metric depth estimation framework for autonomous driving using surrounding cameras. Proposes a unified fusion module and cross-attention for scale information re...

Don't Splat your Gaussians: Volumetric Ray-Traced Primitives for Modeling and Rendering Scattering and Emissive Media

Generalizes 3D Gaussian modeling to volumetric primitives for scattering and emissive media, introducing closed-form solutions for modeling and rendering. Enables unified representation of surfaces an...

Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges

Surveys recent methods leveraging LLMs for crash detection from video data, presenting a structured taxonomy, datasets, architectures, and performance benchmarks. Provides a comprehensive overview of ...

HieraRS: A Hierarchical Segmentation Paradigm for Remote Sensing Enabling Multi-Granularity Interpretation and Cross-Domain Transfer

Proposes HieraRS, a hierarchical segmentation paradigm for remote sensing enabling multi-granularity interpretation and cross-domain transfer. Addresses limitations of flat classification by generatin...

SplatFill: 3D Scene Inpainting via Depth-Guided Gaussian Splatting

Presents SplatFill, a novel depth-guided approach for 3D Gaussian Splatting scene inpainting. Achieves state-of-the-art perceptual quality and improved efficiency for filling missing regions in 3D sce...

RayGaussX: Accelerating Gaussian-Based Ray Marching for Real-Time and High-Quality Novel View Synthesis

Introduces RayGaussX, accelerating Gaussian-based ray marching for real-time, high-quality novel view synthesis. Achieves significant speedups in training and inference by building on RayGauss with ke...

Interpretable Text-Guided Image Clustering via Iterative Search

Introduces an interpretable text-guided image clustering method via iterative search. Addresses ambiguity in clustering by allowing users to define criteria, enabling flexible and accurate partitionin...

Geospatial Foundational Embedder: Top-1 Winning Solution on EarthVision Embed2Scale Challenge (CVPR 2025)

Introduces a foundational geospatial model for embedding hyperspectral geospatial data cubes into vectors. Achieved Top-1 in the EarthVision Embed2Scale Challenge, demonstrating effectiveness for down...

PINGS: Gaussian Splatting Meets Distance Fields within a Point-Based Implicit Neural Map

Proposes PINGS, a novel map representation unifying distance fields and radiance fields for robots, enabling high-fidelity, geometrically accurate, and photorealistic environmental reconstructions. Ac...

VMGNet: A Low Computational Complexity Robotic Grasping Network Based on VMamba with Multi-Scale Feature Fusion

Proposes VMGNet, a low computational complexity, high-accuracy network for robotic grasping using VMamba and multi-scale feature fusion. Achieves linear computational complexity, significantly reducin...

IntuiTF: MLLM-Guided Transfer Function Optimization for Direct Volume Rendering

Proposes IntuiTF, an MLLM-guided framework for transfer function optimization in direct volume rendering. Addresses vast exploration space and limited generalizability by enabling intuitive, semantic ...

Missing Fine Details in Images: Last Seen in High Frequencies

Addresses the lack of fine details in latent generative models by focusing on high frequencies. Proposes methods to improve latent representations and generation quality, particularly for textured reg...

Atomizer: Generalizing to new modalities by breaking satellite images down to a set of scalars

Introduces Atomizer, a flexible architecture representing remote sensing images as sets of scalars to generalize across diverse satellite modalities. Enables models to adapt to new configurations with...

DiGS: Accurate and Complete Surface Reconstruction from 3D Gaussians via Direct SDF Learning

Introduces DiGS, a unified framework embedding Signed Distance Field learning within 3D Gaussians for accurate and complete surface reconstruction. Achieves state-of-the-art rendering quality while en...

HairGS: Hair Strand Reconstruction based on 3D Gaussian Splatting

Extends 3D Gaussian Splatting for strand-level hair geometry reconstruction from multi-view images. Achieves efficient and explicit scene representation for hair, enabling applications in virtual real...

Beyond Motion Cues and Structural Sparsity: Revisiting Small Moving Target Detection

Proposes a novel deep learning framework for small moving target detection that moves beyond traditional motion cues and structural sparsity. Achieves robust detection in complex environments by focus...

TextlessRAG: End-to-End Visual Document RAG by Speech Without Text

Presents TextlessRAG, the first end-to-end framework for speech-based question answering over visual document images, eliminating ASR, TTS, and OCR. Directly interprets speech queries to extract knowl...

XOCT: Enhancing OCT to OCTA Translation via Cross-Dimensional Supervised Multi-Scale Feature Learning

Proposes XOCT for enhancing OCT to OCTA translation using cross-dimensional supervised multi-scale feature learning. Addresses challenges in acquiring high-quality OCTA images and improves deep learni...

📅

Tuesday, September 9, 2025

Executive Briefing Bullets (19) JSON

Flow-based generative models as iterative algorithms in probability space

Proposes flow-based generative models as iterative algorithms operating in probability space. Demonstrates their power for high-dimensional data synthesis, exact likelihood estimation, efficient sampl...

Precise Bayesian Neural Networks

Revisits Bayesian Neural Networks (BNNs) through normalization, modeling uncertainty only in weight directions. Aims to address misalignment with network geometry and improve uncertainty quantificatio...

Robust Generative Learning with Lipschitz-Regularized $\alpha$-Divergences Allows Minimal Assumptions on Target Distributions

Demonstrates robustness of Lipschitz-regularized $\alpha$-divergences in generative modeling, enabling stable learning with minimal assumptions on target distributions. Establishes finiteness under mi...

Randomized Quasi-Monte Carlo Features for Kernel Approximation

Investigates Randomized Quasi-Monte Carlo (RQMC) methods for kernel approximation, improving deterministic error bounds over classical Monte Carlo. Establishes theoretical guarantees for RQMC in rando...

The Ground Cost for Optimal Transport of Angular Velocity

Revisits optimal transport for angular velocity dynamics via the controlled Euler equation. Enables stochastic guidance of spin states for rigid bodies under deadline constraints by transferring state...

Beyond ATE: Multi-Criteria Design for A/B Testing

Proposes multi-criteria design for A/B testing beyond Average Treatment Effect (ATE). Addresses additional objectives like welfare or revenue loss, critical for practical applications beyond simple es...

LLaDA-VLA: Vision Language Diffusion Action Models

Introduces LLaDA-VLA, a vision-language diffusion action model for robotic manipulation. Leverages diffusion models for policy learning, extending their application beyond text generation and multimod...

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Introduces F1, a pretrained Vision-Language-Action (VLA) framework integrating visual foresight generation into decision-making. Adopts a Mixture-of-Transformer architecture for language-conditioned t...

Barlow-Swin: Toward a novel siamese-based segmentation architecture using Swin-Transformers

Proposes Barlow-Swin, a novel Siamese-based segmentation architecture using Swin-Transformers. Addresses limitations of CNNs in global context modeling for medical image segmentation, aiming for light...

H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers

Introduces H$_{2}$OT, a hierarchical plug-and-play pruning-and-recovering framework for efficient transformer-based 3D human pose estimation from videos. Addresses high computational costs of video po...

GenAI-Powered Inference

Introduces GenAI-Powered Inference (GPI), a statistical framework for causal and predictive inference using unstructured data. Leverages GenAI models to generate data at scale and extract low-dimensio...

Predicting Market Troughs: A Machine Learning Approach with Causal Interpretation

Provides robust evidence on causal drivers of market troughs using a flexible causal machine learning framework. Identifies volatility of risk appetite and market liquidity as key drivers, offering im...

Variational Inference for Uncertainty Quantification: an Analysis of Trade-offs

Analyzes trade-offs in variational inference for uncertainty quantification, showing that mean-field approximations lead to an impossibility theorem when the target distribution does not factorize. Hi...

The feasibility of multi-graph alignment: a Bayesian approach

Establishes feasibility thresholds for random multi-graph alignment in Gaussian and Erdős-Rényi models. Demonstrates an 'all-or-nothing' phenomenon in the Gaussian model and rigorously identifies thre...

Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning

Presents non-asymptotic convergence analysis for Q-learning and actor-critic algorithms in robust average-reward MDPs. Shows optimal robust Q operator is a strict contraction, enabling stochastic appr...

MM-DINOv2: Adapting Foundation Models for Multi-Modal Medical Image Analysis

Adapts foundation models like DINOv2 for multi-modal medical image analysis, addressing limitations of uni-modal designs. Aims to improve effectiveness for multi-modal tasks common in medical fields.

Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data

Scales transformer-based novel view synthesis models using token disentanglement and synthetic data. Incorporates synthetic data from diffusion models to improve generalization to real-world scenes.

ADIR: Adaptive Diffusion for Image Reconstruction

Introduces ADIR, an adaptive diffusion framework for image reconstruction. Leverages diffusion model priors while enforcing consistency with measurements, adapting pre-trained models for improved reco...

FoMo4Wheat: Toward reliable crop vision foundation models with globally curated data

Presents FoMo4Wheat, a crop-domain vision foundation model pretrained with self-supervision on ImAg4Wheat. Aims for reliable crop vision models by using the largest wheat image dataset for self-superv...

📅

Monday, September 8, 2025

Executive Briefing Bullets (20) JSON

Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs

Introduces Instruction-oriented Preference Alignment (IPA) to enhance Multimodal Large Language Models (MLLMs) comprehension. IPA focuses on crucial multi-modal comprehension factors, improving perfor...

Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach

Presents a world model-driven code execution approach for smarter mobile device control, addressing limitations of reactive policies. It enables foresighted planning by considering sequential steps an...

GeoSplat: A Deep Dive into Geometry-Constrained Gaussian Splatting

Presents GeoSplat, a general geometry-constrained optimization framework for Gaussian splatting. It leverages higher-order geometric priors beyond normal vectors, addressing limitations of prior noisy...

Auto-Connect: Connectivity-Preserving RigFormer with Direct Preference Optimization

Introduces Auto-Connect, a novel approach for automatic rigging that explicitly preserves skeletal connectivity using a connectivity-preserving tokenization scheme. It automates connectivity relations...

LUIVITON: Learned Universal Interoperable VIrtual Try-ON

Introduces LUIVITON, an end-to-end system for automated virtual try-on of complex clothing on diverse characters. It addresses garment-body alignment by separating draping into clothing-to-SMPL and bo...

SGS-3D: High-Fidelity 3D Instance Segmentation via Reliable Semantic Mask Splitting and Growing

Presents SGS-3D for high-fidelity 3D instance segmentation, addressing errors from 2D-to-3D lifting. It employs splitting and growing reliable semantic masks, overcoming ambiguous semantic guidance an...

FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases

Introduces FlowSeek, a novel framework for optical flow requiring minimal hardware resources. It combines optical flow networks with single-image depth foundation models and motion parametrization, ac...

Histo-Miner: Deep Learning based Tissue Features Extraction Pipeline from H&E Whole Slide Images of Cutaneous Squamous Cell Carcinoma

Proposes Histo-Miner, a deep learning pipeline for tissue feature extraction from Whole Slide Images (WSIs) of skin cancer. It generates datasets with labeled nuclei and tumor regions, providing an op...

DisPatch: Disarming Adversarial Patches in Object Detection with Diffusion Models

Presents DisPatch, a defense mechanism against adversarial patch attacks in object detection using diffusion models. It aims to disarm these attacks by leveraging diffusion model capabilities, providi...

A biologically inspired separable learning vision model for real-time traffic object perception in Dark

Introduces a biologically inspired separable learning vision model for real-time traffic object perception in low-light conditions. It addresses severe illumination degradation and lack of visual cues...

Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

Proposes a diagnostic framework using the Linear Separability Ceiling (LSC) to analyze Visual-Language Models (VLMs). It reveals pervasive alignment issues in VLM representations, disentangling percep...

High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model

Introduces LatentCSI, a novel method for generating high-resolution images from WiFi CSI measurements using a pretrained latent diffusion model. It employs a lightweight network for direct mapping to ...

YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception

Introduces YOLOv13, enhancing real-time object detection with hypergraph-enhanced adaptive visual perception. It overcomes limitations of pairwise correlations by capturing global multi-to-multi high-...

Towards High-Fidelity, Identity-Preserving Real-Time Makeup Transfer: Decoupling Style Generation

Presents a novel framework for real-time virtual makeup try-on that achieves high-fidelity, identity-preserving transfer with temporal consistency. It effectively decouples semitransparent cosmetics f...

Multimodal LLM Guided Exploration and Active Mapping using Fisher Information

Proposes an active mapping system using a 3D Gaussian Splatting representation guided by multimodal LLMs for long-horizon exploration. It integrates detailed motion planning with LLM guidance, address...

PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting

Introduces PromptEnhancer, a prompt rewriting framework that enhances text-to-image models via Chain-of-Thought. It addresses challenges in rendering complex prompts, improving attribute binding and c...

STADI: Fine-Grained Step-Patch Diffusion Parallelism for Heterogeneous GPUs

Presents STADI (Spatio-Temporal Adaptive Diffusion Inference), a novel framework for efficient diffusion model inference on heterogeneous GPUs. It addresses workload imbalance, optimizing resource uti...

Semi-supervised Deep Transfer for Regression without Domain Alignment

Introduces a semi-supervised deep transfer learning approach for regression without domain alignment, addressing generalization challenges in domain-shifted target data. It offers a solution for scena...

Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control

Introduces techniques for improved 3D scene stylization via text-guided generative editing, addressing challenges in high-quality stylization and view consistency. It enables consistent style applicat...

Leveraging Transfer Learning and Mobile-enabled Convolutional Neural Networks for Improved Arabic Handwritten Character Recognition

Explores transfer learning with mobile-enabled CNNs (MbNets) to enhance Arabic Handwritten Character Recognition (AHCR). It evaluates TL strategies and lightweight MbNets to address computational requ...

📅

Friday, September 5, 2025

Executive Briefing Bullets (20) JSON

A Generative Foundation Model for Chest Radiography

Introduces ChexGen, a generative vision-language foundation model for synthesizing chest radiographs guided by text, masks, and bounding boxes. Pretrained on a large dataset, it offers a unified frame...

TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection

Proposes TRUST-VL, an explainable news assistant for general multimodal misinformation detection. It jointly trains across distortion types, facilitating knowledge sharing and enabling generalization ...

From Embeddings to Accuracy: Comparing Foundation Models for Radiographic Classification

Evaluates embeddings from foundation models for radiographic classification using lightweight adapters. Compares various models and algorithms on a large dataset, providing insights into embedding eff...

From Editor to Dense Geometry Estimator

Analyzes fine-tuning behaviors of image editing models versus text-to-image generative models for dense geometry estimation. Finds editing models are more suitable foundations, enabling improved dense...

DUDE: Diffusion-Based Unsupervised Cross-Domain Image Retrieval

Proposes DUDE, a diffusion-based unsupervised cross-domain image retrieval method using feature disentanglement. Leverages diffusion models to address domain gaps by separating object features from do...

Improved sampling algorithms and Poincar\'e inequalities for non-log-concave distributions

Studies sampling from non-log-concave distributions using improved algorithms and Poincaré inequalities. Focuses on query complexity for potentials with L-smoothness and bounded second moments, advanc...

Prob-GParareal: A Probabilistic Numerical Parallel-in-Time Solver for Differential Equations

Introduces Prob-GParareal, a probabilistic extension of GParareal for uncertainty quantification in parallel-in-time solvers. Employs Gaussian processes to model the correction function, enabling prob...

Deep Learning Advances in Vision-Based Traffic Accident Anticipation: A Comprehensive Review of Methods, Datasets, and Future Directions

Reviews 147 recent studies on deep learning for vision-based traffic accident anticipation. Categorizes methodologies and datasets, focusing on supervised, unsupervised, and hybrid models for accident...

ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving

Introduces ConServe, a fine-grained GPU harvesting method for co-serving LLM online and offline requests. Achieves high GPU utilization by managing resources at a finer granularity than existing syste...

Plot'n Polish: Zero-shot Story Visualization and Disentangled Editing with Text-to-Image Diffusion Models

Presents Plot'n Polish for zero-shot story visualization and disentangled editing using diffusion models. Addresses the need for enhanced control and post-generation modification, enabling consistent ...

Combining feature-based approaches with graph neural networks and symbolic regression for synergistic performance and interpretability

Introduces MatterVial, a hybrid framework integrating GNNs and symbolic regression for materials science. It expands feature space by combining latent representations from GNNs with descriptors and no...

Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis

Introduces a dual-stream diffusion model for coordinated piano hand motion synthesis from audio. It models hand independence and coordination, generating synchronized gestures while preserving distinc...

Spatial-aware Transformer-GRU Framework for Enhanced Glaucoma Diagnosis from 3D OCT Imaging

Presents a spatial-aware Transformer-GRU framework for enhanced glaucoma diagnosis using 3D OCT imaging. Integrates Vision Transformer for feature extraction and Bi-GRU for temporal modeling, improvin...

FastPart: Over-Parameterized Stochastic Gradient Descent for Sparse optimisation on Measures

Presents FastPart, an algorithm leveraging SGD and Random Features for sparse optimization on measures. It provides rigorous mathematical proofs for its variational framework, demonstrating improved s...

Towards understanding Accelerated Stein Variational Gradient Flow -- Analysis of Generalized Bilinear Kernels for Gaussian target distributions

Analyzes accelerated Stein Variational Gradient Flow using generalized bilinear kernels for Gaussian targets. Investigates methods to improve speed and efficiency compared to standard SVGD, aiming for...

Sharp Convergence Rates of Empirical Unbalanced Optimal Transport for Spatio-Temporal Point Processes

Statistically analyzes empirical plug-in estimators for unbalanced optimal transport with Kantorovich-Rubinstein distance. Establishes sharp convergence rates for spatio-temporal point processes, adva...

A Framework for Supervised and Unsupervised Segmentation and Classification of Materials Microstructure Images

Proposes an automated framework integrating unsupervised and supervised learning for segmenting and classifying materials microstructure images. Aims to classify micrographs by phase and segment multi...

Bootstrapping the Cross-Validation Estimate

Proposes bootstrapping the cross-validation estimate to accurately quantify uncertainty. Addresses optimism bias in error estimates for prediction models, essential for complex statistical learning al...

POET: Supporting Prompting Creativity and Personalization with Automated Expansion of Text-to-Image Generation

Introduces POET, a framework for automated expansion of text-to-image generation. Supports prompting creativity and personalization by generating novel visuals that adhere to user specifications, enha...

An Empirical Study of Vulnerabilities in Python Packages and Their Detection

Empirically studies vulnerabilities in Python packages, considering their interaction with other languages. Investigates detection methods for inherent vulnerabilities and those arising from interoper...

📅

Thursday, September 4, 2025

Executive Briefing Bullets (20) JSON

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

Introduces OneCAT, a pure decoder-only transformer for unified multimodal understanding and generation. Eliminates external vision components for efficiency, achieving significant gains especially for...

Mitigating Hallucination in Large Vision-Language Models through Aligning Attention Distribution to Information Flow

Introduces a method to mitigate hallucination in Large Vision-Language Models by aligning attention distribution to information flow. Analyzes LVLM attention mechanisms to emphasize visual information...

Real-Time Per-Garment Virtual Try-On with Temporal Consistency for Loose-Fitting Garments

Develops a real-time virtual try-on method for loose-fitting garments that maintains temporal consistency. Addresses limitations of body semantic maps for obscured contours and trains garment synthesi...

GS-TG: 3D Gaussian Splatting Accelerator with Tile Grouping for Reducing Redundant Sorting while Preserving Rasterization Efficiency

Introduces GS-TG, a tile-grouping-based accelerator for 3D Gaussian Splatting. Enhances rendering speed by reducing redundant sorting operations and preserving rasterization efficiency, addressing the...

Faster and Better: Reinforced Collaborative Distillation and Self-Learning for Infrared-Visible Image Fusion

Proposes a reinforced collaborative distillation and self-learning framework for infrared-visible image fusion. Achieves high-quality fusion with lightweight models by integrating reinforcement learni...

Proposes a novel framework for license plate super-resolution guided by embedding similarity. Combines pixel-based loss with embedding similarity learning (PECL) to address unique challenges and enhan...

Repurposing SAM for User-Defined Semantics Aware Segmentation

Proposes U-SAM to imbue the Segment Anything Model (SAM) with semantic awareness for user-defined segmentation. Enables targeted mask generation for specified object categories using only class names ...

Refinement of Monocular Depth Maps via Multi-View Differentiable Rendering

Combines monocular depth estimation with multi-view data using differentiable rendering. Frames refinement as an analysis-by-synthesis optimization problem to lift and refine relative depth maps, impr...

Enhancing Diffusion Model Stability for Image Restoration via Gradient Management

Enhances diffusion model stability for image restoration through gradient management. Analyzes underlying gradient dynamics of denoising and likelihood guidance components to identify and address sign...

Comparing Next-Day Wildfire Predictability of MODIS and VIIRS Satellite Data

Evaluates the next-day wildfire predictability of MODIS and VIIRS satellite data. Compares their suitability for fire prediction by assessing how well their data forecasts wildfire spread, addressing ...

AstroClearNet: Deep image prior for multi-frame astronomical image restoration

Proposes AstroClearNet, a self-supervised multi-frame method using deep image priors for astronomical image restoration. Achieves denoising, deblurring, and co-adding from blurred observations, overco...

Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

Proposes a Universal Network for Identifying synthetic video content, addressing limitations of face-centric detectors. Detects manipulations from face-swapping to fully AI-generated videos, enabling ...

Grid-Reg: Detector-Free Gridized Feature Learning and Matching for Large-Scale SAR-Optical Image Registration

Proposes Grid-Reg, a detector-free framework for large-scale SAR-Optical image registration. Uses grid-based multimodal registration with a domain-robust descriptor network and a grid-based solver to ...

Point Cloud Recombination: Systematic Real Data Augmentation Using Robotic Targets for LiDAR Perception Validation

Presents Point Cloud Recombination for systematic real data augmentation using robotic targets. Addresses LiDAR perception validation challenges by combining physical sensor realism with controlled sc...

Beyond Feature Mapping GAP: Integrating Real HDRTV Priors for Superior SDRTV-to-HDRTV Conversion

Introduces a superior SDRTV-to-HDRTV conversion method by integrating real HDRTV priors. Addresses ill-posedness and generalization constraints of single-style mapping by leveraging generative approac...

Planning with Reasoning using Vision Language World Model

Introduces the Vision Language World Model (VLWM) for language-based world modeling on natural videos. Infers goal achievements and predicts action trajectories, enabling effective planning with seman...

Bridging the Domain Gap for Flight-Ready Spaceborne Vision

Presents Spacecraft Pose Network v3 (SPNv3) for monocular pose estimation of spacecraft. Designed for computational efficiency and robustness to spaceborne images, essential for deployment on space-gr...

Learning a Neural Association Network for Self-supervised Multi-Object Tracking

Introduces a self-supervised framework to learn data association for multi-object tracking. Uses an EM algorithm to train a neural network, overcoming the need for tedious identity-level annotations.

ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality

Proposes ViDDAR, a VLM-based framework for detecting task-detrimental virtual content in AR. Identifies obstruction and information manipulation attacks that impair user task performance and real-worl...

Empowering Lightweight MLLMs with Reasoning via Long CoT SFT

Investigates using long Chain-of-Thought (CoT) data for Supervised Fine-Tuning (SFT) to enhance reasoning in lightweight Multimodal Language Models (MLLMs). Demonstrates significant improvement in MLL...

📅

Wednesday, September 3, 2025

Executive Briefing Bullets (20) JSON

Beyond the Kolmogorov Barrier: A Learnable Weighted Hybrid Autoencoder for Model Order Reduction

Proposes a learnable weighted hybrid autoencoder to address poor convergence in high-rank latent spaces for model order reduction. Demonstrates improved performance in learning low-dimensional intrins...

Variance-reduced first-order methods for deterministically constrained stochastic nonconvex optimization with strong convergence guarantees

Studies deterministically constrained stochastic optimization problems, proposing variance-reduced first-order methods. Aims to satisfy constraints with certainty, addressing limitations of existing m...

Gradient-free stochastic optimization for additive models

Addresses zero-order optimization for additive models with noisy observations, assuming Polyak-Lojasiewicz or strong convexity and higher-order smoothness. Proposes gradient-free methods for nonparame...

Bigger Isn't Always Memorizing: Early Stopping Overparameterized Diffusion Models

Revisits the view that overparameterized diffusion models memorize training data, showing generalization in natural domains is possible with early stopping. Challenges the notion that larger models in...

Variance-reduced first-order methods for deterministically constrained stochastic nonconvex optimization with strong convergence guarantees

Combining Evidence Across Filtrations

Proposes a method for combining e-processes constructed in different filtrations for anytime-valid inference. Addresses the challenge of combining e-processes across different filtrations, which is no...

Beyond Universal Approximation Theorems: Algorithmic Uniform Approximation by Neural Networks Trained with Noisy Data

Introduces an architecture-specific randomized training algorithm to bridge the gap between theoretical approximation theorems and practical training with noisy data. Constructs uniform approximations...

The Nondecreasing Rank

Introduces the notion of nondecreasing (ND) rank for tensors, representing them as sums of outer products with monotonicity constraints. Shows equivalence to nonnegative rank factorization for certain...

Beyond Universal Approximation Theorems: Algorithmic Uniform Approximation by Neural Networks Trained with Noisy Data

Probabilities of Causation and Root Cause Analysis with Quasi-Markovian Models

Introduces algorithmic simplifications to reduce computational complexity for probabilities of causation and latent confounding. Proposes a novel framework for Root Cause Analysis using these causal m...

WeSpeR: Computing non-linear shrinkage formulas for the weighted sample covariance

Introduces the WeSpeR algorithm to compute non-linear shrinkage formulas for weighted sample covariance in high dimensions. Significantly speeds up non-linear shrinkage for dimensions over 1000, with ...

A Generalization Theory for Zero-Shot Prediction

Presents a theoretical framework for zero-shot prediction, analyzing foundation models trained with self-supervised and multimodal contrastive learning. Identifies target quantities for zero-shot pred...

Armijo Line-search Can Make (Stochastic) Gradient Descent Provably Faster

Shows Armijo line-search can make (stochastic) gradient descent provably faster by adapting to local smoothness without needing the global constant. Strengthens existing results and demonstrates const...

The Complexity of Learning Sparse Superposed Features with Feedback

Investigates efficient retrieval of latent features from deep networks using triplet comparisons as feedback. Explores whether learned features, like dictionaries or covariance matrices, can be effici...

Leveraging Offline Data in Linear Latent Contextual Bandits

Designs end-to-end latent bandit algorithms capable of handling uncountably many latent states for offline data leverage. Focuses on linear latent contextual bandits for accelerated online sequential ...

Memory Capacity of Nonlinear Recurrent Networks: Is it Informative?

Shows that the memory capacity (MC) of random nonlinear RNNs can yield arbitrary values, questioning its informativeness. Contrasts this with linear RNNs where MC equals the Kalman controllability mat...

Learning in complex action spaces without policy gradients

Investigates learning in complex action spaces without policy gradients, hypothesizing reasons for policy gradients' apparent superiority. Explores why computational applicability and performance dive...

Two-Sided Nearest Neighbors: An adaptive and minimax optimal procedure for matrix completion

Analyzes Nearest Neighbor algorithms for matrix completion with non-smooth nonlinear functions and high missingness. Proposes an adaptive and minimax optimal procedure, 'Two-Sided Nearest Neighbors', ...

Feature Augmentations for High-Dimensional Learning

Proposes a simple technique to enhance supervised learning by augmenting features with factors extracted from design matrices and their transformations. Addresses over-parametrization and need for fas...

Identifying Causal Direction via Dense Functional Classes

Addresses causal direction identification between two variables assuming no hidden confounders. Proposes a bivariate causal score based on MDL principle using functions with density property on a comp...

📅

Monday, September 1, 2025

Executive Briefing Bullets (20) JSON

Gaussian is All You Need: A Unified Framework for Solving Inverse Problems via Diffusion Posterior Sampling

Proposes a unified framework for solving inverse problems using diffusion posterior sampling, demonstrating that existing approximations are insufficient or inefficient. Addresses limitations by offer...

Visual Imitation Enables Contextual Humanoid Control

Introduces VIDEOMIMIC, a real-to-sim-to-real pipeline that reconstructs humans and environments from videos to produce whole-body control policies for humanoid robots, enabling them to perform skills ...

JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model

Presents JambaTalk, a hybrid Transformer-Mamba model for speech-driven 3D talking head generation, aiming to achieve equivalence across lip-sync, facial expressions, and head pose generation metrics.

PicoPose: Progressive Pixel-to-Pixel Correspondence Learning for Novel Object Pose Estimation

Introduces PicoPose, a framework for RGB-based novel object pose estimation using a three-stage pixel-to-pixel correspondence learning process to tackle zero-shot generalization challenges in robotic ...

Scale-GS: Efficient Scalable Gaussian Splatting via Redundancy-filtering Training on Streaming Content

Presents Scale-GS, a scalable Gaussian Splatting framework for efficient training in streaming tasks. Organizes Gaussian spheres hierarchically by scale to improve efficiency for dynamic scenes.

Video-LLMs with Temporal Visual Screening

Proposes Temporal Visual Screening (TVS) for Video Large Language Models, inspired by human screening behavior. Aims to improve fine-grained temporal semantics capture by pre-processing videos univers...

BASE-Q: Bias and Asymmetric Scaling Enhanced Rotational Quantization for Large Language Models

Introduces BASE-Q, a quantization technique for LLMs that enhances rotational quantization with bias and asymmetric scaling. Addresses limitations of existing methods regarding training overhead and m...

Federated Fine-tuning of SAM-Med3D for MRI-based Dementia Classification

Systematically evaluates design choices for federated fine-tuning of foundation models for MRI-based dementia classification. Assesses impact on performance and efficiency using brain MRI data across ...

Temporal Flow Matching for Learning Spatio-Temporal Trajectories in 4D Longitudinal Medical Imaging

Introduces Temporal Flow Matching for learning spatio-temporal trajectories in 4D medical imaging. It enables fine-grained spatial predictions and understanding of temporal dynamics, advancing applica...

Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

Proposes a counterfactual evaluation framework to assess Automatic Reviewer Generators (ARGs) ability to detect faulty research logic. Demonstrates ARGs fail to detect faulty reasoning in research pap...

TrueGL: A Truthful, Reliable, and Unified Engine for Grounded Learning in Full-Stack Search

Introduces TrueGL, a model for trustworthy search results with clear reliability indicators. It addresses the need for AI systems to evaluate information credibility and justify assessments, aiming to...

Granite Embedding R2 Models

Introduces the Granite Embedding R2 models, English encoder-based embedding models for enterprise-scale dense retrieval. Features 16x expanded context length and state-of-the-art performance across di...

Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration

Introduces Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes from diverse LiDAR configurations, aiming to address limitations in existing datasets for c...

Rethinking Layer-wise Model Merging through Chain of Merges

Proposes a novel approach for merging fine-tuned models by considering inter-layer dependencies through a chain of merges, addressing limitations of existing layer-wise merging techniques.

BrainGPT: Unleashing the Potential of EEG Generalist Foundation Model by Autoregressive Pre-training

Introduces EEGPT, the first generalist EEG foundation model using autoregressive pre-training. Aims to address limitations in versatile EEG model exploration due to diverse data formats and outdated p...

Maybe you don't need a U-Net: convolutional feature upsampling for materials micrograph segmentation

Proposes a convolutional neural network to upsample features for materials micrograph segmentation. It offers an alternative to U-Nets, aiming to improve the representation of fine features and handle...

ECHO: Ego-Centric modeling of Human-Object interactions

Introduces ECHO, a unified framework for ego-centric modeling of human-object interactions from head and wrist tracking. It recovers human pose, object motion, and interaction semantics, important for...

How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images

Conducts a comparative study of off-the-shelf VLMs (BLIP-2, InstructBLIP, LLaVA-1.5) on spatial reasoning in urban scenes. It evaluates zero-shot performance and fine-tuning effects with a synthetic V...

From Drone Imagery to Livability Mapping: AI-powered Environment Perception in Rural China

Develops an AI-powered approach for rural livability mapping using drone imagery. Addresses limitations of questionnaire-based and urban-oriented methods by adapting visual perception for rural contex...

Integrating Pathology and CT Imaging for Personalized Recurrence Risk Prediction in Renal Cancer

Evaluates multimodal recurrence prediction in ccRCC by integrating CT and histopathology whole-slide images. A modular deep learning framework is proposed to improve personalized risk estimation beyon...

📅

Friday, August 29, 2025

Executive Briefing Bullets (20) JSON

Pixel Motion as Universal Representation for Robot Control

Introduces LangToMo, a vision-language-action framework using pixel motion forecasts as intermediate representations. A diffusion model generates text-conditioned motion sequences for robot control, e...

TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity

Proposes TAG-WM for tamper-aware generative image watermarking using diffusion inversion sensitivity. Addresses copyright and authenticity risks of AI-generated content by enhancing watermark robustne...

First-Place Solution to NeurIPS 2024 Invisible Watermark Removal Challenge

Presents the winning solution to the NeurIPS 2024 Invisible Watermark Removal challenge, stress-testing watermark robustness under varying adversary knowledge. Addresses black-box and beige-box tracks...

ZIM: Zero-Shot Image Matting for Anything

Proposes ZIM, a zero-shot image matting model that addresses limitations of segmentation models in generating fine-grained masks. Develops a label converter and constructs a new dataset for matte labe...

Language-to-Space Programming for Training-Free 3D Visual Grounding

Introduces a training-free approach for 3D visual grounding using Language-to-Space programming. Addresses challenges of scarce data and high annotation costs in 3D vision-language datasets.

Prediction of Distant Metastasis for Head and Neck Cancer Patients Using Multi-Modal Tumor and Peritumoral Feature Fusion Network

Develops a deep learning multimodal framework integrating CT images, radiomics, and clinical data to predict metastasis risk in HNSCC patients. Aims to optimize treatment strategies and prognosis.

Efficient Fine-Tuning of DINOv3 Pretrained on Natural Images for Atypical Mitotic Figure Classification in MIDOG 2025

Fine-tunes DINOv3 using low-rank adaptation for atypical mitotic figure classification in medical imaging. Achieves efficient training by adapting only ~1.3M parameters, focusing on the MIDOG 2025 cha...

A multimodal dataset for understanding the impact of mobile phones on remote online virtual education

Presents the IMPROVE dataset, a multimodal resource with behavioral, biometric, and academic data to evaluate mobile phone impact on online education. Includes data from 120 learners across three phon...

T-Stars-Poster: A Framework for Product-Centric Advertising Image Design

Proposes T-Stars-Poster, a product-centric framework for automated advertising image design. Uses product information like foreground images and taglines to generate advertising visuals in four sequen...

GeoTexBuild: 3D Building Model Generation from Map Footprints

Introduces GeoTexBuild, a modular framework for generating 3D building models from map footprints. Employs height map generation, geometry reconstruction, and appearance stylization for detailed model...

LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty

Presents LoTUS, a machine unlearning method that smooths prediction probabilities to eliminate training sample influence without retraining. Evaluated on Transformer and ResNet models, it mitigates da...

OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

Introduces OneReward, a unified reinforcement learning framework for multi-task image generation using a single reward model. Enhances generative capabilities across tasks under different evaluation c...

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Introduces VLMEvalKit, an open-source toolkit for evaluating large multi-modality models. Implements over 200 models and 80 benchmarks, providing a user-friendly framework for reproducible evaluation ...

A Machine Learning Approach to Volumetric Computations of Solid Pulmonary Nodules

Proposes an advanced framework combining a multi-scale 3D CNN with subtype-specific bias correction for precise pulmonary nodule volume estimation. Addresses limitations of traditional methods in CT s...

Efficient and Privacy-Protecting Background Removal for 2D Video Streaming using iPhone 15 Pro Max LiDAR

Integrates iPhone 15 Pro Max LiDAR and cameras for efficient, privacy-preserving background removal in 2D video streaming. Leverages depth information independent of lighting, outperforming traditiona...

GENRE-CMR: Generalizable Deep Learning for Diverse Multi-Domain Cardiac MRI Reconstruction

Proposes GENRE-CMR, a GAN-based architecture using residual deep unrolled reconstruction for enhanced fidelity and generalization in accelerated Cardiac MRI. Addresses trade-offs between scan time and...

SMARTe-VR: Student Monitoring and Adaptive Response Technology for e-Learning in Virtual Reality

Introduces SMARTe-VR, a platform for student monitoring in VR e-learning using facial biometrics and learning metadata. Enables adaptive learning sessions with features like AutoQA and interaction too...

HSM: Hierarchical Scene Motifs for Multi-Scale Indoor Scene Generation

Introduces Hierarchical Scene Motifs (HSM), a framework for indoor 3D scene generation that synthesizes dense object arrangements across multiple scales. Addresses limitations of existing methods in p...

WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation

Develops WikiAutoGen for multi-modal Wikipedia-style article generation. Integrates multimodal content retrieval and synthesis, addressing limitations of text-only generation methods for enhanced info...

Leadership Assessment in Pediatric Intensive Care Unit Team Training

Develops an automated analysis framework using egocentric vision for leadership assessment in PICU team training. Identifies cues like fixation object, eye contact, and conversation patterns from Aria...

📅

Thursday, August 28, 2025

Executive Briefing Bullets (20) JSON

Eigenvalue distribution of the Neural Tangent Kernel in the quadratic scaling

Computes the asymptotic eigenvalue distribution of the Neural Tangent Kernel for two-layer neural networks under quadratic scaling. Analyzes the behavior of NTK matrices with specific dimension scalin...

DVM-SLAM: Decentralized Visual Monocular Simultaneous Localization and Mapping for Multi-Agent Systems

Presents DVM-SLAM, the first open-source decentralized monocular C-SLAM system for multi-agent cooperative mapping. Enhances robustness, scalability, and accuracy by sharing information between agents...

MTS-Net: Dual-Enhanced Positional Multi-Head Self-Attention for 3D CT Diagnosis of May-Thurner Syndrome

Proposes MTS-Net, an end-to-end 3D deep learning framework for May-Thurner Syndrome diagnosis using CT volumes. Employs dual-enhanced positional multi-head self-attention to capture spatial-temporal p...

Reduced-Order Modeling of Cyclo-Stationary Time Series Using Score-Based Generative Methods

Presents a data-driven method using score-based generative modeling for reduced-order models of cyclo-stationary time series. Accurately reproduces statistical properties and temporal correlations, en...

Segmentation Assisted Incremental Test Time Adaptation in an Open World

Addresses Incremental Test Time Adaptation for Vision-Language Models in open worlds with unseen classes and domains. Uses segmentation assistance to improve generalization capabilities when encounter...

PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence

Proposes PAUL, an uncertainty-guided framework for robust cross-view geo-localization under noisy correspondence. Uses partitioning and augmentation to handle real-world alignment imperfections, impro...

AudioStory: Generating Long-Form Narrative Audio with Large Language Models

Proposes AudioStory, a unified framework integrating LLMs with Text-to-Audio systems for structured, long-form audio narratives. Addresses temporal coherence and compositional reasoning challenges in ...

Variational Bayes image restoration with compressive autoencoders

Exploits neural networks for data-driven regularizers in inverse problems via Variational Bayes image restoration. Uses compressive autoencoders for regularization, offering an alternative Bayesian ap...

REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment

Presents REPARO, a novel approach for compositional 3D asset generation from single images. Optimizes 3D mesh layout using differentiable rendering to address challenges in scenes with multiple object...

DiffArtist: Towards Structure and Appearance Controllable Image Stylization

Introduces DiffArtist, the first 2D stylization method offering simultaneous control over structure and appearance style strength. Addresses the gap in neural stylization by focusing on both structura...

TAGS: 3D Tumor-Adaptive Guidance for SAM

Proposes TAGS, a 3D tumor-adaptive guidance framework for SAM to address the domain gap in 3D medical imaging. Adapts foundation models to capture 3D anatomical context, improving tumor segmentation u...

Analysis and Synthesis Denoisers for Forward-Backward Plug-and-Play Algorithms

Studies the forward-backward algorithm with sub-iterative denoisers in a Plug-and-Play fashion. Analyzes analysis and synthesis Gaussian denoisers within a dictionary framework, examining minimization...

Neural Conditional Simulation for Complex Spatial Processes

Introduces Neural Conditional Simulation (NCS), a general method for spatial conditional simulation. Enables spatial prediction and uncertainty quantification by simulating from predictive distributio...

Deep Learning in Mild Cognitive Impairment Diagnosis using Eye Movements and Image Content in Visual Memory Tasks

Utilizes digital cognitive tasks with eye-tracking data and deep learning (VTNet) to distinguish Mild Cognitive Impairment from healthy controls. Correlates eye movements and image content in visual m...

OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations

Introduces OpenM3D, an open-vocabulary multi-view 3D object detector trained without human annotations. Adapts 2D-induced voxel features and uses a class-agnostic 3D localization loss for OV detection...

Seam360GS: Seamless 360{\deg} Gaussian Splatting from Real-World Omnidirectional Images

Introduces Seam360GS, a novel calibration framework incorporating a dual-fisheye camera model into 3D Gaussian splatting. Achieves seamless 360-degree visual content generation from real-world omnidir...

Bridging Domain Gaps for Fine-Grained Moth Classification Through Expert-Informed Adaptation and Foundation Model Priors

Proposes a lightweight classification approach for fine-grained moth identification by combining expert-labeled field data with knowledge distillation from a foundation model. Bridges domain gaps for ...

Latent space configuration for improved generalization in supervised autoencoder neural networks

Proposes two methods for latent space configuration to obtain desired topology in autoencoders. Improves generalization in supervised autoencoder neural networks by controlling latent space properties...

TraceNet: Segment one thing efficiently

Proposes TraceNet for efficient single instance segmentation on mobile devices. Addresses computational constraints by optimizing instance segmentation for mobile imaging applications, enabling captur...

ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

Proposes ReCLIP++, a method to rectify unexpected bias in CLIP for unsupervised semantic segmentation. Explicitly models and rectifies class-preference and space-preference biases to enhance segmentat...

📅

Wednesday, August 27, 2025

Executive Briefing Bullets (20) JSON

How many samples are needed to train a deep neural network?

Investigates sample size requirements for training ReLU feed-forward neural networks. Theoretically and empirically shows generalization error scales at $1/\sqrt{n}$, not $1/n$, underpinning practical...

Deshadow-Anything: When Segment Anything Model Meets Zero-shot shadow removal

Develops Deshadow-Anything by fine-tuning Segment Anything Model (SAM) for zero-shot shadow removal. Addresses SAM's challenges with shadows, leveraging diffusion models for improved image shadow remo...

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

Introduces ZoomEye to enhance Multimodal LLMs with human-like zooming via tree-based image exploration. Enables LLMs to perform visual reasoning by dynamically scaling visual inputs during analysis.

ForgetMe: Evaluating Selective Forgetting in Generative Models

Proposes an Automatic Dataset Creation Framework for selective forgetting in diffusion models. Evaluates methods to remove sensitive information while preserving non-sensitive regions' consistency.

FUSELOC: Fusing Global and Local Descriptors to Disambiguate 2D-3D Matching in Visual Localization

Proposes FUSELOC, fusing global and local descriptors for visual localization. Uses a weighted average operator to disambiguate 2D-3D matching, improving accuracy while maintaining low memory requirem...

StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models

Introduces StreetCrafter, a controllable video diffusion model for street view synthesis. Utilizes LiDAR point clouds as conditioning to achieve photorealistic view synthesis from vehicle sensor data.

PromptGAR: Flexible Promptive Group Activity Recognition

Presents PromptGAR for flexible group activity recognition with high accuracy. Bridges the gap in real-world applicability by offering input flexibility across prompts, frames, and instances without a...

LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification

Proposes LATex to leverage attribute-based text knowledge for Aerial-Ground Person Re-ID. Integrates semantic information from person attributes, improving feature extraction for cross-view person ret...

PhysioSync: Temporal and Cross-Modal Contrastive Learning Inspired by Physiological Synchronization for EEG-Based Emotion Recognition

Introduces PhysioSync for EEG-based emotion recognition, inspired by physiological synchronization. Employs temporal and cross-modal contrastive learning, addressing noise and individual variability i...

Generative Data Augmentation for Object Point Cloud Segmentation

Explores generative data augmentation using denoising diffusion models for 3D point cloud segmentation. Generates realistic novel point clouds to enrich data diversity and improve model performance be...

Lightweight posterior construction for gravitational-wave catalogs with the Kolmogorov-Arnold network

Applies Kolmogorov-Arnold Networks (KANs) for neural density estimation in gravitational-wave data analysis. Proposes KANs for efficient and interpretable posterior construction in GW catalogs, enhanc...

Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes

Comprehensively evaluates SAM and SAM 2 using diverse prompts for context-dependent concepts. Analyzes their performance across various scenes, providing insights for future Segment Anything Model dev...

Meta-Learned Modality-Weighted Knowledge Distillation for Robust Multi-Modal Learning with Missing Data

Proposes Meta-learned Modality-weighted Knowledge Distillation (MetaKD) for robust multi-modal learning with missing data. Adaptively weights modalities via meta-learning, maintaining accuracy even wh...

RAFT: Robust Augmentation of FeaTures for Image Segmentation

Introduces RAFT for robust augmentation of features in image segmentation. Addresses the Syn2Real gap by generating synthetic data that improves model performance on real-world deployments.

MCGS: Multiview Consistency Enhancement for Sparse-View 3D Gaussian Radiance Fields

Introduces MCGS to enhance multiview consistency for sparse-view 3D Gaussian Radiance Fields. Addresses suboptimal performance with sparse views by incorporating inherent multiview consistency.

Single-Domain Generalized Object Detection by Balancing Domain Diversity and Invariance

Proposes balancing domain diversity and invariance for single-domain generalized object detection. Addresses loss of domain-specific information in invariance-driven strategies, improving cross-domain...

Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness

Introduces Project-Probe-Aggregate (PPA) for parameter-efficient fine-tuning of foundation models. Enhances group robustness without relying on group annotations by improving failure-based debiasing c...

WMKA-Net: A Weighted Multi-Kernel Attention Network for Retinal Vessel Segmentation

Proposes WMKA-Net with a Reversible Multi-Scale Fusion Module for retinal vessel segmentation. Addresses feature fusion, contextual continuity, and noise interference using adaptive convolution and at...

MonoCoP: Chain-of-Prediction for Monocular 3D Object Detection

Proposes MonoCoP, a Chain-of-Prediction framework for monocular 3D object detection. Improves depth prediction by conditioning on other inter-correlated 3D attributes, addressing inherent depth estima...

Egocentric Human-Object Interaction Detection: A New Benchmark and Method

Introduces Ego-HOIBench and a new method for egocentric human-object interaction detection. Addresses challenges like hand-object occlusion from a first-person perspective in real-world scenarios.

📅

Tuesday, August 26, 2025

Executive Briefing Bullets (19) JSON

FaceCrafter: Identity-Conditional Diffusion with Disentangled Control over Facial Pose, Expression, and Emotion

Introduces FaceCrafter, an identity-conditional diffusion model with disentangled control over facial pose, expression, and emotion. Achieves high-fidelity face synthesis while allowing fine-grained m...

AnimateAnywhere: Rouse the Background in Human Image Animation

Introduces AnimateAnywhere, a human image animation method that animates both foreground characters and backgrounds. Addresses static or inharmonious background generation, enabling more realistic and...

Boosting Temporal Sentence Grounding via Causal Inference

Proposes boosting Temporal Sentence Grounding (TSG) via causal inference to address spurious correlations. Achieves improved accuracy in identifying relevant video moments by mitigating biases from te...

PainFormer: a Vision Foundation Model for Automatic Pain Assessment

Introduces PainFormer, a vision foundation model for automatic pain assessment. Utilizes multi-task learning to provide continuous monitoring and support decision-making in pain management, aiming to ...

Transferring Styles for Reduced Texture Bias and Improved Robustness in Semantic Segmentation Networks

Examines style transfer's impact on semantic segmentation, showing it reduces texture bias and improves robustness. Demonstrates that applying style transfer techniques can enhance generalization capa...

On the attainment of the Wasserstein--Cramer--Rao lower bound

Investigates conditions for achieving the Wasserstein--Cramer--Rao lower bound, defining Wasserstein efficiency. Shows a condition under which estimators attain this bound, providing theoretical insig...

LEL: A Novel Lipschitz Continuity-constrained Ensemble Learning Model for EEG-based Emotion Recognition

Introduces LEL, a Lipschitz continuity-constrained ensemble learning model for EEG-based emotion recognition. Enhances model stability, accuracy in high-dimensional signals, and robustness against var...

BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

Proposes BoxFusion, a reconstruction-free framework for open-vocabulary 3D object detection. Achieves real-time performance via multi-view box fusion, addressing computational overhead and memory cons...

VIN-NBV: A View Introspection Network for Next-Best-View Selection

Introduces VIN-NBV, a view introspection network for Next-Best-View (NBV) selection. Trains an acquisition policy to directly optimize reconstruction quality rather than coverage, improving scene acqu...

VFOG: Variance-Reduced Fast Optimistic Gradient Methods for a Class of Nonmonotone Generalized Equations

Develops VFOG, variance-reduced optimistic gradient methods for nonmonotone generalized equations. Combines Nesterov acceleration and variance reduction, achieving O(1/k^2) convergence rates for data-...

ANT: Adaptive Neural Temporal-Aware Text-to-Motion Model

Proposes ANT, an adaptive neural temporal-aware text-to-motion model that addresses temporal-frequency demands in diffusion models. Achieves improved motion foundations and text alignment by adapting ...

Learning Heterogeneous Mixture of Scene Experts for Large-scale Neural Radiance Fields

Presents Switch-NeRF++, a heterogeneous mixture of hash experts for large-scale NeRFs. Addresses learnable decomposition, scene heterogeneity, and modeling efficiency, enabling highly scalable and rob...

AffordanceSAM: Segment Anything Once More in Affordance Grounding

Proposes AffordanceSAM, leveraging Segment Anything Model for affordance grounding. Enables generalized affordance recognition by segmenting actionable regions, addressing limitations in supervised le...

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

Introduces GigaTok, the first approach to scale visual tokenizers to 3 billion parameters for autoregressive image generation. Simultaneously improves image reconstruction and generation quality, addr...

Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recognition

Proposes language-guided action anatomy for few-shot action recognition, exploiting text to enhance understanding of subtle variations. Achieves improved recognition with limited data by incorporating...

Mesh-Learner: Texturing Mesh with Spherical Harmonics

Presents Mesh-Learner, a 3D reconstruction and rendering framework texturing meshes with Spherical Harmonics. Learns view-dependent radiance end-to-end within rasterization pipelines, enabling native ...

From Partial Exchangeability to Predictive Probability: A Bayesian Perspective on Classification

Proposes a Bayesian nonparametric classification model combining Gaussian and Dirichlet process priors. Extends de Finetti representation and Ferguson's construction, allowing flexible uncertainty mod...

Using Visual Anomaly Detection for Task Execution Monitoring

Learns to predict robot motions during nominal task execution to detect visual anomalies for execution monitoring. Uses a probabilistic U-Net architecture to predict optical flow, enabling robots to i...

WetCat: Enabling Automated Skill Assessment in Wet-Lab Cataract Surgery Videos

Enables automated skill assessment in wet-lab cataract surgery videos using computer vision. Enhances efficiency and objectivity of surgical education by moving beyond manual performance evaluations, ...

📅

Monday, August 25, 2025

Executive Briefing Bullets (20) JSON

Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment

Proposes dual visual-text alignment for zero-shot skeleton-based action recognition, enabling models to adapt to new, unseen actions dynamically by aligning visual features with semantic text represen...

Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D Dataset

Introduces the DeepScenario Open 3D Dataset (DSC3D), a high-quality, occlusion-free dataset for autonomous driving research, capturing accurate 3D trajectory data.

NeuroKoop: Neural Koopman Fusion of Structural-Functional Connectomes for Identifying Prenatal Drug Exposure in Adolescents

Proposes NeuroKoop, a neural Koopman fusion of structural-functional connectomes, to better capture complementary features in neuroimaging data for identifying prenatal drug exposure effects.

MambaIC: State Space Models for High-Performance Learned Image Compression

Leverages state space models (SSMs) for high-performance learned image compression, addressing computational inefficiency and improving redundancy modeling by capturing long-range dependencies.

Mean-Field Generalisation Bounds for Learning Controls in Stochastic Environments

Exploits mean-field interpretation and dynamic programming to formulate stochastic control problems as infinite-dimensional minimizations, providing generalization bounds.

Lightweight and Fast Real-time Image Enhancement via Decomposition of the Spatial-aware Lookup Tables

Proposes an image enhancement method decomposing spatial-aware lookup tables to achieve lightweight and fast real-time performance while retaining spatial information.

EHGCN: Hierarchical Euclidean-Hyperbolic Fusion via Motion-Aware GCN for Hybrid Event Stream Perception

Proposes EHGCN, a hierarchical Euclidean-Hyperbolic fusion via motion-aware GCN, to capture long-range dependencies and hierarchical structures in event stream perception.

Explicit Correspondence Matching for Generalizable Neural Radiance Fields

Presents a generalizable NeRF method using explicit correspondence matching to provide geometry prior for novel view synthesis with as few as two source views.

A Novel Dataset for Video-Based Neurodivergent Classification Leveraging Extra-Stimulatory Behavior

Introduces a novel dataset for video-based neurodivergent classification, leveraging extra-stimulatory behavior to improve productivity and understanding of these behaviors.

Cascaded Multi-Scale Attention for Enhanced Multi-Scale Feature Extraction and Interaction with Low-Resolution Images

Proposes cascaded multi-scale attention (CMSA) for CNN-ViT hybrid architectures to effectively extract and interact with multi-scale features from low-resolution images.

VIBE: Video-to-Text Information Bottleneck Evaluation for TL;DR

Introduces VIBE for evaluating video-to-text summarization, addressing verbose outputs from current models by focusing on information bottleneck evaluation for concise TL;DR generation.

Towards Diagnostic Quality Flat-Panel Detector CT Imaging Using Diffusion Models

Explores using diffusion models to enhance flat-panel detector CT imaging quality, aiming for diagnostic quality comparable to multi-detector CT for improved patient management.

Geometric-Aware Low-Light Image and Video Enhancement via Depth Guidance

Proposes a Geometry-Guided Low-Light Enhancement Refine Framework (GG-LLE) incorporating geometric information and depth guidance to improve low-light image and video enhancement.

Adaptive Multi-Order Graph Regularized NMF with Dual Sparsity for Hyperspectral Unmixing

Proposes an adaptive multi-order graph regularized NMF method (MOGNMF) for hyperspectral unmixing, capturing intrinsic data structures and requiring less manual parameter tuning.

Self-Validated Learning for Particle Separation: A Correctness-Based Self-Training Framework Without Human Labels

Introduces Self-Validated Learning, a correctness-based self-training framework without human labels, for accurate particle instance segmentation in tomographic data.

Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation

Investigates using rear cameras for egocentric 3D human pose estimation, addressing self-occlusion and limited field-of-view coverage issues with frontal cameras.

Improving U-Net Confidence on TEM Image Data with L2-Regularization, Transfer Learning, and Deep Fine-Tuning

Improves U-Net confidence on TEM image data for nanoscale defect identification using L2-regularization, transfer learning, and deep fine-tuning to handle data variations.

Localized Gaussian Splatting Editing with Contextual Awareness

Introduces an illumination-aware 3D scene editing pipeline for 3D Gaussian Splatting (3DGS) that considers background illumination mismatches for object insertion/replacement.

Efficient Density Control for 3D Gaussian Splatting

Proposes efficient density control for 3D Gaussian Splatting by improving clone and split operations to enhance optimization speed and detail recovery.

Review of Demographic Fairness in Face Recognition

Reviews demographic fairness in face recognition, discussing disparities across groups, ethical concerns, and the impact on system credibility and reliability.

📅

Saturday, August 23, 2025

Executive Briefing Bullets (2) JSON

Machine learning enhanced expert system for detecting heart failure decompensation using patient reported vitals and electronic health records

Introduces a machine learning enhanced expert system for detecting heart failure decompensation. It utilizes patient-reported vitals and electronic health records to provide early detection, aiming to...

SpaIM: single-cell spatial transcriptomics imputation via style transfer

Proposes SpaIM, a novel method for single-cell spatial transcriptomics imputation using style transfer techniques. This approach aims to fill in missing gene expression data, thereby improving the acc...

📅

Friday, August 22, 2025

Executive Briefing Bullets (20) JSON

One-shot Entropy Minimization

Proposes one-shot entropy minimization for LLMs, requiring only one unlabeled data point and 10 optimization steps. Achieves performance comparable to or exceeding methods using thousands of data poin...

Vulnerabilities in AI-generated Image Detection: The Challenge of Adversarial Attacks

Examines the vulnerability of AI-generated image detectors to adversarial attacks. Investigates systematic understanding of robustness and proposes methods to address identified weaknesses, crucial fo...

Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer

Presents the Hadamard Attention Recurrent Stereo Transformer (HART) to overcome attention mechanism bottlenecks. Introduces a Dense Attention Kernel for improved nonlinear expressivity and robustness ...

TrackID3x3: A Dataset and Algorithm for Multi-Player Tracking with Identification and Pose Estimation in 3x3 Basketball Full-court Videos

Introduces TrackID3x3, a dataset and algorithm for multi-player tracking, identification, and pose estimation in basketball videos. Addresses limitations of existing sports analytics datasets for fixe...

Fine-grained Multi-class Nuclei Segmentation with Molecular-empowered All-in-SAM Model

Proposes a molecular-empowered All-in-SAM model for fine-grained multi-class nuclei segmentation in computational pathology. Addresses challenges faced by general foundation models in capturing fine-g...

3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt

Introduces 3DGS-LM, accelerating 3D Gaussian Splatting reconstruction by replacing ADAM with a tailored Levenberg-Marquardt optimizer. Reduces optimization time from hours to minutes, enabling faster ...

An Empirical Study on How Video-LLMs Answer Video Questions

Conducts an empirical study on how Video-LLMs answer video questions using attention knockouts. Analyzes internal mechanisms and designs variants to interpret existing VideoLLMs' question-answering st...

ExtraGS: Geometric-Aware Trajectory Extrapolation with Uncertainty-Guided Generative Priors

Introduces ExtraGS, a framework for trajectory extrapolation integrating geometric and generative priors. Addresses poor geometric consistency and over-smoothed renderings by unifying priors for drivi...

High-Frequency First: A Two-Stage Approach for Improving Image INR

Presents a two-stage approach, High-Frequency First, to improve Implicit Neural Representations (INRs). Addresses spectral bias by capturing high-frequency details like edges and textures, enhancing i...

When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

Introduces Grounded VideoLLM, a diffusion-grounded VideoLLM with entity-aware segmentation for long video understanding. Improves temporal perception, frame continuity, and language-vision alignment w...

Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering

Introduces PKR-QA, a benchmark for procedural knowledge reasoning question answering, built using a procedural knowledge graph. Enriches commonsense knowledge and structured reasoning for video unders...

TripleMixer: A 3D Point Cloud Denoising Model for Adverse Weather

Proposes TripleMixer, a robust 3D point cloud denoising network for adverse weather using spatial, frequency, and channel-wise processing. Effectively suppresses noise while preserving geometric struc...

Learning Motion Blur Robust Vision Transformers for Real-Time UAV Tracking

Develops motion blur robust Vision Transformers for real-time UAV tracking, addressing challenges of high-speed movement and blur. Improves performance of trackers in demanding aerial surveillance sce...

Understanding Co-speech Gestures in-the-wild

Proposes a new framework for understanding co-speech gestures in the wild, introducing three tasks and benchmarks for gesture-speech-text association. Learns a tri-modal representation for improved no...

Enhancing Novel View Synthesis from extremely sparse views with SfM-free 3D Gaussian Splatting Framework

Presents an SfM-free 3D Gaussian Splatting framework to enhance novel view synthesis from extremely sparse views. Addresses degraded rendering quality when Structure-from-Motion fails due to sparse in...

Hybrid Autoregressive-Diffusion Model for Real-Time Streaming Sign Language Production

Proposes a hybrid autoregressive-diffusion model for real-time streaming sign language production. Addresses limitations of autoregressive methods regarding error accumulation and diffusion models' st...

Task-Generalized Adaptive Cross-Domain Learning for Multimodal Image Fusion

Proposes Task-Generalized Adaptive Cross-Domain Learning for Multimodal Image Fusion. Addresses modality misalignment, detail destruction, and task-specific limitations to enhance image quality and do...

D3FNet: A Differential Attention Fusion Network for Fine-Grained Road Structure Extraction in Remote Perception Systems

Proposes D3FNet, a Dilated Dual-Stream Differential Attention Fusion Network for fine-grained road structure extraction. Addresses challenges of narrow roads, fragmentation, and occlusions in remote s...

Fast globally optimal Truncated Least Squares point cloud registration with fixed rotation axis

Proposes a novel linear-time convex relaxation and contractor for fast, globally optimal truncated least squares point cloud registration. Addresses scalability challenges of previous provably optimal...

MapKD: Unlocking Prior Knowledge with Cross-Modal Distillation for Efficient Online HD Map Construction

Introduces MapKD, unlocking prior knowledge with cross-modal distillation for efficient online HD map construction. Addresses reliance on stale offline maps and sensor suites, reducing inference overh...

📅

Thursday, August 21, 2025

Executive Briefing Bullets (20) JSON

Non-asymptotic bounds for forward processes in denoising diffusions: Ornstein-Uhlenbeck is hard to beat

Develops non-asymptotic bounds for denoising diffusion probabilistic models, making minimal assumptions on data distribution. Establishes theoretical understanding of error bounds, crucial for compari...

Endo-FASt3r: Endoscopic Foundation model Adaptation for Structure from motion

Introduces Endo-FASt3r, the first method to use self-supervised learning with foundation models for pose estimation in endoscopic scenes. Explores adaptation for structure from motion, crucial for 3D ...

Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression

Introduces TransDiff, the first image generation model combining Autoregressive Transformers and diffusion models. Achieves state-of-the-art performance on ImageNet by effectively encoding labels and ...

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Presents VBench-2.0, advancing a benchmark suite for video generation models. Focuses on intrinsic faithfulness beyond superficial aspects, measuring factors like temporal consistency and prompt adher...

UnZipLoRA: Separating Content and Style from a Single Image

Introduces UnZipLoRA, a method to decompose an image into subject and style using two distinct LoRAs trained simultaneously. Achieves disentanglement from a single image, ensuring LoRA compatibility f...

CoMatcher: Multi-View Collaborative Feature Matching

Proposes a multi-view collaborative matching strategy for reliable track construction in complex scenarios. Addresses ambiguity in pairwise matching by considering collaborative information, improving...

Deep Skin Lesion Segmentation with Transformer-CNN Fusion: Toward Intelligent Skin Cancer Analysis

Proposes a Transformer-CNN fusion method for high-precision skin lesion segmentation. Integrates transformers for global semantics and CNNs for local features, enhancing analysis of complex lesion str...

From Slices to Structures: Unsupervised 3D Reconstruction of Female Pelvic Anatomy from Freehand Transvaginal Ultrasound

Presents an unsupervised framework for 3D anatomical structure reconstruction from freehand transvaginal ultrasound sweeps. Achieves volumetric reconstruction without external tracking or learned pose...

RNDiff: Rainfall nowcasting with Condition Diffusion Model

Introduces a condition diffusion model for short-term precipitation nowcasting, referred to as RNDiff. Leverages diffusion models for high-quality sample generation, contrasting with GANs and VAEs for...

MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection

Proposes MoE-FFD, a Mixture of Experts approach for generalized and parameter-efficient face forgery detection. Addresses limitations of ViT-based methods in computational resources and capturing loca...

Dynamic watermarks in images generated by diffusion models

Proposes a novel multi-stage watermarking framework for diffusion models to establish copyright and trace generated images. Addresses ethical concerns including intellectual property and misuse of syn...

Reconstruction-Free Anomaly Detection with Diffusion Models

Proposes a novel inversion-based anomaly detection approach using diffusion models that circumvents explicit reconstruction. Addresses tension between fidelity and efficiency in anomaly detection.

DuCos: Duality Constrained Depth Super-Resolution via Foundation Model

Introduces DuCos, a depth super-resolution framework using Lagrangian duality theory and foundation models. Improves generalization across diverse scenarios with a novel prompt design for enhanced geo...

Virtual Multiplex Staining for Histological Images using a Marker-wise Conditioned Diffusion Model

Proposes a marker-wise conditioned diffusion model for virtual multiplex staining of histological images. Addresses limitations of multiplex data acquisition and enables multimodal analysis on existin...

MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds

Presents MeshCoder, a framework reconstructing complex 3D objects from point clouds into editable Blender Python scripts. Leverages LLMs for structured mesh code generation, overcoming limitations of ...

Improving Token-based Object Detection with Video

Extends the Pix2Seq object detector for videos, introducing an end-to-end approach for video object detection. Represents objects as discrete tokens, improving succinctness and handling varying number...

3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds

Introduces 3D-Generalist, a self-improving vision-language-action model for crafting 3D worlds. Addresses challenges in spatial reasoning by grounding models in the 3D world, enabling scalable generat...

GeMS: Efficient Gaussian Splatting for Extreme Motion Blur

Introduces GeMS, a framework for 3D Gaussian Splatting designed to handle severely motion-blurred images. Addresses limitations of existing deblurring and Gaussian Splatting methods by not assuming ac...

Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural Calibration

Presents NCLR, a self-supervised learning framework for 3D perception using 2D-3D neural calibration. Estimates rigid pose aligning camera and LiDAR systems, bridging domain gaps for effective percept...

What Makes for Good Image Captions?

Establishes an information-theoretic framework for image captioning, balancing sufficiency, redundancy, and comprehensibility. Provides quantitative measures for evaluating caption quality and a flexi...

📅

Wednesday, August 20, 2025

Executive Briefing Bullets (20) JSON

Benchmarking GPT-5 for Zero-Shot Multimodal Medical Reasoning in Radiology and Radiation Oncology

Benchmarks GPT-5's zero-shot multimodal reasoning in radiology and radiation oncology, comparing its performance against GPT-4o across key medical tasks. Assesses the practical gains of large multimod...

UNICON: UNIfied CONtinual Learning for Medical Foundational Models

Proposes UNICON, a unified continual learning framework for medical foundational models. Addresses data scarcity by enabling sequential fine-tuning on diverse domains and tasks without requiring large...

State of Abdominal CT Datasets: A Critical Review of Bias, Clinical Relevance, and Real-world Applicability

Critically reviews 46 abdominal CT datasets, finding substantial redundancy and Western/geographic bias. Assesses suitability for AI applications, highlighting limitations in clinical relevance and re...

Towards Understanding and Harnessing the Transferability of Prognostic Knowledge in Computational Pathology

Investigates transferability of prognostic knowledge in computational pathology for Whole-Slide Images. Addresses scaling limitations for rare tumors and knowledge utilization from other cancers, prop...

InnerGS: Internal Scenes Rendering via Factorized 3D Gaussian Splatting

Targets internal scene reconstruction using factorized 3D Gaussian Splatting. Models continuous volumetric density via inner 3D Gaussians for applications requiring deep interior understanding.

PediDemi -- A Pediatric Demyelinating Lesion Segmentation Dataset

Introduces PediDemi, a dataset for pediatric demyelinating lesion segmentation. Addresses the need for specialized datasets to improve AI models for diagnosing central nervous system disorders.

MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation

Enhances Vision Transformers for medical image segmentation by integrating pre-trained LLM transformer blocks. Achieves substantial improvements by incorporating frozen LLM blocks into the ViT encoder...

Exploration of Deep Learning Based Recognition for Urdu Text

Explores deep learning for Urdu text recognition, addressing challenges of its cursive script and complex structure. Proposes a component-based classification approach to improve recognition accuracy.

Automated Assessment of Aesthetic Outcomes in Facial Plastic Surgery

Introduces a computer-vision framework for quantifying aesthetic outcomes in facial plastic surgery. Leverages automated landmark detection, symmetry computation, and deep learning on a large dataset.

EDTalk++: Full Disentanglement for Controllable Talking Head Synthesis

Achieves full disentanglement for controllable talking head synthesis with EDTalk++. Enhances application and entertainment by controlling facial motions and accommodating diverse input modalities.

Multimodal Data Storage and Retrieval for Embodied AI: A Survey

Surveys storage architectures for Embodied AI data, evaluating graph, multi-model, data lake, vector, and time-series databases. Focuses on suitability for physical grounding, low-latency access, and ...

Learning to See Through Flare

Introduces NeuSee, a framework for sensor protection against laser flare. Jointly learns a diffractive optical element representation and a Mamba-GAN network for image restoration, enabling high-fidel...

Real-Time, Population-Based Reconstruction of 3D Bone Models via Very-Low-Dose Protocols

Introduces SSR-KD, a fast, accurate AI framework for real-time 3D bone model reconstruction from very-low-dose protocols. Enables patient-specific surgical guides and preoperative planning without hig...

Susceptibility Distortion Correction of Diffusion MRI with a single Phase-Encoding Direction

Proposes a resample-aggregate framework using diffusion models for stable variable selection in high-dimensional, correlated data. Generates high-fidelity synthetic data to improve model stability and...

Colon Polyps Detection from Colonoscopy Images Using Deep Learning

Applies deep learning object detection for early colon polyp identification using the Kvasir-SEG dataset. Utilizes data augmentation and specific training/validation/testing splits to improve detectio...

TracSum: A New Benchmark for Aspect-Based Summarization with Sentence-Level Traceability in Medical Domain

Introduces TracSum, a benchmark for traceable, aspect-based summarization in the medical domain. Pairs summaries with sentence-level citations to enable users to assess factual accuracy and alleviate ...

DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

Enhances OCR capabilities using a reasoning-and-tool interleaved vision-language model. Addresses LVLM hallucinations and improves effectiveness on OCR tasks compared to general-purpose models.

Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving

Proposes Prune2Drive, a plug-and-play framework to accelerate Vision-Language Models in autonomous driving. Addresses computational overhead from high-resolution, multi-view images via pruning.

Applications of Small Language Models in Medical Imaging Classification with a Focus on Prompt Strategies

Investigates Small Language Models (SLMs) for medical imaging classification, comparing models and prompt designs. Addresses computational cost and data privacy concerns hindering LLM adoption in heal...

Revisiting MLLM Token Technology through the Lens of Classical Visual Coding

Re-examines MLLM token technology through classical visual coding principles. Establishes a unified formulation bridging token technology and visual coding to minimize computational cost while maximiz...

📅

Tuesday, August 19, 2025

Executive Briefing Bullets (20) JSON

On computing and the complexity of computing higher-order $U$-statistics, exactly

Derives a decomposition of m-th order U-statistics to linear terms, aiming to fill the gap in comprehensive studies of their computational complexity, which are known to be time-consuming in practice.

Does the Barron space really defy the curse of dimensionality?

Provides evidence that the Barron space, while defying the curse of dimensionality in classical smoothness, does not defy it with a nonclassical notion of smoothness related to 'infinite'.

Communicate Less, Synthesize the Rest: Latency-aware Intent-based Generative Semantic Multicasting with Diffusion Models

Develops an intent-aware generative semantic multicasting framework utilizing pre-trained diffusion models that decomposes source signals into semantic classes based on multi-user intent for efficient...

WIR3D: Visually-Informed and Geometry-Aware 3D Shape Abstraction

Presents WIR3D, a technique for abstracting 3D shapes using sparse Bezier curves that represent geometry and visual features, guided by CLIP model activations.

EgoTwin: Dreaming Body and View in First Person

Introduces a novel task of joint egocentric video and human motion generation, addressing viewpoint alignment and camera motion challenges for first-person view content.

IntelliCap: Intelligent Guidance for Consistent View Sampling

Addresses the under-attended problem of assisting humans in collecting input images for novel view synthesis, focusing on uniform and dense view sampling.

ID-Card Synthetic Generation: Toward a Simulated Bona fide Dataset

Proposes a method for mimicking bona fide ID card images by generating synthetic versions, aiming to address the lack of images for training robust Presentation Attack Detection systems.

DMS:Diffusion-Based Multi-Baseline Stereo Generation for Improving Self-Supervised Depth Estimation

Proposes DMS, a diffusion-based multi-baseline stereo generation method to address ambiguity in photometric reconstruction, improving self-supervised depth estimation.

Precise Action-to-Video Generation Through Visual Action Prompts

Presents visual action prompts, a unified action representation for action-to-video generation of complex interactions, balancing action precision and cross-domain transferability.

IGFuse: Interactive 3D Gaussian Scene Reconstruction via Multi-Scans Fusion

Introduces IGFuse, a method for reconstructing 3D scenes by fusing multi-scans with Gaussian representations, addressing object occlusions and limited sensor coverage.

Unified Conformalized Multiple Testing with Full Data Efficiency

Proposes a unified framework for conformalized multiple testing that uses all available data (null, alternative, unlabeled) to construct scores and calibrate p-values via a full permutation strategy.

Transforming Blood Cell Detection and Classification with Advanced Deep Learning Models: A Comparative Study

Compares YOLOv10 against other models for blood cell detection, showing increased training epochs significantly enhance accuracy, precision, and recall for real-time detection and classification.

A polynomial formula for the perspective four points problem

Presents a fast and accurate solution to the perspective n-points problem for n=4 by separating variables and finding 3D points on rays connecting the camera to canvas points.

Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model

Introduces Matrix-Game 2.0, an open-source, real-time world model using diffusion models for interactive video generation, addressing latency issues of previous models.

HierAdaptMR: Cross-Center Cardiac MRI Reconstruction with Hierarchical Feature Adapters

Proposes HierAdaptMR, a hierarchical feature adaptation framework using parameter-efficient adapters to address multi-level domain variations in cross-center cardiac MRI reconstruction.

Odo: Depth-Guided Diffusion for Identity-Preserving Body Reshaping

Enables controllable human shape editing while preserving pose, identity, clothing, and background by using depth-guided diffusion, addressing limitations of current approaches.

Checkmate: interpretable and explainable RSVQA is the endgame

Introduces a novel RSVQA dataset, Chessboard, designed to minimize biases and improve interpretability and explainability in Remote Sensing Visual Question Answering models.

Real-Time Beach Litter Detection and Counting: A Comparative Analysis of RT-DETR Model Variants

Conducts a comparative analysis of RT-DETR model variants for automated beach litter detection and counting, investigating the efficacy of state-of-the-art object detection models.

Motion2Motion: Cross-topology Motion Transfer with Sparse Correspondence

Studies the challenge of transferring animations between characters with different skeletal topologies by proposing a method to address topological inconsistency and establish bone correspondences.

4DNeX: Feed-Forward 4D Generative Modeling Made Easy

Presents 4DNeX, the first feed-forward framework for generating dynamic 3D scene representations from a single image by fine-tuning a pre-trained video diffusion model.

Archive contains 57 days of AI research intelligence