AI Research Archive
Recording AI Revolution One Day At A Time
Wednesday, November 5, 2025
Investigates subtraction accuracy in eight LLMs, finding it lags behind addition. Errors in (a-b) are consistently related to errors in (b-a), suggesting models struggle with non-commutativity. This h...
Introduces Fast, Private, and Protected (FPP), a novel approach for federated learning that safeguards data privacy and defends against model poisoning attacks. It aims to ensure secure and robust dis...
Introduces LTD-Bench, a benchmark for evaluating LLMs' spatial reasoning capabilities through drawing. It addresses the limitations of opaque numerical metrics by providing an intuitive understanding ...
Introduces SEAL, a symmetry-encouraging loss function for high energy physics. It improves robustness and data efficiency of machine learning models by explicitly respecting physical symmetries, even ...
Introduces the 'Three Taxes' framework to analyze performance inefficiencies in distributed LLMs. Proposes moving beyond BSP to achieve efficient multi-GPU inference by addressing bulk synchronous, lo...
Reveals a jailbreak strategy that evades defenses by extracting information from failed attacks and evolving itself. It provides an automated framework for discovering, retrieving, and evolving strate...
Formalizes AI research agents as search policies navigating solution spaces using operators. Focuses on improving agent performance in MLE-bench by enhancing search, exploration, and generalization fo...
Identifies a class of simulation problems where Graph Neural Networks (GNNs) outperform LLMs. Introduces Graph-based Models (GEMs) that match or surpass LLM baselines for human simulation despite bein...
Introduces path-consistency, leveraging confidence of earlier answers to guide generation and enhance LLM inference efficiency. It identifies promising prefixes to reduce computational cost and time c...
Fine-tunes LLMs for classification by attaching explanations to labels, systematically improving naturalness, comprehensiveness, and adherence. This explanation-enhanced approach yields better convers...
Proposes a QUBO formulation to enhance privacy in federated learning by bounding the risk of membership inference attacks. This method aims to improve data protection while maintaining model utility i...
Proposes IG-Pruning, a novel input-aware method for pruning transformer layers in LLMs. It dynamically removes layers based on input, reducing computational costs for efficient inference without signi...
Proposes GRACE, a lightweight score to quantify teacher model effectiveness for student model distillation. It measures distributional properties of student gradients without a verifier, enabling prin...
Addresses GPU NUMA effects in large-scale attention workloads by proposing Swizzle, a novel kernel scheduling strategy. It exploits NUMA-aware locality to optimize attention performance, mitigating me...
Proposes PrivGNN, a high-performance secure inference protocol for graph neural networks. It addresses the challenge of securing GNNs and graph data in privacy-critical cloud environments, enabling se...
Presents AutoAdv, a training-free framework for automated multi-turn jailbreaking of LLMs. It achieves high attack success rates by combining adaptive adversarial prompting and prompt refinement, impr...
Proposes a novel Multi-Personality Generation (MPG) framework for LLMs at decoding time. It flexibly controls multiple personalities without retraining, enhancing adaptability and robustness for user-...
Compares sequential and parallel self-consistency for LLM reasoning, finding sequential voting with inverse entropy outperforms parallel methods at equal compute. This demonstrates a more efficient sc...
Derives and investigates two DPO variants that explicitly model ties in pairwise comparisons. Experiments show explicit tie handling can be added without performance degradation, improving DPO's robus...
Proposes ExplicitLM, a novel architecture with a million-scale external memory bank storing human-readable knowledge. This decouples knowledge from parameters, enabling direct inspection and modificat...
Tuesday, November 4, 2025
Investigates how generative AI models encode 'beauty' norms and erase 'ugliness'. Studies the propagation of Western beauty myths in text-image models and discusses societal implications, particularly...
Surveys complex question-answering strategies using hybrid LLM architectures. Reviews methods for addressing specific, complex questions beyond chatbot capabilities, exploring power-generation and cli...
Proposes dictionary learning for adversarial training to defend LLMs against jailbreak attacks. Aims to improve generalization to unseen attacks by creating more robust safety guardrails, addressing a...
Introduces PADBen, a benchmark for evaluating AI text detectors against paraphrase attacks. Reveals that iterative paraphrasing evades current detectors by creating an intermediate laundering region, ...
Introduces MARS-SQL, a multi-agent RL framework for complex Text-to-SQL tasks. It decomposes the problem into specialized agents for grounding, generation, and validation, improving accuracy and handl...
Considers a method for finding mixed Nash equilibria in two-layer zero-sum games using entropic regularization. Applies interacting particle dynamics and large deviations theory to problems in GAN tra...
Introduces a framework to assess LLM reasoning's knowledge grounding by collecting principal knowledge and evaluating intermediate reasoning steps. It comprises knowledge collection, grounding assessm...
Provides a comprehensive review of Low-Rank Adaptation (LoRA) for foundation models. It analyzes LoRA's effectiveness in adapting large models to downstream tasks, addressing parameter efficiency chal...
Investigates safety and fairness risks in parameter-efficient fine-tuning (PEFT) of LLMs. Compares four PEFT methods (LoRA, DoRA, ICL, Prompt Tuning) to assess trade-offs between efficiency and alignm...
Introduces DTS, a framework for enhancing large reasoning models by pruning over-long chain-of-thought traces. It uses decoding tree sketching to identify short, accurate reasoning paths, reducing inf...
Reevaluates self-consistency scaling in multi-agent systems using Gemini 2.5 models. Examines trade-offs of increasing sampled reasoning paths, comparing pooled outputs to single chain-of-thought, and...
Introduces ToM, a framework leveraging Tree-oriented MapReduce for long-context reasoning in LLMs. It improves logical coherence over RAG and divide-and-conquer methods by optimizing graph traversal f...
Proposes a framework for generating spatially coherent multimodal data by integrating spatial knowledge graphs with MLLMs. It addresses spatial perception limitations in MLLMs, enabling the creation o...
Introduces ReSpec, a retrieval-enhanced speculative decoding framework for LLM acceleration. It optimizes cache scheduling as a graph problem using Lexicographic Minimax Path Optimization to minimize ...
Presents a systematic investigation into diversity's impact on LLM reasoning via RL. Proposes a diversity-aware policy optimization framework to enhance reasoning capabilities and stability, addressin...
Proposes SEPS, a semantic-enhanced patch slimming framework for fine-grained cross-modal alignment. It addresses patch redundancy and ambiguity in MLLMs by optimizing patch selection for improved visi...
Proves that suboptimality of Empirical Risk Minimization (ERM) is due to large bias, with variance bounded by the minimax rate. Provides an elementary proof in the fixed design setting and extends it ...
Introduces the Bhili-Hindi-English Parallel Corpus (BHEPC), the largest of its kind. Leverages cross-domain and cross-linguistic data to address low-resource Neural Machine Translation challenges for ...
Introduces CORGII, a graph indexing framework for efficient subgraph isomorphism retrieval. Uses contextual graph representations and inverted indices to overcome limitations of exhaustive scoring in ...
Proposes a Bayesian tensor regression model for phenotype prediction across multiple factors. Incorporates spike-and-slab structures to identify relevant interactions and uses prior distributions to r...
Monday, November 3, 2025
Introduces ThinkMorph, a multimodal model learning interleaved chain-of-thought reasoning by treating text and image as complementary. Fine-tuned on 24K reasoning traces, it demonstrates emergent prop...
Introduces DUST, a dual-stream diffusion framework for world-model augmented Vision-Language-Action (VLA) models. It addresses modality conflicts between state and action prediction, enhancing VLA per...
Proposes a robust deep neural watermarking framework for copyright protection in 3D point clouds. It addresses challenges posed by geometric and non-geometric attacks, offering enhanced resilience com...
Proposes a Data-Free Quantization (DFQ) method for Vision Transformers (ViTs) that addresses semantic distortion and inadequacy using semantic alignment and reinforcement. It enables model quantizatio...
Proposes SAGS, a self-adaptive alias-free Gaussian Splatting method for dynamic surgical endoscopic reconstruction. It addresses aliasing and artifacts in deformable tissue reconstruction from endosco...
Introduces NegoCollab, a common representation negotiation approach for heterogeneous collaborative perception. It addresses domain gaps in intermediate features shared among agents with fixed percept...
Presents a deep learning-based denoising framework for quantitative operando microscopy. It preserves physical fidelity and enhances resolution, enabling deeper insights into dynamic chemical and phys...
Introduces Phased DMD, a few-step distribution matching distillation method using score matching within subintervals. It addresses limitations of one-step distillation in complex generative tasks by e...
Presents generative diffusion modeling protocols to enhance Kikuchi pattern indexing in electron back-scatter diffraction (EBSD). It addresses limitations of traditional methods at high scanning speed...
Introduces a multi-agent framework for editable scientific illustrations that outputs vector graphics with semantic structure. It addresses rasterization limitations and cumbersome code-based methods,...
Proposes NAUTILUS, a large multimodal model for underwater scene understanding, addressing the lack of large-scale datasets. It enables multi-task perception from multiple granularities, advancing aut...
Presents FRIDA, a lightweight framework using diffusion features for fake image detection and source attribution. It addresses generalization challenges of supervised detectors across unseen generator...
Introduces ANCHOR, integrating adversarial training with hard-mined supervised contrastive learning for robust representation learning. It enhances model resilience against adversarial attacks by lear...
Introduces PROFIT, an optimizer specifically designed for deep fine-tuning of converged models on new tasks or datasets. It aims to improve fine-tuning efficiency and model performance, addressing a g...
Proposes an Audio-Visual Speech Enhancement (AVSE) system that jointly models separation and dereverberation for complex acoustic scenarios. It leverages visual auxiliary information to extract target...
Proposes Gaussian Combined Distance (GCD) as a generic similarity metric for object detection, addressing limitations of IoU-based metrics, especially for small objects. GCD enhances model performance...
Proposes Sh-ViT, a lightweight Vision Transformer for robust occluded person re-identification in complex surveillance scenes. It enhances robustness to occlusion through a shuffle module in the final...
Proposes LifWavNet, a lifting wavelet network for non-contact ECG reconstruction from radar signals. It employs learnable lifting wavelets for adaptive feature capture and synthesis, offering an unobt...
Introduces WildfireX-SLAM, a large-scale low-altitude RGB-D dataset for wildfire SLAM. It aims to facilitate research in 3D Gaussian splatting-based SLAM for challenging forest environments, supportin...
Proposes a fragile zero-watermarking method using dual quaternion matrix decomposition for medical image copyright protection. It extracts stable features without modifying the original image, providi...
Friday, October 31, 2025
Introduces Emu3.5, a large-scale multimodal world model pre-trained end-to-end with a unified next-token prediction objective. Trained on over 10 trillion vision-language tokens, it natively predicts ...
Proposes a new planning method for end-to-end autonomous driving using constraint-aware flow matching. This generative approach overcomes the mode collapse issue of imitation learning by producing div...
Introduces a framework for consistent and reproducible evaluation of novel view synthesis methods like NeRFs and 3D Gaussian Splatting. It provides standardized implementations and evaluation protocol...
Adapts the Segment Anything Model (SAM) for medical magnetic resonance imaging (MRI) segmentation. This work demonstrates how a large-scale vision foundation model can be effectively fine-tuned for a ...
Introduces CronusVLA, a vision-language-action model for robotic manipulation that leverages temporal information from multiple frames. By moving beyond the single-frame paradigm, this approach enhanc...
Introduces MoralCLIP, a method to imbue vision-language models with the ability to reason about moral dimensions of content. It aligns image-text representations with principles from Moral Foundations...
Proposes Masked Diffusion Captioning (MDC), a novel self-supervised method for learning visual features. The approach trains a model to caption images using an image-conditioned masked diffusion langu...
Presents DOVE, a diffusion model for real-world video super-resolution that achieves high performance in a single sampling step. This overcomes the significant latency of traditional iterative diffusi...
Proposes JOGS, a unified framework that jointly optimizes 3D Gaussian points and camera poses for novel view synthesis. This approach eliminates the dependency on external pose estimation tools like C...
Provides a comprehensive survey on efficient post-training for Large Language Models (LLMs) from a data-centric viewpoint. The paper reviews methods and challenges related to data annotation costs and...
Presents Disentangled 4D Gaussian Splatting (Disentangled4DGS), a novel method for dynamic scene rendering. By disentangling static and dynamic components, it achieves high-resolution, real-time rende...
Provides a comprehensive survey of 3D Gaussian Splatting (3DGS), a transformative technique for 3D scene representation. The paper analyzes follow-up research that enhances efficiency, scalability, an...
Addresses the challenge of selecting effective pre-training data for long-context LLMs. The paper proposes a method to quantify long-range dependencies in text, enabling the filtering of documents tha...
Presents ProstNFound+, a prospective clinical study validating the use of medical foundation models for prostate cancer detection from micro-ultrasound images. This work demonstrates the real-world ap...
Proposes SplitFlow, a method for inversion-free image editing with rectified flow models. By decomposing the flow into content and structure components, it allows for high-fidelity, text-guided edits ...
Presents LODGE, a level-of-detail (LOD) method for 3D Gaussian Splatting that enables real-time rendering of large-scale scenes on memory-constrained devices. It creates a hierarchical representation,...
Introduces HyGen, a system for efficient LLM serving that co-locates latency-sensitive online requests and throughput-oriented offline requests. By dynamically managing resources and batching strategi...
Introduces DDL, a large-scale dataset for deepfake detection and localization designed to cover diverse real-world scenarios. By including a wide range of AIGC-generated content and manipulation types...
Introduces Spiking Patches, a novel tokenization method specifically designed for asynchronous and sparse data from event cameras. This approach creates an event representation that preserves the inhe...
Presents CRAG-MM, a new benchmark for evaluating Multi-Modal Retrieval-Augmented Generation (MM-RAG) systems. It focuses on multi-turn conversational scenarios, such as those encountered with wearable...
Thursday, October 30, 2025
Proposes a new LLM serving architecture that executes programs instead of processing static prompts. This allows for dynamic, runtime customization of inference, achieving up to 2x throughput improvem...
Presents a novel transformer architecture where looped computations (reusing weights) run in parallel instead of sequentially. This design overcomes the latency bottleneck of previous looped models, e...
Introduces RLAIF-V, a framework for reducing multimodal LLM hallucination using feedback from open-source AI models instead of humans. This method creates a highly effective preference dataset and tra...
Presents a generative AI framework that creates dynamic visual effects (VFX) by learning from in-context examples, rather than relying on per-effect fine-tuning. This allows the model to generalize an...
Proposes a unified training pipeline that improves both Program-of-Thought (P-CoT) and Natural Language Chain-of-Thought (N-CoT) reasoning. The method uses each paradigm to iteratively generate and re...
Presents a foundational LLM for Electronic Health Record (EHR) analysis, pre-trained on a massive clinical dataset. The model is fine-tuned with a reasoning-focused objective, demonstrating superior p...
Introduces an open-source framework for building and evaluating automated fact-checking systems. The work provides a comprehensive benchmark that measures the ability of LLMs and dedicated systems to ...
Proposes the first Multimodal Large Language Model (MLLM) framework for open-vocabulary, hierarchical part segmentation. The model can jointly detect and segment objects and their constituent parts fr...
Proposes a new training method that improves the reliability of post-hoc attribution for long-document question answering. By training the model to decompose answers into components, it enhances the a...
Introduces ExtractAnything3D (EA3D), a unified online framework that performs simultaneous geometric reconstruction and open-world 3D object extraction from a single, streaming video. The system can i...
Introduces Ouro, a family of pre-trained Looped Language Models that perform iterative reasoning in latent space. This approach allows smaller models (1.4B) to match the reasoning performance of much ...
Demonstrates that language models, regardless of architecture (Transformer, Mamba) or scale (14M to 12B parameters), exhibit highly consistent and predictable behavioral phases during pre-training, re...
Proposes a method for precisely erasing entire concepts directly from a model's parameters. This technique surgically modifies model behavior without requiring fine-tuning, offering a more robust appr...
Introduces a method to accelerate Chain-of-Thought (CoT) reasoning by encoding reasoning steps into implicit, non-textual tokens. This reduces the number of generated tokens, significantly speeding up...
Develops an LLM-based agent for complex business tasks within a Customer Relationship Management (CRM) system. The agent uses reinforcement learning and a shared memory module to improve its tool-call...
Proposes PairUni, a unified framework for training multimodal models to perform both understanding and generation tasks. It uses pairwise ranking objectives during reinforcement learning to effectivel...
Introduces a new generative model based on rectified flow, an ODE-based approach that learns smooth transport between distributions. This method offers an alternative to diffusion, enabling high-quali...
Introduces MiRAGE, a new evaluation framework and benchmark for Retrieval-Augmented Generation (RAG) systems that use multimodal sources like video and audio. It tests the ability of models to integra...
Presents a novel debugging framework where the model first translates buggy code into a natural language description of its logic. It then identifies and corrects flaws in the natural language represe...
Presents a new Video Question Answering dataset to evaluate a model's ability to understand temporal dynamics and perform complex reasoning over streaming video. The dataset includes questions requiri...
Wednesday, October 29, 2025
Introduces a reinforcement learning framework where a single model acts as both a Challenger and a Reasoner. The model self-improves by generating reasoning problems from a large text corpus, demonstr...
Proposes a method for Multimodal Large Language Models to improve complex visual reasoning by generating intermediate 'visual thoughts.' The model learns to sketch in a latent space, mimicking human c...
Presents Pie, a programmable serving system designed for complex LLM applications involving agentic workflows. It replaces the monolithic token generation loop with a flexible system that can execute ...
Introduces a data synthesis method inspired by the Zone of Proximal Development (ZPD). It generates training tasks at the edge of an LLM's capabilities, enabling the model to effectively expand its re...
Creates a benchmark to disentangle reasoning from factual recall in language models. It generates controlled, synthetic 'worlds' with alternate physics or facts, allowing for precise evaluation of a m...
Introduces a large multi-modal model capable of processing contexts up to 1 million tokens, including images, video, and text. It achieves state-of-the-art performance on long-context visual understan...
Introduces a benchmark to evaluate if AI agents can replicate research from astrophysics papers. It tests an agent's ability to perform a complex workflow, including understanding the paper, writing c...
Demonstrates that Reinforcement Learning can significantly improve the performance of LLM-based search agents on long-horizon tasks. By learning from experience, the RL-trained agents outperform promp...
Develops a multi-sensor fusion method for autonomous driving based on 3D Gaussian representations. The approach effectively combines information from various sensors like cameras and LiDAR into a unif...
Introduces a large-scale commonsense reasoning benchmark covering over 100 languages and cultures. Constructed through participatory methods, it evaluates the ability of LLMs to handle culturally-spec...
Provides a comprehensive survey on general world models, a key concept for AGI. It analyzes OpenAI's Sora within this framework, discussing its capabilities, limitations, and the future trajectory for...
Introduces a method to transfer a language model to a new tokenizer without retraining. This technique allows for adapting models to new languages or domains efficiently, improving performance and red...
Develops a neuromuscular speech interface that synthesizes audible speech directly from electromyographic (EMG) signals of orofacial muscles. The system leverages self-supervised speech representation...
Identifies 'temporal blindness' in LLM agents, where they fail to account for real-world time progression during multi-turn interactions. The paper diagnoses this issue and demonstrates its negative i...
Proposes a diffusion-based large language model that natively supports variable-length text generation. By treating the [EOS] token as a special signal, the model overcomes a key limitation of previou...
Proposes a 'Zero-Imitation' framework for end-to-end autonomous driving. Instead of relying on expert demonstrations, the model learns by generating and scoring its own trajectories based on safety an...
Presents a method for learning reward models for complex, long-form agentic tasks. The system uses reinforcement learning and web-grounded feedback to train reward models that can evaluate the correct...
Presents an agent-based foundation model for analyzing high-resolution pathology images. The model mimics the diagnostic logic of human pathologists by sequentially selecting and analyzing regions of ...
Proposes a framework for proactive robotic manipulation using omni-modal context from vision, language, and audio. The robot can infer human intent and proactively assist in tasks without explicit ins...
Presents a framework to automatically create large-scale, navigable simulators for indoor environments from simple image sequences. It adapts 3D Gaussian Splatting to build photorealistic scenes, enab...
Tuesday, October 28, 2025
Introduces the first zero-shot method for grounding 3D orientation in text-to-image models. It allows users to specify the viewpoint of multiple objects across diverse categories without requiring exp...
Presents a method for compositional motion customization in text-to-video generation. It enables precise control over complex, multi-subject motions by decomposing motion descriptions and applying the...
Presents a method to unify image generation and depth estimation within a single text-to-image diffusion model. It overcomes the catastrophic degradation of generative capabilities during fine-tuning,...
Introduces BrainFound, a self-supervised foundation model for 3D brain MRI analysis built by extending DINO-v2. It learns general-purpose features from large-scale unlabeled MRI datasets, demonstratin...
Proposes a system for converting 3D scans into parametric, constrained Computer-Aided Design (CAD) models. It reconstructs fine-grained geometric primitives and infers the underlying design intent, su...
Presents an end-to-end autonomous driving model that is robust to variations in camera viewpoint. It uses a feed-forward 3D Gaussian Splatting module to create an explicit 3D representation of the sce...
Proposes a 4D Gaussian Splatting method for reconstructing surgical scenes from endoscopic video. It uses a rational-wavelet representation to model non-rigid tissue motion and handles photometric inc...
Presents a lightweight framework for building unified multimodal models for both understanding and generation. It uses a double fusion approach to efficiently combine pre-trained vision encoders and L...
Presents a dataset and method for large-scale, occupancy-centric driving scene generation. The framework allows for the creation of diverse and consistent driving scenarios conditioned on occupancy gr...
Introduces Kernel Density Steering (KDS), a novel inference-time framework for diffusion-based image restoration. It guides the sampling process toward high-density regions of the data manifold, promo...
Proposes a Vision-Language-Action model for end-to-end autonomous driving. The model leverages world knowledge and reasoning to make driving decisions, using reinforcement fine-tuning and adaptive rea...
Proposes VOLD, a method to transfer reasoning from text-only LLMs to Vision-Language Models using on-policy distillation. This technique leverages abundant text-based reasoning data to improve VLM per...
Proposes a method to accelerate diffusion model sampling by adaptively combining ODE and SDE solvers. The technique introduces adaptive stochastic coefficients to leverage the complementary strengths ...
Introduces a unified framework for 3D open-vocabulary segmentation by integrating it with Gaussian Splatting. The method first reconstructs a 3D scene and then performs segmentation, ensuring multi-vi...
Introduces a framework for egocentric video reasoning that infers the hidden intentions and actions of the camera-wearer. It uses a Spatio-Temporal Chain-of-Thought (CoT) approach, enabling multimodal...
Introduces a training-free method for multi-subject text-to-image generation by automatically fusing multiple subject-specific LoRAs at test time. It uses an auto-masking technique to apply different ...
Introduces a benchmark for evaluating and mitigating hallucinations in Vision-Language Models for video understanding. It uses synthetic videos to test physical and common-sense reasoning, revealing m...
Introduces a large-scale 3D radiology dataset for Medical Visual Question Answering (Med-VQA) using CT scans. It supports diverse diagnostic tasks and multi-temporal analysis, providing a comprehensiv...
Proposes an adversarial fair contrastive pre-training method for chest X-ray models to mitigate demographic biases. The AdFair-CLIP framework learns representations that are invariant to sensitive att...
Proposes a flexible model merging technique that allows for navigating the trade-off between model accuracy and size. It can combine multiple single-task fine-tuned models into a multi-task model of a...
Monday, October 27, 2025
Presents WorldGrow, a framework for generating infinitely extendable 3D worlds. It addresses the challenges of creating large, continuous environments with coherent geometry and realistic appearance, ...
Proposes RigAnything, a template-free, autoregressive transformer model for 3D asset rigging. It probabilistically generates joints, skeleton topologies, and skinning weights, making diverse 3D assets...
Presents a method to overcome the batch size dependency in contrastive learning. The proposed Smart Batch Mining technique allows models to learn effective representations without requiring large batc...
Improves video generation models by incorporating epipolar geometry constraints into large latent diffusion transformers. This approach enhances geometric consistency, stabilizes motion, and reduces v...
Proposes zip2zip, an inference-time adaptive tokenization method for large language models. It uses online compression to dynamically adjust the tokenizer's vocabulary to domain-specific inputs, impro...
Investigates the operational mechanisms of Classifier-Free Guidance (CFG) in text-to-image diffusion models. The paper proposes a new interpretation based on foresight fixed point iterations, aiming t...
Presents SAMA, a Video Large Multimodal Model designed for fine-grained spatio-temporal understanding. It enables multi-turn, referential grounded video chat by mastering both video referring understa...
Introduces CLIPGaussian, a universal and multimodal style transfer method for representations based on Gaussian Splatting (GS). It extends style transfer beyond simple color changes for GS-based image...
Presents RiverMamba, a State Space Model for global-scale river discharge and flood forecasting. This approach aims to improve the accuracy and efficiency of early warning systems by moving beyond loc...
Proposes a self-refining framework for training language model-based anonymizers using adversarial distillation. This approach enhances privacy in LLM applications by creating open-source anonymizers ...
Introduces Seed3D, a system that converts images into high-fidelity, simulation-ready 3D assets. It aims to bridge the gap between content diversity and physics accuracy in world simulators, providing...
Introduces VITA-1.5, a Multimodal Large Language Model focused on achieving GPT-4o level real-time interaction. It integrates vision and speech modalities to enhance dialogue systems, addressing the n...
Proposes InfiniPot-V, a key-value (KV) cache compression method for multimodal large language models processing streaming video. It allows for hour-long video reasoning on memory-constrained devices b...
Introduces InfiniDreamer, a novel framework for generating arbitrarily long human motion sequences. It overcomes the lack of long motion training data by using a segment score distillation approach, e...
Presents Grasp2Grasp, a vision-based approach for dexterous grasp translation using Schrödinger Bridges. Given a visual observation of a source hand, the method synthesizes a functionally equivalent g...
Introduces RTV-Bench, a new benchmark for evaluating Multimodal Large Language Models on continuous perception, understanding, and reasoning in dynamic environments. It uses real-time video to assess ...
Proposes ArtiLatent, a generative framework for synthesizing articulated 3D objects with fine-grained geometry and realistic appearance. It jointly models part geometry and articulation by embedding s...
Introduces Lorentz Local Canonicalization (LLoCa), a general framework that renders any standard neural network architecture Lorentz-equivariant. This method removes the need for specialized layers, b...
Introduces Hierarchical Soft Mixture-of-Experts (HoME) with a Mamba-based architecture for 3D medical image segmentation. The model is designed to efficiently process diverse 3D medical modalities and...
Presents Frame In-N-Out, a method for unbounded and controllable image-to-video generation. It leverages cinematic techniques to address key challenges in controllability, temporal coherence, and deta...
Friday, October 24, 2025
Introduces Attentive Convolution, a layer unifying the global receptive field of self-attention with the efficiency of convolutions. The resulting AC-Net architecture achieves competitive performance ...
Presents Sherlock, a framework for Vision-Language Models that performs self-correction on its own reasoning steps without external verifiers. By generating and refining hypotheses internally, it impr...
Introduces a video generation framework that improves physical plausibility by regularizing the model with 3D point trajectories. By augmenting 2D videos with this 3D-aware data, the fine-tuned latent...
Introduces OpenWorldSAM, a framework that extends the Segment Anything Model (SAM) to perform universal image segmentation from open-ended language prompts. By integrating a vision-language model, it ...
Demonstrates emergent properties in biological vision models by scaling hierarchical contrastive learning on a large-scale, taxonomy-curated dataset. The resulting BioCLIP 2 model shows improved zero-...
Develops a method where an LLM iteratively fine-tunes itself to improve its ability to generate adversarial suffixes that jailbreak other models. This automated self-improvement loop discovers more ef...
Presents Spatial-DISE, a unified benchmark for evaluating the spatial reasoning capabilities of Vision-Language Models across four key dimensions: Direction, Intersection, Scale, and Existence. It pro...
Develops a method for statistically comparing generative models by providing confidence intervals on the distance between a model's generated distribution and the true data distribution. This allows f...
Introduces AccuQuant, a post-training quantization method for diffusion models that mitigates the accumulation of quantization errors over multiple denoising steps. By simulating a few sampling steps ...
Proposes a framework that bridges smoothed molecular dynamics (MD) with score-based generative models to efficiently sample protein conformational ensembles. The model learns from smoothed MD trajecto...
Proposes Positional Encoding Field (PEF), a continuous function that generates positional encodings for Diffusion Transformers based on patch coordinates. This method improves generation quality and a...
Provides the first algorithm for sampling from multi-modal distributions, including Gaussian mixtures, with query complexity that is polynomial in the multi-modality parameters. The method is based on...
Proposes using generative diffusion models as computationally efficient surrogates for mechanistic, agent-based biological models like the Cellular-Potts Model (CPM). The surrogate model learns to emu...
Proposes a 'model MoE-ization' strategy that converts a pretrained model's weight matrices into Mixture-of-Experts (MoE) layers for multi-task adaptation. This SVD-based method mitigates task conflict...
Presents a training-free method for subject-driven text-to-image generation that grafts cross-image features at inference time. It preserves subject identity from reference images by manipulating atte...
Introduces AnyPcc, a universal point cloud geometry compression model designed to generalize across diverse data distributions. It uses a robust context model and efficient handling of out-of-distribu...
Challenges the dominant two-stage paradigm in computational pathology by demonstrating that a properly regularized, end-to-end trained model can outperform methods relying on pre-trained, frozen encod...
Proposes a new evaluation framework to assess large-scale video generation models as simulators of multi-person pedestrian dynamics. The study finds that while models produce visually realistic scenes...
Introduces REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models against real-world perturbations. It assesses model performance under variou...
Introduces Online Audio-Visual Event Parsing (On-AVEP) and a Predictive Future Modeling (PreFM) framework to enable real-time event parsing in videos. The model processes video streams incrementally a...
Thursday, October 23, 2025
Proposes a unified "perceive everything as pixels" approach for agentic models, encoding both text and images into a shared pixel-space representation. This framework aims to eliminate separate text t...
Presents a neuro-symbolic agent designed for complex reasoning over large spreadsheets. It combines a neural model for understanding natural language queries with a symbolic engine for executing opera...
Integrates causal graphs into Retrieval-Augmented Generation (RAG) to enhance reasoning and reduce context disruption. By retrieving and reasoning over causal relationships instead of just semantic si...
Proposes a novel method for improving tool retrieval by 'instilling' LLM reasoning capabilities into the retriever itself. This is achieved by having the LLM generate synthetic queries and tool usage ...
Introduces a difficulty-adaptive reasoning framework for token-efficient LLM inference. The system dynamically adjusts the complexity of its 'thinking traces' based on a problem's perceived difficulty...
Introduces Visual Geometry Gaussian Splatting (VGD), a feed-forward method for surround-view autonomous driving scene reconstruction. It uses a visual geometry-aware transformer to explicitly model 3D...
Introduces a balanced, long-context benchmark for evaluating LLMs with context lengths up to 256K. The benchmark features five distinct length levels and is designed to mitigate knowledge leakage and ...
Introduces MoAlign, a motion-centric representation alignment method for text-to-video diffusion models. It explicitly aligns motion representations within the model's U-Net architecture, improving th...
Improves factual hallucination detection by jointly generating claims from an LLM's response and verification queries for those claims. This joint process creates a stronger signal for identifying uns...
Introduces a Mixture of Experts (MoE) architecture for dynamic 3D Gaussian Splatting. This approach uses different 'expert' networks to model various types of motion and scene dynamics, enabling high-...
Introduces a method where Large Language Models automatically optimize the update rules of learning algorithms. By representing optimizer logic as text, LLMs can meta-learn and propose superior optimi...
Investigates and addresses context limitations in long-horizon agentic search tasks. The work identifies how agents 'get lost' during long explorations and proposes a framework to improve information ...
Reframes the problem of detecting machine-generated text as a form of Membership Inference Attack (MIA). This conceptual link reveals that text detectors inherently expose information about a model's ...
Presents Ninja Codes, neurally-generated fiducial markers for 6-DoF tracking that blend into real-world environments. An encoder network subtly alters arbitrary images to embed tracking information, c...
Presents neurally-generated fiducial markers that blend stealthily into environments for 6-DoF tracking. An encoder network subtly alters images to embed trackable codes, creating markers that are bot...
Utilizes reinforcement learning (RL) to enhance the advanced reasoning capabilities of LLMs over long contexts. The method trains models to discover and apply complex thinking patterns required for hi...
Presents an edge-first framework for processing continuous multimodal sensor streams into compact semantic tokens. It enables cost- and uncertainty-aware cooperation between edge devices and cloud-bas...
Proposes a Detector-to-Differentiable Critic (D2D) framework to improve the numeracy of text-to-image diffusion models. By incorporating a differentiable object counting module as a critic during trai...
Introduces the task of grounded article generation from multiple, diverse videos about a real-world event. The goal is to create a Wikipedia-style article where all information is explicitly supported...
Presents an open-source platform for creating context-aware safety guardrails for LLM applications. The system allows developers to define and enforce complex safety policies, enabling more robust pro...
Wednesday, October 22, 2025
Proposes Compressed Latent Reasoning (CoLaR), a framework that dynamically compresses token-level Chain-of-Thought into a latent space. This approach accelerates inference and reduces computational co...
Provides a comprehensive theoretical overview and survey of 3D Gaussian Splatting (3DGS), tracing its evolution from classical volume rendering. The paper details the underlying principles, mathematic...
Introduces "World-in-World," a framework for evaluating generative world models in a closed-loop setting for decision-making tasks. This work bridges the gap between visual simulation and agent contro...
Introduces DCAD-2000, a large-scale multilingual corpus covering over 2000 languages, constructed from web-crawled data. It proposes a novel "Data Cleaning as Anomaly Detection" method to ensure high ...
Introduces UltraGen, a high-resolution video generation model based on a diffusion transformer. It employs a novel Hierarchical Attention mechanism to efficiently model both local and global dependenc...
Provides a comprehensive survey and meta-analysis of methods integrating Large Language Models with 3D spatial data (3D-LLMs). The paper categorizes methodologies, summarizes key tasks and datasets, a...
Introduces OmniNWM, an omniscient driving navigation world model designed to predict future states across multiple modalities (video, LiDAR, maps). The model handles long sequences, incorporates preci...
Proposes Janus-Pro-R1, a Multimodal Large Language Model that uses reinforcement learning to create a synergistic link between visual comprehension and generation. This allows the model's understandin...
Introduces and defines the task of 3D Audio-Visual Segmentation. This work extends 2D audio-visual segmentation into 3D space, aiming to identify and segment sounding objects within a 3D scene represe...
Demonstrates the use of a fine-tuned geospatial foundation model for detecting, simulating, and predicting urban heat island effects. The model leverages diverse data sources to generate high-resoluti...
Introduces DeepSeek-OCR, a novel method for extreme long-context compression by mapping text into an optical 2D representation. This approach leverages an encoder-decoder architecture to potentially b...
Presents RAD, a closed-loop Reinforcement Learning framework for end-to-end autonomous driving. It trains a driving policy directly in a large-scale, 3D Gaussian Splatting-based simulated environment,...
Proposes a corpus-free pipeline for training dense retrieval models by using a Large Language Model to generate synthetic queries and hard negative passages. This "generate, don't retrieve" approach e...
Proposes the "Translation Barrier Hypothesis," arguing that poor multilingual generation in LLMs for low-resource languages stems from an implicit task-solving-then-translation pipeline failure. This ...
Proposes Re-ttention, an ultra-sparse attention mechanism for Diffusion Transformers that statistically reshapes attention maps to focus computation on important query-key pairs. This method significa...
Presents SAM 2++, a unified framework for video tracking that can handle targets of any granularity, from points and boxes to masks. It extends the Segment Anything Model (SAM) with a novel design to ...
Introduces Visionary-R1, a method that uses reinforcement learning to mitigate shortcut learning in visual reasoning models. By rewarding generalizable reasoning paths over simple correlations, it imp...
Introduces Robobench, a comprehensive benchmark for evaluating Multimodal Large Language Models as the high-level reasoning "brain" for embodied agents. The benchmark assesses capabilities in percepti...
Presents MSR-Align, a framework for improving safety-aware reasoning in Vision-Language Models. It uses a policy-grounded multimodal alignment technique to steer the model's chain-of-thought process a...
Introduces Occluded nuScenes, a new multi-sensor dataset for evaluating perception model robustness in automated driving. The dataset systematically introduces synthetic occlusions to sensors, providi...
Tuesday, October 21, 2025
Provides a comprehensive survey on world models for embodied AI agents. It organizes the field by defining world models as internal simulators that capture environment dynamics, enabling agents to sup...
Demonstrates that visual autoregressive models can outperform diffusion models in inference-time scaling through search strategies. While search offers limited benefits for diffusion models, it signif...
Presents a method to reduce visual hallucinations in Vision-Language Models (VLMs) by incorporating a verification step. It uses retrospective resampling, where the model verifies its own generated te...
Develops a radiology foundation model for pan-tumor clinical diagnosis using synthetic data to overcome the scarcity of annotated medical images. The model is trained on a large-scale synthetic datase...
Introduces a novel theoretical interpretation of the attention matrix in Transformers as a discrete-time Markov chain. This framework unifies common attention operations like selection and averaging a...
Presents a systematic study of scaling laws for deepfake detection, analyzing model performance against the number of real image domains, generation methods, and training images. The work provides fou...
Proposes a method to scale Multi-modal Large Language Models (MLLMs) by decoupling their perception and reasoning modules. This allows for upgrading the internal language model without expensive joint...
Presents REALM, an MLLM-Agent framework for open-world 3D reasoning and editing on Gaussian Splatting representations. The agent interprets complex human instructions to perform precise 3D segmentatio...
Proposes FairGen, a method to enhance fairness in text-to-image diffusion models by self-discovering latent directions associated with biases. The approach allows for mitigating these biases during th...
Presents SSL4Eco, a global, seasonal, and multi-spectral dataset for self-supervised learning in ecology. It provides a large-scale resource of remote sensing imagery to train geospatial foundation mo...
Proposes an industry-level omni-modal large language model pipeline integrating auditory, visual, and linguistic modalities. The system overcomes challenges like limited tri-modal datasets and high co...
Introduces Morpheus, a benchmark for evaluating the physical reasoning of video generative models using real-world physical experiments. It provides a dataset and evaluation suite to test a model's ab...
Introduces a vision-centric model for autonomous driving that performs 4D occupancy forecasting and planning. It uses an implicit residual world model to predict changes in the scene rather than recon...
Introduces Embody 3D, a large-scale multimodal dataset featuring 500 hours of 3D motion data from 439 participants. The dataset includes diverse single-person and two-person interactions, providing a ...
Proposes a method to add pixel-level segmentation capabilities to frozen, pre-trained Multimodal Large Language Models (MLLMs) without fine-tuning the base model. It trains a lightweight segmentation ...
Presents Scale-DiT, a diffusion transformer model for ultra-high-resolution text-to-image generation. It introduces a hierarchical local attention mechanism to overcome the quadratic complexity of sta...
Introduces VisionSelector, an end-to-end learnable module for compressing visual tokens in Multimodal LLMs. It adaptively selects the most informative tokens from high-resolution or multi-image inputs...
Systematically investigates the cross-task generalization capabilities of vision-language-action (VLA) models for robotic manipulation. The study analyzes how VLA models perform on unseen tasks, provi...
Introduces StretchySnake, a flexible training strategy for State Space Models (SSMs) in action recognition. By training the model on clips of varying spatio-temporal scales, it improves generalization...
Introduces PRISMM-Bench, a benchmark for evaluating the ability of Large Multimodal Models (LMMs) to detect multimodal inconsistencies in scientific papers. It tests whether models can reason across t...
Monday, October 20, 2025
Introduces the Predictive-Corrective (PC) paradigm and PCMambaN network for anatomy-informed brain MRI segmentation. Achieves accelerated learning and improved efficiency in data-scarce medical imagin...
Introduces PFGS, a pose-aware 3D Gaussian Splatting framework that reconstructs complete objects from multi-pose image captures. Addresses limitations of single-pose methods by integrating pose inform...
Proposes a diffusion bridge network to synthesize clinical-grade FDG-PET scans from standard MRI images for dementia diagnosis. This approach makes a critical diagnostic tool more accessible by simula...
Introduces Unimedvl, a unified medical vision-language model for both understanding and generation tasks. It processes diverse multimodal inputs to generate textual reports, visual annotations, and se...
Presents Skyfall-GS, a method to synthesize large-scale, explorable, and geometrically accurate 3D urban scenes from satellite imagery. It addresses the lack of real-world 3D scans for training genera...
Introduces AutoGraph-R1, an end-to-end reinforcement learning framework for building knowledge graphs for RAG systems. It directly optimizes the KG construction process to improve performance on downs...
Proposes VISTA, a test-time self-improving agent for text-to-video generation. Instead of relying on a perfect user prompt, VISTA iteratively refines the generated video based on user-defined scoring ...
Presents BLIP3o-NEXT, a fully open-source vision-language foundation model that unifies text-to-image generation and image editing within a single architecture. The model demonstrates strong performan...
Introduces Ditto, a framework to address data scarcity in instruction-based video editing. It features a pipeline to automatically generate a large-scale, high-quality synthetic dataset of video editi...
Presents MAVR-Net, a multi-view learning framework for MAV action recognition using cross-view attention. Addresses limitations of RGB-only models by capturing complex spatial-temporal characteristics...
Presents Bolt3D, a latent diffusion model for feed-forward 3D scene generation from images. It directly samples a 3D scene representation in under seven seconds on a single GPU, achieving a significan...
Introduces YOLOE, a model extending the YOLO series for real-time open-vocabulary object detection and segmentation. It leverages visual and text prompts to detect and segment any object without being...
Introduces SHARE, a technique that leverages scene geometry to accurately ground human motion reconstruction from monocular RGB video. Addresses challenges in placing humans in 3D space for realistic ...
Proposes FreqPDE, rethinking positional depth embedding for multi-view 3D object detection transformers. Addresses depth prediction quality issues in autonomous driving by improving spatial informatio...
Introduces the Hierarchical Mixing Architecture (HiMA) for efficient low-light RAW image enhancement. Leverages complementary strengths of Transformer and Mamba for improved enhancement quality and hi...
Explores leveraging pre-trained text-to-image diffusion models for task-adaptive visual representations in robotic control without fine-tuning. Investigates optimal conditions for applying textual pro...
Proposes Proto-Former, a unified, adaptive, end-to-end facial landmark detection framework. Addresses limitations in single-dataset training by explicitly unifying landmark detection across different ...
Presents a systematic investigation of custom CNN architectures for satellite land use classification, achieving 97.23% accuracy on EuroSAT without pre-training. Introduces a novel balanced multi-task...
Introduces SiM2P, a 3D diffusion bridge-based framework simulating clinical-grade PET from MRI for dementia diagnostics. Learns a probabilistic mapping from MRI to PET images, addressing accessibility...
Presents V2X-Radar, a new large-scale, multi-modal dataset for cooperative perception in autonomous driving. It uniquely features 4D radar data alongside LiDAR and camera streams, enabling research on...
Friday, October 17, 2025
Introduces ScholarBench, a benchmark for evaluating LLMs on complex academic problem-solving. It targets specialized contexts to assess academic reasoning ability, addressing limitations of prior benc...
Develops EasyNER, an easy-to-use pipeline for Named Entity Recognition in medical and life science text. It provides automated text mining to help researchers utilize information from large bodies of ...
Proposes Vgent, a graph-based Retrieval-Augmented Generation framework for long video understanding. It addresses challenges in processing extended video tokens and retaining long-term sequential info...
Proposes PIA, a deepfake detection method using phoneme-temporal and identity-dynamic analysis. It aims to improve the identification of modern deepfakes generated by advanced generative models, overc...
Introduces VaCo, a framework optimizing MLLM representations through vision-centric activation and coordination. It enhances analytical abilities by leveraging multiple vision foundation models, addre...
Applies pruning to overparameterized multi-task networks for degraded web image restoration. It addresses the quality of web images affected by lossy operations, aiming to recover clean, high-quality ...
Introduces DentVFM, the first family of vision foundation models for oral and maxillofacial radiology. It addresses limitations of single-modality, task-specific dental AI systems, aiming for generali...
Proposes scene de-contextualization for consistent text-to-image generation. It addresses identity shift by decoupling subject and scene context, enabling identity-preserving images across diverse sce...
Introduces Efficient Video Sampling (EVS), a method for pruning temporally redundant tokens in videos. It addresses scalability limitations of VLMs processing dense frame sequences, reducing token red...
Proposes SteeringTTA, an inference-only framework guiding diffusion-based input adaptation for test-time adaptation. It steers diffusion trajectories to improve robustness across distortion types, add...
Investigates LLM stability in translating natural language to formal logic for reasoning. Identifies inconsistencies in symbolic representations across linguistic forms, highlighting a need for more r...
Introduces three evaluation metrics: Creativity, prompt Alignment, and Persuasiveness (CAP) for advertisement image generation. Addresses the challenge of evaluating Text-to-Image models beyond simple...
Presents a zero-shot pipeline for creating hyperrealistic 3D avatars from phone images. Introduces a generative canonicalization approach to address geometric inconsistencies and improve identity pres...
Introduces CLEAR, a causal-inference-based framework for robust histopathology tumor detection. It leverages semantic features while mitigating OOD shifts by modeling causal relationships, improving g...
Proposes DOS, a method for directional object separation in text embeddings for multi-object image generation. It addresses challenges in T2I models with multiple objects, mitigating object neglect an...
Introduces PaddleOCR-VL, a compact Vision-Language Model for multilingual document parsing. It efficiently supports 109 languages and excels at recognizing complex elements like text, tables, and char...
Proposes a framework for brain MR image harmonization that acquires interpretable domain information. It disentangles domain-invariant and domain-specific features to improve machine learning performa...
Introduces an ego-proactive Video-LLM for streaming video that actively understands and anticipates events. It focuses on proactive coherence and just-in-time perception and reasoning for dynamic, evo...
Introduces RepTok, a generative modeling framework using single continuous latent tokens from self-supervised ViTs. It adapts semantic tokens with low-level details for faithful image reconstruction, ...
Introduces WeCKD, a weakly-supervised chained distillation network for efficient multimodal medical imaging. It addresses knowledge degradation and inefficient supervision in traditional KD by using a...
Thursday, October 16, 2025
Introduces a new KV cache eviction strategy that dynamically adapts eviction thresholds based on predicted future importance. Achieves significant memory reduction and speedup in LLM inference by pres...
Introduces D-SMART, a dynamic structured memory and reasoning tree framework to enhance LLM dialogue consistency. Addresses factual inconsistencies and logical decay in multi-turn dialogues by adaptiv...
Introduces Breadcrumbs Reasoning, using learned compression beacons to periodically compress the KV cache. Achieves memory-efficient long-context reasoning by reducing KV cache costs, enabling LLMs to...
Introduces a multi-pair, multi-perspective preference optimization for machine translation that addresses flawed reward signals and inefficient data utilization. Improves LLM alignment to human prefer...
Explores Bayesian Persuasion (BP) in natural language for single-turn dialogues to enhance LLM strategic persuasion. Incorporates information asymmetry and avoids pre-commitment assumptions, improving...
Presents a framework for enhancing LLM capabilities in underrepresented languages by fine-tuning language-specific subnetworks. Identifies language-specific neurons and tunes associated weights, impro...
Introduces a novel methodology for evaluating chat assistants' web search behavior, focusing on source credibility and response groundedness. Assesses how assistants integrate web search, highlighting...
Proposes ICA-RAG, an adaptive retrieval-augmented generation framework guided by information completeness for disease diagnosis. Tailors retrieval strategies to diagnostic difficulty and sample inform...
Introduces FreshTab, an on-the-fly table-to-text benchmark generation from Wikipedia. Combats LLM data contamination and enables domain-sensitive evaluation, addressing precision needs in table-to-tex...
Introduces DualHyp, an audio-visual speech error correction framework using an LLM to compose N-best hypotheses from ASR and VSR models. Enhances error correction by reasoning over modality-specific e...
Positions attention heads as a mechanistic blueprint for LLM reasoning, distinguishing between local and global attention for fine-grained policy optimization. Enables legible internal logic and impro...
Proposes MemoTime, a memory-augmented temporal knowledge graph to enhance LLM temporal reasoning. Addresses challenges in understanding evolving event sequences and compound operators, enabling more a...
Presents BRIEF-Pro, a universal, lightweight compressor for distillation of relevant evidence in retrieval-augmented generation. Enables fast and accurate multi-hop reasoning by summarizing retrieved ...
Assesses whether the Concordia framework can effectively model Theory of Mind (ToM) in simulated environments using GPT-4. Explores if LLMs can perform tasks requiring genuine understanding of others'...
Introduces a controlled evaluation framework to investigate the mechanisms and loci of symbol grounding emergence in (vision-)language models. Explores how symbols acquire meaning by connecting to rea...
Systematically examines how decoding strategies affect the detectability of machine-written texts. Demonstrates the robustness of text detection systems to changes in generation settings, highlighting...
Surveys Arabic LLM evaluation benchmarks, analyzing 40+ resources across NLP tasks, knowledge, and culture. Proposes a taxonomy and identifies critical gaps, revealing progress and areas needing devel...
Proposes a confidence estimation method for RAG systems using feed-forward network activations to align with output correctness. Enables response abstinence based on uncertainty, improving LLM trustwo...
Presents the first large-scale, multilingual study on personalized disinformation generation by LLMs. Investigates the interplay between safeguards, personalization, and disinformation, revealing LLM ...
Identifies topical differences in gender bias across regions and proposes region-aware bias evaluation metrics. Addresses limitations of existing benchmarks by considering context-specific biases, lea...
Wednesday, October 15, 2025
Proposes a hierarchical reasoning framework for incident report generation from dashcam videos. Aims to improve out-of-distribution scenario hazard understanding for autonomous driving models.
Introduces SpineBench, a Visual Question Answering benchmark for fine-grained spinal pathology analysis. Evaluates multimodal LLMs, addressing limitations of existing general medical benchmarks.
Applies pre-trained state space models for video classification using prompt learning. Gathers and spreads spatio-temporal information for efficient adaptation to downstream tasks.
Proposes UniGS, a unified representation and framework for multimodal 3D reconstruction using Gaussian Splatting. Renders RGB, depth, normals, and semantic logits simultaneously with high fidelity.
Introduces BEEP3D for 3D instance segmentation using box-level supervision. Generates pseudo-masks end-to-end, addressing annotation costs and ambiguity in overlapping regions.
Proposes BIGFix for bidirectional image generation using token fixing. Aims to improve inference efficiency by combining auto-regressive modeling with multi-token prediction.
Proposes Multiplicative Loss and Confidence-Adaptive Multiplicative Loss for semantic segmentation. Enhances performance in medical and cellular images, especially with limited data.
Introduces CurriFlow, a semantic occupancy prediction framework for 3D Semantic Scene Completion. Integrates optical flow for temporal alignment, addressing motion reasoning and occlusion challenges.
Presents a probabilistic reinterpretation of training Scene Coordinate Regression (SCR) models. Infuses high-level reconstruction priors to improve implicit scene representations for 3D vision.
Introduces VideoLucy, a framework with deep memory backtracking for long video understanding. Addresses challenges in temporal context capture and sparse frame sampling for agent-based systems.
Introduces data curation approaches to study their impact on Vision-Language reasoning capabilities. Analyzes effects of context sources and implements targeted data interventions.
Proposes spatio-temporally consistent proxy nodes to represent dynamic objects for vectorized video representation. Enables easy editing by overcoming vulnerabilities of pixel-level matching.
Introduces Priority-Adaptive Gaussian Splatting (PAGS) for reconstructing dynamic 3D urban scenes. Injects task-aware semantic priorities into 3D representations to address fidelity vs. cost trade-off...
Proposes an angle-based perception approach for spatial-sensitive multi-modality image fusion. Integrates visible-infrared information to produce enhanced images for downstream tasks.
Conducts a comparative analysis of synthetic versus real-world data for object detection fine-tuning. Investigates opportunities to optimize workflows in industries like manufacturing.
Proposes Ivan-ISTD, a wavelet-guided framework for Infrared Small Target Detection. Addresses cross-domain shift and heteroscedastic noise perturbations using invariance learning.
Investigates the role of local background features in out-of-distribution detection. Addresses overconfident predictions on OOD data, improving reliability in real-world deployments.
Proposes Spatio-Temporal Occlusion-Resilient Modeling for Procedure Step Recognition. Enhances robustness and accuracy in recognizing completed steps from egocentric assembly videos.
Reviews, evaluates, and proposes a research agenda for using Vision-Language Models in general urban monitoring. Addresses challenges in object diversity, environmental conditions, and contextual unde...
Tuesday, October 14, 2025
No research highlights available for this date
Monday, October 13, 2025
Introduces SpatialSplat for efficient semantic 3D reconstruction from sparse unposed images. Associates primitives with compressed semantic features, addressing limitations of prior methods in incorpo...
Introduces ProbRes, a probabilistic residual search framework based on jump-diffusion for open-world egocentric activity recognition. Balances prior-guided exploration and likelihood-driven exploitati...
Proposes FLAIR to solve inverse imaging problems using flow-based latent generative models. Addresses intractable data likelihood and direct generative model integration challenges for improved fideli...
Proposes HoliTom for holistic token merging to accelerate video LLMs. Addresses computational inefficiency caused by redundant video tokens with an efficient token pruning strategy.
Introduces Self-supervised Motion Fields (SMF) for template-free, rig-free animation transfer. Addresses limitations of existing methods like motion jitter and limited generalization to unseen motions...
Introduces SQ-GAN integrating semantic image coding and vector quantization for optimized image compression. Focuses on source coding, compliant with legacy systems, using semantic segmentation maps f...
Introduces online Video Depth Anything (oVDA) for temporally-consistent depth prediction with low memory. Adapts LLM techniques like latent feature caching for efficient online processing.
Introduces BLINK-Twice, a vision-centric reasoning benchmark for MLLMs. Focuses on challenging perceptual tasks requiring reasoning from visual context rather than external knowledge.
Introduces MomentSeg for moment-centric sampling in referring video object segmentation. Jointly learns sampling strategies to improve temporal reasoning and fine-grained visual comprehension.
Investigates using discrete semantic entropy (DSE) to filter questions likely to generate hallucinations in radiology VLMs. Aims to improve accuracy in medical image-based visual question answering.
Proposes a text-to-video diffusion transformer to generate annotated data for training, addressing data scarcity in video action understanding. Enables scalable, realistic data generation without huma...
Proposes CQ-DINO to mitigate gradient dilution in vast vocabulary object detection. Addresses positive and hard negative gradient dilution by introducing category queries, improving learning signals f...
Introduces DenseDPO for fine-grained temporal preference optimization in video diffusion models. Addresses limitations of pairwise video comparisons by enabling detailed temporal preference learning.
Proposes DiffMark, a diffusion-based robust watermarking framework against deepfakes. Enables seamless watermark fusion during image generation, offering improved robustness against deepfake manipulat...
Develops a differentially private framework for 2D human pose estimation. Provides formal privacy guarantees while addressing the data utility degradation typically associated with differential privac...
Proposes TARO for semantically rich open-world object detection, moving beyond closed-world assumptions. Aims to assign subcategories to novel objects for enhanced decision-making in safety-critical c...
Proposes TEMA-LLM for cross-domain sequential recommendation. Integrates tag-enriched multi-attention and LLMs to capture both domain-specific and cross-domain user behaviors effectively.
Proposes a progressive prompt fusion network for infrared image enhancement, addressing coupled degradations. Revisit imaging models to improve effectiveness on infrared sensors due to significant ima...
Proposes dynamic Chain-of-Thought for boosting multi-modal keyphrase prediction in vision-language models. Addresses limitations in handling absence and unseen scenarios, and overestimation in existin...
Presents Cattle-CLIP, a multimodal framework for cattle behavior recognition using semantic cues. Improves video-based visual feature recognition performance by leveraging semantic information.
Friday, October 10, 2025
Introduces an uncertainty-aware diffusion guided refinement framework for 3D scene reconstruction from a single image. It addresses limitations in existing methods that render incoherent and blurry no...
Proposes D2GS, a depth-and-density guided Gaussian Splatting method that addresses instability and performance degradation in sparse-view reconstruction. It improves accuracy by identifying and fixing...
Introduces X2Video, a diffusion model for photorealistic video rendering guided by intrinsic channels and multimodal controls. It allows manipulation of color, material, geometry, and lighting with re...
Proposes ReSplat, a recurrent Gaussian splatting model that iteratively refines 3D Gaussians without explicit gradients. It leverages rendering error as a feedback signal for improved performance, esp...
Proposes compact clue selection for efficient Retrieval-Augmented Generation (RAG) reasoning, optimizing input for LLMs. It extracts and organizes answer-relevant clues from documents to enhance reaso...
Proposes FlowNIB to analyze bidirectional vs. unidirectional language models using the Information Bottleneck principle. It investigates the theoretical reasons behind bidirectional models' better con...
Investigates integrating LLMs into Argument Summarization (ArgSum) systems and proposes a novel prompt-based evaluation scheme. It validates this scheme through a new human benchmark dataset for ArgSu...
Introduces FlashDLM to accelerate Diffusion Language Model (DLM) inference using efficient KV caching and guided diffusion. It addresses slow inference in DLMs by optimizing token generation processes...
Presents XYZCylinder, a feedforward reconstruction method for driving scenes using a unified cylinder lifting approach. It improves generalization by learning a fixed view transformation for single-re...
Introduces the Dual-Stream Alignment Network (DSA Net) for action segmentation, proposing a novel dual-stream approach. It learns action-wise features to enhance performance by modeling spatio-tempora...
Introduces NaViL, a native training approach for Multimodal Large Language Models (MLLMs) that systematically studies its design space and scaling properties under data constraints. It aims to improve...
Presents a method to optimize neural fields for spectral prefiltering in a single forward pass by analytically scaling Fourier feature embeddings. This enables efficient, resolution-independent neural...
Presents SViM3D, a framework predicting multi-view consistent physically based rendering (PBR) materials from a single image. It extends latent video diffusion models to efficiently generate 3D object...
Introduces SatFusion, a unified framework for enhancing satellite IoT images by fusing multi-temporal and multi-source data. It exploits complementary information across temporal and source dimensions...
Introduces R2RGEN for real-to-real 3D data generation to achieve spatially generalized robotic manipulation. It aims to train visuomotor policies robust to variations in object distribution, environme...
Introduces I&S-ViT, an inclusive and stable method for post-training quantization of Vision Transformers (ViTs). It addresses cost issues by enabling low-bit operation while mitigating performance dro...
Proposes a misleading-learning approach for fair deepfake detection that addresses dual-overfitting issues. It fills redundant semantic environments to improve fairness and reduce demographic bias in ...
Investigates representation alignment in multilingual LLMs, particularly in middle layers, to disentangle language-specific and language-agnostic information. It confirms alignment and analyzes its be...
Proposes Targetless LiDAR-Camera Calibration (TLC-Calib) using Neural Gaussian Splatting. It jointly optimizes sensor poses and a neural Gaussian-based representation, eliminating the need for physica...
Presents DEGS, a deformable event-based 3D Gaussian Splatting method for dynamic scenes using RGB and event streams. It addresses challenges in reconstructing dynamic scenes from low-framerate RGB vid...
Thursday, October 9, 2025
Proposes SanDRA, the first safe LLM-based decision-making framework for automated vehicles using reachability analysis. Addresses LLM hallucinations and integrates vehicle dynamics for safer autonomou...
Investigates the robustness of Vision-Language-Action (VLA) models in embodied AI to linguistic perturbations, specifically irrelevant context in commands. Presents a novel systematic study evaluating...
Introduces a Diffusion Trajectory-guided policy for long-horizon robot manipulation, leveraging diffusion models to mitigate compounding errors in imitation learning. Addresses challenges in out-of-di...
Introduces RAISE, a Robotic Autonomous Imaging Surface Evaluator, a closed-loop, self-driving laboratory. Links liquid formulation optimization with surface wettability assessment for interfacial prop...
Investigates sampling strategies for configuration variations to generate robust universal locomotion policies for quadrupedal robots. Compares joint gain sampling strategies to enable single reinforc...
Presents a geometric control framework on the Lie group SO(3) for 3D source-seeking by robot swarms. Avoids Euler-angle singularities and quaternion ambiguities, ensuring intrinsic orientation represe...
Introduces DPL, a depth-only perceptive humanoid locomotion framework using realistic depth synthesis and cross-attention terrain reconstruction. Addresses limitations of current depth-image and eleva...
Presents an end-to-end sensing-to-control pipeline for small fixed-wing aircraft, combining bio-inspired hardware, physics-informed dynamics learning, and convex control allocation. Inspired by narwha...
Introduces EffiTune to diagnose and mitigate training inefficiency for parameter tuners in robot navigation systems. Balances classical and learning-based methods to improve adaptability and stability...
Explores professional abstract artists' perceptions of co-creative interactions with an autonomous painting robotic arm. Analyzes their experiences through semi-structured interviews to understand hum...
Introduces RLinf-VLA, a unified and efficient framework for Vision-Language-Action (VLA) and Reinforcement Learning (RL) training. Addresses error accumulation in VLA models trained with supervised fi...
Presents Assist-As-Needed, an adaptive multimodal robotic assistance system for medication management in dementia care. Addresses limitations of one-size-fits-all assistive technologies by adapting as...
Proposes a framework using diffusion models to autonomously identify recovery needs and optimize contact-rich trajectories for multi-fingered robotic manipulation. Enables recovery behaviors to resume...
Explores kirigami potential in robotics by tailoring materials for multifunctional, lightweight, and adaptable solutions. Details how kirigami components can be optimized for specific robotic applicat...
Develops a model-free workspace trajectory planner for space manipulators using a TD3 agent for safe debris removal. Employs local control strategies for singularity avoidance and manipulability enhan...
Proposes temporal-prior-guided view planning for periodic 3D plant reconstruction. Aligns previous models with new observations and uses inflation to accommodate plant growth for improved reconstructi...
Introduces the M3RS problem for multi-robot missions, considering quality of service as a variable. Addresses time-constrained missions with multiple execution modes, varying resource needs, durations...
Proposes a pipeline for mobile manipulator path planning that generates and optimizes topologically distinct paths with end-effector constraints. Circumvents local optima convergence by discovering mu...
Proposes P2 Explore for efficient robot exploration in unknown cluttered environments by predicting floor plans. Improves exploration efficiency by overcoming limitations of traditional frontier-based...
Introduces COMPAct, a framework for computational optimization and automated modular design of planetary actuators. Systematically identifies optimal gearbox parameters for a given motor across four g...
Wednesday, October 8, 2025
Proposes SAFER, a framework for Safety Alignment via Efficient Ex-Ante Reasoning, enhancing LLM safety by instantiating structured reasoning to address harmful content generation. Demonstrates improve...
Introduces Lang-PINN, a multi-agent framework enabling LLMs to generate physics-informed neural networks (PINNs) from language descriptions. Simplifies PINN construction by automating PDE formulation,...
Introduces a new novelty metric for LLM generations, addressing limitations of prior work evaluating originality and quality. Aims to reliably measure LLM's ability to generate novel, high-quality out...
Introduces ExpertLongBench, an expert-level benchmark with 11 tasks across 9 domains for long-form generation. Utilizes structured checklists validated by domain experts to evaluate LLM adherence to s...
Explores leveraging LLM biases to reveal society's "unwritten code" like implicit stereotypes. Proposes a framework using a case study in science to uncover hidden rules in peer review, making biases ...
Proposes a holistic evaluation for RAG systems and web agents on deep search tasks using hint-free questions and factorized metrics. Addresses limitations of current benchmarks that leak reasoning pat...
Explores advancements and applications of large models (LLMs, Vision, 3D, Multimodal) in medicine, revolutionizing disease prediction, diagnosis, and drug discovery. Integrates GNNs for medical knowle...
Introduces SocialNLI (SoNLI), the first social dialogue inference dataset, assessing models' social abilities via theory-of-mind inferences from human dialogue. Addresses LLMs' struggles with sophisti...
Introduces Self-Filtered Distillation for patent classification, using LLM-generated rationales as trust signals. Addresses logical errors and misalignments in rationales by filtering noise for stable...
Addresses ICL's quadratic input complexity by proposing submodular context partitioning and compression. Mitigates information redundancy from partitions to improve performance, enabling efficient few...
Adapts decoder-only LLMs to solve partial differential equations (PDEs) by proposing cross-modal adaptation, addressing limitations of encoder-only models. Shows potential for LLMs in scientific machi...
Analyzes prompt underspecification in LLMs, showing fragile inference and instability across model/prompt changes. Proposes methods to manage underspecification, enabling more reliable LLM application...
Proposes FAID, a fine-grained detection framework using multi-task auxiliary and contrastive learning to classify human, LLM, and hybrid texts. Introduces FAIDSet, a multilingual dataset for improved ...
Proposes Self-Routing RAG (SR-RAG), a framework binding selective retrieval with knowledge verbalization. Enables LLMs to improve RAG accuracy and efficiency by making better retrieval decisions, brid...
Traces factual knowledge acquisition and cross-lingual consistency in LLM pretraining. Focuses on OLMo-7B, finding improvements in accuracy and consistency over time, providing insights into how factu...
Investigates using LLMs as cognitive models of human linguistic prediction by making them less superhuman. Suggests that improving LLM performance on prediction tasks requires making them more human-l...
Introduces WildIFEval, a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Evaluates LLMs' ability to handle complex instructions spanning broad lexical and t...
Surveys recent efforts to overcome the quadratic complexity bottleneck of transformer attention. Critically analyzes sub-quadratic attention variants, RNNs, state space models, and hybrid architecture...
Proposes SimulatorArena to systematically study the reliability of LLM-simulated users for AI assistant evaluation. Addresses the lack of benchmarks for automatic evaluation, aiming to determine if si...
Proposes CL-PDE, a framework for cross-lingual mental health ontologies using graphs for Indian languages. Bridges patient expression and clinical understanding via explainable AI and human-in-the-loo...
Tuesday, October 7, 2025
Proposes a joint learning framework for 6D pose estimation using denoising diffusion and score scaling sampling. This method improves training convergence and reduces the need for additional pose vali...
Introduces Quokka, the first systematic scaling law for diffusion language models. It covers compute and data-constrained regimes, offering practical guidance for DLM training and future AI research.
Proposes LaDiR, unifying latent diffusion with LLMs for improved text reasoning. This framework enables iterative refinement of reasoning paths, addressing autoregressive decoding limitations and enha...
Introduces SwiReasoning, enabling LLMs to switch between latent and explicit reasoning for Pareto-superior performance. This framework enhances token efficiency and robustness, particularly for challe...
Introduces PatentMind, a framework for patent similarity evaluation using a Multi-Aspect Reasoning Graph. It decomposes patents into technical, application, and claim dimensions for comprehensive anal...
Introduces Reason-RFT, a reinforcement fine-tuning framework for VLMs to improve visual reasoning. This approach mitigates overfitting from supervised fine-tuning, enhancing generalization and real-wo...
Introduces AutoMiSeg, a zero-shot pipeline for automatic medical image segmentation using foundation models. This approach combines VLMs and segmentation models for direct segmentation without expert ...
Optimizes Clifford neural layers for inference speed using superscalar techniques. This approach addresses computational bottlenecks in equivariant networks, enabling faster execution without sacrific...
Introduces SyMerge for synergistic model merging via single-layer adaptation. This framework moves beyond task non-interference to actively enhance cross-task performance, offering improved model comb...
Introduces StructPrune for structured global pruning of LLMs with reduced GPU memory. This method balances efficiency and robustness by leveraging asymptotic analysis and layer-independent pruning.
Disentangles recall and reasoning in transformers using layer-wise analysis. This method reveals distinct internal mechanisms for these abilities, aiding in understanding model behavior and targeted i...
Introduces MAVE, a cross-attentive Mamba framework for high-fidelity voice editing and zero-shot TTS. This model achieves state-of-the-art speech editing and competitive TTS results, outperforming exi...
Proposes Test-Time Token-Level Cross-Validation for dLLMs to address early termination issues. This method allows revision of tokens across iterations, improving final output quality and mitigating er...
Introduces a framework to ensure contextual integrity in LLMs by training them to reason about information disclosure. This approach uses reinforcement learning to align LLM behavior with human prefer...
Analyzes in-context learning representations across transformer layers, revealing a layerwise compression-to-expression phenomenon. This insight helps understand how LLMs capture task-specific informa...
Introduces LLM-Sieve, a framework for task-specific pruning of LLMs to minimal parameter subsets. This method achieves efficient and faithful task performance by using output-aligned projections.
Proposes Wave-PDE Nets, a novel architecture using trainable wave-equation layers as an alternative to attention. This approach models global dependencies efficiently, offering a powerful mechanism fo...
Proposes a new framework that avoids backpropagation's weight symmetry requirement by using a biologically plausible mechanism. This approach addresses the weight transport problem, enabling more biol...
Monday, October 6, 2025
Surveys research on defenses against AI-generated visual media, covering detection, disruption, and authentication methods. Provides a systematic and timely review essential for understanding and miti...
Provides a comprehensive assessment of CLIP model robustness by investigating specific visual factors and safety objectives like confidence uncertainty. Aims to offer new perspectives beyond overall c...
Proposes filter-guided diffusion for controllable image generation, enhancing zero-shot image-to-image translation and editing. Addresses runtime and memory costs of existing feature injection methods...
Proposes WaveNet-SF, a hybrid network using wavelet transform for enhanced retinal disease detection from OCT images. Addresses challenges like speckle noise and varying lesion sizes for critical time...
Introduces So-Fake, a benchmark and explanation framework for social media image forgery detection. Addresses limitations in current datasets and detection methods for realistic, diverse social media ...
Proposes fine-grained abnormality prompt learning for zero-shot anomaly detection. Addresses limitations of current methods focusing on coarse-grained semantics by enabling recognition of finer-graine...
Proposes neural posterior estimation with autoregressive tiling for detecting faint, overlapping objects in astronomical images. Introduces an amortized variational inference procedure for small-objec...
Presents InsideOut, an EfficientNetV2-S based framework for robust multi-class facial emotion recognition. Addresses challenges like occlusions and illumination variations for improved FER performance...
Presents SoccerSynth-Detection, a synthetic dataset addressing diversity limitations for soccer player detection. Aims to improve algorithm adaptation to varied soccer video contexts with frequent occ...
Proposes latent diffusion unlearning using trajectory shifted perturbations to protect against unauthorized personalization. Addresses concerns regarding data privacy and intellectual property protect...
Proposes a method to rank large multimodal models without labels, exploring alternative signals beyond standard performance evaluation. Aims to provide efficient ways to choose between models when fac...
Introduces RichControl for training-free spatial control in text-to-image generation. Addresses limitations of feature injection methods by improving structural alignment and reducing visual artifacts...
Introduces GCVAMD, a modified CausalVAE model for detecting and predicting Age-related Macular Degeneration risk factors. Aims to improve early-stage detection for reducing vision loss possibilities.
Presents PyRadiomics-cuda, a GPU-accelerated extension for extracting 3D features from medical images. Dramatically reduces processing times for volumetric datasets while maintaining API compatibility...
Explores training-free out-of-distribution segmentation using foundation models. Investigates the capability of these models to detect unknown regions in semantic segmentation, a capability previously...
Revisits reweighted risk functionals for model calibration, establishing a connection between calibration error and selective classification. Clarifies theoretical links for common deep learning losse...
Introduces LEAML, a label-efficient adaptation framework for multimodal LLMs on OOD visual tasks. Leverages scarce labeled VQA samples and unlabeled images to generate pseudo QA pairs for adaptation.
Introduces a unified zero-shot captioning framework shifting from image-centric to patch-centric paradigms. Enables captioning at a finer granularity, moving beyond global representations for more det...
Introduces Gate-Shift-Pose, enhancing action recognition by integrating skeleton pose data with RGB frames. Evaluates early and late fusion strategies for athlete fall classification in figure skating...
Friday, October 3, 2025
Introduces dynamic bundling with LLMs for zero-shot inference on text-attributed graphs. Addresses limited graph information and unreliable responses by proposing a novel framework, enabling better ge...
Proposes a systematic method to determine optimal data mixtures for foundation models using scaling laws. Accurately predicts model loss based on size and mixture proportions, enabling efficient large...
Introduces a lightweight, plug-in framework for adversarial example detection leveraging internal layer-wise inconsistencies. Addresses limitations of external models and complex architectures, improv...
Demonstrates Transformers discovering molecular structure without graph priors, challenging GNN dominance. Avoids fixed graph limitations and improves expressivity and inference speed for molecular ma...
Presents BD Attention (BDA), a lossless reformulation of attention using Basis Decomposition. Achieves mathematically guaranteed acceleration by restructuring multi-head projections, improving efficie...
Explores the applicability of the Convolutional Tsetlin Machine for large-scale machine learning. Offers transparent, logic-based classification with comparable performance to neural networks, enhanci...
Introduces PepCompass, navigating peptide embedding spaces using Riemannian geometry. Addresses distorted exploration and inefficient optimization from flat Euclidean metrics, improving antimicrobial ...
Introduces Penalized Exponential Loss (PENEX), a multi-class exponential loss formulation amenable to optimization. Offers AdaBoost-inspired regularization for neural networks, grounding generalizatio...
Introduces DAC-SE1, a simplified language model-based speech enhancement framework using discrete high-resolution audio representations. Achieves high-fidelity enhancement with a simplified pipeline, ...
Analyzes a policy execution framework sampling actions from stochastic policies at discrete time points. Proves accuracy bounds as sampling mesh size tends to zero, addressing challenges in continuous...
Proposes learning equivariant models by discovering symmetries with learnable augmentations. Addresses limitations of fixed equivariant architectures and implicit learning, enabling more flexible and ...
Proposes a tri-level optimization model integrating proactive actions, disruptions, and reactive responses for electricity system resilience. Uses conformal prediction for uncertainty, enhancing syste...
Investigates scaling behavior of xLSTM and Transformers, showing competitive performance with linear time-complexity. Enables prediction of model performance relative to compute budgets, offering effi...
Analyzes gradient space dynamics to find efficient training methods for LLMs. Proposes using randomized gradient subspaces to capture most gradient energy, reducing memory bottlenecks and improving tr...
Proposes LieNLSD, a method for explicit discovery of nonlinear symmetries from dynamic data. Determines the number of infinitesimal generators, advancing symmetry discovery beyond linear methods and i...
Proposes StelLA, a geometry-aware extension of LoRA using a three-factor decomposition on the Stiefel manifold. Improves LoRA performance by exploiting geometric structure, offering better parameter-e...
Proposes FairContrast for enhancing fairness through contrastive learning and customized data augmentation on tabular data. Offers a powerful approach to debiasing algorithms and improving fairness wh...
Introduces GlobalDISCO, a large-scale dataset to analyze biases in AI-generated music across countries, languages, cultures, and genres. Addresses underexplored research on global diversity and bias i...
Leverages a multidata causal discovery framework for hurricane intensity forecasting. Addresses limitations of correlation-based methods by incorporating causal discovery, improving generalizability a...
Addresses the impact of noisy human feedback on preference optimization for LLM alignment. Analyzes generalization capabilities under unrealistic noise conditions, crucial for reliable LLM alignment.
Thursday, October 2, 2025
Introduces DreamCS, a geometry-aware text-to-3D generation method using unpaired 3D reward supervision. It mitigates 2D bias artifacts common in prior methods, enabling better human preference alignme...
Proposes using diffusion models as noise-aware latent reward models for preference optimization in diffusion models. Shows pre-trained diffusion models are naturally suited for step-level preference a...
Proposes ATAS, a self-distillation framework for enhanced open-vocabulary dense prediction. Addresses CLIP's struggle with fine-grained understanding by focusing on semantic coherence and vision-langu...
Presents a dual-adapter framework learning frequency and memory-aware prompts for multi-modal object tracking. Addresses underutilization of modality-specific frequency structure and long-range tempor...
Introduces NSARM, an autoregressive modeling approach for robust real-world image super-resolution. Addresses limitations of diffusion models in Real-ISR by improving output quality and efficiency wit...
Proposes a plug-and-play refinement module for autoregressive models to enhance spatial correspondence modeling. Operates as a post-pretraining step to jointly refine generated visual tokens, improvin...
Studies the use of image segmentation foundation models to improve the truthfulness of learned prototypes in explainable AI. Aims to enhance explainability beyond post-hoc saliency techniques.
Introduces PhraseStereo, the first dataset for phrase-region segmentation in stereo image pairs. Addresses limitations in phrase grounding by leveraging stereo vision's rich geometric cues.
Proposes JEPA-T, a unified multimodal framework for image generation using joint-embedding predictive architecture with text fusion. Enhances fusion by incorporating cross-attention after the feature ...
Presents MoE-SGT, a reasoning-driven framework augmenting Concept Bottleneck Models (CBMs) with a Graph Transformer and MoE module. Addresses limitations of single-modal CBMs by incorporating structur...
Investigates why Vision-Language Models underutilize spatial cues, identifying an imbalance between vision and text token norms. Proposes interpretability tools to expose this mechanism and improve sp...
Revisits Classifier-Free Guidance (CFG) theory for diffusion models, rigorously confirming improper coefficient configurations can risk misuse. Proposes rectified guidance to ensure proper combination...
Introduces STORK, a method for faster diffusion and flow matching sampling by addressing ODE stiffness and structure-dependence. Enables quality-preserving sampling with fewer function evaluations for...
Proposes SoftCFG to address guidance diminishing and over-guidance issues in autoregressive models. Uses uncertainty guidance for stable generation, improving visual coherence by managing conditional ...
Proposes a training-free framework using MLLM uncertainty for guidance in complex visual tasks. Leverages intrinsic uncertainty to improve fine-grained perception without task-specific fine-tuning or ...
Introduces ImageDoctor, a unified multi-aspect evaluation model for text-to-image generation. Uses grounded image reasoning to provide comprehensive and interpretable feedback on image quality, moving...
Presents FLORA, the first comprehensive dataset for fashion language-to-outfit translation, containing industry-specific terminology. Introduces a KAN adapter for enhanced feature adaptation in AI-dri...
Proposes SEE, an adaptive brightness adjustment method for event cameras across broad light ranges. Addresses the research gap of utilizing event data beyond low-light enhancement.
Introduces a cascaded diffusion framework for probabilistic coarse-to-fine hand pose estimation. Addresses pose ambiguities and uncertainties by refining predictions in a cascaded manner, improving ac...
Proposes a feed-forward camera localization method from image features, aiming for faster mapping time compared to state-of-the-art approaches. Raises the question of achieving competitive accuracy mu...
Wednesday, October 1, 2025
Introduces AICrypto, the first comprehensive benchmark to evaluate LLM cryptography capabilities. Comprising 135 multiple-choice questions, 150 CTF challenges, and 18 proof problems, it covers a broad...
Proposes the first lattice-based linearly homomorphic ring signature scheme. This scheme combines anonymity with verifiable homomorphic computation, demonstrating potential for confidential blockchain...
Introduces mutual information minimization via optimal noise injection as a countermeasure against side-channel attacks. This approach aims to be more efficient for resource-constrained systems like I...
Introduces Thunderdome, the first timelock-free payment channel network (PCN). It leverages virtual channels to extend a timelock-free primitive, addressing vulnerabilities to timelock and censoring a...
Presents a novel Zero Trust-based Decentralized Identity Management (D-IM) protocol for autonomous vehicles. This system enhances cybersecurity in dynamic, untrusted environments by integrating Zero T...
Introduces low-latency systems for high-quality, instantaneous monitoring of cellular communications to detect unauthorized devices in sensitive areas. Addresses a critical gap in current security sys...
Reintroduces a reproducible framework, TRUE, for LLM-driven relevance judgment in Information Retrieval. It addresses the lack of standardized workflows in existing methods, aiming for reliable label ...
Presents Palace, an open-source library for interactive, GPU-accelerated out-of-core tensor processing and visualization. It enables efficient handling of large tensor datasets for scientific fields.
Proposes Aristotle, a logic-complete framework for LLM logical reasoning that decomposes, searches, and resolves problems. It aims to improve both the efficacy and efficiency of LLM reasoning by lever...
Presents ActorDB, a novel database architecture unifying single-writer actors, incremental view maintenance, and zero-trust security. This system aims to reduce architectural complexity for modern dat...
Proposes AntiFLipper, a novel and computationally efficient defense against multi-class label-flipping attacks in Federated Learning. It aims to protect the global model's performance degradation caus...
Proposes authenticated Private Set Intersection (PSI) schemes by integrating Merkle Trees with existing PSI protocols. This enhances data integrity in PSI, addressing vulnerabilities to attacks that m...
Presents Chypnosis, an undervolting attack technique that indirectly stops a target circuit's clock to enable static side-channel attacks. Crucially, it also blocks detection mechanisms while preservi...
Proposes logic solver guided directed fuzzing for hardware designs to improve early bug detection in complex IC designs. This approach extends verification efforts for incremental updates in hardware ...
Proposes the concept of differentiated secure connectivity using intents for 5G/6G mobile networks. This approach aims to express and enforce complex, goal-driven security requirements beyond current ...
Formulates threshold signature schemes for cryptocurrencies like Bitcoin as an optimization problem. It determines the optimal threshold to balance security against user lockout risks.
Presents a journalist-centered approach to LLM-powered document search for newsrooms, prioritizing transparency and editorial control. Evaluates small language models for investigative document search...
Analyzes latent space dynamics to explain how diffusion models memorize training data. Shows memorization is driven by specific aspects of the diffusion and denoising process, raising privacy concerns...
Proposes fingerprinting LLMs via prompt injection to detect model derivations without altering the base model. This method aims to robustly identify model provenance even after post-processing.
Introduces IMProofBench, a benchmark for evaluating AI on research-level mathematical proof generation. It consists of 39 peer-reviewed problems designed by expert mathematicians to assess advanced re...
Tuesday, September 30, 2025
Introduces 3D-LATTE, a training-free method for instruction-based 3D asset editing operating in the latent space of native 3D diffusion models. Addresses view-inconsistent editing signals common in 2D...
Introduces ART-DECO, a neural model that generates high-quality 3D assets with detailed geometry and texture from coarse proxies guided by text prompts. Achieves instantaneous detailization in under 1...
Proposes Representation Entanglement for Generation (REPA) to simplify training diffusion transformers. Integrates external visual representations from pretrained models through alignment.
Presents SimpleGVR, a baseline for latent-cascaded video super-resolution. Decouples semantic content generation from detail synthesis for efficient video upscaling.
Introduces Implicit-ARAP for efficient handle-guided neural field deformation. Leverages local patch meshing to balance surface quality, robustness, and efficiency in neural field manipulation.
Introduces Freqformer, a Transformer-based model with a frequency-domain module for 3D retinal vasculature reconstruction and quantification from single OCTA scans.
Introduces a novel onboard tracking approach for vision-based relative localization using active blinking markers in multi-robot systems. Improves robustness for aerial vehicles.
Proposes ReDDiT, a rehashing noise approach for discrete diffusion transformers to improve expressive capacity. Addresses design of noise and sampling heuristics in discrete diffusion models.
Proposes a score replacement method with bounded deviation for rare prompt generation in diffusion models. Addresses struggle with rare concepts by improving prompt switching robustness.
Proposes Forced Prompt Learning (FA) for Vision-Language Models to improve OOD detection. Makes full use of VLMs' inherent capabilities without relying on external datasets.
Proposes the Reconstruct Anything Model (RAM), a lightweight foundation model for computational imaging. Addresses limitations of iterative and unrolled architectures for imaging inverse problems.
Proposes LINO UniPS with Light Register Tokens to unify photometric stereo under arbitrary lighting. Enforces decoupling of illumination and normal information for universal application.
Introduces a generative video semantic communication framework using multimodal semantic fusion with large models. Addresses limitations of traditional syntactic communication for 6G immersive scenari...
Proposes a controllable reference-guided diffusion method with local-global fusion for remote sensing image super-resolution. Integrates complementary information from auxiliary data.
Presents SPARTA for estimating state-dependent traversability from point clouds without needing approach angle as input. Addresses computational inefficiency during planning.
Proposes a deep convolutional network for COPD prediction using lung sound auscultation. Addresses the demand for automated tools in early disease detection.
Introduces RAM-W1K, a multi-task wrist dataset and benchmark for Rheumatoid Arthritis research. Addresses limitations in CAD research due to annotation challenges.
Introduces ZeroScene, a zero-shot framework for 3D scene generation from a single image with controllable texture editing. Leverages large vision models for asset quality and scene coherence.
Introduces Mod-Adapter for tuning-free, versatile multi-concept personalization. Enables customization of abstract concepts like pose and lighting without test-time fine-tuning.
Introduces Causal-Guided Adversarial Steering for counterfactual visual explanations. Addresses view-inconsistent editing signals by incorporating causal relationships.
Monday, September 29, 2025
Proposes a novel neural operator combining differential and integral forms to address error accumulation in long-term turbulence forecasting. Achieves improved physical fidelity and accuracy compared ...
Proposes a framework for computing provably valid prediction bounds for probabilistic image reconstruction algorithms. Enables statistically guaranteed claims about reconstructed subjects from sparse ...
Introduces FERD to address fairness issues in data-free robustness distillation. Identifies and tackles key problems leading to robustness disparity across categories, improving fairness in model tran...
Introduces SeamCrafter, a GPT-style seam generator using reinforcement learning for UV unwrapping. Enhances mesh seam generation, addressing distortion and fragmentation issues in 3D texturing workflo...
Proposes a Surgical Vision World Model to facilitate realistic and interactive surgical simulation. Enables action-controlled data generation for training autonomous surgical agents when real data acq...
Proposes a simple and effective method combining feature norms, randomization, and orthogonality for diverse subset selection. Selects informative samples from large unlabeled pools for annotation, ad...
Introduces NeuVAS, a framework for variational shape modeling using neural implicit surfaces. Addresses challenges in modeling shapes with sparse geometric control, like 3D curve sketches.
Investigates Vision-Language Models' performance in fine-grained tasks like font recognition. Highlights challenges VLMs face in distinguishing texture from semantics, impacting aesthetic and design-r...
Proposes Plan-R1, a two-stage trajectory planning framework that decouples principle alignment from behavior learning. Enables safe and feasible trajectory planning inspired by large language models.
Focuses on motion analysis and part-level segmentation from casually captured RGBD videos for articulated objects. Enables interactable digital twins from practical, scalable acquisition, useful for e...
Proposes the APTx Neuron, a novel, unified neural computation unit integrating activation and computation into a single trainable expression. Eliminates separate activation layers for computational ef...
Presents a novel theoretical framework for understanding disentangled representation learning in diffusion models. Establishes identifiability conditions and derives sample complexity bounds for disen...
Introduces Diffence, a novel defense against membership inference attacks using diffusion models. It removes distinguishing features between member and non-member data by regenerating inputs, enhancin...
Presents a hierarchical multimodal recurrent ensemble that maps video, audio, and language embeddings to fMRI responses. Integrates information over time to predict distributed cortical responses to m...
Enhances diffusion models' compositional generation power on rare concepts using LLM guidance. Demonstrates improved generation of rare compositions by exposing frequent concepts relevant to targets d...
Formulates the policy mobilization problem to improve generalization of visuomotor policies to novel robot positions. Addresses poor generalization from limited robot positions and camera viewpoints i...
Introduces VLN-PE, a physically realistic platform to bridge the embodied gap in Vision-and-Language Navigation. Systematically evaluates VLN methods in physical robotic settings across different pipe...
Introduces enf2enf, a neural field approach for predicting steady-state PDEs with geometric variability. Encodes geometries into latent features anchored at spatial locations, preserving locality for ...
Proposes an STQE network exploiting spatial-temporal correlations to enhance quality of G-PCC compressed dynamic point clouds. Addresses the unexplored area of quality enhancement for compressed dynam...
Friday, September 26, 2025
Introduces TempSamp-R1, a reinforcement fine-tuning framework for video LLMs, addressing inefficient on-policy sampling in large temporal spaces. Achieves improved effectiveness for video temporal gro...
Introduces AnyPlace, a two-stage method trained on synthetic data for generalized object placement in robot manipulation. Leverages VLMs for rough placement location identification, focusing on releva...
Investigates how model architecture and training environment affect deep learning energy consumption. Analyzes trade-offs by training various computer vision models and collecting energy and accuracy ...
Presents FoMo-0D, a pre-trained foundation model for zero/few-shot outlier detection on tabular data. Addresses the bottleneck of unsupervised algorithm and hyperparameter selection for effective OD u...
Investigates whether multi-neuron convex relaxations overcome the single-neuron convex barrier in neural network certification. Addresses questions about their expressiveness and limitations for robus...
Presents the first comprehensive benchmark for evaluating supervised and self-supervised learning for few-shot time-series crop-type classification. Assesses algorithm efficacy in challenging, real-wo...
Argues current definitions of appropriate AI reliance lack formal statistical grounding. Proposes a decision-theoretic framework for measuring AI reliance, focusing on human-AI decision-making and com...
Proposes a semantic edge-cloud communication framework for real-time urban traffic surveillance using ViT and LLMs. Addresses understanding dynamic traffic scenarios and responsive user interaction ov...
Introduces data-centric elastic pipeline parallelism for efficient long-context LLM training. Addresses communication overhead issues in existing schemes like sequence parallelism by optimizing partit...
Explores using the Simplex architecture to enhance safety in deep-learning autonomous systems. Addresses trustworthiness issues related to anomalous samples, distribution shifts, and adversarial attac...
Proposes a new task and benchmark for temporal reasoning in multi-session dialogues, under-studied previously. Introduces TReMu, a neuro-symbolic framework to enhance LLM-agent temporal reasoning capa...
Shows that major reinforcement learning algorithms fit into categorical cybernetics' framework of parameterized bidirectional processes. Extends Bellman operators to parameterized optics for action-va...
Contributes a decision-theoretic framework for characterizing the value of information in human-AI pairings. Focuses on improving performance of collaborating agents by understanding their information...
Introduces tail batching, a novel rollout strategy to mitigate long-tail rollout issues in synchronous RL post-training for LLMs. Aims to reduce GPU underutilization without compromising training accu...
Presents an integrated RL and MPC framework for autonomous satellite docking, mitigating fuel sloshing effects. Integrates PPO and SAC RL algorithms with MPC, leveraging MPC's predictive capabilities ...
Models conductive electrodes in diamond particle detectors using physics-informed neural networks. Extends the classical Ramo-Shockley formalism to optimize design for fast-tracking at high luminosity...
Proposes using text-augmented multimodal LLMs for chemical reaction condition recommendation. Aims to reliably discover effective conditions during reaction exploration, addressing labor-intensive tri...
Investigates the theoretical properties of transformer attention mechanisms in large language models. Analyzes how increasing model size and depth affects performance and identifies potential diminish...
Proposes a maximum entropy regulated long chain-of-thought approach for fine-tuning LLMs to analyze code review dimensions. Enhances LLM context understanding and reasoning compared to human reviewers...
Proposes a dual-path phishing detection framework integrating transformer-based NLP and structural URL analysis. Addresses limitations of traditional methods by comprehensively analyzing both semantic...
Thursday, September 25, 2025
Introduces SoFar, a language-grounded orientation representation that bridges spatial reasoning and object manipulation. Defines object orientations using natural language in a reference-frame-free ma...
Proposes WorldPrompter, a generative pipeline using 360° video as an intermediate representation for synthesizing traversable 3D scenes. Captures full-scene context and ensures visual consistency, off...
Proposes using Vision Language Models (VLMs) to interpret human demonstration videos and generate robot action plans. Integrates keyframe selection, visual perception, and VLM reasoning into a pipelin...
Introduces NERO, a framework for explainable out-of-distribution (OOD) detection using neuron-level relevance. Enhances reliability in deep learning, particularly medical imaging, by flagging potentia...
Introduces CBM-HNMU, a Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding. Leverages concept bottleneck models for effective interventions and mutual understanding, addre...
Proposes Localized LoRA, a generalized framework for parameter-efficient fine-tuning that models weight updates using low-rank matrices applied to structured blocks. Overcomes limitations of global lo...
Introduces Urania, a framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. Employs private clustering and keyword extraction, providing e...
Presents Latent Wavelet Diffusion (LWD), a lightweight framework for ultra-high-resolution image synthesis. Introduces a frequency-aware masking strategy derived from wavelet energy maps to focus diff...
Presents a unified framework for automatic multitrack music arrangement handling diverse scenarios via track-aware reconstruction and structured tokenization. Enables flexible any-to-any instrumentati...
Introduces GraphEQA, utilizing 3D semantic scene graphs for real-time Embodied Question Answering. Addresses challenges in acquiring semantic representations and leveraging prior knowledge for efficie...
Investigates the functions of individual neurons in Vision-Language Models (VLMs) by observing activations with visual and text tokens. Reveals insights into VLM internals, crucial for fostering trans...
Introduces SEM, a diffusion-based policy framework that explicitly enhances spatial understanding for robust robot manipulation. Addresses limitations of 3D point cloud and 2D image encoders by improv...
Proposes POLED, a probabilistic framework for event downsampling that models event importance. Addresses high bandwidth and computational demands of event cameras by adaptively downsampling events bas...
Proposes SurgVidLM for multi-grained surgical video understanding using Large Language Models (LLMs). Facilitates surgeons in understanding surgical scenes and procedures by enabling fine-grained vide...
Introduces AAPO, an RL-based method enhancing LLM reasoning by leveraging Advantage Momentum. Eliminates dependency on value models in group relative advantage estimation, simplifying training and imp...
Introduces OmniSpatial, a comprehensive benchmark to evaluate and improve spatial reasoning in Vision-Language Models (VLMs). Addresses the limitations of existing tasks by covering more elementary la...
Introduces CLOSP, a unified semantic space for Synthetic Aperture Radar (SAR), Multispectral Imagery (MSI), and Text in remote sensing. Bridges the gap for text-to-image retrieval systems by exploitin...
Compares the accuracy of Large Language Models (LLMs) against traditional methods for macroeconomic forecasting. Investigates LLM effectiveness in capturing intricate patterns in macroeconomic time se...
Proposes MediNotes, a generative AI framework automating SOAP note creation from medical conversations using LLMs and Retrieval-Augmented Generation. Addresses administrative burden and physician burn...
Introduces HawkBench, a human-labeled benchmark to rigorously assess the resilience of Retrieval-Augmented Generation (RAG) methods. Stratifies tasks based on information-seeking behaviors to evaluate...
Wednesday, September 24, 2025
Proposes EventVL, the first generative event-based MLLM framework for explicit semantic understanding of event streams. Bridges event streams and multimodal LLMs for enhanced semantic understanding, e...
Introduces a novel beam search strategy for latent space exploration in diffusion models. Enables conditional generation of full image sequences with improved visual consistency, addressing challenges...
Presents a system using Multimodal LLMs to analyze millions of images for temporal change patterns. Answers open-ended queries about city trends without predetermined subjects, enabling large-scale vi...
Proposes dual data alignment to improve the generalizability of AI-generated image detectors. Addresses overfitting on non-causal attributes by matching semantic content between real and synthetic ima...
Introduces JL1-CD, a large-scale remote sensing change detection dataset, and a robust multi-teacher knowledge distillation framework. Addresses scarcity of high-resolution datasets and improves perfo...
Proposes an optimal transport perspective for 3D Gaussian Splatting (3DGS) compaction. Casts compaction as global Gaussian mixture reduction, addressing memory and rendering budgets by reducing redund...
Proposes Token Preference Optimization (TPO) with self-calibrated rewards for hallucination mitigation in LVLMs. Addresses lack of scalable token-level rewards and visual-anchored tokens for improved ...
Introduces REACT, a framework for real-time Scene Graph Generation (SGG). Addresses the trade-off between performance and inference speed, enabling SGG for downstream tasks like reasoning for embodied...
Proposes Lavida-O, a unified Masked Diffusion Model for multimodal understanding and generation. Enables image-level understanding, object grounding, image editing, and high-resolution text-to-image s...
Surveys representative methods in Explainable AI (xAI) for computer vision. Addresses the challenge of "black-box" models by providing insights into decision-making processes for improved reliability.
Introduces SparseDiT, a novel framework implementing token sparsification in Diffusion Transformers. Addresses computational costs by reducing self-attention complexity, enabling more efficient genera...
Proposes ICEdit, a framework for precise instruction-based image editing using Diffusion Transformers. Achieves a precision-efficiency tradeoff by leveraging inherent comprehension and generation abil...
Proposes AvatarShield, a visual reinforcement learning framework for detecting human-centric synthetic videos. Addresses threats from realistic synthetic human body generation with controllable moveme...
Improves semantic correspondence estimation through 3D-aware pseudo-labeling. Trains an adapter to refine off-the-shelf models, addressing ambiguities in symmetric objects or repeated parts for better...
Proposes HDM, a hybrid diffusion model for unified image anomaly detection. Addresses challenges of complex anomaly patterns by improving coordination between anomaly sample generation and detection.
Proposes DWTGS, a framework rethinking frequency regularization for sparse-view 3D Gaussian Splatting. Leverages wavelet transforms to address overfitting to high-frequency details and improve novel v...
Presents TCVADS, a system for weakly supervised video anomaly detection with explainability and lightweight design. Leverages knowledge distillation and cross-modal contrastive learning for efficient,...
Explores zero-shot deepfake detection, enabling detection without prior exposure to specific variations. Studies self-supervised learning, transformer classifiers, generative model fingerprinting, and...
Presents VLN-Zero, a framework for zero-shot vision-language navigation using neurosymbolic planning. Leverages VLMs to construct symbolic scene graphs for efficient exploration and adaptation in unse...
Proposes Prompt-DAS, a promptable multitask framework for annotation-efficient domain adaptive segmentation of EM images. Utilizes point prompts for unsupervised domain adaptation and weakly supervise...
Tuesday, September 23, 2025
Proposes Visual Instruction Pretraining (ViTP) to improve foundation models in downstream domains by leveraging top-down reasoning influence on low-level perceptual features. Enhances perception-reaso...
Proposes COLA, a context-aware language-driven test-time adaptation framework for domain adaptation without shared labels. Enables adaptation to multiple target domains by leveraging language to guide...
Proposes Self-Distilled RoI Predictors to improve fine-grained perception in Multimodal LLMs by focusing on salient regions. Addresses trade-offs between training data needs and computational cost for...
Proposes the Core Space merging framework for efficient merging of Low-Rank Adaptation (LoRA) models. Avoids merging full weight matrices, maintaining efficiency while enabling adaptation of large neu...
Provides an overview of tame geometry's role in deep learning, focusing on convergence guarantees for stochastic gradient descent in nonsmooth nonconvex settings. Illustrates how deep learning models ...
Introduces Stencil, a framework for subject-driven generation with context guidance, addressing subject consistency issues in diffusion models. Balances fidelity and efficiency by improving prompt-bas...
Presents R-Splatting, a unified framework bridging underwater image restoration and 3D Gaussian Splatting. Improves rendering quality and geometric fidelity for 3D reconstruction in challenging underw...
Introduces ContextFlow, a training-free framework for video object editing via adaptive context enrichment. Addresses fidelity and temporal consistency challenges in diffusion-based video manipulation...
Introduces OnePiece, a framework integrating context engineering and reasoning into industrial cascade ranking systems. Addresses limitations of solely architectural transplanting by leveraging LLM br...
Presents a latent diffusion model for heterogeneous histopathology image generation using semantic segmentation and visual crops. Overcomes challenges in tissue heterogeneity and morphological feature...
Introduces MAESTRO, a framework for multi-task 3D perception that adaptively enhances and suppresses features to mitigate task conflicts. Improves learning efficiency and perception accuracy by managi...
Introduces CoBEVMoE, a collaborative perception framework using dynamic Mixture-of-Experts for heterogeneity-aware feature fusion. Mitigates perceptual diversity issues by dynamically adapting experts...
Presents a novel unsupervised generative modeling challenge for counterfactual sample generation across domains without parallel data. Relies on causal graphs to address challenges beyond conventional...
Introduces a synergistic learning pre-training framework for multimodal semi-supervised medical image classification. Addresses modality fusion and label scarcity challenges by consistently learning a...
Proposes SmaRT, a style-modulated robust test-time adaptation method for cross-domain brain tumor segmentation. Addresses instability and inconsistency in adaptation strategies for medical imaging dom...
Proposes LLaSA, a sensor-aware LLM for natural language reasoning of human activity from IMU data. Introduces SensorCap and OpenSQA resources for causal and explanatory reasoning in wearable systems.
Develops PAC-Bayesian risk certificates for contrastive representation learning. Provides statistical theory for contrastive learning, bounding generalization error for foundation models trained via a...
Proposes DT-NeRF, a diffusion and transformer-based optimization method for Neural Radiance Fields. Enhances detail recovery and multi-view consistency in 3D scene reconstruction, outperforming tradit...
Interprets vision transformers via a residual replacement model and analysis of 6.6K features. Reveals feature evolution and encoding of curves, providing insights into ViT processing from low-level p...
Proposes validation-free sparse learning using a phase transition approach for feature selection. Addresses AI's environmental footprint by promoting frugal and interpretable models with reduced compl...
Monday, September 22, 2025
Introduces Grounding via View Retrieval (GVR), a zero-shot method for 3D visual grounding in 3D Gaussian Splatting. It overcomes per-scene training limitations by using view retrieval, enabling effici...
Introduces SeCodePLT, a unified platform for evaluating code GenAI security. It addresses limitations of existing benchmarks by offering dynamic analysis and scalable evaluation, improving precision o...
Proposes Negotiative Alignment to achieve fairer outcomes by embracing disagreement. A community-centered study with diverse groups reveals systematic disagreement patterns, enhancing urban assessment...
Proposes Dynamics Modeling (DyMo) to augment LLMs with state prediction for tool use in stateful environments. This enables LLMs to predict future states via an internal environment model, improving a...
Proposes a regression-adjusted estimator for distributional treatment effects with imperfect compliance. It leverages treatment assignment as an instrumental variable to identify distributional effect...
Proposes Perception-R1 to enhance multimodal reasoning in MLLMs using visual perception rewards. This approach addresses overlooked perception capabilities, a prerequisite for advanced multimodal reas...
Proposes a transfer learning method for latent space models to improve network analysis and link prediction. It leverages information from similar networks to enhance estimation accuracy, especially f...
Proposes AttentionDrop, a family of stochastic regularization techniques operating on self-attention distributions. This method combats overfitting in transformer models, particularly with limited or ...
Introduces DSDNet for raw domain demoiréeing using dual color-space synergy. This addresses severe visual degradation from moirée artifacts in smartphone captured screen images, overcoming limitations...
Introduces CLIPTTA for robust contrastive vision-language test-time adaptation. It addresses misalignment in standard test-time adaptation objectives for VLMs, improving performance and mitigating fai...
Argues that algorithmic fairness is a socio-technical property, not purely technical. It highlights misconceptions limiting metric effectiveness and calls for a broader understanding beyond mathematic...
Analyzes 202 AI incidents to develop a taxonomy of causes, entities, and consequences. It classifies incidents across the AI lifecycle, addressing limitations in existing taxonomies for prevention and...
Introduces an LLM-driven decision-making framework for cooperative driving automation. It aims to enhance interaction and continuous learning for connected autonomous vehicles in complex scenarios.
Develops NeuroRAD-FM, a neuro-oncology foundation model with distributionally robust training. It improves generalization across cohorts and predicts molecular markers, addressing challenges in hetero...
Introduces Asymmetric LoRA Adaptation with Poisoning Experts (LoPE) for noise-robust parameter-efficient fine-tuning. This framework enhances model adaptation by leveraging noise rather than relying s...
Introduces a novel training framework with a discriminative loss and Gaussian noise injection for robust classification. It enhances intra-class compactness and decision boundary margins without degra...
Introduces SGEquiDiff, a crystal generative model handling space group constraints with equivariant likelihoods. This accelerates inverse design of crystalline materials by naturally incorporating sym...
Proposes a deep learning model for recognizing partially occluded road signs for autonomous vehicles. It addresses the complexity introduced by occlusions, aiming for improved accuracy in challenging ...
Proposes a knowledge transfer method to boost uncertainty estimation in Active Learning, particularly for domain tasks like cryo-ET classification. It addresses challenges in training complex auxiliar...
Proposes Multi-Prototype Supervision for robust visual continual learning using language-guided supervision. It addresses semantic ambiguity and intra-class diversity limitations of single-target appr...
Friday, September 19, 2025
Proposes a least-squares perspective for consistent causal discovery in linear acyclic SEMs with equal error variances. Establishes theoretical guarantees for unique DAG identification, demonstrating ...
Introduces Hamiltonian Descent Algorithms for optimization, leveraging randomized integration time. Achieves accelerated convergence rates similar to gradient descent for convex functions, offering a ...
Addresses open-set label shift by proposing a semiparametric density ratio model framework. Handles novel classes absent from training without restrictive assumptions, offering improved theoretical gu...
Develops rate doubly robust estimation for weighted average treatment effects (WATE), a versatile class of causal estimands. Addresses robustness limitations in existing literature, enabling more reli...
Proposes an efficient dual-domain image dehazing method using haze prior perception. Combines spatial and frequency domain features to overcome limitations of existing transformer-based models, enabli...
Proposes Gradient Distance Functions (GDFs) to represent non-watertight surfaces in deep learning. GDFs are differentiable at the surface, remedying brittleness of UDFs and enabling representation of ...
Introduces GCDance, a diffusion-based framework for genre-specific 3D full-body dance generation driven by music. Achieves physically realistic and synchronized dance sequences while adhering to genre...
Proposes a framework quantifying the contribution of latent variables in Multiple Latent Variable Generative Models (MLVGMs) using mutual information. Offers a systematic understanding of generative d...
Introduces AutoEdit for automatic hyperparameter tuning in text-guided image editing. Addresses the challenge of manual tuning by automating the process, reducing computational costs and improving edi...
Proposes semantically consistent style transfer using diffusion models for synthetic-to-real domain adaptation. Improves performance of vision models trained on synthetic data, especially in adverse c...
Presents the first gap-dependent analysis of regret and communication cost for on-policy federated Q-learning in tabular MDPs. Achieves improved bounds compared to worst-case analyses, offering a more...
Presents two sharp, closed-form empirical Bernstein inequalities for symmetric random matrices with bounded eigenvalues. Achieves tight adaptation to unknown variance, matching matrix Bernstein inequa...
Revisits the replica method for parametric models by employing a variational Gaussian approximation. Enables deferred and empirical data averages, leading to stationarity conditions for intractable in...
Introduces MedFuncta, a unified framework for learning efficient medical neural fields. Addresses challenges in scaling Neural Fields to large medical datasets, offering a powerful alternative to disc...
Introduces HPGN, a hybrid priors-guided network for enhancing compressed low-light images. Integrates compression and illumination priors in a unified framework, addressing joint enhancement challenge...
Presents Morph, a motion-free physics optimization framework for human motion generation. Addresses physically implausible motions by incorporating physics constraints, offering a new approach to real...
Proposes DM-Calib, a diffusion-based approach for monocular camera intrinsic parameter estimation. Leverages diffusion models trained on massive data for improved generalization across diverse real-wo...
Introduces PhyRMDM, a physics-informed framework for sparse radio-map reconstruction. Aligns physical constraints with data-driven features, establishing a novel approach for accurate reconstruction u...
Rethinks concept erasure in diffusion models by evaluating robustness and reversibility. Investigates whether erasure truly eliminates generative capacity or achieves only superficial suppression, pro...
Introduces a confidence-aware diffusion model for lightweight and accurate multi-view stereo reconstruction. Achieves 3D geometry reconstruction from calibrated images efficiently, demonstrating the p...
Thursday, September 18, 2025
Surveys the integration of Large Language Models (LLMs) into Information Retrieval (IR) systems. It details how LLMs capture complex signals and semantic nuances, evolving IR from term-based methods t...
Introduces a method for synthesizing and perceptually scaling high-resolution naturalistic images using Stable Diffusion. It focuses on generating perceptually continuous variations of naturalistic st...
Presents a Mixed-Integer Linear Programming approach for effort-optimized, accuracy-driven labeling and validation of test inputs for Deep Learning (DL) systems. It aims to build highly accurate datas...
Proposes a method for valid inference for M-estimators using adaptively collected bandit data under model misspecification. It provides robust statistical approaches for data collected adaptively, lik...
Demonstrates that brain age identification from diffusion MRI (dMRI) synergistically predicts neurodegenerative disease. It leverages dMRI's sensitivity to microstructural changes to build an earlier ...
Presents UniPLV, a framework for label-efficient open-world 3D scene understanding using regional visual language supervision. It unifies point clouds and images for robust recognition without manual ...
Proposes a self-supervised method for Embodied Image Captioning, enabling agents to describe objects while exploring environments. It fine-tunes captioning models via a three-phase framework for enhan...
Introduces GenExam, the first benchmark for multidisciplinary text-to-image exams. It features 1,000 samples across 10 subjects, evaluating integrated understanding, reasoning, and generation capabili...
Presents an object pose estimation approach using sensorimotor exploration and Reinforcement Learning (RL). It enables robots to actively control hand interactions for pose estimation, especially when...
Proposes InterKey, a cross-modal approach for global localization on OpenStreetMap. It enables robust localization for autonomous vehicles by matching sensor data with OSM, addressing scalability limi...
Proposes a novel framework for identity-preserving text-to-video generation using spatial-temporal decoupled representations. It addresses the trade-off between spatial coherence and temporal smoothne...
Proposes GROOD, a gradient-aware approach for Out-of-Distribution (OOD) detection in deep learning. It improves reliability in real-world applications by better distinguishing near-OOD samples compare...
Introduces Imputation-Powered Inference to address blockwise missingness in multi-modal and multi-site data. It offers a solution for complex missingness patterns that challenge standard inference met...
Presents a general method for physics-informed, boundary-constrained Gaussian process regression for reconstructing fluid flow fields. It uses adapted covariance functions to obtain estimates and cons...
Introduces StereoAnything, a data-centric framework unifying zero-shot stereo matching with large-scale mixed data. It enhances generalization capabilities for stereo matching models in unseen domains...
Introduces a lightweight gradient-aware upscaling technique for 3D Gaussian Splatting (3DGS) on GPUs. It achieves higher rendering speeds and reduces artifacts by leveraging analytical image gradients...
Introduces CROP (Contextual Region-Oriented Visual Token Pruning), a framework to compress visual tokens in VLM-based VQA. It localizes and prunes redundant visual tokens, reducing memory and computat...
Proposes CDPIR, a Cross-Distribution Diffusion Priors-Driven Iterative Reconstruction framework for Sparse-View CT. It addresses out-of-distribution problems and enhances reconstruction quality with r...
Introduces Rest2Visual, a method to predict visually evoked fMRI from resting-state scans. It bridges spontaneous brain activity with stimulus-driven responses, offering a way to interpret rs-fMRI.
Introduces MetricNet, a framework for recovering metric scale in generative navigation policies. It addresses issues of unscaled abstract spaces and short-sighted actions in learned navigation, enabli...
Wednesday, September 17, 2025
No research highlights available for this date
Tuesday, September 16, 2025
Proposes transformers minimize expected conditional description length over orderings, not permutation-invariant length, explaining hallucinations. Shows transformers are Bayesian in expectation, not ...
Proves kernel covariance embeddings achieve information-theoretically perfect separation of probability distributions. Establishes equivalence between testing measure equality and singularity between ...
Proposes novel discretizations of kinetic Langevin SDEs for sampling from log-concave distributions with superlinear gradient growth. Shows contractivity and log-Sobolev inequality, establishing non-a...
Introduces generalized Dirichlet energy (GDE) to cluster directed and undirected graphs. GDE handles asymmetry in directed graphs, extending classical spectral methods and preserving directional infor...
Presents a deep learning framework using the SPAR model for multivariate joint extremes of metocean variables. Transforms multivariate extremes to angular density modeling, enabling improved tail anal...
Explores social perception of faces in CLIP by comparing embedding similarities between prompts and face images. Systematically varies dimensions like age, gender, and race to analyze social perceptio...
Presents a scalable reservoir computing implementation for dynamical systems using pseudorandom nonlinear projection. Offers a flexible alternative to polynomial projections for time series data analy...
Proposes using the Morgan-Pitman test for equality of variances in forecasting errors. Enhances robustness against heavy-tailed distributions and outliers, aiding machine learning model evaluation and...
Proposes a framework integrating score-based diffusion priors with moment-based estimators to solve ill-conditioned polynomial equations. Stabilizes polynomial recovery from noisy statistical features...
Adapts projection-based reduced-order models using Projected Gaussian Process. Addresses challenges in updating parametric ROMs by utilizing snapshot data and POD basis modes for improved representati...
Introduces a preconditioned subgradient method for composite optimization problems. Demonstrates fast convergence even with ill-conditioned or overparameterized smooth maps, applicable to data science...
Analyzes fundamental limits of active learning for linear dynamical systems, focusing on excitation input's effect on sample complexity. Presents lower bounds and system-theoretic conditions for poten...
Studies spectral convergence of graph Laplacians to Laplace-Beltrami operators using manifold heat interpolation. Proves convergence with Gaussian kernels by setting bandwidth parameter $\epsilon \sim...
Compares geostatistical and machine learning models for PM2.5 spatio-temporal prediction. Highlights the impact of low-cost sensors on data granularity and enables real-time, high-resolution air quali...
Proposes a contrastive learning framework for network representation learning, specifically for subject-specific, high-dimensional, sparse brain connectivity data. Preserves structural and semantic pr...
Develops a stochastic approximation framework for learning nonlinear operators using Mercer operator-valued kernels. Encompasses compact and diagonal kernels, inducing expressive vector-valued reprodu...
Demonstrates that learning procedures using aggregated labels are robust against issues impossible without data cleaning. This robustness appears in risk consistency and improved generalization.
Analyzes accuracy limits of causal trees for heterogeneous treatment effect estimation. Discusses how fitting procedures using CART or variants are believed to be adaptive, but reveals limitations.
Introduces Piecewise Deterministic Markov Process (PDMP) samplers for Bayesian Neural Networks. Permits subsampling of likelihoods, overcoming limitations of traditional MCMC in computation.
Introduces a permutation-free kernel two-sample test for MMD statistics. Designs a level-$\alpha$ test by overcoming intractable limiting distributions, offering finite-sample validity without permuta...
Monday, September 15, 2025
Introduces GARD (Gamma-based Anatomical Restoration and Denoising), a novel deep learning approach for retinal OCT image denoising. Effectively balances noise reduction with preservation of crucial an...
Presents an efficient learned image compression method through knowledge distillation. Maps images to a low-dimensional latent space for entropy coding, reconstructing approximations at the receiver, ...
Provides a systematic classification and benchmarking of compressed video quality enhancement (CVQE) methods across standards. Addresses limitations in linking methods to artifacts and comparative ana...
Introduces HHI-Assist, a dataset and benchmark for human-human interaction in physical assistance. Addresses challenges in accurate human motion prediction for assistive robots in complex physical int...
Proposes GC-VLN, a training-free framework for vision-and-language navigation. Formulates navigation guidance as graph constraint optimization, enabling deployment in continuous environments without e...
Proposes Talk2PC, enhancing 3D visual grounding for autonomous driving through LiDAR and Radar point cloud fusion. Moves beyond 2D VLMs to leverage rich 3D representations from point clouds for improv...
Introduces GROVE, a generalized reward framework for learning open-vocabulary physical skills for simulated agents. Enables skill learning without manual reward engineering or task-specific demonstrat...
Proposes a novel prompt optimization framework for text-to-image generation using self-rewarding large vision-language models. Alleviates dependence on large-scale manual data and biases from trained ...
Develops AI-driven methods to uncover neuroimaging biomarkers for brain tumor surgery outcome prediction. Addresses limitations of curated datasets by using AI to analyze complex imaging data for impr...
Proposes a novel approach using Geometry and Perception Guided Gaussians for multiview-consistent 3D generation from a single image. Addresses poor multiview consistency and lack of geometric detail i...
Proposes Attention Attack, a novel adversarial attack disrupting cross-attention for text-based image editing. Immunizes images from text-to-image editing by targeting the visual component, enhancing ...
Introduces a novel training-free approach for intrinsic image decomposition using visible and thermal image pairs. Leverages ordinality of intensities to decompose images into shading and reflectance ...
Proposes Chord, a two-stage framework for PBR material generation. Synthesizes shaded, tileable texture images using a fine-tuned diffusion model and then decomposes them to estimate PBR materials, im...
Presents a dataset and baseline method for polarization denoising and demosaicking of DoFP polarimeter images. Addresses the scarcity of research on the joint task, crucial for applications using pola...
Investigates if generative geospatial diffusion models can excel as discriminative geospatial foundation models. Explores their potential to capture multi-grained semantic information for improved rep...
Introduces MotionCutMix, an online data augmentation technique for text-guided motion editing. Dynamically generates training triplets by blending body part motions, significantly expanding training d...
Systematically explores model architectures and training strategies for Medical Large Vision-Language Models (LVLMs) based on LLaVA. Aims to define what makes a good medical LVLM for complex multimoda...
Introduces the Integrative Variational Autoencoder (InVA) for image-on-image regression in multimodal neuroimaging. Models outcome images as functions of shared and modality-specific features, offerin...
Introduces the IISAN framework for parameter-efficient fine-tuning of multimodal foundation models in sequential recommendation. Significantly enhances efficiency in GPU memory and training speed comp...
Introduces PhilEO, an Earth Observation Foundation Model pretrained on massive datasets. Demonstrates competitive performance against specialized models, enabling efficient fine-tuning for downstream ...
Friday, September 12, 2025
Introduces FLUX-Reason-6M, a 6 million image dataset, and PRISM-Bench, a benchmark for text-to-image reasoning. Addresses performance gaps in open-source models and enables comprehensive evaluation of...
Proposes MOAT, a multi-agent joint alignment tuning framework to harmonize LLM-based multi-agent systems. Addresses capability gaps and poor coordination issues arising from independent agent fine-tun...
Investigates energy efficiency and performance trade-offs in LLM inference across tasks and DVFS settings. Identifies and optimizes factors influencing runtime efficiency without compromising performa...
Proposes a Gradient-Attention Guided Dual-Masking framework for robust text-based person retrieval. Addresses scarcity of person-centric data and limitations of global contrastive learning for fine-gr...
Proposes UnsafeBench, a framework to evaluate image safety classifiers. Benchmarks effectiveness and robustness on both real-world and AI-generated images, addressing concerns about misuse of text-to-...
Derives adaptive kernel predictors from feature-learning infinite-width neural network limits. Provides explicit expressions for kernel predictors and numerical calculation methods, advancing understa...
Introduces Talk2Event, the first large-scale benchmark for language-driven object grounding using event camera data. Addresses the gap in multimodal perception for event cameras, leveraging their adva...
Proposes Medverse, a universal model for full-resolution 3D medical image analysis including segmentation, transformation, and enhancement. Enables high-fidelity predictions and global anatomical unde...
Introduces GEMINUS, a Mixture-of-Experts framework for end-to-end autonomous driving. Features a Global Expert and Scene-Adaptive Experts Group with a Dual-aware Router to handle diverse traffic envir...
Proposes using Vision Diffusion Model features aggregated by a transformer for action recognition. Achieves human-like generalization across context and viewpoint variations in untrained domains, over...
Summarizes the VQualA 2025 Challenge on visual quality comparison for Large Multimodal Models (LMMs). Introduces a novel benchmark for evaluating LMMs' reasoning about visual quality differences acros...
Presents MetaGraph, a methodology for extracting knowledge graphs from financial NLP literature. Analyzes research trends in GenAI for finance NLP, defining an ontology and structuring research insigh...
Introduces ALL-PET, a low-resource, low-shot PET foundation model operating in the projection domain. Leverages a latent diffusion model and innovative augmentation strategies to overcome data scarcit...
Introduces MOLLM, a Multi-Objective Large Language Model for molecular design. Combines domain knowledge with LLMs and in-context learning for multi-objective optimization of molecular properties.
Introduces MetaRAG, a metamorphic testing framework for hallucination detection in RAG systems. Addresses challenges specific to RAG where responses must align with retrieved evidence, unlike standalo...
Proposes ABS-Mamba, a novel network for medical image translation integrating SAM2 for semantic representation and Mamba for structure preservation. Harmonizes global semantics and local fidelity, add...
Introduces the Oxford Spires Dataset, a large-scale multi-modal dataset for benchmarking LiDAR-visual tasks. Establishes benchmarks for localization, reconstruction, and novel-view synthesis using syn...
Proposes MR-UIE, a framework using Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction. Enhances LLM performance in structured output scenarios requiring compl...
Integrates anatomical priors into a causal diffusion model for 3D brain MRI synthesis. Addresses the lack of inductive biases in counterfactual models, preserving fine-grained anatomical details for p...
Connects cognitive science theory of analogical reasoning with NLP research. Shows how these notions are relevant for major NLP challenges, offering a cognitive lens to understand and advance analogic...
Thursday, September 11, 2025
Proposes an alternating minimization scheme (OAM) to compute the rate-distortion-perception function with $f$-divergence constraints. Characterizes optimal parametric solutions, enabling efficient com...
Introduces a new transport approach using a W-TV transport inequality and parabolic regularization to study the cutoff phenomenon for Markov processes. This bypasses the use of varentropy, offering an...
Addresses identification challenges in linear simultaneous-equation models by exploiting higher-order moments of non-Gaussian data. Relaxes the typical assumption of uncorrelated structural errors, en...
Surveys foundation models for autonomous driving perception, analyzing their impact on generalization, scalability, and robustness. Introduces a taxonomy based on four core capabilities, examining how...
Explores event-based vision for high-agility mobile devices, focusing on abstraction, algorithms, acceleration, and applications. Addresses challenges of noisy events and stable perception for low-lat...
Introduces self-supervised feature extraction and tracking for visual odometry to improve robustness in challenging settings. Addresses issues like lighting changes and dynamic scenes that degrade per...
Explores a CNN-ViT hybrid model for pneumonia detection trained from scratch on limited data. Demonstrates the architectural strengths of the hybrid model, achieving competitive performance on balance...
Proposes Sigma, a Siamese Mamba network for multi-modal semantic segmentation. Leverages additional modalities (X-modality) alongside RGB to enhance perception and scene understanding, particularly in...
Investigates reward scaling in visual generation using Reinforcement Learning. Addresses limitations of CLIP-based RMs and Bradley-Terry losses, proposing a method for effective scaling in Vision-Lang...
Proposes SAFT, a system for reconstructing 3D shape and appearance of fabrics from monocular video using differentiable physical simulations. Achieves realistic deformations and rendering for dynamic ...
Investigates how task design and individual differences affect human evaluation of AI suggestions through a randomized experiment. Reveals psychological factors influencing the success and failure of ...
Establishes convergence rates for set membership identification in linear systems under relaxed assumptions on persistent excitation and disturbances. Uses a block-martingale small-ball condition enab...
Introduces X-Part, a controllable generative model for decomposing 3D objects into semantically meaningful, structurally coherent parts with high geometric fidelity. Addresses limitations in controlla...
Systematically reviews recent advances in world models for autonomous driving, highlighting their role in robust scene interpretation and safe decision-making. Discusses how these models integrate mul...
Proposes Dj vu, an efficient video-language query engine that reuses computations between video frames using learning-based inter-frame techniques. Addresses the computational burden of Vision Transfo...
Presents a physics-guided rectified flow method for low-light RAW image enhancement. Addresses limitations of synthetic datasets by physically modeling sensor noise more comprehensively, improving enh...
Introduces SocialNav-SUB, a benchmark for evaluating Vision-Language Models (VLMs) in social robot navigation scene understanding. Assesses VLM capabilities in inferring social navigation contexts cru...
Introduces a Sparse Scan Self-Attention mechanism ($\rm{S}^3\rm{A}$) for Vision Transformers, inspired by human eye scanning. Predefines anchors of interest for tokens to reduce computational overhead...
Introduces a Bidirectional Transition approach for learning robust representations in visual reinforcement learning. Aims to create reliable representations by predicting future states and tracing his...
Introduces GeneVA, a dataset of human annotations for generative text-to-video artifacts. Addresses the need for systematic benchmarks to study and mitigate unpredictable artifacts like impossible phy...
Wednesday, September 10, 2025
Introduces BEAM, a novel pipeline bridging 4D Gaussian representations with physically-based rendering to produce relightable volumetric video. Achieves efficient, high-quality rendering of dynamic 3D...
Introduces GraspCoT, integrating physical property reasoning into LLMs for flexible language-instruction-guided 6-DoF robotic grasping. Enables robots to comprehend and execute grasping tasks by lever...
Introduces Semi-SMD, a semi-supervised metric depth estimation framework for autonomous driving using surrounding cameras. Proposes a unified fusion module and cross-attention for scale information re...
Generalizes 3D Gaussian modeling to volumetric primitives for scattering and emissive media, introducing closed-form solutions for modeling and rendering. Enables unified representation of surfaces an...
Surveys recent methods leveraging LLMs for crash detection from video data, presenting a structured taxonomy, datasets, architectures, and performance benchmarks. Provides a comprehensive overview of ...
Proposes HieraRS, a hierarchical segmentation paradigm for remote sensing enabling multi-granularity interpretation and cross-domain transfer. Addresses limitations of flat classification by generatin...
Presents SplatFill, a novel depth-guided approach for 3D Gaussian Splatting scene inpainting. Achieves state-of-the-art perceptual quality and improved efficiency for filling missing regions in 3D sce...
Introduces RayGaussX, accelerating Gaussian-based ray marching for real-time, high-quality novel view synthesis. Achieves significant speedups in training and inference by building on RayGauss with ke...
Introduces an interpretable text-guided image clustering method via iterative search. Addresses ambiguity in clustering by allowing users to define criteria, enabling flexible and accurate partitionin...
Introduces a foundational geospatial model for embedding hyperspectral geospatial data cubes into vectors. Achieved Top-1 in the EarthVision Embed2Scale Challenge, demonstrating effectiveness for down...
Proposes PINGS, a novel map representation unifying distance fields and radiance fields for robots, enabling high-fidelity, geometrically accurate, and photorealistic environmental reconstructions. Ac...
Proposes VMGNet, a low computational complexity, high-accuracy network for robotic grasping using VMamba and multi-scale feature fusion. Achieves linear computational complexity, significantly reducin...
Proposes IntuiTF, an MLLM-guided framework for transfer function optimization in direct volume rendering. Addresses vast exploration space and limited generalizability by enabling intuitive, semantic ...
Addresses the lack of fine details in latent generative models by focusing on high frequencies. Proposes methods to improve latent representations and generation quality, particularly for textured reg...
Introduces Atomizer, a flexible architecture representing remote sensing images as sets of scalars to generalize across diverse satellite modalities. Enables models to adapt to new configurations with...
Introduces DiGS, a unified framework embedding Signed Distance Field learning within 3D Gaussians for accurate and complete surface reconstruction. Achieves state-of-the-art rendering quality while en...
Extends 3D Gaussian Splatting for strand-level hair geometry reconstruction from multi-view images. Achieves efficient and explicit scene representation for hair, enabling applications in virtual real...
Proposes a novel deep learning framework for small moving target detection that moves beyond traditional motion cues and structural sparsity. Achieves robust detection in complex environments by focus...
Presents TextlessRAG, the first end-to-end framework for speech-based question answering over visual document images, eliminating ASR, TTS, and OCR. Directly interprets speech queries to extract knowl...
Proposes XOCT for enhancing OCT to OCTA translation using cross-dimensional supervised multi-scale feature learning. Addresses challenges in acquiring high-quality OCTA images and improves deep learni...
Tuesday, September 9, 2025
Proposes flow-based generative models as iterative algorithms operating in probability space. Demonstrates their power for high-dimensional data synthesis, exact likelihood estimation, efficient sampl...
Revisits Bayesian Neural Networks (BNNs) through normalization, modeling uncertainty only in weight directions. Aims to address misalignment with network geometry and improve uncertainty quantificatio...
Demonstrates robustness of Lipschitz-regularized $\alpha$-divergences in generative modeling, enabling stable learning with minimal assumptions on target distributions. Establishes finiteness under mi...
Investigates Randomized Quasi-Monte Carlo (RQMC) methods for kernel approximation, improving deterministic error bounds over classical Monte Carlo. Establishes theoretical guarantees for RQMC in rando...
Revisits optimal transport for angular velocity dynamics via the controlled Euler equation. Enables stochastic guidance of spin states for rigid bodies under deadline constraints by transferring state...
Proposes multi-criteria design for A/B testing beyond Average Treatment Effect (ATE). Addresses additional objectives like welfare or revenue loss, critical for practical applications beyond simple es...
Introduces LLaDA-VLA, a vision-language diffusion action model for robotic manipulation. Leverages diffusion models for policy learning, extending their application beyond text generation and multimod...
Introduces F1, a pretrained Vision-Language-Action (VLA) framework integrating visual foresight generation into decision-making. Adopts a Mixture-of-Transformer architecture for language-conditioned t...
Proposes Barlow-Swin, a novel Siamese-based segmentation architecture using Swin-Transformers. Addresses limitations of CNNs in global context modeling for medical image segmentation, aiming for light...
Introduces H$_{2}$OT, a hierarchical plug-and-play pruning-and-recovering framework for efficient transformer-based 3D human pose estimation from videos. Addresses high computational costs of video po...
Introduces GenAI-Powered Inference (GPI), a statistical framework for causal and predictive inference using unstructured data. Leverages GenAI models to generate data at scale and extract low-dimensio...
Provides robust evidence on causal drivers of market troughs using a flexible causal machine learning framework. Identifies volatility of risk appetite and market liquidity as key drivers, offering im...
Analyzes trade-offs in variational inference for uncertainty quantification, showing that mean-field approximations lead to an impossibility theorem when the target distribution does not factorize. Hi...
Establishes feasibility thresholds for random multi-graph alignment in Gaussian and Erdős-Rényi models. Demonstrates an 'all-or-nothing' phenomenon in the Gaussian model and rigorously identifies thre...
Presents non-asymptotic convergence analysis for Q-learning and actor-critic algorithms in robust average-reward MDPs. Shows optimal robust Q operator is a strict contraction, enabling stochastic appr...
Adapts foundation models like DINOv2 for multi-modal medical image analysis, addressing limitations of uni-modal designs. Aims to improve effectiveness for multi-modal tasks common in medical fields.
Scales transformer-based novel view synthesis models using token disentanglement and synthetic data. Incorporates synthetic data from diffusion models to improve generalization to real-world scenes.
Introduces ADIR, an adaptive diffusion framework for image reconstruction. Leverages diffusion model priors while enforcing consistency with measurements, adapting pre-trained models for improved reco...
Presents FoMo4Wheat, a crop-domain vision foundation model pretrained with self-supervision on ImAg4Wheat. Aims for reliable crop vision models by using the largest wheat image dataset for self-superv...
Monday, September 8, 2025
Introduces Instruction-oriented Preference Alignment (IPA) to enhance Multimodal Large Language Models (MLLMs) comprehension. IPA focuses on crucial multi-modal comprehension factors, improving perfor...
Presents a world model-driven code execution approach for smarter mobile device control, addressing limitations of reactive policies. It enables foresighted planning by considering sequential steps an...
Presents GeoSplat, a general geometry-constrained optimization framework for Gaussian splatting. It leverages higher-order geometric priors beyond normal vectors, addressing limitations of prior noisy...
Introduces Auto-Connect, a novel approach for automatic rigging that explicitly preserves skeletal connectivity using a connectivity-preserving tokenization scheme. It automates connectivity relations...
Introduces LUIVITON, an end-to-end system for automated virtual try-on of complex clothing on diverse characters. It addresses garment-body alignment by separating draping into clothing-to-SMPL and bo...
Presents SGS-3D for high-fidelity 3D instance segmentation, addressing errors from 2D-to-3D lifting. It employs splitting and growing reliable semantic masks, overcoming ambiguous semantic guidance an...
Introduces FlowSeek, a novel framework for optical flow requiring minimal hardware resources. It combines optical flow networks with single-image depth foundation models and motion parametrization, ac...
Proposes Histo-Miner, a deep learning pipeline for tissue feature extraction from Whole Slide Images (WSIs) of skin cancer. It generates datasets with labeled nuclei and tumor regions, providing an op...
Presents DisPatch, a defense mechanism against adversarial patch attacks in object detection using diffusion models. It aims to disarm these attacks by leveraging diffusion model capabilities, providi...
Introduces a biologically inspired separable learning vision model for real-time traffic object perception in low-light conditions. It addresses severe illumination degradation and lack of visual cues...
Proposes a diagnostic framework using the Linear Separability Ceiling (LSC) to analyze Visual-Language Models (VLMs). It reveals pervasive alignment issues in VLM representations, disentangling percep...
Introduces LatentCSI, a novel method for generating high-resolution images from WiFi CSI measurements using a pretrained latent diffusion model. It employs a lightweight network for direct mapping to ...
Introduces YOLOv13, enhancing real-time object detection with hypergraph-enhanced adaptive visual perception. It overcomes limitations of pairwise correlations by capturing global multi-to-multi high-...
Presents a novel framework for real-time virtual makeup try-on that achieves high-fidelity, identity-preserving transfer with temporal consistency. It effectively decouples semitransparent cosmetics f...
Proposes an active mapping system using a 3D Gaussian Splatting representation guided by multimodal LLMs for long-horizon exploration. It integrates detailed motion planning with LLM guidance, address...
Introduces PromptEnhancer, a prompt rewriting framework that enhances text-to-image models via Chain-of-Thought. It addresses challenges in rendering complex prompts, improving attribute binding and c...
Presents STADI (Spatio-Temporal Adaptive Diffusion Inference), a novel framework for efficient diffusion model inference on heterogeneous GPUs. It addresses workload imbalance, optimizing resource uti...
Introduces a semi-supervised deep transfer learning approach for regression without domain alignment, addressing generalization challenges in domain-shifted target data. It offers a solution for scena...
Introduces techniques for improved 3D scene stylization via text-guided generative editing, addressing challenges in high-quality stylization and view consistency. It enables consistent style applicat...
Explores transfer learning with mobile-enabled CNNs (MbNets) to enhance Arabic Handwritten Character Recognition (AHCR). It evaluates TL strategies and lightweight MbNets to address computational requ...
Friday, September 5, 2025
Introduces ChexGen, a generative vision-language foundation model for synthesizing chest radiographs guided by text, masks, and bounding boxes. Pretrained on a large dataset, it offers a unified frame...
Proposes TRUST-VL, an explainable news assistant for general multimodal misinformation detection. It jointly trains across distortion types, facilitating knowledge sharing and enabling generalization ...
Evaluates embeddings from foundation models for radiographic classification using lightweight adapters. Compares various models and algorithms on a large dataset, providing insights into embedding eff...
Analyzes fine-tuning behaviors of image editing models versus text-to-image generative models for dense geometry estimation. Finds editing models are more suitable foundations, enabling improved dense...
Proposes DUDE, a diffusion-based unsupervised cross-domain image retrieval method using feature disentanglement. Leverages diffusion models to address domain gaps by separating object features from do...
Studies sampling from non-log-concave distributions using improved algorithms and Poincaré inequalities. Focuses on query complexity for potentials with L-smoothness and bounded second moments, advanc...
Introduces Prob-GParareal, a probabilistic extension of GParareal for uncertainty quantification in parallel-in-time solvers. Employs Gaussian processes to model the correction function, enabling prob...
Reviews 147 recent studies on deep learning for vision-based traffic accident anticipation. Categorizes methodologies and datasets, focusing on supervised, unsupervised, and hybrid models for accident...
Introduces ConServe, a fine-grained GPU harvesting method for co-serving LLM online and offline requests. Achieves high GPU utilization by managing resources at a finer granularity than existing syste...
Presents Plot'n Polish for zero-shot story visualization and disentangled editing using diffusion models. Addresses the need for enhanced control and post-generation modification, enabling consistent ...
Introduces MatterVial, a hybrid framework integrating GNNs and symbolic regression for materials science. It expands feature space by combining latent representations from GNNs with descriptors and no...
Introduces a dual-stream diffusion model for coordinated piano hand motion synthesis from audio. It models hand independence and coordination, generating synchronized gestures while preserving distinc...
Presents a spatial-aware Transformer-GRU framework for enhanced glaucoma diagnosis using 3D OCT imaging. Integrates Vision Transformer for feature extraction and Bi-GRU for temporal modeling, improvin...
Presents FastPart, an algorithm leveraging SGD and Random Features for sparse optimization on measures. It provides rigorous mathematical proofs for its variational framework, demonstrating improved s...
Analyzes accelerated Stein Variational Gradient Flow using generalized bilinear kernels for Gaussian targets. Investigates methods to improve speed and efficiency compared to standard SVGD, aiming for...
Statistically analyzes empirical plug-in estimators for unbalanced optimal transport with Kantorovich-Rubinstein distance. Establishes sharp convergence rates for spatio-temporal point processes, adva...
Proposes an automated framework integrating unsupervised and supervised learning for segmenting and classifying materials microstructure images. Aims to classify micrographs by phase and segment multi...
Proposes bootstrapping the cross-validation estimate to accurately quantify uncertainty. Addresses optimism bias in error estimates for prediction models, essential for complex statistical learning al...
Introduces POET, a framework for automated expansion of text-to-image generation. Supports prompting creativity and personalization by generating novel visuals that adhere to user specifications, enha...
Empirically studies vulnerabilities in Python packages, considering their interaction with other languages. Investigates detection methods for inherent vulnerabilities and those arising from interoper...
Thursday, September 4, 2025
Introduces OneCAT, a pure decoder-only transformer for unified multimodal understanding and generation. Eliminates external vision components for efficiency, achieving significant gains especially for...
Introduces a method to mitigate hallucination in Large Vision-Language Models by aligning attention distribution to information flow. Analyzes LVLM attention mechanisms to emphasize visual information...
Develops a real-time virtual try-on method for loose-fitting garments that maintains temporal consistency. Addresses limitations of body semantic maps for obscured contours and trains garment synthesi...
Introduces GS-TG, a tile-grouping-based accelerator for 3D Gaussian Splatting. Enhances rendering speed by reducing redundant sorting operations and preserving rasterization efficiency, addressing the...
Proposes a reinforced collaborative distillation and self-learning framework for infrared-visible image fusion. Achieves high-quality fusion with lightweight models by integrating reinforcement learni...
Proposes a novel framework for license plate super-resolution guided by embedding similarity. Combines pixel-based loss with embedding similarity learning (PECL) to address unique challenges and enhan...
Proposes U-SAM to imbue the Segment Anything Model (SAM) with semantic awareness for user-defined segmentation. Enables targeted mask generation for specified object categories using only class names ...
Combines monocular depth estimation with multi-view data using differentiable rendering. Frames refinement as an analysis-by-synthesis optimization problem to lift and refine relative depth maps, impr...
Enhances diffusion model stability for image restoration through gradient management. Analyzes underlying gradient dynamics of denoising and likelihood guidance components to identify and address sign...
Evaluates the next-day wildfire predictability of MODIS and VIIRS satellite data. Compares their suitability for fire prediction by assessing how well their data forecasts wildfire spread, addressing ...
Proposes AstroClearNet, a self-supervised multi-frame method using deep image priors for astronomical image restoration. Achieves denoising, deblurring, and co-adding from blurred observations, overco...
Proposes a Universal Network for Identifying synthetic video content, addressing limitations of face-centric detectors. Detects manipulations from face-swapping to fully AI-generated videos, enabling ...
Proposes Grid-Reg, a detector-free framework for large-scale SAR-Optical image registration. Uses grid-based multimodal registration with a domain-robust descriptor network and a grid-based solver to ...
Presents Point Cloud Recombination for systematic real data augmentation using robotic targets. Addresses LiDAR perception validation challenges by combining physical sensor realism with controlled sc...
Introduces a superior SDRTV-to-HDRTV conversion method by integrating real HDRTV priors. Addresses ill-posedness and generalization constraints of single-style mapping by leveraging generative approac...
Introduces the Vision Language World Model (VLWM) for language-based world modeling on natural videos. Infers goal achievements and predicts action trajectories, enabling effective planning with seman...
Presents Spacecraft Pose Network v3 (SPNv3) for monocular pose estimation of spacecraft. Designed for computational efficiency and robustness to spaceborne images, essential for deployment on space-gr...
Introduces a self-supervised framework to learn data association for multi-object tracking. Uses an EM algorithm to train a neural network, overcoming the need for tedious identity-level annotations.
Proposes ViDDAR, a VLM-based framework for detecting task-detrimental virtual content in AR. Identifies obstruction and information manipulation attacks that impair user task performance and real-worl...
Investigates using long Chain-of-Thought (CoT) data for Supervised Fine-Tuning (SFT) to enhance reasoning in lightweight Multimodal Language Models (MLLMs). Demonstrates significant improvement in MLL...
Wednesday, September 3, 2025
Proposes a learnable weighted hybrid autoencoder to address poor convergence in high-rank latent spaces for model order reduction. Demonstrates improved performance in learning low-dimensional intrins...
Studies deterministically constrained stochastic optimization problems, proposing variance-reduced first-order methods. Aims to satisfy constraints with certainty, addressing limitations of existing m...
Addresses zero-order optimization for additive models with noisy observations, assuming Polyak-Lojasiewicz or strong convexity and higher-order smoothness. Proposes gradient-free methods for nonparame...
Revisits the view that overparameterized diffusion models memorize training data, showing generalization in natural domains is possible with early stopping. Challenges the notion that larger models in...
Studies deterministically constrained stochastic optimization problems, proposing variance-reduced first-order methods. Aims to satisfy constraints with certainty, addressing limitations of existing m...
Proposes a method for combining e-processes constructed in different filtrations for anytime-valid inference. Addresses the challenge of combining e-processes across different filtrations, which is no...
Introduces an architecture-specific randomized training algorithm to bridge the gap between theoretical approximation theorems and practical training with noisy data. Constructs uniform approximations...
Introduces the notion of nondecreasing (ND) rank for tensors, representing them as sums of outer products with monotonicity constraints. Shows equivalence to nonnegative rank factorization for certain...
Introduces an architecture-specific randomized training algorithm to bridge the gap between theoretical approximation theorems and practical training with noisy data. Constructs uniform approximations...
Introduces algorithmic simplifications to reduce computational complexity for probabilities of causation and latent confounding. Proposes a novel framework for Root Cause Analysis using these causal m...
Introduces the WeSpeR algorithm to compute non-linear shrinkage formulas for weighted sample covariance in high dimensions. Significantly speeds up non-linear shrinkage for dimensions over 1000, with ...
Presents a theoretical framework for zero-shot prediction, analyzing foundation models trained with self-supervised and multimodal contrastive learning. Identifies target quantities for zero-shot pred...
Shows Armijo line-search can make (stochastic) gradient descent provably faster by adapting to local smoothness without needing the global constant. Strengthens existing results and demonstrates const...
Investigates efficient retrieval of latent features from deep networks using triplet comparisons as feedback. Explores whether learned features, like dictionaries or covariance matrices, can be effici...
Designs end-to-end latent bandit algorithms capable of handling uncountably many latent states for offline data leverage. Focuses on linear latent contextual bandits for accelerated online sequential ...
Shows that the memory capacity (MC) of random nonlinear RNNs can yield arbitrary values, questioning its informativeness. Contrasts this with linear RNNs where MC equals the Kalman controllability mat...
Investigates learning in complex action spaces without policy gradients, hypothesizing reasons for policy gradients' apparent superiority. Explores why computational applicability and performance dive...
Analyzes Nearest Neighbor algorithms for matrix completion with non-smooth nonlinear functions and high missingness. Proposes an adaptive and minimax optimal procedure, 'Two-Sided Nearest Neighbors', ...
Proposes a simple technique to enhance supervised learning by augmenting features with factors extracted from design matrices and their transformations. Addresses over-parametrization and need for fas...
Addresses causal direction identification between two variables assuming no hidden confounders. Proposes a bivariate causal score based on MDL principle using functions with density property on a comp...
Monday, September 1, 2025
Proposes a unified framework for solving inverse problems using diffusion posterior sampling, demonstrating that existing approximations are insufficient or inefficient. Addresses limitations by offer...
Introduces VIDEOMIMIC, a real-to-sim-to-real pipeline that reconstructs humans and environments from videos to produce whole-body control policies for humanoid robots, enabling them to perform skills ...
Presents JambaTalk, a hybrid Transformer-Mamba model for speech-driven 3D talking head generation, aiming to achieve equivalence across lip-sync, facial expressions, and head pose generation metrics.
Introduces PicoPose, a framework for RGB-based novel object pose estimation using a three-stage pixel-to-pixel correspondence learning process to tackle zero-shot generalization challenges in robotic ...
Presents Scale-GS, a scalable Gaussian Splatting framework for efficient training in streaming tasks. Organizes Gaussian spheres hierarchically by scale to improve efficiency for dynamic scenes.
Proposes Temporal Visual Screening (TVS) for Video Large Language Models, inspired by human screening behavior. Aims to improve fine-grained temporal semantics capture by pre-processing videos univers...
Introduces BASE-Q, a quantization technique for LLMs that enhances rotational quantization with bias and asymmetric scaling. Addresses limitations of existing methods regarding training overhead and m...
Systematically evaluates design choices for federated fine-tuning of foundation models for MRI-based dementia classification. Assesses impact on performance and efficiency using brain MRI data across ...
Introduces Temporal Flow Matching for learning spatio-temporal trajectories in 4D medical imaging. It enables fine-grained spatial predictions and understanding of temporal dynamics, advancing applica...
Proposes a counterfactual evaluation framework to assess Automatic Reviewer Generators (ARGs) ability to detect faulty research logic. Demonstrates ARGs fail to detect faulty reasoning in research pap...
Introduces TrueGL, a model for trustworthy search results with clear reliability indicators. It addresses the need for AI systems to evaluate information credibility and justify assessments, aiming to...
Introduces the Granite Embedding R2 models, English encoder-based embedding models for enterprise-scale dense retrieval. Features 16x expanded context length and state-of-the-art performance across di...
Introduces Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes from diverse LiDAR configurations, aiming to address limitations in existing datasets for c...
Proposes a novel approach for merging fine-tuned models by considering inter-layer dependencies through a chain of merges, addressing limitations of existing layer-wise merging techniques.
Introduces EEGPT, the first generalist EEG foundation model using autoregressive pre-training. Aims to address limitations in versatile EEG model exploration due to diverse data formats and outdated p...
Proposes a convolutional neural network to upsample features for materials micrograph segmentation. It offers an alternative to U-Nets, aiming to improve the representation of fine features and handle...
Introduces ECHO, a unified framework for ego-centric modeling of human-object interactions from head and wrist tracking. It recovers human pose, object motion, and interaction semantics, important for...
Conducts a comparative study of off-the-shelf VLMs (BLIP-2, InstructBLIP, LLaVA-1.5) on spatial reasoning in urban scenes. It evaluates zero-shot performance and fine-tuning effects with a synthetic V...
Develops an AI-powered approach for rural livability mapping using drone imagery. Addresses limitations of questionnaire-based and urban-oriented methods by adapting visual perception for rural contex...
Evaluates multimodal recurrence prediction in ccRCC by integrating CT and histopathology whole-slide images. A modular deep learning framework is proposed to improve personalized risk estimation beyon...
Friday, August 29, 2025
Introduces LangToMo, a vision-language-action framework using pixel motion forecasts as intermediate representations. A diffusion model generates text-conditioned motion sequences for robot control, e...
Proposes TAG-WM for tamper-aware generative image watermarking using diffusion inversion sensitivity. Addresses copyright and authenticity risks of AI-generated content by enhancing watermark robustne...
Presents the winning solution to the NeurIPS 2024 Invisible Watermark Removal challenge, stress-testing watermark robustness under varying adversary knowledge. Addresses black-box and beige-box tracks...
Proposes ZIM, a zero-shot image matting model that addresses limitations of segmentation models in generating fine-grained masks. Develops a label converter and constructs a new dataset for matte labe...
Introduces a training-free approach for 3D visual grounding using Language-to-Space programming. Addresses challenges of scarce data and high annotation costs in 3D vision-language datasets.
Develops a deep learning multimodal framework integrating CT images, radiomics, and clinical data to predict metastasis risk in HNSCC patients. Aims to optimize treatment strategies and prognosis.
Fine-tunes DINOv3 using low-rank adaptation for atypical mitotic figure classification in medical imaging. Achieves efficient training by adapting only ~1.3M parameters, focusing on the MIDOG 2025 cha...
Presents the IMPROVE dataset, a multimodal resource with behavioral, biometric, and academic data to evaluate mobile phone impact on online education. Includes data from 120 learners across three phon...
Proposes T-Stars-Poster, a product-centric framework for automated advertising image design. Uses product information like foreground images and taglines to generate advertising visuals in four sequen...
Introduces GeoTexBuild, a modular framework for generating 3D building models from map footprints. Employs height map generation, geometry reconstruction, and appearance stylization for detailed model...
Presents LoTUS, a machine unlearning method that smooths prediction probabilities to eliminate training sample influence without retraining. Evaluated on Transformer and ResNet models, it mitigates da...
Introduces OneReward, a unified reinforcement learning framework for multi-task image generation using a single reward model. Enhances generative capabilities across tasks under different evaluation c...
Introduces VLMEvalKit, an open-source toolkit for evaluating large multi-modality models. Implements over 200 models and 80 benchmarks, providing a user-friendly framework for reproducible evaluation ...
Proposes an advanced framework combining a multi-scale 3D CNN with subtype-specific bias correction for precise pulmonary nodule volume estimation. Addresses limitations of traditional methods in CT s...
Integrates iPhone 15 Pro Max LiDAR and cameras for efficient, privacy-preserving background removal in 2D video streaming. Leverages depth information independent of lighting, outperforming traditiona...
Proposes GENRE-CMR, a GAN-based architecture using residual deep unrolled reconstruction for enhanced fidelity and generalization in accelerated Cardiac MRI. Addresses trade-offs between scan time and...
Introduces SMARTe-VR, a platform for student monitoring in VR e-learning using facial biometrics and learning metadata. Enables adaptive learning sessions with features like AutoQA and interaction too...
Introduces Hierarchical Scene Motifs (HSM), a framework for indoor 3D scene generation that synthesizes dense object arrangements across multiple scales. Addresses limitations of existing methods in p...
Develops WikiAutoGen for multi-modal Wikipedia-style article generation. Integrates multimodal content retrieval and synthesis, addressing limitations of text-only generation methods for enhanced info...
Develops an automated analysis framework using egocentric vision for leadership assessment in PICU team training. Identifies cues like fixation object, eye contact, and conversation patterns from Aria...
Thursday, August 28, 2025
Computes the asymptotic eigenvalue distribution of the Neural Tangent Kernel for two-layer neural networks under quadratic scaling. Analyzes the behavior of NTK matrices with specific dimension scalin...
Presents DVM-SLAM, the first open-source decentralized monocular C-SLAM system for multi-agent cooperative mapping. Enhances robustness, scalability, and accuracy by sharing information between agents...
Proposes MTS-Net, an end-to-end 3D deep learning framework for May-Thurner Syndrome diagnosis using CT volumes. Employs dual-enhanced positional multi-head self-attention to capture spatial-temporal p...
Presents a data-driven method using score-based generative modeling for reduced-order models of cyclo-stationary time series. Accurately reproduces statistical properties and temporal correlations, en...
Addresses Incremental Test Time Adaptation for Vision-Language Models in open worlds with unseen classes and domains. Uses segmentation assistance to improve generalization capabilities when encounter...
Proposes PAUL, an uncertainty-guided framework for robust cross-view geo-localization under noisy correspondence. Uses partitioning and augmentation to handle real-world alignment imperfections, impro...
Proposes AudioStory, a unified framework integrating LLMs with Text-to-Audio systems for structured, long-form audio narratives. Addresses temporal coherence and compositional reasoning challenges in ...
Exploits neural networks for data-driven regularizers in inverse problems via Variational Bayes image restoration. Uses compressive autoencoders for regularization, offering an alternative Bayesian ap...
Presents REPARO, a novel approach for compositional 3D asset generation from single images. Optimizes 3D mesh layout using differentiable rendering to address challenges in scenes with multiple object...
Introduces DiffArtist, the first 2D stylization method offering simultaneous control over structure and appearance style strength. Addresses the gap in neural stylization by focusing on both structura...
Proposes TAGS, a 3D tumor-adaptive guidance framework for SAM to address the domain gap in 3D medical imaging. Adapts foundation models to capture 3D anatomical context, improving tumor segmentation u...
Studies the forward-backward algorithm with sub-iterative denoisers in a Plug-and-Play fashion. Analyzes analysis and synthesis Gaussian denoisers within a dictionary framework, examining minimization...
Introduces Neural Conditional Simulation (NCS), a general method for spatial conditional simulation. Enables spatial prediction and uncertainty quantification by simulating from predictive distributio...
Utilizes digital cognitive tasks with eye-tracking data and deep learning (VTNet) to distinguish Mild Cognitive Impairment from healthy controls. Correlates eye movements and image content in visual m...
Introduces OpenM3D, an open-vocabulary multi-view 3D object detector trained without human annotations. Adapts 2D-induced voxel features and uses a class-agnostic 3D localization loss for OV detection...
Introduces Seam360GS, a novel calibration framework incorporating a dual-fisheye camera model into 3D Gaussian splatting. Achieves seamless 360-degree visual content generation from real-world omnidir...
Proposes a lightweight classification approach for fine-grained moth identification by combining expert-labeled field data with knowledge distillation from a foundation model. Bridges domain gaps for ...
Proposes two methods for latent space configuration to obtain desired topology in autoencoders. Improves generalization in supervised autoencoder neural networks by controlling latent space properties...
Proposes TraceNet for efficient single instance segmentation on mobile devices. Addresses computational constraints by optimizing instance segmentation for mobile imaging applications, enabling captur...
Proposes ReCLIP++, a method to rectify unexpected bias in CLIP for unsupervised semantic segmentation. Explicitly models and rectifies class-preference and space-preference biases to enhance segmentat...
Wednesday, August 27, 2025
Investigates sample size requirements for training ReLU feed-forward neural networks. Theoretically and empirically shows generalization error scales at $1/\sqrt{n}$, not $1/n$, underpinning practical...
Develops Deshadow-Anything by fine-tuning Segment Anything Model (SAM) for zero-shot shadow removal. Addresses SAM's challenges with shadows, leveraging diffusion models for improved image shadow remo...
Introduces ZoomEye to enhance Multimodal LLMs with human-like zooming via tree-based image exploration. Enables LLMs to perform visual reasoning by dynamically scaling visual inputs during analysis.
Proposes an Automatic Dataset Creation Framework for selective forgetting in diffusion models. Evaluates methods to remove sensitive information while preserving non-sensitive regions' consistency.
Proposes FUSELOC, fusing global and local descriptors for visual localization. Uses a weighted average operator to disambiguate 2D-3D matching, improving accuracy while maintaining low memory requirem...
Introduces StreetCrafter, a controllable video diffusion model for street view synthesis. Utilizes LiDAR point clouds as conditioning to achieve photorealistic view synthesis from vehicle sensor data.
Presents PromptGAR for flexible group activity recognition with high accuracy. Bridges the gap in real-world applicability by offering input flexibility across prompts, frames, and instances without a...
Proposes LATex to leverage attribute-based text knowledge for Aerial-Ground Person Re-ID. Integrates semantic information from person attributes, improving feature extraction for cross-view person ret...
Introduces PhysioSync for EEG-based emotion recognition, inspired by physiological synchronization. Employs temporal and cross-modal contrastive learning, addressing noise and individual variability i...
Explores generative data augmentation using denoising diffusion models for 3D point cloud segmentation. Generates realistic novel point clouds to enrich data diversity and improve model performance be...
Applies Kolmogorov-Arnold Networks (KANs) for neural density estimation in gravitational-wave data analysis. Proposes KANs for efficient and interpretable posterior construction in GW catalogs, enhanc...
Comprehensively evaluates SAM and SAM 2 using diverse prompts for context-dependent concepts. Analyzes their performance across various scenes, providing insights for future Segment Anything Model dev...
Proposes Meta-learned Modality-weighted Knowledge Distillation (MetaKD) for robust multi-modal learning with missing data. Adaptively weights modalities via meta-learning, maintaining accuracy even wh...
Introduces RAFT for robust augmentation of features in image segmentation. Addresses the Syn2Real gap by generating synthetic data that improves model performance on real-world deployments.
Introduces MCGS to enhance multiview consistency for sparse-view 3D Gaussian Radiance Fields. Addresses suboptimal performance with sparse views by incorporating inherent multiview consistency.
Proposes balancing domain diversity and invariance for single-domain generalized object detection. Addresses loss of domain-specific information in invariance-driven strategies, improving cross-domain...
Introduces Project-Probe-Aggregate (PPA) for parameter-efficient fine-tuning of foundation models. Enhances group robustness without relying on group annotations by improving failure-based debiasing c...
Proposes WMKA-Net with a Reversible Multi-Scale Fusion Module for retinal vessel segmentation. Addresses feature fusion, contextual continuity, and noise interference using adaptive convolution and at...
Proposes MonoCoP, a Chain-of-Prediction framework for monocular 3D object detection. Improves depth prediction by conditioning on other inter-correlated 3D attributes, addressing inherent depth estima...
Introduces Ego-HOIBench and a new method for egocentric human-object interaction detection. Addresses challenges like hand-object occlusion from a first-person perspective in real-world scenarios.
Tuesday, August 26, 2025
Introduces FaceCrafter, an identity-conditional diffusion model with disentangled control over facial pose, expression, and emotion. Achieves high-fidelity face synthesis while allowing fine-grained m...
Introduces AnimateAnywhere, a human image animation method that animates both foreground characters and backgrounds. Addresses static or inharmonious background generation, enabling more realistic and...
Proposes boosting Temporal Sentence Grounding (TSG) via causal inference to address spurious correlations. Achieves improved accuracy in identifying relevant video moments by mitigating biases from te...
Introduces PainFormer, a vision foundation model for automatic pain assessment. Utilizes multi-task learning to provide continuous monitoring and support decision-making in pain management, aiming to ...
Examines style transfer's impact on semantic segmentation, showing it reduces texture bias and improves robustness. Demonstrates that applying style transfer techniques can enhance generalization capa...
Investigates conditions for achieving the Wasserstein--Cramer--Rao lower bound, defining Wasserstein efficiency. Shows a condition under which estimators attain this bound, providing theoretical insig...
Introduces LEL, a Lipschitz continuity-constrained ensemble learning model for EEG-based emotion recognition. Enhances model stability, accuracy in high-dimensional signals, and robustness against var...
Proposes BoxFusion, a reconstruction-free framework for open-vocabulary 3D object detection. Achieves real-time performance via multi-view box fusion, addressing computational overhead and memory cons...
Introduces VIN-NBV, a view introspection network for Next-Best-View (NBV) selection. Trains an acquisition policy to directly optimize reconstruction quality rather than coverage, improving scene acqu...
Develops VFOG, variance-reduced optimistic gradient methods for nonmonotone generalized equations. Combines Nesterov acceleration and variance reduction, achieving O(1/k^2) convergence rates for data-...
Proposes ANT, an adaptive neural temporal-aware text-to-motion model that addresses temporal-frequency demands in diffusion models. Achieves improved motion foundations and text alignment by adapting ...
Presents Switch-NeRF++, a heterogeneous mixture of hash experts for large-scale NeRFs. Addresses learnable decomposition, scene heterogeneity, and modeling efficiency, enabling highly scalable and rob...
Proposes AffordanceSAM, leveraging Segment Anything Model for affordance grounding. Enables generalized affordance recognition by segmenting actionable regions, addressing limitations in supervised le...
Introduces GigaTok, the first approach to scale visual tokenizers to 3 billion parameters for autoregressive image generation. Simultaneously improves image reconstruction and generation quality, addr...
Proposes language-guided action anatomy for few-shot action recognition, exploiting text to enhance understanding of subtle variations. Achieves improved recognition with limited data by incorporating...
Presents Mesh-Learner, a 3D reconstruction and rendering framework texturing meshes with Spherical Harmonics. Learns view-dependent radiance end-to-end within rasterization pipelines, enabling native ...
Proposes a Bayesian nonparametric classification model combining Gaussian and Dirichlet process priors. Extends de Finetti representation and Ferguson's construction, allowing flexible uncertainty mod...
Learns to predict robot motions during nominal task execution to detect visual anomalies for execution monitoring. Uses a probabilistic U-Net architecture to predict optical flow, enabling robots to i...
Enables automated skill assessment in wet-lab cataract surgery videos using computer vision. Enhances efficiency and objectivity of surgical education by moving beyond manual performance evaluations, ...
Monday, August 25, 2025
Proposes dual visual-text alignment for zero-shot skeleton-based action recognition, enabling models to adapt to new, unseen actions dynamically by aligning visual features with semantic text represen...
Introduces the DeepScenario Open 3D Dataset (DSC3D), a high-quality, occlusion-free dataset for autonomous driving research, capturing accurate 3D trajectory data.
Proposes NeuroKoop, a neural Koopman fusion of structural-functional connectomes, to better capture complementary features in neuroimaging data for identifying prenatal drug exposure effects.
Leverages state space models (SSMs) for high-performance learned image compression, addressing computational inefficiency and improving redundancy modeling by capturing long-range dependencies.
Exploits mean-field interpretation and dynamic programming to formulate stochastic control problems as infinite-dimensional minimizations, providing generalization bounds.
Proposes an image enhancement method decomposing spatial-aware lookup tables to achieve lightweight and fast real-time performance while retaining spatial information.
Proposes EHGCN, a hierarchical Euclidean-Hyperbolic fusion via motion-aware GCN, to capture long-range dependencies and hierarchical structures in event stream perception.
Presents a generalizable NeRF method using explicit correspondence matching to provide geometry prior for novel view synthesis with as few as two source views.
Introduces a novel dataset for video-based neurodivergent classification, leveraging extra-stimulatory behavior to improve productivity and understanding of these behaviors.
Proposes cascaded multi-scale attention (CMSA) for CNN-ViT hybrid architectures to effectively extract and interact with multi-scale features from low-resolution images.
Introduces VIBE for evaluating video-to-text summarization, addressing verbose outputs from current models by focusing on information bottleneck evaluation for concise TL;DR generation.
Explores using diffusion models to enhance flat-panel detector CT imaging quality, aiming for diagnostic quality comparable to multi-detector CT for improved patient management.
Proposes a Geometry-Guided Low-Light Enhancement Refine Framework (GG-LLE) incorporating geometric information and depth guidance to improve low-light image and video enhancement.
Proposes an adaptive multi-order graph regularized NMF method (MOGNMF) for hyperspectral unmixing, capturing intrinsic data structures and requiring less manual parameter tuning.
Introduces Self-Validated Learning, a correctness-based self-training framework without human labels, for accurate particle instance segmentation in tomographic data.
Investigates using rear cameras for egocentric 3D human pose estimation, addressing self-occlusion and limited field-of-view coverage issues with frontal cameras.
Improves U-Net confidence on TEM image data for nanoscale defect identification using L2-regularization, transfer learning, and deep fine-tuning to handle data variations.
Introduces an illumination-aware 3D scene editing pipeline for 3D Gaussian Splatting (3DGS) that considers background illumination mismatches for object insertion/replacement.
Proposes efficient density control for 3D Gaussian Splatting by improving clone and split operations to enhance optimization speed and detail recovery.
Reviews demographic fairness in face recognition, discussing disparities across groups, ethical concerns, and the impact on system credibility and reliability.
Saturday, August 23, 2025
Introduces a machine learning enhanced expert system for detecting heart failure decompensation. It utilizes patient-reported vitals and electronic health records to provide early detection, aiming to...
Proposes SpaIM, a novel method for single-cell spatial transcriptomics imputation using style transfer techniques. This approach aims to fill in missing gene expression data, thereby improving the acc...
Friday, August 22, 2025
Proposes one-shot entropy minimization for LLMs, requiring only one unlabeled data point and 10 optimization steps. Achieves performance comparable to or exceeding methods using thousands of data poin...
Examines the vulnerability of AI-generated image detectors to adversarial attacks. Investigates systematic understanding of robustness and proposes methods to address identified weaknesses, crucial fo...
Presents the Hadamard Attention Recurrent Stereo Transformer (HART) to overcome attention mechanism bottlenecks. Introduces a Dense Attention Kernel for improved nonlinear expressivity and robustness ...
Introduces TrackID3x3, a dataset and algorithm for multi-player tracking, identification, and pose estimation in basketball videos. Addresses limitations of existing sports analytics datasets for fixe...
Proposes a molecular-empowered All-in-SAM model for fine-grained multi-class nuclei segmentation in computational pathology. Addresses challenges faced by general foundation models in capturing fine-g...
Introduces 3DGS-LM, accelerating 3D Gaussian Splatting reconstruction by replacing ADAM with a tailored Levenberg-Marquardt optimizer. Reduces optimization time from hours to minutes, enabling faster ...
Conducts an empirical study on how Video-LLMs answer video questions using attention knockouts. Analyzes internal mechanisms and designs variants to interpret existing VideoLLMs' question-answering st...
Introduces ExtraGS, a framework for trajectory extrapolation integrating geometric and generative priors. Addresses poor geometric consistency and over-smoothed renderings by unifying priors for drivi...
Presents a two-stage approach, High-Frequency First, to improve Implicit Neural Representations (INRs). Addresses spectral bias by capturing high-frequency details like edges and textures, enhancing i...
Introduces Grounded VideoLLM, a diffusion-grounded VideoLLM with entity-aware segmentation for long video understanding. Improves temporal perception, frame continuity, and language-vision alignment w...
Introduces PKR-QA, a benchmark for procedural knowledge reasoning question answering, built using a procedural knowledge graph. Enriches commonsense knowledge and structured reasoning for video unders...
Proposes TripleMixer, a robust 3D point cloud denoising network for adverse weather using spatial, frequency, and channel-wise processing. Effectively suppresses noise while preserving geometric struc...
Develops motion blur robust Vision Transformers for real-time UAV tracking, addressing challenges of high-speed movement and blur. Improves performance of trackers in demanding aerial surveillance sce...
Proposes a new framework for understanding co-speech gestures in the wild, introducing three tasks and benchmarks for gesture-speech-text association. Learns a tri-modal representation for improved no...
Presents an SfM-free 3D Gaussian Splatting framework to enhance novel view synthesis from extremely sparse views. Addresses degraded rendering quality when Structure-from-Motion fails due to sparse in...
Proposes a hybrid autoregressive-diffusion model for real-time streaming sign language production. Addresses limitations of autoregressive methods regarding error accumulation and diffusion models' st...
Proposes Task-Generalized Adaptive Cross-Domain Learning for Multimodal Image Fusion. Addresses modality misalignment, detail destruction, and task-specific limitations to enhance image quality and do...
Proposes D3FNet, a Dilated Dual-Stream Differential Attention Fusion Network for fine-grained road structure extraction. Addresses challenges of narrow roads, fragmentation, and occlusions in remote s...
Proposes a novel linear-time convex relaxation and contractor for fast, globally optimal truncated least squares point cloud registration. Addresses scalability challenges of previous provably optimal...
Introduces MapKD, unlocking prior knowledge with cross-modal distillation for efficient online HD map construction. Addresses reliance on stale offline maps and sensor suites, reducing inference overh...
Thursday, August 21, 2025
Develops non-asymptotic bounds for denoising diffusion probabilistic models, making minimal assumptions on data distribution. Establishes theoretical understanding of error bounds, crucial for compari...
Introduces Endo-FASt3r, the first method to use self-supervised learning with foundation models for pose estimation in endoscopic scenes. Explores adaptation for structure from motion, crucial for 3D ...
Introduces TransDiff, the first image generation model combining Autoregressive Transformers and diffusion models. Achieves state-of-the-art performance on ImageNet by effectively encoding labels and ...
Presents VBench-2.0, advancing a benchmark suite for video generation models. Focuses on intrinsic faithfulness beyond superficial aspects, measuring factors like temporal consistency and prompt adher...
Introduces UnZipLoRA, a method to decompose an image into subject and style using two distinct LoRAs trained simultaneously. Achieves disentanglement from a single image, ensuring LoRA compatibility f...
Proposes a multi-view collaborative matching strategy for reliable track construction in complex scenarios. Addresses ambiguity in pairwise matching by considering collaborative information, improving...
Proposes a Transformer-CNN fusion method for high-precision skin lesion segmentation. Integrates transformers for global semantics and CNNs for local features, enhancing analysis of complex lesion str...
Presents an unsupervised framework for 3D anatomical structure reconstruction from freehand transvaginal ultrasound sweeps. Achieves volumetric reconstruction without external tracking or learned pose...
Introduces a condition diffusion model for short-term precipitation nowcasting, referred to as RNDiff. Leverages diffusion models for high-quality sample generation, contrasting with GANs and VAEs for...
Proposes MoE-FFD, a Mixture of Experts approach for generalized and parameter-efficient face forgery detection. Addresses limitations of ViT-based methods in computational resources and capturing loca...
Proposes a novel multi-stage watermarking framework for diffusion models to establish copyright and trace generated images. Addresses ethical concerns including intellectual property and misuse of syn...
Proposes a novel inversion-based anomaly detection approach using diffusion models that circumvents explicit reconstruction. Addresses tension between fidelity and efficiency in anomaly detection.
Introduces DuCos, a depth super-resolution framework using Lagrangian duality theory and foundation models. Improves generalization across diverse scenarios with a novel prompt design for enhanced geo...
Proposes a marker-wise conditioned diffusion model for virtual multiplex staining of histological images. Addresses limitations of multiplex data acquisition and enables multimodal analysis on existin...
Presents MeshCoder, a framework reconstructing complex 3D objects from point clouds into editable Blender Python scripts. Leverages LLMs for structured mesh code generation, overcoming limitations of ...
Extends the Pix2Seq object detector for videos, introducing an end-to-end approach for video object detection. Represents objects as discrete tokens, improving succinctness and handling varying number...
Introduces 3D-Generalist, a self-improving vision-language-action model for crafting 3D worlds. Addresses challenges in spatial reasoning by grounding models in the 3D world, enabling scalable generat...
Introduces GeMS, a framework for 3D Gaussian Splatting designed to handle severely motion-blurred images. Addresses limitations of existing deblurring and Gaussian Splatting methods by not assuming ac...
Presents NCLR, a self-supervised learning framework for 3D perception using 2D-3D neural calibration. Estimates rigid pose aligning camera and LiDAR systems, bridging domain gaps for effective percept...
Establishes an information-theoretic framework for image captioning, balancing sufficiency, redundancy, and comprehensibility. Provides quantitative measures for evaluating caption quality and a flexi...
Wednesday, August 20, 2025
Benchmarks GPT-5's zero-shot multimodal reasoning in radiology and radiation oncology, comparing its performance against GPT-4o across key medical tasks. Assesses the practical gains of large multimod...
Proposes UNICON, a unified continual learning framework for medical foundational models. Addresses data scarcity by enabling sequential fine-tuning on diverse domains and tasks without requiring large...
Critically reviews 46 abdominal CT datasets, finding substantial redundancy and Western/geographic bias. Assesses suitability for AI applications, highlighting limitations in clinical relevance and re...
Investigates transferability of prognostic knowledge in computational pathology for Whole-Slide Images. Addresses scaling limitations for rare tumors and knowledge utilization from other cancers, prop...
Targets internal scene reconstruction using factorized 3D Gaussian Splatting. Models continuous volumetric density via inner 3D Gaussians for applications requiring deep interior understanding.
Introduces PediDemi, a dataset for pediatric demyelinating lesion segmentation. Addresses the need for specialized datasets to improve AI models for diagnosing central nervous system disorders.
Enhances Vision Transformers for medical image segmentation by integrating pre-trained LLM transformer blocks. Achieves substantial improvements by incorporating frozen LLM blocks into the ViT encoder...
Explores deep learning for Urdu text recognition, addressing challenges of its cursive script and complex structure. Proposes a component-based classification approach to improve recognition accuracy.
Introduces a computer-vision framework for quantifying aesthetic outcomes in facial plastic surgery. Leverages automated landmark detection, symmetry computation, and deep learning on a large dataset.
Achieves full disentanglement for controllable talking head synthesis with EDTalk++. Enhances application and entertainment by controlling facial motions and accommodating diverse input modalities.
Surveys storage architectures for Embodied AI data, evaluating graph, multi-model, data lake, vector, and time-series databases. Focuses on suitability for physical grounding, low-latency access, and ...
Introduces NeuSee, a framework for sensor protection against laser flare. Jointly learns a diffractive optical element representation and a Mamba-GAN network for image restoration, enabling high-fidel...
Introduces SSR-KD, a fast, accurate AI framework for real-time 3D bone model reconstruction from very-low-dose protocols. Enables patient-specific surgical guides and preoperative planning without hig...
Proposes a resample-aggregate framework using diffusion models for stable variable selection in high-dimensional, correlated data. Generates high-fidelity synthetic data to improve model stability and...
Applies deep learning object detection for early colon polyp identification using the Kvasir-SEG dataset. Utilizes data augmentation and specific training/validation/testing splits to improve detectio...
Introduces TracSum, a benchmark for traceable, aspect-based summarization in the medical domain. Pairs summaries with sentence-level citations to enable users to assess factual accuracy and alleviate ...
Enhances OCR capabilities using a reasoning-and-tool interleaved vision-language model. Addresses LVLM hallucinations and improves effectiveness on OCR tasks compared to general-purpose models.
Proposes Prune2Drive, a plug-and-play framework to accelerate Vision-Language Models in autonomous driving. Addresses computational overhead from high-resolution, multi-view images via pruning.
Investigates Small Language Models (SLMs) for medical imaging classification, comparing models and prompt designs. Addresses computational cost and data privacy concerns hindering LLM adoption in heal...
Re-examines MLLM token technology through classical visual coding principles. Establishes a unified formulation bridging token technology and visual coding to minimize computational cost while maximiz...
Tuesday, August 19, 2025
Derives a decomposition of m-th order U-statistics to linear terms, aiming to fill the gap in comprehensive studies of their computational complexity, which are known to be time-consuming in practice.
Provides evidence that the Barron space, while defying the curse of dimensionality in classical smoothness, does not defy it with a nonclassical notion of smoothness related to 'infinite'.
Develops an intent-aware generative semantic multicasting framework utilizing pre-trained diffusion models that decomposes source signals into semantic classes based on multi-user intent for efficient...
Presents WIR3D, a technique for abstracting 3D shapes using sparse Bezier curves that represent geometry and visual features, guided by CLIP model activations.
Introduces a novel task of joint egocentric video and human motion generation, addressing viewpoint alignment and camera motion challenges for first-person view content.
Addresses the under-attended problem of assisting humans in collecting input images for novel view synthesis, focusing on uniform and dense view sampling.
Proposes a method for mimicking bona fide ID card images by generating synthetic versions, aiming to address the lack of images for training robust Presentation Attack Detection systems.
Proposes DMS, a diffusion-based multi-baseline stereo generation method to address ambiguity in photometric reconstruction, improving self-supervised depth estimation.
Presents visual action prompts, a unified action representation for action-to-video generation of complex interactions, balancing action precision and cross-domain transferability.
Introduces IGFuse, a method for reconstructing 3D scenes by fusing multi-scans with Gaussian representations, addressing object occlusions and limited sensor coverage.
Proposes a unified framework for conformalized multiple testing that uses all available data (null, alternative, unlabeled) to construct scores and calibrate p-values via a full permutation strategy.
Compares YOLOv10 against other models for blood cell detection, showing increased training epochs significantly enhance accuracy, precision, and recall for real-time detection and classification.
Presents a fast and accurate solution to the perspective n-points problem for n=4 by separating variables and finding 3D points on rays connecting the camera to canvas points.
Introduces Matrix-Game 2.0, an open-source, real-time world model using diffusion models for interactive video generation, addressing latency issues of previous models.
Proposes HierAdaptMR, a hierarchical feature adaptation framework using parameter-efficient adapters to address multi-level domain variations in cross-center cardiac MRI reconstruction.
Enables controllable human shape editing while preserving pose, identity, clothing, and background by using depth-guided diffusion, addressing limitations of current approaches.
Introduces a novel RSVQA dataset, Chessboard, designed to minimize biases and improve interpretability and explainability in Remote Sensing Visual Question Answering models.
Conducts a comparative analysis of RT-DETR model variants for automated beach litter detection and counting, investigating the efficacy of state-of-the-art object detection models.
Studies the challenge of transferring animations between characters with different skeletal topologies by proposing a method to address topological inconsistency and establish bone correspondences.
Presents 4DNeX, the first feed-forward framework for generating dynamic 3D scene representations from a single image by fine-tuning a pre-trained video diffusion model.
Archive contains 57 days of AI research intelligence