# Academic Research Intelligence
Deep dive into AI research papers for researchers and academics
---
Executive Summary
1. ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts
Introduces ScholarBench, a benchmark for evaluating LLMs on complex academic problem-solving. It targets specialized contexts to assess academic reasoning ability, addressing limitations of prior benchmarks lacking scalability for deep expert knowledge.
---
2. Are LLMs Stable Formal Logic Translators in Logical Reasoning Across Linguistically Diversified Texts?
Investigates LLM stability in translating natural language to formal logic for reasoning. Identifies inconsistencies in symbolic representations across linguistic forms, highlighting a need for more robust translation mechanisms to ensure logical coherence.
---
3. EasyNER: A Customizable Easy-to-Use Pipeline for Deep Learning- and Dictionary-based Named Entity Recognition from Medical and Life Science Text
Develops EasyNER, an easy-to-use pipeline for Named Entity Recognition in medical and life science text. It provides automated text mining to help researchers utilize information from large bodies of literature, overcoming accessibility challenges.
---
4. CAP: Evaluation of Persuasive and Creative Image Generation
Introduces three evaluation metrics: Creativity, prompt Alignment, and Persuasiveness (CAP) for advertisement image generation. Addresses the challenge of evaluating Text-to-Image models beyond simple alignment with explicit descriptions.
---
5. Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding
Proposes Vgent, a graph-based Retrieval-Augmented Generation framework for long video understanding. It addresses challenges in processing extended video tokens and retaining long-term sequential information for effective reasoning.
---
6. Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images
Presents a zero-shot pipeline for creating hyperrealistic 3D avatars from phone images. Introduces a generative canonicalization approach to address geometric inconsistencies and improve identity preservation and realism.
---
7. PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis
Proposes PIA, a deepfake detection method using phoneme-temporal and identity-dynamic analysis. It aims to improve the identification of modern deepfakes generated by advanced generative models, overcoming limitations of conventional methods.
---
8. CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts
Introduces CLEAR, a causal-inference-based framework for robust histopathology tumor detection. It leverages semantic features while mitigating OOD shifts by modeling causal relationships, improving generalization beyond statistical correlations.
---
9. Vision-Centric Activation and Coordination for Multimodal Large Language Models
Introduces VaCo, a framework optimizing MLLM representations through vision-centric activation and coordination. It enhances analytical abilities by leveraging multiple vision foundation models, addressing the neglect of vision-centric information.
---
10. DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation
Proposes DOS, a method for directional object separation in text embeddings for multi-object image generation. It addresses challenges in T2I models with multiple objects, mitigating object neglect and mixing through improved text representation.
---
11. Pruning Overparameterized Multi-Task Networks for Degraded Web Image Restoration
Applies pruning to overparameterized multi-task networks for degraded web image restoration. It addresses the quality of web images affected by lossy operations, aiming to recover clean, high-quality images through efficient network optimization.
---
12. PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Introduces PaddleOCR-VL, a compact Vision-Language Model for multilingual document parsing. It efficiently supports 109 languages and excels at recognizing complex elements like text, tables, and charts, boosting document analysis capabilities.
---
13. Towards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology
Introduces DentVFM, the first family of vision foundation models for oral and maxillofacial radiology. It addresses limitations of single-modality, task-specific dental AI systems, aiming for generalization across diverse clinical scenarios.
---
14. Acquisition of interpretable domain information during brain MR image harmonization for content-based image retrieval
Proposes a framework for brain MR image harmonization that acquires interpretable domain information. It disentangles domain-invariant and domain-specific features to improve machine learning performance and content-based retrieval.
---
15. Consistent text-to-image generation via scene de-contextualization
Proposes scene de-contextualization for consistent text-to-image generation. It addresses identity shift by decoupling subject and scene context, enabling identity-preserving images across diverse scenes without prior scene knowledge.
---
16. Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video
Introduces an ego-proactive Video-LLM for streaming video that actively understands and anticipates events. It focuses on proactive coherence and just-in-time perception and reasoning for dynamic, evolving questions.
---
17. Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference
Introduces Efficient Video Sampling (EVS), a method for pruning temporally redundant tokens in videos. It addresses scalability limitations of VLMs processing dense frame sequences, reducing token redundancy for faster inference.
---
18. Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
Introduces RepTok, a generative modeling framework using single continuous latent tokens from self-supervised ViTs. It adapts semantic tokens with low-level details for faithful image reconstruction, enabling efficient generation.
---
19. SteeringTTA: Guiding Diffusion Trajectories for Robust Test-Time-Adaptation
Proposes SteeringTTA, an inference-only framework guiding diffusion-based input adaptation for test-time adaptation. It steers diffusion trajectories to improve robustness across distortion types, addressing limitations of gradient-guided methods.
---
20. WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging
Introduces WeCKD, a weakly-supervised chained distillation network for efficient multimodal medical imaging. It addresses knowledge degradation and inefficient supervision in traditional KD by using a chained approach and minimal data.
---
AI for Science
1. CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts
Introduces CLEAR, a causal-inference-based framework for histopathology tumor detection. It leverages semantic features while mitigating out-of-distribution shifts, improving model generalization in challenging medical imaging scenarios.
---
2. EasyNER: A Customizable Easy-to-Use Pipeline for Deep Learning- and Dictionary-based Named Entity Recognition from Medical and Life Science Text
Develops EasyNER, an end-to-end pipeline for named entity recognition in medical and life science text. It provides an accessible tool for automating information extraction from large scientific literature bodies.
---
3. Element2Vec: Build Chemical Element Representation from Text for Property Prediction
Proposes Element2Vec, a method to build chemical element representations from text for property prediction. It models complex relationships, enabling more accurate predictions for materials design and manufacturing.
---
4. Biology-informed neural networks learn nonlinear representations from omics data to improve genomic prediction and interpretability
Extends biology-informed neural networks (BINNs) for genomic prediction by integrating omics data and biological knowledge. It improves prediction accuracy and interpretability, offering an alternative to traditional models.
---
5. Improving Intrusion Detection with Domain-Invariant Representation Learning in Latent Space
Introduces a multi-task representation learning technique that fuses information for domain generalization in intrusion detection. It improves zero-day anomaly detection by leveraging knowledge from multiple domains.
---
6. Towards geological inference with process-based and deep generative modeling, part 1: training on fluvial deposits
Explores using GANs for geological inference, specifically training on fluvial deposits to better reproduce geological structures. It aims to improve resource prediction and quantify uncertainty in subsurface variations.
---
AI Safety & Ethics
1. NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations
Introduces NAPPure, an adversarial purification framework to combat non-additive perturbations like blur and distortion in image classification. Achieves improved robustness against common real-world corruptions, enabling more reliable image recognition systems.
---
2. Are LLMs Stable Formal Logic Translators in Logical Reasoning Across Linguistically Diversified Texts?
Investigates LLM stability in translating natural language to formal logic for reasoning tasks. Identifies inconsistencies in LLM-generated symbolic representations across linguistic variations, highlighting a challenge for reliable logical deduction.
---
3. PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis
Proposes PIA, a deepfake detection framework analyzing phoneme-temporal and identity-dynamic features. Addresses limitations of conventional methods in identifying modern deepfakes generated by advanced models, aiming for more accurate detection.
---
4. Vision-Centric Activation and Coordination for Multimodal Large Language Models
Introduces VaCo, a framework optimizing multimodal LLM representations through vision-centric activation and coordination. Enhances analytical abilities by focusing on essential vision-centric information beyond text-only supervision.
---
5. CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts
Proposes CLEAR, a causal-inference-based framework for robust histopathology tumor detection under distribution shifts. Leverages semantic features and mitigates impacts of acquisition process differences for better generalization.
---
6. LOTA: Bit-Planes Guided AI-Generated Image Detection
Introduces LOTA, an AI-generated image detection method using bit-planes for feature extraction. Solves high computational cost and captures intrinsic noisy features of raw images, improving detection efficiency and accuracy.
---
7. Structured Universal Adversarial Attacks on Object Detection for Video Sequences
Proposes a universal adversarial attack tailored for video object detection using structured perturbations concentrated in the background. Leverages nuclear norm regularization to promote minimally distorted attacks for enhanced robustness testing.
---
8. RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs
Presents RAID, a framework for probing LLM jailbreaking vulnerabilities by crafting adversarial suffixes that induce restricted content. Optimizes embeddings with a joint objective encouraging refusal awareness and integrated decoding for better safety analysis.
---
AI Theory & Foundations
1. Are LLMs Stable Formal Logic Translators in Logical Reasoning Across Linguistically Diversified Texts?
Introduces an analysis framework to evaluate LLM consistency in translating natural language to formal logic across linguistic variations. Identifies inconsistencies that break logical coherence, highlighting a challenge for symbolic solver applications.
---
2. Multilinguality Does not Make Sense: Investigating Factors Behind Zero-Shot Transfer in Sense-Aware Tasks
Investigates zero-shot transfer in sense-aware NLP tasks across 28 languages. Demonstrates that multilinguality is not essential, identifying other factors like pretraining and fine-tuning data as more critical for effective cross-lingual transfer.
---
3. TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks
Introduces a benchmark for evaluating LLM probabilistic reasoning using text-only multi-armed bandit environments. Assesses LLMs' ability to infer latent reward structures and make sequential decisions without numerical cues or explicit probabilities.
---
4. Interpreting the Latent Structure of Operator Precedence in Language Models
Investigates whether LLMs encode operator precedence in internal representations using arithmetic tasks. Analyzes the LLaMA 3.2-3B model to understand how it performs computations, offering insights into LLM reasoning mechanisms.
---
5. LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization
Proposes the Prompt Duel Optimizer (PDO), a label-free framework for efficient prompt optimization. Formulates prompt optimization as a pairwise comparison problem, reducing reliance on costly labeled validation data.
---
6. RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs
Presents RAID, a framework for crafting adversarial suffixes to induce restricted content from LLMs. Optimizes continuous embeddings with a joint objective to encourage restricted responses while preserving fluency.
---
Computer Vision
1. CAP: Evaluation of Persuasive and Creative Image Generation
Introduces three evaluation metrics (Creativity, prompt Alignment, Persuasiveness) for advertisement image generation. Addresses challenges in evaluating Text-to-Image models beyond explicit description alignment, enabling better assessment of generated advertisement quality.
---
2. Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding
Proposes Vgent, a graph-based retrieval-reasoning-augmented generation framework for long video understanding. Addresses challenges of processing intensive video tokens and retaining long-term information, improving large video language model capabilities.
---
3. Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images
Presents a zero-shot pipeline for creating hyperrealistic 3D avatars from phone images. Introduces generative canonicalization and Gaussian splatting to capture high-frequency details and improve realism, addressing limitations of existing single-view and synthetic data methods.
---
4. MatchAttention: Matching the Relative Positions for High-Resolution Cross-View Matching
Introduces MatchAttention, a novel attention mechanism for high-resolution cross-view matching. Dynamically matches relative positions, addressing quadratic complexity and lack of explicit constraints in existing cross-attention methods for improved matching accuracy.
---
5. GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering
Proposes GauSSmart, a hybrid method for enhanced 3D reconstruction using 2D foundation models and geometric filtering. Addresses Gaussian Splatting's limitations in fine detail capture and realism in sparse coverage regions, improving overall 3D reconstruction quality.
---
6. A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection
Introduces a multi-domain image translative diffusion StyleGAN for iris presentation attack detection. Addresses scarcity of iris PAD datasets by generating diverse attack samples, enabling more robust detection against sophisticated presentation attacks.
---
7. Vision-Centric Activation and Coordination for Multimodal Large Language Models
Introduces VaCo, a framework optimizing multimodal LLM representations via vision-centric activation and coordination. Addresses neglect of vision-centric information in current MLLMs, improving analytical abilities by optimizing from multiple vision foundation models.
---
8. DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation
Proposes DOS, a method for directional object separation in text embeddings for multi-object image generation. Addresses object neglect and mixing in T2I models for prompts with multiple objects by refining inter-object relationship modeling.
---
9. NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations
Proposes NAPPure, an adversarial purification framework for robust image classification under non-additive perturbations. Extends existing methods to handle real-world perturbations like blur and distortion, improving robustness beyond additive attacks.
---
10. TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving
Introduces TopoStreamer, a framework for temporal lane segment topology reasoning in autonomous driving. Addresses limitations in positional embedding and attribute learning for accurate road network reconstruction, enabling better road-dependent maneuvers.
---
Efficient AI
1. Real-Time Neural Video Compression with Unified Intra and Inter Coding
Proposes a neural video compression scheme with unified intra and inter coding to address limitations in disocclusion, new content, and error propagation. Enables efficient real-time encoding/decoding with superior compression efficiency compared to H.266/VVC.
---
2. Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow
Presents a low-power Vision Transformer accelerator optimized via algorithm-hardware co-design, using hardware-friendly dynamic token pruning. Reduces model complexity and dominant FFN bottleneck for efficient vision transformer inference, especially on edge devices.
---
3. BitNet Distillation
Introduces BitNet Distillation (BitDistill), a lightweight pipeline to fine-tune full-precision LLMs into 1.58-bit precision for downstream tasks. Achieves strong performance with minimal computational cost using SubLN, multi-head attention distillation, and continual pre-training.
---
4. ELASTIC: Efficient Once For All Iterative Search for Object Detection on Microcontrollers
Proposes ELASTIC, an efficient iterative search method for object detection on TinyML platforms. Optimizes individual modules and their synergies within hardware constraints, enabling high-performance deployment on resource-limited microcontrollers.
---
5. Efficient Dynamic Structured Sparse Training with Learned Shuffles
Introduces learned shuffles to enable efficient dynamic structured sparse training, closing the expressivity gap with unstructured sparsity. Achieves higher accuracy than fixed layouts by learning permutations jointly with structured sparsity.
---
6. Enhancing Time-Series Anomaly Detection by Integrating Spectral-Residual Bottom-Up Attention with Reservoir Computing
Proposes a time-series anomaly detection framework integrating spectral-residual attention with reservoir computing for edge AI. Achieves real-time detection with low memory overhead and computational simplicity, crucial for preventing incidents.
---
Generative AI
1. DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation
Introduces DOS to improve multi-object image generation by addressing object neglect and mixing. Demonstrates better handling of inter-object relationships, enabling more accurate and aligned image synthesis for complex prompts.
---
2. On the Ability of LLMs to Handle Character-Level Perturbations: How Well and How?
Investigates LLM resilience to character-level perturbations using UCC-Inj. Shows LLMs maintain performance despite significant obfuscation, implying robustness against tokenization fragmentation and reduced signal-to-noise ratio.
---
3. Visual Stereotypes of Autism Spectrum in Janus-Pro-7B, DALL-E, Stable Diffusion, SDXL, FLUX, and Midjourney
Evaluates six text-to-image models for autism stereotypes by comparing generated images to controls. Analyzes prompt sensitivity and model evolution, highlighting potential biases in AI's portrayal of neurodiversity.
---
4. Contrastive Diffusion Alignment: Learning Structured Latents for Controllable Generation
Introduces ConDA to organize diffusion model latents using contrastive learning. Aligns latent geometry with system dynamics, enabling structured traversals that reflect controllable generation and disentangled representations.
---
5. A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection
Proposes a multi-domain diffusion StyleGAN for iris presentation attack detection. Addresses scarcity of training data by translating images across domains, enabling more robust detection of spoofing attempts.
---
6. CAP: Evaluation of Persuasive and Creative Image Generation
Introduces three metrics (CAP) to evaluate advertisement image generation quality. Assesses Creativity, prompt Alignment, and Persuasiveness, addressing limitations of existing methods that focus solely on explicit descriptions.
---
7. Real-Time Adaptive Motion Planning via Point Cloud-Guided, Energy-Based Diffusion and Potential Fields
Presents a motion planning framework combining energy-based diffusion with potential fields for real-time trajectory generation. Processes point clouds directly, enabling efficient planning without full geometric representations.
---
8. Towards geological inference with process-based and deep generative modeling, part 1: training on fluvial deposits
Explores using GANs for geological inference, specifically training on fluvial deposits. Addresses challenges in reproducing geological structures by leveraging deep learning for continuous representations.
---
9. Generating High Dimensional User-Specific Wireless Channels using Diffusion Models
Introduces a novel method for generating synthetic wireless channel data using diffusion models. Produces user-specific channels to train DNN-based algorithms without expensive, high-dimensional measurements.
---
10. GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering
Proposes GauSSmart, a hybrid method enhancing 3D reconstruction by integrating 2D foundation models and geometric filtering. Improves detail capture and realism in regions with sparse coverage.
---
Graph Neural Networks
1. Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding
Introduces Vgent, a graph-based retrieval-reasoning-augmented generation framework for long video understanding. Addresses challenges of processing intensive video tokens and retaining long-term information, enabling better video comprehension.
---
2. Boosting Graph Foundation Model from Structural Perspective
Proposes BooG, a framework to boost graph foundation models by unifying structural characteristics across domains. Constructs virtual super nodes to improve generalizability and performance on graph learning tasks.
---
3. Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval
Introduces MAHA, a Modality-Aware Hybrid retrieval Architecture for multimodal RAG on unstructured data. Leverages modality-aware knowledge graphs with hybrid retrieval for enhanced question answering and reasoning.
---
4. Learning Wireless Interference Patterns: Decoupled GNN for Throughput Prediction in Heterogeneous Multi-Hop p-CSMA Networks
Proposes a decoupled GNN for predicting throughput in heterogeneous multi-hop p-CSMA networks. Learns wireless interference patterns to overcome limitations of simplified models and exponential-scaling Markov-chain analyses.
---
5. DARTS-GT: Differentiable Architecture Search for Graph Transformers with Quantifiable Instance-Specific Interpretability Analysis
Introduces DARTS-GT, a differentiable architecture search method for Graph Transformers. Enables quantifiable instance-specific interpretability analysis and addresses rigid designs by allowing depth-specific component selection.
---
6. PoissonNet: A Local-Global Approach for Learning on Surfaces
Introduces PoissonNet, a novel neural architecture for learning on meshes using a local-global scheme. Formulates learning via Poisson's equation to overcome issues with high-frequency features and receptive fields.
---
7. Stealthy Dual-Trigger Backdoors: Attacking Prompt Tuning in LM-Empowered Graph Foundation Models
Investigates stealthy dual-trigger backdoor attacks on prompt tuning in LM-empowered graph foundation models. Reveals significant performance degradation and security vulnerabilities unique to these models.
---
8. Leveraging Code Cohesion Analysis to Identify Source Code Supply Chain Attacks
Proposes an unsupervised approach to highlight spurious code injections by quantifying cohesion disruptions in source code. Leverages code cohesion analysis to identify supply chain attacks.
---
Large Language Models
1. ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts
Introduces ScholarBench, a bilingual benchmark for evaluating LLM academic reasoning. It targets complex, expert-derived contexts to assess deep knowledge and problem-solving, addressing scalability limitations of prior benchmarks.
---
2. Are LLMs Stable Formal Logic Translators in Logical Reasoning Across Linguistically Diversified Texts?
Investigates LLM stability in translating natural language to formal logic for reasoning. It highlights inconsistencies in symbolic representations from varied linguistic forms, impacting logical coherence and solver errors.
---
3. Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding
Proposes Vgent, a graph-based Retrieval-Reasoning-Augmented Generation framework for long video understanding. It addresses challenges of processing intensive video tokens beyond context windows and retaining sequential information.
---
4. Vision-Centric Activation and Coordination for Multimodal Large Language Models
Introduces VaCo, a framework optimizing MLLM representations through vision-centric activation and coordination. It addresses the neglect of critical vision-centric information in mainstream MLLMs, enhancing analytical abilities.
---
5. Spatial Preference Rewarding for MLLMs Spatial Understanding
Proposes Spatial Preference Rewarding to enhance MLLMs' spatial understanding capabilities. It addresses limitations in fine-grained spatial perception and accurate object localization, improving response to user needs.
---
6. TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks
Introduces TextBandit, a benchmark evaluating LLM probabilistic reasoning via language-only decision tasks. It assesses LLMs' ability to infer latent reward structures from purely textual feedback in multi-armed bandit environments.
---
7. Interpreting the Latent Structure of Operator Precedence in Language Models
Investigates whether LLMs encode operator precedence in their internal representations. It uses a dataset of arithmetic expressions to probe LLaMA 3.2-3B's internal structure for mathematical computation.
---
8. RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems
Introduces RAGCap-Bench to benchmark LLMs in agentic Retrieval Augmented Generation systems. It evaluates capabilities in planning, retrieving, and reasoning over complex, multi-hop queries.
---
9. MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning
Introduces MathMist, a parallel multilingual benchmark dataset for assessing LLM mathematical problem-solving and reasoning. It addresses gaps in evaluating competence across diverse languages beyond English.
---
10. Multilinguality Does not Make Sense: Investigating Factors Behind Zero-Shot Transfer in Sense-Aware Tasks
Investigates zero-shot transfer in sense-aware tasks, finding multilinguality is not necessary for effective transfer. A large-scale analysis across 28 languages reveals other factors are more critical.
---
Multimodal Learning
1. Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding
Introduces Vgent, a graph-based retrieval-reasoning-augmented generation framework for long video understanding. It addresses challenges of processing intensive video tokens beyond context windows and retaining long-term sequential information, enabling more comprehensive video analysis.
---
2. MultiFoodhat: A potential new paradigm for intelligent food quality inspection
Proposes MultiFoodChat, a dialogue-driven multi-agent reasoning framework for zero-shot food recognition. It integrates vision-language models and large language models to overcome limitations of supervised models relying on large labeled datasets and limited generalization.
---
3. Vision-Centric Activation and Coordination for Multimodal Large Language Models
Introduces VaCo, optimizing MLLM representations through vision-centric activation and coordination. It addresses the neglect of critical vision-centric information in mainstream MLLMs solely supervised by next-token prediction, enhancing analytical abilities.
---
4. CAP: Evaluation of Persuasive and Creative Image Generation
Addresses advertisement image generation by introducing three evaluation metrics: Creativity, prompt Alignment, and Persuasiveness (CAP). It challenges existing evaluation methods that focus largely on alignment with explicit descriptions.
---
5. MatchAttention: Matching the Relative Positions for High-Resolution Cross-View Matching
Proposes MatchAttention, an attention mechanism that dynamically matches relative positions for high-resolution cross-view matching. It addresses challenges of quadratic complexity and lack of explicit matching constraints in existing cross-attention.
---
6. Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding
Introduces Vgent, a graph-based retrieval-reasoning-augmented generation framework for long video understanding. It addresses challenges of processing intensive video tokens beyond context windows and retaining long-term sequential information, enabling more comprehensive video analysis.
---
7. PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Proposes PaddleOCR-VL, an ultra-compact vision-language model for multilingual document parsing. It integrates a dynamic resolution visual encoder with a language model to efficiently support 109 languages and excel in recognizing complex elements.
---
8. Seeing Through Green: Text-Based Classification and the Firm's Returns from Green Patents
Introduces Natural Language Processing for identifying "true" green patents from supporting documents. It trains a neural network to enlarge a baseline dictionary through vector representations of environmental technologies.
---
Natural Language Processing
1. Are LLMs Stable Formal Logic Translators in Logical Reasoning Across Linguistically Diversified Texts?
Investigates LLM stability in translating natural language to formal logic for reasoning. Finds inconsistencies in LLM-generated symbolic representations across linguistic variations, impacting logical coherence and solver accuracy. Proposes a benchmark to evaluate this stability.
---
2. EasyNER: A Customizable Easy-to-Use Pipeline for Deep Learning- and Dictionary-based Named Entity Recognition from Medical and Life Science Text
Develops an easy-to-use, customizable pipeline for Named Entity Recognition (NER) from medical and life science text. Integrates deep learning and dictionary-based methods to aid researchers in extracting information from large literature bodies.
---
3. Multilinguality Does not Make Sense: Investigating Factors Behind Zero-Shot Transfer in Sense-Aware Tasks
Investigates zero-shot transfer in sense-aware NLP tasks across 28 languages. Finds multilinguality is not necessary for effective transfer, highlighting other factors like pretraining and fine-tuning data. Challenges common assumptions about cross-lingual transfer.
---
4. ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts
Introduces ScholarBench, a bilingual benchmark for evaluating LLMs on complex academic tasks. Focuses on deep expert knowledge and problem-solving, addressing limitations of prior benchmarks in scalability and complexity for academic reasoning.
---
5. Seeing Through Green: Text-Based Classification and the Firm's Returns from Green Patents
Introduces NLP methods for identifying "true" green patents using text classification. Trains a neural network on patent documents to enlarge a baseline dictionary through vector representations, finding "true" green patents represent about 20% of total.
---
6. Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments
Proposes Thunder-DeID, an accurate and efficient framework for de-identifying Korean court judgments. Addresses limitations of current processes in handling legal requirements at scale and vague personal identifier definitions for technical solutions.
---
7. TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks
Introduces TextBandit, a benchmark evaluating LLMs' probabilistic reasoning in language-only decision tasks. LLMs interact with multi-armed bandits using textual feedback, inferring latent reward structures without numerical cues or explicit probabilities.
---
8. Semantic Prosody in Machine Translation: the English-Chinese Case of Passive Structures
Investigates semantic prosody in English-Chinese machine translation, focusing on passive structures. Proposes an approach to address how current MT models fail to handle literal translations with different semantic prosody, improving translation accuracy.
---
Reinforcement Learning
1. Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning
Proposes Identity-GRPO, a reinforcement learning pipeline for multi-human identity-preserving video generation. It optimizes video generation by refining character consistency, enabling more realistic and controllable human-centric video synthesis.
---
2. TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks
Introduces TextBandit, a benchmark for evaluating LLMs' probabilistic reasoning in decision tasks using only textual feedback. It tests LLMs' ability to infer latent reward structures without numerical cues, advancing understanding of LLM decision-making.
---
3. Strategyproof Reinforcement Learning from Human Feedback
Studies Reinforcement Learning from Human Feedback (RLHF) under strategic labelers, proving existing RLHF algorithms are not strategyproof. Proposes a framework where any strategyproof RLHF algorithm must perform k-times worse in worst-case scenarios.
---
4. Offline Reinforcement Learning via Inverse Optimization
Proposes a novel offline Reinforcement Learning (ORL) algorithm using Inverse Optimization's sub-optimality loss for continuous spaces. Mitigates distribution shift with a robust MPC expert steering a dynamics model, enhancing ORL sample efficiency.
---
5. RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning
Introduces RL-100, a real-world reinforcement learning framework for robotic manipulation using diffusion visuomotor policies. It combines imitation learning, iterative offline RL, and Offline Policy Evaluation for reliable and efficient robot control.
---
6. Agentic Entropy-Balanced Policy Optimization
Proposes Agentic Entropy-Balanced Policy Optimization to address training collapse in agentic RL. It balances exploration and exploitation by mitigating excessive reliance on entropy signals, improving multi-turn tool-use capabilities.
---
7. Demystifying the Mechanisms Behind Emergent Exploration in Goal-conditioned RL
Investigates emergent exploration in unsupervised reinforcement learning, specifically Single-Goal Contrastive Reinforcement Learning (SGCRL). It combines theoretical analysis and experiments to understand the drivers of exploration in long-horizon tasks.
---
8. Active Measuring in Reinforcement Learning With Delayed Negative Effects
Introduces the Actively Observable Markov Decision Process (AOMDP), where agents select control actions and measurement actions. Measurement reveals state but has delayed negative effects, improving sample efficiency and provably reducing uncertainty.
---
Robotics & Embodied AI
1. Leveraging Cycle-Consistent Anchor Points for Self-Supervised RGB-D Registration
Introduces cycle-consistent keypoints for self-supervised RGB-D registration, improving correspondence accuracy. The novel pose blending enhances spatial coherence, enabling better geometric reasoning from unlabeled data.
---
2. ChangingGrounding: 3D Visual Grounding in Changing Scenes
Proposes ChangingGrounding, a benchmark measuring agent ability to exploit past observations for 3D visual grounding in dynamic scenes. Formulates 3DVG as an active, memory-driven problem, enabling robots to localize objects without constant re-scans.
---
3. MimicKit: A Reinforcement Learning Framework for Motion Imitation and Control
Presents MimicKit, an open-source RL framework for training motion controllers using imitation and RL. Provides modular implementations for graphics and robotics research, enabling configurable training structures.
---
4. SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms
Introduces SimULi, a real-time LiDAR and camera simulation framework using unscented transforms. Achieves high-fidelity rendering with neural methods suitable for self-driving vehicles, overcoming speed limitations.
---
5. CALM-Net: Curvature-Aware LiDAR Point Cloud-based Multi-Branch Neural Network for Vehicle Re-Identification
Proposes CALM-Net, a curvature-aware LiDAR point cloud network for vehicle re-identification. Integrates edge convolution, point attention, and curvature embedding to learn discriminative features for 3D point cloud analysis.
---
6. Efficient Dynamic Structured Sparse Training with Learned Shuffles
Introduces learned shuffles to enable efficient dynamic structured sparse training. Closes the expressivity gap between structured and unstructured sparse training by learning permutation matrices for greater accuracy.
---
7. RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning
Presents RL-100, a real-world RL training framework using diffusion visuomotor policies. Employs imitation learning and iterative offline RL with robust evaluation to achieve reliable robotic manipulation.
---
8. The Pursuit of Diversity: Multi-Objective Testing of Deep Reinforcement Learning Agents
Introduces INDAGO-Nexus, a multi-objective search approach for DRL testing. Jointly optimizes for failure likelihood and test scenario diversity, ensuring discovered failures are distinct and informative.
---
Speech & Audio
1. Quechua Speech Datasets in Common Voice: The Case of Puno Quechua
Details the integration of Quechua languages into Common Voice, focusing on Puno Quechua. Presents a case study on language onboarding and corpus creation to address data scarcity in speech technology for under-resourced languages.
---
2. Interpreting the Latent Structure of Operator Precedence in Language Models
Investigates if LLMs encode operator precedence in internal representations using LLaMA 3.2-3B. Constructs a dataset of arithmetic expressions to analyze how models handle operator precedence, aiming to improve arithmetic reasoning.
---
3. Quantifying Phonosemantic Iconicity Distributionally in 6 Languages
Undertakes a distributional approach to quantify phonosemantic iconicity across 6 diverse languages. Investigates systematic relationships between phonetics and semantics at scale, exploring language's largely arbitrary nature.
---
4. SPIRIT: Patching Speech Language Models against Jailbreak Attacks
Analyzes adversarial attacks on Speech Language Models (SLMs), finding them more vulnerable to jailbreaks. Proposes SPIRIT, a patching method to improve SLM robustness against imperceptible noise-injected speech.
---
5. Beat Detection as Object Detection
Reframes beat and downbeat tracking as object detection in audio. Adapts the FCOS detector to 1D audio, using WaveBeat's feature extractor and a Pyramid Network for multi-scale temporal patterns.
---
6. Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks
Creates a benchmark for multi-modal conference talk transcription, integrating slides with audio. Investigates the impact of visual context, specifically presentation slides, on Automatic Speech Recognition (ASR) performance.
---