arxiv_ai 90% Match Research Paper NLP Researchers,Information Retrieval Specialists,ML Engineers 2 weeks ago

MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning

large-language-models › training-methods

📄 Abstract

Abstract: We introduce MOSAIC (Masked Objective with Selective Adaptation for In-domain Contrastive learning), a multi-stage framework for domain adaptation of sentence embedding models that incorporates joint domain-specific masked supervision. Our approach addresses the challenges of adapting large-scale general-domain sentence embedding models to specialized domains. By jointly optimizing masked language modeling (MLM) and contrastive objectives within a unified training pipeline, our method enables effective learning of domain-relevant representations while preserving the robust semantic discrimination properties of the original model. We empirically validate our approach on both high-resource and low-resource domains, achieving improvements up to 13.4% in NDCG@10 (Normalized Discounted Cumulative Gain) over strong general-domain baselines. Comprehensive ablation studies further demonstrate the effectiveness of each component, highlighting the importance of balanced joint supervision and staged adaptation.

Authors (2)

Vera Pavlova

Mohammed Makhlouf

Submitted

October 19, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces MOSAIC, a multi-stage framework for domain adaptation of sentence embedding models that combines joint domain-specific masked language modeling (MLM) and contrastive objectives. It effectively adapts large general-domain models to specialized domains, improving retrieval performance.

Business Value

Enables more accurate and relevant search results and semantic understanding within specialized domains (e.g., legal, medical, scientific), improving information access and decision-making.

Paper Metadata

Innovation Type

Framework/Methodology

Deployment Feasibility

Moderate, requires fine-tuning pre-trained models.

Limitations Addressed

Challenges in adapting large-scale general-domain sentence embedding models to specialized domains, especially low-resource ones, while preserving their semantic discrimination capabilities.

Performance Gains

Achieves improvements up to 13.4% in NDCG@10 over strong general-domain baselines on both high-resource and low-resource domains.

Technical Tags

domain adaptationsentence embeddingscontrastive learningmasked language modeling (MLM)in-domain traininglarge-scale modelsrepresentation learningNDCG@10low-resource domainsjoint supervision

Research Topics

Representation LearningDomain AdaptationSentence EmbeddingsContrastive LearningTransfer Learning

Methods & Architectures

Multi-stage FrameworkJoint MLM and Contrastive ObjectivesSelective AdaptationIn-domain Training Sentence Embedding ModelsLarge Language Models (LLMs)

Applications & Tasks

Natural Language Processing Information Retrieval Search Engines Adapting General Models to Specific DomainsImproving Sentence Representation QualityHandling Low-Resource Domains Domain Adaptation of Sentence EmbeddingsImproving Retrieval PerformanceLearning Domain-Relevant Representations

Related Fields

Information RetrievalNatural Language UnderstandingMachine Learning

Keywords

domain adaptationsentence embeddingscontrastive learningMLMrepresentation learninginformation retrievalNLPlow-resourcetransfer learningMOSAICin-domain

Academic Context

#Representation Learning#Domain Adaptation#Sentence Embeddings#Contrastive Learning#Transfer Learning

Commercial Potential

Potential Products

Domain-Specific Search EnginesSpecialized Text Analysis ToolsIntelligent Document Understanding Systems

Target Industries

E-commerceHealthcareFinanceLegalResearch

Use Case Examples

Improving product search relevance in specialized online storesEnhancing medical literature searchBuilding domain-specific Q&A systems

Competitive Edge

Offers a unified framework combining MLM and contrastive learning for domain adaptation, potentially outperforming methods that use only one objective or simpler adaptation techniques.

Market Opportunity

Large market for search technologies and NLP solutions.

Revenue Models

Licensing the adapted modelsoffering domain-specific embedding services.

Resource Requirements

Compute Needs

High (for training/fine-tuning large models)

Data Requirements

Domain-specific unlabeled text data.

Deployment Constraints

Requires fine-tuning on target domain data.

Scalability

Scalability depends on the underlying sentence embedding model and the efficiency of the adaptation process.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years (for integration into IR systems)

Patent Potential

Moderate

View Full Paper Back to Papers