arxiv_cl 90% Match Research Paper Data engineers,Database administrators,AI researchers,Software developers 1 week ago

TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents

large-language-models › model-architecture

📄 Abstract

Abstract: The task of information extraction (IE) is to extract structured knowledge from text. However, it is often not straightforward to utilize IE output due to the mismatch between the IE ontology and the downstream application needs. We propose a new formulation of IE TEXT2DB that emphasizes the integration of IE output and the target database (or knowledge base). Given a user instruction, a document set, and a database, our task requires the model to update the database with values from the document set to satisfy the user instruction. This task requires understanding user instructions for what to extract and adapting to the given DB/KB schema for how to extract on the fly. To evaluate this new task, we introduce a new benchmark featuring common demands such as data infilling, row population, and column addition. In addition, we propose an LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer component that interacts with the database, the Planner component that generates a code-based plan with calls to IE models, and the Analyzer component that provides feedback regarding code quality before execution. Experiments show that OPAL can successfully adapt to diverse database schemas by generating different code plans and calling the required IE models. We also highlight difficult cases such as dealing with large databases with complex dependencies and extraction hallucination, which we believe deserve further investigation. Source code: https://github.com/yzjiao/Text2DB

Authors (5)

Yizhu Jiao

Sha Li

Sizhe Zhou

Heng Ji

Jiawei Han

Submitted

October 28, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper introduces TEXT2DB, a new formulation for Information Extraction that emphasizes integration with target databases. It also proposes the OPAL LLM agent framework to perform this task, along with a new benchmark covering data infilling, row population, and column addition.

Business Value

Streamlines the process of populating and updating databases with information extracted from unstructured text, enabling more dynamic and data-rich business intelligence and knowledge management systems.

Paper Metadata

Innovation Type

Task Formulation & Framework Development

Deployment Feasibility

Moderate, requires integration with existing database systems and LLM infrastructure.

Limitations Addressed

The mismatch between traditional IE output ontologies and downstream application needs (databases/KBs), making IE output difficult to utilize.

Technical Tags

Information Extraction (IE)Large Language Models (LLMs)LLM AgentsDatabase UpdateKnowledge Base IntegrationIntegration-Aware IEBenchmarkData InfillingRow PopulationColumn AdditionOPAL framework

Research Topics

Information ExtractionKnowledge RepresentationDatabase ManagementLLM AgentsStructured Data Generation

Methods & Architectures

TEXT2DB formulationOPAL (Observe-Plan-Analyze LLM) agent frameworkDatabase interactionSchema adaptation Large Language Models (LLMs)LLM Agents

Applications & Tasks

Data Management Knowledge Management Business Intelligence Information Systems Extracting structured knowledgeIntegrating IE output with databasesOn-the-fly schema adaptationData completion Updating databases from textPopulating tablesAdding columns to databasesData infilling

Datasets & Benchmarks

Benchmarks

New benchmark for integration-aware IE (data infilling, row population, column addition)

Related Fields

Database SystemsKnowledge GraphsNatural Language ProcessingArtificial IntelligenceData Engineering

Keywords

Information ExtractionLarge Language ModelsLLM AgentsDatabase IntegrationKnowledge BaseStructured DataData ManagementOPALBenchmarkData Infilling

Academic Context

#Information Extraction#Knowledge Representation#Database Management#LLM Agents#Structured Data Generation

Commercial Potential

Potential Products

Automated database population toolsKnowledge graph construction servicesData integration platforms

Target Industries

TechnologyFinanceE-commerceMediaAny industry relying on structured data

Use Case Examples

Automatically updating a product catalog database from product descriptions.Populating a CRM system with contact information extracted from emails.Building a knowledge graph from news articles.

Competitive Edge

Addresses a critical gap in information extraction by focusing on direct database integration, offering a more practical solution than traditional IE methods that require post-processing.

Market Opportunity

Large market for data integration, data management, and business intelligence solutions.

Revenue Models

SaaS for data integrationlicensing of the agent framework.

Resource Requirements

Compute Needs

Moderate to high, depending on the complexity of the database and the size of the documents.

Data Requirements

Requires documents, target database schemas, and user instructions.

Deployment Constraints

Requires robust database access and security protocols. LLM agent performance can vary.

Scalability

Scalable to different database schemas and document types. Agent performance may need tuning for large-scale operations.

Regulatory Considerations

Data privacy and security when accessing and updating databases.

Production Readiness

Maturity Level

Research/Development

Time to Market

1-3 years for a production-ready tool.

Patent Potential

Moderate, for the OPAL agent framework and the TEXT2DB task formulation.

View Full Paper Back to Papers