Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The task of information extraction (IE) is to extract structured knowledge
from text. However, it is often not straightforward to utilize IE output due to
the mismatch between the IE ontology and the downstream application needs. We
propose a new formulation of IE TEXT2DB that emphasizes the integration of IE
output and the target database (or knowledge base). Given a user instruction, a
document set, and a database, our task requires the model to update the
database with values from the document set to satisfy the user instruction.
This task requires understanding user instructions for what to extract and
adapting to the given DB/KB schema for how to extract on the fly. To evaluate
this new task, we introduce a new benchmark featuring common demands such as
data infilling, row population, and column addition. In addition, we propose an
LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer
component that interacts with the database, the Planner component that
generates a code-based plan with calls to IE models, and the Analyzer component
that provides feedback regarding code quality before execution. Experiments
show that OPAL can successfully adapt to diverse database schemas by generating
different code plans and calling the required IE models. We also highlight
difficult cases such as dealing with large databases with complex dependencies
and extraction hallucination, which we believe deserve further investigation.
Source code: https://github.com/yzjiao/Text2DB
Authors (5)
Yizhu Jiao
Sha Li
Sizhe Zhou
Heng Ji
Jiawei Han
Submitted
October 28, 2025
Key Contributions
This paper introduces TEXT2DB, a new formulation for Information Extraction that emphasizes integration with target databases. It also proposes the OPAL LLM agent framework to perform this task, along with a new benchmark covering data infilling, row population, and column addition.
Business Value
Streamlines the process of populating and updating databases with information extracted from unstructured text, enabling more dynamic and data-rich business intelligence and knowledge management systems.