arxiv_cl 95% Match Research Paper NLP Researchers,ML Engineers,Data Scientists,Benchmark Developers 2 weeks ago

FreshTab: Sourcing Fresh Data for Table-to-Text Generation Evaluation

large-language-models › evaluation

📄 Abstract

Abstract: Table-to-text generation (insight generation from tables) is a challenging task that requires precision in analyzing the data. In addition, the evaluation of existing benchmarks is affected by contamination of Large Language Model (LLM) training data as well as domain imbalance. We introduce FreshTab, an on-the-fly table-to-text benchmark generation from Wikipedia, to combat the LLM data contamination problem and enable domain-sensitive evaluation. While non-English table-to-text datasets are limited, FreshTab collects datasets in different languages on demand (we experiment with German, Russian and French in addition to English). We find that insights generated by LLMs from recent tables collected by our method appear clearly worse by automatic metrics, but this does not translate into LLM and human evaluations. Domain effects are visible in all evaluations, showing that a~domain-balanced benchmark is more challenging.

Key Contributions

Introduces FreshTab, an on-the-fly table-to-text benchmark generation method from Wikipedia to combat LLM data contamination and enable domain-sensitive evaluation. It addresses the limitations of existing benchmarks by providing fresh data and supporting multilingual generation, revealing domain effects in LLM performance.

Business Value

Provides a more reliable and representative evaluation framework for table-to-text generation models, crucial for applications requiring accurate data interpretation and insight generation. This leads to better-performing AI systems in data analysis and reporting.

Paper Metadata

Innovation Type

New Benchmark and Methodology

Deployment Feasibility

High, as it focuses on data generation and evaluation methodology rather than a deployable model itself. The methodology can be integrated into existing LLM development and testing pipelines.

Limitations Addressed

LLM data contamination in benchmarks,Domain imbalance in evaluation,Limited non-English table-to-text datasets

Technical Tags

table-to-text generationLLM evaluationdata contaminationbenchmark generationdomain imbalancemultilingualon-the-fly generationWikipedia data

Research Topics

Natural Language GenerationEvaluation BenchmarksData CurationLLM RobustnessMultilingual NLP

Methods & Architectures

on-the-fly benchmark generationWikipedia data extractionautomatic metricsLLM evaluationhuman evaluation Large Language Models (LLMs)

Applications & Tasks

Natural Language Processing Data Analysis Data ContaminationDomain ImbalanceEvaluation Limitations Table-to-Text GenerationBenchmark CreationDomain-Sensitive Evaluation

Datasets & Benchmarks

Datasets

FreshTab, Wikipedia

automatic metricsLLM evaluationhuman evaluations

Related Fields

Natural Language ProcessingMachine LearningData ScienceInformation Retrieval

Keywords

table-to-textLLMevaluationbenchmarkdata contaminationdomain imbalancemultilingualWikipedianatural language generationNLP

Academic Context

#Natural Language Generation#Evaluation Benchmarks#Data Curation#LLM Robustness#Multilingual NLP

Commercial Potential

Potential Products

LLM evaluation servicesAutomated data analysis tools

Target Industries

TechnologyResearch & DevelopmentData Analytics

Use Case Examples

Evaluating LLMs for financial report generationAssessing AI for scientific literature summarization

Competitive Edge

Offers a novel approach to benchmark creation that directly addresses data contamination and domain imbalance, which are significant limitations in current LLM evaluation practices.

Market Opportunity

Growing market for LLM evaluation tools and services.

Revenue Models

Licensing of the benchmark or evaluation services.

Resource Requirements

Compute Needs

Moderate for data generation and LLM inference during evaluation.

Data Requirements

Access to Wikipedia data; ability to generate tables and text.

Deployment Constraints

Requires infrastructure for data scraping and LLM evaluation.

Scalability

The on-the-fly generation approach is inherently scalable to new domains and languages.

Regulatory Considerations

None explicitly mentionedstandard data usage policies apply.

Production Readiness

Maturity Level

Research/Development

Time to Market

N/A (methodology)

Patent Potential

Low, as it's a methodology and benchmark.

View Full Paper Back to Papers