Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: As model context lengths continue to grow, concerns about whether models
effectively use the full context length have persisted. While several carefully
designed long-context evaluations have recently been released, these
evaluations tend to rely on retrieval from one or more sections of the context,
which allows nearly all of the context tokens to be disregarded as noise. This
represents only one type of task that might be performed with long context. We
introduce Oolong, a benchmark of long-context reasoning tasks that require
analyzing individual chunks of text on an atomic level, and then aggregating
these analyses to answer distributional questions. Oolong is separated into two
task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can
easily ablate components of the reasoning problem; and Oolong-real, a
downstream setting which requires reasoning over real-world conversational
data. Oolong requires models to reason over large quantities of examples, to
perform both classification and counting in-context, and to reason over
temporal and user relations. Even frontier models struggle on Oolong, with
GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy
on both splits at 128K. We release the data and evaluation harness for Oolong
to enable further development of models that can reason over large quantities
of text.
Key Contributions
Introduces Oolong, a benchmark designed to evaluate the long-context reasoning and aggregation capabilities of LLMs. Oolong focuses on tasks requiring atomic-level analysis and aggregation of information, distinguishing itself from benchmarks that primarily rely on retrieval, and includes both synthetic and real-world task sets.
Business Value
Enables the development of more capable LLMs that can process and reason over extensive documents, leading to better summarization, analysis, and question-answering systems for complex information.