Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent LLMs have shown remarkable success in following user instructions, yet
handling instructions with multiple constraints remains a significant
challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K
real user instructions with diverse, multi-constraint conditions. Unlike prior
datasets, our collection spans a broad lexical and topical spectrum of
constraints, extracted from natural user instructions. We categorize these
constraints into eight high-level classes to capture their distribution and
dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive
experiments to benchmark the instruction-following capabilities of leading
LLMs. WildIFEval clearly differentiates between small and large models, and
demonstrates that all models have a large room for improvement on such tasks.
We analyze the effects of the number and type of constraints on performance,
revealing interesting patterns of model constraint-following behavior. We
release our dataset to promote further research on instruction-following under
complex, realistic conditions.
Key Contributions
Introduces WildIFEval, a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions, spanning a broad lexical and topical spectrum. It benchmarks leading LLMs, revealing significant room for improvement and analyzing the effects of constraint number and type on performance.
Business Value
Enables the development of more capable and reliable AI assistants and applications that can understand and execute complex user requests accurately, improving user experience and task completion rates.