Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Emerging large language model (LLM) applications involve diverse reasoning
strategies and agentic workflows, straining the capabilities of existing
serving systems built on a monolithic token generation loop. This paper
introduces Pie, a programmable LLM serving system designed for flexibility and
efficiency. Pie decomposes the traditional generation loop into fine-grained
service handlers exposed via an API and delegates control of the generation
process to user-provided programs, called inferlets. This enables applications
to implement new KV cache strategies, bespoke generation logic, and seamlessly
integrate computation and I/O-entirely within the application, without
requiring modifications to the serving system. Pie executes inferlets using
WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows
Pie matches state-of-the-art performance on standard tasks (3-12% latency
overhead) while significantly improving latency and throughput (1.3x-3.4x
higher) on agentic workflows by enabling application-specific optimizations.
Authors (4)
In Gim
Zhiyao Ma
Seung-seob Lee
Lin Zhong
Submitted
October 28, 2025
Key Contributions
This paper introduces Pie, a programmable LLM serving system that decomposes the generation loop into fine-grained service handlers and executes user-defined programs (inferlets) via WebAssembly. This offers significant flexibility and efficiency improvements for emerging LLM applications.
Business Value
Enables faster, more cost-effective deployment and scaling of advanced LLM applications, supporting innovation in areas like AI agents and complex reasoning systems.