Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 90% Match Research Paper MLOps engineers,Software architects,Developers of LLM applications,System researchers 1 week ago

Pie: A Programmable Serving System for Emerging LLM Applications

large-language-models › model-architecture
📄 Abstract

Abstract: Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.
Authors (4)
In Gim
Zhiyao Ma
Seung-seob Lee
Lin Zhong
Submitted
October 28, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

This paper introduces Pie, a programmable LLM serving system that decomposes the generation loop into fine-grained service handlers and executes user-defined programs (inferlets) via WebAssembly. This offers significant flexibility and efficiency improvements for emerging LLM applications.

Business Value

Enables faster, more cost-effective deployment and scaling of advanced LLM applications, supporting innovation in areas like AI agents and complex reasoning systems.