arxiv_cl 90% Match Research Paper MLOps engineers,Software architects,Developers of LLM applications,System researchers 1 week ago

Pie: A Programmable Serving System for Emerging LLM Applications

large-language-models › model-architecture

📄 Abstract

Abstract: Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.

Authors (4)

In Gim

Zhiyao Ma

Seung-seob Lee

Lin Zhong

Submitted

October 28, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper introduces Pie, a programmable LLM serving system that decomposes the generation loop into fine-grained service handlers and executes user-defined programs (inferlets) via WebAssembly. This offers significant flexibility and efficiency improvements for emerging LLM applications.

Business Value

Enables faster, more cost-effective deployment and scaling of advanced LLM applications, supporting innovation in areas like AI agents and complex reasoning systems.

Paper Metadata

Innovation Type

System Design

Deployment Feasibility

High, designed as a serving system for LLM applications, aiming for seamless integration.

Limitations Addressed

The limitations of monolithic serving systems for complex LLM applications involving diverse reasoning and agentic workflows.

Performance Gains

1.3x-3.4x higher latency and throughput compared to state-of-the-art; 3-12% latency overhead on standard tasks.

Technical Tags

LLM Serving SystemProgrammable ServingInferletsWebAssembly (Wasm)KV Cache StrategiesAgentic WorkflowsReasoning StrategiesLatency OptimizationThroughput ImprovementService Handlers

Research Topics

LLM DeploymentSystem ArchitectureEfficient InferenceAI InfrastructureProgrammable Systems

Methods & Architectures

Pie serving systemDecomposition of generation loopService handlersInferletsWebAssembly execution Large Language Models (LLMs)

Applications & Tasks

AI Infrastructure Cloud Computing Software Engineering Machine Learning Operations (MLOps) Strained serving capabilitiesLack of flexibility in LLM servingInefficient generation loopsIntegrating diverse reasoning strategies Serving LLM applicationsEnabling programmable generation logicOptimizing LLM inference performanceSupporting agentic workflows

Datasets & Benchmarks

Benchmarks

Standard tasks (latency/throughput comparison)

LatencyThroughputOverhead

Related Fields

Distributed SystemsCloud ComputingSoftware ArchitectureMachine Learning OperationsWebAssembly

Keywords

LLM ServingAI InfrastructureProgrammable SystemsWebAssemblyInferletsLatency OptimizationThroughputAgentic WorkflowsMLOpsDistributed Systems

Academic Context

#LLM Deployment#System Architecture#Efficient Inference#AI Infrastructure#Programmable Systems

Technology Stack

Frameworks & Libraries

WebAssembly (Wasm)

ML Infrastructure

LLM Serving Systems

Commercial Potential

Potential Products

A flexible and efficient LLM serving platformTools for building custom LLM inference pipelinesInfrastructure for AI agent deployment

Target Industries

TechnologyCloud ServicesSaaSAI Development

Use Case Examples

Deploying a complex AI agent that requires custom logic and state management.Serving multiple LLM applications with varying inference requirements on shared infrastructure.Implementing novel KV cache strategies for improved efficiency.

Competitive Edge

Offers a more flexible and programmable alternative to existing monolithic LLM serving systems, enabling advanced functionalities and improved performance for next-generation LLM applications.

Market Opportunity

Rapidly growing market for LLM infrastructure and deployment solutions.

Revenue Models

Open-source project with potential for enterprise support/servicesor as a component in cloud offerings.

Resource Requirements

Compute Needs

Moderate, depends on the scale of LLM serving.

Data Requirements

N/A (system design paper).

Deployment Constraints

Requires expertise in system architecture and WebAssembly. Integration with existing cloud infrastructure.

Scalability

Designed for scalability, allowing fine-grained handlers and inferlets to be managed efficiently.

Regulatory Considerations

Standard software deployment and security considerations.

Production Readiness

Maturity Level

Research/Development

Time to Market

1-2 years for adoption in production environments.

Patent Potential

Moderate, for the programmable serving architecture and inferlet execution model.

View Full Paper Back to Papers