Similar Items: LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
- Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
- Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
- PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers
- Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
- Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel
- RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching