Similar Items: The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
- Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures
- FATE: Future-State-Aware Scheduling for Heterogeneous LLM Workflows
- Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel
- Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
- Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
- KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving