Channels - The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures :: FRELIP Discovery

Similar Items: The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

Quick Look
Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures
Quick Look
FATE: Future-State-Aware Scheduling for Heterogeneous LLM Workflows
Quick Look
Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel
Quick Look
Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
Quick Look
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Quick Look
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
Quick Look
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
Quick Look
Stochastic Sparse Attention for Memory-Bound Inference
Quick Look
PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
Quick Look
LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
Quick Look
Regulating Branch Parallelism in LLM Serving
Quick Look
Space Network of Experts: Architecture and Expert Placement
Quick Look
Accelerating Compound LLM Training Workloads with Maestro
Quick Look
End-to-End and Phase-Level Performance Optimization for Hyperledger Fabric
Quick Look
Akita: A High Usability Simulation Framework for Computer Architecture
Quick Look
KEET: Explaining Performance of GPU Kernels Using LLM Agents
Quick Look
Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism
Quick Look
ResiHP: Taming LLM Training Failures with Dynamic Hybrid
Quick Look
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
Quick Look
PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers
Quick Look
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
Quick Look
Joint Temporal-Structural Representation Learning for Distributed Fault Discrimination in Microservice Architectures
Quick Look
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
Quick Look
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs