Similar Items: ResiHP: Taming LLM Training Failures with Dynamic Hybrid
- Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
- ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training
- AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
- MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
- Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
- LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling