Similar Items: Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
- Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
- MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
- Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
- Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
- A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models
- CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training