Similar Items: ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
- Accelerating Compound LLM Training Workloads with Maestro
- ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training
- Lifting to tensors when compiling scientific computing workloads for AI Engines
- MERBIT: A GPU-Based SpMV Method for Iterative Workloads
- ResiHP: Taming LLM Training Failures with Dynamic Hybrid
- LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling