Similar Items: FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication
- Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge
- Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3
- MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
- MANOJAVAM: A Scalable, Unified FPGA Accelerator for Matrix Multiplication and Singular Value Decomposition in Principal Component Analysis
- HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
- Efficient Training on Multiple Consumer GPUs with RoundPipe