Similar Items: FusionRCG: Orchestrating Recursive Computation Graphs across GPU Memory Hierarchies
- Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures
- Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
- VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU
- VDCores: Resource Decoupled Programming and Execution for Asynchronous GPU
- KEET: Explaining Performance of GPU Kernels Using LLM Agents
- PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers