Channels - AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference :: FRELIP Discovery

Similar Items: AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference

Quick Look
Stochastic Sparse Attention for Memory-Bound Inference
Quick Look
Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3
Quick Look
SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication
Quick Look
ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations
Quick Look
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
Quick Look
AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
Quick Look
Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel
Quick Look
PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
Quick Look
Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
Quick Look
Accelerating Locality-Driven Integration in Quantum Chemistry with Block-Structured Matrix Multiplication
Quick Look
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
Quick Look
PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers
Quick Look
Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
Quick Look
SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
Quick Look
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Quick Look
The Distributed Complexity Landscape on Trees Depends on the Knowledge About the Network Size
Quick Look
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
Quick Look
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
Quick Look
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
Quick Look
AI Inference as Relocatable Electricity Demand: A Latency-Constrained Energy-Geography Framework
Quick Look
LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
Quick Look
ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference
Quick Look
MARLIN: Multi-Agent Game-Theoretic Reinforcement Learning for Sustainable LLM Inference in Cloud Datacenters
Quick Look
Cloud Performance Decomposition for Long-Term Performance Engineering: A Case Study