Mainstream Hacker News • 14 hours ago

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Researchers have introduced CODA, a novel GPU kernel abstraction designed to optimize Transformer model training by rewriting Transformer blocks as GEMM-plus-epilogue programs. This approach addresses a significant bottleneck in Transformer training systems, where non-arithmetic, memory-bound operations such as normalization, activations, and residual updates consume substantial time due to repeated data movement of large intermediate tensors. CODA leverages the insight that many Transformer computations can be algebraically restructured to execute while GEMM (General Matrix Multiply) output tiles remain on-chip, reducing costly global memory transfers. CODA’s design fixes the GEMM mainloop and exposes a limited set of composable epilogue primitives for operations like scaling, reductions, pairwise transformations, and accumulation. This constrained interface maintains the high-performance characteristics of expert-tuned GEMM kernels while providing sufficient flexibility to cover nearly all non-attention computations in both the forward and backward passes of standard Transformer blocks. The framework demonstrates that combining GEMM with epilogue programming can unify hardware efficiency with framework-level productivity, a crucial advancement for accelerating large-scale Transformer training. Performance evaluations across representative Transformer workloads show that CODA kernels, whether human- or large language model-authored, achieve competitive speeds, validating the practicality of this approach. By minimizing memory bandwidth bottlenecks and improving data locality, CODA offers a promising path to enhance the efficiency of deep learning training pipelines, which are increasingly constrained by data movement rather than raw computation. This development is significant given the widespread use of Transformer architectures in natural language processing and other AI fields, where training costs and time remain major challenges. CODA’s method could influence future hardware-software co-design strategies and inspire new compiler and kernel optimizations tailored to the unique demands of modern deep learning workloads.

Original story by Hacker News • View original source

0 comments

0 people discussing

Anonymous Discussion

Real voices. Real opinions. No censorship. Resets in 4 hours.

No account needed Anonymous • Resets in 4h

Loading comments...

MS TechCrunch

About NewsBin

Freedom of speech first. Anonymous discussion on today's news. All content resets every 24 hours.

No accounts. No tracking. No censorship. Just honest conversation.

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Anonymous Discussion

Waymo expands pause to four cities as robotaxis keep driving into floods

SpaceX scrubs first Starship V3 launch just before liftoff

Meta quietly released a new Reddit-like app called Forum

Elon Musk reportedly owes quite a few of his employees $420

Project Hail Mary – Stellar Navigation Chart

About NewsBin