+abstract="This paper presents a performant and portable recursive implementation of triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM) operations in Julia for GPUs, which form the backbone of many other linear algebra algorithms. This work is based on an existing recursive implementation for TRMM and TRSM, which restructures the operations to include general matrix-matrix multiplication (GEMM) calls, facilitating better utilization of the GPU memory hierarchy, and reducing latency overhead. The unified implementation in Julia harnesses the language's multiple-dispatch and metaprogramming capabilities through the existing GPUArrays and KernelAbstractions frameworks, enabling performant hardware-agnostic execution across different GPU architectures. By supporting a consistent API, this implementation allows users to seamlessly switch between different GPU backends. The recursive hardware-agnostic implementation we present achieves performance comparable to vendor-optimized (cuBLAS/rocBLAS) libraries for larger matrix sizes and provides such methods for the first time to Apple Silicion hardware with only a few hundred lines of code, demonstrating the power of unified implementations.",
0 commit comments