Talks - Paul Springer
- Design of a High-Performance GEMM-like Tensor-Tensor MultiplicationBLIS Retreat 2016.
September 2016.abstractwebPDFhideWe present ''GEMM-like Tensor-Tensor multiplication'' (GETT), a novel approach to tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned ''macro-kernel''. This macro-kernel operates on suitably prepared (''packed'') sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. To compare our technique with other modern tensor contractions, we integrate GETT alongside the so called Transpose-Transpose-GEMM-Transpose and Loops-over-GEMM approaches into an open source ''Tensor Contraction Code Generator'' (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to 12.3x. More precisely, GETT achieves speedups of up to 1.42x over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to 91.3% of peak floating-point performance for compute-bound tensor contractions.
- TTC: A Tensor Transposition Compiler for Multiple ArchitecturesARRAY ACM SIGPLAN 3rd International Workshop on Libraries, Languages and Compilers for Programming.
June 2016.abstractwebPDFhideWe consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system’s peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g., Intel Haswell, AMD Steamroller), Intel’s Knights Corner as well as different CUDA-based GPUs such as NVIDIA’s Kepler and Maxwell architectures. We report speedups of TTC over a meaningful base- line implementation generated by external C++ compilers; the re- sults suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel’s latest C++ compiler on the Haswell and Knights Cor- ner architecture, TTC yields speedups of up to 8× and 32×, respectively. We also showcase TTC’s support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.
- TTC: A Compiler for Tensor TranspositionsSIAM Conference on Parallel Processing for Scientific Computing..
Université Pierre et Marie Curie, Paris, 14 April 2016.abstractwebPDFhideWe present TTC, an open-source compiler for multidimensional tensor transpositions. Thanks to a range of optimizations (software prefetching, blocking, loop-reordering, explicit vectorization), TCC generates high-performance parallel C/C++ code. We use generic heuristics and auto-tuning to manage the huge search space. Performance results show that TTC achieves close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, yielding performance gains of up to 15× over modern compilers.
- Parallel Python - TutorialJanuary 2013.
- A Study of Productivity and Performance of Modern Vector Processors2012.
- Berkeley's Dwarfs on CUDA2011.