Talks - Paul Springer

A set of building blocks for tensor operations: transposition, summation, and contraction
Paolo Bientinesi and Paul Springer
SIAM Conference on Parallel Processing for Scientific Computing.
Waseda University, Tokyo, Japan, March 2018.
Tensors naturally appear in a variety of disciplines and applications, including computational chemistry, computational physics, mathematics, and even machine learning. While a range of high-performance software tools exist for computations involving one- and two-dimensional arrays, i.e. vectors and matrices, the availability for tensors is much more limited. Moreover, until recently the contrast between the efficiency attained by matrix and tensor operations was staggering. With this talk we give an overview of a set of high-performance kernels for three common tensor operations: transposition, summation, and contraction. Specifically, we present 1) TTC and HPTT, respectively a compiler and a library for the efficient transposition of tensors of arbitrary size, 2) a code generator for tensor summations, and 3) TCCG, a compiler for tensor transpositions. In all cases, the tools exhibit significant speedups over state-of-the-art solutions. All tools are available for download and use.
abstract web PDF hide
HPTT: A High-Performance Tensor Transposition C++ Library
Paul Springer, Tong Su and Paolo Bientinesi
4th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming.
Barcelona, June 2017.
Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework—guided by a performance model—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen’s tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.
abstract web PDF hide
The Landscape of High-Performance Tensor Contractions
Paul Springer and Paolo Bientinesi
Workshop on Batched, Reproducible, and Reduced Prevision BLAS.
Atlanta, Georgia, February 2017.
web PDF hide
Design of a High-Performance GEMM-like Tensor-Tensor Multiplication
Paul Springer and Paolo Bientinesi
- SIAM Conference on Computational Science and Engineering.
  Atlanta, February 2017.
- BLIS Retreat 2016.
  September 2016.
web PDF hide
TTC: A Tensor Transposition Compiler for Multiple Architectures
Paul Springer and Paolo Bientinesi
ARRAY ACM SIGPLAN 3rd International Workshop on Libraries, Languages and Compilers for Programming.
June 2016.
We consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system’s peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g., Intel Haswell, AMD Steamroller), Intel’s Knights Corner as well as different CUDA-based GPUs such as NVIDIA’s Kepler and Maxwell architectures. We report speedups of TTC over a meaningful base- line implementation generated by external C++ compilers; the re- sults suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel’s latest C++ compiler on the Haswell and Knights Cor- ner architecture, TTC yields speedups of up to 8× and 32×, respectively. We also showcase TTC’s support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.
abstract web PDF hide
TTC: A Compiler for Tensor Transpositions
Paul Springer and Paolo Bientinesi
SIAM Conference on Parallel Processing for Scientific Computing..
Université Pierre et Marie Curie, Paris, 14 April 2016.
We present TTC, an open-source compiler for multidimensional tensor transpositions. Thanks to a range of optimizations (software prefetching, blocking, loop-reordering, explicit vectorization), TCC generates high-performance parallel C/C++ code. We use generic heuristics and auto-tuning to manage the huge search space. Performance results show that TTC achieves close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, yielding performance gains of up to 15× over modern compilers.
abstract web PDF hide
Parallel Python - Tutorial
Paul Springer
January 2013.
This 30 min. tutorial introduces the reader to parallel python and shows some of the parallel python approaches in action (Live Demos are included in the .tar file).
abstract PDF hide
A Study of Productivity and Performance of Modern Vector Processors
Paul Springer
2012.
PDF hide
Berkeley's Dwarfs on CUDA
Paul Springer
2011.
PDF hide