# Publications - Paul Springer

### Submitted Paper

**Spin Summations: A High-Performance Perspective**ACM Transactions on Mathematical Software (TOMS), May 2017.abstractwebPDFBesides tensor contractions, one of the most pronounced computational bottlenecks in the non-orthogonally spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate forms---including CCSD(T) and CCSDT(Q)---are spin summations. At a first sight, spin summations are operations similar to tensor transpositions; a closer look instead reveals additional challenges to high-performance calculations, including temporal locality as well as scattered memory accesses. This publication explores a sequence of algorithmic solutions for spin summations, each exploiting individual properties of either the underlying hardware (e.g. caches, vectorization), or the problem itself (e.g. factorizability). The final algorithm combines the advantages of all the solutions, while avoiding their drawbacks; this algorithm achieves high-performance through parallelization, vectorization, and by exploiting the temporal locality inherent to spin summations. Combined, these optimizations result in speedups between 2.4x and 5.5x over the NCC quantum chemistry software package. In addition to such a performance boost, our algorithm can perform the spin summations in-place, thus reducing the memory footprint by 2x over an out-of-place variant.

### Journal Articles

**TTC: A high-performance Compiler for Tensor Transpositions**ACM Transactions on Mathematical Software (TOMS), Volume 44(2), pp. 15:1-15:21, August 2017.@article{Springer2017:910, author = "Paul Springer and {Jeff R.} Hammond and Paolo Bientinesi", title = "TTC: A high-performance Compiler for Tensor Transpositions", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2017, volume = 44, number = 2, pages = "15:1--15:21", month = aug, publisher = "ACM", url = "http://arxiv.org/pdf/1603.02297v1" }

abstractwebPDFbibtexWe present TTC, an open-source parallel compiler for multidimensional tensor transpositions. In order to generate high-performance C++ code, TTC explores a number of optimizations, including software prefetching, blocking, loop-reordering, and explicit vectorization. To evaluate the performance of multidimensional transpositions across a range of possible use-cases, we also release a benchmark covering arbitrary transpositions of up to six dimensions. Performance results show that the routines generated by TTC achieve close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, and yield significant performance gains over modern compilers. By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large. Experiments indicate that when only 100 potential solutions are considered, the resulting performance is about 99% of that achieved with exhaustive search.**Design of a high-performance GEMM-like Tensor-Tensor Multiplication**ACM Transactions on Mathematical Software (TOMS), 2017.

Accepted.@article{Springer2017:554, author = "Paul Springer and Paolo Bientinesi", title = "Design of a high-performance GEMM-like Tensor-Tensor Multiplication", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2017, note = "Accepted", url = "https://arxiv.org/abs/1607.00145" }

abstractPDFbibtexWe present ''GEMM-like Tensor-Tensor multiplication'' (GETT), a novel approach to tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned ''macro-kernel''. This macro-kernel operates on suitably prepared (''packed'') sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. To compare our technique with other modern tensor contractions, we integrate GETT alongside the so called Transpose-Transpose-GEMM-Transpose and Loops-over-GEMM approaches into an open source ''Tensor Contraction Code Generator'' (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to 12.3x. More precisely, GETT achieves speedups of up to 1.42x over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to 91.3% of peak floating-point performance for compute-bound tensor contractions.**Multilevel Summation for Dispersion: A Linear-Time Algorithm for 1/r^6 Potentials**Journal of Chemical Physics, Volume 140, pp. 024105, January 2014.@article{Tameling2014:590, author = "Daniel Tameling and Paul Springer and Paolo Bientinesi and {Ahmed E.} Ismail", title = "Multilevel Summation for Dispersion: A Linear-Time Algorithm for 1/r^6 Potentials", journal = "Journal of Chemical Physics", year = 2014, volume = 140, pages = 24105, month = jan, url = "http://arxiv.org/abs/1308.4005" }

abstractwebPDFbibtexThe multilevel summation (MLS) method was developed to evaluate long-range interactions in molecular dynamics (MD) simulations. MLS was initially introduced for Coulombic potentials; we have extended this method to dispersion interactions. While formally short-ranged, for an accurate calculation of forces and energies in cases such as in interfacial systems, dispersion potentials require long-range methods. Since long-range solvers tend to dominate the time needed to perform MD calculations, increasing their performance is of vital importance. The MLS method offers some significant advantages when compared to mesh-based Ewald methods like the particle-particle particle-mesh and particle mesh Ewald methods. Unlike mesh-based Ewald methods, MLS does not use fast Fourier transforms and is thus not limited by communication and bandwidth concerns. In addition, it scales linearly in the number of particles, as compared to the O(N log N) complexity of the mesh-based Ewald methods. While the structure of the MLS method is invariant for different potentials, every algorithmic step had to be adapted to accommodate the 1/r^6 form of the dispersion interactions. In addition, we have derived error bounds, similar to those obtained by Hardy for the electrostatic MLS. Using a prototype implementation, we can already demonstrate the linear scaling of the MLS method for dispersion, and present results establishing the accuracy and efficiency of the method.

### Peer Reviewed Conference Publications

**HPTT: A High-Performance Tensor Transposition C++ Library**Proceedings of the ARRAY 2017, June 2017.@inproceedings{Springer2017:558, author = "Paul Springer and Tong Su and Paolo Bientinesi", title = "HPTT: A High-Performance Tensor Transposition C++ Library", year = 2017, month = jun, url = "http://arxiv.org/abs/1704.04374" }

abstractwebPDFbibtexRecently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework—guided by a performance model—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen’s tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.**TTC: A Tensor Transposition Compiler for Multiple Architectures**Proceedings of the 3rd International Workshop on Libraries, Languages and Compilers for Programming (ARRAY 2016), June 2016.@inproceedings{Springer2016:940, author = "Paul Springer and Aravind Sankaran and Paolo Bientinesi", title = "TTC: A Tensor Transposition Compiler for Multiple Architectures", year = 2016, month = jun, url = "https://arxiv.org/pdf/1607.01249.pdf" }

abstractPDFbibtexWe consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system's peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g.,~Intel Haswell, AMD Steamroller), Intel's Knights Corner as well as different CUDA-based GPUs such as NVIDIA's Kepler and Maxwell architectures. We report speedups of TTC over a meaningful baseline implementation generated by external C++ compilers; the results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel's latest C++ compiler on the Haswell and Knights Corner architecture, TTC yields speedups of up to 8x and 32x, respectively. We also showcase TTC's support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.**A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Dynamics**High Performance Computing, Lecture Notes in Computer Science, Volume 9137, pp. 155-170, Springer International Publishing, July 2015.@inproceedings{Springer2015:450, author = "Paul Springer and {Ahmed E.} Ismail and Paolo Bientinesi", title = "A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Dynamics", booktitle = "High Performance Computing ", year = 2015, volume = 9137, series = "Lecture Notes in Computer Science", pages = "155-170", month = jul, publisher = "Springer International Publishing", url = "https://arxiv.org/pdf/1701.05242.pdf" }

abstractwebPDFbibtexRecent results on supercomputers show that beyond 65K cores, the efficiency of molecular dynamics simulations of interfacial sys- tems decreases significantly. In this paper, we introduce a dynamic cutoff method (DCM) for interfacial systems of arbitrarily large size. The idea consists in adopting a cutoff-based method in which the cutoff is cho- sen on a particle-by-particle basis, according to the distance from the interface. Computationally, the challenge is shifted from the long-range solvers to the detection of the interfaces and to the computation of the particle-interface distances. For these tasks, we present linear-time algo- rithms that do not rely on global communication patterns. As a result, the DCM algorithm is suited for large systems of particles and mas- sively parallel computers. To demonstrate its potential, we integrated DCM into the LAMMPS open-source molecular dynamics package, and simulated large liquid/vapor systems on two supercomputers: SuperMuc and JUQUEEN. In all cases, the accuracy of DCM is comparable to the traditional particle-particle particle-mesh (PPPM) algorithm, while the performance is considerably superior for large numbers of particles. For JUQUEEN, we provide timings for simulations running on the full system (458, 752 cores), and show nearly perfect strong and weak scaling.**Packet-Oriented Streamline Tracing on Modern SIMD Architectures**2015.@inproceedings{Hentschel2015:388, author = "Bernd Hentschel and {Jens Henrik} Göbbert and Michael Klemm and Paul Springer and Andrea Schnorr and {Torsten W.} Kuhlen", title = "Packet-Oriented Streamline Tracing on Modern SIMD Architectures", year = 2015, journal = "Eurographics Symposium on Parallel Graphics and Visualization", url = "https://diglib.eg.org/handle/10.2312/pgv.20151154.043-052" }

abstractPDFbibtexThe advection of integral lines is an important computational kernel in vector field visualization. We investigate how this kernel can profit from vector (SIMD) extensions in modern CPUs. As a baseline, we formulate a streamline tracing algorithm that facilitates auto-vectorization by an optimizing compiler. We analyze this algorithm and propose two different optimizations. Our results show that particle tracing does not per se benefit from SIMD computation. Based on a careful analysis of the auto-vectorized code, we propose an optimized data access routine and a re-packing scheme which increases average SIMD efficiency. We evaluate our approach on three different, turbulent flow fields. Our optimized approaches increase integration performance up to 5.6x over our baseline measurement. We conclude with a discussion of current limitations and aspects for future work.**OpenACC - First Experiences with Real-World Applications**Euro-Par 2012 Parallel Processing, August 2012.@inproceedings{Wienke2012:54, author = "Sandra Wienke and Paul Springer and Christian Terboven and Dieter {An Mey}", title = "OpenACC - First Experiences with Real-World Applications", booktitle = "Euro-Par 2012 Parallel Processing", year = 2012, address = "Rhodes Island, Greece", month = aug, url = "http://hpac.rwth-aachen.de/people/springer/OpenACC_first_experiences.pdf" }

abstractPDFbibtexToday's trend to use accelerators like GPGPUs in heterogeneous computer systems has entailed several low-level APIs for accelerator programming. However, programming these APIs is often tedious and therefore unproductive. To tackle this problem, recent approaches employ directive-based high-level pro- gramming for accelerators. In this work, we present our first experiences with OpenACC, an API consisting of compiler directives to offload loops and re- gions of C/C++ and Fortran code to accelerators. We compare the performance of OpenACC to PGI Accelerator and OpenCL for two real-world applications and evaluate programmability and productivity. We find that OpenACC offers a promising ratio of development effort to performance and that a directive-based approach to program accelerators is more efficient than low-level APIs, even if suboptimal performance is achieved.

### Thesis

**A scalable, linear-time dynamic cutoff algorithm for molecular simu- lations of interfacial systems**RWTH Aachen University, 2013.@mastersthesis{Springer2013:970, author = "Paul Springer", title = "A scalable, linear-time dynamic cutoff algorithm for molecular simu- lations of interfacial systems", school = "RWTH Aachen University", year = 2013, institution = "RWTH Aachen University", url = "http://arxiv.org/pdf/1502.03234v1" }

abstractPDFbibtexThis master thesis introduces the idea of dynamic cutoffs in molecular dynamics simulations, based on the distance between particles and the interface, and presents a solution for detecting interfaces in real-time. Our dynamic cutoff method (DCM) exhibits a linear-time complexity as well as nearly ideal weak and strong scaling. The DCM is tailored for massively parallel architectures and for large interfacial systems with millions of particles. We implemented the DCM as part of the LAMMPS open-source molecular dynamics package and demonstrate the nearly ideal weak- and strong-scaling behavior of this method on an IBM BlueGene/Q supercomputer. Our results for a liquid/vapor system consisting of Lennard-Jones particles show that the accuracy of DCM is comparable to that of the traditional particle-particle particle- mesh (PPPM) algorithm. The performance comparison indicates that DCM is preferable for large systems due to the limited scaling of FFTs within the PPPM algorithm. Moreover, the DCM requires the interface to be identified every other MD timestep. As a consequence, this thesis also presents an interface detection method which is (1) applicable in real time; (2) parallelizable; and (3) scales linearly with respect to the number of particles.

### Technical Reports

**OpenACC - A Step Towards Heterogeneous Computing**RWTH Aachen University, 2013.

Seminar project.@techreport{Springer2013:808, author = "Paul Springer", title = "OpenACC - A Step Towards Heterogeneous Computing", institution = "RWTH Aachen University", year = 2013, note = "Seminar project", url = "http://hpac.rwth-aachen.de/people/springer/openacc_seminar.pdf" }

abstractPDFbibtexWith the fast growing number of heterogeneous supercomputers, consisting of massively parallel coprocessors attached to multi-core processors, it becomes increasingly important to program these heterogeneous systems in a productive manner. Programming these coprocessors through low-level APIs such as CUDA or OpenCL is often a tedious task and may result in poor productivity. OpenACC tries to overcome this drawback by allowing programmers to annotate C/C++ or Fortran code with directives which are then translated into accelerator-specific code by the compiler. This paper summarizes OpenACC’s features, its limitations and possible future directions. Moreover, I will present two case studies where I evaluate OpenACC’s performance and productivity in comparison to CUDA and OpenMP.**A Study of Productivity and Performance of Modern Vector Processors**RWTH Aachen University, March 2012.

Bachelor Thesis.@techreport{Springer2012:818, author = "Paul Springer", title = "A Study of Productivity and Performance of Modern Vector Processors", institution = "RWTH Aachen University", year = 2012, month = mar, note = "Bachelor Thesis", url = "http://hpac.rwth-aachen.de/people/springer/bachelor_thesis.pdf" }

abstractPDFbibtexThis bachelor thesis carries out a case study describing the performance and productivity of modern vector processors such as graphics processing units (GPUs) and central processing units (CPUs) based on three different computational routines arising from a magnetoencephalography application. I apply different programming paradigms to these routines targeting either the CPU or the GPU. Furthermore, I investigate the performance and productivity of programming paradigms such as OpenMP with respect to its auto-vectorization capabilities, Intel intrinsic AVX and Intel OpenCL for the CPU. Moreover, I examine NVIDIA's CUDA and OpenCL APIs for GPU-sided applications. The results of the performed case study yield roughly the same performances for the CPU and GPU implementations, but favour the OpenMP paradigm (i.e. the CPU) with respect to productivity.**Berkeley's Dwarfs on CUDA**RWTH Aachen University, 2011.

Seminar Project.