# Recent Publications

### Submitted Paper

**Spin Summations: A High-Performance Perspective**ACM Transactions on Mathematical Software (TOMS), May 2017.abstractwebPDFBesides tensor contractions, one of the most pronounced computational bottlenecks in the non-orthogonally spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate forms---including CCSD(T) and CCSDT(Q)---are spin summations. At a first sight, spin summations are operations similar to tensor transpositions; a closer look instead reveals additional challenges to high-performance calculations, including temporal locality as well as scattered memory accesses. This publication explores a sequence of algorithmic solutions for spin summations, each exploiting individual properties of either the underlying hardware (e.g. caches, vectorization), or the problem itself (e.g. factorizability). The final algorithm combines the advantages of all the solutions, while avoiding their drawbacks; this algorithm achieves high-performance through parallelization, vectorization, and by exploiting the temporal locality inherent to spin summations. Combined, these optimizations result in speedups between 2.4x and 5.5x over the NCC quantum chemistry software package. In addition to such a performance boost, our algorithm can perform the spin summations in-place, thus reducing the memory footprint by 2x over an out-of-place variant.

### Journal Articles

**Algorithm 979: Recursive Algorithms for Dense Linear Algebra—The ReLAPACK Collection**ACM Trans. Math. Softw., Volume 44(2), pp. 16:1-16:19, September 2017.@article{Peise2017:728, author = "Elmar Peise and Paolo Bientinesi", title = "Algorithm 979: Recursive Algorithms for Dense Linear Algebra—The ReLAPACK Collection", journal = "ACM Trans. Math. Softw.", year = 2017, volume = 44, number = 2, pages = "16:1--16:19", month = sep, publisher = "ACM", address = "New York, NY, USA", url = "http://arxiv.org/pdf/1602.06763v1" }

abstractwebPDFbibtexTo exploit both memory locality and the full performance potential of highly tuned kernels, dense linear algebra libraries such as LAPACK commonly implement operations as blocked algorithms. However, to achieve next-to-optimal performance with such algorithms, significant tuning is required. On the other hand, recursive algorithms are virtually tuning free, and yet attain similar performance. In this paper, we first analyze and compare blocked and recursive algorithms in terms of performance, and then introduce ReLAPACK, an open-source library of recursive algorithms to seamlessly replace most of LAPACK's blocked algorithms. In many scenarios, ReLAPACK clearly outperforms reference LAPACK, and even improves upon the performance of optimizes libraries.**TTC: A high-performance Compiler for Tensor Transpositions**ACM Transactions on Mathematical Software (TOMS), Volume 44(2), pp. 15:1-15:21, August 2017.@article{Springer2017:910, author = "Paul Springer and {Jeff R.} Hammond and Paolo Bientinesi", title = "TTC: A high-performance Compiler for Tensor Transpositions", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2017, volume = 44, number = 2, pages = "15:1--15:21", month = aug, publisher = "ACM", url = "http://arxiv.org/pdf/1603.02297v1" }

abstractwebPDFbibtexWe present TTC, an open-source parallel compiler for multidimensional tensor transpositions. In order to generate high-performance C++ code, TTC explores a number of optimizations, including software prefetching, blocking, loop-reordering, and explicit vectorization. To evaluate the performance of multidimensional transpositions across a range of possible use-cases, we also release a benchmark covering arbitrary transpositions of up to six dimensions. Performance results show that the routines generated by TTC achieve close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, and yield significant performance gains over modern compilers. By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large. Experiments indicate that when only 100 potential solutions are considered, the resulting performance is about 99% of that achieved with exhaustive search.**The ELAPS Framework: Experimental Linear Algebra Performance Studies**International Journal of High Performance Computing, August 2017.

Accepted.@article{Peise2017:560, author = "Elmar Peise and Paolo Bientinesi", title = "The ELAPS Framework: Experimental Linear Algebra Performance Studies", journal = "International Journal of High Performance Computing", year = 2017, month = aug, note = "Accepted", url = "http://arxiv.org/pdf/1504.08035v1" }

abstractPDFbibtexIn scientific computing, optimal use of computing resources comes at the cost of extensive coding, tuning and benchmarking. While the classic approach of “features first, performance later” is supported by a variety of tools such as Tau, Vampir, and Scalasca, the emerging performance-centric approach, in which both features and performance are primary objectives, is still lacking suitable development tools. For dense linear algebra applications, we fill this gap with the Experimental Linear Algebra Performance Studies framework (ELAPS), a multi-platform open-source environment for easy and fast, yet powerful performance experimentation and prototyping. In contrast to many existing tools, ELAPS targets the beginning of the development process, assisting application developers in both algorithmic and optimization decisions. With ELAPS, users construct experiments to investigate how performance and efficiency depend on factors such as caching, algorithmic parameters, problem size, and parallelism. Experiments are designed either through Python scripts or a specialized GUI, and run on a spectrum of architectures, ranging from laptops to accelerators and clusters. The resulting reports provide various metrics and statistics that can be analyzed both numerically and visually. In this paper, we introduce ELAPS and illustrate its practical value in guiding critical performance decisions already in early development stages.

### Peer Reviewed Conference Publications

**Efficient Pattern Matching in Python**Proceedings of the 7th Workshop on Python for High-Performance and Scientific Computing, November 2017.

In conjunction with SC17: The International Conference for High Performance Computing, Networking, Storage and Analysis.@inproceedings{Krebber2017:404, author = "{Manuel } Krebber and Henrik Barthels and Paolo Bientinesi", title = "Efficient Pattern Matching in Python", booktitle = "Proceedings of the 7th Workshop on Python for High-Performance and Scientific Computing", year = 2017, month = nov, note = "In conjunction with SC17: The International Conference for High Performance Computing, Networking, Storage and Analysis", url = "https://arxiv.org/pdf/1710.00077.pdf" }

abstractwebPDFbibtexPattern matching is a powerful tool for symbolic computations. Applications include term rewriting systems, as well as the manipulation of symbolic expressions, abstract syntax trees, and XML and JSON data. It also allows for an intuitive description of algorithms in the form of rewrite rules. We present the open source Python module MatchPy, which offers functionality and expressiveness similar to the pattern matching in Mathematica. In particular, it includes syntactic pattern matching, as well as matching for commutative and/or associative functions, sequence variables, and matching with constraints. MatchPy uses new and improved algorithms to efficiently find matches for large pattern sets by exploiting similarities between patterns. The performance of MatchPy is investigated on several real-world problems.**Linnea: Compiling Linear Algebra Expressions to High-Performance Code**Proceedings of the 8th International Workshop on Parallel Symbolic Computation, July 2017.@inproceedings{Barthels2017:254, author = "Henrik Barthels and Paolo Bientinesi", title = "Linnea: Compiling Linear Algebra Expressions to High-Performance Code", booktitle = "Proceedings of the 8th International Workshop on Parallel Symbolic Computation", year = 2017, month = jul, url = "http://hpac.rwth-aachen.de/~barthels/publications/PASCO_2017.pdf" }

abstractwebPDFbibtexLinear algebra expressions appear in fields as diverse as computational biology, signal processing, communication technology, finite element methods, and control theory. Libraries such as BLAS and LAPACK provide highly optimized building blocks for just about any linear algebra computation; thus, a linear algebra expression can be evaluated efficiently by breaking it down into those building blocks. However, this is a challenging problem, requiring knowledge in high-performance computing, compilers, and numerical linear algebra. In this paper we give an overview of existing solutions, and introduce Linnea, a compiler that solves this problem. As shown through a set of test cases, Linnea’s results are comparable with those obtained by human experts.**MatchPy: A Pattern Matching Library**Proceedings of the 15th Python in Science Conference, July 2017.@inproceedings{Krebber2017:550, author = "{Manuel } Krebber and Henrik Barthels and Paolo Bientinesi", title = "MatchPy: A Pattern Matching Library", booktitle = "Proceedings of the 15th Python in Science Conference", year = 2017, month = jul, url = "http://conference.scipy.org/proceedings/scipy2017/pdfs/manuel_krebber.pdf" }

abstractwebPDFbibtexPattern matching is a powerful tool for symbolic computations, based on the well-defined theory of term rewriting systems. Application domains include algebraic expressions, abstract syntax trees, and XML and JSON data. Unfortunately, no lightweight implementation of pattern matching as general and flexible as Mathematica exists for Python Mathics,MacroPy,patterns,PyPatt. Therefore, we created the open source module MatchPy which offers similar pattern matching functionality in Python using a novel algorithm which finds matches for large pattern sets more efficiently by exploiting similarities between patterns.**HPTT: A High-Performance Tensor Transposition C++ Library**Proceedings of the ARRAY 2017, June 2017.@inproceedings{Springer2017:558, author = "Paul Springer and Tong Su and Paolo Bientinesi", title = "HPTT: A High-Performance Tensor Transposition C++ Library", year = 2017, month = jun, url = "http://arxiv.org/abs/1704.04374" }

abstractwebPDFbibtexRecently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework—guided by a performance model—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen’s tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.**LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi**Proceedings of the 32nd International Conference, ISC High Performance 2017, Volume 10266, pp. 61-78, Springer, June 2017.@inproceedings{McDoniel2017:890, author = "William McDoniel and Markus Höhnerbach and Rodrigo Canales and {Ahmed E.} Ismail and Paolo Bientinesi", title = "LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi", year = 2017, volume = 10266, pages = "61--78", address = "Frankfurt", month = jun, publisher = "Springer", url = "https://arxiv.org/pdf/1702.04250.pdf" }

abstractPDFbibtexMolecular Dynamics is an important tool for computational biologists, chemists, and materials scientists, consuming a sizable amount of supercomputing resources. Many of the investigated systems contain charged particles, which can only be simulated accurately using a long-range solver, such as PPPM. We extend the popular LAMMPS molecular dynamics code with an implementation of PPPM particularly suitable for the second generation Intel Xeon Phi. Our main target is the optimization of computational kernels by means of vectorization, and we observe speedups in these kernels of up to 12x. These improvements carry over to LAMMPS users, with overall speedups ranging between 2-3x, without requiring users to retune input parameters. Furthermore, our optimizations make it easier for users to determine optimal input parameters for attaining top performance.

### Thesis

**Performance Modeling and Prediction for Dense Linear Algebra**HPAC, RWTH Aachen, May 2017.

1st version (under evaluation).@phdthesis{Peise2017:620, author = "Elmar Peise", title = "Performance Modeling and Prediction for Dense Linear Algebra", school = "HPAC, RWTH Aachen", year = 2017, month = may, note = "1st version (under evaluation)", url = "https://arxiv.org/pdf/1706.01341" }

abstractwebPDFbibtexThis dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel.