# Recent Publications

### Submitted Papers

**Spin Summations: A High-Performance Perspective**ACM Transactions on Mathematical Software (TOMS), May 2017.abstractwebPDFBesides tensor contractions, one of the most pronounced computational bottlenecks in the non-orthogonally spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate forms---including CCSD(T) and CCSDT(Q)---are spin summations. At a first sight, spin summations are operations similar to tensor transpositions; a closer look instead reveals additional challenges to high-performance calculations, including temporal locality as well as scattered memory accesses. This publication explores a sequence of algorithmic solutions for spin summations, each exploiting individual properties of either the underlying hardware (e.g. caches, vectorization), or the problem itself (e.g. factorizability). The final algorithm combines the advantages of all the solutions, while avoiding their drawbacks; this algorithm achieves high-performance through parallelization, vectorization, and by exploiting the temporal locality inherent to spin summations. Combined, these optimizations result in speedups between 2.4x and 5.5x over the NCC quantum chemistry software package. In addition to such a performance boost, our algorithm can perform the spin summations in-place, thus reducing the memory footprint by 2x over an out-of-place variant.**Accelerating scientific codes by performance and accuracy modeling**Journal of Computational Science, May 2017.abstractPDFScientific software is often driven by multiple parameters that affect both accuracy and performance. Since finding the optimal configuration of these parameters is a highly complex task, it extremely common that the software is used suboptimally. In a typical scenario, accuracy requirements are imposed, and attained through suboptimal performance. In this paper, we present a methodology for the automatic selection of parameters for simulation codes, and a corresponding prototype tool. To be amenable to our methodology, the target code must expose the parameters affecting accuracy and performance, and there must be formulas available for error bounds and computational complexity of the underlying methods. As a case study, we consider the particle-particle particle-mesh method (PPPM) from the LAMMPS suite for molecular dynamics, and use our tool to identify configurations of the input parameters that achieve a given accuracy in the shortest execution time. When compared with the configurations suggested by expert users, the parameters selected by our tool yield reductions in the time-to-solution ranging between 10% and 60%. In other words, for the typical scenario where a fixed number of core-hours are granted and simulations of a fixed number of timesteps are to be run, usage of our tool may allow up to twice as many simulations. While we develop our ideas using LAMMPS as computational framework and use the PPPM method for dispersion as case study, the methodology is general and valid for a range of software tools and methods.**Perfect spike detection via time reversal**2017.

Submitted to Frontiers in Neuroscience.abstractwebPDFSpiking neuronal networks are usually simulated with three main simulation schemes: the classical time-driven and event-driven schemes, and the more recent hybrid scheme. All three schemes evolve the state of a neuron through a series of checkpoints: equally spaced in the first scheme and determined neuron-wise by spike events in the latter two. The time-driven and the hybrid scheme determine whether the membrane potential of a neuron crosses a threshold at the end of of the time interval between consecutive checkpoints. Threshold crossing can, however, occur within the interval even if this test is negative. Spikes can therefore be missed. The present work derives, implements, and benchmarks a method for perfect retrospective spike detection. This method can be applied to neuron models with affine or linear subthreshold dynamics. The idea behind the method is to propagate the threshold with a time-inverted dynamics, testing whether the threshold crosses the neuron state to be evolved, rather than vice versa. Algebraically this translates into a set of inequalities necessary and sufficient for threshold crossing. This test is slower than the imperfect one, but faster than an alternative perfect tests based on bisection or root-finding methods. Comparison confirms earlier results that the imperfect test rarely misses spikes (less than a fraction 1/108 of missed spikes) in biologically relevant settings. This study offers an alternative geometric point of view on neuronal dynamics.

### Journal Articles

**High-performance functional renormalization group calculations for interacting fermions**Computer Physics Communications, Volume 213, pp. 100-110, April 2017.@article{Lichtenstein2017:788, author = "Julian Lichtenstein and David {Sanchez de la Pena} and Daniel Rohe and Edoardo {Di Napoli} and Carsten Honerkamp and {Stefan A.} Maier", title = "High-performance functional renormalization group calculations for interacting fermions", journal = "Computer Physics Communications", year = 2017, volume = 213, pages = "100-110", month = apr, url = "https://arxiv.org/pdf/1604.06296v2.pdf" }

abstractwebPDFbibtexWe derive a novel computational scheme for functional Renormalization Group (fRG) calculations for interacting fermions on 2D lattices. The scheme is based on the exchange parametrization fRG for the two-fermion interaction, with additional insertions of truncated partitions of unity. These insertions decouple the fermionic propagators from the exchange propagators and lead to a separation of the underlying equations. We demonstrate that this separation is numerically advantageous and may pave the way for refined, large-scale computational investigations even in the case of complex multiband systems. Furthermore, on the basis of speedup data gained from our implementation, it is shown that this new variant facilitates efficient calculations on a large number of multi-core CPUs. We apply the scheme to the t,t′ Hubbard model on a square lattice to analyze the convergence of the results with the bond length of the truncation of the partition of unity. In most parameter areas, a fast convergence can be observed. Finally, we compare to previous results in order to relate our approach to other fRG studies.**TTC: A high-performance Compiler for Tensor Transpositions**ACM Transactions on Mathematical Software (TOMS), April 2017.

Accepted.@article{Springer2017:910, author = "Paul Springer and {Jeff R.} Hammond and Paolo Bientinesi", title = "TTC: A high-performance Compiler for Tensor Transpositions", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2017, month = apr, note = "Accepted", url = "http://arxiv.org/pdf/1603.02297v1" }

abstractPDFbibtexWe present TTC, an open-source parallel compiler for multidimensional tensor transpositions. In order to generate high-performance C++ code, TTC explores a number of optimizations, including software prefetching, blocking, loop-reordering, and explicit vectorization. To evaluate the performance of multidimensional transpositions across a range of possible use-cases, we also release a benchmark covering arbitrary transpositions of up to six dimensions. Performance results show that the routines generated by TTC achieve close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, and yield significant performance gains over modern compilers. By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large. Experiments indicate that when only 100 potential solutions are considered, the resulting performance is about 99% of that achieved with exhaustive search.**Recursive Algorithms for Dense Linear Algebra: The ReLAPACK Collection**ACM Transactions on Mathematical Software (TOMS), March 2017.

Accepted.@article{Peise2017:728, author = "Elmar Peise and Paolo Bientinesi", title = "Recursive Algorithms for Dense Linear Algebra: The ReLAPACK Collection", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2017, month = mar, note = "Accepted", institution = "AICES, RWTH Aachen University", url = "http://arxiv.org/pdf/1602.06763v1" }

abstractwebPDFbibtexTo exploit both memory locality and the full performance potential of highly tuned kernels, dense linear algebra libraries such as LAPACK commonly implement operations as blocked algorithms. However, to achieve next-to-optimal performance with such algorithms, significant tuning is required. On the other hand, recursive algorithms are virtually tuning free, and yet attain similar performance. In this paper, we first analyze and compare blocked and recursive algorithms in terms of performance, and then introduce ReLAPACK, an open-source library of recursive algorithms to seamlessly replace most of LAPACK's blocked algorithms. In many scenarios, ReLAPACK clearly outperforms reference LAPACK, and even improves upon the performance of optimizes libraries.**High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods**Computer Physics Communications, Volume 211, pp. 61 - 72, February 2017.

High Performance Computing for Advanced Modeling and Simulation of Materials.@article{Di_Napoli2017:318, author = "Edoardo {Di Napoli} and Elmar Peise and Markus Hrywniak and Paolo Bientinesi", title = "High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods", journal = "Computer Physics Communications", year = 2017, volume = 211, pages = "61 - 72", month = feb, note = "High Performance Computing for Advanced Modeling and Simulation of Materials", url = "http://arxiv.org/pdf/1602.06589v2" }

abstractwebPDFbibtexOne of the greatest effort of computational scientists is to translate the mathematical model describing a class of physical phenomena into large and complex codes. Many of these codes face the difficulty of implementing the mathematical operations in the model in terms of low level optimized kernels offering both performance and portability. Legacy codes suffers from the additional curse of rigid design choices based on outdated performance metrics (e.g. minimization of memory footprint). Using a representative code from the Materials Science community, we propose a methodology to restructure the most expensive operations in terms of an optimized combination of dense linear algebra kernels. The resulting algorithm guarantees an increased performance and an extended life span of this code enabling larger scale simulations.

### Peer Reviewed Conference Publications

**HPTT: A High-Performance Tensor Transposition C++ Library**Proceedings of the ARRAY 2017, June 2017.

Accepted.@inproceedings{Springer2017:558, author = "Paul Springer and Tong Su and Paolo Bientinesi", title = "HPTT: A High-Performance Tensor Transposition C++ Library", year = 2017, month = jun, note = "Accepted", url = "http://arxiv.org/abs/1704.04374" }

abstractwebPDFbibtexRecently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework—guided by a performance model—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen’s tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.**LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi**Proceedings of the ISC HPC 2017, June 2017.@inproceedings{McDoniel2017:890, author = "William McDoniel and Markus Höhnerbach and Rodrigo Canales and {Ahmed E.} Ismail and Paolo Bientinesi", title = "LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi", year = 2017, month = jun, url = "https://arxiv.org/pdf/1702.04250.pdf" }

abstractPDFbibtexMolecular Dynamics is an important tool for computational biologists, chemists, and materials scientists, consuming a sizable amount of supercomputing resources. Many of the investigated systems contain charged particles, which can only be simulated accurately using a long-range solver, such as PPPM. We extend the popular LAMMPS molecular dynamics code with an implementation of PPPM particularly suitable for the second generation Intel Xeon Phi. Our main target is the optimization of computational kernels by means of vectorization, and we observe speedups in these kernels of up to 12x. These improvements carry over to LAMMPS users, with overall speedups ranging between 2-3x, without requiring users to retune input parameters. Furthermore, our optimizations make it easier for users to determine optimal input parameters for attaining top performance.

### Thesis

**Performance Modeling and Prediction for Dense Linear Algebra**HPAC, RWTH Aachen, May 2017.

1st version (under evaluation).@phdthesis{Peise2017:620, author = "Elmar Peise", title = "Performance Modeling and Prediction for Dense Linear Algebra", school = "HPAC, RWTH Aachen", year = 2017, month = may, note = "1st version (under evaluation)", url = "https://arxiv.org/pdf/1706.01341" }

abstractwebPDFbibtexThis dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel.