# Recent Publications

### Submitted Papers

**HPTT: A High-Performance Tensor Transposition C++ Library**Submitted to the ARRAY 2017, April 2017.abstractwebPDFRecently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework—guided by a performance model—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen’s tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.**Perfect spike detection via time reversal**2017.

Submitted to Frontiers in Neuroscience.abstractwebPDFSpiking neuronal networks are usually simulated with three main simulation schemes: the classical time-driven and event-driven schemes, and the more recent hybrid scheme. All three schemes evolve the state of a neuron through a series of checkpoints: equally spaced in the first scheme and determined neuron-wise by spike events in the latter two. The time-driven and the hybrid scheme determine whether the membrane potential of a neuron crosses a threshold at the end of of the time interval between consecutive checkpoints. Threshold crossing can, however, occur within the interval even if this test is negative. Spikes can therefore be missed. The present work derives, implements, and benchmarks a method for perfect retrospective spike detection. This method can be applied to neuron models with affine or linear subthreshold dynamics. The idea behind the method is to propagate the threshold with a time-inverted dynamics, testing whether the threshold crosses the neuron state to be evolved, rather than vice versa. Algebraically this translates into a set of inequalities necessary and sufficient for threshold crossing. This test is slower than the imperfect one, but faster than an alternative perfect tests based on bisection or root-finding methods. Comparison confirms earlier results that the imperfect test rarely misses spikes (less than a fraction 1/108 of missed spikes) in biologically relevant settings. This study offers an alternative geometric point of view on neuronal dynamics.**Non-linear Least-Squares optimization of rational filters for the solution of interior eigenvalue problems**2017.

Submitted to SIAM SIMAX.abstractwebPDFRational filter functions can be used to improve convergence of contour-based eigensolvers, a popular family of algorithms for the solution of the interior eigenvalue problem. We present a framework for the optimization of rational filters based on a non-convex weighted Least-Squares scheme. When used in combination with the FEAST library, our filters out-perform existing ones on a large and representative set of benchmark problems. This work provides a detailed description of: (1) a set up of the optimization process that exploits symmetries of the filter function for Hermitian eigenproblems, (2) a formulation of the gradient descent and Levenberg-Marquardt algorithms that exploits the symmetries, (3) a method to select the starting position for the optimization algorithms that reliably produces effective filters, (4) a constrained optimization scheme that produces filter functions with specific properties that may be beneficial to the performance of the eigensolver that employs them.

### Journal Articles

**High-performance functional renormalization group calculations for interacting fermions**Computer Physics Communications, Volume 213, pp. 100-110, April 2017.@article{Lichtenstein2017:788, author = "Julian Lichtenstein and David {Sanchez de la Pena} and Daniel Rohe and Edoardo {Di Napoli} and Carsten Honerkamp and {Stefan A.} Maier", title = "High-performance functional renormalization group calculations for interacting fermions", journal = "Computer Physics Communications", year = 2017, volume = 213, pages = "100-110", month = apr, url = "https://arxiv.org/pdf/1604.06296v2.pdf" }

abstractwebPDFbibtexWe derive a novel computational scheme for functional Renormalization Group (fRG) calculations for interacting fermions on 2D lattices. The scheme is based on the exchange parametrization fRG for the two-fermion interaction, with additional insertions of truncated partitions of unity. These insertions decouple the fermionic propagators from the exchange propagators and lead to a separation of the underlying equations. We demonstrate that this separation is numerically advantageous and may pave the way for refined, large-scale computational investigations even in the case of complex multiband systems. Furthermore, on the basis of speedup data gained from our implementation, it is shown that this new variant facilitates efficient calculations on a large number of multi-core CPUs. We apply the scheme to the t,t′ Hubbard model on a square lattice to analyze the convergence of the results with the bond length of the truncation of the partition of unity. In most parameter areas, a fast convergence can be observed. Finally, we compare to previous results in order to relate our approach to other fRG studies.**Recursive Algorithms for Dense Linear Algebra: The ReLAPACK Collection**ACM Transactions on Mathematical Software (TOMS), March 2017.

Accepted.@article{Peise2017:728, author = "Elmar Peise and Paolo Bientinesi", title = "Recursive Algorithms for Dense Linear Algebra: The ReLAPACK Collection", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2017, month = mar, note = "Accepted", institution = "AICES, RWTH Aachen University", url = "http://arxiv.org/pdf/1602.06763v1" }

abstractwebPDFbibtexTo exploit both memory locality and the full performance potential of highly tuned kernels, dense linear algebra libraries such as LAPACK commonly implement operations as blocked algorithms. However, to achieve next-to-optimal performance with such algorithms, significant tuning is required. On the other hand, recursive algorithms are virtually tuning free, and yet attain similar performance. In this paper, we first analyze and compare blocked and recursive algorithms in terms of performance, and then introduce ReLAPACK, an open-source library of recursive algorithms to seamlessly replace most of LAPACK's blocked algorithms. In many scenarios, ReLAPACK clearly outperforms reference LAPACK, and even improves upon the performance of optimizes libraries.**High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods**Computer Physics Communications, Volume 211, pp. 61 - 72, February 2017.

High Performance Computing for Advanced Modeling and Simulation of Materials.@article{Di_Napoli2017:318, author = "Edoardo {Di Napoli} and Elmar Peise and Markus Hrywniak and Paolo Bientinesi", title = "High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods", journal = "Computer Physics Communications", year = 2017, volume = 211, pages = "61 - 72", month = feb, note = "High Performance Computing for Advanced Modeling and Simulation of Materials", url = "http://arxiv.org/pdf/1602.06589v2" }

abstractwebPDFbibtexOne of the greatest effort of computational scientists is to translate the mathematical model describing a class of physical phenomena into large and complex codes. Many of these codes face the difficulty of implementing the mathematical operations in the model in terms of low level optimized kernels offering both performance and portability. Legacy codes suffers from the additional curse of rigid design choices based on outdated performance metrics (e.g. minimization of memory footprint). Using a representative code from the Materials Science community, we propose a methodology to restructure the most expensive operations in terms of an optimized combination of dense linear algebra kernels. The resulting algorithm guarantees an increased performance and an extended life span of this code enabling larger scale simulations.

### Peer Reviewed Conference Publications

**LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi**Proceedings of the ISC HPC 2017, June 2017.@inproceedings{McDoniel2017:890, author = "William McDoniel and Markus Höhnerbach and Rodrigo Canales and {Ahmed E.} Ismail and Paolo Bientinesi", title = "LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi", year = 2017, month = jun, url = "https://arxiv.org/pdf/1702.04250.pdf" }

abstractPDFbibtexMolecular Dynamics is an important tool for computational biologists, chemists, and materials scientists, consuming a sizable amount of supercomputing resources. Many of the investigated systems contain charged particles, which can only be simulated accurately using a long-range solver, such as PPPM. We extend the popular LAMMPS molecular dynamics code with an implementation of PPPM particularly suitable for the second generation Intel Xeon Phi. Our main target is the optimization of computational kernels by means of vectorization, and we observe speedups in these kernels of up to 12x. These improvements carry over to LAMMPS users, with overall speedups ranging between 2-3x, without requiring users to retune input parameters. Furthermore, our optimizations make it easier for users to determine optimal input parameters for attaining top performance.**Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods**Proceedings of the JARA-HPC Symposium, Lecture Notes in Computer Science, Volume 10164, pp. 200-211, Springer, 2017.@inproceedings{Fabregat-Traver2017:4, author = "Diego Fabregat-Traver and Davor Davidovic and Markus Höhnerbach and Edoardo {Di Napoli}", title = " Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods", year = 2017, volume = 10164, series = "Lecture Notes in Computer Science", pages = "200--211", publisher = "Springer", url = "https://arxiv.org/pdf/1611.00606v1" }

abstractwebPDFbibtexIn this paper we focus on the integration of high-performance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations developed in the Forschungszentrum J\"ulich over the course of two decades. The presented work follows up on a previous effort to modernize legacy code by re-engineering and rewriting it in terms of highly optimized libraries. We illustrate how this initial effort to get efficient and portable shared-memory code enables fast porting of the code to emerging heterogeneous architectures. More specifically, we port the code to nodes equipped with multiple GPUs. We divide our study in two parts. First, we show considerable speedups attained by minor and relatively straightforward code changes to off-load parts of the computation to the GPUs. Then, we identify further possible improvements to achieve even higher performance and scalability. On a system consisting of 16-cores and 2 GPUs, we observe speedups of up to 5x with respect to our optimized shared-memory code, which in turn means between 7.5x and 12.5x speedup with respect to the original FLEUR code.**Parallel adaptive integration in high-performance functional Renormalization Group computations**Jülich Aachen Research Alliance High-Performance Computing Symposium 2016, Lecture Notes in Computer Science, Springer-Verlag, 2017.@inproceedings{Lichtenstein2017:360, author = "Julian Lichtenstein and Jan Winkelmann and David {Sanchez de la Pena} and Toni Vidovic and Edoardo {Di Napoli}", title = "Parallel adaptive integration in high-performance functional Renormalization Group computations", booktitle = "Jülich Aachen Research Alliance High-Performance Computing Symposium 2016", year = 2017, editor = "E. Di Napoli et. al.", series = "Lecture Notes in Computer Science", publisher = "Springer-Verlag", url = "https://arxiv.org/pdf/1610.09991v1.pdf" }

abstractwebPDFbibtexThe conceptual framework provided by the functional Renormalization Group (fRG) has become a formidable tool to study correlated electron systems on lattices which, in turn, provided great insights to our understanding of complex many-body phenomena, such as high- temperature superconductivity or topological states of matter. In this work we present one of the latest realizations of fRG which makes use of an adaptive numerical quadrature scheme specifically tailored to the described fRG scheme. The final result is an increase in performance thanks to improved parallelism and scalability.

### Thesis

**Performance Modeling and Prediction for Dense Linear Algebra**HPAC, RWTH Aachen, May 2017.

1st version (under evaluation).@phdthesis{Peise2017:620, author = "Elmar Peise", title = "Performance Modeling and Prediction for Dense Linear Algebra", school = "HPAC, RWTH Aachen", year = 2017, month = may, note = "1st version (under evaluation)", url = "https://arxiv.org/pdf/1706.01341" }

abstractwebPDFbibtexThis dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel.