# Recent Publications

### Submitted Papers

**HPTT: A High-Performance Tensor Transposition C++ Library**Submitted to the ARRAY 2017, April 2017.abstractwebPDFRecently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework—guided by a performance model—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen’s tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.**Non-linear Least-Squares optimization of rational filters for the solution of interior eigenvalue problems**2017.

Submitted to SIAM SISC.abstractwebPDFRational filter functions can be used to improve convergence of contour-based eigensolvers, a popular family of algorithms for the solution of the interior eigenvalue problem. We present a framework for the optimization of rational filters based on a non-convex weighted Least-Squares scheme. When used in combination with the FEAST library, our filters out-perform existing ones on a large and representative set of benchmark problems. This work provides a detailed description of: (1) a set up of the optimization process that exploits symmetries of the filter function for Hermitian eigenproblems, (2) a formulation of the gradient descent and Levenberg-Marquardt algorithms that exploits the symmetries, (3) a method to select the starting position for the optimization algorithms that reliably produces effective filters, (4) a constrained optimization scheme that produces filter functions with specific properties that may be beneficial to the performance of the eigensolver that employs them.

### Journal Articles

**High-performance functional renormalization group calculations for interacting fermions**Computer Physics Communications, Volume 213, pp. 100-110, April 2017.@article{Lichtenstein2017:788, author = "Julian Lichtenstein and David {Sanchez de la Pena} and Daniel Rohe and Edoardo {Di Napoli} and Carsten Honerkamp and {Stefan A.} Maier", title = "High-performance functional renormalization group calculations for interacting fermions", journal = "Computer Physics Communications", year = 2017, volume = 213, pages = "100-110", month = apr, url = "https://arxiv.org/pdf/1604.06296v2.pdf" }

abstractwebPDFbibtexWe derive a novel computational scheme for functional Renormalization Group (fRG) calculations for interacting fermions on 2D lattices. The scheme is based on the exchange parametrization fRG for the two-fermion interaction, with additional insertions of truncated partitions of unity. These insertions decouple the fermionic propagators from the exchange propagators and lead to a separation of the underlying equations. We demonstrate that this separation is numerically advantageous and may pave the way for refined, large-scale computational investigations even in the case of complex multiband systems. Furthermore, on the basis of speedup data gained from our implementation, it is shown that this new variant facilitates efficient calculations on a large number of multi-core CPUs. We apply the scheme to the t,t′ Hubbard model on a square lattice to analyze the convergence of the results with the bond length of the truncation of the partition of unity. In most parameter areas, a fast convergence can be observed. Finally, we compare to previous results in order to relate our approach to other fRG studies.**Recursive Algorithms for Dense Linear Algebra: The ReLAPACK Collection**ACM Transactions on Mathematical Software (TOMS), March 2017.

Accepted.@article{Peise2017:728, author = "Elmar Peise and Paolo Bientinesi", title = "Recursive Algorithms for Dense Linear Algebra: The ReLAPACK Collection", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2017, month = mar, note = "Accepted", institution = "AICES, RWTH Aachen University", url = "http://arxiv.org/pdf/1602.06763v1" }

abstractwebPDFbibtexTo exploit both memory locality and the full performance potential of highly tuned kernels, dense linear algebra libraries such as LAPACK commonly implement operations as blocked algorithms. However, to achieve next-to-optimal performance with such algorithms, significant tuning is required. On the other hand, recursive algorithms are virtually tuning free, and yet attain similar performance. In this paper, we first analyze and compare blocked and recursive algorithms in terms of performance, and then introduce ReLAPACK, an open-source library of recursive algorithms to seamlessly replace most of LAPACK's blocked algorithms. In many scenarios, ReLAPACK clearly outperforms reference LAPACK, and even improves upon the performance of optimizes libraries.**High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods**Computer Physics Communications, Volume 211, pp. 61 - 72, February 2017.

High Performance Computing for Advanced Modeling and Simulation of Materials.@article{Di_Napoli2017:318, author = "Edoardo {Di Napoli} and Elmar Peise and Markus Hrywniak and Paolo Bientinesi", title = "High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods", journal = "Computer Physics Communications", year = 2017, volume = 211, pages = "61 - 72", month = feb, note = "High Performance Computing for Advanced Modeling and Simulation of Materials", url = "http://arxiv.org/pdf/1602.06589v2" }

abstractwebPDFbibtexOne of the greatest effort of computational scientists is to translate the mathematical model describing a class of physical phenomena into large and complex codes. Many of these codes face the difficulty of implementing the mathematical operations in the model in terms of low level optimized kernels offering both performance and portability. Legacy codes suffers from the additional curse of rigid design choices based on outdated performance metrics (e.g. minimization of memory footprint). Using a representative code from the Materials Science community, we propose a methodology to restructure the most expensive operations in terms of an optimized combination of dense linear algebra kernels. The resulting algorithm guarantees an increased performance and an extended life span of this code enabling larger scale simulations.

### Peer Reviewed Conference Publications

**LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi**Proceedings of the ISC HPC 2017, June 2017.@inproceedings{McDoniel2017:890, author = "William McDoniel and Markus Höhnerbach and Rodrigo Canales and {Ahmed E.} Ismail and Paolo Bientinesi", title = "LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi", year = 2017, month = jun, url = "https://arxiv.org/pdf/1702.04250.pdf" }

abstractPDFbibtexMolecular Dynamics is an important tool for computational biologists, chemists, and materials scientists, consuming a sizable amount of supercomputing resources. Many of the investigated systems contain charged particles, which can only be simulated accurately using a long-range solver, such as PPPM. We extend the popular LAMMPS molecular dynamics code with an implementation of PPPM particularly suitable for the second generation Intel Xeon Phi. Our main target is the optimization of computational kernels by means of vectorization, and we observe speedups in these kernels of up to 12x. These improvements carry over to LAMMPS users, with overall speedups ranging between 2-3x, without requiring users to retune input parameters. Furthermore, our optimizations make it easier for users to determine optimal input parameters for attaining top performance.**Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods**Proceedings of the JARA-HPC Symposium, Lecture Notes in Computer Science, Volume 10164, pp. 200-211, Springer, 2017.@inproceedings{Fabregat-Traver2017:4, author = "Diego Fabregat-Traver and Davor Davidovic and Markus Höhnerbach and Edoardo {Di Napoli}", title = " Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods", year = 2017, volume = 10164, series = "Lecture Notes in Computer Science", pages = "200--211", publisher = "Springer", url = "https://arxiv.org/pdf/1611.00606v1" }

abstractwebPDFbibtexIn this paper we focus on the integration of high-performance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations developed in the Forschungszentrum J\"ulich over the course of two decades. The presented work follows up on a previous effort to modernize legacy code by re-engineering and rewriting it in terms of highly optimized libraries. We illustrate how this initial effort to get efficient and portable shared-memory code enables fast porting of the code to emerging heterogeneous architectures. More specifically, we port the code to nodes equipped with multiple GPUs. We divide our study in two parts. First, we show considerable speedups attained by minor and relatively straightforward code changes to off-load parts of the computation to the GPUs. Then, we identify further possible improvements to achieve even higher performance and scalability. On a system consisting of 16-cores and 2 GPUs, we observe speedups of up to 5x with respect to our optimized shared-memory code, which in turn means between 7.5x and 12.5x speedup with respect to the original FLEUR code.**Parallel adaptive integration in high-performance functional Renormalization Group computations**Jülich Aachen Research Alliance High-Performance Computing Symposium 2016, Lecture Notes in Computer Science, Springer-Verlag, 2017.@inproceedings{Lichtenstein2017:360, author = "Julian Lichtenstein and Jan Winkelmann and David {Sanchez de la Pena} and Toni Vidovic and Edoardo {Di Napoli}", title = "Parallel adaptive integration in high-performance functional Renormalization Group computations", booktitle = "Jülich Aachen Research Alliance High-Performance Computing Symposium 2016", year = 2017, editor = "E. Di Napoli et. al.", series = "Lecture Notes in Computer Science", publisher = "Springer-Verlag", url = "https://arxiv.org/pdf/1610.09991v1.pdf" }

abstractwebPDFbibtexThe conceptual framework provided by the functional Renormalization Group (fRG) has become a formidable tool to study correlated electron systems on lattices which, in turn, provided great insights to our understanding of complex many-body phenomena, such as high- temperature superconductivity or topological states of matter. In this work we present one of the latest realizations of fRG which makes use of an adaptive numerical quadrature scheme specifically tailored to the described fRG scheme. The final result is an increase in performance thanks to improved parallelism and scalability.

### Other

**A Compiler for Linear Algebra Operations**Poster, 2 November 2016.

ACM Student Research Competition at SPLASH 2016.

### Technical Report

**The ELAPS Framework: Experimental Linear Algebra Performance Studies**November 2016.@techreport{Peise2016:560, author = "Elmar Peise and Paolo Bientinesi", title = "The ELAPS Framework: Experimental Linear Algebra Performance Studies", year = 2016, month = nov, url = "http://arxiv.org/pdf/1504.08035v1" }

abstractwebPDFbibtexIn scientific computing, optimal use of computing resources comes at the cost of extensive coding, tuning and benchmarking. While the classic approach of “features first, performance later” is supported by a variety of tools such as Tau, Vampir, and Scalasca, the emerging performance-centric approach, in which both features and performance are primary objectives, is still lacking suitable development tools. For dense linear algebra applications, we fill this gap with the Experimental Linear Algebra Performance Studies framework (ELAPS), a multi-platform open-source environment for easy and fast, yet powerful performance experimentation and prototyping. In contrast to many existing tools, ELAPS targets the beginning of the development process, assisting application developers in both algorithmic and optimization decisions. With ELAPS, users construct experiments to investigate how performance and efficiency depend on factors such as caching, algorithmic parameters, problem size, and parallelism. Experiments are designed either through Python scripts or a specialized GUI, and run on a spectrum of architectures, ranging from laptops to accelerators and clusters. The resulting reports provide various metrics and statistics that can be analyzed both numerically and visually. In this paper, we introduce ELAPS and illustrate its practical value in guiding critical performance decisions already in early development stages.