# Recent Publications

### Submitted Paper

**Large Scale Parallel Computations in R through Elemental**Journal of Statistical Software, October 2016.abstractwebPDFEven though in recent years the scale of statistical analysis problems has increased tremendously, many statistical software tools are still limited to single-node computations. However, statistical analyses are largely based on dense linear algebra operations, which have been deeply studied, optimized and parallelized in the high-performance-computing community. To make high-performance distributed computations available for statistical analysis, and thus enable large scale statistical computations, we introduce RElem, an open source package that integrates the distributed dense linear algebra library Elemental into R. While on the one hand, RElem provides direct wrappers of Elemental's routines, on the other hand, it overloads various operators and functions to provide an entirely native R experience for distributed computations. We showcase how simple it is to port existing R programs to Relem and demonstrate that Relem indeed allows to scale beyond the single-node limitation of R with the full performance of Elemental without any overhead.

### Journal Articles

**High-performance functional renormalization group calculations for interacting fermions**Computer Physics Communications, Volume 213, pp. 100-110, April 2017.@article{Lichtenstein2017:788, author = "Julian Lichtenstein and David {Sanchez de la Pena} and Daniel Rohe and Edoardo {Di Napoli} and Carsten Honerkamp and {Stefan A.} Maier", title = "High-performance functional renormalization group calculations for interacting fermions", journal = "Computer Physics Communications", year = 2017, volume = 213, pages = "100-110", month = apr, url = "https://arxiv.org/pdf/1604.06296v2.pdf" }

abstractwebPDFbibtexWe derive a novel computational scheme for functional Renormalization Group (fRG) calculations for interacting fermions on 2D lattices. The scheme is based on the exchange parametrization fRG for the two-fermion interaction, with additional insertions of truncated partitions of unity. These insertions decouple the fermionic propagators from the exchange propagators and lead to a separation of the underlying equations. We demonstrate that this separation is numerically advantageous and may pave the way for refined, large-scale computational investigations even in the case of complex multiband systems. Furthermore, on the basis of speedup data gained from our implementation, it is shown that this new variant facilitates efficient calculations on a large number of multi-core CPUs. We apply the scheme to the t,t′ Hubbard model on a square lattice to analyze the convergence of the results with the bond length of the truncation of the partition of unity. In most parameter areas, a fast convergence can be observed. Finally, we compare to previous results in order to relate our approach to other fRG studies.**High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods**Computer Physics Communications, Volume 211, pp. 61 - 72, February 2017.

High Performance Computing for Advanced Modeling and Simulation of Materials.@article{Di_Napoli2017:318, author = "Edoardo {Di Napoli} and Elmar Peise and Markus Hrywniak and Paolo Bientinesi", title = "High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods", journal = "Computer Physics Communications", year = 2017, volume = 211, pages = "61 - 72", month = feb, note = "High Performance Computing for Advanced Modeling and Simulation of Materials", url = "http://arxiv.org/pdf/1602.06589v2" }

abstractwebPDFbibtexOne of the greatest effort of computational scientists is to translate the mathematical model describing a class of physical phenomena into large and complex codes. Many of these codes face the difficulty of implementing the mathematical operations in the model in terms of low level optimized kernels offering both performance and portability. Legacy codes suffers from the additional curse of rigid design choices based on outdated performance metrics (e.g. minimization of memory footprint). Using a representative code from the Materials Science community, we propose a methodology to restructure the most expensive operations in terms of an optimized combination of dense linear algebra kernels. The resulting algorithm guarantees an increased performance and an extended life span of this code enabling larger scale simulations.

### Peer Reviewed Conference Publications

**LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi**Proceedings of the ISC HPC 2017, June 2017.@inproceedings{McDoniel2017:890, author = "William McDoniel and Markus Höhnerbach and Rodrigo Canales and {Ahmed E.} Ismail and Paolo Bientinesi", title = "LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi", year = 2017, month = jun, url = "https://arxiv.org/pdf/1702.04250.pdf" }

abstractPDFbibtexMolecular Dynamics is an important tool for computational biologists, chemists, and materials scientists, consuming a sizable amount of supercomputing resources. Many of the investigated systems contain charged particles, which can only be simulated accurately using a long-range solver, such as PPPM. We extend the popular LAMMPS molecular dynamics code with an implementation of PPPM particularly suitable for the second generation Intel Xeon Phi. Our main target is the optimization of computational kernels by means of vectorization, and we observe speedups in these kernels of up to 12x. These improvements carry over to LAMMPS users, with overall speedups ranging between 2-3x, without requiring users to retune input parameters. Furthermore, our optimizations make it easier for users to determine optimal input parameters for attaining top performance.**Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods**Proceedings of the JARA-HPC Symposium, Lecture Notes in Computer Science, Volume 10164, pp. 200-211, Springer, 2017.@inproceedings{Fabregat-Traver2017:4, author = "Diego Fabregat-Traver and Davor Davidovic and Markus Höhnerbach and Edoardo {Di Napoli}", title = " Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods", year = 2017, volume = 10164, series = "Lecture Notes in Computer Science", pages = "200--211", publisher = "Springer", url = "https://arxiv.org/pdf/1611.00606v1" }

abstractwebPDFbibtexIn this paper we focus on the integration of high-performance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations developed in the Forschungszentrum J\"ulich over the course of two decades. The presented work follows up on a previous effort to modernize legacy code by re-engineering and rewriting it in terms of highly optimized libraries. We illustrate how this initial effort to get efficient and portable shared-memory code enables fast porting of the code to emerging heterogeneous architectures. More specifically, we port the code to nodes equipped with multiple GPUs. We divide our study in two parts. First, we show considerable speedups attained by minor and relatively straightforward code changes to off-load parts of the computation to the GPUs. Then, we identify further possible improvements to achieve even higher performance and scalability. On a system consisting of 16-cores and 2 GPUs, we observe speedups of up to 5x with respect to our optimized shared-memory code, which in turn means between 7.5x and 12.5x speedup with respect to the original FLEUR code.**Parallel adaptive integration in high-performance functional Renormalization Group computations**Jülich Aachen Research Alliance High-Performance Computing Symposium 2016, Lecture Notes in Computer Science, Springer-Verlag, 2017.@inproceedings{Lichtenstein2017:360, author = "Julian Lichtenstein and Jan Winkelmann and David {Sanchez de la Pena} and Toni Vidovic and Edoardo {Di Napoli}", title = "Parallel adaptive integration in high-performance functional Renormalization Group computations", booktitle = "Jülich Aachen Research Alliance High-Performance Computing Symposium 2016", year = 2017, editor = "E. Di Napoli et. al.", series = "Lecture Notes in Computer Science", publisher = "Springer-Verlag", url = "https://arxiv.org/pdf/1610.09991v1.pdf" }

abstractwebPDFbibtexThe conceptual framework provided by the functional Renormalization Group (fRG) has become a formidable tool to study correlated electron systems on lattices which, in turn, provided great insights to our understanding of complex many-body phenomena, such as high- temperature superconductivity or topological states of matter. In this work we present one of the latest realizations of fRG which makes use of an adaptive numerical quadrature scheme specifically tailored to the described fRG scheme. The final result is an increase in performance thanks to improved parallelism and scalability.**A Compiler for Linear Algebra Operations**SPLASH '16 Companion, ACM, October 2016.

Student Research Competition.@inproceedings{Barthels2016:280, author = "Henrik Barthels", title = "A Compiler for Linear Algebra Operations", booktitle = "SPLASH '16 Companion", year = 2016, month = oct, publisher = "ACM", note = "Student Research Competition", url = "http://hpac.rwth-aachen.de/~barthels/publications/ACM_SRC_2016_abstract.pdf" }

abstractwebPDFbibtexIn this paper we present a compiler that translates arithmetic expressions containing matrices to efficient sequences of calls to basic linear algebra kernels.

### Others

**A Compiler for Linear Algebra Operations**Poster, 2 November 2016.

ACM Student Research Competition at SPLASH 2016.**The Matrix Chain Algorithm to Compile Linear Algebra Expressions**Two-page abstract, 31 October 2016.

DSLDI 2016.webPDFbibtex@misc{Barthels2016:600, author = "Henrik Barthels and Paolo Bientinesi", title = "The Matrix Chain Algorithm to Compile Linear Algebra Expressions", month = oct, year = 2016, note = "DSLDI 2016", type = "Two-page abstract", institution = "RWTH Aachen", url = "https://arxiv.org/pdf/1611.05660" }

### Technical Report

**The ELAPS Framework: Experimental Linear Algebra Performance Studies**November 2016.@techreport{Peise2016:560, author = "Elmar Peise and Paolo Bientinesi", title = "The ELAPS Framework: Experimental Linear Algebra Performance Studies", year = 2016, month = nov, url = "http://arxiv.org/pdf/1504.08035v1" }

abstractwebPDFbibtexIn scientific computing, optimal use of computing resources comes at the cost of extensive coding, tuning and benchmarking. While the classic approach of “features first, performance later” is supported by a variety of tools such as Tau, Vampir, and Scalasca, the emerging performance-centric approach, in which both features and performance are primary objectives, is still lacking suitable development tools. For dense linear algebra applications, we fill this gap with the Experimental Linear Algebra Performance Studies framework (ELAPS), a multi-platform open-source environment for easy and fast, yet powerful performance experimentation and prototyping. In contrast to many existing tools, ELAPS targets the beginning of the development process, assisting application developers in both algorithmic and optimization decisions. With ELAPS, users construct experiments to investigate how performance and efficiency depend on factors such as caching, algorithmic parameters, problem size, and parallelism. Experiments are designed either through Python scripts or a specialized GUI, and run on a spectrum of architectures, ranging from laptops to accelerators and clusters. The resulting reports provide various metrics and statistics that can be analyzed both numerically and visually. In this paper, we introduce ELAPS and illustrate its practical value in guiding critical performance decisions already in early development stages.