# Publications by year

## 2017

### Submitted Papers

**Spin Summations: A High-Performance Perspective**ACM Transactions on Mathematical Software (TOMS), May 2017.abstractwebPDFBesides tensor contractions, one of the most pronounced computational bottlenecks in the non-orthogonally spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate forms---including CCSD(T) and CCSDT(Q)---are spin summations. At a first sight, spin summations are operations similar to tensor transpositions; a closer look instead reveals additional challenges to high-performance calculations, including temporal locality as well as scattered memory accesses. This publication explores a sequence of algorithmic solutions for spin summations, each exploiting individual properties of either the underlying hardware (e.g. caches, vectorization), or the problem itself (e.g. factorizability). The final algorithm combines the advantages of all the solutions, while avoiding their drawbacks; this algorithm achieves high-performance through parallelization, vectorization, and by exploiting the temporal locality inherent to spin summations. Combined, these optimizations result in speedups between 2.4x and 5.5x over the NCC quantum chemistry software package. In addition to such a performance boost, our algorithm can perform the spin summations in-place, thus reducing the memory footprint by 2x over an out-of-place variant.**Accelerating scientific codes by performance and accuracy modeling**Journal of Computational Science, May 2017.abstractPDFScientific software is often driven by multiple parameters that affect both accuracy and performance. Since finding the optimal configuration of these parameters is a highly complex task, it extremely common that the software is used suboptimally. In a typical scenario, accuracy requirements are imposed, and attained through suboptimal performance. In this paper, we present a methodology for the automatic selection of parameters for simulation codes, and a corresponding prototype tool. To be amenable to our methodology, the target code must expose the parameters affecting accuracy and performance, and there must be formulas available for error bounds and computational complexity of the underlying methods. As a case study, we consider the particle-particle particle-mesh method (PPPM) from the LAMMPS suite for molecular dynamics, and use our tool to identify configurations of the input parameters that achieve a given accuracy in the shortest execution time. When compared with the configurations suggested by expert users, the parameters selected by our tool yield reductions in the time-to-solution ranging between 10% and 60%. In other words, for the typical scenario where a fixed number of core-hours are granted and simulations of a fixed number of timesteps are to be run, usage of our tool may allow up to twice as many simulations. While we develop our ideas using LAMMPS as computational framework and use the PPPM method for dispersion as case study, the methodology is general and valid for a range of software tools and methods.**Perfect spike detection via time reversal**2017.

Submitted to Frontiers in Neuroscience.abstractwebPDFSpiking neuronal networks are usually simulated with three main simulation schemes: the classical time-driven and event-driven schemes, and the more recent hybrid scheme. All three schemes evolve the state of a neuron through a series of checkpoints: equally spaced in the first scheme and determined neuron-wise by spike events in the latter two. The time-driven and the hybrid scheme determine whether the membrane potential of a neuron crosses a threshold at the end of of the time interval between consecutive checkpoints. Threshold crossing can, however, occur within the interval even if this test is negative. Spikes can therefore be missed. The present work derives, implements, and benchmarks a method for perfect retrospective spike detection. This method can be applied to neuron models with affine or linear subthreshold dynamics. The idea behind the method is to propagate the threshold with a time-inverted dynamics, testing whether the threshold crosses the neuron state to be evolved, rather than vice versa. Algebraically this translates into a set of inequalities necessary and sufficient for threshold crossing. This test is slower than the imperfect one, but faster than an alternative perfect tests based on bisection or root-finding methods. Comparison confirms earlier results that the imperfect test rarely misses spikes (less than a fraction 1/108 of missed spikes) in biologically relevant settings. This study offers an alternative geometric point of view on neuronal dynamics.**Non-linear Least-Squares optimization of rational filters for the solution of interior eigenvalue problems**2017.

Submitted to SIAM SIMAX.abstractwebPDFRational filter functions can be used to improve convergence of contour-based eigensolvers, a popular family of algorithms for the solution of the interior eigenvalue problem. We present a framework for the optimization of rational filters based on a non-convex weighted Least-Squares scheme. When used in combination with the FEAST library, our filters out-perform existing ones on a large and representative set of benchmark problems. This work provides a detailed description of: (1) a set up of the optimization process that exploits symmetries of the filter function for Hermitian eigenproblems, (2) a formulation of the gradient descent and Levenberg-Marquardt algorithms that exploits the symmetries, (3) a method to select the starting position for the optimization algorithms that reliably produces effective filters, (4) a constrained optimization scheme that produces filter functions with specific properties that may be beneficial to the performance of the eigensolver that employs them.

### Journal Articles

**TTC: A high-performance Compiler for Tensor Transpositions**ACM Transactions on Mathematical Software (TOMS), Volume 44(2), pp. 15:1-15:21, August 2017.@article{Springer2017:910, author = "Paul Springer and {Jeff R.} Hammond and Paolo Bientinesi", title = "TTC: A high-performance Compiler for Tensor Transpositions", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2017, volume = 44, number = 2, pages = "15:1--15:21", month = aug, publisher = "ACM", url = "http://arxiv.org/pdf/1603.02297v1" }

abstractwebPDFbibtexWe present TTC, an open-source parallel compiler for multidimensional tensor transpositions. In order to generate high-performance C++ code, TTC explores a number of optimizations, including software prefetching, blocking, loop-reordering, and explicit vectorization. To evaluate the performance of multidimensional transpositions across a range of possible use-cases, we also release a benchmark covering arbitrary transpositions of up to six dimensions. Performance results show that the routines generated by TTC achieve close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, and yield significant performance gains over modern compilers. By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large. Experiments indicate that when only 100 potential solutions are considered, the resulting performance is about 99% of that achieved with exhaustive search.**The ELAPS Framework: Experimental Linear Algebra Performance Studies**International Journal of High Performance Computing, August 2017.

Accepted.@article{Peise2017:560, author = "Elmar Peise and Paolo Bientinesi", title = "The ELAPS Framework: Experimental Linear Algebra Performance Studies", journal = "International Journal of High Performance Computing", year = 2017, month = aug, note = "Accepted", url = "http://arxiv.org/pdf/1504.08035v1" }

abstractPDFbibtexIn scientific computing, optimal use of computing resources comes at the cost of extensive coding, tuning and benchmarking. While the classic approach of “features first, performance later” is supported by a variety of tools such as Tau, Vampir, and Scalasca, the emerging performance-centric approach, in which both features and performance are primary objectives, is still lacking suitable development tools. For dense linear algebra applications, we fill this gap with the Experimental Linear Algebra Performance Studies framework (ELAPS), a multi-platform open-source environment for easy and fast, yet powerful performance experimentation and prototyping. In contrast to many existing tools, ELAPS targets the beginning of the development process, assisting application developers in both algorithmic and optimization decisions. With ELAPS, users construct experiments to investigate how performance and efficiency depend on factors such as caching, algorithmic parameters, problem size, and parallelism. Experiments are designed either through Python scripts or a specialized GUI, and run on a spectrum of architectures, ranging from laptops to accelerators and clusters. The resulting reports provide various metrics and statistics that can be analyzed both numerically and visually. In this paper, we introduce ELAPS and illustrate its practical value in guiding critical performance decisions already in early development stages.**High-performance functional renormalization group calculations for interacting fermions**Computer Physics Communications, Volume 213, pp. 100-110, April 2017.@article{Lichtenstein2017:788, author = "Julian Lichtenstein and David {Sanchez de la Pena} and Daniel Rohe and Edoardo {Di Napoli} and Carsten Honerkamp and {Stefan A.} Maier", title = "High-performance functional renormalization group calculations for interacting fermions", journal = "Computer Physics Communications", year = 2017, volume = 213, pages = "100-110", month = apr, url = "https://arxiv.org/pdf/1604.06296v2.pdf" }

abstractwebPDFbibtexWe derive a novel computational scheme for functional Renormalization Group (fRG) calculations for interacting fermions on 2D lattices. The scheme is based on the exchange parametrization fRG for the two-fermion interaction, with additional insertions of truncated partitions of unity. These insertions decouple the fermionic propagators from the exchange propagators and lead to a separation of the underlying equations. We demonstrate that this separation is numerically advantageous and may pave the way for refined, large-scale computational investigations even in the case of complex multiband systems. Furthermore, on the basis of speedup data gained from our implementation, it is shown that this new variant facilitates efficient calculations on a large number of multi-core CPUs. We apply the scheme to the t,t′ Hubbard model on a square lattice to analyze the convergence of the results with the bond length of the truncation of the partition of unity. In most parameter areas, a fast convergence can be observed. Finally, we compare to previous results in order to relate our approach to other fRG studies.**Algorithm 979: Recursive Algorithms for Dense Linear Algebra—The ReLAPACK Collection**ACM Transactions on Mathematical Software (TOMS), Volume 44(2), March 2017.@article{Peise2017:728, author = "Elmar Peise and Paolo Bientinesi", title = "Algorithm 979: Recursive Algorithms for Dense Linear Algebra—The ReLAPACK Collection", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2017, volume = 44, number = 2, month = mar, publisher = "ACM", institution = "AICES, RWTH Aachen University", url = "http://arxiv.org/pdf/1602.06763v1" }

abstractwebPDFbibtexTo exploit both memory locality and the full performance potential of highly tuned kernels, dense linear algebra libraries such as LAPACK commonly implement operations as blocked algorithms. However, to achieve next-to-optimal performance with such algorithms, significant tuning is required. On the other hand, recursive algorithms are virtually tuning free, and yet attain similar performance. In this paper, we first analyze and compare blocked and recursive algorithms in terms of performance, and then introduce ReLAPACK, an open-source library of recursive algorithms to seamlessly replace most of LAPACK's blocked algorithms. In many scenarios, ReLAPACK clearly outperforms reference LAPACK, and even improves upon the performance of optimizes libraries.**High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods**Computer Physics Communications, Volume 211, pp. 61 - 72, February 2017.

High Performance Computing for Advanced Modeling and Simulation of Materials.@article{Di_Napoli2017:318, author = "Edoardo {Di Napoli} and Elmar Peise and Markus Hrywniak and Paolo Bientinesi", title = "High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods", journal = "Computer Physics Communications", year = 2017, volume = 211, pages = "61 - 72", month = feb, note = "High Performance Computing for Advanced Modeling and Simulation of Materials", url = "http://arxiv.org/pdf/1602.06589v2" }

abstractwebPDFbibtexOne of the greatest effort of computational scientists is to translate the mathematical model describing a class of physical phenomena into large and complex codes. Many of these codes face the difficulty of implementing the mathematical operations in the model in terms of low level optimized kernels offering both performance and portability. Legacy codes suffers from the additional curse of rigid design choices based on outdated performance metrics (e.g. minimization of memory footprint). Using a representative code from the Materials Science community, we propose a methodology to restructure the most expensive operations in terms of an optimized combination of dense linear algebra kernels. The resulting algorithm guarantees an increased performance and an extended life span of this code enabling larger scale simulations.

### Peer Reviewed Conference Publications

**Linnea: Compiling Linear Algebra Expressions to High-Performance Code**Proceedings of the 8th International Workshop on Parallel Symbolic Computation, July 2017.@inproceedings{Barthels2017:254, author = "Henrik Barthels and Paolo Bientinesi", title = "Linnea: Compiling Linear Algebra Expressions to High-Performance Code", booktitle = "Proceedings of the 8th International Workshop on Parallel Symbolic Computation", year = 2017, month = jul }

abstractbibtexLinear algebra expressions appear in fields as diverse as computational biology, signal processing, communication technology, finite element methods, and control theory. Libraries such as BLAS and LAPACK provide highly optimized building blocks for just about any linear algebra computation; thus, a linear algebra expression can be evaluated efficiently by breaking it down into those building blocks. However, this is a challenging problem, requiring knowledge in high-performance computing, compilers, and numerical linear algebra. In this paper we give an overview of existing solutions, and introduce Linnea, a compiler that solves this problem. As shown through a set of test cases, Linnea’s results are comparable with those obtained by human experts.**MatchPy: A Pattern Matching Library**Proceedings of the 15th Python in Science Conference, July 2017.@inproceedings{Krebber2017:550, author = "{Manuel } Krebber and Henrik Barthels and Paolo Bientinesi", title = "MatchPy: A Pattern Matching Library", booktitle = "Proceedings of the 15th Python in Science Conference", year = 2017, month = jul, url = "http://conference.scipy.org/proceedings/scipy2017/pdfs/manuel_krebber.pdf" }

abstractwebPDFbibtexPattern matching is a powerful tool for symbolic computations, based on the well-defined theory of term rewriting systems. Application domains include algebraic expressions, abstract syntax trees, and XML and JSON data. Unfortunately, no lightweight implementation of pattern matching as general and flexible as Mathematica exists for Python Mathics,MacroPy,patterns,PyPatt. Therefore, we created the open source module MatchPy which offers similar pattern matching functionality in Python using a novel algorithm which finds matches for large pattern sets more efficiently by exploiting similarities between patterns.**HPTT: A High-Performance Tensor Transposition C++ Library**Proceedings of the ARRAY 2017, June 2017.@inproceedings{Springer2017:558, author = "Paul Springer and Tong Su and Paolo Bientinesi", title = "HPTT: A High-Performance Tensor Transposition C++ Library", year = 2017, month = jun, url = "http://arxiv.org/abs/1704.04374" }

abstractwebPDFbibtexRecently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework—guided by a performance model—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen’s tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.**LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi**Proceedings of the 32nd International Conference, ISC High Performance 2017, Volume 10266, pp. 61-78, Springer, June 2017.@inproceedings{McDoniel2017:890, author = "William McDoniel and Markus Höhnerbach and Rodrigo Canales and {Ahmed E.} Ismail and Paolo Bientinesi", title = "LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi", year = 2017, volume = 10266, pages = "61--78", address = "Frankfurt", month = jun, publisher = "Springer", url = "https://arxiv.org/pdf/1702.04250.pdf" }

abstractPDFbibtexMolecular Dynamics is an important tool for computational biologists, chemists, and materials scientists, consuming a sizable amount of supercomputing resources. Many of the investigated systems contain charged particles, which can only be simulated accurately using a long-range solver, such as PPPM. We extend the popular LAMMPS molecular dynamics code with an implementation of PPPM particularly suitable for the second generation Intel Xeon Phi. Our main target is the optimization of computational kernels by means of vectorization, and we observe speedups in these kernels of up to 12x. These improvements carry over to LAMMPS users, with overall speedups ranging between 2-3x, without requiring users to retune input parameters. Furthermore, our optimizations make it easier for users to determine optimal input parameters for attaining top performance.**Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods**Proceedings of the JARA-HPC Symposium, Lecture Notes in Computer Science, Volume 10164, pp. 200-211, Springer, 2017.@inproceedings{Fabregat-Traver2017:4, author = "Diego Fabregat-Traver and Davor Davidovic and Markus Höhnerbach and Edoardo {Di Napoli}", title = " Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods", year = 2017, volume = 10164, series = "Lecture Notes in Computer Science", pages = "200--211", publisher = "Springer", url = "https://arxiv.org/pdf/1611.00606v1" }

abstractwebPDFbibtexIn this paper we focus on the integration of high-performance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations developed in the Forschungszentrum J\"ulich over the course of two decades. The presented work follows up on a previous effort to modernize legacy code by re-engineering and rewriting it in terms of highly optimized libraries. We illustrate how this initial effort to get efficient and portable shared-memory code enables fast porting of the code to emerging heterogeneous architectures. More specifically, we port the code to nodes equipped with multiple GPUs. We divide our study in two parts. First, we show considerable speedups attained by minor and relatively straightforward code changes to off-load parts of the computation to the GPUs. Then, we identify further possible improvements to achieve even higher performance and scalability. On a system consisting of 16-cores and 2 GPUs, we observe speedups of up to 5x with respect to our optimized shared-memory code, which in turn means between 7.5x and 12.5x speedup with respect to the original FLEUR code.**Parallel adaptive integration in high-performance functional Renormalization Group computations**Jülich Aachen Research Alliance High-Performance Computing Symposium 2016, Lecture Notes in Computer Science, Springer-Verlag, 2017.@inproceedings{Lichtenstein2017:360, author = "Julian Lichtenstein and Jan Winkelmann and David {Sanchez de la Pena} and Toni Vidovic and Edoardo {Di Napoli}", title = "Parallel adaptive integration in high-performance functional Renormalization Group computations", booktitle = "Jülich Aachen Research Alliance High-Performance Computing Symposium 2016", year = 2017, editor = "E. Di Napoli et. al.", series = "Lecture Notes in Computer Science", publisher = "Springer-Verlag", url = "https://arxiv.org/pdf/1610.09991v1.pdf" }

abstractwebPDFbibtexThe conceptual framework provided by the functional Renormalization Group (fRG) has become a formidable tool to study correlated electron systems on lattices which, in turn, provided great insights to our understanding of complex many-body phenomena, such as high- temperature superconductivity or topological states of matter. In this work we present one of the latest realizations of fRG which makes use of an adaptive numerical quadrature scheme specifically tailored to the described fRG scheme. The final result is an increase in performance thanks to improved parallelism and scalability.

### Thesis

**Performance Modeling and Prediction for Dense Linear Algebra**HPAC, RWTH Aachen, May 2017.

1st version (under evaluation).@phdthesis{Peise2017:620, author = "Elmar Peise", title = "Performance Modeling and Prediction for Dense Linear Algebra", school = "HPAC, RWTH Aachen", year = 2017, month = may, note = "1st version (under evaluation)", url = "https://arxiv.org/pdf/1706.01341" }

abstractwebPDFbibtexThis dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel.

## 2016

### Submitted Paper

**Design of a high-performance GEMM-like Tensor-Tensor Multiplication**July 2016.

ACM Transactions on Mathematical Software (TOMS).abstractPDFWe present ''GEMM-like Tensor-Tensor multiplication'' (GETT), a novel approach to tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned ''macro-kernel''. This macro-kernel operates on suitably prepared (''packed'') sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. To compare our technique with other modern tensor contractions, we integrate GETT alongside the so called Transpose-Transpose-GEMM-Transpose and Loops-over-GEMM approaches into an open source ''Tensor Contraction Code Generator'' (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to 12.3x. More precisely, GETT achieves speedups of up to 1.42x over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to 91.3% of peak floating-point performance for compute-bound tensor contractions.

### Journal Articles

**New methodology for determining the electronic thermal conductivity of metals via direct non-equilibrium ab initio molecular dynamics**Physical Review B, pp. 075149, 25 August 2016.@article{Yue2016:504, author = "Sheng-Ying Yue and Xiaoliang Zhang and Stephen Stackhouse and Guangzhao Qin and Edoardo {Di Napoli} and Ming Hu", title = "New methodology for determining the electronic thermal conductivity of metals via direct non-equilibrium ab initio molecular dynamics", journal = "Physical Review B", year = 2016, pages = 75149, month = aug, url = "http://arxiv.org/pdf/1603.07755v2.pdf" }

abstractwebPDFbibtexMany physical properties of metals can be understood in terms of the free electron model, as proven by the Wiedemann-Franz law. According to this model, electronic thermal conductivity (κ_el) can be inferred from the Boltzmann transport equation (BTE). However, the BTE does not perform well for some complex metals, such as Cu. Moreover, the BTE cannot clearly describe the origin of the thermal energy carried by electrons or how this energy is transported in metals. The charge distribution of conduction electrons in metals is known to reflect the electrostatic potential (EP) of the ion cores. Based on this premise, we develop a new methodology for evaluating κel by combining the free electron model and non-equilibrium ab initio molecular dynamics (NEAIMD) simulations. We demonstrate that the kinetic energy of thermally excited electrons originates from the energy of the spatial electrostatic potential oscillation (EPO), which is induced by the thermal motion of ion cores. This method directly predicts the κ_el of pure metals with a high degree of accuracy.**Efficient estimation of eigenvalue counts in an interval**Numerical Linear Algebra with Applications, Volume 23(4), pp. 674-692, July 2016.@article{Di_Napoli2016:838, author = "Edoardo {Di Napoli} and Eric Polizzi and Yousef Saad", title = "Efficient estimation of eigenvalue counts in an interval", journal = "Numerical Linear Algebra with Applications", year = 2016, volume = 23, number = 4, pages = "674-692", month = jul }

abstractwebbibtexEstimating the number of eigenvalues located in a given interval of a large sparse Hermitian matrix is an important problem in certain applications and it is a prerequisite of eigensolvers based on a divide-and-conquer paradigm. Often an exact count is not necessary and methods based on stochastic estimates can be utilized to yield rough approximations. This paper examines a number of techniques tailored to this specific task. It reviews standard approaches and explores new ones based on polynomial and rational approximation filtering combined with stochastic procedure.**Large-Scale Linear Regression: Development of High-Performance Routines**Applied Mathematics and Computation, Volume 275, pp. 411-421, 15 February 2016.@article{Frank2016:90, author = "Alvaro Frank and Diego Fabregat-Traver and Paolo Bientinesi", title = "Large-Scale Linear Regression: Development of High-Performance Routines", journal = "Applied Mathematics and Computation", year = 2016, volume = 275, pages = "411-421", month = feb, url = "http://arxiv.org/abs/1504.07890" }

abstractwebPDFbibtexIn statistics, series of ordinary least squares problems (OLS) are used to study the linear correlation among sets of variables of interest; in many studies, the number of such variables is at least in the millions, and the corresponding datasets occupy terabytes of disk space. As the availability of large-scale datasets increases regularly, so does the challenge in dealing with them. Indeed, traditional solvers---which rely on the use of ``black-box'' routines optimized for one single OLS---are highly inefficient and fail to provide a viable solution for big-data analyses. As a case study, in this paper we consider a linear regression consisting of two-dimensional grids of related OLS problems that arise in the context of genome-wide association analyses, and give a careful walkthrough for the development of {\sc ols-grid}, a high-performance routine for shared-memory architectures; analogous steps are relevant for tailoring OLS solvers to other applications. In particular, we first illustrate the design of efficient algorithms that exploit the structure of the OLS problems and eliminate redundant computations; then, we show how to effectively deal with datasets that do not fit in main memory; finally, we discuss how to cast the computation in terms of efficient kernels and how to achieve scalability. Importantly, each design decision along the way is justified by simple performance models. {\sc ols-grid} enables the solution of $10^{11}$ correlated OLS problems operating on terabytes of data in a matter of hours.

### Peer Reviewed Conference Publications

**A Compiler for Linear Algebra Operations**SPLASH '16 Companion, ACM, October 2016.

Student Research Competition.@inproceedings{Barthels2016:280, author = "Henrik Barthels", title = "A Compiler for Linear Algebra Operations", booktitle = "SPLASH '16 Companion", year = 2016, month = oct, publisher = "ACM", note = "Student Research Competition" }

abstractwebbibtexIn this paper we present a compiler that translates arithmetic expressions containing matrices to efficient sequences of calls to basic linear algebra kernels.**TTC: A Tensor Transposition Compiler for Multiple Architectures**Proceedings of the 3rd International Workshop on Libraries, Languages and Compilers for Programming (ARRAY 2016), June 2016.@inproceedings{Springer2016:940, author = "Paul Springer and Aravind Sankaran and Paolo Bientinesi", title = "TTC: A Tensor Transposition Compiler for Multiple Architectures", year = 2016, month = jun, url = "https://arxiv.org/pdf/1607.01249.pdf" }

abstractPDFbibtexWe consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system's peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g.,~Intel Haswell, AMD Steamroller), Intel's Knights Corner as well as different CUDA-based GPUs such as NVIDIA's Kepler and Maxwell architectures. We report speedups of TTC over a meaningful baseline implementation generated by external C++ compilers; the results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel's latest C++ compiler on the Haswell and Knights Corner architecture, TTC yields speedups of up to 8x and 32x, respectively. We also showcase TTC's support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.**The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability**Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC'16, Number 7, pp. 7:1-7:13, IEEE Press, 2016.

Selected for Reproducibility Initiative at SC17.@inproceedings{Höhnerbach2016:78, author = "Markus Höhnerbach and {Ahmed E.} Ismail and Paolo Bientinesi", title = "The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability", booktitle = "Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis", year = 2016, number = 7, series = "SC'16", pages = "7:1--7:13", publisher = "IEEE Press", note = "Selected for Reproducibility Initiative at SC17", url = "https://arxiv.org/pdf/1607.02904v1" }

abstractwebPDFbibtexMolecular dynamics simulations, an indispensable research tool in computational chemistry and materials science, consume a significant portion of the supercomputing cycles around the world. We focus on multi-body potentials and aim at achieving performance portability. Compared with well-studied pair potentials, multibody potentials deliver increased simulation accuracy but are too complex for effective compiler optimization. Because of this, achieving cross-platform performance remains an open question. By abstracting from target architecture and computing precision, we develop a vectorization scheme applicable to both CPUs and accelerators. We present results for the Tersoff potential within the molecular dynamics code LAMMPS on several architectures, demonstrating efficiency gains not only for computational kernels, but also for large-scale simulations. On a cluster of Intel Xeon Phi's, our optimized solver is between 3 and 5 times faster than the pure MPI reference.**Dynamic SIMD Vector Lane Scheduling**High Performance Computing : ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P^3MA, VHPC, WOPSSS, Springer International Publishing, 2016.@inproceedings{Höhnerbach2016:440, author = "Markus Höhnerbach and Florian Wende and Olaf Krizikalla", title = "Dynamic SIMD Vector Lane Scheduling", booktitle = "High Performance Computing : ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P^3MA, VHPC, WOPSSS", year = 2016, editor = "Taufer, Michela", publisher = "Springer International Publishing" }

abstractwebbibtexA classical technique to vectorize code that contains control flow is a control-flow to data-flow conversion. In that approach statements are augmented with masks that denote whether a given vector lane participates in the statement’s execution or idles. If the scheduling of work to vector lanes is performed statically, then some of the vector lanes will run idle in case of control flow divergences or varying work intensities across the loop iterations. With an increasing number of vector lanes, the likelihood of divergences or heavily unbalanced work assignments increases and static scheduling leads to a poor resource utilization. In this paper, we investigate different approaches to dynamic SIMD vector lane scheduling using the Mandelbrot set algorithm as a test case. To overcome the limitations of static scheduling, idle vector lanes are assigned work items dynamically, thereby minimizing per-lane idle cycles. Our evaluation on the Knights Corner and Knights Landing platform shows, that our approaches can lead to considerable performance gains over a static work assignment. By using the AVX-512 vector compress and expand instruction, we are able to further improve the scheduling.

### Others

**A Compiler for Linear Algebra Operations**Poster, 2 November 2016.

ACM Student Research Competition at SPLASH 2016.**The Matrix Chain Algorithm to Compile Linear Algebra Expressions**Proceedings of the DSLDI 2016, Extended abstract, 31 October 2016.**Program Generation for Linear Algebra Using Multiple Layers of DSLs**Proceedings of the DSLDI 2016, Extended abstract, October 2016.bibtex@misc{Spampinato2016:960, author = "{Daniele } Spampinato and Diego Fabregat-Traver and Markus Pueschel and Paolo Bientinesi", title = "Program Generation for Linear Algebra Using Multiple Layers of DSLs", month = oct, year = 2016, type = "Extended abstract" }

### Technical Reports

**Large Scale Parallel Computations in R through Elemental**October 2016.@techreport{Canales2016:718, author = "Rodrigo Canales and Elmar Peise and Paolo Bientinesi", title = "Large Scale Parallel Computations in R through Elemental", year = 2016, month = oct, url = "https://arxiv.org/pdf/1610.07310v1" }

abstractwebPDFbibtexEven though in recent years the scale of statistical analysis problems has increased tremendously, many statistical software tools are still limited to single-node computations. However, statistical analyses are largely based on dense linear algebra operations, which have been deeply studied, optimized and parallelized in the high-performance-computing community. To make high-performance distributed computations available for statistical analysis, and thus enable large scale statistical computations, we introduce RElem, an open source package that integrates the distributed dense linear algebra library Elemental into R. While on the one hand, RElem provides direct wrappers of Elemental's routines, on the other hand, it overloads various operators and functions to provide an entirely native R experience for distributed computations. We showcase how simple it is to port existing R programs to Relem and demonstrate that Relem indeed allows to scale beyond the single-node limitation of R with the full performance of Elemental without any overhead.**A Note on Time Measurements in LAMMPS**Aachen Institute for Computational Engineering Science, RWTH Aachen, February 2016.

Technical Report AICES-2016/02-1.@techreport{Tameling2016:140, author = "Daniel Tameling and Paolo Bientinesi and {Ahmed E.} Ismail", title = "A Note on Time Measurements in LAMMPS", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2016, month = feb, note = "Technical Report AICES-2016/02-1", url = "http://arxiv.org/abs/1602.05566" }

abstractPDFbibtexWe examine the issue of assessing the efficiency of components of a parallel program at the example of the MD package LAMMPS. In particular, we look at how LAMMPS deals with the issue and explain why the approach adopted might lead to inaccurate conclusions. The misleading nature of this approach is subsequently verified experimentally with a case study. Afterwards, we demonstrate how one should correctly determine the efficiency of the components and show what changes to the code base of LAMMPS are necessary in order to get the correct behavior.

## 2015

### Journal Articles

**An Optimized and Scalable Eigensolver for Sequences of Eigenvalue Problems**Concurrency and Computation: Practice and Experience, Volume 27(4), pp. 905, 25 March 2015.@article{Berljafa2015:768, author = "Mario Berljafa and Daniel Wortmann and Edoardo {Di Napoli}", title = "An Optimized and Scalable Eigensolver for Sequences of Eigenvalue Problems", journal = "Concurrency and Computation: Practice and Experience", year = 2015, volume = 27, number = 4, pages = 905, month = mar, url = "http://arxiv.org/pdf/1404.4161" }

abstractwebPDFbibtexIn many scientific applications the solution of non-linear differential equations are obtained through the set-up and solution of a number of successive eigenproblems. These eigenproblems can be regarded as a sequence whenever the solution of one problem fosters the initialization of the next. In addition, some eigenproblem sequences show a connection between the solutions of adjacent eigenproblems. Whenever is possible to unravel the existence of such a connection, the eigenproblem sequence is said to be a correlated. When facing with a sequence of correlated eigenproblems the current strategy amounts to solving each eigenproblem in isolation. We propose a novel approach which exploits such correlation through the use of an eigensolver based on subspace iteration and accelerated with Chebyshev polynomials (ChFSI). The resulting eigensolver, is optimized by minimizing the number of matvec multiplications and parallelized using the Elemental library framework. Numerical results shows that ChFSI achieves excellent scalability and is competitive with current dense linear algebra parallel eigensolvers.**High Performance Solutions for Big-data GWAS**Parallel Computing, Volume 42, pp. 75 - 87, February 2015.

Special issue on Parallelism in Bioinformatics.@article{Peise2015:754, author = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi", title = "High Performance Solutions for Big-data GWAS", journal = "Parallel Computing", year = 2015, volume = 42, pages = "75 - 87", month = feb, note = "Special issue on Parallelism in Bioinformatics", url = "http://arxiv.org/pdf/1403.6426v1" }

abstractwebPDFbibtexIn order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms and thousands of phenotypes come at the cost of hundreds of gigabytes of data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a relationship matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, by using distributed resources such as Cloud or clusters, we address both challenges: The genotype and phenotype data is streamed from secondary storage using a double buffering technique, while the relationship matrix is kept across the main memory of a distributed memory system. With the help of these solutions, we develop separate algorithms for studies involving only one or a multitude of traits. We show that these algorithms sustain high-performance and allow the analysis of enormous datasets.

### Peer Reviewed Conference Publications

**The Tersoff many-body potential: Sustainable performance through vectorization**Proceedings of the SC15 Workshop: Producing High Performance and Sustainable Software for Molecular Simulation, November 2015.@inproceedings{Höhnerbach2015:718, author = "Markus Höhnerbach and {Ahmed E.} Ismail and Paolo Bientinesi", title = "The Tersoff many-body potential: Sustainable performance through vectorization", year = 2015, month = nov, url = "http://hpac.rwth-aachen.de/ipcc/sc15_paper.pdf" }

abstractPDFbibtexThis contribution discusses the effectiveness and the sustainability of vectorization in molecular dynamics. As a case study, we present results for multi-body potentials on a range of vector instruction sets, targeting both CPUs and accelerators such as GPUs and the Intel Xeon Phi. The shared-memory and distributed-memory parallelization of MD simulations are problems that have been extensively treated; by contrast, vectorization is a relatively unexplored dimension. However, given both the increasing number of vector units available in the computing architectures, and the increasing width of such units, it is imperative to write algorithms and code for taking advantage of vector instructions. We chose to investigate multi-body potentials, as they are not as easily vectorizable as the pair potentials. As a matter of fact, their optimization pushes the boundaries of current compiler technology: our experience suggests that for such problems, compilers alone are not able to deliver appreciable speedups. By constrast, by resorting to the explicit use of instrinsics, we demonstrate significant improvements both on CPUs and the Intel Xeon Phi. Nonetheless, it is clear that the direct use of intrinsics is not a sustainable solution, not only for the effort and expertise it requires, but also because it hinders readability, extensibility and maintainability of code. In our case study, we wrote a C++ implementation of the Tersoff potential for the LAMMPS molecular dynamics simulation package. To alleviate the difficulty of using intrinsics, we used a template-based approach and abstracted from the instruction set, the vector length and the floating point precision. Instead of having to write 18 different kernels, this approach allowed us to only have one source code, from which all implementations (scalar, SSE, AVX, AVX2, AVX512, and IMCI, in single, double, and mixed precision) are automatically derived. Such an infrastructure makes it possible to compare kernels and identify which algorithms are suitable for which architecture. Without optimizations, we observed that a Intel Xeon E5-2650 CPU (Sandy Bridge) is considerably faster than a Xeon Phi. With our optimizations, the performance of both CPU and accelerator improved, and thanks to its wide vector units, the Xeon Phi takes the lead. For the future, besides monitoring the compilers' improvements, one should evaluate OpenMP 4.1 and its successors, which will include support for directive-based vectorization.**A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Dynamics**High Performance Computing, Lecture Notes in Computer Science, Volume 9137, pp. 155-170, Springer International Publishing, July 2015.@inproceedings{Springer2015:450, author = "Paul Springer and {Ahmed E.} Ismail and Paolo Bientinesi", title = "A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Dynamics", booktitle = "High Performance Computing ", year = 2015, volume = 9137, series = "Lecture Notes in Computer Science", pages = "155-170", month = jul, publisher = "Springer International Publishing", url = "http://hpac.rwth-aachen.de/~springer/dcm_isc15.pdf" }

abstractwebPDFbibtexRecent results on supercomputers show that beyond 65K cores, the efficiency of molecular dynamics simulations of interfacial sys- tems decreases significantly. In this paper, we introduce a dynamic cutoff method (DCM) for interfacial systems of arbitrarily large size. The idea consists in adopting a cutoff-based method in which the cutoff is cho- sen on a particle-by-particle basis, according to the distance from the interface. Computationally, the challenge is shifted from the long-range solvers to the detection of the interfaces and to the computation of the particle-interface distances. For these tasks, we present linear-time algo- rithms that do not rely on global communication patterns. As a result, the DCM algorithm is suited for large systems of particles and mas- sively parallel computers. To demonstrate its potential, we integrated DCM into the LAMMPS open-source molecular dynamics package, and simulated large liquid/vapor systems on two supercomputers: SuperMuc and JUQUEEN. In all cases, the accuracy of DCM is comparable to the traditional particle-particle particle-mesh (PPPM) algorithm, while the performance is considerably superior for large numbers of particles. For JUQUEEN, we provide timings for simulations running on the full system (458, 752 cores), and show nearly perfect strong and weak scaling.**On the Performance Prediction of BLAS-based Tensor Contractions**High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, Volume 8966, pp. 193-212, Springer International Publishing, April 2015.@inproceedings{Peise2015:380, author = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi", title = "On the Performance Prediction of BLAS-based Tensor Contractions", booktitle = "High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation", year = 2015, editor = "Jarvis, Stephen A. and Wright, Steven A. and Hammond, Simon D.", volume = 8966, series = "Lecture Notes in Computer Science", pages = "193-212", month = apr, publisher = "Springer International Publishing", url = "http://arxiv.org/pdf/1409.8608v1" }

abstractwebPDFbibtexTensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.**A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels**High Performance Computing for Computational Science -- VECPAR 2014, Lecture Notes in Computer Science, Volume 8969, pp. 245-258, Springer International Publishing, April 2015.@inproceedings{Peise2015:250, author = "Elmar Peise and Paolo Bientinesi", title = "A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels", booktitle = "High Performance Computing for Computational Science -- VECPAR 2014", year = 2015, editor = "Daydé, Michel and Marques, Osni and Nakajima, Kengo", volume = 8969, series = "Lecture Notes in Computer Science", pages = "245-258", month = apr, publisher = "Springer International Publishing", url = "https://arxiv.org/pdf/1402.5897v1" }

abstractwebPDFbibtexIt is universally known that caching is critical to attain high- performance implementations: In many situations, data locality (in space and time) plays a bigger role than optimizing the (number of) arithmetic floating point operations. In this paper, we show evidence that at least for linear algebra algorithms, caching is also a crucial factor for accurate performance modeling and performance prediction.**Scalable and Efficient Linear Algebra Kernel Mapping for Low Energy Consumption on the Layers CGRA**Applied Reconfigurable Computing, Lecture Notes in Computer Science, Volume 9040, pp. 301-310, January 2015.@inproceedings{Rakossy2015:898, author = "{Zoltan Endre } Rakossy and Dominik Stengele and {Axel } Acosta-Aponte and Saumitra Chafekar and Paolo Bientinesi and {Anupam } Chattopadhyay", title = "Scalable and Efficient Linear Algebra Kernel Mapping for Low Energy Consumption on the Layers CGRA", booktitle = "Applied Reconfigurable Computing", year = 2015, volume = 9040, series = "Lecture Notes in Computer Science", pages = "301-310", month = jan, organization = "Ruhr-Universität Bochum", url = "http://arc2015.esit.rub.de" }

abstractwebPDFbibtexA scalable mapping is proposed for 3 important kernels from the Numerical Linear Algebra domain, to exploit architectural features to reach asymptotically optimal efficiency and a low energy consumption. Performance and power evaluations were done with input data set matrix sizes ranging from 64×64 to 16384×16384. 12 architectural variants with up to 10×10 processing elements were used to explore scalability of the mapping and the architecture, achieving < 10% energy increase for architectures up to 8×8 PEs coupled with performance speed-ups of more than an order of magnitude. This enables a clean area-performance trade-off on the Layers architecture while keeping energy constant over the variants.**Packet-Oriented Streamline Tracing on Modern SIMD Architectures**2015.@inproceedings{Hentschel2015:388, author = "Bernd Hentschel and {Jens Henrik} Göbbert and Michael Klemm and Paul Springer and Andrea Schnorr and {Torsten W.} Kuhlen", title = "Packet-Oriented Streamline Tracing on Modern SIMD Architectures", year = 2015, journal = "Eurographics Symposium on Parallel Graphics and Visualization", url = "https://diglib.eg.org/handle/10.2312/pgv.20151154.043-052" }

abstractPDFbibtexThe advection of integral lines is an important computational kernel in vector field visualization. We investigate how this kernel can profit from vector (SIMD) extensions in modern CPUs. As a baseline, we formulate a streamline tracing algorithm that facilitates auto-vectorization by an optimizing compiler. We analyze this algorithm and propose two different optimizations. Our results show that particle tracing does not per se benefit from SIMD computation. Based on a careful analysis of the auto-vectorized code, we propose an optimized data access routine and a re-packing scheme which increases average SIMD efficiency. We evaluate our approach on three different, turbulent flow fields. Our optimized approaches increase integration performance up to 5.6x over our baseline measurement. We conclude with a discussion of current limitations and aspects for future work.

### Thesis

**Systematic Generation of Algorithms for Iterative Methods**Master's Thesis, RWTH Aachen, March 2015.

## 2014

### Journal Articles

**Big-Data, High-Performance, Mixed Models Based Genome-Wide Association Analysis**F1000Research, Volume 3(200), August 2014.

Open peer reviews.@article{Fabregat-Traver2014:430, author = "Diego Fabregat-Traver and Sodbo Sharapov and {Caroline } Hayward and {Igor } Rudan and {Harry } Campbell and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "Big-Data, High-Performance, Mixed Models Based Genome-Wide Association Analysis", journal = "F1000Research", year = 2014, volume = 3, number = 200, month = aug, note = "Open peer reviews", howpublished = "F1000Research" }

abstractwebbibtexTo raise the power of genome-wide association studies (GWAS) and avoid false-positive results, one can rely on mixed model based tests. When large samples are used, and especially when multiple traits are to be studied in the omics context, this approach becomes computationally unmanageable. Here, we develop new methods for analysis of arbitrary number of traits, and demonstrate that for the analysis of single-trait and multiple-trait GWAS, different methods are optimal. We implement these methods in a high-performance computing framework that uses state-of-the-art linear algebra kernels, incorporates optimizations, and avoids redundant computations, thus increasing throughput while reducing memory usage and energy consumption. We show that compared to existing libraries, the OmicABEL software---which implements these methods---achieves speed-ups of up to three orders of magnitude. As a consequence, samples of tens of thousands of individuals as well as samples phenotyped for many thousands of ''omics'' measurements can be analyzed for association with millions of omics features without the need for super-computers.**Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies**ACM Transactions on Mathematical Software (TOMS), Volume 40(4), pp. 27:1-27:22, June 2014.@article{Fabregat-Traver2014:798, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2014, volume = 40, number = 4, pages = "27:1--27:22", month = jun, publisher = "ACM", url = "http://arxiv.org/pdf/1210.7683v1" }

abstractwebPDFbibtexIn many scientific and engineering applications one has to solve not one but a sequence of instances of the same problem. Often times, the problems in the sequence are linked in a way that allows intermediate results to be reused. A characteristic example for this class of applications is given by the Genome-Wide Association Studies (GWAS), a widely spread tool in computational biology. GWAS entails the solution of up to trillions ($10^{12}$) of correlated generalized least-squares problems, posing a daunting challenge: the performance of petaflops ($10^{15}$ floating-point operations) over terabytes of data residing on disk. In this paper, we design an efficient algorithm for performing GWAS on multi-core architectures. This is accomplished in three steps. First, we show how to exploit the relation among successive problems, thus reducing the overall computational complexity. Then, through an analysis of the required data transfers, we identify how to eliminate any overhead due to input/output operations. Finally, we study how to decompose computation into tasks to be distributed among the available cores, to attain high performance and scalability. We believe the paper contributes valuable guidelines of general applicability for computational scientists on how to develop and optimize numerical algorithms.**Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions**Applied Mathematics and Computation, Volume 235, pp. 454-468, May 2014.@article{Di_Napoli2014:210, author = "Edoardo {Di Napoli} and Diego Fabregat-Traver and Gregorio Quintana-Orti and Paolo Bientinesi", title = "Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions", journal = "Applied Mathematics and Computation", year = 2014, volume = 235, pages = "454--468", month = may, publisher = "Elsevier", url = "http://arxiv.org/pdf/1307.2100" }

abstractwebPDFbibtexMathematical operators whose transformation rules constitute the building blocks of a multi-linear algebra are widely used in physics and engineering applications where they are very often represented as tensors. In the last century, thanks to the advances in tensor calculus, it was possible to uncover new research fields and make remarkable progress in the existing ones, from electromagnetism to the dynamics of fluids and from the mechanics of rigid bodies to quantum mechanics of many atoms. By now, the formal mathematical and geometrical properties of tensors are well defined and understood; conversely, in the context of scientific and high-performance computing, many tensor-related problems are still open. In this paper, we address the problem of efficiently computing contractions among two tensors of arbitrary dimension by using kernels from the highly optimized BLAS library. In particular, we establish precise conditions to determine if and when GEMM, the kernel for matrix products, can be used. Such conditions take into consideration both the nature of the operation and the storage scheme of the tensors, and induce a classification of the contractions into three groups. For each group, we provide a recipe to guide the users towards the most effective use of BLAS.**Solving Sequences of Generalized Least-Squares Problems on Multi-threaded Architectures**Applied Mathematics and Computation, Volume 234, pp. 606-617, May 2014.@article{Fabregat-Traver2014:430, author = "Diego Fabregat-Traver and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "Solving Sequences of Generalized Least-Squares Problems on Multi-threaded Architectures", journal = "Applied Mathematics and Computation", year = 2014, volume = 234, pages = "606--617", month = may, url = "http://arxiv.org/pdf/1210.7325v1" }

abstractwebPDFbibtexGeneralized linear mixed-effects models in the context of genome-wide association studies (GWAS) represent a formidable computational challenge: the solution of millions of correlated generalized least-squares problems, and the processing of terabytes of data. We present high performance in-core and out-of-core shared-memory algorithms for GWAS: By taking advantage of domain-specific knowledge, exploiting multi-core parallelism, and handling data efficiently, our algorithms attain unequalled performance. When compared to GenABEL, one of the most widely used libraries for GWAS, on a 12-core processor we obtain 50-fold speedups. As a consequence, our routines enable genome studies of unprecedented size.**Improved Accuracy and Parallelism for MRRR-based Eigensolvers**SIAM Journal of Scientific Computing, Volume 36(2), pp. 240-263, April 2014.@article{Petschow2014:60, author = "Matthias Petschow and {Enrique S.} Quintana-Orti and Paolo Bientinesi", title = "Improved Accuracy and Parallelism for MRRR-based Eigensolvers", journal = "SIAM Journal of Scientific Computing", year = 2014, volume = 36, number = 2, pages = "240--263", month = apr, url = "http://arxiv.org/pdf/1304.1864v3" }

abstractwebPDFbibtexThe real symmetric tridiagonal eigenproblem is of outstanding importance in numerical computations; it arises frequently as part of eigensolvers for standard and generalized dense Hermitian eigenproblems that are based on a reduction to real tridiagonal form. For its solution, the algorithm of Multiple Relatively Robust Representations (MRRR) is among the fastest methods. Although fast, the solvers based on MRRR do not deliver the same accuracy as competing methods like Divide & Conquer or the QR algorithm. In this paper, we demonstrate that the use of mixed precisions leads to improved accuracy of MRRR-based eigensolvers with limited or no performance penalty. As a result, we obtain eigensolvers that are not only equally or more accurate than the best available methods, but also - under most circumstances - faster and more scalable than the competition.**Multilevel Summation for Dispersion: A Linear-Time Algorithm for 1/r^6 Potentials**Journal of Chemical Physics, Volume 140, pp. 024105, January 2014.@article{Tameling2014:590, author = "Daniel Tameling and Paul Springer and Paolo Bientinesi and {Ahmed E.} Ismail", title = "Multilevel Summation for Dispersion: A Linear-Time Algorithm for 1/r^6 Potentials", journal = "Journal of Chemical Physics", year = 2014, volume = 140, pages = 24105, month = jan, url = "http://arxiv.org/abs/1308.4005" }

abstractwebPDFbibtexThe multilevel summation (MLS) method was developed to evaluate long-range interactions in molecular dynamics (MD) simulations. MLS was initially introduced for Coulombic potentials; we have extended this method to dispersion interactions. While formally short-ranged, for an accurate calculation of forces and energies in cases such as in interfacial systems, dispersion potentials require long-range methods. Since long-range solvers tend to dominate the time needed to perform MD calculations, increasing their performance is of vital importance. The MLS method offers some significant advantages when compared to mesh-based Ewald methods like the particle-particle particle-mesh and particle mesh Ewald methods. Unlike mesh-based Ewald methods, MLS does not use fast Fourier transforms and is thus not limited by communication and bandwidth concerns. In addition, it scales linearly in the number of particles, as compared to the O(N log N) complexity of the mesh-based Ewald methods. While the structure of the MLS method is invariant for different potentials, every algorithmic step had to be adapted to accommodate the 1/r^6 form of the dispersion interactions. In addition, we have derived error bounds, similar to those obtained by Hardy for the electrostatic MLS. Using a prototype implementation, we can already demonstrate the linear scaling of the MLS method for dispersion, and present results establishing the accuracy and efficiency of the method.

### Book Chapter

**Numerical methods for the eigenvalue problem in electronic structure computations**Computing Solids: Models, ab-initio methods and supercomputing, Key Technologies, Volume 74, pp. D3.1-28, Forschungszentrum Juelich GmbH, March 2014.

Lecture Notes of the 45th IFF Spring School.bibtex@inbook{Di_Napoli2014:348, author = "Edoardo {Di Napoli}", title = "Numerical methods for the eigenvalue problem in electronic structure computations", pages = "D3.1-28", publisher = "Forschungszentrum Juelich GmbH", year = 2014, volume = 74, series = "Key Technologies", month = mar, note = "Lecture Notes of the 45th IFF Spring School", booktitle = "Computing Solids: Models, ab-initio methods and supercomputing", institution = "Juelich Supercomputing Centre" }

### Theses

**Knowledge-Based Automatic Generation of Linear Algebra Algorithms and Code**RWTH Aachen, April 2014.@phdthesis{Fabregat-Traver2014:278, author = "Diego Fabregat-Traver", title = " Knowledge-Based Automatic Generation of Linear Algebra Algorithms and Code", school = "RWTH Aachen", year = 2014, month = apr, url = "http://arxiv.org/abs/1404.3406" }

abstractPDFbibtexThis dissertation focuses on the design and the implementation of domain-specific compilers for linear algebra matrix equations. The development of efficient libraries for such equations, which lie at the heart of most software for scientific computing, is a complex process that requires expertise in a variety of areas, including the application domain, algorithms, numerical analysis and high-performance computing. Moreover, the process involves the collaboration of several people for a considerable amount of time. With our compilers, we aim to relieve the developers from both designing algorithms and writing code, and to generate routines that match or even surpass the performance of those written by human experts.**MRRR-based Eigensolvers for Multi-Core Processors and Supercomputers**RWTH Aachen, January 2014.

PhD thesis.@phdthesis{Petschow2014:748, author = "Matthias Petschow", title = "MRRR-based Eigensolvers for Multi-Core Processors and Supercomputers", school = "RWTH Aachen", year = 2014, month = jan, note = "PhD thesis", url = "http://arxiv.org/pdf/1401.4950v1.pdf" }

abstractwebPDFbibtexThe real symmetric tridiagonal eigenproblem is of outstanding importance in numerical computations; it arises frequently as part of eigensolvers for standard and generalized dense Hermitian eigenproblems that are based on a reduction to tridiagonal form. For its solution, the algorithm of Multiple Relatively Robust Representations (MRRR or MR3 in short) - introduced in the late 1990s - is among the fastest methods. To compute k eigenpairs of a real n-by-n tridiagonal T, MRRR only requires O(kn) arithmetic operations; in contrast, all the other practical methods require O(k^2 n) or O(n^3) operations in the worst case. This thesis centers around the performance and accuracy of MRRR.

### Technical Report

**Cache-aware Performance Modeling and Prediction for Dense Linear Algebra**November 2014.@techreport{Peise2014:904, author = "Elmar Peise and Paolo Bientinesi", title = "Cache-aware Performance Modeling and Prediction for Dense Linear Algebra", year = 2014, month = nov, url = "http://arxiv.org/pdf/1409.8602v1" }

abstractPDFbibtexCountless applications cast their computational core in terms of dense linear algebra operations. These operations can usually be implemented by combining the routines offered by standard linear algebra libraries such as BLAS and LAPACK, and typically each operation can be obtained in many alternative ways. Interestingly, identifying the fastest implementation---without executing it---is a challenging task even for experts. An equally challenging task is that of tuning each routine to performance-optimal configurations. Indeed, the problem is so difficult that even the default values provided by the libraries are often considerably suboptimal; as a solution, normally one has to resort to executing and timing the routines, driven by some form of parameter search. In this paper, we discuss a methodology to solve both problems: identifying the best performing algorithm within a family of alternatives, and tuning algorithmic parameters for maximum performance; in both cases, we do not execute the algorithms themselves. Instead, our methodology relies on timing and modeling the computational kernels underlying the algorithms, and on a technique for tracking the contents of the CPU cache. In general, our performance predictions allow us to tune dense linear algebra algorithms within few percents from the best attainable results, thus allowing computational scientists and code developers alike to efficiently optimize their linear algebra routines and codes.

## 2013

### Journal Articles

**Application-tailored Linear Algebra Algorithms: A Search-Based Approach**International Journal of High Performance Computing Applications (IJHPCA), Volume 27(4), pp. 425 - 438, November 2013.@article{Fabregat-Traver2013:4, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Application-tailored Linear Algebra Algorithms: A Search-Based Approach", journal = "International Journal of High Performance Computing Applications (IJHPCA)", year = 2013, volume = 27, number = 4, pages = "425 - 438", month = nov, publisher = "Sage Publications, Inc.", url = "http://arxiv.org/abs/1211.5904" }

abstractwebPDFbibtexIn this paper, we tackle the problem of automatically generating algorithms for linear algebra operations by taking advantage of problem-specific knowledge. In most situations, users possess much more information about the problem at hand than what current libraries and computing environments accept; evidence shows that if properly exploited, such information leads to uncommon/unexpected speedups. We introduce a knowledge-aware linear algebra compiler that allows users to input matrix equations together with properties about the operands and the problem itself; for instance, they can specify that the equation is part of a sequence, and how successive instances are related to one another. The compiler exploits all this information to guide the generation of algorithms, to limit the size of the search space, and to avoid redundant computations. We applied the compiler to equations arising as part of sensitivity and genome studies; the algorithms produced exhibit, respectively, 100- and 1000-fold speedups.**Block Iterative Eigensolvers for Sequences of Correlated Eigenvalue Problems**Computer Physics Communications, Volume 184, pp. 2478 - 2488, November 2013.@article{Di_Napoli2013:904, author = "Edoardo {Di Napoli} and Mario Berljafa", title = "Block Iterative Eigensolvers for Sequences of Correlated Eigenvalue Problems", journal = "Computer Physics Communications", year = 2013, volume = 184, pages = "2478 - 2488", month = nov, url = "http://arxiv.org/pdf/1206.3768v2" }

abstractwebPDFbibtexIn Density Functional Theory simulations based on the LAPW method, each self-consistent cycle comprises dozens of large dense generalized eigenproblems. In contrast to real-space methods, eigenpairs solving for problems at distinct cycles have either been believed to be independent or at most very loosely connected. In a recent study [7], it was proposed to revert this point of view and consider simulations as made of dozens of sequences of eigenvalue problems; each sequence groups together eigenproblems with equal k-vectors and an increasing outer-iteration cycle index l. From this different standpoint it was possible to demonstrate that, contrary to belief, successive eigenproblems in a sequence are strongly correlated with one another. In particular, by tracking the evolution of subspace angles between eigenvectors of successive eigenproblems, it was shown that these angles decrease noticeably after the first few iterations and become close to collinear: the closer to convergence the stronger the correlation becomes. This last result suggests that we can manipulate the eigenvectors, solving for a specific eigenproblem in a sequence, as an approximate solution for the following eigenproblem. In this work we present results that are in line with this intuition. First, we provide numer- ical examples where opportunely selected block iterative solvers benefit from the reuse of eigenvectors by achieving a substantial speed-up. We then develop a C language version of one of these algorithms and run a series of tests specifically focused on perfor- mance and scalability. All the numerical tests are carried out employing sequences of eigenproblems extracted from simulations of solid-state physics crystals. The results presented here could eventually open the way to a widespread use of block iterative solvers in ab initio electronic structure codes based on the LAPW approach.**Dissecting the FEAST Algorithm for Generalized Eigenproblems**Journal of Computational and Applied Mathematics, Volume 244, pp. 1-9, May 2013.@article{Kraemer2013:188, author = "Lukas Kraemer and Edoardo {Di Napoli} and {Martin } Galgon and Bruno Lang and Paolo Bientinesi", title = "Dissecting the FEAST Algorithm for Generalized Eigenproblems", journal = "Journal of Computational and Applied Mathematics", year = 2013, volume = 244, pages = "1--9", month = may, url = "http://arxiv.org/abs/1204.1726" }

abstractwebPDFbibtexWe analyze the FEAST method for computing selected eigenvalues and eigenvectors of large sparse matrix pencils. After establishing the close connection between FEAST and the well-known Rayleigh-Ritz method, we identify several critical issues that influence convergence and accuracy of the solver: the choice of the starting vector space, the stopping criterion, how the inner linear systems impact the quality of the solution, and the use of FEAST for computing eigenpairs from multiple intervals. We complement the study with numerical examples, and hint at possible improvements to overcome the existing problems.**High-Performance Solvers for Dense Hermitian Eigenproblems**SIAM Journal on Scientific Computing, Volume 35(1), pp. C1-C22, January 2013.@article{Petschow2013:208, author = "Matthias Petschow and Elmar Peise and Paolo Bientinesi", title = "High-Performance Solvers for Dense Hermitian Eigenproblems", journal = "SIAM Journal on Scientific Computing", year = 2013, volume = 35, number = 1, pages = "C1-C22", month = jan, publisher = "Society for Industrial and Applied Mathematics", url = "http://arxiv.org/pdf/1205.2107.pdf" }

abstractwebPDFbibtexWe introduce a new collection of solvers - subsequently called EleMRRR - for large-scale dense Hermitian eigenproblems. EleMRRR solves various types of problems: generalized, standard, and tridiagonal eigenproblems. Among these, the last is of particular importance as it is a solver on its own right, as well as the computational kernel for the first two; we present a fast and scalable tridiagonal solver based on the Algorithm of Multiple Relatively Robust Representations - referred to as PMRRR. Like the other EleMRRR solvers, PMRRR is part of the freely available Elemental library, and is designed to fully support both message-passing (MPI) and multithreading parallelism (SMP). As a result, the solvers can equally be used in pure MPI or in hybrid MPI-SMP fashion. We conducted a thorough performance study of EleMRRR and ScaLAPACK's solvers on two supercomputers. Such a study, performed with up to 8,192 cores, provides precise guidelines to assemble the fastest solver within the ScaLAPACK framework; it also indicates that EleMRRR outperforms even the fastest solvers built from ScaLAPACK's components.

### Peer Reviewed Conference Publications

**Streaming Data from HDD to GPUs for Sustained Peak Performance**Proceedings of the Euro-Par 2013, 19th International European Conference on Parallel and Distributed Computing, Lecture Notes in Computer Science, Volume 8097, pp. 788-799, Springer Berlin Heidelberg, May 2013.@inproceedings{Beyer2013:618, author = "Lucas Beyer and Paolo Bientinesi", title = "Streaming Data from HDD to GPUs for Sustained Peak Performance", year = 2013, volume = 8097, series = "Lecture Notes in Computer Science", pages = "788-799", month = may, publisher = "Springer Berlin Heidelberg", url = "http://arxiv.org/pdf/1302.4332v1.pdf" }

abstractwebPDFbibtexIn the context of the genome-wide association studies (GWAS), one has to solve long sequences of generalized least-squares problems; such a task has two limiting factors: execution time --often in the range of days or weeks-- and data management --data sets in the order of Terabytes. We present an algorithm that obviates both issues. By pipelining the computation, and thanks to a sophisticated transfer strategy, we stream data from hard disk to main memory to GPUs and achieve sustained peak performance; with respect to a highly-optimized CPU implementation, our algorithm shows a speedup of 2.6x. Moreover, the approach lends itself to multiple GPUs and attains almost perfect scalability. When using 4 GPUs, we observe speedups of 9x over the aforementioned implementation, and 488x over a widespread biology library.**Algorithms for Large-scale Whole Genome Association Analysis**Proceedings of the 20th European MPI Users' Group Meeting, EuroMPI '13, pp. 229-234, ACM, 2013.@inproceedings{Peise2013:504, author = "Elmar Peise and Diego Fabregat-Traver and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "Algorithms for Large-scale Whole Genome Association Analysis ", booktitle = "Proceedings of the 20th European MPI Users' Group Meeting", year = 2013, series = "EuroMPI '13", pages = "229--234 ", address = "New York, NY, USA", publisher = "ACM", url = "http://arxiv.org/pdf/1304.2272v1" }

abstractwebPDFbibtexIn order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms come at the cost of hundreds of Gigabytes of genotype data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a covariance matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, we present solutions for both challenges: The genotype data is streamed from and to secondary storage using a double buffering technique, while the covariance matrix is kept across the main memory of a distributed memory system. We show that these methods sustain high-performance and allow the analysis of enormous datasets.**A Domain-Specific Compiler for Linear Algebra Operations**High Performance Computing for Computational Science - VECPAR 2012, Lecture Notes in Computer Science, Volume 7851, pp. 346-361, Springer, 2013.@inproceedings{Fabregat-Traver2013:340, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "A Domain-Specific Compiler for Linear Algebra Operations", booktitle = "High Performance Computing for Computational Science - VECPAR 2012", year = 2013, editor = "M. Dayde, O. Marques, and K. Nakajima", volume = 7851, series = "Lecture Notes in Computer Science", pages = "346--361", address = "Heidelberg", publisher = "Springer", url = "http://arxiv.org/pdf/1205.5975v1" }

abstractwebPDFbibtexWe present a prototypical linear algebra compiler that automatically exploits domain-specific knowledge to generate high-performance algorithms. The input to the compiler is a target equation together with knowledge of both the structure of the problem and the properties of the operands. The output is a variety of high-performance algorithms to solve the target equation. Our approach consists in the decomposition of the input equation into a sequence of library-supported kernels. Since in general such a decomposition is not unique, our compiler returns not one but a number of algorithms. The potential of the compiler is shown by means of its application to a challenging equation arising within the*genome-wide association study*. As a result, the compiler produces multiple "best" algorithms that outperform the best existing libraries.**A Parallel and Scalable Iterative Solver for Sequences of Dense Eigenproblems Arising in FLAPW**Lecture Notes in Computer Science, 2013.

Accepted.@inproceedings{Berljafa2013:300, author = "Mario Berljafa and Edoardo {Di Napoli}", title = "A Parallel and Scalable Iterative Solver for Sequences of Dense Eigenproblems Arising in FLAPW ", booktitle = "Lecture Notes in Computer Science", year = 2013, note = "Accepted" }

abstractbibtexIn one of the most important methods in Density Functional Theory -- the Full- Potential Linearized Augmented Plane Wave (FLAPW) method -- dense generalized eigenproblems are organized in long sequences. Moreover each eigenproblem is strongly correlated to the next one in the sequence. We propose a novel approach which ex- ploits such correlation through the use of an eigensolver based on subspace iteration and accelerated with Chebyshev polynomials. The resulting solver, parallelized using the Elemental library framework, achieves excellent scalability and is competitive with current dense parallel eigensolvers.

### Thesis

**A scalable, linear-time dynamic cutoff algorithm for molecular simu- lations of interfacial systems**RWTH Aachen University, 2013.@mastersthesis{Springer2013:970, author = "Paul Springer", title = "A scalable, linear-time dynamic cutoff algorithm for molecular simu- lations of interfacial systems", school = "RWTH Aachen University", year = 2013, institution = "RWTH Aachen University", url = "http://arxiv.org/pdf/1502.03234v1" }

abstractPDFbibtexThis master thesis introduces the idea of dynamic cutoffs in molecular dynamics simulations, based on the distance between particles and the interface, and presents a solution for detecting interfaces in real-time. Our dynamic cutoff method (DCM) exhibits a linear-time complexity as well as nearly ideal weak and strong scaling. The DCM is tailored for massively parallel architectures and for large interfacial systems with millions of particles. We implemented the DCM as part of the LAMMPS open-source molecular dynamics package and demonstrate the nearly ideal weak- and strong-scaling behavior of this method on an IBM BlueGene/Q supercomputer. Our results for a liquid/vapor system consisting of Lennard-Jones particles show that the accuracy of DCM is comparable to that of the traditional particle-particle particle- mesh (PPPM) algorithm. The performance comparison indicates that DCM is preferable for large systems due to the limited scaling of FFTs within the PPPM algorithm. Moreover, the DCM requires the interface to be identified every other MD timestep. As a consequence, this thesis also presents an interface detection method which is (1) applicable in real time; (2) parallelizable; and (3) scales linearly with respect to the number of particles.

### Technical Reports

**FLAME Derivation of the General Minimal Residual Method**Texas Advanced Computing Center, The University of Texas at Austin, November 2013.@techreport{Bientinesi2013:760, author = "Paolo Bientinesi and Victor Eijkhout", title = "FLAME Derivation of the General Minimal Residual Method", institution = "Texas Advanced Computing Center, The University of Texas at Austin", year = 2013, month = nov, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/gmres.pdf" }

abstractPDFbibtexThe Conjugate Gradient method is found by constructing a sequence of residuals R that spans the Krylov space K and that is orthogonal. The GMRES method constructs another basis of the Krylov space, but the residuals S now satisfy 1) the residual is of minimum L2 length in each iteration, or equivalent, 2) the residuals are A-orthogonal. While it would be easy to construct the A-orthogonal basis of K, and in fact several methods do so, this construction can break down. For this reason, Saad and Schultz [2] proposed using a breakdown-free process, such as Arnoldi iteration, and constructing the basis by appropriate linear combinations.**Streaming Data from HDD to GPUs for Sustained Peak Performance**Aachen Institute for Computational Engineering Science, RWTH Aachen, February 2013.

Technical Report AICES-2013/02-1.@techreport{Beyer2013:398, author = "Lucas Beyer and Paolo Bientinesi", title = "Streaming Data from HDD to GPUs for Sustained Peak Performance", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2013, month = feb, note = "Technical Report AICES-2013/02-1" }

abstractwebbibtex**OpenACC - A Step Towards Heterogeneous Computing**RWTH Aachen University, 2013.

Seminar project.@techreport{Springer2013:808, author = "Paul Springer", title = "OpenACC - A Step Towards Heterogeneous Computing", institution = "RWTH Aachen University", year = 2013, note = "Seminar project", url = "http://hpac.rwth-aachen.de/people/springer/openacc_seminar.pdf" }

abstractPDFbibtexWith the fast growing number of heterogeneous supercomputers, consisting of massively parallel coprocessors attached to multi-core processors, it becomes increasingly important to program these heterogeneous systems in a productive manner. Programming these coprocessors through low-level APIs such as CUDA or OpenCL is often a tedious task and may result in poor productivity. OpenACC tries to overcome this drawback by allowing programmers to annotate C/C++ or Fortran code with directives which are then translated into accelerator-specific code by the compiler. This paper summarizes OpenACC’s features, its limitations and possible future directions. Moreover, I will present two case studies where I evaluate OpenACC’s performance and productivity in comparison to CUDA and OpenMP.

## 2012

### Journal Articles

**Modeling Performance through Memory-Stalls**ACM SIGMETRICS Performance Evaluation Review, Volume 40(2), pp. 86-91, October 2012.@article{Iakymchuk2012:120, author = "Roman Iakymchuk and Paolo Bientinesi", title = "Modeling Performance through Memory-Stalls", journal = "ACM SIGMETRICS Performance Evaluation Review", year = 2012, volume = 40, number = 2, pages = "86--91", month = oct, publisher = "ACM", address = "New York, NY, USA", url = "http://hpac.rwth-aachen.de/~pauldj/pubs/pmbs.pdf" }

abstractwebPDFbibtexWe aim at modeling the performance of linear algebra algorithms without executing either the algorithms or any parts of them. The performance of an algorithm can be expressed in terms of the time spent on CPU execution and memory-stalls. The main concern of this paper is to build analytical models to accurately predict memory-stalls. The scenario in which data resides in the L2 cache is considered; with this assumption, only L1 cache misses occur. We construct an analytical formula for modeling the L1 cache misses of fundamental linear algebra operations such as those included in the Basic Linear Algebra Subprograms (BLAS) library. The number of cache misses occurring in higher-level algorithms--like a matrix factorization--is then predicted by combining the models for the appropriate BLAS subroutines. As case studies, we consider the LU factorization and GER--a BLAS operation and a building block for the LU factorization. We validate the models on both Intel and AMD processors, attaining remarkably accurate performance predictions.**Correlations in Sequences of Generalized Eigenproblems Arising in Density Functional Theory**Computer Physics Communications (CPC), Volume 183(8), pp. 1674-1682, August 2012.@article{Di_Napoli2012:160, author = "Edoardo {Di Napoli} and Stefan Bluegel and Paolo Bientinesi", title = "Correlations in Sequences of Generalized Eigenproblems Arising in Density Functional Theory", journal = "Computer Physics Communications (CPC)", year = 2012, volume = 183, number = 8, pages = "1674-1682", month = aug, url = "http://arxiv.org/pdf/1108.2594v1" }

abstractwebPDFbibtexDensity Functional Theory (DFT) is one of the most used {\em ab initio} theoretical frameworks in materials science. It derives the ground state properties of multi-atomic ensembles directly from the computation of their one-particle density \nr . In DFT-based simulations the solution is calculated through a chain of successive self-consistent cycles; in each cycle a series of coupled equations (Kohn-Sham) translates to a large number of generalized eigenvalue problems whose eigenpairs are the principal means for expressing \nr. A simulation ends when \nr\ has converged to the solution within the required numerical accuracy. This usually happens after several cycles, resulting in a process calling for the solution of many sequences of eigenproblems. In this paper, the authors report evidence showing unexpected correlations between adjacent eigenproblems within each sequence and suggest the investigation of an alternative computational approach: information extracted from the simulation at one step of the sequence is used to compute the solution at the next step. The implications are multiple: from increasing the performance of material simulations, to the development of a mixed direct-iterative solver, to modifying the mathematical foundations of the DFT computational paradigm in use, thus opening the way to the investigation of new materials.**Solving Dense Generalized Eigenproblems on Multi-Threaded Architectures**Applied Mathematics and Computation, Volume 218(22), pp. 11279-11289, July 2012.@article{Aliaga2012:420, author = "Jose' Aliaga and Paolo Bientinesi and Davor Davidovic and Edoardo {Di Napoli} and Francisco Igual and {Enrique S.} Quintana-Orti", title = "Solving Dense Generalized Eigenproblems on Multi-Threaded Architectures", journal = "Applied Mathematics and Computation", year = 2012, volume = 218, number = 22, pages = "11279-11289", month = jul, url = "http://arxiv.org/pdf/1111.6374v1" }

abstractwebPDFbibtexWe compare two approaches to compute a portion of the spectrum of dense symmetric definite generalized eigenproblems: one is based on the reduction to tridiagonal form, and the other on the Krylov-subspace iteration. Two large-scale applications, arising in molecular dynamics and materials science, are employed to investigate the contributions of the application, architecture, and parallelism of the method to the performance of the solvers. The experimental results on a state-of-the-art 8-core platform, equipped with a graphics processing unit (GPU), reveal that in real applications, iterative Krylov-subspace methods can be a competitive approach also for the solution of dense problems.**Deriving Dense Linear Algebra Libraries**Formal Aspects of Computing, pp. 1-13, January 2012.@article{Bientinesi2012:978, author = "Paolo Bientinesi and {John A.} Gunnels and {Margaret E.} Myers and {Enrique S.} Quintana-Orti and Tyler Rhodes and Robert {van de Geijn} and Field {Van Zee}", title = "Deriving Dense Linear Algebra Libraries", journal = "Formal Aspects of Computing", year = 2012, pages = "1--13", month = jan, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/FormAspComp.pdf" }

abstractwebPDFbibtexStarting in the late 1960s computer scientists including Dijkstra and Hoare advocated goal-oriented program- ming and the formal derivation of algorithms. The chief impediment to realizing this for loop-based programs was that a priori determination of loop-invariants, a prerequisite for developing loops, was a task too complex for any but the simplest of operations. Around 2000, these techniques were for the first time successfully applied to the domain of high-performance dense linear algebra libraries. This has led to a multitude of papers, mostly published in the ACM Transactions for Mathematical Software, as well as a system for the mechanical derivation of algorithms and a high-performance linear algebra library, libflame, that includes more than a thousand vari- ants of algorithms for more than a hundred linear algebra operations. To our knowledge, this success story has unfolded with limited awareness on the part the formal methods community. This paper reports on ten years of experience and is meant to raise that awareness.

### Peer Reviewed Conference Publications

**Performance Modeling for Dense Linear Algebra**Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (PMBS12), SCC '12, pp. 406-416, IEEE Computer Society, November 2012.@inproceedings{Peise2012:50, author = "Elmar Peise and Paolo Bientinesi", title = "Performance Modeling for Dense Linear Algebra", booktitle = "Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (PMBS12)", year = 2012, series = "SCC '12", pages = "406--416", address = "Washington, DC, USA", month = nov, publisher = "IEEE Computer Society", url = "http://arxiv.org/pdf/1209.2364" }

abstractwebPDFbibtexIt is well known that the behavior of dense linear algebra algorithms is greatly influenced by factors like target architecture, underlying libraries and even problem size; because of this, the accurate prediction of their performance is a real challenge. In this article, we are not interested in creating accurate models for a given algorithm, but in correctly ranking a set of equivalent algorithms according to their performance. Aware of the hierarchical structure of dense linear algebra routines, we approach the problem by developing a framework for the automatic generation of statistical performance models for BLAS and LAPACK libraries. This allows us to obtain predictions through evaluating and combining such models. We demonstrate that our approach is successful in both single- and multi-core environments, not only in the ranking of algorithms but also in tuning their parameters.**OpenACC - First Experiences with Real-World Applications**Euro-Par 2012 Parallel Processing, August 2012.@inproceedings{Wienke2012:54, author = "Sandra Wienke and Paul Springer and Christian Terboven and Dieter {An Mey}", title = "OpenACC - First Experiences with Real-World Applications", booktitle = "Euro-Par 2012 Parallel Processing", year = 2012, address = "Rhodes Island, Greece", month = aug, url = "http://hpac.rwth-aachen.de/people/springer/OpenACC_first_experiences.pdf" }

abstractPDFbibtexToday's trend to use accelerators like GPGPUs in heterogeneous computer systems has entailed several low-level APIs for accelerator programming. However, programming these APIs is often tedious and therefore unproductive. To tackle this problem, recent approaches employ directive-based high-level pro- gramming for accelerators. In this work, we present our first experiences with OpenACC, an API consisting of compiler directives to offload loops and re- gions of C/C++ and Fortran code to accelerators. We compare the performance of OpenACC to PGI Accelerator and OpenCL for two real-world applications and evaluate programmability and productivity. We find that OpenACC offers a promising ratio of development effort to performance and that a directive-based approach to program accelerators is more efficient than low-level APIs, even if suboptimal performance is achieved.**An Example of Symmetry Exploitation for Energy-related Eigencomputations**INTERNATIONAL CONFERENCE OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING 2009: (ICCMSE 2009), AIP Conference Proceedings, Volume 1504, pp. 1134-1137, 2012.@inproceedings{Petschow2012:860, author = "Matthias Petschow and Edoardo {Di Napoli} and Paolo Bientinesi", title = "An Example of Symmetry Exploitation for Energy-related Eigencomputations", booktitle = "INTERNATIONAL CONFERENCE OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING 2009: (ICCMSE 2009)", year = 2012, volume = 1504, series = "AIP Conference Proceedings", pages = "1134--1137", url = "http://link.aip.org/link/doi/10.1063/1.4772126" }

abstractwebPDFbibtexOne of the most used approaches in simulating materials is the tight-binding approximation. When using this method in a material simulation, it is necessary to compute the eigenvalues and eigenvectors of the Hamiltonian describing the system. In general, the system possesses few explicit symmetries. Due to them, the problem has many degenerate eigenvalues. The ambiguity in choosing a orthonormal basis of the invariant subspaces, associated with degenerate eigenvalues, will result in eigenvectors which are not invariant under the action of the symmetry operators in matrix form. A meaningful computation of the eigenvectors needs to take those symmetries into account. A natural choice is a set of eigenvectors, which simultaneously diagonalizes the Hamiltonian and the symmetry matrices. This is possible because all the matrices commute with each other. The simultaneous eigenvectors and the corresponding eigenvalues will be in a parametrized form in terms of the lattice momentum components. This functional dependence of the eigenvalues is the dispersion relation and describes the band structure of a material. Therefore it is important to find this functional dependence in any numerical computation related to material properties.

### Theses

**Exploiting Graphics Accelerators for Computational Biology**Aachen Institute for Computational Engineering Science, RWTH Aachen, June 2012.webPDFbibtex@mastersthesis{Beyer2012:400, author = "Lucas Beyer", title = "Exploiting Graphics Accelerators for Computational Biology", school = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2012, month = jun, url = "http://www.aices.rwth-aachen.de:8080/aices/preprint/documents/AICES-2012-06-01.pdf" }

**Hierarchical Performance Modeling for Ranking Dense Linear Algebra Algorithms**AICES, RWTH Aachen, May 2012.@mastersthesis{Peise2012:168, author = "Elmar Peise", title = "Hierarchical Performance Modeling for Ranking Dense Linear Algebra Algorithms", school = "AICES, RWTH Aachen", year = 2012, month = may, url = "http://arxiv.org/pdf/1207.5217v3" }

abstractwebPDFbibtexA large class of dense linear algebra operations, such as LU decomposition or inversion of a triangular matrix, are usually performed by blocked algorithms. For one such operation, typically, not only one but many algorithmic variants exist; depending on computing architecture, libraries and problem size, each variant attains a different performances. We propose methods and tools to rank the algorithmic variants according to their performance for a given scenario without executing them. For this purpose, we identify the routines upon which the algorithms are built. A first tool - the Sampler - measures the performance of these routines. Using the Sampler, a second tool models their performance. The generated models are then used to predict the performance of the considered algorithms. For a given scenario, these predictions allow us to correctly rank the algorithms according to their performance without executing them. With the help of the same tools, algorithmic parameters such as block-size can be optimally tuned.

### Technical Reports

**Improved Orthogonality for Dense Hermitian Eigensolvers based on the MRRR algorithm**Aachen Institute for Computational Engineering Science, RWTH Aachen, September 2012.

Technical Report AICES-2012/09-1.@techreport{Petschow2012:754, author = "Matthias Petschow and {Enrique S.} Quintana-Orti and Paolo Bientinesi", title = "Improved Orthogonality for Dense Hermitian Eigensolvers based on the MRRR algorithm", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2012, month = sep, note = "Technical Report AICES-2012/09-1", url = "http://www.aices.rwth-aachen.de:8080/%7Epetschow/pubs/mixed.pdf" }

abstractPDFbibtexThe dense Hermitian eigenproblem is of outstanding importance in numerical computations and a number of excellent algorithms for this problem exist. One of the fastest method is the MRRR algorithm, which is based on a reduction to real tridiagonal form. This approach, although fast, does not deliver the same accuracy (orthogonality) as competing methods like the Divide-and-Conquer or the QR algorithm. In this paper, we demonstrate how the use of mixed precisions in MRRR-based eigensolvers leads to improved orthogonality. At the same time, when compared to the classical use of the MRRR algorithm, our approach comes with no or only limited performance penalty, increases the robustness, and improves scalability.**High-Throughput Genome-Wide Association Analysis for Single and Multiple Phenotypes**Aachen Institute for Computational Engineering Science, RWTH Aachen, July 2012.

Technical Report AICES-2012/07-1.@techreport{Fabregat-Traver2012:20, author = "Diego Fabregat-Traver and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "High-Throughput Genome-Wide Association Analysis for Single and Multiple Phenotypes", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2012, month = jul, note = "Technical Report AICES-2012/07-1", url = "http://arxiv.org/abs/1207.2169" }

abstractPDFbibtexThe variance component tests used in genomewide association studies of thousands of individuals become computationally exhaustive when multiple traits are analysed in the context of omics studies. We introduce two high-throughput algorithms -- CLAK-CHOL and CLAK-EIG -- for single and multiple phenotype genome-wide association studies (GWAS). The algorithms, generated with the help of an expert system, reduce the computational complexity to the point that thousands of traits can be analyzed for association with millions of polymorphisms in a course of days on a standard workstation. By taking advantage of problem specific knowledge, CLAK-CHOL and CLAK-EIG significantly outperform the current state-of-the-art tools in both single and multiple trait analysis.**A Study of Productivity and Performance of Modern Vector Processors**RWTH Aachen University, March 2012.

Bachelor Thesis.@techreport{Springer2012:818, author = "Paul Springer", title = "A Study of Productivity and Performance of Modern Vector Processors", institution = "RWTH Aachen University", year = 2012, month = mar, note = "Bachelor Thesis", url = "http://hpac.rwth-aachen.de/people/springer/bachelor_thesis.pdf" }

abstractPDFbibtexThis bachelor thesis carries out a case study describing the performance and productivity of modern vector processors such as graphics processing units (GPUs) and central processing units (CPUs) based on three different computational routines arising from a magnetoencephalography application. I apply different programming paradigms to these routines targeting either the CPU or the GPU. Furthermore, I investigate the performance and productivity of programming paradigms such as OpenMP with respect to its auto-vectorization capabilities, Intel intrinsic AVX and Intel OpenCL for the CPU. Moreover, I examine NVIDIA's CUDA and OpenCL APIs for GPU-sided applications. The results of the performed case study yield roughly the same performances for the CPU and GPU implementations, but favour the OpenMP paradigm (i.e. the CPU) with respect to productivity.

## 2011

### Journal Articles

**MR3-SMP: A Symmetric Tridiagonal Eigensolver for Multi-Core Architectures**Parallel Computing, Volume 37(12), December 2011.@article{Petschow2011:254, author = "Matthias Petschow and Paolo Bientinesi", title = "MR3-SMP: A Symmetric Tridiagonal Eigensolver for Multi-Core Architectures", journal = "Parallel Computing", year = 2011, volume = 37, number = 12, month = dec, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/MR3-SMP-TR.pdf" }

abstractwebPDFbibtexThe computation of eigenvalues and eigenvectors of symmetric tridiagonal matrices arises frequently in applications; often as one of the steps in the solution of Hermitian and symmetric eigenproblems. While several accurate and efficient methods for the tridiagonal eigenproblem exist, their corresponding implementations usually target uni-processors or large distributed memory systems. Our new eigensolver MR3-SMP is instead specifically designed for multi-core and many-core general purpose processors, which today have effectively replaced uni-processors. We show that in most cases MR3-SMP is faster and achieves better speedups than state-of-the-art eigensolvers for uni-processors and distributed-memory systems.**Condensed Forms for the Symmetric Eigenvalue Problem on Multi-threaded Architectures**Concurrency and Computation: Practice and Experience, Volume 23(7), May 2011.@article{Bientinesi2011:754, author = "Paolo Bientinesi and Francisco Igual and Daniel Kressner and Matthias Petschow and {Enrique S.} Quintana-Orti", title = "Condensed Forms for the Symmetric Eigenvalue Problem on Multi-threaded Architectures", journal = "Concurrency and Computation: Practice and Experience", year = 2011, volume = 23, number = 7, month = may, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/red-multi-TR.pdf" }

abstractwebPDFbibtexWe investigate the performance of the routines in LAPACK and the Successive Band Reduction (SBR) toolbox for the reduction of a dense matrix to tridiagonal form, a crucial preprocessing stage in the solution of the symmetric eigenvalue problem, on general-purpose multi-core processors. In response to the advances of hardware accelerators, we also modify the code in the SBR toolbox to accelerate the computation by off-loading a significant part of the operations to a graphics processor (GPU). The performance results illustrate the parallelism and scalability of these algorithms on current high-performance multi-core and many-core architectures.**Goal-Oriented and Modular Stability Analysis**SIAM Journal on Matrix Analysis and Applications, Volume 32(1), pp. 286 - 308, March 2011.@article{Bientinesi2011:810, author = "Paolo Bientinesi and Robert {van de Geijn}", title = "Goal-Oriented and Modular Stability Analysis", journal = "SIAM Journal on Matrix Analysis and Applications", year = 2011, volume = 32, number = 1, pages = "286 -- 308", month = mar, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/Stability-74105.pdf" }

abstractwebPDFbibtexWe introduce a methodology for obtaining inventories of error results for families of numerical dense linear algebra algorithms. The approach for deriving the analyses is goal-oriented, systematic, and layered. The presentation places the analysis side-by-side with the algorithm so that it is obvious where roundoff error is introduced. The approach supports the analysis of more complex algorithms, such as the blocked LU factorization. For this operation we derive a tighter error bound than has been previously reported in the literature.

### Peer Reviewed Conference Publications

**Execution-Less Performance Modeling**Proceedings of the Second International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computing Systems (PMBS11) held as part of the Supercomputing Conference (SC11), November 2011.webbibtex@inproceedings{Iakymchuk2011:258, author = "Roman Iakymchuk and Paolo Bientinesi", title = "Execution-Less Performance Modeling", booktitle = "Proceedings of the Second International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computing Systems (PMBS11) held as part of the Supercomputing Conference (SC11)", year = 2011, address = "Seattle, USA", month = nov, institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen University" }

**Juggrnaut – An Abstract JVM**International Conference on Formal Verification of Object-Oriented Software, pp. 142-159, 2011.@inproceedings{Heinen2011:370, author = "Jonathan Heinen and Henrik Barthels and Christina Jansen", title = "Juggrnaut – An Abstract JVM", booktitle = "International Conference on Formal Verification of Object-Oriented Software", year = 2011, pages = "142--159", organization = "Springer", url = "http://hpac.rwth-aachen.de/~barthels/publications/juggrnaut.pdf" }

abstractPDFbibtexWe introduce a new kind of hypergraphs and hyperedge replacement grammars, where nodes are associated types. We use them to adapt the abstraction framework Juggrnaut presented by us in [7, 8] – for the verification of Java Bytecode programs. The framework is extended to handle additional concepts needed for the analysis of Java Bytecode like null pointers and method stacks as well as local and static variables. We define the abstract transition rules for a significant subset of opcodes and show how to compute the abstract state space. Finally we complete the paper with some experimental results.**High-Performance Parallel Computations using Python as High-Level Language**Euro-Par 2010, Parallel Processing Workshops, Lecture Notes in Computer Science, Volume 6586, pp. 541-548, Springer, 2011.@inproceedings{Masini2011:8, author = "Stefano Masini and Paolo Bientinesi", title = "High-Performance Parallel Computations using Python as High-Level Language", booktitle = "Euro-Par 2010, Parallel Processing Workshops", year = 2011, editor = "Guarracino, M.R.; Vivien, F.; Traff, J.L.; Cannataro, M.; Danelutto, M.; Hast, A.; Perla, F.; KnÃ�ï¿½Ã¯Â¿Â½Ã�ï¿½Ã�Â¼pfer, A.; Di Martino, B.; Alexander, M. ", volume = 6586, series = "Lecture Notes in Computer Science", pages = "541--548", publisher = "Springer", url = "http://hpac.rwth-aachen.de/~pauldj/pubs/Python-TR.pdf" }

abstractwebPDFbibtexHigh-performance and parallel computations have always represented a challenge in terms of code optimization and memory usage, and have typically been tackled with languages that allow a low-level management of resources, like Fortran, C and C++. Nowadays, most of the implementation effort goes into constructing the bookkeeping logic that binds together functionalities taken from standard libraries. Because of the increasing complexity of this kind of codes, it becomes more and more necessary to keep it well organized through proper software engineering practices. Indeed, in the presence of chaotic implementations, reasoning about correctness is difficult, even when limited to specific aspects like concurrency; moreover, due to the lack in flexibility of the code, making substantial changes for experimentation becomes a grand challenge. Since the bookkeeping logic only accounts for a tiny fraction of the total execution time, we believe that for such a task it can be afforded to introduce an overhead due to a high-level language. We consider Python as a preliminary candidate with the intent of improving code readability, flexibility and, in turn, the level of confidence with respect to correctness. In this study, the bookkeeping logic of SMP-MRRR, a C & Fortran highly optimized multi-core eigensolver, is ported to Python. We report here on the porting process and on the pros and cons of using a high-level language in a high-performance parallel library.**Improving High-Performance Computations on Clouds Through Resource Underutilization**Proceedings of ACM 26th Symposium on Applied Computing, pp. 119-126, ACM, 2011.@inproceedings{Iakymchuk2011:404, author = "Roman Iakymchuk and Jeff Napper and Paolo Bientinesi", title = "Improving High-Performance Computations on Clouds Through Resource Underutilization", booktitle = "Proceedings of ACM 26th Symposium on Applied Computing", year = 2011, pages = "119--126 ", address = "Taichung, Taiwan", publisher = "ACM" }

abstractwebbibtexWe investigate the effects of shared resources for high-performance computing in a commercial cloud environment where multiple virtual machines share a single hardware node. Although good performance is occasionally obtained, contention degrades the expected performance and introduces significant variance. Using the DGEMM kernel and the HPL benchmark, we show that the underutilization of resources considerably improves expected performance by reducing contention for the CPU and cache space. For instance, for some cluster configurations, the solution is reached almost an order of magnitude earlier on average when the available resources are underutilized. The performance benefits for single node computations are even more impressive: Underutilization improves the expected execution time by two orders of magnitude. Finally, in contrast to unshared clusters, extending underutilized clusters by adding more nodes often improves the execution time due to an increased parallelism even with a slow interconnect. In the best case, by underutilizing the nodes performance was improved enough to entirely offset the cost of an extra node in the cluster.**Performance Prediction through Time Measurements**The Proceedings of the First International Conference on High-Performance Computing (HPC-UA 2011), 2011.@inproceedings{Iakymchuk2011:330, author = "Roman Iakymchuk", title = "Performance Prediction through Time Measurements", booktitle = "The Proceedings of the First International Conference on High-Performance Computing (HPC-UA 2011)", year = 2011, institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen University", url = "http://hpac.rwth-aachen.de/~iakymchuk/pub/hpc_ua.pdf" }

abstractPDFbibtexIn this article we address the problem of predicting performance of linear algebra algorithms for small matrices. This approach is based on reducing the performance prediction to modeling the execution time of algorithms. The execution time of higher level algorithms like the LU factorization is predicted through modeling the computational time of the kernel linear algebra operations such as the BLAS subroutines. As the time measurements confirmed, the execution time of the BLAS subroutines has a piecewise-polynomial behavior. Therefore, the subroutines time is modeled by conducting only few samples and then applying polynomial interpolation. The validation of the approach is established by comparing the predicted execution time of the unblocked LU factorization, which is built on top of two BLAS subroutines, with the separately measured one. The applicability of the approach is illustrated through performance experiments on Intel and AMD processors.**Branch and Bound for Optimal Jacobian Accumulation**Proceedings of the Fifth SIAM Workshop on Combinatorial Scientific Computing (CSC11), pp. 73-76, 2011.bibtex@inproceedings{Mosenkis2011:868, author = "Viktor Mosenkis and Elmar Peise and Uwe Naumann", title = "Branch and Bound for Optimal Jacobian Accumulation", booktitle = "Proceedings of the Fifth SIAM Workshop on Combinatorial Scientific Computing (CSC11)", year = 2011, pages = "73-76" }

**Automatic Generation of Loop-Invariants for Matrix Operations**Computational Science and its Applications, International Conference, pp. 82-92, IEEE Computer Society, 2011.@inproceedings{Fabregat-Traver2011:238, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Automatic Generation of Loop-Invariants for Matrix Operations", booktitle = "Computational Science and its Applications, International Conference", year = 2011, pages = "82-92", address = "Los Alamitos, CA, USA", publisher = "IEEE Computer Society", url = "http://hpac.rwth-aachen.de/aices/preprint/documents/AICES-2011-02-01.pdf" }

abstractwebPDFbibtexIn recent years it has been shown that for many linear algebra operations it is possible to create families of algorithms following a very systematic procedure. We do not refer to the fine tuning of a known algorithm, but to a methodology for the actual generation of both algorithms and routines to solve a given target matrix equation. Although systematic, the methodology relies on complex algebraic manipulations and non-obvious pattern matching, making the procedure challenging to be performed by hand, our goal is the development of a fully automated system that from the sole description of a target equation creates multiple algorithms and routines. We present CL1ck, a symbolic system written in Mathematica, that starts with an equation, decomposes it into multiple equations, and returns a set of loop-invariants for the algorithms -- yet to be generated -- that will solve the equation. In a successive step each loop-invariant is then mapped to its corresponding algorithm and routine. For a large class of equations, the methodology generates known algorithms as well as many previously unknown ones. Most interestingly, the methodology unifies algorithms traditionally developed in isolation. As an example, the five well known algorithms for the LU factorization are for the first time unified under a common root.**Knowledge-Based Automatic Generation of Partitioned Matrix Expressions**Computer Algebra in Scientific Computing, Lecture Notes in Computer Science, Volume 6885, pp. 144-157, Springer, 2011.@inproceedings{Fabregat-Traver2011:54, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Knowledge-Based Automatic Generation of Partitioned Matrix Expressions", booktitle = "Computer Algebra in Scientific Computing", year = 2011, editor = "Gerdt, Vladimir and Koepf, Wolfram and Mayr, Ernst and Vorozhtsov, Evgenii", volume = 6885, series = "Lecture Notes in Computer Science", pages = "144-157", address = "Heidelberg", publisher = "Springer", url = "http://hpac.rwth-aachen.de/aices/preprint/documents/AICES-2011-01-03.pdf" }

abstractwebPDFbibtexIn a series of papers it has been shown that for many linear algebra operations it is possible to generate families of algorithms by following a systematic procedure. Although powerful, such a methodology involves complex algebraic manipulation, symbolic computations and pattern matching, making the generation a process challenging to be performed by hand. We aim for a fully automated system that from the sole description of a target operation creates multiple algorithms without any human intervention. Our approach consists of three main stages. The first stage yields the core object for the entire process, the Partitioned Matrix Expression (PME), which establishes how the target problem may be decomposed in terms of simpler sub-problems. In the second stage the PME is inspected to identify predicates, the Loop-Invariants, to be used to set up the skeleton of a family of proofs of correctness. In the third and last stage the actual algorithms are constructed so that each of them satisfies its corresponding proof of correctness. In this paper we focus on the first stage of the process, the automatic generation of Partitioned Matrix Expressions. In particular, we discuss the steps leading to a PME and the knowledge necessary for a symbolic system to perform such steps. We also introduce Cl1ck, a prototype system written in Mathematica that generates PMEs automatically.

### Other

**Automata-Based Detection of Hypergraph Embeddings**Bachelor's Thesis, RWTH Aachen, September 2011.@misc{Barthels2011:804, author = "Henrik Barthels", title = "Automata-Based Detection of Hypergraph Embeddings", month = sep, year = 2011, type = "Bachelor's Thesis, RWTH Aachen", url = "http://hpac.rwth-aachen.de/~barthels/publications/bachelor_thesis.pdf" }

abstractPDFbibtexThe verification of programs using pointers and dynamic data structures requires to deal with potentially infinite state spaces. Because of that, it is reasonable to use abstraction techniques capable of dealing with those potentially infinite structures. The Juggrnaut framework applies hyperedge replacement grammars to dynamically abstract and concretize parts of a heap. Abstraction is done by the backwards application of grammar rules, which is related to subgraph isomorphism and therefore NP-complete. In this thesis, after giving a formal definition to hypergraphs, hyperedge replacement and heap representation, an automata model is introduced which is able to detect embeddings of grammar rules with certain properties efficiently. We provide an algorithm to construct an automaton that is able to detect a given set of embeddings of grammar rules. Finally, proofs of the NP-completeness of subgraph isomorphism on hypergraphs and embedding detection in general are presented.

### Technical Reports

**Solving Dense Generalized Eigenproblems on Multi-Threaded Architectures**Aachen Institute for Computational Engineering Science, RWTH Aachen, November 2011.

Technical Report AICES-2011/11-3.@techreport{Aliaga2011:348, author = "Jose' Aliaga and Paolo Bientinesi and Davor Davidovic and Edoardo {Di Napoli} and Francisco Igual and {Enrique S.} Quintana-Orti", title = "Solving Dense Generalized Eigenproblems on Multi-Threaded Architectures", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2011, month = nov, note = "Technical Report AICES-2011/11-3" }

abstractbibtex**Berkeley's Dwarfs on CUDA**RWTH Aachen University, 2011.

Seminar Project.

## 2010

### Journal Article

**Sparse Direct Factorizations through Unassembled Hyper-Matrices**Computer Methods in Applied Mechanics and Engineering, Volume 199(9), pp. 430-438, January 2010.@article{Bientinesi2010:700, author = "Paolo Bientinesi and Victor Eijkhout and Kyungjoo Kim and Jason Kurtz and Robert {van de Geijn}", title = "Sparse Direct Factorizations through Unassembled Hyper-Matrices", journal = "Computer Methods in Applied Mechanics and Engineering", year = 2010, volume = 199, number = 9, pages = "430--438", month = jan, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/UHM-TR-09.pdf" }

abstractwebPDFbibtexWe set out to efficiently compute the solution of a sequence of linear systems Aixi = bi, where the matrix Ai is tightly related to matrix Ai-1. In the setting of an hp-adaptive Finite Element Method, the sequence of matrices Ai results from successive local refinements of the problem domain. At any step i > 1, a factorization already exists and it is the updated linear system relative to the refined mesh for which a factorization must be computed in the least amount of time. This observation holds the promise of a tremendous reduction in the cost of an individual refinement step. We argue that traditional matrix storage schemes, whether dense or sparse, are a bottleneck, limiting the potential efficiency of the solvers. We propose a new hierarchical data structure, the Unassembled Hyper-Matrix (UHM), which allows the matrix to be stored as a tree of unassembled element matrices, hierarchically ordered to mirror the refinement history of the physical domain. The factorization of such an UHM proceeds in terms of element matrices, only assembling nodes when they need to be eliminated. Efficiency comes in terms of both performance and space requirements.

### Book Chapter

**HPC on Competitive Cloud Resources**In Handbook of Cloud Computing, pp. 493-516, Springer, 2010.PDFbibtex@inbook{Bientinesi2010:54, author = "Paolo Bientinesi and Roman Iakymchuk and Jeff Napper", title = "HPC on Competitive Cloud Resources", pages = "493--516", publisher = "Springer", year = 2010, editor = "Borko Furht and Armando Escalante", booktitle = "In Handbook of Cloud Computing", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen University", url = "http://hpac.rwth-aachen.de/~iakymchuk/pub/AICES-2010-06-01.pdf" }

### Peer Reviewed Conference Publications

**Matrix Structure Exploitation in Generalized Eigenproblems Arising in Density Functional Theory**ICNAAM 2010: International Conference of Numerical Analysis and Applied Mathematics 2010, AIP Conference Proceedings, Volume 1281, pp. 937-940, American Institute of Physics, 2010.@inproceedings{Di_Napoli2010:830, author = "Edoardo {Di Napoli} and Paolo Bientinesi", title = "Matrix Structure Exploitation in Generalized Eigenproblems Arising in Density Functional Theory", booktitle = "ICNAAM 2010: International Conference of Numerical Analysis and Applied Mathematics 2010", year = 2010, volume = 1281, series = "AIP Conference Proceedings", pages = "937--940", publisher = "American Institute of Physics", url = "http://hpac.rwth-aachen.de/~pauldj/pubs/ICNAAM-Edo-TR.pdf" }

abstractwebPDFbibtexIn this short paper, the authors report a new computational approach in the context of Density Functional Theory (DFT). It is shown how it is possible to speed up the self-consistent cycle (iteration) characterizing one of the most well-known DFT implementations: FLAPW. Generating the Hamiltonian and overlap matrices and solving the associated generalized eigenproblems A x = l B x constitute the two most time-consuming fractions of each iteration. Two promising directions, implementing the new methodology, are presented that will ultimately improve the performance of the generalized eigensolver and save computational time.**Benchmarking Different Direct Solution Methods For Large Power System Simulation**Grand Challenges in Modeling & Simulation, SummerSIM'10, pp. 280-284, 2010.bibtex@inproceedings{Benigni2010:654, author = "Andrea Benigni and Paolo Bientinesi and Antonello Monti", title = "Benchmarking Different Direct Solution Methods For Large Power System Simulation", booktitle = "Grand Challenges in Modeling & Simulation, SummerSIM'10", year = 2010, pages = "280--284 " }

**Towards Mechanical Derivation of Krylov Solver Libraries**Proceedings of the ICCS2010, Procedia Computer Science, Volume 1(1), pp. 1805-1813, 2010.@inproceedings{Eijkhout2010:480, author = "Victor Eijkhout and Paolo Bientinesi and Robert {van de Geijn}", title = "Towards Mechanical Derivation of Krylov Solver Libraries", year = 2010, volume = 1, number = 1, series = "Procedia Computer Science", pages = "1805--1813", url = "http://hpac.rwth-aachen.de/~pauldj/pubs/Krylov-TR.pdf" }

abstractwebPDFbibtexIn a series of papers, it has been shown that algorithms for dense linear algebra operations can be systematically and even mechanically derived from the mathematical specification of the operation. A frequent question has been whether the methodology can be broadened to iterative methods. In this paper, we show preliminary evidence that this is indeed the case for so-called Krylov subspace methods. Our aims with this are two-fold: first of all, we show how the FLAME paradigm can simplify the derivation of subspace iteration methods. In view of this, the fact that we only derive the classical conjugate gradient method should be viewed as a promise for the future, rather than as a limitation of this approach. Secondly, and more importantly, our use of FLAME shows how mechanical reasoning can derive full algorithm specifications from the constraints (for instance, orthogonality conditions) on the generated results. If we tie this to ongoing research in automatic optimized code generated in FLAME, we see that this research is a necessary building block towards automatic generation of optimized library software.**On Parallelizing the MRRR Algorithm for Data-Parallel Coprocessors**Proceedings of the PPAM 2009, Eighth International Conference on Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science, Volume 6067, Springer, 2010.@inproceedings{Lessig2010:138, author = "Christian Lessig and Paolo Bientinesi", title = "On Parallelizing the MRRR Algorithm for Data-Parallel Coprocessors", year = 2010, volume = 6067, series = "Lecture Notes in Computer Science", publisher = "Springer", url = "http://hpac.rwth-aachen.de/~pauldj/pubs/MRRR-GPU.pdf" }

abstractwebPDFbibtexThe eigenvalues and eigenvectors of a symmetric matrix are needed in a myriad of applications in computational engineering and computational science. One of the fastest and most accurate eigensolvers is the Algorithm of Multiple Relatively Robust Representations (MRRR). This is the first stable algorithm that computes k eigenvalues and eigenvectors of a tridiagonal symmetric matrix in O(nk) time. We present a parallelization of the MRRR algorithm for data parallel coprocessors using the CUDA programming environment. The results clearly demonstrate the potential of data-parallel coprocessors for scientific computations: when comparing against routine sstemr, LAPACKâ��s implementation of MRRR, our parallel algorithm provides 10-fold speedups.**Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures**Proceedings of the PPAM 2009, Eighth International Conference on Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science, Volume 6067, Springer, 2010.@inproceedings{Bientinesi2010:970, author = "Paolo Bientinesi and Francisco Igual and Daniel Kressner and {Enrique S.} Quintana-Orti", title = "Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures", year = 2010, volume = 6067, series = "Lecture Notes in Computer Science", publisher = "Springer" }

abstractwebbibtexWe investigate the performance of the routines in LAPACK and the Successive Band Reduction (SBR) toolbox for the reduction of a dense matrix to tridiagonal form, a crucial preprocessing stage in the solution of the symmetric eigenvalue problem, on general-purpose multi-core processors. In response to the advances of hardware accelerators, we also modify the code in SBR to accelerate the computation by off-loading a significant part of the operations to a graphics processor (GPU). Performance results illustrate the parallelism and scalability of these algorithms on current high-performance multi-core architectures.**The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors**Proceedings of the PARA 2010, Reykjavik, Iceland, June 6-9, 2010, Lecture Notes in Computer Science, Volume 7133, pp. 152-161, Springer, 2010.@inproceedings{Petschow2010:754, author = "Matthias Petschow and Paolo Bientinesi", title = "The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors", year = 2010, editor = "Jonasson, Kristjan", volume = 7133, series = "Lecture Notes in Computer Science", pages = "152--161", publisher = "Springer", url = "http://hpac.rwth-aachen.de/aices/preprint/documents/AICES-2010-09-04.pdf" }

abstractwebPDFbibtexThe algorithm of Multiple Relatively Robust Representations (MRRR or MR3) computes k eigenvalues and eigenvectors of a symmetric tridiagonal matrix in O(nk) arithmetic operations. Large problems can be effectively tackled with existing distributed-memory parallel implementations of MRRR; small and medium size problems can instead make use of LAPACK's routine xSTEMR. However, xSTEMR is optimized for single-core CPUs, and does not take advantage of today's multi-core and future many-core architectures. In this paper we discuss some of the issues and trade-offs arising in the design of MR3-SMP, an algorithm for multi-core CPUs and SMP systems. Experiments on application matrices indicate that MR3-SMP is both faster and obtains better speedups than all the tridiagonal eigensolvers included in LAPACK and Intel's Math Kernel Library (MKL).**Automatic Generation of Partitioned Matrix Expressions for Matrix Operations.**Proceedings of the 8th International Conference of Numerical Analysis and Applied Mathematics (ICNAAM 2010), AIP Conference Proceedings, Volume 1281, pp. 774-777, 2010.@inproceedings{Fabregat-Traver2010:304, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Automatic Generation of Partitioned Matrix Expressions for Matrix Operations.", year = 2010, volume = 1281, series = "AIP Conference Proceedings", pages = "774-777", url = "http://hpac.rwth-aachen.de/aices/preprint/documents/AICES-2010-10-01.pdf" }

abstractwebPDFbibtexWe target the automatic generation of formally correct algorithms and routines for linear algebra operations. Given the broad variety of architectures and configurations with which scientists deal, there does not exist one algorithmic variant that is suitable for all scenarios. Therefore, we aim to generate a family of algorithmic variants to attain high-performance for a broad set of scenarios. One of the authors has previously demonstrated that automatic derivation of a family of algorithms is possible when the Partitioned Matrix Expression (PME) of the target operation is available. The PME is a recursive definition that states the relations between submatrices in the input and the output operands. In this paper we describe all the steps involved in the automatic derivation of PMEs, thus making progress towards a fully automated system.

### Other

**HPC Sharing in the Cloud**August 2010.@misc{Napper2010:958, author = "Jeff Napper and Roman Iakymchuk and Paolo Bientinesi", title = "HPC Sharing in the Cloud", howpublished = "HPCWire.com", month = aug, year = 2010 }

abstractwebbibtexMany scientific applications can require significant coordination between large numbers of nodes. Massive effort is spent in the HPC arena to hide much of the latency from coordination with expensive low-latency networks and fine-tuned communication libraries. Such efforts have not yet been translated to the commercial cloud computing arena, which still typically provide systems with varying amounts of installed memory but rarely various quality interconnects.

### Technical Report

**Proof-Driven Derivation of Krylov Solver Libraries**Aachen Institute for Computational Engineering Science, RWTH Aachen, June 2010.

Technical Report AICES-2010/06-3.PDFbibtex@techreport{Eijkhout2010:0, author = "Victor Eijkhout and Paolo Bientinesi and Robert {van de Geijn}", title = "Proof-Driven Derivation of Krylov Solver Libraries ", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2010, month = jun, note = "Technical Report AICES-2010/06-3", url = "http://www.aices.rwth-aachen.de:8080/~pauldj/pubs/Krylov-TR.pdf" }

## 2009

### Journal Articles

**Effects of Pupil Discretization and Littrow Illumination in the Simulation of Bright-Field Defect Detection**Optics Letters, Volume 34(12), pp. 1840-1842, 2009.@article{Rafler2009:968, author = "Stephan Rafler and Matthias Petschow and Uwe Seifert and Karsten Frenner and Jens Goeckeritz and Wolfgang Osten", title = "Effects of Pupil Discretization and Littrow Illumination in the Simulation of Bright-Field Defect Detection", journal = "Optics Letters", year = 2009, volume = 34, number = 12, pages = "1840--1842" }

abstractwebbibtexThe simulation of optical defect detection on wafers by microscopy requires several approximations. For the simulation with Fourier modal methods, the number of modes is limited by memory and time consumption. Furthermore, the illumination pupil has to be divided into discrete incidence angles for plane waves. We present a way to save the computation of most of these incidence angles. It works only for certain configurations of grating period and illumination wavelength but is an accurate and robust approximation in these cases. We present sample calculations for one- and two-dimensional periodic structures with defects.**On an One-Step Modification of Gauss-Newton Method under Generalized Lipschitz Conditions for Solving the Nonlinear Least Squares Problem**PAMM. Vol. 9. Special issue: 80th Annual Meeting of the International Association of Applied Mathematics and Mechanics (GAMM), Gdansk, Volume 9(1), pp. 565-566, 2009.bibtex@article{Shakhno2009:638, author = "Stepan Shakhno and Roman Iakymchuk", title = "On an One-Step Modification of Gauss-Newton Method under Generalized Lipschitz Conditions for Solving the Nonlinear Least Squares Problem", journal = "PAMM. Vol. 9. Special issue: 80th Annual Meeting of the International Association of Applied Mathematics and Mechanics (GAMM), Gdansk", year = 2009, volume = 9, number = 1, pages = "565-566" }

**On a Secant Type Method for Nonlinear Least Squares Problems**Journal of Numerical and Applied Mathematics, Volume 97, pp. 112-121, 2009.bibtex@article{Shakhno2009:558, author = "Stepan Shakhno and Oleksandra Gnatyshyn and Roman Iakymchuk", title = "On a Secant Type Method for Nonlinear Least Squares Problems", journal = "Journal of Numerical and Applied Mathematics", year = 2009, volume = 97, pages = "112--121" }

### Peer Reviewed Conference Publications

**Can Cloud Computing Reach the TOP500?**UCHPC-MAW '09 Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop, pp. 17-20, 2009.

In conjunction with the 2009 ACM International Conference on Computing Frontiers.@inproceedings{Napper2009:804, author = "Jeff Napper and Paolo Bientinesi", title = "Can Cloud Computing Reach the TOP500?", booktitle = "UCHPC-MAW '09 Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop ", year = 2009, pages = "17--20", note = "In conjunction with the 2009 ACM International Conference on Computing Frontiers", url = "http://hpac.rwth-aachen.de/~pauldj/pubs/EC2-Cloud.pdf" }

abstractwebPDFbibtexComputing as a utility has reached the mainstream. Scientists can now rent time on large commercial clusters through several vendors. The cloud computing model provides flexible support for "pay as you go" systems. In addition to no upfront investment in large clusters or supercomputers, such systems incur no maintenance costs. Furthermore, they can be expanded and reduced on-demand in real-time. Current cloud computing performance falls short of systems specifically designed for scientific applications. Scientific computing needs are quite different from those of web applications--composed primarily of database queries--that have been the focus of cloud computing vendors. In this paper we investigate the use of cloud computing for high-performance numerical applications. In particular, we assume unlimited monetary resources to answer the question, "How high can a cloud computing service get in the TOP500 list?" We show results for the Linpack benchmark on different allocations on Amazon EC2.**Low-Memory Tour Reversal in Directed Graphs**Combinatorial Scientific Computing, Dagstuhl Seminar Proceedings, Number 9061, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany, 2009.@inproceedings{Mosenkis2009:928, author = "Viktor Mosenkis and Elmar Peise and Uwe Naumann", title = "Low-Memory Tour Reversal in Directed Graphs", booktitle = "Combinatorial Scientific Computing", year = 2009, number = 9061, series = "Dagstuhl Seminar Proceedings", address = "Dagstuhl, Germany", publisher = "Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany" }

abstractwebbibtexWe consider the problem of reversing a*tour*$(i_1, i_2, \ldots, i_l)$ in a directed graph $G = (V, E)$ with positive integer vertices $V$ and edges $E \subseteq V \times V$, where $i_j \in V$ and $(i_j, i_{j + 1}) \in E$ for all $j = 1, \ldots, l - 1$. The tour can be processed in last-in-first-out order as long as the size of the corresponding stack does not exceed the available memory. This constraint is violated in most cases when considering control-flow graphs of large-scale numerical simulation programs. The tour reversal problem also arises in adjoint programs used, for example, in the context of derivative-based nonlinear optimization, sensitivity analysis, or other, often inverse, problems. The intention is to compress the tour in order not to run out of memory. As the general optimal compression problem was proven to be NP-hard and big control-flow graphs results from loops in programs we do not want to use general purpose algorithms to compress the tour. We want rather to compress the tour by finding loops and replace the redundant information by proper representation of the loops.

## 2008

### Journal Articles

**Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix**ACM Transactions on Mathematical Software, Volume 35(1), July 2008.@article{Bientinesi2008:648, author = "Paolo Bientinesi and Brian Gunter and Robert {van de Geijn}", title = "Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix", journal = "ACM Transactions on Mathematical Software", year = 2008, volume = 35, number = 1, month = jul, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/SPDInv.pdf" }

abstractwebPDFbibtexWe study the high-performance implementation of the inversion of a Symmetric Positive Definite (SPD) matrix on architectures ranging from sequential processors to Symmetric MultiProcessors to distributed memory parallel computers. This inversion is traditionally accomplished in three Ã¢??sweepsÃ¢??: a Cholesky factorization of the SPD matrix, the inversion of the resulting triangular matrix, and finally the multiplication of the inverted triangular matrix by its own transpose. We state different algorithms for each of these sweeps as well as algorithms that compute the result in a single sweep. One algorithm outperforms the current ScaLAPACK implementation by 20-30 percent due to improved load-balance on a distributed memory architecture.**Scalable Parallelization of FLAME Code via the Workqueuing Model**ACM Transactions on Mathematical Software, Volume 34(2), March 2008.@article{Van_Zee2008:598, author = "Field {Van Zee} and Paolo Bientinesi and Tze {Meng Low} and Robert {van de Geijn}", title = "Scalable Parallelization of FLAME Code via the Workqueuing Model", journal = "ACM Transactions on Mathematical Software", year = 2008, volume = 34, number = 2, month = mar }

abstractwebbibtexWe discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and represents these algorithms as they are formally derived and presented. We report on two implementations of the workqueuing model, neither of which requires the use of explicit indices to specify parallelism. The first implementation uses the experimental taskq pragma, which may influence the adoption of a similar construct into OpenMP 3.0. The second workqueuing implementation is domain-specific to FLAME but allows us to illustrate the benefits of sorting tasks according to their computational cost prior to parallel execution. In addition, we discuss how scalable parallelization of dense linear algebra algorithms via OpenMP will require a two-dimensional partitioning of operands much like a 2D data distribution is needed on distributed memory architectures. We illustrate the issues and solutions by discussing the parallelization of the symmetric rank-k update and report impressive performance on an SGI system with 14 Itanium2 processors.

### Peer Reviewed Conference Publications

**Multi-dimensional Array Operations for Signal Processing Algorithms**PARA 2008: 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing, 2008.

To be published.@inproceedings{Bientinesi2008:640, author = "Paolo Bientinesi and Nikos Pitsianis and Xiaobai Sun", title = "Multi-dimensional Array Operations for Signal Processing Algorithms", booktitle = "PARA 2008: 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing", year = 2008, note = "To be published", url = "http://hpac.rwth-aachen.de/~pauldj/pubs/para-2008.pdf" }

abstractPDFbibtexGame and graphics processors are increasingly utilized for scientific computing applications and for signal processing in particular. This paper addresses the issue of efficiently mapping high-dimensional array operations for signal processing algorithms onto such computing platforms. Algorithms for fast Fourier transforms and convolutions are essential to many signal processing applications. Such algorithms entail fast high-dimensional data array operations. Game and graphics processors typically differ from general-purpose processors in memory hierarchy as well as in memory capacity and access patterns. We characterize the memory structures from the perspective of high-dimensional array operations, identify the mismatch between algorithmic dimension and architectural dimension and then describe the consequent penalty in memory latency. An architecture supports d-dimensional accesses if a d-dimensional data array can be accessed along one dimension as fast as along any other dimension. We introduce an approach to reduce latency and provide experimental results on the STI Cell Broadband Engine as supporting evidence.**Fast Computation of Local Correlation Coefficients**Proc. SPIE 7074, Advanced Signal Processing Algorithms, Architectures, and Implementations XVIII, Number 707405, 2008.@inproceedings{Bientinesi2008:478, author = "Paolo Bientinesi and Nikos Pitsianis and Xiaobai Sun", title = "Fast Computation of Local Correlation Coefficients", year = 2008, number = 707405, series = "Proc. SPIE 7074, Advanced Signal Processing Algorithms, Architectures, and Implementations XVIII" }

abstractwebbibtexThis paper presents an acceleration method, using both algorithmic and architectural means, for fast calculation of local correlation coefficients, which is a basic image-based information processing step for template or pattern matching, image registration, motion or change detection and estimation, compensation of changes, or compression of representations, among other information processing objectives. For real-time applications, the complexity in arithmetic operations as well as in programming and memory access latency had been a divisive issue between the so-called correction-based methods and the Fourier domain methods. In the presented method, the complexity in calculating local correlation coefficients is reduced via equivalent reformulation that leads to efficient array operations or enables the use of multi-dimensional fast Fourier transforms, without losing or sacrificing local and non-linear changes or characteristics. The computation time is further reduced by utilizing modern multi-core architectures, such as the Sony-Toshiba-IBM Cell processor, with high processing speed and low power consumption.**SuperMatrix: a Multithreaded Runtime Scheduling System for Algorithms-by-Blocks**Proceedings of ACM SIGPLAN 2008 Symposium on Principles and Practice of Parallel Programming (PPoPP'08), 2008.@inproceedings{Bientinesi2008:400, author = "Paolo Bientinesi and Ernie Chan and {Enrique S.} Quintana-Orti and Gregorio Quintana-Orti and Robert {van de Geijn}", title = "SuperMatrix: a Multithreaded Runtime Scheduling System for Algorithms-by-Blocks", booktitle = "Proceedings of ACM SIGPLAN 2008 Symposium on Principles and Practice of Parallel Programming (PPoPP'08)", year = 2008 }

abstractwebbibtexThis paper describes SuperMatrix, a runtime system that parallelizes matrix operations for SMP and/or multi-core architectures. We use this system to demonstrate how code described at a high level of abstraction can achieve high performance on such architectures while completely hiding the parallelism from the library programmer. The key insight entails viewing matrices hierarchically, consisting of blocks that serve as units of data where operations over those blocks are treated as units of computation. The implementation transparently enqueues the required operations, internally tracking dependencies, and then executes the operations utilizing out-of-order execution techniques inspired by superscalar microarchitectures. This separation of concerns allows library developers to implement algorithms without concerning themselves with the parallelization aspect of the problem. Different heuristics for scheduling operations can be implemented in the runtime system independent of the code that enqueues the operations. Results gathered on a 16 CPU ccNUMA Itanium2 server demonstrate excellent performance.**Investigation of Methods to Set Up the Normal Vector Field for the Differential Method**Proc. SPIE, Volume 6995, 2008.@inproceedings{Rafler2008:190, author = "Stephan Rafler and Peter Goetz and Matthias Petschow and Thomas Schuster and Wolfgang Osten", title = "Investigation of Methods to Set Up the Normal Vector Field for the Differential Method", booktitle = "Proc. SPIE", year = 2008, volume = 6995 }

abstractwebbibtexIn Fourier modal methods like the RCWA and the Differential Method the Li-rules for products in truncated Fourier space have to be obeyed in order to achieve good convergence of the results with respect to the mode number. The Li- rules have to be applied differently for parts of the field that are tangential and orthogonal to material boundaries. This is achieved in the Differential Method by including a field of vectors in the calculation that are normal to the material boundaries. The same can be done laterally in each layer of an RCWA calculation of a 2-D periodic structure. It turns out that discontinuities in the normal vector field can disturb the computation especially when metallic materials are dominant in the structure which would make the usefulness of the normal vector method questionable. So it is of great importance to investigate how normal vector fields can be established with as few discontinuities as possible. We present various methods for the 2-D RCWA and the 1-D and 2-D Differential Method and compare the respective convergence behaviors. Especially we emphasize methods that are automatic and require as few user input as possible.

### Thesis

**Differential Method for Arbitrary Anisotropic Periodic Media**University Stuttgart, 2008.

Diploma thesis (in German).

### Technical Report

**Automation in Dense Linear Algebra**Aachen Institute for Computational Engineering Science, RWTH Aachen, October 2008.

Technical Report AICES-2008-2.PDFbibtex@techreport{Bientinesi2008:748, author = "Paolo Bientinesi and Robert {van de Geijn}", title = "Automation in Dense Linear Algebra", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2008, month = oct, note = "Technical Report AICES-2008-2", url = "http://www.aices.rwth-aachen.de:8080/~pauldj/pubs/CC-2009.pdf" }

## 2007

### Journal Articles

**Quantum Deconstruction of 5D SQCD**Journal of High Energy Physics, Volume 2007(3), pp. 092, 2007.

ArXiv:hep-th/0611085.@article{Di_Napoli2007:504, author = "Edoardo {Di Napoli} and Vadim Kaplunovsky", title = "Quantum Deconstruction of 5D SQCD ", journal = "Journal of High Energy Physics", year = 2007, volume = 2007, number = 3, pages = 92, note = "ArXiv:hep-th/0611085" }

abstractwebbibtexWe deconstruct the fifth dimension of 5D SCQD with general numbers of colors and flavors and general 5D Chern-Simons level; the latter is adjusted by adding extra quarks to the 4D quiver. We use deconstruction as a non-stringy UV completion of the quantum 5D theory; to prove its usefulness, we compute quantum corrections to the SQCD_5 prepotential. We also explore the moduli/parameter space of the deconstructed SQCD_5 and show that for |K_CS| < N_F/2 it continues to negative values of 1/(g_5)^2. In many cases there are flop transitions connecting SQCD_5 to exotic 5D theories such as E0, and we present several examples of such transitions. We compare deconstruction to brane-web engineering of the same SQCD_5 and show that the phase diagram is the same in both cases; indeed, the two UV completions are in the same universality class, although they are not dual to each other. Hence, the phase structure of an SQCD_5 (and presumably any other 5D gauge theory) is inherently five-dimensional and does not depends on a UV completion.**Multi-parametric Quantum Algebras and the Cosmological Constant.**Advances in High Energy Physics, Volume 2007, pp. 13458, 2007.

ArXiv: hep-th/0511147.@article{Krishnan2007:300, author = "Chethan Krishnan and Edoardo {Di Napoli}", title = "Multi-parametric Quantum Algebras and the Cosmological Constant. ", journal = "Advances in High Energy Physics", year = 2007, volume = 2007, pages = 13458, note = "ArXiv: hep-th/0511147" }

abstractwebbibtexWith a view towards applications for de Sitter, we construct the multi-parametric $q$-deformation of the $so(5,\IC)$ algebra using the Faddeev-Reshetikhin-Takhtadzhyan (FRT) formalism.

### Peer Reviewed Conference Publication

**Sparse Direct Factorizations through Unassembled Hyper-Matrices**Proceedings of the International Congress on Industrial and Applied Mathematics (ICIAM'07), Proc. Appl. Math. Mech., Volume 7(1), pp. 1010901-1010902, December 2007.webbibtex@inproceedings{Bientinesi2007:254, author = "Paolo Bientinesi and Victor Eijkhout and Jason Kurtz and Robert {van de Geijn}", title = "Sparse Direct Factorizations through Unassembled Hyper-Matrices", booktitle = "Proceedings of the International Congress on Industrial and Applied Mathematics (ICIAM'07)", year = 2007, volume = 7, number = 1, series = "Proc. Appl. Math. Mech.", pages = "1010901--1010902", month = dec }

### Technical Reports

**Multi-dimensional Array Memory Accesses for FFTs on Parallel Architectures.**Department of Computer Science, Duke University, December 2007.

Technical Report CS-2007-10.bibtex@techreport{Bientinesi2007:250, author = "Paolo Bientinesi and Nikos Pitsianis and Xiaobai Sun", title = "Multi-dimensional Array Memory Accesses for FFTs on Parallel Architectures. ", institution = "Department of Computer Science, Duke University", year = 2007, month = dec, note = "Technical Report CS-2007-10" }

**Sparse Direct Factorizations through Unassembled Hyper-Matrices**Texas Advanced Computing Center (TACC), The University of Texas at Austin, October 2007.

Technical Report TR-07-02.PDFbibtex@techreport{Bientinesi2007:178, author = "Paolo Bientinesi and Victor Eijkhout and Kyungjoo Kim and Jason Kurtz and Robert {van de Geijn}", title = "Sparse Direct Factorizations through Unassembled Hyper-Matrices", institution = "Texas Advanced Computing Center (TACC), The University of Texas at Austin", year = 2007, month = oct, note = "Technical Report TR-07-02", url = "http://www.aices.rwth-aachen.de:8080/~pauldj/pubs/uhmreport.pdf" }

**Parallel 2D FFTs on the Cell Broadband Engine**Department of Computer Science, Duke University, April 2007.

Technical Report CS-2007-03.bibtex@techreport{Bientinesi2007:4, author = "Paolo Bientinesi and Nikos Pitsianis and Xiaobai Sun", title = "Parallel 2D FFTs on the Cell Broadband Engine", institution = "Department of Computer Science, Duke University", year = 2007, month = apr, note = "Technical Report CS-2007-03" }

## 2006

### Journal Articles

**Anomaly Cancellation and Conformality in Quiver Gauge Theories**Physics Letters B, Volume 638(4), pp. 374, 2006.

ArXiv:hep-th/0603065.@article{Di_Napoli2006:710, author = "Edoardo {Di Napoli} and Paul Frampton", title = "Anomaly Cancellation and Conformality in Quiver Gauge Theories ", journal = "Physics Letters B", year = 2006, volume = 638, number = 4, pages = 374, note = "ArXiv:hep-th/0603065" }

abstractwebbibtexAbelian quiver gauge theories provide nonsupersymmetric candidates for the conformality approach to physics beyond the standard model. Written as ${\cal N}=0$, $U(N)^n$ gauge theories, however, they have mixed $U(1)_p U(1)_q^2$ and $U(1)_p SU(N)_q^2$ triangle anomalies. It is shown how to construct explicitly a compensatory term $\Delta{\cal L}_{comp}$ which restores gauge invariance of ${\cal L}_{eff} = {\cal L} + \Delta {\cal L}_{comp}$ under $U(N)^n$. It can lead to a negative contribution to the U(1) $\beta$-function and hence to one-loop conformality at high energy for all dimensionless couplings.**Can Quantum de Sitter Space Have Finite Entropy?**Classical and Quantum Gravity, Volume 24(13), pp. 3457, 2006.

ArXiv:hep-th/0602002.@article{Krishnan2006:98, author = "Chethan Krishnan and Edoardo {Di Napoli}", title = "Can Quantum de Sitter Space Have Finite Entropy? ", journal = "Classical and Quantum Gravity", year = 2006, volume = 24, number = 13, pages = 3457, note = "ArXiv:hep-th/0602002" }

abstractwebbibtexIf one tries to view de Sitter as a true (as opposed to a meta-stable) vacuum, there is a tension between the finiteness of its entropy and the infinite-dimensionality of its Hilbert space. We invetsigate the viability of one proposal to reconcile this tension using $q$-deformation. After defining a differential geometry on the quantum de Sitter space, we try to constrain the value of the deformation parameter by imposing the condition that in the undeformed limit, we want the real form of the (inherently complex) quantum group to reduce to the usual SO(4,1) of de Sitter. We find that this forces $q$ to be a real number. Since it is known that quantum groups have finite-dimensional representations only for $q=$ root of unity, this suggests that standard $q$-deformations cannot give rise to finite dimensional Hilbert spaces, ruling out finite entropy for q-deformed de Sitter.

### Thesis

**Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms**Ph.D. Thesis, Department of Computer Sciences, University of Texas at Austin, July 2006.@phdthesis{Bientinesi2006:720, author = "Paolo Bientinesi", title = "Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms", school = "Department of Computer Sciences, University of Texas at Austin", year = 2006, type = "Ph.D. Thesis", month = jul, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/thesis.pdf" }

abstractPDFbibtexWe consider the problem of developing formally correct dense linear algebra libraries. The problem would be solved convincingly if, starting from the mathe- matical specification of a target operation, it were possible to generate, implement and analyze a family of correct algorithms that compute the operation. This thesis presents evidence that for a class of dense linear operations, systematic and me- chanical development of algorithms is within reach. It describes and demonstrates an approach for deriving and implementing, systematically and even mechanically, proven correct algorithms. It also introduces a systematic procedure to analyze, in a modular fashion, numerical properties of the generated algorithms.

### Technical Reports

**Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms**The University of Texas at Austin, Department of Computer Sciences, July 2006.

Technical Report TR-06-46.@techreport{Bientinesi2006:4, author = "Paolo Bientinesi", title = "Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms", institution = "The University of Texas at Austin, Department of Computer Sciences", year = 2006, month = jul, note = "Technical Report TR-06-46", url = "http://www.aices.rwth-aachen.de:8080/~pauldj/pubs/thesis.pdf" }

abstractPDFbibtex**Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix**The University of Texas at Austin, Department of Computer Sciences, April 2006.

FLAME Working Note #19. Technical Report TR-06-20.bibtex@techreport{Bientinesi2006:610, author = "Paolo Bientinesi and Brian Gunter and Robert {van de Geijn}", title = "Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix", institution = "The University of Texas at Austin, Department of Computer Sciences", year = 2006, month = apr, note = "FLAME Working Note #19. Technical Report TR-06-20" }

**Representing Dense Linear Algebra Algorithms: A Farewell to Indices**The University of Texas at Austin, Department of Computer Sciences, February 2006.

FLAME Working Note #17. Technical Report TR-06-10.@techreport{Bientinesi2006:304, author = "Paolo Bientinesi and Robert {van de Geijn}", title = "Representing Dense Linear Algebra Algorithms: A Farewell to Indices", institution = "The University of Texas at Austin, Department of Computer Sciences", year = 2006, month = feb, note = "FLAME Working Note #17. Technical Report TR-06-10", url = "http://hpac.rwth-aachen.de/~pauldj/pubs/farewell.pdf" }

abstractPDFbibtexWe present a notation that allows a dense linear algebra algorithm to be represented in a way that is visually recognizable. The primary value of the notation is that it exposes subvectors and submatrices allowing the details of the algorithm to be the focus while hiding the intricate indices related to the arrays in which the vectors and matrices are stored. The applicability of the notation is illustrated through a succession of progressively complex case studies ranging from matrix-vector operations to the chasing of the bulge of the symmetric QR iteration. The notation facilitates comparing and contrasting different algorithms for the same operation as well as similar algorithms for different operations. Finally, we point out how algorithms represented with this notation can be directly translated into high-performance code.

## 2005

### Journal Articles

**Representing Linear Algebra Algorithms in Code: The FLAME APIs**ACM Transactions on Mathematical Software, Volume 31(1), March 2005.@article{Bientinesi2005:504, author = "Paolo Bientinesi and {Enrique S.} Quintana-Orti and Robert {van de Geijn}", title = "Representing Linear Algebra Algorithms in Code: The FLAME APIs", journal = "ACM Transactions on Mathematical Software", year = 2005, volume = 31, number = 1, month = mar, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/FLAME-API.pdf" }

abstractwebPDFbibtexIn this article, we present a number of Application Program Interfaces (APIs) for coding linear algebra algorithms. On the surface, these APIs for the MATLAB M-script and C programming languages appear to be simple, almost trivial, extensions of those languages. Yet with them, the task of programming and maintaining families of algorithms for a broad spectrum of linear algebra operations is greatly simplified. In combination with our Formal Linear Algebra Methods Environment (FLAME) approach to deriving such families of algorithms, dozens of algorithms for a single linear algebra operation can be derived, verified to be correct, implemented, and tested, often in a matter of minutes per algorithm. Since the algorithms are expressed in code much like they are explained in a classroom setting, these APIs become not just a tool for implementing libraries, but also a valuable tool for teaching the algorithms that are incorporated in the libraries. In combination with an extension of the Parallel Linear Algebra Package (PLAPACK) API, the approach presents a migratory path from algorithm to MATLAB implementation to high-performance sequential implementation to parallel implementation. Finally, the APIs are being used to create a repository of algorithms and implementations for linear algebra operations, the FLAME Interface REpository (FIRE), which already features hundreds of algorithms for dozens of commonly encountered linear algebra operations.**The Science of Deriving Dense Linear Algebra Algorithms**ACM Transactions on Mathematical Software, Volume 31(1), March 2005.@article{Bientinesi2005:460, author = "Paolo Bientinesi and {John A.} Gunnels and {Margaret E.} Myers and {Enrique S.} Quintana-Orti and Robert {van de Geijn}", title = "The Science of Deriving Dense Linear Algebra Algorithms", journal = "ACM Transactions on Mathematical Software", year = 2005, volume = 31, number = 1, month = mar, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/Recipe.ps" }

abstractwebPDFbibtexIn this article we present a systematic approach to the derivation of families of high-performance algorithms for a large set of frequently encountered dense linear algebra operations. As part of the derivation a constructive proof of the correctness of the algorithm is generated. The article is structured so that it can be used as a tutorial for novices. However, the method has been shown to yield new high-performance algorithms for well-studied linear algebra operations and should also be of interest to those who wish to produce best-in-class high-performance codes.**Unitary matrix model of a chiral [SU(N)]^K gauge theory**Journal of High Energy Physics, Volume 2005(10), pp. 074, 2005.

ArXiv:hep-th/0508192.@article{Di_Napoli2005:918, author = "Edoardo {Di Napoli} and Vadim Kaplunovsky", title = "Unitary matrix model of a chiral [SU(N)]^K gauge theory ", journal = "Journal of High Energy Physics", year = 2005, volume = 2005, number = 10, pages = 74, note = "ArXiv:hep-th/0508192" }

abstractbibtexWe build a matrix model of a chiral [SU(N)]^K gauge theory (SQCD5 deconstructed down to 4D) using random unitary matrices to model chiral bifundamental fields (N,bar N) (without (bar N,N)). We verify the duality by matching the loop equation of the matrix model to the anomaly equations of the gauge theory. Then we evaluate the matrix model's free energy and use it to derive the effective superpotential for the gaugino condensates.**A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations**SIAM Journal on Scientific Computing, Volume 27(1), 2005.@article{Bientinesi2005:550, author = "Paolo Bientinesi and Inderjit Dhillon and Robert {van de Geijn}", title = "A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations", journal = "SIAM Journal on Scientific Computing", year = 2005, volume = 27, number = 1, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/PMR3.pdf" }

abstractwebPDFbibtexWe present a new parallel algorithm for the dense symmetric eigenvalue/eigenvector problem that is based upon the tridiagonal eigensolver, Algorithm MR3, recently developed by Dhillon and Parlett. Algorithm MR3 has a complexity of O(n2) operations for computing all eigenvalues and eigenvectors of a symmetric tridiagonal problem. Moreover the algorithm requires only O(n) extra workspace and can be adapted to compute any subset of k eigenpairs in O(nk) time. In contrast, all earlier stable parallel algorithms for the tridiagonal eigenproblem require O(n3) operations in the worst case, while some implementations, such as divide and conquer, have an extra O(n2) memory requirement. The proposed parallel algorithm balances the workload equally among the processors by traversing a matrix-dependent representation tree which captures the sequence of computations performed by Algorithm MR3. The resulting implementation allows problems of very large size to be solved efficiently---the largest dense eigenproblem solved in-core on a 256 processor machine with 2 GBytes of memory per processor is for a matrix of size 128,000 x 128,000, which required about 8 hours of CPU time. We present comparisons with other eigensolvers and results on matrices that arise in the applications of computational quantum chemistry and finite element modeling of automobile bodies.

### Peer Reviewed Conference Publication

**Formal Correctness and Stability of Dense Linear Algebra Algorithms**Proceedings of 17th IMACS World Congress: Scientific Computation, Applied Mathematics and Simulation, 2005.@inproceedings{Bientinesi2005:110, author = "Paolo Bientinesi and Robert {van de Geijn}", title = "Formal Correctness and Stability of Dense Linear Algebra Algorithms", booktitle = "Proceedings of 17th IMACS World Congress: Scientific Computation, Applied Mathematics and Simulation", year = 2005, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/IMACS.pdf" }

abstractPDFbibtexThe Formal Linear Algebra Methods Environment (FLAME) project at UT-Austin pursues the mechanical derivation of algorithms for linear algebra operations. Rather than proving loop based algorithms correct a posteriori, a systematic methodology is employed that determines loop invariants from a mathematical specification of a given linear algebra operation. Algorithms and their correctness proofs are then constructively derived from this loop invariant. The process has been made mechanical via a prototype system implemented with Mathematica. Once an algorithm has been determined, a similarly systematic a posteriori process is used to determine the correctness in the presence of roundoff error (stability properties) of the algorithm. In this paper, we report progress towards the ultimate goal of full automation of the system.

### Thesis

**Quiver gauge theories, chiral rings and random matrix models**Ph.D. Dissertation, The University of Texas at Austin, 2005.@phdthesis{Di_Napoli2005:340, author = "Edoardo {Di Napoli}", title = "Quiver gauge theories, chiral rings and random matrix models ", school = "The University of Texas at Austin", year = 2005, type = "Ph.D. Dissertation", address = "1, University Station, 78712 Austin, Texas, USA" }

abstractbibtexDimensional deconstruction of 5D SQCD with general nc, nf and kCS gives rise to 4D N = 1 gauge theories with large quivers of SU(nc) gauge factors. We first describe the spectrum of the model in the deconstructive limit and show its properties. We then construct the chiral rings of such theories, off-shell and on-shell. Anomaly equations for the various resolvents allowed by the model permit us to calculate all the relevant chiral operators. The results are broadly similar to the chiral rings of single U(nc) theories with both adjoint and fundamental matter, but there are also some noteworthy differences such as nonlocal meson-like operators where the quark and antiquark fields belong to different nodes of the quiver. And because the analyzed gauge groups are SU(nc) rather than U(nc), our chiral rings also contain a whole collection of baryonic and antibaryonic operators. We then investigate the random matrix model corresponding to such chiral ring. We find that bifundamental chiral operators correspond to unitary matrices. We derive the loop equations and show that they are in perfect agreement with the anomaly equations of the gauge model. An exact expression for the free energy is found in the large NÃ�ï¿½ (rank of the matrix) limit. A formula for the effective superpotential is derived and some examples are illustrated.

### Technical Report

**FLAME 2005 Prospectus: Towards the Final Generation of Dense Linear Algebra Libraries**The University of Texas at Austin, Department of Computer Sciences, April 2005.

FLAME Working Note #16. Technical Report TR-05-15.bibtex@techreport{Bientinesi2005:218, author = "Paolo Bientinesi and Kazushige Goto and Tze {Meng Low} and {Enrique S.} Quintana-Orti and Robert {van de Geijn} and Field {Van Zee}", title = "FLAME 2005 Prospectus: Towards the Final Generation of Dense Linear Algebra Libraries", institution = "The University of Texas at Austin, Department of Computer Sciences", year = 2005, month = apr, note = "FLAME Working Note #16. Technical Report TR-05-15" }

## 2004

### Journal Article

**Chiral rings of deconstructive [SU(n(c))]^N quivers**Journal of High Energy Physics, Volume 2004(6), pp. 060, 2004.

ArXiv:hep-th/0406122.@article{Di_Napoli2004:128, author = "Edoardo {Di Napoli} and Vadim Kaplunovsky and Jacob Sonnenschein", title = "Chiral rings of deconstructive [SU(n(c))]^N quivers ", journal = "Journal of High Energy Physics", year = 2004, volume = 2004, number = 6, pages = 60, note = "ArXiv:hep-th/0406122" }

abstractbibtexDimensional deconstruction of 5D SQCD with general n_c, n_f and k_CS gives rise to 4D N=1 gauge theories with large quivers of SU(n_c) gauge factors. We construct the chiral rings of such [SU(n_c)]^N theories, off-shell and on-shell. Our results are broadly similar to the chiral rings of single U(n_c) theories with both adjoint and fundamental matter, but there are also some noteworthy differences such as nonlocal meson-like operators where the quark and antiquark fields belong to different nodes of the quiver. And because our gauge groups are SU(n_c) rather than U(n_c), our chiral rings also contain a whole zoo of baryonic and antibaryonic operators.

### Peer Reviewed Conference Publications

**Automatic Derivation of Linear Algebra Algorithms with Application to Control Theory**PARA'04 State-of-the-Art in Scientific Computing, June 2004.@inproceedings{Bientinesi2004:690, author = "Paolo Bientinesi and Sergey Kolos and Robert {van de Geijn}", title = "Automatic Derivation of Linear Algebra Algorithms with Application to Control Theory", booktitle = "PARA'04 State-of-the-Art in Scientific Computing", year = 2004, month = jun, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/PARA04-automsyst.pdf" }

abstractPDFbibtexIt is our belief that the ultimate automatic system for deriving linear algebra libraries should be able to generate a set of algorithms starting from the mathematical specification of the target operation only. Indeed, one should be able to visit a website, fill in a form with information about the operation to be performed and about the target architectures, click the SUBMIT button, and receive an optimized library routine for that operation even if the operation has not previously been implemented. In this paper we relate recent advances towards what is probably regarded as an unreachable dream. We discuss the steps necessary to automatically obtain an algorithm starting from the mathematical abstract description of the operation to be solved. We illustrate how these steps have been incorporated into two prototype systems and we show the application of one the two systems to a problem from Control Theory: The Sylvester Equation. The output of this system is a description of an algorithm that can then be directly translated into code via API’s that we have developed. The description can also be passed as input to a parallel performance analyzer to obtain optimized parallel routines.**Rapid Development of High-Performance Linear Algebra Libraries**PARA'04 State-of-the-Art in Scientific Computing, 2004.@inproceedings{Bientinesi2004:554, author = "Paolo Bientinesi and {John A.} Gunnels and {Fred G.} Gustavson and {Greg M.} Henry and {Margaret E.} Myers and {Enrique S.} Quintana-Orti and Robert {van de Geijn}", title = "Rapid Development of High-Performance Linear Algebra Libraries", booktitle = "PARA'04 State-of-the-Art in Scientific Computing", year = 2004, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/PARA04-worksheet.pdf" }

abstractPDFbibtexWe present a systematic methodology for deriving and implementing linear algebra libraries. It is quite common that an application requires a library of routines for the computation of linear algebra operations that are not (exactly) supported by commonly used libraries like LAPACK. In this situation, the application developer has the option of casting the operation into one supported by an existing library, often at the expense of performance, or implementing a custom library, often requiring considerable effort. Our recent discovery of a methodology based on formal derivation of algorithm allows such a user to quickly derive proven correct algorithms. Furthermore it provides an API that allows the so-derived algorithms to be quickly translated into high-performance implementations.

## 2003

### Technical Report

**FLAME@Lab: A Farewell to Indices**The University of Texas at Austin, Department of Computer Sciences, April 2003.

FLAME Working Note #11. Technical Report TR-03-11.bibtex@techreport{Bientinesi2003:828, author = "Paolo Bientinesi and {Enrique S.} Quintana-Orti and Robert {van de Geijn}", title = "FLAME@Lab: A Farewell to Indices", institution = "The University of Texas at Austin, Department of Computer Sciences", year = 2003, month = apr, note = "FLAME Working Note #11. Technical Report TR-03-11" }

## 2002

### Peer Reviewed Conference Publication

**The Science of Programming High-Performance Linear Algebra Libraries**Proceedings of the Workshop on Performance Optimization for High-Level Languages and Libraries (POHLL-02), 2002.@inproceedings{Bientinesi2002:448, author = "Paolo Bientinesi and {John A.} Gunnels and {Fred G.} Gustavson and {Greg M.} Henry and {Margaret E.} Myers and {Enrique S.} Quintana-Orti and Robert {van de Geijn}", title = "The Science of Programming High-Performance Linear Algebra Libraries", booktitle = "Proceedings of the Workshop on Performance Optimization for High-Level Languages and Libraries (POHLL-02)", year = 2002, editor = "G. Baumgartner, J. Ramanujam, and P. Sadayappan " }

abstractbibtexWhen considering the unmanageable complexity of computer systems, Dijkstra recently made the following observations: 1. When exhaustive testing is impossible --i.e., almost always-- our trust can only be based on proof (be it mechanized or not). 2. A program for which it is not clear why we should trust it, is of dubious value. 3. A program should be structured in such a way that the argument for its correctness is feasible and not unnecessarily laborious. 4. Given the proof, deriving a program justified by it, is much easier than, given the program, constructing a proof justifying it. Over the last decade, our research in the area of the implementation of high performance sequential and parallel linear algebra libraries has made the wisdom of the above observations obvious. In this paper, we show how to apply formal derivation methods to linear algebra operations. By combining this approach with a high-level abstraction for coding such algorithms, a linear algebra library that is more desirable, as measured by the metric provided by the above observations, can be achieved. Moreover, such a library can include a richer array of algorithms and is, at the same time, more easily maintained and extended. Most surprisingly, the methodology results in more efficient implementations. The approach is sufficiently systematic that much of it can be, and has been, automated.

### Technical Reports

**The Science of Deriving Dense Linear Algebra Algorithms**The University of Texas at Austin, Department of Computer Sciences, September 2002.

FLAME Working Note #8. Technical Report TR-02-53.bibtex@techreport{Bientinesi2002:440, author = "Paolo Bientinesi and {John A.} Gunnels and {Margaret E.} Myers and {Enrique S.} Quintana-Orti and Robert {van de Geijn}", title = "The Science of Deriving Dense Linear Algebra Algorithms", institution = "The University of Texas at Austin, Department of Computer Sciences", year = 2002, month = sep, note = "FLAME Working Note #8. Technical Report TR-02-53" }

**A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations**The University of Texas at Austin, Department of Computer Sciences, September 2002.

Technical Report TR-03-26.bibtex@techreport{Bientinesi2002:54, author = "Paolo Bientinesi and Inderjit Dhillon and Robert {van de Geijn}", title = "A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations", institution = "The University of Texas at Austin, Department of Computer Sciences", year = 2002, month = sep, note = "Technical Report TR-03-26" }

## 1998

### Journal Article

**On Numerical Approximation of Electrostatic Energy in 3D**Journal of Computational Physics, Volume 146(2), pp. 707-725, November 1998.@article{Finocchiaro1998:418, author = "Daniele Finocchiaro and Marco Pellegrini and Paolo Bientinesi", title = "On Numerical Approximation of Electrostatic Energy in 3D", journal = "Journal of Computational Physics", year = 1998, volume = 146, number = 2, pages = "707-725", month = nov, url = "http://hpac.rwth-aachen.de/~pauldj/pubs/jcp-final.pdf" }

abstractwebPDFbibtexApproximating the Coulomb self-energy of a charge distribution within a three-dimensional domain and the mutual Coulomb energy of two charge distributions often constitutes a computational bottleneck in the simulation of physical systems. The present article reports on a recently developed computational technique aimed at the numerical evaluation of the six-dimensional integrals arising from Coulomb interactions. Techniques from integral geometry are used to show a reduction of the domain from six-dimensional to two-dimensional. In the process analytic singularities due to Coulomb's law are eliminated. Experimental results on the self-energy of a charged cube show that the proposed method converges rapidly and is competitive with methods proposed in the literature for similar integration problems.

### Thesis

**Computational Geometry Techniques for Approximation of Electrostatic Forces**Master's Thesis, University of Pisa, 1998.

## 1997

### Technical Report

**Electrostatic Fields Without Singularities: Implementation and Experiments**Institute for Computational Mathematics, CNR, Pisa, 1997.

Technical Report TR-B4-16-97.PDFbibtex@techreport{Bientinesi1997:760, author = "Paolo Bientinesi and Marco Pellegrini", title = "Electrostatic Fields Without Singularities: Implementation and Experiments", institution = "Institute for Computational Mathematics, CNR, Pisa", year = 1997, note = "Technical Report TR-B4-16-97", url = "http://www.aices.rwth-aachen.de:8080/~pauldj/pubs/report.pdf" }

## 1996

### Thesis

**Su una proposta di Modello Standard alternativo**Laurea Thesis, I Universita` di Roma "La Sapienza", 1996.@mastersthesis{Di_Napoli1996:554, author = "Edoardo {Di Napoli}", title = "Su una proposta di Modello Standard alternativo ", school = "I Universita` di Roma "La Sapienza"", year = 1996, type = "Laurea Thesis", address = "Piazzale Aldo Moro, Roma, Italy" }

abstractbibtexQuesto modello si propone di sostituire al tradizionale settore elettrodebole del Modello Standard, in cui le masse sono ''generate'' dalla rottura spontanea di simmetria di un doppietto scalare inserito esplicitamente nella lagrangiana, un modello nel quale la rottura di simmetria è dinamica e le particelle scalari di tipo Higgs sono il risultato di uno stato composito di fermioni. L'analogia con la teoria BCS è stretta anche se la trattazione matematica è decisamente differente facendo ampio uso di metodologie funzionali largamente diffuse nell'ambito delle teorie di campo relativistiche. Il punto di partenza è il fondamentale lavoro di Nambu-Jona Lasinio che ha dato origine all'omonimo modello. In questo modello affondano le origini del meccanismo di rottura dinamica di simmetria poi ampiamente sviluppato in lavori successivi. NJL basandosi sulla stretta analogia con la teoria BCS, applicano il metodo delle quasi particelle, ideato da Bogoliubov, alla fisica relativistica dei campi. Attraverso una semplice interazione a 4-F (quattro fermioni) chiral-invariante ricavano un eq. di gap, nell'approssimazione di Hartree-Fock, per un parametro d'ordine che poi non è nient'altro che la massa dinamica dei fermioni. Ma il modello non si ferma quì\ indicando esplicitamente la ricca struttura degli infiniti stati fondamentali e l'esistenza di stati dinamici legati massivi. L'unico problema che NJL trovano è legato alla inevitabile dipendenza logaritmica dal cut-off dei risultati analitici, che paradossalmente arricchisce strutturalmente la teoria ma non può essere assorbito nell'eventuale ridefinizione della costante d'accoppiamento essendo l'intero apparato non rinormalizzabile. Nella prima parte del presente lavoro si è fatto ampio uso delle idee fondamentali di questo modello unitamente alle tecniche sviluppate da Gross e Neveu per lo sviluppo non perturbativo 1/N ed al formalismo per gli operatori compositi inventato da Jackiw, Cornwall e Tomboulis. I risultati sono stati confrontati con il recente articolo di Bardeen, Hill e Lindner. Allo scopo di ottenere uno schema il più generale possibile, è stato seguito un metodo autoconsistene che fà ampio uso di tecniche funzionali applicate all'azione efficace.