# Talks by year

## 2018

### Talks

**Parallelism in Linnea**20th Workshop on Compilers for Parallel Computing.

Dublin, Ireland, 18 April 2018.abstractPDFLinnea is an experimental tool for the automatic translation of linear algebra expressions to efficient programs consisting of a sequence of calls to BLAS and LAPACK kernels. Linnea generates programs by constructing a search graph, where each path in the graph represents one program. We introduce two problems related to parallelism that arise in Linnea. Those problems consist in 1) parallelizing the construction of the search graph and 2) generating parallel programs.**A set of building blocks for tensor operations: transposition, summation, and contraction**SIAM Conference on Parallel Processing for Scientific Computing.

Waseda University, Tokyo, Japan, March 2018.abstractwebPDFTensors naturally appear in a variety of disciplines and applications, including computational chemistry, computational physics, mathematics, and even machine learning. While a range of high-performance software tools exist for computations involving one- and two-dimensional arrays, i.e. vectors and matrices, the availability for tensors is much more limited. Moreover, until recently the contrast between the efficiency attained by matrix and tensor operations was staggering. With this talk we give an overview of a set of high-performance kernels for three common tensor operations: transposition, summation, and contraction. Specifically, we present 1) TTC and HPTT, respectively a compiler and a library for the efficient transposition of tensors of arbitrary size, 2) a code generator for tensor summations, and 3) TCCG, a compiler for tensor transpositions. In all cases, the tools exhibit significant speedups over state-of-the-art solutions. All tools are available for download and use.**The Generalized Matrix Chain Algorithm**2018 IEEE/ACM International Symposium on Code Generation and Optimization.

Vienna, Austria, 27 February 2018.abstractPDFIn this paper, we present a generalized version of the matrix chain algorithm to generate efficient code for linear algebra problems, a task for which human experts often invest days or even weeks of works. The standard matrix chain problem consists in finding the parenthesization of a matrix product $M := A_1 A_2 \cdots A_n$ that minimizes the number of scalar operations. In practical applications, however, one frequently encounters more complicated expressions, involving transposition, inversion, and matrix properties. Indeed, the computation of such expressions relies on a set of computational kernels that offer functionality well beyond the simple matrix product. The challenge then shifts from finding an optimal parenthesization to finding an optimal mapping of the input expression to the available kernels. Furthermore, it is often the case that a solution based on the minimization of scalar operations does not result in the optimal solution in terms of execution time. In our experiments, the generated code outperforms other libraries and languages on average by a factor of about 5. The motivation for this work comes from the fact that---despite great advances in the development of compilers---the task of mapping linear algebra problems to optimized kernels is still to be done manually. In order to relieve the user from this complex task, new techniques for the compilation of linear algebra expressions have to be developed.**Linnea: Automatic Generation of Efficient Linear Algebra Programs**Umeå University, 12 January 2018.abstractPDFThe evaluation of linear algebra expressions is a central part of both languages for scientific computing such as Julia and Matlab, and libraries such as Eigen, Blaze, and NumPy. However, the existing strategies are still rather primitive. At present, the only way to achieve high performance is by handcoding algorithms using libraries such as BLAS and LAPACK, a task that requires extensive knowledge in linear algebra, numerical linear algebra and high-performance computing. We present Linnea, the prototype of a compiler that automates the translation of the mathematical description of a linear algebra problem to an efficient sequence of calls to BLAS and LAPACK kernels. The main idea of Linnea is to construct a search graph that represents a large number of programs, taking into account knowledge about linear algebra. The algebraic nature of the domain is used to reduce the size of the search graph, without reducing the size of the search space that is explored. Experiments show that 1) the code generated by Linnea outperforms standard linear algebra languages and libraries, and 2) in contrast to the development time of human experts, the generation takes only few seconds.**Teaching Computers Linear Algebra**Friedrich-Schiller-Universitaet Jena, Jena, Germany, January 2018.abstractwebPDFIn the mid 1950s, the computing world was revolutionized by the advent of "The IBM Mathematical Formula Translating System" (FORTRAN), a program--nowadays universally recognized as the first complete compiler--that allowed scientists to express calculations in a "high-level", portable language. Both FORTRAN and C were, and still are, much better solutions than computer-specific code, but they still require users to reduce their mathematical formulas to scalar computations. Indeed, computers only operate on scalars and small arrays, while scientists operate with vectors, matrices and higher-dimensional objects. In the past 60 years there has been tremendous progress in the world of programming languages and compilers, and many languages and libraries (Matlab, Julia, Armadillo, Eigen, ...) now make it possible to code directly in terms of matrices; however in terms of efficiency, these solutions are still far from what human experts achieve. In a nutshell, none of these tools know linear algebra well enough to compete with humans. In this talk I present the Linear Algebra Mapping Problem (LAMP), that is, how to efficiently compute linear algebra expressions from a set of available building blocks, and the compiler Linnea, our initial solution to the problem.**Automatic Seamless Mixing of Computer Generated Playlists**Umea Universitet, Umea, Sweden, January 2018.abstractContinuous (mixed) dance music has animated disco clubs since the end of the 1970s, when DJs began to create unbroken sequences of songs. Thereafter, continuous mixes progressively conquered other public spaces such as private parties, shops, gyms and radio broadcasts. On the other hand, continuous (but unmixed) music has been widely available in the form of playlists since when portable storage devices and online repositories for digital audio became commercially available. For non-professional DJs with audio databases in the order of hundreds of tracks, both the selection of songs for a dance music playlist, and the smooth mixing into a continuous streaming represent non-trivial operations. Indeed, whoever is entrusted with such a task would benefit greatly if both operations were automated. AutoMix, short for Automatic Seamless Mixing of Computer Generated Playlists, aims to solve both problems simultaneously: Starting from an existing repository of dance tracks, it will automatically create a sequence of songs, and mix them together seamlessly, exactly as a human DJ would do.

## 2017

### Talks

**Efficient Pattern Matching in Python**7th Workshop on Python for High-Performance and Scientific Computing.

Denver, Colorado, 12 November 2017.**Performance Modeling and Prediction for Dense Linear Algebra**RWTH Aachen, November 2017.

PhD Defense.abstractPDFThis dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel.**A tale of efficiency and productivity. From scalar to tensor computations.**Umea Universitat, Umea, Sweden, October 2017.abstractPDFThe scientific computing community has to deal with the disconnect between the language spoken by practitioners (for the most part non-computer scientists), and the language with which computers operate. While scientists and engineers (possibly after a discretization process) speak the language of vectors, matrices, and higher-dimensional objects, computers only operate on scalars and small arrays. This gap is partly bridged thanks to the enormous effort that is put into the identification, development, and optimization of libraries for well defined tasks ("building blocks"), but this is far from a complete solution. Users still have to paintakingly map their formulas onto the available building blocks---sadly, an extremely time consuming task, especially when efficiency is of importance; alternatively, users can rely on high-level languages and libraries which perform the mapping automatically, although in terms of efficiency the results are severaly suboptimal, certainly far from what human experts can achieve by hand. The High-Performance and Automatic Computing group tackles this tradeoff between computer efficiency and human productivity. In this talk we give an overview of our contributions, including interdisciplinary research, compilers, numerical algorithms, and library development.**A journey from scalar to tensor computations**Tensor Computation Workshop.

Flatiron Institute, New York City, September 2017.**Optimizing the ChASE eigensolver for Bethe-Salpeter computations**7th Workshop of the Joint Laboratory for Extreme Scale Computing.

17 July 2017.abstractwebPDFThe Chebyshev Accelerated Subspace iteration Eigensolver (ChASE) is an iterative eigensolver developed at the JSC by the SimLab Quantum Materials. The solver mainly targets sequences of dense eigenvalue problems as they arise in Density Functional Theory, but can also work on the single eigenproblem. ChASE leverages on the predominant use of BLAS 3 subroutines to achieve close-to-peak performance and potentially achieve scalability over hundreds if not thousands of computing nodes. We have recently succeeded to integrate a version of the ChASE library within the Jena BSE code. Preliminary comparison between ChASE and the conjugate gradient eigensolver (KSCG), previously used by the Jena BSE code, shows that ChASE can outperform KSCG with speedups up to 5X. In this talk we illustrate our latest results and give an outlook of the scientific problems that can be tackled once the integration is successfully completed.**Compiling Linear Algebra Expressions to High-Performance Code**8th International Workshop on Parallel Symbolic Computation (PASCO).

Kaiserslautern, July 2017.abstractwebPDFVectors, matrices and tensors are the mathematical objects universally used to describe scientific phenomena, engineering processes, and numerical algorithms. By contrast, processors only operate with scalars and small arrays, and do not understand the language and the rules of linear algebra. Because of this mismatch, any linear algebra expression has to be translated in terms of the instructions supported by the specific target processor. Over the course of many years, the linear algebra community has put tremendous effort in the identification, standardization, and optimization of a rich set of relatively simple computational kernels--such as those included in the BLAS and LAPACK libraries--that provide the necessary building blocks for just about any linear algebra computation. The initial--daunting--task has thus been reduced to the decomposition of a target linear algebra expression in terms of said building blocks; we refer to this task as the "Linear Algebra Mapping Problem" (LAMP). However, LAMP is itself an especially challenging problem, requiring knowledge in high-performance computing, compilers, and numerical linear algebra. In this talk we present the problem, we give an overview of the solutions provided by several programming languages and computing environments (such as Julia, Matlab, R, ...), and introduce Linnea, a compiler to solve the general form of LAMP. As shown through a set of test cases, Linnea's results are comparable with those obtained by a human expert.**Linear algebra tasks in Materials Science: optimization and portability**Accelerated Data and Computing Workshop.

July 2017.**The Linear Algebra Mapping Problem (LAMP)**Householder Symposium XX on Numerical Linear Algebra.

Blacksburg, Virginia, June 2017.**LAMP: the Linear Algebra Mapping Problem**Platform for Advanced Scientific Computing (PASC).

Lugano, June 2017.abstractwebPDFMatrix expressions appear in just about every computational discipline; their fast and accurate evaluation is a time-consuming task that requires expertise in both numerical linear algebra and high-performance computing. On one hand, the numerical linear algebra community made tremendous progress in the identification, layering, and optimization of computational kernels. On the other hand, the translation of a target expression into a sequence of kernels - the task we named "Linear Algebra Mapping Problem" (LAMP) - remains a duty for domain experts. Indeed, while compilers excel at producing fast code for scalar expressions, they are still in their infancy when it comes to dealing with matrices. Nowadays, many tools and languages (Matlab, Julia, Armadillo, Eigen, ...) provide solutions to LAMP, but, as we point out, there is a lot of room for improvement. We give a formal description of LAMP, and discuss the design and the results of a prototype compiler.**LAMMPS’ PPPM Long-Range Solver for the Second Generation Xeon Phi**International Supercomputing Conference (ISC 17).

Frankfurt, June 2017.**Non-linear Associative-Commutative Many-to-One Pattern Matching with Sequence Variables**June 2017.

Master's Thesis Colloquium.**HPTT: A High-Performance Tensor Transposition C++ Library**4th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming.

Barcelona, June 2017.abstractwebPDFRecently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework—guided by a performance model—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen’s tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.**Linnea: Automatic Generation of Efficient Linear Algebra Programs**- Massachusetts Institute of Technology, 24 May 2017.
- University of Nevada, Las Vegas, 17 May 2017.

**Distributed parallel non-equilibrium Green’s function approach to inelastic charge transport**GAMM 2017.

7 March 2017.**When 1+1 > 2: The Power of Interdisciplinary Research**Opening Workshop of the SimLab Quantum Materials.

Juelich, March 2017.**Particle-Particle Particle-Mesh (P3M) on Knights Landing Processors**SIAM Conference on Computational Science and Engineering.

Atlanta, Georgia, February 2017.abstractwebPDFParticle-particle particle-mesh methods are often used in molecular dynamics codes to approximate the effects of long-range forces between atoms where it would not be feasible to compute all pair-wise interactions. While short-range interactions are computed in a pair-wise fashion, the forces produced by long-range interactions are obtained by mapping particle charge to a grid, solving Poisson's equation in the frequency domain for the electrical potential, and then mapping the local potential back to the particles. Using the popular molecular dynamics code LAMMPS, we present vectorization and new implementations of the two mapping algorithms. We also discuss how using larger stencil sizes when mapping charges and forces better takes advantage of the Xeon Phi architecture, both by making use of its large vector registers and because a larger stencil allows a coarser grid to be used. This shifts work from the poorly-scaling FFTs used to solve Poisson's equation and to the newly-accelerated and highly parallel mapping functions. The acceleration of the PPPM method as a whole also affects the optimal input parameters in a similar fashion; using a smaller cutoff to shift work from the pair-wise short-range computation to the long-range PPPM computation saves time even while using a finer charge grid to preserve accuracy.**Vectorization of Multi-Body Potentials: Performance and Portability**SIAM Conference on Computational Science and Engineering.

Atlanta, Georgia, February 2017.abstractwebPDFAs today's supercomputers become more and more powerful, simulations can cover bigger length-scales and time-scales using more accurate, but also more expensive force fields. In the materials science community, many-body potentials are widely used for their predictive power with respect to certain material properties, at the expense of higher computational cost. The challenge lies in mapping the complex calculations necessary to evaluate such potentials onto the available computing devices. Since modern architectures concentrate the computational power in wide SIMD units, and compilers commonly have trouble generating efficient code for them, a dedicated optimization effort is necessary. Special care is needed to minimize the effort required to implement a potential on a new architecture, and to allow for portability at the algorithmic level. Our research provided insights in the vectorization of the Tersoff, REBO and AIREBO potentials, widely used for semiconductor, carbon material, and carbohydrate simuations. We target a diverse set of hardware ranging from x86 CPUs (Westmere to Skylake), to Xeon Phi accelerators of both generations, and even GPUs. The improvements typically double the simulation throughput in large-scale, parallel runs, and higher speedups are possible when deploying accelerators.**The Landscape of High-Performance Tensor Contractions**Workshop on Batched, Reproducible, and Reduced Prevision BLAS.

Atlanta, Georgia, February 2017.**Design of a High-Performance GEMM-like Tensor-Tensor Multiplication**SIAM Conference on Computational Science and Engineering.

Atlanta, February 2017.

## 2016

### Talks

**Towards Automated Load Balancing via Spectrum Slicing for FEAST-like Solvers**6th Workshop of the Joint Laboratory for Extreme Scale Computing.

30 November 2016.**A Compiler for Linear Algebra Operations**ACM Student Research Competition at SPLASH 2016.

Amsterdam, Netherlands, 3 November 2016.**IPCC @ RWTH Aachen University: Optimization of multibody and long-range solvers in LAMMPS**IPCC Showcase November 2016.

November 2016.**The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability**SC 2016.

November 2016.**The Matrix Chain Algorithm to Compile Linear Algebra Expressions**4th Workshop on Domain Specific Language Design and Implementation (DSLDI).

Amsterdam, Netherlands, 31 October 2016.**Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods**JHPCS'16.

4 October 2016.**Accelerating Particle-Particle Particle-Mesh Methods for Molecular Dynamics**IPCC Toulouse.

October 2016.**Cl1ck + LGen: FLAME for small scale linear algebra**BLIS Retreat 2016.

University of Texas at Austin, 19 September 2016.**Cl1ck: A code generator for linear algebra kernels**Programming Languages Lunch Colloquium.

University of Texas at Austin, 12 September 2016.abstractwebPDFWe present Cl1ck, a code generator for specialized linear algebra kernels. Cl1ck adopts the FLAME methodology for the derivation of formally correct loop-based algorithms, and takes a three-stage approach: First, the input operation is transformed into one or more Partitioned Matrix Expressions (PMEs), i.e., a recursive definition of the operation; then, the PMEs are decomposed to identify a family of loop invariants; finally, loop-based algorithms are built around these loop invariants using formal methods techniques. Different back-ends enable then the translation of the algorithms into Matlab and optimized C code.**Design of a High-Performance GEMM-like Tensor-Tensor Multiplication**BLIS Retreat 2016.

September 2016.abstractwebPDFWe present ''GEMM-like Tensor-Tensor multiplication'' (GETT), a novel approach to tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned ''macro-kernel''. This macro-kernel operates on suitably prepared (''packed'') sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. To compare our technique with other modern tensor contractions, we integrate GETT alongside the so called Transpose-Transpose-GEMM-Transpose and Loops-over-GEMM approaches into an open source ''Tensor Contraction Code Generator'' (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to 12.3x. More precisely, GETT achieves speedups of up to 1.42x over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to 91.3% of peak floating-point performance for compute-bound tensor contractions.**Optimizing Least-Squares Rational Filters for Solving Interior Eigenvalue Problems**International Workshop on Parallel Matrix Algorithms and Applications.

Bordeaux, France, 6 July 2016.**TTC: A Tensor Transposition Compiler for Multiple Architectures**ARRAY ACM SIGPLAN 3rd International Workshop on Libraries, Languages and Compilers for Programming.

June 2016.abstractwebPDFWe consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system’s peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g., Intel Haswell, AMD Steamroller), Intel’s Knights Corner as well as different CUDA-based GPUs such as NVIDIA’s Kepler and Maxwell architectures. We report speedups of TTC over a meaningful base- line implementation generated by external C++ compilers; the re- sults suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel’s latest C++ compiler on the Haswell and Knights Cor- ner architecture, TTC yields speedups of up to 8× and 32×, respectively. We also showcase TTC’s support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.**TTC: A Compiler for Tensor Transpositions**SIAM Conference on Parallel Processing for Scientific Computing..

Université Pierre et Marie Curie, Paris, 14 April 2016.abstractwebPDFWe present TTC, an open-source compiler for multidimensional tensor transpositions. Thanks to a range of optimizations (software prefetching, blocking, loop-reordering, explicit vectorization), TCC generates high-performance parallel C/C++ code. We use generic heuristics and auto-tuning to manage the huge search space. Performance results show that TTC achieves close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, yielding performance gains of up to 15× over modern compilers.**Exploring OpenMP Task Priorities on the MR3 Eigensolver**SIAM Conference on Parallel Processing for Scientific Computing.

Université Pierre et Marie Curie, Paris, 12 April 2016.abstractwebPDFAs part of the OpenMP 4.1 draft, the runtime incorporates task priorities. We use the Method of Multiple Relatively Robust Representations (MR3), for which a pthreads-based task parallel version already exists (MR3SMP), to analyze and compare the performance of MR3SMP with three different OpenMP runtimes, with and without the support of priorities. From a dataset consisting of application matrices, it appears that OpenMP is always on par or better than the pthreads implementation**The ELAPS Framework: Experimental Linear Algebra Performance Studies**SIAM Conference on Parallel Processing for Scientific Computing.

Université Pierre et Marie Curie, Paris, April 2016.abstractwebPDFThe multi-platform open source framework ELAPS facilitates easy and fast, yet powerful performance experimentation and prototyping of dense linear algebra algorithms. In contrast to most existing performance analysis tools, it targets the initial stages of the development process and assists developers in both algorithmic and optimization decisions. Users construct experiments to investigate how performance and efficiency vary from one algorithm to another, depending on factors such as caching, algorithmic parameters, problem size, and parallelism.**Optimization of multibody and long-range solvers in LAMMPS**Intel PCC EMEA Meeting.

Ostrava, March 2016.

## 2015

### Talks

**The Tersoff many-body potential: Sustainable performance through vectorization**SC15 Workshop: Producing High Performance and Sustainable Software for Molecular Simulation.

November 2015.**Knowledge-Based Linear Algebra Compiler**AICES Annual Retreat 2015.

15 October 2015.**A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Dynamics**International Supercomputing Conference (ISC 15).

Frankfurt, Germany, July 2015.abstractwebPDFRecent results on supercomputers show that beyond 65K cores, the efficiency of molecular dynamics simulations of interfacial systems decreases significantly. In this paper, we introduce a dynamic cutoff method (DCM) for interfacial systems of arbitrarily large size. The idea consists in adopting a cutoff-based method in which the cutoff is chosen on a particle by particle basis, according to the distance from the interface. Computationally, the challenge is shifted from the long-range solvers to the detection of the interfaces and to the computation of the particle-interface distances. For these tasks, we present linear-time algorithms that do not rely on global communication patterns. As a result, the DCM algorithm is suited for large systems of particles and massively parallel computers. To demonstrate its potential, we integrated DCM into the LAMMPS open-source molecular dynamics package, and simulated large liquid/vapor systems on two supercomputers: SuperMuc and JUQUEEN. In all cases, the accuracy of DCM is comparable to the traditional particle-particle particle-mesh (PPPM) algorithm, and for large numbers of particles the performance is considerably superior. For JUQUEEN, we provide timings for simulations running on the full system (458,752 cores), and show nearly perfect strong and weak scaling.**Systematic Generation of Algorithms for Iterative Methods**RWTH Aachen University, April 2015.

Master Thesis Presentation.**ELAPS: Experimental Linear Algebra Performance Studies**University of Texas at Austin, March 2015.

Live demo.**Enabling Large Scale LAPW DFT Calculations by a Scalable Iterative Eigensolver**The 2015 SIAM Conference on Computational Science & Engineering.

February 2015.abstractwebPDFIn LAPW-based methods a sequence of dense generalized eigenvalue problems appears. Traditionally these problems were solved using direct eigensolvers from standard libraries like ScaLAPACK. We developed a subspace iteration method pre-conditioned with Chebyshev polynomials of optimal degree (ChASE). This algorithm is consistently competitive with direct eigensolvers and greatly enhance performance and scalability. ChASE is included in the FLEUR software and improves its scaling behaviour for calculations of large physical systems on modern supercomputers.

## 2014

### Talks

**On the Performance Prediction of BLAS-based Tensor Contractions**5th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS14).

SC14, New Orleans, LA, USA, 16 November 2014.abstractwebPDFTensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.**Estimating the Efficiency of BLAS-based Tensor Contractions**Annual Report 2.

AICES, RWTH Aachen, 6 November 2014.**Bringing knowledge into HPC**Symposium on High Performance Computing.

Universitaet Basel, Switzerland, October 2014.**A knowledge-based approach to high-performance computing in ab initio simulations**AICES Advisory Board.

14 July 2014.**An optimized subspace iteration eigensolver applied to sequences of dense eigenproblems in ab initio simulations**8th International Workshop on Parallel Matrix Algorithms and Applications (PMAA14).

Lugano, Switzerland, 4 July 2014.abstractPDFSequences of eigenvalue problems consistently appear in a large class of ap- plications based on the iterative solution of a non-linear eigenvalue problem. A typical example is given by the chemistry and materials science ab initio sim- ulations relying on computational methods developed within the framework of Density Functional Theory (DFT). DFT provides the means to solve a high- dimensional quantum mechanical problem by representing it as a non-linear gen- eralized eigenvalue problem which is solved self-consistently through a series of successive outer-iteration cycles. As a consequence each self-consistent simulation is made of several sequences of generalized eigenproblems P : Ax = λBx. Each sequence, P (1) , . . . P (l) . . . P (N ) , groups together eigenproblems with increasing outer-iteration index l. In more general terms a set of eigenvalue problems {P(1),...P(N)} is said to be a “sequence” if the solution of the l-th eigenproblem determines, in an application-specific manner, the initialization of the (l + 1)-th eigenproblem. For instance at the beginning of each DFT cycle an initial function ρ(l)(r) determines the initialization of the l-th eigenproblem. A large fraction of P (l) eigenpairs are then use to compute a new ρ(l+1)(r) which, in turn, leads to the initialization of a new eigenvalue problem P(l+1). In addition to be part of a sequence, successive eigenproblems might possess a certain degree of correlation connecting their re- spective eigenpairs. In DFT sequences, correlation becomes manifest in the way eigenvectors of successive eigenproblems become progressively more collinear to each other as the l-index increases. We developed a subspace iteration method (ChFSI) specifically tailored for sequences of eigenproblems whose correlation appears in the form of increasingly collinear eigenvectors. Our strategy is to take the maximal advantage possible from the information that the solution of the P(l) eigenproblem is providing when solving for the successive P(l+1) problem. As a consequence the subspace iteration was augmented with a Chebyshev polynomial filter whose degree gets dynamically optimized so as to minimize the number of matrix-vector multipli- cations. The effectiveness of the Chebyshev filter is substantially increased when (l) (l) inputed the approximate eigenvectors {x1 , . . . xnev }, as well as very reliable esti- (l) (l) mates, namely [λ1 , λnev], for the limits of the eigenspectrum interval [a, b] to be filtered in. In addition the degree of the polynomial filter is adjusted so as to be minimal with respect to the required tolerance for the eigenpairs residual. This result is achieved by exploiting the dependence each eigenpair residual have with respect to its convergence ratio as determined by the rescaled Chebyshev poly- nomial and its degree. The solver is complemented with an efficient mechanism which locks and deflates the converged eigenpairs. The resulting eigensolver was implemented in C language and parallelized for both shared and distributed architectures. Numerical tests show that, when the eigensolver is inputed approximate solutions instead of random vectors, it achieves up to a 5X speedup. Moreover ChFSI takes great advantage of compu- tational resources by scaling over a large range of cores commensurate with the size of the eigenproblems. Specifically, by making better use of massively parallel architectures, the distributed memory version will allow DFT users to simulate physical systems quite larger than are currently accessible.**A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels**The Ninth International Workshop on Automatic Performance Tuning (iWAPT2014), VECPAR 2014.

University of Oregon and Hilton Conference Center, Eugene, Oregon, USA, July 2014.abstractwebPDFIt is universally known that caching is critical to attain high-performance implementations: In many situations, data locality (in space and time) plays a bigger role than optimizing the (number of) arithmetic floating point operations. In this paper, we show evidence that at least for linear algebra algorithms, caching is also a crucial factor for accurate performance modeling and performance prediction.**Can Numerical Linear Algebra make it in Nature?**Householder Symposium XIX, Spa, Belgium, June 2014.**OmicABEL: Story of a successful interdisciplinary collaboration**PASC Conference 14.

ETH Zürich, Zürich, Switzerland, June 2014.**Performance Prediction for Tensor Contractions**PASC 14.

ETH Zürich, Zürich, Switzerland, June 2014.**An Optimized and Scalable Iterative Eigensolver for Sequences of Dense Eigenvalue Problems**13th Copper Mountain Conference on Iterative Methods.

Copper Mountain, Colorado, USA., 7 April 2014.abstractPDFSequences of eigenvalue problems consistently appear in a large class of applications based on the iterative solution of a non-linear eigenvalue problem. A typical example is given by the chemistry and materials science ab initio simulations relying on computational methods developed within the framework of Density Functional Theory (DFT). DFT provides the means to solve a high- dimensional quantum mechanical problem by representing it as a non-linear generalized eigenvalue problem which is solved self-consistently through a series of successive outer-iteration cycles. As a consequence each self-consistent simulation is made of several sequences of generalized eigenproblems P : A x = \lambda B x .....**Annual Report**AICES Graduate School.

Aachen, Germany, April 2014.**Numerical methods for the eigenvalue problem in electronic structure computations.**45th IFF Spring School. Computing Solids: Models, ab initio methods and supercomputing.

Forschungszentrum Juelich, 14 March 2014.

## 2013

### Talks

**MRRR-Based Eigensolvers for Multi-Core Processors and Supercomputers**RWTH Aachen, Schinkelstr. 2, 52062 Aachen, December 2013.

Doctoral defense.abstractwebThe real symmetric tridiagonal eigenproblem is of outstanding importance in numerical computations; it arises frequently as part of eigensolvers for dense standard and generalized Hermitian eigenproblems that are based on a reduction to tridiagonal form. For its solution, the algorithm of Multiple Relatively Robust Representations (MRRR or MR^3 in short) - introduced in the late 1990s - is among the fastest methods. To compute k eigenpairs of an n-by-n tridiagonal T, MRRR only requires O(kn) arithmetic operations; in contrast, all the other practical methods require O(k^2 n) or O(n^3) operations in the worst case. This talk centers around the performance and accuracy of MRRR-based eigensolvers on modern parallel architectures.**Knowledge-Based Automatic Generation of Algorithms and Code**Ph.D. Defense, Aachen, Germany, December 2013.**Multilevel Summation for Dispersion: A Linear-Time Algorithm for 1/r^6 Potentials**2013 AIChE Annual Meeting.

San Francisco, USA, November 2013.abstractPDFThe multilevel summation method (MLS) was developed to evaluate long-range interactions in molecular dynamics (MD) simulations. Previously MLS was investigated in detail for the electrostatic potential by Hardy et al., and we have applied this new method to dispersion interactions. While dispersion interactions are formally short-ranged, long-range methods have to be used to calculate accurately dispersion forces in certain situations, such as in interfacial systems. Because long-range solvers tend to dominate the time needed to perform a step in MD calculations, increasing their performance is of vital importance. The multilevel summation method offers some significant advantages when compared to mesh-based Ewald methods like the particle-particle particle-mesh and particle mesh Ewald methods. Because, unlike mesh-based Ewald methods, the multilevel summation method does not use fast Fourier transforms, they are not limited by communication and bandwidth concerns. In addition, it scales linearly in the number of particles, as compared to the O(N log N) complexity of the mesh-based methods. While the structure of the Multilevel Summation is invariant for different potentials, every algorithmic step had to be adapted to accommodate the 1/r^6 form of the dispersion interactions. In addition, we have derived strict error bounds, similar to those obtained by Hardy for the electrostatic multilevel summation method. Using an unoptimized implementation in C++, we can already demonstrate the linear scaling of the multilevel summation method for dispersion, and present results that suggest it will be competitive in terms of accuracy and performance with mesh-based methods.**High-performance and automatic computing for simulation science**Jülich Supercomputing Centre, Kickoff workshop, Simulation Lab "ab initio Methods in Chemistry and Physics", Jülich, Germany, November 2013.**High-Performance and Automatic Computing**Goethe Universität Frankfurt am Main, Big Data Lab, October 2013.**Multilevel Summation for Dispersion: A Linear-Time Algorithm for 1/r^6 Potentials**International Conference on Scientific Computation and Differential Equations (SciCADE).

Valladolid, Spain, September 2013.abstractPDFThe multilevel summation (MLS) method was developed to evaluate long-range interactions in molecular dynamics (MD) simulations. In MD the MLS was initially introduced for Coulombic potentials by Skeel et al., based on the multilevel matrix summation; we have extended the MLS to dispersion interactions. While formally short-ranged, dispersion potentials require long-range methods for accurate calculation of forces and energies in certain situations, such as in interfacial systems. Because long-range solvers tend to dominate the time needed to perform a step in MD calculations, increasing their performance is of vital importance. Because of its properties, the MLS is particularly attractive compared to other long-range solvers, e.g. the mesh-based Ewald and the fast Multipole methods. While the structure of the MLS is invariant for different potentials, every algorithmic step had to be adapted to accommodate the 1/r^6 form of the dispersion interactions.**Algorithms for Large-scale Whole Genome Association Analysis**PBio 2013: International Workshop on Parallelism in Bioinformatics.

EuroMPI 2013, Madrid, September 2013.abstractPDFIn order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms come at the cost of hundreds of Gigabytes of genotype data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a covariance matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, we present solutions for both challenges: The genotype data is streamed from and to secondary storage using a double buffering technique, while the covariance matrix is kept across the main memory of a distributed memory system. We show that these methods sustain high-performance and allow the analysis of enormous datasets.**A Parallel and Scalable Iterative Solver for Sequences of Dense Eigenproblems Arising in FLAPW**10th INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING AND APPLIED MATHEMATICS (PPAM 2013).

September 2013.abstractPDFn one of the most important methods in Density Functional Theory - the Full-Potential Linearized Augmented Plane Wave (FLAPW) method – dense generalized eigenproblems are organized in long sequences. Moreover each eigenproblem is strongly correlated to the next one in the sequence. We propose a novel approach which exploits such correlation through the use of an eigensolver based on subspace iteration and accelerated with Chebyshev polynomials. The resulting solver, parallelized using the Elemental library framework, achieves excellent scalability and is competitive with current dense parallel eigensolvers.**Performance Modeling for DLA Kernels**BLIS Retreat.

University of Texas at Austin, September 2013.**High Performance Computational Biology: Asynchronous IO and Elemental for GWAS**Annual Report 1.

AICES, RWTH Aachen, August 2013.**Preconditioning Chebyshev subspace iteration applied to sequences of dense eigenproblems in ab initio simulations**Numerical Analysis and Scientific Computation with Applications (NASCA13) conference in Calais, France.

June 2013.abstractwebPDFIn many material science applications simulations are made of dozens of se- quences, where each sequence groups together eigenproblems with increasing self- consistent cycle outer-iteration index. Successive eigenproblems in a sequence possess a high degree of correlation. In particular it has been demonstrated that eigenvectors of adjacent eigenproblems become progressively more collinear to each other as the outer-iteration index increases. This result suggests one could use eigenvectors, computed at a certain outer-iteration, as approximate solutions to improve the performance of the eigensolver at the next one. In order to opti- mally exploit the approximate solution, we developed a block iterative eigensolver augmented with a Chebyshev polynomial accelerator (BChFSI). Numerical tests show that, when the sequential version of the solver is fed approximate solutions instead of random vectors, it achieves up to a 5X speedup. Moreover the parallel shared memory implementation of the algorithm obtains a high level of efficiency up to 80 % of the theoretical peak performance. Despite the eigenproblems in the sequence being relatively large and dense, the parallel BChFSI fed with ap- proximate solutions performs substantially better than the corresponding direct eigensolver, even for a significant portion of the sought-after spectrum.**Recent Trends in Dense Linear Algebra**ComplexHPC Spring School 2013.

Uppsala University, Uppsala, Sweden, June 2013.

Invited lecturer.**Improving the performance of applied science numerical simulations: an application to Density Functional Theory.**March 2013.

Invited Talk at Columbia University, New York, USA..abstractPDFIn the early days of numerical simulations, advances were based on the ingenuity of pioneer scientists writing codes for relatively simple machines. Nowadays the investigation of large physical systems requires scaling simulations up to massively parallel computers whose optimal usage can often be challenging. On the one hand the algorithmic structure of many legacy codes can be a limiting factor to their portability on large supercomputers. More importantly in many cases algorithmic libraries are used as black boxes and no information coming from the physics of the specific application is exploited to improve the overall performance of the simulation. What is needed is a more interdisciplinary approach where the tools of scientific computing and knowledge extracted from the specific application are merged together in a new computational paradigm. One of the most promising new paradigms borrows from the "inverse problem" concept and, by reversing the logical arrow going from mathematical modeling to numerical simulations, extracts from the latter specific information that can be used to modify the algorithm. The resulting methodology, named "reverse simulation", produces an algorithm variant specifically tailored to the scientific application. Additionally such a variant can be optimally implemented for multiple parallel computing architectures. To demonstrate its applicability I will exemplify the workings of reverse simulation on a computational method widely used in the framework of Density Functional Theory (DFT): the Full-potential Linearized Augmented Plane Wave (FLAPW) method. FLAPW provides the means to solve a high-dimensional quantum mechanical problem by representing it as a non-linear generalized eigenvalue problem which is solved self-consistently through a series of successive outer-iteration cycles. By applying the principles of reverse simulation it can be shown that eigenvectors of successive eigenproblems become progressively more collinear to each other as the outer-iteration index increases. This result suggests that one could use eigenvectors, computed at a certain outer-iteration, as approximate solutions to improve the performance of the eigensolver at the next iteration. In order to maximally exploit the approximate solution, we developed a subspace iteration method augmented with an optimized Chebyshev polynomial accelerator together with an efficient locking mechanism (ChFSI). The resulting eigensolver was implemented in C language and can be parallelized for both shared and distributed architectures. Numerical tests show that, when the eigensolver is preconditioned with approximate solutions instead of random vectors, it achieves up to a 5X speedup. Moreover ChFSI takes great advantage of computational resources by obtaining levels of efficiency up to 80% of the theoretical peak performance. In particular, by making better use of massively parallel architectures, the distributed memory version will allow users of the FLAPW method to simulate larger physical systems than are currently accessible.**Improved Accuracy for MR3-based Eigensolvers**SIAM Conference on Computational Science and Engineering (SIAM CSE13).

February 2013.abstractwebPDFA number of algorithms exist for the dense Hermitian eigenproblem. In many cases, MRRR is the fastest one, although it does not deliver the same accuracy as Divide&Conquer or the QR algorithm. We demonstrate how the use of mixed precisions in MRRR-based eigensolvers leads to an improved orthogonality, even surpassing the accuracy of DC and QR. Our approach comes with limited performance penalty, and increases both robustness and scalability.**Parallel Python - Tutorial**January 2013.**Genome-Wide Association Studies: Computing Petaflops over Terabytes of Data**Blue Gene Active Storage Workshop.

Jülich Supercomputing Centre, Jülich, Germany, January 2013.

Invited speaker.

## 2012

### Talks

**First Steps Towards a Linear Algebra Compiler**ETH Zürich, Zürich, Switzerland, December 2012.

Host: Markus Pueschel.**Parallel block Chebyshev subspace iteration algorithm optimized for sequences of correlated dense eigenproblems**5th International Conference of the ERCIM Working Group.

December 2012.abstractPDF**Performance Modeling for Dense Linear Algebra**3rd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS12).

SC12, Salt Lake City, Utah, USA, November 2012.abstractPDFIt is well known that the behavior of dense linear algebra algorithms is greatly influenced by factors like target architecture, underlying libraries and even problem size; because of this, the accurate prediction of their performance is a real challenge. In this article, we are not interested in creating accurate models for a given algorithm, but in correctly ranking a set of equivalent algorithms according to their performance. Aware of the hierarchical structure of dense linear algebra routines, we approach the problem by developing a framework for the automatic generation of statistical performance models for BLAS and LAPACK libraries. This allows us to obtain predictions through evaluating and combining such models. We demonstrate that our approach is successful in both single- and multi-core environments, not only in the ranking of algorithms but also in tuning their parameters.**Automating the Generation of Algorithms for Generalized Least-Squares Problems**The 6th European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS 2012).

Vienna, Austria, September 2012.**A Compiler for Linear Algebra Operations**Workshop on Libraries and Autotuning for Extreme-Scale Systems (CScADS '12).

Snowbird, Utah, August 2012.

Invited speaker.**A Domain-Specific Compiler for Linear Algebra Operations**The 7th Intl Workshop on Automatic Performance Tuning (iWAPT), 10th Intl Meeting on High-Performance Computing for Computational Science (VECPAR 2012).

Kobe, Japan, July 2012.**Automatic Modeling and Ranking of Algorithms**The Seventh International Workshop on Automatic Performance Tuning (iWAPT 2012).

Kobe, Japan, July 2012.

Invited speaker.abstractwebPDFIn the context of automatic generation of linear algebra algorithms, it is not uncommon to find dozens of algorithmic variants, all equivalent mathematically, but different in terms of accuracy and performance. In this talk I discuss how to rank the variants automatically, without executing them. In principle, one can attempt a fully analytical approach, creating performance models that take into account both the structure of the algorithm and the features of the processor; unfortunately, due to the intricacies of the memory system, currently this solution is not at all systematic. By contrast, I present an approach based on the automatic modeling of the routines that represent the building blocks for linear algebra algorithms. Performance predictions are then made by composing evaluations of such models. Our experiments show how this approach can be applied to both algorithm ranking and parameter tuning, yielding remarkably accurate results.**High-throughput Algorithms for Genome-Wide Association Studies**The 8th International Conference on Bioinformatics of Genome Regulation and Structure\Systems Biology (BGRS\SB 2012).

Novosibirsk, Russia, June 2012.**Orthogonality in the Hermitian Eigenproblem and the MRRR Algorithm**International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2012).

London, England, June 2012.**Block Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems**Parallel Matrix Algorithms and Applications 2012.

June 2012.abstractPDFSimulations in Density Functional Theory are made of dozens of sequences, where each sequence groups together eigenproblems with increasing self-consistent cycle iteration index. In a recent study, it has been shown a high degree of cor- relation between successive eigenproblems of each sequence. In particular, by tracking the evolution over iterations of the angles between eigenvectors of suc- cessive eigenproblems, it was shown that eigenvectors are almost collinear after the first few iterations. This result suggests we could use eigenvectors, com- puted at a certain iteration, as approximate solutions for the problem at the successive one. The key element is to exploit the collinearity between vectors of adjacent problems in order to improve the performance of the eigensolver. In this study we provide numerical examples where opportunely selected block iterative eigensolvers benefit from the re-use of eigenvectors when applied to sequences of eigenproblems extracted from simulations of real-world materials. In our inves- tigation we vary several parameters in order to address how the solvers behave under different conditions. In most cases our study shows that, when the solvers are fed approximated eigenvectors as opposed to random vectors, they obtain a substantial speed-up and could become a valid alternative to direct methods.**High Performace Genome Studies**2012 SIAM Conference on Applied Linear Algebra.

SIAM Conference on Applied Linear Algebra, Universitat Politecnica de Valencia, Spain, June 2012.abstractPDFIn the context of the genome-wide association study (GWAS), one has to solve long sequences of generalized least-squares problems; such a problem presents two limiting factors: execution time (often in the range of days or weeks) and data management (in the order of Terabytes). We present algorithms that obviate both issues. By taking advantage of domain-specific knowledge, exploiting parallelism provided by multicores and GPU, and handling data effciently, our algorithms attain unequalled high performance. When compared to GenABEL, one of the most widely used libraries for GWAS, we obtain speedups up to a factor of 50.**Performance Modeling for Ranking Blocked Algorithms**Parallel Matrix Algorithms and Applications (PMAA) 2012.

Birkbeck University of London, June 2012.abstractPDFA large class of dense linear algebra operations, such as LU decomposition or inversion of a triangular matrix, are usually performed by blocked algorithms. For one such operation, typically, not only one but many algorithmic variants exist; depending on computing architecture, libraries and problem size, each variant attains a different performances. We propose methods and tools to rank the algorithmic variants according to their performance for a given scenario without executing them. For this purpose, we identify the routines upon which the algorithms are built. A first tool ,the Sampler, measures the performance of these routines. Using the Sampler, a second tool models their performance. The generated models are then used to predict the performance of the considered algorithms. For a given scenario, these predictions allow us to correctly rank the algorithms according to their performance without executing them. With the help of the same tools, algorithmic parameters such as block-size can be optimally tuned.**Eigenproblems and Eigensolvers in Density Functional Theory**May 2012.

Invited talk at the Argonne Leadership Computing Facility.abstractPDFIn DFT based simulations each SCF cycle comprises dozens of large generalized eigenproblems. In a recent study, it has been proposed to consider simulations as made of dozens of sequences of eigenvalue problems, where each sequence groups together eigenproblems with equal k-vector and increasing iteration index i. It was then demonstrated that successive eigenproblems in a sequence are strongly correlated to one another. In particular, by tracking the evolution over iterations of the angle between eigenvectors of adjacent iterations, it was shown the angles decrease noticeably after the first few iterations and become close to collinear. This last result suggests we could use the eigenvectors solution of a problem in a sequence as an educated guess for the eigenvectors of the successive problem. In this talk we present preliminary results that would eventually open the way to a widespread use of iterative solvers in ab initio electronic structure codes. We provide numerical examples where opportunely selected iterative solvers benefit from the reuse of eigenvectors when applied to sequences of eigenproblems extracted from simulations of real-world materials.**Hierarchical Performance Modeling for Ranking Dense Linear Algebra Algorithms**Master's Thesis colloquium.

AICES, RWTH Aachen, April 2012.abstractPDFA large class of dense linear algebra operations, such as LU decomposition or inversion of a triangular matrix, are usually performed by blocked algorithms. For one such operation, typically, not only one but many algorithmic variants exist; depending on architecture, libraries and problem size, each variant attains a different performances. In my project, I developed methods and tools to rank the algorithmic variants according to their performance for a given scenario without executing them. For this purpose, I analyze the routines upon which the algorithms are built and introduce a tool that, based on measurements, models their performance. The generated performance models are then used to predict the performance of the considered algorithms. For a given scenario, these predictions allow me not only to rank the variants but to determine the optimal algorithm configuration.**Block Iterative Solvers and Sequences of Eigenproblems arising in Electronic Structure Calculations**Twelve Copper Mountain Conference on Iterative Methods, March 25-30, Copper Mountain, Colorado, USA.

April 2012.

Twelve Copper Mountain Conference on Iterative Methods, March 25-30, Copper Mountain, Colorado, USA.abstractwebPDFResearch in several branches of chemistry and materials science relies on large ab initio numerical simulations. The majority of these simulations are based on computational methods developed within the framework of Density Functional Theory (DFT). DFT provides the means to solve a high-dimensional quantum mechanical problem by transforming it into a large set of coupled one-dimensional equations, which is ultimately represented as a non-linear generalized eigenvalue problem. The latter is solved self-consistently through a series of successive iteration cycles: the solution computed at the end of one cycle is used to generate the input in the next until the distance between two successive solutions is negligible. Typically a simulations requires tens of cycles before reaching convergence. After the discretization -- intended in the general sense of reducing a continuous problem to one with a finite number of unknowns -- each cycle comprises dozens of large generalized eigenproblems $P^{(i)}_{\bf k}: A^{(i)}_{\bf k} x - \lambda B^{(i)}_{\bf k} x$ where $A$ is hermitian and $B$ hermitian positive definite. Within every cycle the eigenproblems are parametrized by the reciprocal lattice vector ${\bf k}$ while the iteration index $i$ denotes the iteration cycle. The size of each problem ranges from 10 to 40 thousand and the interest lays in the eigenpairs corresponding to the lower 10-20\% part of the spectrum. Due to the dense nature of the eigenproblems and the large portion of the spectrum requested, iterative solvers are generally not competitive; as a consequence, current simulation codes uniquely use direct methods. Generally, the eigenproblem $P^{(i+1)}_{\bf k}$ is solved in complete isolation from all $P^{(i)}$s. This is surprising in light of the fact that each $P^{(i+1)}_{\bf k}$ is constructed by manipulating the solutions of all the problems at iteration $i$. In a recent study, the author harnessed this last observation by considering every simulation as made of dozens of sequences $\{P_{\bf k}\}$, where each sequence groups together eigenproblems with equal {\bf k}-vector and increasing iteration index $i$. It was then demonstrated that successive eigenproblems in a sequence are strongly correlated to one another. In particular, by tracking the evolution over iterations of the angle between eigenvectors $x^{(i)}$ and $x^{(i+1)}$, it was shown the angles decrease noticeably after the first few iterations. Even though the overall simulation has not yet converged, the eigenvectors of $P^{(i)}_{\bf k}$ and $P^{(i+1)}_{\bf k}$ are close to collinear. This result suggests we could use the eigenvectors of $P^{(i)}$ as an educated guess for the eigenvectors of the successive problem $P^{(i+1)}$. The key element is to exploit the collinearity between vectors of adjacent problems in order to improve the performance of the solver. While implementing this strategy naturally leads to the use of iterative solvers, not all the methods are well suited to the task at hand. In light of these considerations, exploiting the evolution of eigenvectors depends on two distinct key properties of the iterative method of choice: 1) the ability to receive as input a sizable set of approximate eigenvectors and exploit them to efficiently compute the required eigenpairs, and 2) the capacity to solve simultaneously for a substantial portion of eigenpairs. The first property implies the solver achieves a moderate-to-substantial reduction in the number of matrix-vector operations as well as an improved convergence. The second characteristic results in a solver capable of filtering away as efficiently as possible the unwanted components of the approximate eigenvectors. In accord to these requirements, the class of block iterative methods constitutes the natural choice in order to satisfy property 1); by accepting a variable set of multiple starting vectors, these methods have a faster convergence rate and avoid stalling when facing small clusters of eigenvalues. When block methods are augmented with polynomial accelerators they also satisfy property 2) and their performance is further improved. In this study we present preliminary results that would eventually open the way to a widespread use of iterative solvers in ab initio electronic structure codes. We provide numerical examples where opportunely selected iterative solvers benefit from the re-use of eigenvectors when applied to sequences of eigenproblems extracted from simulations of real-world materials. At first we experimented with a block method (block Krylov-Schur) satisfying only property 1). At a later stage we selected a solver satisfying both properties (block Chebyshev-Davidson) and performed similar experiments. In our investigation we vary several parameters in order to address how the solvers behave under different conditions. We selected different size for the sequence of eigenproblems as well as an increasing number of eigenpairs to be computed. We also vary the block size in relation with the fraction of eigenpairs sought after and analyze its influence on the performance of the solver. In most cases our study shows that, when the solvers are fed approximated eigenvectors as opposed to random vectors, they obtain a substantial speed-up. The increase in performance changes with the variation of the parameters but strongly indicates that iterative solvers could become a valid alternative to direct methods. The pivotal element allowing the achievement of such a result resides in the transposition of the ``matrix'' of eigenproblems emerging from electronic structure calculations; while the computational physicist solves many rows of {\bf k}-eigenproblems (one for each cycle) we look at the entire computation as made of {\bf k}-sequences. This shift in focus enables one to harness the inherent essence of the iterative cycles and translate it to an efficient use of opportunely tailored iterative solvers. In the long term such a result will positively impact a large community of computational scientists working on DFT-based methods.**Sequences of Generalized Least-Squares for Genome-Wide Association Studies**The 83rd Annual Meeting of the International Association of Applied Mathematics and Mechanics (GAMM 2012).

Darmstadt, Germany, March 2012.**Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures**40th SPEEDUP Workshop on High-Performance Computing.

Basel, Switzerland, February 2012.

Plenary speaker.abstractwebPDFEigenproblems are at the core of a myriad of engineering and scientific applications. In many cases, the size of such problems is not determined by the physics of the application but is limited by the eigensolver's time and memory requirements. While on the one hand the demand is for larger problems, on the other hand the available numerical libraries are not exploiting the parallelism provided in the modern computing environments. In this talk I compare two approaches to parallelism --one that relies on fast multithreaded libraries (BLAS), and another that uses blocking and careful scheduling-- and show that the right choice depends, among other factors, on the specific operation performed. I will then introduce two eigensolvers specifically designed for high-performance architectures. MR3-SMP targets multicore architectures, while EleMRRR is well suited for both small clusters and massively parallel computers which may or may not use multithreading. Experiments on application matrices indicate that our algorithms are both faster and obtain better speedups than all the eigensolvers from LAPACK, ScaLAPACK and Intel's Math Kernel Library (MKL). Finally, I will discuss the use of graphics processing units.**A Study of Productivity and Performance of Modern Vector Processors**2012.

## 2011

### Talks

**Quantum Theory of Materials - Introduction to Density Functional Theory and its Computational Challenges**December 2011.

Class series given during the 11th EU Regional school, AICES, Aachen, Germany.abstractPDFDensity Functional Theory (DFT) is one of the most used ab initio theoretical frameworks in materials science. DFT-based methods are growing as the standard tools for simulating new materials. Simulations aim at recovering and predicting physical properties (electronic structure, total energy differences, magnetic properties, etc.) of large molecules as well as systems made of many hundreds of atoms. DFT reaches this result by solving self-consistently a rather complex set of quantum mechanical equations leading to the computation of the one-particle density n(r), from which physical properties are derived. In order to preserve self-consistency, numerical implementations of DFT methods consist of a series of iterative cycles; at the end of each cycle a new density is computed and compared to the one calculated in the previous cycle. The end result is a series of successive densities converging to a n(r) approximating the exact density within the desired level of accuracy. The course is divided in two parts. The first part is concerned with theoretical and conceptual foundations of DFT: we will introduce basic concepts of many-body quantum mechanics, proceed to illustrate the fundamental building blocks of DFT, and finally present a broad overview of the three most used ab initio methods. In the second part we will focus on one specific method, FLAPW, and analyze its computational aspects in details; the material will be presented paying special attention on the interrelation between the physics and the numerics of the problem. In order to facilitate the exposition, numerous examples will be presented and discussed in class. A basic knowledge of quantum mechanics concepts is assumed.**Automata-Based Detection of Hypergraph Embeddings**RWTH Aachen University, October 2011.

Bachelor's Thesis Presentation.**Knowledge-Based Automatic Generation of Partitioned Matrix Expressions**The 13th International Workshop on Computer Algebra in Scientific Computing (CASC 2011).

Kassel, Germany, September 2011.**Automation in Computational Biology**9th International Conference on Parallel Processing and Applied Mathematics (PPAM 2011).

Torun, Poland, September 2011.

Keynote speaker.abstractwebPDFIn the past 30 years the development of linear algebra libraries has been tremendously successful, resulting in a variety of reliable and efficient computational kernels. Unfortunately these kernels--meant to become the building blocks for scientific and engineering applications--are not designed to exploit knowledge relative to the specific target application. If opportunely used, this extra knowledge may lead to domain-specific algorithms that attain higher performance than any traditional library. As a case study, we look at a common operation in computational biology, the computation of mixed-effects models; in particular, we consider the use of mixed models in the context of genome analysis. At the core of this operation lays a generalized least square problem (GLS); GLS may be directly solved with Matlab, or may be reduced to a form accepted by LAPACK. Either way, none of these solutions can exploit the special structure of GLS within genome analysis. Specifically, as part of this application it has to be solved not one, but a large two-dimensional parametric sequence of GLS'. Since the performance of an algorithm is directly affected by the choice of parameters, a family of algorithms is needed. In this talk we show how automation comes to help. We introduce a symbolic system, written in Mathematica, that takes as input a matrix equation and automatically returns a family of algorithms to solve the equation. The system has knowledge of matrix properties, matrix factorizations, and rules of linear algebra; it decomposes the input equation into a sequence of building blocks and maps them onto available high-performance kernels. Automation is achieved through extensive use of pattern matching and rewrite rules. When applied to GLS in the context of genome analysis, it generates algorithms that outperform LAPACK by a factor of six.**The Symmetric Tridiagonal Eigenproblem on Massively-Parallel Supercomputers**International Linear Algebra Society Conference (ILAS 2011).

Braunschweig, Germany, August 2011.**Estimating the number of eigenvalues in a interval using the eigenproblem resolvent**July 2011.

25th Umbrella Symposium, Aachen, Germany.abstractwebSymmetric generalized eigenvalue problems arise in many applications in chemistry physics and engineering. In several cases only a small fraction of the eigenpairs, usually located in a interval at one of the extremities of the spectrum, is required. Obtaining previous knowledge of the number of eigenvalues within the interval boudaries is often beneficial. For instance, for those iterative methods where a projector is used in conjunction with a Rayleigh-Ritz quotient this is an essential ingredient for reducing the number of iterations and increasing the accuracy. We present a potentially inexpensive technique to estimate the number of eigenvalues in a generic interval based on numerical manipulations of the eigenproblem resolvent.**Automatic Generation of Loop-Invariants for Matrix Operations**Workshop on Computer Algebra Systems and Their Applications (CASA), 11th International Conference on Computational Science and Its Applications (ICCSA 2011).

Santander, Spain, June 2011.**Sequences of Generalized Eigenproblems in DFT**June 2011.

10th IMACS International Symposium on Iterative Methods in Scientific Computing, Marrakech, Morocco.abstractPDFResearch in several branches of chemistry and material science relies on large numerical simulations. Many of these simulations are based on Density Functional Theory (DFT) models that lead to sequences of generalized eigenproblems \seq. Every simulation normally requires the solution of hundreds of sequences, each comprising dozens of large and dense eigenproblems; in addition, the problems at iteration $i+1$ of \seq are constructed manipulating the solution of the problems at iteration $i$. The size of each problem ranges from 10 to 40 thousand and the interest lays in the eigenpairs corresponding to the lower 10-30\% part of the spectrum. Due to the dense nature of the eigenproblems and the large portion of the spectrum requested, iterative solvers are not competitive; as a consequence, current simulation codes uniquely use direct methods. In this talk we present a study that highlights how eigenproblems in successive iterations are strongly correlated to one another. In order to understand this result, we need to stress the importance of the basis wave functions, which constitute the building blocks of any DFT scheme. Indeed, the matrix entries of each problem in \seq are calculated through superposition integrals of a set of basis wave functions. Moreover, the state wave functions---describing the quantum states of the material---are linear combinations of basis wave functions with coefficients given by the eigenvectors of the problem. Since a new set of basis wave functions is determined at each iteration of the simulation, the eigenvectors between adjacent iterations are only loosely linked with one another. In light of these considerations it is surprising to find such a deep correlation between the eigenvectors of successive problems. We set up a mechanism to track the evolution over iterations $i=1,\dots,n$ of the angle between eigenvectors $x^{(i)}$ and $x^{(i+1)}$ corresponding to the $j^{th}$ eigenvalue. In all cases the angles decrease noticeably after the first few iterations and become almost negligible, even though the overall simulation is not close to convergence. Even the state of the art direct eigensolvers cannot exploit this behavior in the solutions. In contrast, we propose a 2-step approach in which the use of direct methods is limited to the first few iterations, while iterative methods are employed for the rest of the sequence. The choice of the iterative solver is dictated by the large number of eigenpairs required in the simulation. For this reason we envision the Subspace Iterations Method---despite its slow convergence rate---to be the method of choice. Nested at the core of the method lays an inner loop $V \leftarrow A V$; due to the observed correlation between eigenvectors, the convergence is reached in a limited number of steps. In summary, we propose evidence in favor of a mixed solver in which direct and iterative methods are combined together.**A Modular and Systematic Approach to Stability Analysis**Householder Symposium XVIII.

Tahoe City, CA, June 2011.

Poster.**MR3-SMP and PMRRR: Fast and Scalable Eigensolvers**25th Umbrella Symposium.

Aachen, Germany, June 2011.**Solvers and Eigensolvers for Multicore Processors**Max-Plank-Institute fuer biologische Kybernetik, Tuebingen, Germany, March 2011.

Invited speaker.**Berkeley's Dwarfs on CUDA**2011.

## 2010

### Talks

**Matrix Structure Exploitation in Generalized Eigenproblems Arising in Density Functional Theory**October 2010.

8th International Conference on Numerical Analysis and Applied Mathematics (ICNAAM 2010), Rhodos Island, Greece.abstractIn this short paper, the authors report a new computational approach in the context of Density Functional Theory (DFT). It is shown how it is possible to speed up the self-consistent cycle (iteration) characterizing one of the most well-known DFT implementations: FLAPW. Generating the Hamiltonian and overlap matrices and solving the associated generalized eigenproblems $Ax = \lambda Bx$ constitute the two most time-consuming fractions of each iteration. Two promising directions, implementing the new methodology, are presented that will ultimately improve the performance of the generalized eigensolver and save computational time.**Automatic Generation of Partitioned Matrix Expressions for Matrix Operations**Workshop on Automated computing, 8th International Conference of Numerical Analysis and Applied Mathematics (ICNAAM 2010)..

Rhodes, Greece, September 2010.**Improving the performance of Density Functional Theory self-consistent cycle**August 2010.

Invited colloquium given at the RWTH Physics Department, Aachen, Germany.abstractDensity Functional Theory (DFT) is one of the most effective frameworks for studying large quantum mechanical systems. DFT-based methods are growing as the standard tools for designing and testing new materials. The core of DFT relies on a self-consistent scheme formed by a sequence of outer iterations (cycles), each one containing multiple large dense generalized eigenpencils. At each step, the configuration of the system is determined by the solution of all the eigenproblems from the previous iteration. In this talk we will show a preliminary study aimed at improving the performance of the overall DFT scheme. Contrary to the current trend we plan to decrease the accuracy of each outer-iteration in order to speed up its execution. The direct consequence will be to have a higher number of faster iterations with an overall gain in execution time towards convergence. Success is guaranteed by acting on the most expensive operations of the self-consistent cycle: namely matrix generation and eigenproblem solution.**Goal-oriented and Modular Stability Analysis**Conference on Numerical Linear Algebra: Perturbation, Performance and Portability.

A Conference in Celebration of G.W. (Pete) Stewart's 70th Birthday, Austin, TX, July 2010.

Invited speaker.**The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors**International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2010).

Basel, Switzerland, June 2010.**Generation of Linear Algebra Algorithms for Automatic Differentiation**Workshop on Autotuning, 6th Parallel Matrix Algorithms and Applications (PMAA 2010).

Basel, Switzerland, June 2010.**The Algorithm of Multiple Relatively Robust Representations for Multicore Processors**PARA 2010: State of the Art in Scientific and Parallel Computing.

Reykjavik, Iceland, June 2010.abstractwebPDFThe algorithm of Multiple Relatively Robust Representations, in short MRRR or $\mbox{MR}^3$, computes $k$ eigenvalues and eigenvectors of a symmetric tridiagonal matrix in $O(nk)$ arithmetic operations. While for the largest matrices arising in applications parallel implementations especially suited for distributed-memory computer systems exist, small to medium size problems can make use of LAPACK's implementation xSTEMR. However, xSTEMR does not take advantage of today's multi-core and future many-core architectures, as it is optimized for single-core CPUs. In this paper we discuss some of the issues and trade-offs arising in an efficient implementation especially suited for multi-core CPUs and SMP systems. From a set of experiments on application matrices it results that our algorithm is both faster and obtains better speedups than all tridiagonal eigensolvers from LAPACK and Intel's Math Kernel Library (MKL).**At the Heart of the Automation of Linear Algebra Algorithms**Workshop on Program Composition and Optimization.

Dagstuhl, Germany, May 2010.abstractwebPDFIt is well understood that in order to attain high performance for linear algebra operations over multiple architectures and settings, not just one, but a family of loop-based algorithms have to be generated and optimized. In the past we have demonstrated that algorithms and routines can be derived automatically, using a procedure based on symbolic computations and classical formal derivations techniques. At the heart of such a procedure lie the Partitioned Matrix Expressions (PMEs) of the target operation; these expressions describe how parts of the output operands can be represented in terms of parts of the input operands. The PMEs are the unifying element for all the algorithms in the family, as they encapsulate the necessary knowledge for generating each one of them. Until now, the PMEs were considered inputs to the derivation procedure, i.e., the users had to provide them. In this talk we discuss how from a high-level formal description of the operation it is possible to generate automatically even the PMEs. We conclude demonstrating how automation becomes critical in complex, high-dimensional, scenarios.

## 2009

### Talks

**An Example of Symmetry Exploitation for Energy-related Eigencomputations**International Conference of Numerical Analysis and Applied Mathematics (ICNAAM 2009).

Rethymno, Crete, Greece, September 2009.**Linear Algebra on Multicore Architectures**- School on High-performance Computing in Geophysics, Novosobirsk State University, Novosibirsk, Russia, September 2009.

Invited lecturer. - MIT-AICES Workshop.

Aachen, Germany, March 2009.

- School on High-performance Computing in Geophysics, Novosobirsk State University, Novosibirsk, Russia, September 2009.
**Numerical Methods for Large Linear Systems**3rd LHC Detector Alignment Workshop.

CERN, Geneva, Switzerland, June 2009.

Invited speaker.**Automatic Computing**North Rhine-Westphalian Academy of Sciences and Humanities, Duesseldorf, Germany, April 2009.

2009 Karl Arnold Prize acceptance speech.**Computational Mathematics (in Italian, "Matematica Computazionale")**Woche zu Italien, RWTH Aachen.

Aachen, Germany, March 2009.

Invited speaker.

## 2008

### Talks

**Algorithm & Code Generation for High-Performance Computing**Tag der Informatik, RWTH Aachen, Aachen, Germany, December 2008.**Scientific Computing: Applications, Algorithms, Architectures**- AICES Workshop, Monschau, Germany, November 2008.
- Colorado State University, Fort Collins, CO, March 2008.
- RWTH Aachen, Aachen, Germany, January 2008.

Hosts: Marek Behr and Chris Bischof.

**Multicore Processors: What Kind of Parallelism?**AICES, RWTH Aachen, Aachen, Germany, June 2008.

CCES Seminar Series.**Multi-dimensional Array Memory Accesses for FFTs on Parallel Architectures**PARA 2008: 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing.

Trondheim, Norway, June 2008.**Generation of Dense Linear Algebra Software for Shared Memory and Multicore Architectures**- Workshop on Automating the Development of Scientific Computing Software.

Baton Rouge, LA, March 2008.

Invited speaker. - Microsoft Corporation, Redmond, WA, March 2008.

Host: Laurent Visconti.

- Workshop on Automating the Development of Scientific Computing Software.

## 2007

### Talks

**Streaming 2D FFTs on the Cell Broadband Engine**DESA Workshop.

Washington, DC, December 2007.**Dense Linear Algebra on Multicore Architectures: What Kind of Parallelism?**- CScADS. Workshop on Automatic Tuning for Petascale Systems.

Snowbird, Utah, July 2007. - ICIAM07: 6th International Congress on Industrial and Applied Mathematics.

Zürich, Switzerland, July 2007.

- CScADS. Workshop on Automatic Tuning for Petascale Systems.
**Sparse Direct Factorizations Based on Unassembled Hyper-Matrices**ICIAM07: 6th International Congress on Industrial and Applied Mathematics.

Zürich, Switzerland, July 2007.**Can Computers Develop Libraries? A Different Perspective on Scientific Computing**The University of Chicago, Chicago, IL, February 2007.

Host: Ridgway Scott.

## 2006

### Talks

**Mechanical Generation of Correct Linear Algebra Algorithms**- Duke University, Durham, NC, October 2006.

Host: Xiaobai Sun. - Rice University, Houston, TX, September 2006.

Hosts: John Mellor-Crummey, Ken Kennedy. - University of Manchester, Manchester, UK, June 2006.

Host: Chris Taylor. - Georgia Institute of Technology, Atlanta, GA, March 2006.

Host: Richard Fujimoto. - Carnegie Mellon University, Pittsburgh, PA, February 2006.

Host: Markus Pueschel. - Oxford University, Oxford, UK, January 2006.

Host: Richard Bird.

- Duke University, Durham, NC, October 2006.
**Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms**University of Texas, Austin, TX, July 2006.

Dissertation Defense (7 July).**Mechanical Generation of Linear Algebra Libraries with Multiple Variants**SIAM Conference on Parallel Processing for Scientific Computing.

San Francisco, CA, February 2006.

## 2005

### Talks

**Mechanical Generation of Linear Algebra Libraries with Multiple Variants**- University of Washington, Seattle, WA, December 2005.

Host: Larry Snyder. - Caltech Center for Advanced Computing Research (CACR), California Institute of Technology, Pasadena, CA, June 2005.

Host: Mark Stalzer. - Argonne National Laboratory, Argonne, IL, March 2005.

Host: Jorge More'. - IBM T.J. Watson Research Center, Yorktown Heights, January 2005.

Host: John Gunnels. - New York University, New York, NY, January 2005.

Host: Michael Overton.

- University of Washington, Seattle, WA, December 2005.
**Formal Correctness and Stability of Dense Linear Algebra Algorithms**17th IMACS World Congress: Scientific Computation, Applied Mathematics and Simulation.

Paris, France, July 2005.**A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations**Householder XVI Symposium on Numerical Linear Algebra.

Silver Springs Mountain Resort, PA, May 2005.

## 2004

### Talks

**The Science of Deriving Dense Linear Algebra Algorithms**- University of Manchester, Manchester, UK, August 2004.

Host: Nicholas Higham. - PARA'04, Workshop on State-of-the-Art in Scientific Computing.

Lyngby, Denmark, June 2004.

- University of Manchester, Manchester, UK, August 2004.
**Automatic Derivation and Implementation of Parallel Libraries**PARA'04, Workshop on State-of-the-Art in Scientific Computing.

Lyngby, Denmark, June 2004.

## 2003

### Talk

**The Science of Deriving Dense Linear Algebra Algorithms**Lawrence Berkeley National Lab, Berkeley, CA, June 2003.

Host: Esmond Ng.

## 2002

### Talks

**The Science of Deriving Dense Linear Algebra Algorithms**Institute for Informatics and Telematics, Pisa, Italy, July 2002.

Host: Marco Pellegrini.**A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations**IV International Workshop on Accurate Solution of Eigenvalue Problems (IWASEP4).

Split, Croatia, June 2002.

Poster.