# Recent Talks

**The Generalized Matrix Chain Algorithm**2018 IEEE/ACM International Symposium on Code Generation and Optimization.

Vienna, Austria, 27 February 2018.abstractPDFIn this paper, we present a generalized version of the matrix chain algorithm to generate efficient code for linear algebra problems, a task for which human experts often invest days or even weeks of works. The standard matrix chain problem consists in finding the parenthesization of a matrix product $M := A_1 A_2 \cdots A_n$ that minimizes the number of scalar operations. In practical applications, however, one frequently encounters more complicated expressions, involving transposition, inversion, and matrix properties. Indeed, the computation of such expressions relies on a set of computational kernels that offer functionality well beyond the simple matrix product. The challenge then shifts from finding an optimal parenthesization to finding an optimal mapping of the input expression to the available kernels. Furthermore, it is often the case that a solution based on the minimization of scalar operations does not result in the optimal solution in terms of execution time. In our experiments, the generated code outperforms other libraries and languages on average by a factor of about 5. The motivation for this work comes from the fact that---despite great advances in the development of compilers---the task of mapping linear algebra problems to optimized kernels is still to be done manually. In order to relieve the user from this complex task, new techniques for the compilation of linear algebra expressions have to be developed.**Linnea: Automatic Generation of Efficient Linear Algebra Programs**Umeå University, 12 January 2018.abstractPDFThe evaluation of linear algebra expressions is a central part of both languages for scientific computing such as Julia and Matlab, and libraries such as Eigen, Blaze, and NumPy. However, the existing strategies are still rather primitive. At present, the only way to achieve high performance is by handcoding algorithms using libraries such as BLAS and LAPACK, a task that requires extensive knowledge in linear algebra, numerical linear algebra and high-performance computing. We present Linnea, the prototype of a compiler that automates the translation of the mathematical description of a linear algebra problem to an efficient sequence of calls to BLAS and LAPACK kernels. The main idea of Linnea is to construct a search graph that represents a large number of programs, taking into account knowledge about linear algebra. The algebraic nature of the domain is used to reduce the size of the search graph, without reducing the size of the search space that is explored. Experiments show that 1) the code generated by Linnea outperforms standard linear algebra languages and libraries, and 2) in contrast to the development time of human experts, the generation takes only few seconds.**Teaching Computers Linear Algebra**Friedrich-Schiller-Universitaet Jena, Jena, Germany, January 2018.abstractwebPDFIn the mid 1950s, the computing world was revolutionized by the advent of "The IBM Mathematical Formula Translating System" (FORTRAN), a program--nowadays universally recognized as the first complete compiler--that allowed scientists to express calculations in a "high-level", portable language. Both FORTRAN and C were, and still are, much better solutions than computer-specific code, but they still require users to reduce their mathematical formulas to scalar computations. Indeed, computers only operate on scalars and small arrays, while scientists operate with vectors, matrices and higher-dimensional objects. In the past 60 years there has been tremendous progress in the world of programming languages and compilers, and many languages and libraries (Matlab, Julia, Armadillo, Eigen, ...) now make it possible to code directly in terms of matrices; however in terms of efficiency, these solutions are still far from what human experts achieve. In a nutshell, none of these tools know linear algebra well enough to compete with humans. In this talk I present the Linear Algebra Mapping Problem (LAMP), that is, how to efficiently compute linear algebra expressions from a set of available building blocks, and the compiler Linnea, our initial solution to the problem.**Automatic Seamless Mixing of Computer Generated Playlists**Umea Universitet, Umea, Sweden, January 2018.abstractContinuous (mixed) dance music has animated disco clubs since the end of the 1970s, when DJs began to create unbroken sequences of songs. Thereafter, continuous mixes progressively conquered other public spaces such as private parties, shops, gyms and radio broadcasts. On the other hand, continuous (but unmixed) music has been widely available in the form of playlists since when portable storage devices and online repositories for digital audio became commercially available. For non-professional DJs with audio databases in the order of hundreds of tracks, both the selection of songs for a dance music playlist, and the smooth mixing into a continuous streaming represent non-trivial operations. Indeed, whoever is entrusted with such a task would benefit greatly if both operations were automated. AutoMix, short for Automatic Seamless Mixing of Computer Generated Playlists, aims to solve both problems simultaneously: Starting from an existing repository of dance tracks, it will automatically create a sequence of songs, and mix them together seamlessly, exactly as a human DJ would do.**Efficient Pattern Matching in Python**7th Workshop on Python for High-Performance and Scientific Computing.

Denver, Colorado, 12 November 2017.**Performance Modeling and Prediction for Dense Linear Algebra**RWTH Aachen, November 2017.

PhD Defense.abstractPDFThis dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel.**A tale of efficiency and productivity. From scalar to tensor computations.**Umea Universitat, Umea, Sweden, October 2017.abstractPDFThe scientific computing community has to deal with the disconnect between the language spoken by practitioners (for the most part non-computer scientists), and the language with which computers operate. While scientists and engineers (possibly after a discretization process) speak the language of vectors, matrices, and higher-dimensional objects, computers only operate on scalars and small arrays. This gap is partly bridged thanks to the enormous effort that is put into the identification, development, and optimization of libraries for well defined tasks ("building blocks"), but this is far from a complete solution. Users still have to paintakingly map their formulas onto the available building blocks---sadly, an extremely time consuming task, especially when efficiency is of importance; alternatively, users can rely on high-level languages and libraries which perform the mapping automatically, although in terms of efficiency the results are severaly suboptimal, certainly far from what human experts can achieve by hand. The High-Performance and Automatic Computing group tackles this tradeoff between computer efficiency and human productivity. In this talk we give an overview of our contributions, including interdisciplinary research, compilers, numerical algorithms, and library development.**A journey from scalar to tensor computations**Tensor Computation Workshop.

Flatiron Institute, New York City, September 2017.**Optimizing the ChASE eigensolver for Bethe-Salpeter computations**7th Workshop of the Joint Laboratory for Extreme Scale Computing.

17 July 2017.abstractwebPDFThe Chebyshev Accelerated Subspace iteration Eigensolver (ChASE) is an iterative eigensolver developed at the JSC by the SimLab Quantum Materials. The solver mainly targets sequences of dense eigenvalue problems as they arise in Density Functional Theory, but can also work on the single eigenproblem. ChASE leverages on the predominant use of BLAS 3 subroutines to achieve close-to-peak performance and potentially achieve scalability over hundreds if not thousands of computing nodes. We have recently succeeded to integrate a version of the ChASE library within the Jena BSE code. Preliminary comparison between ChASE and the conjugate gradient eigensolver (KSCG), previously used by the Jena BSE code, shows that ChASE can outperform KSCG with speedups up to 5X. In this talk we illustrate our latest results and give an outlook of the scientific problems that can be tackled once the integration is successfully completed.**Compiling Linear Algebra Expressions to High-Performance Code**8th International Workshop on Parallel Symbolic Computation (PASCO).

Kaiserslautern, July 2017.abstractwebPDFVectors, matrices and tensors are the mathematical objects universally used to describe scientific phenomena, engineering processes, and numerical algorithms. By contrast, processors only operate with scalars and small arrays, and do not understand the language and the rules of linear algebra. Because of this mismatch, any linear algebra expression has to be translated in terms of the instructions supported by the specific target processor. Over the course of many years, the linear algebra community has put tremendous effort in the identification, standardization, and optimization of a rich set of relatively simple computational kernels--such as those included in the BLAS and LAPACK libraries--that provide the necessary building blocks for just about any linear algebra computation. The initial--daunting--task has thus been reduced to the decomposition of a target linear algebra expression in terms of said building blocks; we refer to this task as the "Linear Algebra Mapping Problem" (LAMP). However, LAMP is itself an especially challenging problem, requiring knowledge in high-performance computing, compilers, and numerical linear algebra. In this talk we present the problem, we give an overview of the solutions provided by several programming languages and computing environments (such as Julia, Matlab, R, ...), and introduce Linnea, a compiler to solve the general form of LAMP. As shown through a set of test cases, Linnea's results are comparable with those obtained by a human expert.