# Recent Talks

**Linnea: Automatic Generation of Efficient Linear Algebra Programs**- University of Edinburgh, 14 February 2019.
- Heriot-Watt University, Edinburgh, 13 February 2019.

abstractPDFThe evaluation of linear algebra expressions is a central part of both languages for scientific computing such as Julia and Matlab, and libraries such as Eigen, Blaze, and NumPy. However, the existing strategies are still rather primitive. At present, the only way to achieve high performance is by handcoding algorithms using libraries such as BLAS and LAPACK, a task that requires extensive knowledge in linear algebra, numerical linear algebra and high-performance computing. We present Linnea, a synthesis tool that automates the translation of the mathematical description of a linear algebra problem to an efficient sequence of calls to BLAS and LAPACK kernels. The main idea of Linnea is to construct a search graph that represents a large number of programs, taking into account knowledge about linear algebra. The algebraic nature of the domain is used to reduce the size of the search graph, without reducing the size of the search space that is explored. Experiments show that 1) the code generated by Linnea outperforms standard linear algebra languages and libraries, and 2) in contrast to the development time of human experts, the generation takes only few minutes. We present Linnea, the prototype of a compiler that automates the translation of the mathematical description of a linear algebra problem to an efficient sequence of calls to BLAS and LAPACK kernels. The main idea of Linnea is to construct a search graph that represents a large number of programs, taking into account knowledge about linear algebra. The algebraic nature of the domain is used to reduce the size of the search graph, without reducing the size of the search space that is explored. Experiments show that 1) the code generated by Linnea outperforms standard linear algebra languages and libraries, and 2) in contrast to the development time of human experts, the generation takes only few seconds.**Tensor Computations: Efficiency Or Productivity?**SIAM Conference on Computational Science and Engineering.

Spokane, WS, February 2019.abstractPDFWhile in mathematics and physics tensors have been first-class citizens for more than a century, from the perspective of high-performance computing, they have been mostly neglected until very recently. In scientific applications, the computation of tensor operations often was, and still is, crudely translated into n-fold nested loops in combination with scalar operations. This approach has the advantage of being straightforward, and allows to incorporate simplifications due to the physics of the problem; however, due to the lack of locality, it comes at the cost of severe performance hits. For this reason, we are witnessing a rapid increase in interest for the development of high-performance libraries and tools. In this talk, we present tensor operations that arise in disciplines such as materials science and computational biology, and provide a summary classification based on thedifferent approaches used to achieve high-performance implementations. On opposite ends of the spectrum we plac eoperations that can be efficiently cast (possibly with significant effort) as high-performance matrix operations, and operations that instead require their own algorithms and libraries.**Development of a Fully Automated Dj-mixing Algorithm for Electronic Dance Music**Umeå Universitet, December 2018.abstractPDFIt might come as a surprise that behind an artistic task such as djing lie many computational challenges. In practical terms, the input to the problem is given by a set of "songs" (tracks), and the output consists in an uninterrupted stream of music obtained by suitably adjusting and overlapping a subset of the input tracks. In order to solve this problem both automatically and with results that are qualitatively comparable those of a professional dj, one has to solve tasks such as beat tracking, tempo estimation, music structure analysis, and mix quality evaluation, just to name a few. The solutions for all of these tasks build upon techniques from digital signal processing (in particular FFTs) and machine learning (neural networks). In this talk, we first establish a set of rules that a satisfactory solution --a Dj mix-- has to satisfy, and then give an overview of the algorithmic techniques used to solve the aforementioned problems. Finally, we present our results after one year of work.**High-Performance & Automatic Computing**Umeå Universitet, December 2018.abstractPDFWith the increase in complexity and heterogeneity of computing architectures, devising algorithms that take full advantage of the available computational power is an especially challenging and time consuming task. Typically, scientific computing practitioners either invest in months-long cycles of software development --favoring computer efficiency at the expense of productivity and portability-- or resort to portable, high-level languages --in favor of productivity, at the expense of performance--. The group "High-Performance & Automatic Computing" (HPAC) develops methodologies and tools to bridge the disconnect between application experts and computing architectures, aiming for both productivity and performance. In this talk, we give an overview of the activities of the group, with focus on numerical linear algebra, mixed-precision computations, tensor computations, molecular dynamics, and simulation science.**High-Performance & Automatic Computing: Fast & portable code for complex molecular dynamics simulations**Institute for Computational Engineering and Sciences, University of Texas at Austin, Babuska forum, November 2018.abstractPDFWith the increase in complexity and heterogeneity of computing architectures, it is especially challenging and time consuming to devise algorithms that exploit the available computational power. Typically, scientific computing practitioners either invest in months-long cycles of software development --favoring computer efficiency at the expense of productivity and portability-- or resort to high-level languages --in favor of productivity, but entirely disregarding computer performance--. The High-Performance and Automatic Computing group aims to bridge the disconnect between application experts and computing architectures by providing tools that enable both productivity and performance. In this talk, we first give a short overview of our activities in the domanins of linear algebra and tensor operations, and then dive into the specific case of molecular dynamics (MD), an extremely popular technique to simulate the evolution of large systems of particles. We present a domain specific compiler that takes as input the mathematical description of the law that regulates how the particles attract each other, and returns optimized code. While code generation from an abstract representation of the target problem is a common technique for the solution of PDEs (e.g., the Fenics project), it is still largely unexplored in MD. We discuss different optimizations, both with respect to performance and portability, demonstrating efficiency on par to what achieved by human experts.**Parallelism in Linnea**20th Workshop on Compilers for Parallel Computing.

Dublin, Ireland, 18 April 2018.abstractPDFLinnea is an experimental tool for the automatic translation of linear algebra expressions to efficient programs consisting of a sequence of calls to BLAS and LAPACK kernels. Linnea generates programs by constructing a search graph, where each path in the graph represents one program. We introduce two problems related to parallelism that arise in Linnea. Those problems consist in 1) parallelizing the construction of the search graph and 2) generating parallel programs.**A set of building blocks for tensor operations: transposition, summation, and contraction**SIAM Conference on Parallel Processing for Scientific Computing.

Waseda University, Tokyo, Japan, March 2018.abstractwebPDFTensors naturally appear in a variety of disciplines and applications, including computational chemistry, computational physics, mathematics, and even machine learning. While a range of high-performance software tools exist for computations involving one- and two-dimensional arrays, i.e. vectors and matrices, the availability for tensors is much more limited. Moreover, until recently the contrast between the efficiency attained by matrix and tensor operations was staggering. With this talk we give an overview of a set of high-performance kernels for three common tensor operations: transposition, summation, and contraction. Specifically, we present 1) TTC and HPTT, respectively a compiler and a library for the efficient transposition of tensors of arbitrary size, 2) a code generator for tensor summations, and 3) TCCG, a compiler for tensor transpositions. In all cases, the tools exhibit significant speedups over state-of-the-art solutions. All tools are available for download and use.**The Generalized Matrix Chain Algorithm**2018 IEEE/ACM International Symposium on Code Generation and Optimization.

Vienna, Austria, 27 February 2018.abstractPDFIn this paper, we present a generalized version of the matrix chain algorithm to generate efficient code for linear algebra problems, a task for which human experts often invest days or even weeks of works. The standard matrix chain problem consists in finding the parenthesization of a matrix product $M := A_1 A_2 \cdots A_n$ that minimizes the number of scalar operations. In practical applications, however, one frequently encounters more complicated expressions, involving transposition, inversion, and matrix properties. Indeed, the computation of such expressions relies on a set of computational kernels that offer functionality well beyond the simple matrix product. The challenge then shifts from finding an optimal parenthesization to finding an optimal mapping of the input expression to the available kernels. Furthermore, it is often the case that a solution based on the minimization of scalar operations does not result in the optimal solution in terms of execution time. In our experiments, the generated code outperforms other libraries and languages on average by a factor of about 5. The motivation for this work comes from the fact that---despite great advances in the development of compilers---the task of mapping linear algebra problems to optimized kernels is still to be done manually. In order to relieve the user from this complex task, new techniques for the compilation of linear algebra expressions have to be developed.**Linnea: Automatic Generation of Efficient Linear Algebra Programs**Umeå University, 12 January 2018.abstractPDFThe evaluation of linear algebra expressions is a central part of both languages for scientific computing such as Julia and Matlab, and libraries such as Eigen, Blaze, and NumPy. However, the existing strategies are still rather primitive. At present, the only way to achieve high performance is by handcoding algorithms using libraries such as BLAS and LAPACK, a task that requires extensive knowledge in linear algebra, numerical linear algebra and high-performance computing. We present Linnea, the prototype of a compiler that automates the translation of the mathematical description of a linear algebra problem to an efficient sequence of calls to BLAS and LAPACK kernels. The main idea of Linnea is to construct a search graph that represents a large number of programs, taking into account knowledge about linear algebra. The algebraic nature of the domain is used to reduce the size of the search graph, without reducing the size of the search space that is explored. Experiments show that 1) the code generated by Linnea outperforms standard linear algebra languages and libraries, and 2) in contrast to the development time of human experts, the generation takes only few seconds.