# Talks - Paolo Bientinesi

**Tensor Computations: Efficiency Or Productivity?**SIAM Conference on Computational Science and Engineering.

Spokane, WS, February 2019.abstractPDFWhile in mathematics and physics tensors have been first-class citizens for more than a century, from the perspective of high-performance computing, they have been mostly neglected until very recently. In scientific applications, the computation of tensor operations often was, and still is, crudely translated into n-fold nested loops in combination with scalar operations. This approach has the advantage of being straightforward, and allows to incorporate simplifications due to the physics of the problem; however, due to the lack of locality, it comes at the cost of severe performance hits. For this reason, we are witnessing a rapid increase in interest for the development of high-performance libraries and tools. In this talk, we present tensor operations that arise in disciplines such as materials science and computational biology, and provide a summary classification based on thedifferent approaches used to achieve high-performance implementations. On opposite ends of the spectrum we plac eoperations that can be efficiently cast (possibly with significant effort) as high-performance matrix operations, and operations that instead require their own algorithms and libraries.**Development of a Fully Automated Dj-mixing Algorithm for Electronic Dance Music**Umeå Universitet, December 2018.abstractPDFIt might come as a surprise that behind an artistic task such as djing lie many computational challenges. In practical terms, the input to the problem is given by a set of "songs" (tracks), and the output consists in an uninterrupted stream of music obtained by suitably adjusting and overlapping a subset of the input tracks. In order to solve this problem both automatically and with results that are qualitatively comparable those of a professional dj, one has to solve tasks such as beat tracking, tempo estimation, music structure analysis, and mix quality evaluation, just to name a few. The solutions for all of these tasks build upon techniques from digital signal processing (in particular FFTs) and machine learning (neural networks). In this talk, we first establish a set of rules that a satisfactory solution --a Dj mix-- has to satisfy, and then give an overview of the algorithmic techniques used to solve the aforementioned problems. Finally, we present our results after one year of work.**High-Performance & Automatic Computing**Umeå Universitet, December 2018.abstractPDFWith the increase in complexity and heterogeneity of computing architectures, devising algorithms that take full advantage of the available computational power is an especially challenging and time consuming task. Typically, scientific computing practitioners either invest in months-long cycles of software development --favoring computer efficiency at the expense of productivity and portability-- or resort to portable, high-level languages --in favor of productivity, at the expense of performance--. The group "High-Performance & Automatic Computing" (HPAC) develops methodologies and tools to bridge the disconnect between application experts and computing architectures, aiming for both productivity and performance. In this talk, we give an overview of the activities of the group, with focus on numerical linear algebra, mixed-precision computations, tensor computations, molecular dynamics, and simulation science.**High-Performance & Automatic Computing: Fast & portable code for complex molecular dynamics simulations**Institute for Computational Engineering and Sciences, University of Texas at Austin, Babuska forum, November 2018.abstractPDFWith the increase in complexity and heterogeneity of computing architectures, it is especially challenging and time consuming to devise algorithms that exploit the available computational power. Typically, scientific computing practitioners either invest in months-long cycles of software development --favoring computer efficiency at the expense of productivity and portability-- or resort to high-level languages --in favor of productivity, but entirely disregarding computer performance--. The High-Performance and Automatic Computing group aims to bridge the disconnect between application experts and computing architectures by providing tools that enable both productivity and performance. In this talk, we first give a short overview of our activities in the domanins of linear algebra and tensor operations, and then dive into the specific case of molecular dynamics (MD), an extremely popular technique to simulate the evolution of large systems of particles. We present a domain specific compiler that takes as input the mathematical description of the law that regulates how the particles attract each other, and returns optimized code. While code generation from an abstract representation of the target problem is a common technique for the solution of PDEs (e.g., the Fenics project), it is still largely unexplored in MD. We discuss different optimizations, both with respect to performance and portability, demonstrating efficiency on par to what achieved by human experts.**Parallelism in Linnea**20th Workshop on Compilers for Parallel Computing.

Dublin, Ireland, 18 April 2018.abstractPDFLinnea is an experimental tool for the automatic translation of linear algebra expressions to efficient programs consisting of a sequence of calls to BLAS and LAPACK kernels. Linnea generates programs by constructing a search graph, where each path in the graph represents one program. We introduce two problems related to parallelism that arise in Linnea. Those problems consist in 1) parallelizing the construction of the search graph and 2) generating parallel programs.**A set of building blocks for tensor operations: transposition, summation, and contraction**SIAM Conference on Parallel Processing for Scientific Computing.

Waseda University, Tokyo, Japan, March 2018.abstractwebPDFTensors naturally appear in a variety of disciplines and applications, including computational chemistry, computational physics, mathematics, and even machine learning. While a range of high-performance software tools exist for computations involving one- and two-dimensional arrays, i.e. vectors and matrices, the availability for tensors is much more limited. Moreover, until recently the contrast between the efficiency attained by matrix and tensor operations was staggering. With this talk we give an overview of a set of high-performance kernels for three common tensor operations: transposition, summation, and contraction. Specifically, we present 1) TTC and HPTT, respectively a compiler and a library for the efficient transposition of tensors of arbitrary size, 2) a code generator for tensor summations, and 3) TCCG, a compiler for tensor transpositions. In all cases, the tools exhibit significant speedups over state-of-the-art solutions. All tools are available for download and use.**The Generalized Matrix Chain Algorithm**2018 IEEE/ACM International Symposium on Code Generation and Optimization.

Vienna, Austria, 27 February 2018.abstractPDFIn this paper, we present a generalized version of the matrix chain algorithm to generate efficient code for linear algebra problems, a task for which human experts often invest days or even weeks of works. The standard matrix chain problem consists in finding the parenthesization of a matrix product $M := A_1 A_2 \cdots A_n$ that minimizes the number of scalar operations. In practical applications, however, one frequently encounters more complicated expressions, involving transposition, inversion, and matrix properties. Indeed, the computation of such expressions relies on a set of computational kernels that offer functionality well beyond the simple matrix product. The challenge then shifts from finding an optimal parenthesization to finding an optimal mapping of the input expression to the available kernels. Furthermore, it is often the case that a solution based on the minimization of scalar operations does not result in the optimal solution in terms of execution time. In our experiments, the generated code outperforms other libraries and languages on average by a factor of about 5. The motivation for this work comes from the fact that---despite great advances in the development of compilers---the task of mapping linear algebra problems to optimized kernels is still to be done manually. In order to relieve the user from this complex task, new techniques for the compilation of linear algebra expressions have to be developed.**Teaching Computers Linear Algebra**Friedrich-Schiller-Universitaet Jena, Jena, Germany, January 2018.abstractwebPDFIn the mid 1950s, the computing world was revolutionized by the advent of "The IBM Mathematical Formula Translating System" (FORTRAN), a program--nowadays universally recognized as the first complete compiler--that allowed scientists to express calculations in a "high-level", portable language. Both FORTRAN and C were, and still are, much better solutions than computer-specific code, but they still require users to reduce their mathematical formulas to scalar computations. Indeed, computers only operate on scalars and small arrays, while scientists operate with vectors, matrices and higher-dimensional objects. In the past 60 years there has been tremendous progress in the world of programming languages and compilers, and many languages and libraries (Matlab, Julia, Armadillo, Eigen, ...) now make it possible to code directly in terms of matrices; however in terms of efficiency, these solutions are still far from what human experts achieve. In a nutshell, none of these tools know linear algebra well enough to compete with humans. In this talk I present the Linear Algebra Mapping Problem (LAMP), that is, how to efficiently compute linear algebra expressions from a set of available building blocks, and the compiler Linnea, our initial solution to the problem.**Automatic Seamless Mixing of Computer Generated Playlists**Umea Universitet, Umea, Sweden, January 2018.abstractContinuous (mixed) dance music has animated disco clubs since the end of the 1970s, when DJs began to create unbroken sequences of songs. Thereafter, continuous mixes progressively conquered other public spaces such as private parties, shops, gyms and radio broadcasts. On the other hand, continuous (but unmixed) music has been widely available in the form of playlists since when portable storage devices and online repositories for digital audio became commercially available. For non-professional DJs with audio databases in the order of hundreds of tracks, both the selection of songs for a dance music playlist, and the smooth mixing into a continuous streaming represent non-trivial operations. Indeed, whoever is entrusted with such a task would benefit greatly if both operations were automated. AutoMix, short for Automatic Seamless Mixing of Computer Generated Playlists, aims to solve both problems simultaneously: Starting from an existing repository of dance tracks, it will automatically create a sequence of songs, and mix them together seamlessly, exactly as a human DJ would do.**Efficient Pattern Matching in Python**7th Workshop on Python for High-Performance and Scientific Computing.

Denver, Colorado, 12 November 2017.**A tale of efficiency and productivity. From scalar to tensor computations.**Umea Universitat, Umea, Sweden, October 2017.abstractPDFThe scientific computing community has to deal with the disconnect between the language spoken by practitioners (for the most part non-computer scientists), and the language with which computers operate. While scientists and engineers (possibly after a discretization process) speak the language of vectors, matrices, and higher-dimensional objects, computers only operate on scalars and small arrays. This gap is partly bridged thanks to the enormous effort that is put into the identification, development, and optimization of libraries for well defined tasks ("building blocks"), but this is far from a complete solution. Users still have to paintakingly map their formulas onto the available building blocks---sadly, an extremely time consuming task, especially when efficiency is of importance; alternatively, users can rely on high-level languages and libraries which perform the mapping automatically, although in terms of efficiency the results are severaly suboptimal, certainly far from what human experts can achieve by hand. The High-Performance and Automatic Computing group tackles this tradeoff between computer efficiency and human productivity. In this talk we give an overview of our contributions, including interdisciplinary research, compilers, numerical algorithms, and library development.**A journey from scalar to tensor computations**Tensor Computation Workshop.

Flatiron Institute, New York City, September 2017.**Compiling Linear Algebra Expressions to High-Performance Code**8th International Workshop on Parallel Symbolic Computation (PASCO).

Kaiserslautern, July 2017.abstractwebPDFVectors, matrices and tensors are the mathematical objects universally used to describe scientific phenomena, engineering processes, and numerical algorithms. By contrast, processors only operate with scalars and small arrays, and do not understand the language and the rules of linear algebra. Because of this mismatch, any linear algebra expression has to be translated in terms of the instructions supported by the specific target processor. Over the course of many years, the linear algebra community has put tremendous effort in the identification, standardization, and optimization of a rich set of relatively simple computational kernels--such as those included in the BLAS and LAPACK libraries--that provide the necessary building blocks for just about any linear algebra computation. The initial--daunting--task has thus been reduced to the decomposition of a target linear algebra expression in terms of said building blocks; we refer to this task as the "Linear Algebra Mapping Problem" (LAMP). However, LAMP is itself an especially challenging problem, requiring knowledge in high-performance computing, compilers, and numerical linear algebra. In this talk we present the problem, we give an overview of the solutions provided by several programming languages and computing environments (such as Julia, Matlab, R, ...), and introduce Linnea, a compiler to solve the general form of LAMP. As shown through a set of test cases, Linnea's results are comparable with those obtained by a human expert.**The Linear Algebra Mapping Problem (LAMP)**Householder Symposium XX on Numerical Linear Algebra.

Blacksburg, Virginia, June 2017.**LAMP: the Linear Algebra Mapping Problem**Platform for Advanced Scientific Computing (PASC).

Lugano, June 2017.abstractwebPDFMatrix expressions appear in just about every computational discipline; their fast and accurate evaluation is a time-consuming task that requires expertise in both numerical linear algebra and high-performance computing. On one hand, the numerical linear algebra community made tremendous progress in the identification, layering, and optimization of computational kernels. On the other hand, the translation of a target expression into a sequence of kernels - the task we named "Linear Algebra Mapping Problem" (LAMP) - remains a duty for domain experts. Indeed, while compilers excel at producing fast code for scalar expressions, they are still in their infancy when it comes to dealing with matrices. Nowadays, many tools and languages (Matlab, Julia, Armadillo, Eigen, ...) provide solutions to LAMP, but, as we point out, there is a lot of room for improvement. We give a formal description of LAMP, and discuss the design and the results of a prototype compiler.**LAMMPS’ PPPM Long-Range Solver for the Second Generation Xeon Phi**International Supercomputing Conference (ISC 17).

Frankfurt, June 2017.**HPTT: A High-Performance Tensor Transposition C++ Library**4th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming.

Barcelona, June 2017.abstractwebPDFRecently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework—guided by a performance model—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen’s tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.**When 1+1 > 2: The Power of Interdisciplinary Research**Opening Workshop of the SimLab Quantum Materials.

Juelich, March 2017.**Particle-Particle Particle-Mesh (P3M) on Knights Landing Processors**SIAM Conference on Computational Science and Engineering.

Atlanta, Georgia, February 2017.abstractwebPDFParticle-particle particle-mesh methods are often used in molecular dynamics codes to approximate the effects of long-range forces between atoms where it would not be feasible to compute all pair-wise interactions. While short-range interactions are computed in a pair-wise fashion, the forces produced by long-range interactions are obtained by mapping particle charge to a grid, solving Poisson's equation in the frequency domain for the electrical potential, and then mapping the local potential back to the particles. Using the popular molecular dynamics code LAMMPS, we present vectorization and new implementations of the two mapping algorithms. We also discuss how using larger stencil sizes when mapping charges and forces better takes advantage of the Xeon Phi architecture, both by making use of its large vector registers and because a larger stencil allows a coarser grid to be used. This shifts work from the poorly-scaling FFTs used to solve Poisson's equation and to the newly-accelerated and highly parallel mapping functions. The acceleration of the PPPM method as a whole also affects the optimal input parameters in a similar fashion; using a smaller cutoff to shift work from the pair-wise short-range computation to the long-range PPPM computation saves time even while using a finer charge grid to preserve accuracy.**Vectorization of Multi-Body Potentials: Performance and Portability**SIAM Conference on Computational Science and Engineering.

Atlanta, Georgia, February 2017.abstractwebPDFAs today's supercomputers become more and more powerful, simulations can cover bigger length-scales and time-scales using more accurate, but also more expensive force fields. In the materials science community, many-body potentials are widely used for their predictive power with respect to certain material properties, at the expense of higher computational cost. The challenge lies in mapping the complex calculations necessary to evaluate such potentials onto the available computing devices. Since modern architectures concentrate the computational power in wide SIMD units, and compilers commonly have trouble generating efficient code for them, a dedicated optimization effort is necessary. Special care is needed to minimize the effort required to implement a potential on a new architecture, and to allow for portability at the algorithmic level. Our research provided insights in the vectorization of the Tersoff, REBO and AIREBO potentials, widely used for semiconductor, carbon material, and carbohydrate simuations. We target a diverse set of hardware ranging from x86 CPUs (Westmere to Skylake), to Xeon Phi accelerators of both generations, and even GPUs. The improvements typically double the simulation throughput in large-scale, parallel runs, and higher speedups are possible when deploying accelerators.**The Landscape of High-Performance Tensor Contractions**Workshop on Batched, Reproducible, and Reduced Prevision BLAS.

Atlanta, Georgia, February 2017.**Design of a High-Performance GEMM-like Tensor-Tensor Multiplication**- SIAM Conference on Computational Science and Engineering.

Atlanta, February 2017. - BLIS Retreat 2016.

September 2016.

- SIAM Conference on Computational Science and Engineering.
**IPCC @ RWTH Aachen University: Optimization of multibody and long-range solvers in LAMMPS**IPCC Showcase November 2016.

November 2016.**The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability**SC 2016.

November 2016.**The Matrix Chain Algorithm to Compile Linear Algebra Expressions**4th Workshop on Domain Specific Language Design and Implementation (DSLDI).

Amsterdam, Netherlands, 31 October 2016.**Accelerating Particle-Particle Particle-Mesh Methods for Molecular Dynamics**IPCC Toulouse.

October 2016.**TTC: A Tensor Transposition Compiler for Multiple Architectures**ARRAY ACM SIGPLAN 3rd International Workshop on Libraries, Languages and Compilers for Programming.

June 2016.abstractwebPDFWe consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system’s peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g., Intel Haswell, AMD Steamroller), Intel’s Knights Corner as well as different CUDA-based GPUs such as NVIDIA’s Kepler and Maxwell architectures. We report speedups of TTC over a meaningful base- line implementation generated by external C++ compilers; the re- sults suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel’s latest C++ compiler on the Haswell and Knights Cor- ner architecture, TTC yields speedups of up to 8× and 32×, respectively. We also showcase TTC’s support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.**TTC: A Compiler for Tensor Transpositions**SIAM Conference on Parallel Processing for Scientific Computing..

Université Pierre et Marie Curie, Paris, 14 April 2016.abstractwebPDFWe present TTC, an open-source compiler for multidimensional tensor transpositions. Thanks to a range of optimizations (software prefetching, blocking, loop-reordering, explicit vectorization), TCC generates high-performance parallel C/C++ code. We use generic heuristics and auto-tuning to manage the huge search space. Performance results show that TTC achieves close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, yielding performance gains of up to 15× over modern compilers.**Exploring OpenMP Task Priorities on the MR3 Eigensolver**SIAM Conference on Parallel Processing for Scientific Computing.

Université Pierre et Marie Curie, Paris, 12 April 2016.abstractwebPDFAs part of the OpenMP 4.1 draft, the runtime incorporates task priorities. We use the Method of Multiple Relatively Robust Representations (MR3), for which a pthreads-based task parallel version already exists (MR3SMP), to analyze and compare the performance of MR3SMP with three different OpenMP runtimes, with and without the support of priorities. From a dataset consisting of application matrices, it appears that OpenMP is always on par or better than the pthreads implementation**The ELAPS Framework: Experimental Linear Algebra Performance Studies**SIAM Conference on Parallel Processing for Scientific Computing.

Université Pierre et Marie Curie, Paris, April 2016.abstractwebPDFThe multi-platform open source framework ELAPS facilitates easy and fast, yet powerful performance experimentation and prototyping of dense linear algebra algorithms. In contrast to most existing performance analysis tools, it targets the initial stages of the development process and assists developers in both algorithmic and optimization decisions. Users construct experiments to investigate how performance and efficiency vary from one algorithm to another, depending on factors such as caching, algorithmic parameters, problem size, and parallelism.**Optimization of multibody and long-range solvers in LAMMPS**Intel PCC EMEA Meeting.

Ostrava, March 2016.**The Tersoff many-body potential: Sustainable performance through vectorization**SC15 Workshop: Producing High Performance and Sustainable Software for Molecular Simulation.

November 2015.**A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Dynamics**International Supercomputing Conference (ISC 15).

Frankfurt, Germany, July 2015.abstractwebPDFRecent results on supercomputers show that beyond 65K cores, the efficiency of molecular dynamics simulations of interfacial systems decreases significantly. In this paper, we introduce a dynamic cutoff method (DCM) for interfacial systems of arbitrarily large size. The idea consists in adopting a cutoff-based method in which the cutoff is chosen on a particle by particle basis, according to the distance from the interface. Computationally, the challenge is shifted from the long-range solvers to the detection of the interfaces and to the computation of the particle-interface distances. For these tasks, we present linear-time algorithms that do not rely on global communication patterns. As a result, the DCM algorithm is suited for large systems of particles and massively parallel computers. To demonstrate its potential, we integrated DCM into the LAMMPS open-source molecular dynamics package, and simulated large liquid/vapor systems on two supercomputers: SuperMuc and JUQUEEN. In all cases, the accuracy of DCM is comparable to the traditional particle-particle particle-mesh (PPPM) algorithm, and for large numbers of particles the performance is considerably superior. For JUQUEEN, we provide timings for simulations running on the full system (458,752 cores), and show nearly perfect strong and weak scaling.**ELAPS: Experimental Linear Algebra Performance Studies**University of Texas at Austin, March 2015.

Live demo.**Bringing knowledge into HPC**Symposium on High Performance Computing.

Universitaet Basel, Switzerland, October 2014.**Can Numerical Linear Algebra make it in Nature?**Householder Symposium XIX, Spa, Belgium, June 2014.**Performance Prediction for Tensor Contractions**PASC 14.

ETH Zürich, Zürich, Switzerland, June 2014.**High-performance and automatic computing for simulation science**Jülich Supercomputing Centre, Kickoff workshop, Simulation Lab "ab initio Methods in Chemistry and Physics", Jülich, Germany, November 2013.**High-Performance and Automatic Computing**Goethe Universität Frankfurt am Main, Big Data Lab, October 2013.**Recent Trends in Dense Linear Algebra**ComplexHPC Spring School 2013.

Uppsala University, Uppsala, Sweden, June 2013.

Invited lecturer.**Improved Accuracy for MR3-based Eigensolvers**SIAM Conference on Computational Science and Engineering (SIAM CSE13).

February 2013.abstractwebPDFA number of algorithms exist for the dense Hermitian eigenproblem. In many cases, MRRR is the fastest one, although it does not deliver the same accuracy as Divide&Conquer or the QR algorithm. We demonstrate how the use of mixed precisions in MRRR-based eigensolvers leads to an improved orthogonality, even surpassing the accuracy of DC and QR. Our approach comes with limited performance penalty, and increases both robustness and scalability.**Genome-Wide Association Studies: Computing Petaflops over Terabytes of Data**Blue Gene Active Storage Workshop.

Jülich Supercomputing Centre, Jülich, Germany, January 2013.

Invited speaker.**First Steps Towards a Linear Algebra Compiler**ETH Zürich, Zürich, Switzerland, December 2012.

Host: Markus Pueschel.**A Compiler for Linear Algebra Operations**Workshop on Libraries and Autotuning for Extreme-Scale Systems (CScADS '12).

Snowbird, Utah, August 2012.

Invited speaker.**Automatic Modeling and Ranking of Algorithms**The Seventh International Workshop on Automatic Performance Tuning (iWAPT 2012).

Kobe, Japan, July 2012.

Invited speaker.abstractwebPDFIn the context of automatic generation of linear algebra algorithms, it is not uncommon to find dozens of algorithmic variants, all equivalent mathematically, but different in terms of accuracy and performance. In this talk I discuss how to rank the variants automatically, without executing them. In principle, one can attempt a fully analytical approach, creating performance models that take into account both the structure of the algorithm and the features of the processor; unfortunately, due to the intricacies of the memory system, currently this solution is not at all systematic. By contrast, I present an approach based on the automatic modeling of the routines that represent the building blocks for linear algebra algorithms. Performance predictions are then made by composing evaluations of such models. Our experiments show how this approach can be applied to both algorithm ranking and parameter tuning, yielding remarkably accurate results.**Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures**40th SPEEDUP Workshop on High-Performance Computing.

Basel, Switzerland, February 2012.

Plenary speaker.abstractwebPDFEigenproblems are at the core of a myriad of engineering and scientific applications. In many cases, the size of such problems is not determined by the physics of the application but is limited by the eigensolver's time and memory requirements. While on the one hand the demand is for larger problems, on the other hand the available numerical libraries are not exploiting the parallelism provided in the modern computing environments. In this talk I compare two approaches to parallelism --one that relies on fast multithreaded libraries (BLAS), and another that uses blocking and careful scheduling-- and show that the right choice depends, among other factors, on the specific operation performed. I will then introduce two eigensolvers specifically designed for high-performance architectures. MR3-SMP targets multicore architectures, while EleMRRR is well suited for both small clusters and massively parallel computers which may or may not use multithreading. Experiments on application matrices indicate that our algorithms are both faster and obtain better speedups than all the eigensolvers from LAPACK, ScaLAPACK and Intel's Math Kernel Library (MKL). Finally, I will discuss the use of graphics processing units.**Automation in Computational Biology**9th International Conference on Parallel Processing and Applied Mathematics (PPAM 2011).

Torun, Poland, September 2011.

Keynote speaker.abstractwebPDFIn the past 30 years the development of linear algebra libraries has been tremendously successful, resulting in a variety of reliable and efficient computational kernels. Unfortunately these kernels--meant to become the building blocks for scientific and engineering applications--are not designed to exploit knowledge relative to the specific target application. If opportunely used, this extra knowledge may lead to domain-specific algorithms that attain higher performance than any traditional library. As a case study, we look at a common operation in computational biology, the computation of mixed-effects models; in particular, we consider the use of mixed models in the context of genome analysis. At the core of this operation lays a generalized least square problem (GLS); GLS may be directly solved with Matlab, or may be reduced to a form accepted by LAPACK. Either way, none of these solutions can exploit the special structure of GLS within genome analysis. Specifically, as part of this application it has to be solved not one, but a large two-dimensional parametric sequence of GLS'. Since the performance of an algorithm is directly affected by the choice of parameters, a family of algorithms is needed. In this talk we show how automation comes to help. We introduce a symbolic system, written in Mathematica, that takes as input a matrix equation and automatically returns a family of algorithms to solve the equation. The system has knowledge of matrix properties, matrix factorizations, and rules of linear algebra; it decomposes the input equation into a sequence of building blocks and maps them onto available high-performance kernels. Automation is achieved through extensive use of pattern matching and rewrite rules. When applied to GLS in the context of genome analysis, it generates algorithms that outperform LAPACK by a factor of six.**A Modular and Systematic Approach to Stability Analysis**Householder Symposium XVIII.

Tahoe City, CA, June 2011.

Poster.**MR3-SMP and PMRRR: Fast and Scalable Eigensolvers**25th Umbrella Symposium.

Aachen, Germany, June 2011.**Solvers and Eigensolvers for Multicore Processors**Max-Plank-Institute fuer biologische Kybernetik, Tuebingen, Germany, March 2011.

Invited speaker.**Goal-oriented and Modular Stability Analysis**Conference on Numerical Linear Algebra: Perturbation, Performance and Portability.

A Conference in Celebration of G.W. (Pete) Stewart's 70th Birthday, Austin, TX, July 2010.

Invited speaker.**The Algorithm of Multiple Relatively Robust Representations for Multicore Processors**PARA 2010: State of the Art in Scientific and Parallel Computing.

Reykjavik, Iceland, June 2010.abstractwebPDFThe algorithm of Multiple Relatively Robust Representations, in short MRRR or $\mbox{MR}^3$, computes $k$ eigenvalues and eigenvectors of a symmetric tridiagonal matrix in $O(nk)$ arithmetic operations. While for the largest matrices arising in applications parallel implementations especially suited for distributed-memory computer systems exist, small to medium size problems can make use of LAPACK's implementation xSTEMR. However, xSTEMR does not take advantage of today's multi-core and future many-core architectures, as it is optimized for single-core CPUs. In this paper we discuss some of the issues and trade-offs arising in an efficient implementation especially suited for multi-core CPUs and SMP systems. From a set of experiments on application matrices it results that our algorithm is both faster and obtains better speedups than all tridiagonal eigensolvers from LAPACK and Intel's Math Kernel Library (MKL).**At the Heart of the Automation of Linear Algebra Algorithms**Workshop on Program Composition and Optimization.

Dagstuhl, Germany, May 2010.abstractwebPDFIt is well understood that in order to attain high performance for linear algebra operations over multiple architectures and settings, not just one, but a family of loop-based algorithms have to be generated and optimized. In the past we have demonstrated that algorithms and routines can be derived automatically, using a procedure based on symbolic computations and classical formal derivations techniques. At the heart of such a procedure lie the Partitioned Matrix Expressions (PMEs) of the target operation; these expressions describe how parts of the output operands can be represented in terms of parts of the input operands. The PMEs are the unifying element for all the algorithms in the family, as they encapsulate the necessary knowledge for generating each one of them. Until now, the PMEs were considered inputs to the derivation procedure, i.e., the users had to provide them. In this talk we discuss how from a high-level formal description of the operation it is possible to generate automatically even the PMEs. We conclude demonstrating how automation becomes critical in complex, high-dimensional, scenarios.**Linear Algebra on Multicore Architectures**- School on High-performance Computing in Geophysics, Novosobirsk State University, Novosibirsk, Russia, September 2009.

Invited lecturer. - MIT-AICES Workshop.

Aachen, Germany, March 2009.

- School on High-performance Computing in Geophysics, Novosobirsk State University, Novosibirsk, Russia, September 2009.
**Numerical Methods for Large Linear Systems**3rd LHC Detector Alignment Workshop.

CERN, Geneva, Switzerland, June 2009.

Invited speaker.**Automatic Computing**North Rhine-Westphalian Academy of Sciences and Humanities, Duesseldorf, Germany, April 2009.

2009 Karl Arnold Prize acceptance speech.**Computational Mathematics (in Italian, "Matematica Computazionale")**Woche zu Italien, RWTH Aachen.

Aachen, Germany, March 2009.

Invited speaker.**Algorithm & Code Generation for High-Performance Computing**Tag der Informatik, RWTH Aachen, Aachen, Germany, December 2008.**Scientific Computing: Applications, Algorithms, Architectures**- AICES Workshop, Monschau, Germany, November 2008.
- Colorado State University, Fort Collins, CO, March 2008.
- RWTH Aachen, Aachen, Germany, January 2008.

Hosts: Marek Behr and Chris Bischof.

**Multicore Processors: What Kind of Parallelism?**AICES, RWTH Aachen, Aachen, Germany, June 2008.

CCES Seminar Series.**Multi-dimensional Array Memory Accesses for FFTs on Parallel Architectures**PARA 2008: 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing.

Trondheim, Norway, June 2008.**Generation of Dense Linear Algebra Software for Shared Memory and Multicore Architectures**- Workshop on Automating the Development of Scientific Computing Software.

Baton Rouge, LA, March 2008.

Invited speaker. - Microsoft Corporation, Redmond, WA, March 2008.

Host: Laurent Visconti.

- Workshop on Automating the Development of Scientific Computing Software.
**Streaming 2D FFTs on the Cell Broadband Engine**DESA Workshop.

Washington, DC, December 2007.**Dense Linear Algebra on Multicore Architectures: What Kind of Parallelism?**- CScADS. Workshop on Automatic Tuning for Petascale Systems.

Snowbird, Utah, July 2007. - ICIAM07: 6th International Congress on Industrial and Applied Mathematics.

Zürich, Switzerland, July 2007.

- CScADS. Workshop on Automatic Tuning for Petascale Systems.
**Sparse Direct Factorizations Based on Unassembled Hyper-Matrices**ICIAM07: 6th International Congress on Industrial and Applied Mathematics.

Zürich, Switzerland, July 2007.**Can Computers Develop Libraries? A Different Perspective on Scientific Computing**The University of Chicago, Chicago, IL, February 2007.

Host: Ridgway Scott.**Mechanical Generation of Correct Linear Algebra Algorithms**- Duke University, Durham, NC, October 2006.

Host: Xiaobai Sun. - Rice University, Houston, TX, September 2006.

Hosts: John Mellor-Crummey, Ken Kennedy. - University of Manchester, Manchester, UK, June 2006.

Host: Chris Taylor. - Georgia Institute of Technology, Atlanta, GA, March 2006.

Host: Richard Fujimoto. - Carnegie Mellon University, Pittsburgh, PA, February 2006.

Host: Markus Pueschel. - Oxford University, Oxford, UK, January 2006.

Host: Richard Bird.

- Duke University, Durham, NC, October 2006.
**Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms**University of Texas, Austin, TX, July 2006.

Dissertation Defense (7 July).**Mechanical Generation of Linear Algebra Libraries with Multiple Variants**- SIAM Conference on Parallel Processing for Scientific Computing.

San Francisco, CA, February 2006. - University of Washington, Seattle, WA, December 2005.

Host: Larry Snyder. - Caltech Center for Advanced Computing Research (CACR), California Institute of Technology, Pasadena, CA, June 2005.

Host: Mark Stalzer. - Argonne National Laboratory, Argonne, IL, March 2005.

Host: Jorge More'. - IBM T.J. Watson Research Center, Yorktown Heights, January 2005.

Host: John Gunnels. - New York University, New York, NY, January 2005.

Host: Michael Overton.

- SIAM Conference on Parallel Processing for Scientific Computing.
**Formal Correctness and Stability of Dense Linear Algebra Algorithms**17th IMACS World Congress: Scientific Computation, Applied Mathematics and Simulation.

Paris, France, July 2005.**A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations**- Householder XVI Symposium on Numerical Linear Algebra.

Silver Springs Mountain Resort, PA, May 2005. - IV International Workshop on Accurate Solution of Eigenvalue Problems (IWASEP4).

Split, Croatia, June 2002.

Poster.

- Householder XVI Symposium on Numerical Linear Algebra.
**The Science of Deriving Dense Linear Algebra Algorithms**- University of Manchester, Manchester, UK, August 2004.

Host: Nicholas Higham. - PARA'04, Workshop on State-of-the-Art in Scientific Computing.

Lyngby, Denmark, June 2004. - Lawrence Berkeley National Lab, Berkeley, CA, June 2003.

Host: Esmond Ng. - Institute for Informatics and Telematics, Pisa, Italy, July 2002.

Host: Marco Pellegrini.

- University of Manchester, Manchester, UK, August 2004.
**Automatic Derivation and Implementation of Parallel Libraries**PARA'04, Workshop on State-of-the-Art in Scientific Computing.

Lyngby, Denmark, June 2004.