Talks - Paolo Bientinesi

  1. HPTT: A High-Performance Tensor Transposition C++ Library
    4th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming.
    June 2017.
    Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework—guided by a performance model—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen’s tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.
  2. Particle-Particle Particle-Mesh (P3M) on Knights Landing Processors
    SIAM Conference on Computational Science and Engineering.
    February 2017.
  3. The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability
    • SIAM Conference on Computational Science and Engineering.
      February 2017.
    • SC 2016.
      November 2016.
  4. The Landscape of High-Performance Tensor Contractions
    Workshop on Batched, Reproducible, and Reduced Prevision BLAS.
    February 2017.
  5. Design of a High-Performance GEMM-like Tensor-Tensor Multiplication
    SIAM Conference on Computational Science and Engineering.
    February 2017.
  6. IPCC @ RWTH Aachen University: Optimization of multibody and long-range solvers in LAMMPS
    IPCC Showcase November 2016.
    November 2016.
  7. The Matrix Chain Algorithm to Compile Linear Algebra Expressions
    DSLDI 2016.
    31 October 2016.
  8. Accelerating Particle-Particle Particle-Mesh Methods for Molecular Dynamics
    IPCC Toulouse.
    October 2016.
  9. Design of a High-Performance GEMM-like Tensor-Tensor Multiplication
    BLIS Retreat 2016.
    September 2016.
    We present ''GEMM-like Tensor-Tensor multiplication'' (GETT), a novel approach to tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned ''macro-kernel''. This macro-kernel operates on suitably prepared (''packed'') sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. To compare our technique with other modern tensor contractions, we integrate GETT alongside the so called Transpose-Transpose-GEMM-Transpose and Loops-over-GEMM approaches into an open source ''Tensor Contraction Code Generator'' (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to 12.3x. More precisely, GETT achieves speedups of up to 1.42x over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to 91.3% of peak floating-point performance for compute-bound tensor contractions.
  10. TTC: A Tensor Transposition Compiler for Multiple Architectures
    ARRAY ACM SIGPLAN 3rd International Workshop on Libraries, Languages and Compilers for Programming.
    June 2016.
    We consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system’s peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g., Intel Haswell, AMD Steamroller), Intel’s Knights Corner as well as different CUDA-based GPUs such as NVIDIA’s Kepler and Maxwell architectures. We report speedups of TTC over a meaningful base- line implementation generated by external C++ compilers; the re- sults suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel’s latest C++ compiler on the Haswell and Knights Cor- ner architecture, TTC yields speedups of up to 8× and 32×, respectively. We also showcase TTC’s support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.
  11. TTC: A Compiler for Tensor Transpositions
    SIAM Conference on Parallel Processing for Scientific Computing..
    Université Pierre et Marie Curie, Paris, 14 April 2016.
    We present TTC, an open-source compiler for multidimensional tensor transpositions. Thanks to a range of optimizations (software prefetching, blocking, loop-reordering, explicit vectorization), TCC generates high-performance parallel C/C++ code. We use generic heuristics and auto-tuning to manage the huge search space. Performance results show that TTC achieves close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, yielding performance gains of up to 15× over modern compilers.
  12. Exploring OpenMP Task Priorities on the MR3 Eigensolver
    SIAM Conference on Parallel Processing for Scientific Computing.
    Université Pierre et Marie Curie, Paris, 12 April 2016.
    As part of the OpenMP 4.1 draft, the runtime incorporates task priorities. We use the Method of Multiple Relatively Robust Representations (MR3), for which a pthreads-based task parallel version already exists (MR3SMP), to analyze and compare the performance of MR3SMP with three different OpenMP runtimes, with and without the support of priorities. From a dataset consisting of application matrices, it appears that OpenMP is always on par or better than the pthreads implementation
  13. The ELAPS Framework: Experimental Linear Algebra Performance Studies
    SIAM Conference on Parallel Processing for Scientific Computing.
    Université Pierre et Marie Curie, Paris, April 2016.
    The multi-platform open source framework ELAPS facilitates easy and fast, yet powerful performance experimentation and prototyping of dense linear algebra algorithms. In contrast to most existing performance analysis tools, it targets the initial stages of the development process and assists developers in both algorithmic and optimization decisions. Users construct experiments to investigate how performance and efficiency vary from one algorithm to another, depending on factors such as caching, algorithmic parameters, problem size, and parallelism.
  14. Optimization of multibody and long-range solvers in LAMMPS
    Intel PCC EMEA Meeting.
    Ostrava, March 2016.
  15. The Tersoff many-body potential: Sustainable performance through vectorization
    SC15 Workshop: Producing High Performance and Sustainable Software for Molecular Simulation.
    November 2015.
  16. A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Dynamics
    International Supercomputing Conference (ISC 15).
    Frankfurt, Germany, July 2015.
    Recent results on supercomputers show that beyond 65K cores, the efficiency of molecular dynamics simulations of interfacial systems decreases significantly. In this paper, we introduce a dynamic cutoff method (DCM) for interfacial systems of arbitrarily large size. The idea consists in adopting a cutoff-based method in which the cutoff is chosen on a particle by particle basis, according to the distance from the interface. Computationally, the challenge is shifted from the long-range solvers to the detection of the interfaces and to the computation of the particle-interface distances. For these tasks, we present linear-time algorithms that do not rely on global communication patterns. As a result, the DCM algorithm is suited for large systems of particles and massively parallel computers. To demonstrate its potential, we integrated DCM into the LAMMPS open-source molecular dynamics package, and simulated large liquid/vapor systems on two supercomputers: SuperMuc and JUQUEEN. In all cases, the accuracy of DCM is comparable to the traditional particle-particle particle-mesh (PPPM) algorithm, and for large numbers of particles the performance is considerably superior. For JUQUEEN, we provide timings for simulations running on the full system (458,752 cores), and show nearly perfect strong and weak scaling.
  17. ELAPS: Experimental Linear Algebra Performance Studies
    University of Texas at Austin, March 2015.
    Live demo.
  18. Bringing knowledge into HPC
    Symposium on High Performance Computing.
    Universitaet Basel, Switzerland, October 2014.
  19. Can Numerical Linear Algebra make it in Nature?
    Householder Symposium XIX, Spa, Belgium, June 2014.
  20. Performance Prediction for Tensor Contractions
    PASC 14.
    ETH Zürich, Zürich, Switzerland, June 2014.
  21. High-performance and automatic computing for simulation science
    Jülich Supercomputing Centre, Kickoff workshop, Simulation Lab "ab initio Methods in Chemistry and Physics", Jülich, Germany, November 2013.
  22. High-Performance and Automatic Computing
    Goethe Universität Frankfurt am Main, Big Data Lab, October 2013.
  23. Recent Trends in Dense Linear Algebra
    ComplexHPC Spring School 2013.
    Uppsala University, Uppsala, Sweden, June 2013.
    Invited lecturer.
  24. Improved Accuracy for MR3-based Eigensolvers
    SIAM Conference on Computational Science and Engineering (SIAM CSE13).
    February 2013.
    A number of algorithms exist for the dense Hermitian eigenproblem. In many cases, MRRR is the fastest one, although it does not deliver the same accuracy as Divide&Conquer or the QR algorithm. We demonstrate how the use of mixed precisions in MRRR-based eigensolvers leads to an improved orthogonality, even surpassing the accuracy of DC and QR. Our approach comes with limited performance penalty, and increases both robustness and scalability.
  25. Genome-Wide Association Studies: Computing Petaflops over Terabytes of Data
    Blue Gene Active Storage Workshop.
    Jülich Supercomputing Centre, Jülich, Germany, January 2013.
    Invited speaker.
  26. First Steps Towards a Linear Algebra Compiler
    ETH Zürich, Zürich, Switzerland, December 2012.
    Host: Markus Pueschel.
  27. A Compiler for Linear Algebra Operations
    Workshop on Libraries and Autotuning for Extreme-Scale Systems (CScADS '12).
    Snowbird, Utah, August 2012.
    Invited speaker.
  28. Automatic Modeling and Ranking of Algorithms
    The Seventh International Workshop on Automatic Performance Tuning (iWAPT 2012).
    Kobe, Japan, July 2012.
    Invited speaker.
    In the context of automatic generation of linear algebra algorithms, it is not uncommon to find dozens of algorithmic variants, all equivalent mathematically, but different in terms of accuracy and performance. In this talk I discuss how to rank the variants automatically, without executing them. In principle, one can attempt a fully analytical approach, creating performance models that take into account both the structure of the algorithm and the features of the processor; unfortunately, due to the intricacies of the memory system, currently this solution is not at all systematic. By contrast, I present an approach based on the automatic modeling of the routines that represent the building blocks for linear algebra algorithms. Performance predictions are then made by composing evaluations of such models. Our experiments show how this approach can be applied to both algorithm ranking and parameter tuning, yielding remarkably accurate results.
  29. Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures
    40th SPEEDUP Workshop on High-Performance Computing.
    Basel, Switzerland, February 2012.
    Plenary speaker.
    Eigenproblems are at the core of a myriad of engineering and scientific applications. In many cases, the size of such problems is not determined by the physics of the application but is limited by the eigensolver's time and memory requirements. While on the one hand the demand is for larger problems, on the other hand the available numerical libraries are not exploiting the parallelism provided in the modern computing environments. In this talk I compare two approaches to parallelism --one that relies on fast multithreaded libraries (BLAS), and another that uses blocking and careful scheduling-- and show that the right choice depends, among other factors, on the specific operation performed. I will then introduce two eigensolvers specifically designed for high-performance architectures. MR3-SMP targets multicore architectures, while EleMRRR is well suited for both small clusters and massively parallel computers which may or may not use multithreading. Experiments on application matrices indicate that our algorithms are both faster and obtain better speedups than all the eigensolvers from LAPACK, ScaLAPACK and Intel's Math Kernel Library (MKL). Finally, I will discuss the use of graphics processing units.
  30. Automation in Computational Biology
    9th International Conference on Parallel Processing and Applied Mathematics (PPAM 2011).
    Torun, Poland, September 2011.
    Keynote speaker.
    In the past 30 years the development of linear algebra libraries has been tremendously successful, resulting in a variety of reliable and efficient computational kernels. Unfortunately these kernels--meant to become the building blocks for scientific and engineering applications--are not designed to exploit knowledge relative to the specific target application. If opportunely used, this extra knowledge may lead to domain-specific algorithms that attain higher performance than any traditional library. As a case study, we look at a common operation in computational biology, the computation of mixed-effects models; in particular, we consider the use of mixed models in the context of genome analysis. At the core of this operation lays a generalized least square problem (GLS); GLS may be directly solved with Matlab, or may be reduced to a form accepted by LAPACK. Either way, none of these solutions can exploit the special structure of GLS within genome analysis. Specifically, as part of this application it has to be solved not one, but a large two-dimensional parametric sequence of GLS'. Since the performance of an algorithm is directly affected by the choice of parameters, a family of algorithms is needed. In this talk we show how automation comes to help. We introduce a symbolic system, written in Mathematica, that takes as input a matrix equation and automatically returns a family of algorithms to solve the equation. The system has knowledge of matrix properties, matrix factorizations, and rules of linear algebra; it decomposes the input equation into a sequence of building blocks and maps them onto available high-performance kernels. Automation is achieved through extensive use of pattern matching and rewrite rules. When applied to GLS in the context of genome analysis, it generates algorithms that outperform LAPACK by a factor of six.
  31. A Modular and Systematic Approach to Stability Analysis
    Householder Symposium XVIII.
    Tahoe City, CA, June 2011.
  32. MR3-SMP and PMRRR: Fast and Scalable Eigensolvers
    25th Umbrella Symposium.
    Aachen, Germany, June 2011.
  33. Solvers and Eigensolvers for Multicore Processors
    Max-Plank-Institute fuer biologische Kybernetik, Tuebingen, Germany, March 2011.
    Invited speaker.
  34. Goal-oriented and Modular Stability Analysis
    Conference on Numerical Linear Algebra: Perturbation, Performance and Portability.
    A Conference in Celebration of G.W. (Pete) Stewart's 70th Birthday, Austin, TX, July 2010.
    Invited speaker.
  35. The Algorithm of Multiple Relatively Robust Representations for Multicore Processors
    PARA 2010: State of the Art in Scientific and Parallel Computing.
    Reykjavik, Iceland, June 2010.
    The algorithm of Multiple Relatively Robust Representations, in short MRRR or $\mbox{MR}^3$, computes $k$ eigenvalues and eigenvectors of a symmetric tridiagonal matrix in $O(nk)$ arithmetic operations. While for the largest matrices arising in applications parallel implementations especially suited for distributed-memory computer systems exist, small to medium size problems can make use of LAPACK's implementation xSTEMR. However, xSTEMR does not take advantage of today's multi-core and future many-core architectures, as it is optimized for single-core CPUs. In this paper we discuss some of the issues and trade-offs arising in an efficient implementation especially suited for multi-core CPUs and SMP systems. From a set of experiments on application matrices it results that our algorithm is both faster and obtains better speedups than all tridiagonal eigensolvers from LAPACK and Intel's Math Kernel Library (MKL).
  36. At the Heart of the Automation of Linear Algebra Algorithms
    Workshop on Program Composition and Optimization.
    Dagstuhl, Germany, May 2010.
    It is well understood that in order to attain high performance for linear algebra operations over multiple architectures and settings, not just one, but a family of loop-based algorithms have to be generated and optimized. In the past we have demonstrated that algorithms and routines can be derived automatically, using a procedure based on symbolic computations and classical formal derivations techniques. At the heart of such a procedure lie the Partitioned Matrix Expressions (PMEs) of the target operation; these expressions describe how parts of the output operands can be represented in terms of parts of the input operands. The PMEs are the unifying element for all the algorithms in the family, as they encapsulate the necessary knowledge for generating each one of them. Until now, the PMEs were considered inputs to the derivation procedure, i.e., the users had to provide them. In this talk we discuss how from a high-level formal description of the operation it is possible to generate automatically even the PMEs. We conclude demonstrating how automation becomes critical in complex, high-dimensional, scenarios.
  37. Linear Algebra on Multicore Architectures
    • School on High-performance Computing in Geophysics, Novosobirsk State University, Novosibirsk, Russia, September 2009.
      Invited lecturer.
    • MIT-AICES Workshop.
      Aachen, Germany, March 2009.
  38. Numerical Methods for Large Linear Systems
    3rd LHC Detector Alignment Workshop.
    CERN, Geneva, Switzerland, June 2009.
    Invited speaker.
  39. Automatic Computing
    North Rhine-Westphalian Academy of Sciences and Humanities, Duesseldorf, Germany, April 2009.
    2009 Karl Arnold Prize acceptance speech.
  40. Computational Mathematics (in Italian, "Matematica Computazionale")
    Woche zu Italien, RWTH Aachen.
    Aachen, Germany, March 2009.
    Invited speaker.
  41. Algorithm & Code Generation for High-Performance Computing
    Tag der Informatik, RWTH Aachen, Aachen, Germany, December 2008.
  42. Scientific Computing: Applications, Algorithms, Architectures
    • AICES Workshop, Monschau, Germany, November 2008.
    • Colorado State University, Fort Collins, CO, March 2008.
    • RWTH Aachen, Aachen, Germany, January 2008.
      Hosts: Marek Behr and Chris Bischof.
  43. Multicore Processors: What Kind of Parallelism?
    AICES, RWTH Aachen, Aachen, Germany, June 2008.
    CCES Seminar Series.
  44. Multi-dimensional Array Memory Accesses for FFTs on Parallel Architectures
    PARA 2008: 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing.
    Trondheim, Norway, June 2008.
  45. Generation of Dense Linear Algebra Software for Shared Memory and Multicore Architectures
    • Workshop on Automating the Development of Scientific Computing Software.
      Baton Rouge, LA, March 2008.
      Invited speaker.
    • Microsoft Corporation, Redmond, WA, March 2008.
      Host: Laurent Visconti.
  46. Streaming 2D FFTs on the Cell Broadband Engine
    DESA Workshop.
    Washington, DC, December 2007.
  47. Dense Linear Algebra on Multicore Architectures: What Kind of Parallelism?
    • CScADS. Workshop on Automatic Tuning for Petascale Systems.
      Snowbird, Utah, July 2007.
    • ICIAM07: 6th International Congress on Industrial and Applied Mathematics.
      Zürich, Switzerland, July 2007.
  48. Sparse Direct Factorizations Based on Unassembled Hyper-Matrices
    ICIAM07: 6th International Congress on Industrial and Applied Mathematics.
    Zürich, Switzerland, July 2007.
  49. Can Computers Develop Libraries? A Different Perspective on Scientific Computing
    The University of Chicago, Chicago, IL, February 2007.
    Host: Ridgway Scott.
  50. Mechanical Generation of Correct Linear Algebra Algorithms
    • Duke University, Durham, NC, October 2006.
      Host: Xiaobai Sun.
    • Rice University, Houston, TX, September 2006.
      Hosts: John Mellor-Crummey, Ken Kennedy.
    • University of Manchester, Manchester, UK, June 2006.
      Host: Chris Taylor.
    • Georgia Institute of Technology, Atlanta, GA, March 2006.
      Host: Richard Fujimoto.
    • Carnegie Mellon University, Pittsburgh, PA, February 2006.
      Host: Markus Pueschel.
    • Oxford University, Oxford, UK, January 2006.
      Host: Richard Bird.
  51. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms
    University of Texas, Austin, TX, July 2006.
    Dissertation Defense (7 July).
  52. Mechanical Generation of Linear Algebra Libraries with Multiple Variants
    • SIAM Conference on Parallel Processing for Scientific Computing.
      San Francisco, CA, February 2006.
    • University of Washington, Seattle, WA, December 2005.
      Host: Larry Snyder.
    • Caltech Center for Advanced Computing Research (CACR), California Institute of Technology, Pasadena, CA, June 2005.
      Host: Mark Stalzer.
    • Argonne National Laboratory, Argonne, IL, March 2005.
      Host: Jorge More'.
    • IBM T.J. Watson Research Center, Yorktown Heights, January 2005.
      Host: John Gunnels.
    • New York University, New York, NY, January 2005.
      Host: Michael Overton.
  53. Formal Correctness and Stability of Dense Linear Algebra Algorithms
    17th IMACS World Congress: Scientific Computation, Applied Mathematics and Simulation.
    Paris, France, July 2005.
  54. A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations
    • Householder XVI Symposium on Numerical Linear Algebra.
      Silver Springs Mountain Resort, PA, May 2005.
    • IV International Workshop on Accurate Solution of Eigenvalue Problems (IWASEP4).
      Split, Croatia, June 2002.
  55. The Science of Deriving Dense Linear Algebra Algorithms
    • University of Manchester, Manchester, UK, August 2004.
      Host: Nicholas Higham.
    • PARA'04, Workshop on State-of-the-Art in Scientific Computing.
      Lyngby, Denmark, June 2004.
    • Lawrence Berkeley National Lab, Berkeley, CA, June 2003.
      Host: Esmond Ng.
    • Institute for Informatics and Telematics, Pisa, Italy, July 2002.
      Host: Marco Pellegrini.
  56. Automatic Derivation and Implementation of Parallel Libraries
    PARA'04, Workshop on State-of-the-Art in Scientific Computing.
    Lyngby, Denmark, June 2004.