Recent Talks

  1. Optimizing the ChASE eigensolver for Bethe-Salpeter computations
    7th Workshop of the Joint Laboratory for Extreme Scale Computing.
    17 July 2017.
    The Chebyshev Accelerated Subspace iteration Eigensolver (ChASE) is an iterative eigensolver developed at the JSC by the SimLab Quantum Materials. The solver mainly targets sequences of dense eigenvalue problems as they arise in Density Functional Theory, but can also work on the single eigenproblem. ChASE leverages on the predominant use of BLAS 3 subroutines to achieve close-to-peak performance and potentially achieve scalability over hundreds if not thousands of computing nodes. We have recently succeeded to integrate a version of the ChASE library within the Jena BSE code. Preliminary comparison between ChASE and the conjugate gradient eigensolver (KSCG), previously used by the Jena BSE code, shows that ChASE can outperform KSCG with speedups up to 5X. In this talk we illustrate our latest results and give an outlook of the scientific problems that can be tackled once the integration is successfully completed.
    abstractwebPDFhide
  2. Linear algebra tasks in Materials Science: optimization and portability
    Accelerated Data and Computing Workshop.
    July 2017.
  3. HPTT: A High-Performance Tensor Transposition C++ Library
    4th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming.
    June 2017.
    Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework—guided by a performance model—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen’s tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.
    abstractwebhide
  4. Linnea: Automatic Generation of Efficient Linear Algebra Programs
    May 2017.
    May 17, University of Nevada, Las Vegas. May 24, Massachusetts Institute of Technology.
  5. Distributed parallel non-equilibrium Green’s function approach to inelastic charge transport
    GAMM 2017.
    7 March 2017.
  6. Particle-Particle Particle-Mesh (P3M) on Knights Landing Processors
    SIAM Conference on Computational Science and Engineering.
    February 2017.
  7. The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability
    SIAM Conference on Computational Science and Engineering.
    February 2017.
  8. The Landscape of High-Performance Tensor Contractions
    Workshop on Batched, Reproducible, and Reduced Prevision BLAS.
    February 2017.
  9. Design of a High-Performance GEMM-like Tensor-Tensor Multiplication
    SIAM Conference on Computational Science and Engineering.
    February 2017.
  10. A Compiler for Linear Algebra Operations
    ACM Student Research Competition at SPLASH 2016.
    3 November 2016.