Automatic track mixing

By using state-of-the-art algorithms for beat tracking and structure analysis, our goal is to create a continuous flow of music with seamless transitions between tracks.

The objective is a fully automated mixing system with focus on electronic dance music.

Contact: Mickaël Zehren

Experimental Linear Algebra Performance Studies (ELAPS)

The ELAPS Framework is a multi-platform open source environment for fast yet powerful experimentation with dense linear algebra kernels, algorithms, and libraries. (more)

Get ELAPS now!
Intel Xeon Phi Coprocessor

Intel® Parallel Computing Center at RWTH

The IPCC @ RWTH aims to optimize the most important computational kernels in the LAMMPS molecular dynamics package for Intel® architectures. (more)

Contact: Markus Höhnerbach, Rodrigo Canales

Accelerated Carbon Nanotube Calculation

Linear Algebra

The Generalized Matrix Chain (GMC) algorithm, which is part of Linnea, generates code that substantially outperforms high-level languages for linear algebra, as well as C++ expression template libraries.

The Generalized Matrix Chain Algorithm
Henrik Barthels, Marcin Copik and Paolo Bientinesi
NEGF simulation

Non-Equilibirum Green's Function (NEGF)

A highly efficient and optimized implementation of quantum transport in mesoscopic systems was introduced for simulating novel nano- transistors and quantum photovoltaic devices. These simulations are based on software developed within the Non-Equilibrium Green's Functions (NEGF) framework, which is an advanced approach that allows for treatment of out-of-equilibrium transport phenomena.

Contact: Sebastian Achilles

Supercomputing 2017

At the SC17 conference, HPACers won the second and third price in the ACM student research competition.

We also received an ACM SIGHPC Certificate of Appreciation for our support of the student cluster competition and the reproducibility initative.
  1. A01: GEMM-Like Tensor-Tensor Contraction (GETT)
    Paul Springer and Paolo Bientinesi
  2. A04: Optimization of the AIREBO Many-Body Potential for KNL
    Markus Höhnerbach and Paolo Bientinesi

Algorithm Generation

yields hundreds of implementations for tensor contractions.
  1. On the Performance Prediction of BLAS-based Tensor Contractions
    High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, Volume 8966, pp. 193-212, Springer International Publishing, April 2015.
        author    = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi",
        title     = "On the Performance Prediction of BLAS-based Tensor Contractions",
        booktitle = "High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation",
        year      = 2015,
        editor    = "Jarvis, Stephen A. and Wright, Steven A. and Hammond, Simon D.",
        volume    = 8966,
        series    = "Lecture Notes in Computer Science",
        pages     = "193-212",
        month     = apr,
        publisher = "Springer International Publishing",
        url       = ""
    Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.

Tersoff Potential Optimization

  1. The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability
    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC'16, Number 7, pp. 7:1-7:13, IEEE Press, 2016.
    Selected for Reproducibility Initiative at SC17.
        author    = "Markus Höhnerbach and {Ahmed E.} Ismail and Paolo Bientinesi",
        title     = "The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability",
        booktitle = "Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis",
        year      = 2016,
        number    = 7,
        series    = "SC'16",
        pages     = "7:1--7:13",
        publisher = "IEEE Press",
        note      = "Selected for Reproducibility Initiative at SC17",
        url       = ""
    Molecular dynamics simulations, an indispensable research tool in computational chemistry and materials science, consume a significant portion of the supercomputing cycles around the world. We focus on multi-body potentials and aim at achieving performance portability. Compared with well-studied pair potentials, multibody potentials deliver increased simulation accuracy but are too complex for effective compiler optimization. Because of this, achieving cross-platform performance remains an open question. By abstracting from target architecture and computing precision, we develop a vectorization scheme applicable to both CPUs and accelerators. We present results for the Tersoff potential within the molecular dynamics code LAMMPS on several architectures, demonstrating efficiency gains not only for computational kernels, but also for large-scale simulations. On a cluster of Intel Xeon Phi's, our optimized solver is between 3 and 5 times faster than the pure MPI reference.
  2. The Tersoff many-body potential: Sustainable performance through vectorization
    Proceedings of the SC15 Workshop: Producing High Performance and Sustainable Software for Molecular Simulation, November 2015.
        author = "Markus Höhnerbach and {Ahmed E.} Ismail and Paolo Bientinesi",
        title  = "The Tersoff many-body potential: Sustainable performance through vectorization",
        year   = 2015,
        month  = nov,
        url    = ""
    This contribution discusses the effectiveness and the sustainability of vectorization in molecular dynamics. As a case study, we present results for multi-body potentials on a range of vector instruction sets, targeting both CPUs and accelerators such as GPUs and the Intel Xeon Phi. The shared-memory and distributed-memory parallelization of MD simulations are problems that have been extensively treated; by contrast, vectorization is a relatively unexplored dimension. However, given both the increasing number of vector units available in the computing architectures, and the increasing width of such units, it is imperative to write algorithms and code for taking advantage of vector instructions. We chose to investigate multi-body potentials, as they are not as easily vectorizable as the pair potentials. As a matter of fact, their optimization pushes the boundaries of current compiler technology: our experience suggests that for such problems, compilers alone are not able to deliver appreciable speedups. By constrast, by resorting to the explicit use of instrinsics, we demonstrate significant improvements both on CPUs and the Intel Xeon Phi. Nonetheless, it is clear that the direct use of intrinsics is not a sustainable solution, not only for the effort and expertise it requires, but also because it hinders readability, extensibility and maintainability of code. In our case study, we wrote a C++ implementation of the Tersoff potential for the LAMMPS molecular dynamics simulation package. To alleviate the difficulty of using intrinsics, we used a template-based approach and abstracted from the instruction set, the vector length and the floating point precision. Instead of having to write 18 different kernels, this approach allowed us to only have one source code, from which all implementations (scalar, SSE, AVX, AVX2, AVX512, and IMCI, in single, double, and mixed precision) are automatically derived. Such an infrastructure makes it possible to compare kernels and identify which algorithms are suitable for which architecture. Without optimizations, we observed that a Intel Xeon E5-2650 CPU (Sandy Bridge) is considerably faster than a Xeon Phi. With our optimizations, the performance of both CPU and accelerator improved, and thanks to its wide vector units, the Xeon Phi takes the lead. For the future, besides monitoring the compilers' improvements, one should evaluate OpenMP 4.1 and its successors, which will include support for directive-based vectorization.

DSMC Vectorization Schemes

In direct simulation Monte Carlo, molecules collide at random.

We dynamically move cells in and out of the vector register and correlate collisions throughout the domain to reduce branching and improve vectorization efficiency.

Contact: William McDoniel

Open Positions

About HPAC

The High-Performance and Automatic Computing group is concerned with the development and analysis of accurate and efficient numerical algorithms, with focus on numerical linear algebra. We target applications from materials science, molecular dynamics and computational biology, and the whole range of high-performance architectures.


  • Numerical linear algebra
    • Sequences of problems
    • Small scale operations
    • Tensor operations
    • Error analysis
    • Parallel eigensolvers
  • Parallelism
    • Vectorization
    • Multicore
    • Distributed-memory
    • Coprocessors: GPU, Xeon Phi, ...
  • Automation
    • Algorithm and code generation
    • Performance modeling and prediction
    • Algorithm ranking
  • Applications
    • Genome analysis
    • Molecular dynamics simulations
    • Symbolic algorithmic differentiation for matrix operations
    • Electronic structure calculations



HPAC is part of the Aachen Institute for Advanced Study in Computational Engineering Science (AICES) at RWTH Aachen. AICES is a graduate school established in 2006 in the frame of the German Excellence Initiative. It conducts interdisciplinary research at the interface between mathematics, computer science and engineering, which is reflected by a collaborative effort of more than 25 institutes from 8 academic departments.




Our open source projects are available on GitHub.