
Automatic track mixing
By using stateoftheart algorithms for beat tracking and structure analysis, our goal is to create a continuous flow of music with seamless transitions between tracks.
The objective is a fully automated mixing system with focus on electronic dance music.
Contact: Mickaël Zehren

Experimental Linear Algebra Performance Studies (ELAPS)
The ELAPS Framework is a multiplatform open source environment for
fast yet powerful experimentation with dense linear algebra kernels,
algorithms, and libraries.
(more)
Get ELAPS now!
http://github.com/HPAC/ELAPS/

Intel® Parallel Computing Center at RWTH
The IPCC @ RWTH aims to optimize the most important computational kernels in the LAMMPS molecular dynamics package for Intel® architectures.
(more)
Contact: Markus Höhnerbach,
Rodrigo Canales

Linear Algebra
The Generalized Matrix Chain (GMC) algorithm, which is part of Linnea, generates code that substantially outperforms highlevel languages for linear algebra, as well as C++ expression template libraries.
The Generalized Matrix Chain Algorithm
Henrik Barthels, Marcin Copik and Paolo Bientinesi

NonEquilibirum Green's Function (NEGF)
A highly efficient and optimized implementation of quantum transport
in mesoscopic systems was introduced for simulating novel nano
transistors and quantum photovoltaic devices. These simulations are
based on software developed within the NonEquilibrium Green's
Functions (NEGF) framework, which is an advanced approach that allows
for treatment of outofequilibrium transport phenomena.
Contact: Sebastian Achilles
 
Algorithm Generation
yields hundreds of implementations for tensor contractions.
 On the Performance Prediction of BLASbased Tensor Contractions
High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, Volume 8966, pp. 193212, Springer International Publishing, April 2015. @inproceedings{Peise2015:380,
author = "Elmar Peise and Diego FabregatTraver and Paolo Bientinesi",
title = "On the Performance Prediction of BLASbased Tensor Contractions",
booktitle = "High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation",
year = 2015,
editor = "Jarvis, Stephen A. and Wright, Steven A. and Hammond, Simon D.",
volume = 8966,
series = "Lecture Notes in Computer Science",
pages = "193212",
month = apr,
publisher = "Springer International Publishing",
url = "http://arxiv.org/pdf/1409.8608v1"
} Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of highperformance kernels for such operations is known to be a challenging task. While for operations on one and twodimensional tensors there exist standardized interfaces and highlyoptimized libraries (BLAS), for higher dimensional tensors neither standards nor highlytuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating highperformance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cacheaware microbenchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the bestperforming algorithms in a tiny fraction of the time taken by the direct execution of the algorithms. abstractwebPDFbibtexhide

Tersoff Potential Optimization
 The Vectorization of the Tersoff MultiBody Potential: An Exercise in Performance Portability
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC'16, Number 7, pp. 7:17:13, IEEE Press, 2016. Selected for Reproducibility Initiative at SC17. @inproceedings{Höhnerbach2016:78,
author = "Markus Höhnerbach and {Ahmed E.} Ismail and Paolo Bientinesi",
title = "The Vectorization of the Tersoff MultiBody Potential: An Exercise in Performance Portability",
booktitle = "Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis",
year = 2016,
number = 7,
series = "SC'16",
pages = "7:17:13",
publisher = "IEEE Press",
note = "Selected for Reproducibility Initiative at SC17",
url = "https://arxiv.org/pdf/1607.02904v1"
} Molecular dynamics simulations, an indispensable research tool in computational chemistry and materials science, consume a significant portion of the supercomputing cycles around the world. We focus on multibody potentials and aim at achieving performance portability. Compared with wellstudied pair potentials, multibody potentials deliver increased simulation accuracy but are too complex for effective compiler optimization. Because of this, achieving crossplatform performance remains an open question. By abstracting from target architecture and computing precision, we develop a vectorization scheme applicable to both CPUs and accelerators. We present results for the Tersoff potential within the molecular dynamics code LAMMPS on several architectures, demonstrating efficiency gains not only for computational kernels, but also for largescale simulations. On a cluster of Intel Xeon Phi's, our optimized solver is between 3 and 5 times faster than the pure MPI reference.
abstractwebPDFbibtexhide  The Tersoff manybody potential: Sustainable performance through vectorization
Proceedings of the SC15 Workshop: Producing High Performance and Sustainable Software for Molecular Simulation, November 2015. @inproceedings{Höhnerbach2015:718,
author = "Markus Höhnerbach and {Ahmed E.} Ismail and Paolo Bientinesi",
title = "The Tersoff manybody potential: Sustainable performance through vectorization",
year = 2015,
month = nov,
url = "http://hpac.rwthaachen.de/ipcc/sc15_paper.pdf"
} This contribution discusses the effectiveness and the sustainability of
vectorization in molecular dynamics. As a case study, we present results for
multibody potentials on a range of vector instruction sets, targeting both
CPUs and accelerators such as GPUs and the Intel Xeon Phi.
The sharedmemory and distributedmemory parallelization of MD simulations are
problems that have been extensively treated; by contrast, vectorization is a
relatively unexplored dimension. However, given both the increasing number of
vector units available in the computing architectures, and the increasing
width of such units, it is imperative to write algorithms and code for
taking advantage of vector instructions.
We chose to investigate multibody potentials, as they are not as easily
vectorizable as the pair potentials. As a matter of fact, their optimization
pushes the boundaries of current compiler technology: our experience suggests
that for such problems, compilers alone are not able to deliver appreciable
speedups. By constrast, by resorting to the explicit use of instrinsics, we
demonstrate significant improvements both on CPUs and the Intel Xeon Phi.
Nonetheless, it is clear that the direct use of intrinsics is not a
sustainable solution, not only for the effort and expertise it requires, but also
because it hinders readability, extensibility and maintainability of code.
In our case study, we wrote a C++ implementation of the Tersoff potential for the LAMMPS
molecular dynamics simulation package.
To alleviate the difficulty of using intrinsics, we used a templatebased
approach and abstracted from the instruction set, the vector length and the floating point
precision.
Instead of having to write 18 different kernels,
this approach allowed us to only have one source code,
from which all implementations (scalar, SSE, AVX, AVX2, AVX512, and IMCI, in
single, double, and mixed precision) are automatically derived.
Such an infrastructure makes it possible to compare kernels and identify which
algorithms are suitable for which architecture.
Without optimizations, we observed that a Intel Xeon E52650 CPU (Sandy Bridge)
is considerably faster than a Xeon Phi.
With our optimizations, the performance of both CPU and accelerator improved,
and thanks to its wide vector units, the Xeon Phi takes the lead.
For the future, besides monitoring the compilers' improvements, one should
evaluate OpenMP 4.1 and its successors, which will include support for directivebased vectorization. abstractPDFbibtexhide
