|
Automatic track mixing
By using state-of-the-art algorithms for beat tracking and structure analysis, our goal is to create a continuous flow of music with seamless transitions between tracks.
The objective is a fully automated mixing system with focus on electronic dance music.
Contact: Mickaël Zehren
|
Experimental Linear Algebra Performance Studies (ELAPS)
The ELAPS Framework is a multi-platform open source environment for
fast yet powerful experimentation with dense linear algebra kernels,
algorithms, and libraries.
(more)
Get ELAPS now!
http://github.com/HPAC/ELAPS/
|
Intel® Parallel Computing Center at RWTH
The IPCC @ RWTH aims to optimize the most important computational kernels in the LAMMPS molecular dynamics package for Intel® architectures.
(more)
Contact: Markus Höhnerbach,
Rodrigo Canales
|
Linear Algebra
The Generalized Matrix Chain (GMC) algorithm, which is part of Linnea, generates code that substantially outperforms high-level languages for linear algebra, as well as C++ expression template libraries.
The Generalized Matrix Chain Algorithm
Henrik Barthels, Marcin Copik and Paolo Bientinesi
|
Non-Equilibirum Green's Function (NEGF)
A highly efficient and optimized implementation of quantum transport
in mesoscopic systems was introduced for simulating novel nano-
transistors and quantum photovoltaic devices. These simulations are
based on software developed within the Non-Equilibrium Green's
Functions (NEGF) framework, which is an advanced approach that allows
for treatment of out-of-equilibrium transport phenomena.
Contact: Sebastian Achilles
| |
Algorithm Generation
yields hundreds of implementations for tensor contractions.
- On the Performance Prediction of BLAS-based Tensor Contractions
High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, Volume 8966, pp. 193-212, Springer International Publishing, April 2015. @inproceedings{Peise2015:380,
author = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi",
title = "On the Performance Prediction of BLAS-based Tensor Contractions",
booktitle = "High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation",
year = 2015,
editor = "Jarvis, Stephen A. and Wright, Steven A. and Hammond, Simon D.",
volume = 8966,
series = "Lecture Notes in Computer Science",
pages = "193-212",
month = apr,
publisher = "Springer International Publishing",
url = "http://arxiv.org/pdf/1409.8608v1"
} Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms. abstractwebPDFbibtexhide
|
Tersoff Potential Optimization
- The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC'16, Number 7, pp. 7:1-7:13, IEEE Press, 2016. Selected for Reproducibility Initiative at SC17. @inproceedings{Höhnerbach2016:78,
author = "Markus Höhnerbach and {Ahmed E.} Ismail and Paolo Bientinesi",
title = "The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability",
booktitle = "Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis",
year = 2016,
number = 7,
series = "SC'16",
pages = "7:1--7:13",
publisher = "IEEE Press",
note = "Selected for Reproducibility Initiative at SC17",
url = "https://arxiv.org/pdf/1607.02904v1"
} Molecular dynamics simulations, an indispensable research tool in computational chemistry and materials science, consume a significant portion of the supercomputing cycles around the world. We focus on multi-body potentials and aim at achieving performance portability. Compared with well-studied pair potentials, multibody potentials deliver increased simulation accuracy but are too complex for effective compiler optimization. Because of this, achieving cross-platform performance remains an open question. By abstracting from target architecture and computing precision, we develop a vectorization scheme applicable to both CPUs and accelerators. We present results for the Tersoff potential within the molecular dynamics code LAMMPS on several architectures, demonstrating efficiency gains not only for computational kernels, but also for large-scale simulations. On a cluster of Intel Xeon Phi's, our optimized solver is between 3 and 5 times faster than the pure MPI reference.
abstractwebPDFbibtexhide - The Tersoff many-body potential: Sustainable performance through vectorization
Proceedings of the SC15 Workshop: Producing High Performance and Sustainable Software for Molecular Simulation, November 2015. @inproceedings{Höhnerbach2015:718,
author = "Markus Höhnerbach and {Ahmed E.} Ismail and Paolo Bientinesi",
title = "The Tersoff many-body potential: Sustainable performance through vectorization",
year = 2015,
month = nov,
url = "http://hpac.rwth-aachen.de/ipcc/sc15_paper.pdf"
} This contribution discusses the effectiveness and the sustainability of
vectorization in molecular dynamics. As a case study, we present results for
multi-body potentials on a range of vector instruction sets, targeting both
CPUs and accelerators such as GPUs and the Intel Xeon Phi.
The shared-memory and distributed-memory parallelization of MD simulations are
problems that have been extensively treated; by contrast, vectorization is a
relatively unexplored dimension. However, given both the increasing number of
vector units available in the computing architectures, and the increasing
width of such units, it is imperative to write algorithms and code for
taking advantage of vector instructions.
We chose to investigate multi-body potentials, as they are not as easily
vectorizable as the pair potentials. As a matter of fact, their optimization
pushes the boundaries of current compiler technology: our experience suggests
that for such problems, compilers alone are not able to deliver appreciable
speedups. By constrast, by resorting to the explicit use of instrinsics, we
demonstrate significant improvements both on CPUs and the Intel Xeon Phi.
Nonetheless, it is clear that the direct use of intrinsics is not a
sustainable solution, not only for the effort and expertise it requires, but also
because it hinders readability, extensibility and maintainability of code.
In our case study, we wrote a C++ implementation of the Tersoff potential for the LAMMPS
molecular dynamics simulation package.
To alleviate the difficulty of using intrinsics, we used a template-based
approach and abstracted from the instruction set, the vector length and the floating point
precision.
Instead of having to write 18 different kernels,
this approach allowed us to only have one source code,
from which all implementations (scalar, SSE, AVX, AVX2, AVX512, and IMCI, in
single, double, and mixed precision) are automatically derived.
Such an infrastructure makes it possible to compare kernels and identify which
algorithms are suitable for which architecture.
Without optimizations, we observed that a Intel Xeon E5-2650 CPU (Sandy Bridge)
is considerably faster than a Xeon Phi.
With our optimizations, the performance of both CPU and accelerator improved,
and thanks to its wide vector units, the Xeon Phi takes the lead.
For the future, besides monitoring the compilers' improvements, one should
evaluate OpenMP 4.1 and its successors, which will include support for directive-based vectorization. abstractPDFbibtexhide
|
DSMC Vectorization Schemes
In direct simulation Monte Carlo, molecules collide at random.
We dynamically move cells in and out of the vector register and correlate collisions throughout the domain to reduce branching and improve vectorization efficiency.
Contact: William McDoniel
|