
Interfacial system using a (a) static cutoff and a (b) dynamic cutoff. Gray area denotes
the cutoff. The arrows represent the force acting on a particle. The yellow arrow depicts
the additional force contribution due to a larger cutoff.
Molecular Dynamics
We introduce a dynamic cutoff method (DCM) for interfacial systems. The idea consists in adopting a cutoffbased method in which the cutoff is chosen
on a particlebyparticle basis, according to the distance from the interface. We present a lineartime algorithm that does not rely on global communication
patterns. We provide timings for simulations running on the full JUQUEEN supercomputer
(458, 752 cores), and show nearly perfect strong and weak scaling.
 A Scalable, LinearTime Dynamic Cutoff Algorithm for Molecular Dynamics
Proceedings of the International Supercomputing Conference (ISC), July 2015. Accepted. @inproceedings{Springer2015:450,
author = "Paul Springer and {Ahmed E.} Ismail and Paolo Bientinesi",
title = "A Scalable, LinearTime Dynamic Cutoff Algorithm for Molecular Dynamics",
year = 2015,
month = jul,
note = "Accepted"
} Recent results on supercomputers show that beyond 65K cores, the efficiency of molecular dynamics simulations of interfacial sys tems decreases significantly. In this paper, we introduce a dynamic cutoff method (DCM) for interfacial systems of arbitrarily large size. The idea consists in adopting a cutoffbased method in which the cutoff is cho sen on a particlebyparticle basis, according to the distance from the interface. Computationally, the challenge is shifted from the longrange solvers to the detection of the interfaces and to the computation of the particleinterface distances. For these tasks, we present lineartime algo rithms that do not rely on global communication patterns. As a result, the DCM algorithm is suited for large systems of particles and mas sively parallel computers. To demonstrate its potential, we integrated DCM into the LAMMPS opensource molecular dynamics package, and simulated large liquid/vapor systems on two supercomputers: SuperMuc and JUQUEEN. In all cases, the accuracy of DCM is comparable to the traditional particleparticle particlemesh (PPPM) algorithm, while the performance is considerably superior for large numbers of particles. For JUQUEEN, we provide timings for simulations running on the full system (458, 752 cores), and show nearly perfect strong and weak scaling. abstractwebPDFbibtexhide

Experimental Linear Algebra Performance Studies (ELAPS)
The ELAPS Framework is a multiplatform open source environment for
fast yet powerful experimentation with dense linear algebra kernels,
algorithms, and libraries.
Get ELAPS now!
http://github.com/elmarpeise/ELAPS/

Multitrait GWAS as a twodimensional grid of generalized leastsquares problems.
GWAS with multiple traits requires the solution of m x t correlated
GLS problems, originating a threedimensional object B of size
m x t x p.
GenomeWide Association Studies
Multitrait GWAS based on linear mixed models pose a challenge in terms
of both the computational requirements (solution of up to trillions of correlated
GLS problems) and the size of the datasets (up to terabytes) to be processed.
Considering the collection of GLS problems as a whole, and exploiting
applicationspecific knowledge is key to the design efficient algorithms.
 Computing Petaflops over Terabytes of Data: The Case of GenomeWide Association Studies
ACM Transactions on Mathematical Software (TOMS), Volume 40(4), pp. Article 27, June 2014. @article{FabregatTraver2014:798,
author = "Diego FabregatTraver and Paolo Bientinesi",
title = "Computing Petaflops over Terabytes of Data: The Case of GenomeWide Association Studies",
journal = "ACM Transactions on Mathematical Software (TOMS)",
year = 2014,
volume = 40,
number = 4,
pages = "Article 27",
month = jun
} In many scientific and engineering applications one has to solve not one but a sequence of instances of the same problem. Often times, the problems in the sequence are linked in a way that allows intermediate results to be reused. A characteristic example for this class of applications is given by
the GenomeWide Association Studies (GWAS),
a widely spread tool in computational biology.
GWAS entails the solution of up to trillions ($10^{12}$) of correlated generalized leastsquares problems, posing a daunting challenge: the performance of petaflops ($10^{15}$ floatingpoint operations) over terabytes of data residing on disk.
In this paper, we design an efficient algorithm for performing GWAS on multicore architectures. This is accomplished in three steps. First, we show how to exploit the relation among successive problems,
thus reducing the overall computational complexity.
Then, through an analysis of the required data transfers, we identify how to eliminate any overhead due to input/output operations. Finally, we study how to decompose computation into tasks to be
distributed among the available cores, to attain high performance and scalability.
We believe the paper contributes valuable guidelines
of general applicability for computational scientists on how to develop and optimize numerical algorithms. abstractwebPDFbibtexhide

Distributed Algorithm
enables large scale genome wide association studies.
 High Performance Solutions for Bigdata GWAS
Parallel Computing, Volume 42, pp. 75  87, February 2015. Special issue on Parallelism in Bioinformatics. @article{Peise2015:754,
author = "Elmar Peise and Diego FabregatTraver and Paolo Bientinesi",
title = "High Performance Solutions for Bigdata GWAS",
journal = "Parallel Computing",
year = 2015,
volume = 42,
pages = "75  87",
month = feb,
note = "Special issue on Parallelism in Bioinformatics"
} In order to associate complex traits with genetic polymorphisms, genomewide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms and thousands of phenotypes come at the cost of hundreds of gigabytes of data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a relationship matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, by using distributed resources such as Cloud or clusters, we address both challenges: The genotype and phenotype data is streamed from secondary storage using a double buffering technique, while the relationship matrix is kept across the main memory of a distributed memory system. With the help of these solutions, we develop separate algorithms for studies involving only one or a multitude of traits. We show that these algorithms sustain highperformance and allow the analysis of enormous datasets. abstractwebPDFbibtexhide

 Multilevel Summation for Dispersion: A LinearTime Algorithm for 1/r^6 Potentials
Journal of Chemical Physics, Volume 140, pp. 024105, January 2014. @article{Tameling2014:590,
author = "Daniel Tameling and Paul Springer and Paolo Bientinesi and {Ahmed E.} Ismail",
title = "Multilevel Summation for Dispersion: A LinearTime Algorithm for 1/r^6 Potentials",
journal = "Journal of Chemical Physics",
year = 2014,
volume = 140,
pages = 24105,
month = jan
} The multilevel summation (MLS) method was developed to evaluate longrange interactions in molecular dynamics (MD) simulations. MLS was initially introduced for Coulombic potentials; we have extended this method to dispersion interactions. While formally shortranged, for an accurate calculation of forces and energies in cases such as in interfacial systems, dispersion potentials require longrange methods. Since longrange solvers tend to dominate the time needed to perform MD calculations, increasing their performance is of vital importance. The MLS method offers some significant advantages when compared to meshbased Ewald methods like the particleparticle particlemesh and particle mesh Ewald methods. Unlike meshbased Ewald methods, MLS does not use fast Fourier transforms and is thus not limited by communication and bandwidth concerns. In addition, it scales linearly in the number of particles, as compared to the O(N log N) complexity of the meshbased Ewald methods. While the structure of the MLS method is invariant for different potentials, every algorithmic step had to be adapted to accommodate the 1/r^6 form of the dispersion interactions. In addition, we have derived error bounds, similar to those obtained by Hardy for the electrostatic MLS. Using a prototype implementation, we can already demonstrate the linear scaling of the MLS method for dispersion, and present results establishing the accuracy and efficiency of the method. abstractwebPDFbibtexhide
 A Scalable, LinearTime Dynamic Cutoff Algorithm for Molecular Dynamics
Proceedings of the International Supercomputing Conference (ISC), July 2015. Accepted. @inproceedings{Springer2015:450,
author = "Paul Springer and {Ahmed E.} Ismail and Paolo Bientinesi",
title = "A Scalable, LinearTime Dynamic Cutoff Algorithm for Molecular Dynamics",
year = 2015,
month = jul,
note = "Accepted"
} Recent results on supercomputers show that beyond 65K cores, the efficiency of molecular dynamics simulations of interfacial sys tems decreases significantly. In this paper, we introduce a dynamic cutoff method (DCM) for interfacial systems of arbitrarily large size. The idea consists in adopting a cutoffbased method in which the cutoff is cho sen on a particlebyparticle basis, according to the distance from the interface. Computationally, the challenge is shifted from the longrange solvers to the detection of the interfaces and to the computation of the particleinterface distances. For these tasks, we present lineartime algo rithms that do not rely on global communication patterns. As a result, the DCM algorithm is suited for large systems of particles and mas sively parallel computers. To demonstrate its potential, we integrated DCM into the LAMMPS opensource molecular dynamics package, and simulated large liquid/vapor systems on two supercomputers: SuperMuc and JUQUEEN. In all cases, the accuracy of DCM is comparable to the traditional particleparticle particlemesh (PPPM) algorithm, while the performance is considerably superior for large numbers of particles. For JUQUEEN, we provide timings for simulations running on the full system (458, 752 cores), and show nearly perfect strong and weak scaling. abstractwebPDFbibtexhide
Molecular Dynamics
An important tool to study phenomena at the atomic scale are
molecular dynamics simulations. These simulations
offer a great number of computational challenges. In particular,
a critical aspect is the performance. Our group is actively
involved in several projects that aim at improving the performance of
molecular dynamics simulations.
More
Contact: Daniel Tameling

Graphical representation of the multilevel summation method
Molecular Dynamics
A typical bottleneck in molecular dynamics simulations is the
performance of the longrange solver. Because of increasing
simulation sizes, the limitations of current longrange methods
become more and more problematic. Therefore, we are interested
in investigating alternatives such as the multilevel summation
method.
More
Contact: Daniel Tameling

Algorithm Generation
yields hundreds of implementations for tensor contractions.
 On the Performance Prediction of BLASbased Tensor Contractions
High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, Volume 8966, pp. 193212, Springer International Publishing, 2015. @inproceedings{Peise2015:380,
author = "Elmar Peise and Diego FabregatTraver and Paolo Bientinesi",
title = "On the Performance Prediction of BLASbased Tensor Contractions",
booktitle = "High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation",
year = 2015,
editor = "Jarvis, Stephen A. and Wright, Steven A. and Hammond, Simon D.",
volume = 8966,
series = "Lecture Notes in Computer Science",
pages = "193212",
publisher = "Springer International Publishing"
} Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of highperformance kernels for such operations is known to be a challenging task. While for operations on one and twodimensional tensors there exist standardized interfaces and highlyoptimized libraries (BLAS), for higher dimensional tensors neither standards nor highlytuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating highperformance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cacheaware microbenchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the bestperforming algorithms in a tiny fraction of the time taken by the direct execution of the algorithms. abstractwebPDFbibtexhide
