Interfacial system using a (a) static cutoff and a (b) dynamic cutoff. Gray area denotes the cutoff. The arrows represent the force acting on a particle. The yellow arrow depicts the additional force contribution due to a larger cutoff.

Molecular Dynamics


We introduce a dynamic cutoff method (DCM) for interfacial systems. The idea consists in adopting a cutoff-based method in which the cutoff is chosen on a particle-by-particle basis, according to the distance from the interface. We present a linear-time algorithm that does not rely on global communication patterns. We provide timings for simulations running on the full JUQUEEN supercomputer (458, 752 cores), and show nearly perfect strong and weak scaling.
  1. A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Dynamics
    Proceedings of the International Supercomputing Conference (ISC), July 2015.
    Accepted.
    @inproceedings{Springer2015:450,
        author = "Paul Springer and {Ahmed E.} Ismail and Paolo Bientinesi",
        title  = "A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Dynamics",
        year   = 2015,
        month  = jul,
        note   = "Accepted"
    }
    Recent results on supercomputers show that beyond 65K cores, the efficiency of molecular dynamics simulations of interfacial sys- tems decreases significantly. In this paper, we introduce a dynamic cutoff method (DCM) for interfacial systems of arbitrarily large size. The idea consists in adopting a cutoff-based method in which the cutoff is cho- sen on a particle-by-particle basis, according to the distance from the interface. Computationally, the challenge is shifted from the long-range solvers to the detection of the interfaces and to the computation of the particle-interface distances. For these tasks, we present linear-time algo- rithms that do not rely on global communication patterns. As a result, the DCM algorithm is suited for large systems of particles and mas- sively parallel computers. To demonstrate its potential, we integrated DCM into the LAMMPS open-source molecular dynamics package, and simulated large liquid/vapor systems on two supercomputers: SuperMuc and JUQUEEN. In all cases, the accuracy of DCM is comparable to the traditional particle-particle particle-mesh (PPPM) algorithm, while the performance is considerably superior for large numbers of particles. For JUQUEEN, we provide timings for simulations running on the full system (458, 752 cores), and show nearly perfect strong and weak scaling.
    abstractwebPDFbibtexhide

Experimental Linear Algebra Performance Studies (ELAPS)

The ELAPS Framework is a multi-platform open source environment for fast yet powerful experimentation with dense linear algebra kernels, algorithms, and libraries.




Get ELAPS now!

http://github.com/elmar-peise/ELAPS/

Multi-trait GWAS as a two-dimensional grid of generalized least-squares problems. GWAS with multiple traits requires the solution of m x t correlated GLS problems, originating a three-dimensional object B of size m x t x p.

Genome-Wide Association Studies


Multi-trait GWAS based on linear mixed models pose a challenge in terms of both the computational requirements (solution of up to trillions of correlated GLS problems) and the size of the datasets (up to terabytes) to be processed. Considering the collection of GLS problems as a whole, and exploiting application-specific knowledge is key to the design efficient algorithms.
  1. Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies
    ACM Transactions on Mathematical Software (TOMS), Volume 40(4), pp. Article 27, June 2014.
    @article{Fabregat-Traver2014:798,
        author  = "Diego Fabregat-Traver and Paolo Bientinesi",
        title   = "Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies",
        journal = "ACM Transactions on Mathematical Software (TOMS)",
        year    = 2014,
        volume  = 40,
        number  = 4,
        pages   = "Article 27",
        month   = jun
    }
    In many scientific and engineering applications one has to solve not one but a sequence of instances of the same problem. Often times, the problems in the sequence are linked in a way that allows intermediate results to be reused. A characteristic example for this class of applications is given by the Genome-Wide Association Studies (GWAS), a widely spread tool in computational biology. GWAS entails the solution of up to trillions ($10^{12}$) of correlated generalized least-squares problems, posing a daunting challenge: the performance of petaflops ($10^{15}$ floating-point operations) over terabytes of data residing on disk. In this paper, we design an efficient algorithm for performing GWAS on multi-core architectures. This is accomplished in three steps. First, we show how to exploit the relation among successive problems, thus reducing the overall computational complexity. Then, through an analysis of the required data transfers, we identify how to eliminate any overhead due to input/output operations. Finally, we study how to decompose computation into tasks to be distributed among the available cores, to attain high performance and scalability. We believe the paper contributes valuable guidelines of general applicability for computational scientists on how to develop and optimize numerical algorithms.
    abstractwebPDFbibtexhide

Distributed Algorithm

enables large scale genome wide association studies.
  1. High Performance Solutions for Big-data GWAS
    Parallel Computing, Volume 42, pp. 75 - 87, February 2015.
    Special issue on Parallelism in Bioinformatics.
    @article{Peise2015:754,
        author  = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi",
        title   = "High Performance Solutions for Big-data GWAS",
        journal = "Parallel Computing",
        year    = 2015,
        volume  = 42,
        pages   = "75 - 87",
        month   = feb,
        note    = "Special issue on Parallelism in Bioinformatics"
    }
    In order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms and thousands of phenotypes come at the cost of hundreds of gigabytes of data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a relationship matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, by using distributed resources such as Cloud or clusters, we address both challenges: The genotype and phenotype data is streamed from secondary storage using a double buffering technique, while the relationship matrix is kept across the main memory of a distributed memory system. With the help of these solutions, we develop separate algorithms for studies involving only one or a multitude of traits. We show that these algorithms sustain high-performance and allow the analysis of enormous datasets.
    abstractwebPDFbibtexhide


  1. Multilevel Summation for Dispersion: A Linear-Time Algorithm for 1/r^6 Potentials
    Journal of Chemical Physics, Volume 140, pp. 024105, January 2014.
    @article{Tameling2014:590,
        author  = "Daniel Tameling and Paul Springer and Paolo Bientinesi and {Ahmed E.} Ismail",
        title   = "Multilevel Summation for Dispersion: A Linear-Time Algorithm for 1/r^6 Potentials",
        journal = "Journal of Chemical Physics",
        year    = 2014,
        volume  = 140,
        pages   = 24105,
        month   = jan
    }
    The multilevel summation (MLS) method was developed to evaluate long-range interactions in molecular dynamics (MD) simulations. MLS was initially introduced for Coulombic potentials; we have extended this method to dispersion interactions. While formally short-ranged, for an accurate calculation of forces and energies in cases such as in interfacial systems, dispersion potentials require long-range methods. Since long-range solvers tend to dominate the time needed to perform MD calculations, increasing their performance is of vital importance. The MLS method offers some significant advantages when compared to mesh-based Ewald methods like the particle-particle particle-mesh and particle mesh Ewald methods. Unlike mesh-based Ewald methods, MLS does not use fast Fourier transforms and is thus not limited by communication and bandwidth concerns. In addition, it scales linearly in the number of particles, as compared to the O(N log N) complexity of the mesh-based Ewald methods. While the structure of the MLS method is invariant for different potentials, every algorithmic step had to be adapted to accommodate the 1/r^6 form of the dispersion interactions. In addition, we have derived error bounds, similar to those obtained by Hardy for the electrostatic MLS. Using a prototype implementation, we can already demonstrate the linear scaling of the MLS method for dispersion, and present results establishing the accuracy and efficiency of the method.
    abstractwebPDFbibtexhide
  2. A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Dynamics
    Proceedings of the International Supercomputing Conference (ISC), July 2015.
    Accepted.
    @inproceedings{Springer2015:450,
        author = "Paul Springer and {Ahmed E.} Ismail and Paolo Bientinesi",
        title  = "A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Dynamics",
        year   = 2015,
        month  = jul,
        note   = "Accepted"
    }
    Recent results on supercomputers show that beyond 65K cores, the efficiency of molecular dynamics simulations of interfacial sys- tems decreases significantly. In this paper, we introduce a dynamic cutoff method (DCM) for interfacial systems of arbitrarily large size. The idea consists in adopting a cutoff-based method in which the cutoff is cho- sen on a particle-by-particle basis, according to the distance from the interface. Computationally, the challenge is shifted from the long-range solvers to the detection of the interfaces and to the computation of the particle-interface distances. For these tasks, we present linear-time algo- rithms that do not rely on global communication patterns. As a result, the DCM algorithm is suited for large systems of particles and mas- sively parallel computers. To demonstrate its potential, we integrated DCM into the LAMMPS open-source molecular dynamics package, and simulated large liquid/vapor systems on two supercomputers: SuperMuc and JUQUEEN. In all cases, the accuracy of DCM is comparable to the traditional particle-particle particle-mesh (PPPM) algorithm, while the performance is considerably superior for large numbers of particles. For JUQUEEN, we provide timings for simulations running on the full system (458, 752 cores), and show nearly perfect strong and weak scaling.
    abstractwebPDFbibtexhide

Molecular Dynamics

An important tool to study phenomena at the atomic scale are molecular dynamics simulations. These simulations offer a great number of computational challenges. In particular, a critical aspect is the performance. Our group is actively involved in several projects that aim at improving the performance of molecular dynamics simulations. More

Contact: Daniel Tameling


Graphical representation of the multilevel summation method

Molecular Dynamics

A typical bottleneck in molecular dynamics simulations is the performance of the long-range solver. Because of increasing simulation sizes, the limitations of current long-range methods become more and more problematic. Therefore, we are interested in investigating alternatives such as the multilevel summation method.
More

Contact: Daniel Tameling

Algorithm Generation

yields hundreds of implementations for tensor contractions.
  1. On the Performance Prediction of BLAS-based Tensor Contractions
    High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, Volume 8966, pp. 193-212, Springer International Publishing, 2015.
    @inproceedings{Peise2015:380,
        author    = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi",
        title     = "On the Performance Prediction of BLAS-based Tensor Contractions",
        booktitle = "High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation",
        year      = 2015,
        editor    = "Jarvis, Stephen A. and Wright, Steven A. and Hammond, Simon D.",
        volume    = 8966,
        series    = "Lecture Notes in Computer Science",
        pages     = "193-212",
        publisher = "Springer International Publishing"
    }
    Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.
    abstractwebPDFbibtexhide

About HPAC

The High-Performance and Automatic Computing group is concerned with the development and analysis of accurate and efficient numerical algorithms, with focus on numerical linear algebra. We target applications from materials science, molecular dynamics and computational biology, and the whole range of high-performance architectures.

Topics

  • Numerical linear algebra
    • Sequences of problems
    • Small scale operations
    • Tensor operations
    • Error analysis
    • Parallel eigensolvers
  • Parallelism
    • Vectorization
    • Multicore
    • Distributed-memory
    • Coprocessors: GPU, Xeon Phi, ...
    • Hybrid
  • Automation
    • Algorithm and code generation
    • Performance modeling and prediction
    • Algorithm ranking
  • Applications
    • Genome analysis
    • Molecular dynamics simulations
    • Symbolic algorithmic differentiation for matrix operations
    • Electronic structure calculations

Projects for Bachelor and Master theses

Check list of available projects.