Publications - Markus Höhnerbach

Peer Reviewed Conference Publications

  1. LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi
    Proceedings of the 32nd International Conference, ISC High Performance 2017, Volume 10266, pp. 61-78, Springer, June 2017.
    @inproceedings{McDoniel2017:890,
        author    = "William McDoniel and Markus Höhnerbach and Rodrigo Canales and {Ahmed E.} Ismail and Paolo Bientinesi",
        title     = "LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi",
        year      = 2017,
        volume    = 10266,
        pages     = "61--78",
        address   = "Frankfurt",
        month     = jun,
        publisher = "Springer",
        url       = "https://arxiv.org/pdf/1702.04250.pdf"
    }
    Molecular Dynamics is an important tool for computational biologists, chemists, and materials scientists, consuming a sizable amount of supercomputing resources. Many of the investigated systems contain charged particles, which can only be simulated accurately using a long-range solver, such as PPPM. We extend the popular LAMMPS molecular dynamics code with an implementation of PPPM particularly suitable for the second generation Intel Xeon Phi. Our main target is the optimization of computational kernels by means of vectorization, and we observe speedups in these kernels of up to 12x. These improvements carry over to LAMMPS users, with overall speedups ranging between 2-3x, without requiring users to retune input parameters. Furthermore, our optimizations make it easier for users to determine optimal input parameters for attaining top performance.
    abstractPDFbibtexhide
  2. Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods
    Proceedings of the JARA-HPC Symposium, Lecture Notes in Computer Science, Volume 10164, pp. 200-211, Springer, 2017.
    @inproceedings{Fabregat-Traver2017:4,
        author    = "Diego Fabregat-Traver and Davor Davidovic and Markus Höhnerbach and Edoardo {Di Napoli}",
        title     = " Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods",
        year      = 2017,
        volume    = 10164,
        series    = "Lecture Notes in Computer Science",
        pages     = "200--211",
        publisher = "Springer",
        url       = "https://arxiv.org/pdf/1611.00606v1"
    }
    In this paper we focus on the integration of high-performance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations developed in the Forschungszentrum J\"ulich over the course of two decades. The presented work follows up on a previous effort to modernize legacy code by re-engineering and rewriting it in terms of highly optimized libraries. We illustrate how this initial effort to get efficient and portable shared-memory code enables fast porting of the code to emerging heterogeneous architectures. More specifically, we port the code to nodes equipped with multiple GPUs. We divide our study in two parts. First, we show considerable speedups attained by minor and relatively straightforward code changes to off-load parts of the computation to the GPUs. Then, we identify further possible improvements to achieve even higher performance and scalability. On a system consisting of 16-cores and 2 GPUs, we observe speedups of up to 5x with respect to our optimized shared-memory code, which in turn means between 7.5x and 12.5x speedup with respect to the original FLEUR code.
    abstractwebPDFbibtexhide
  3. The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability
    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC'16, Number 7, pp. 7:1-7:13, IEEE Press, 2016.
    Selected for Reproducibility Initiative at SC17.
    @inproceedings{Höhnerbach2016:78,
        author    = "Markus Höhnerbach and {Ahmed E.} Ismail and Paolo Bientinesi",
        title     = "The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability",
        booktitle = "Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis",
        year      = 2016,
        number    = 7,
        series    = "SC'16",
        pages     = "7:1--7:13",
        publisher = "IEEE Press",
        note      = "Selected for Reproducibility Initiative at SC17",
        url       = "https://arxiv.org/pdf/1607.02904v1"
    }
    Molecular dynamics simulations, an indispensable research tool in computational chemistry and materials science, consume a significant portion of the supercomputing cycles around the world. We focus on multi-body potentials and aim at achieving performance portability. Compared with well-studied pair potentials, multibody potentials deliver increased simulation accuracy but are too complex for effective compiler optimization. Because of this, achieving cross-platform performance remains an open question. By abstracting from target architecture and computing precision, we develop a vectorization scheme applicable to both CPUs and accelerators. We present results for the Tersoff potential within the molecular dynamics code LAMMPS on several architectures, demonstrating efficiency gains not only for computational kernels, but also for large-scale simulations. On a cluster of Intel Xeon Phi's, our optimized solver is between 3 and 5 times faster than the pure MPI reference.
    abstractwebPDFbibtexhide
  4. Dynamic SIMD Vector Lane Scheduling
    Markus Höhnerbach, Florian Wende and Olaf Krizikalla
    High Performance Computing : ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P^3MA, VHPC, WOPSSS, Springer International Publishing, 2016.
    @inproceedings{Höhnerbach2016:440,
        author    = "Markus Höhnerbach and Florian Wende and Olaf Krizikalla",
        title     = "Dynamic SIMD Vector Lane Scheduling",
        booktitle = "High Performance Computing : ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P^3MA, VHPC, WOPSSS",
        year      = 2016,
        editor    = "Taufer, Michela",
        publisher = "Springer International Publishing"
    }
    A classical technique to vectorize code that contains control flow is a control-flow to data-flow conversion. In that approach statements are augmented with masks that denote whether a given vector lane participates in the statement’s execution or idles. If the scheduling of work to vector lanes is performed statically, then some of the vector lanes will run idle in case of control flow divergences or varying work intensities across the loop iterations. With an increasing number of vector lanes, the likelihood of divergences or heavily unbalanced work assignments increases and static scheduling leads to a poor resource utilization. In this paper, we investigate different approaches to dynamic SIMD vector lane scheduling using the Mandelbrot set algorithm as a test case. To overcome the limitations of static scheduling, idle vector lanes are assigned work items dynamically, thereby minimizing per-lane idle cycles. Our evaluation on the Knights Corner and Knights Landing platform shows, that our approaches can lead to considerable performance gains over a static work assignment. By using the AVX-512 vector compress and expand instruction, we are able to further improve the scheduling.
    abstractwebbibtexhide
  5. The Tersoff many-body potential: Sustainable performance through vectorization
    Proceedings of the SC15 Workshop: Producing High Performance and Sustainable Software for Molecular Simulation, November 2015.
    @inproceedings{Höhnerbach2015:718,
        author = "Markus Höhnerbach and {Ahmed E.} Ismail and Paolo Bientinesi",
        title  = "The Tersoff many-body potential: Sustainable performance through vectorization",
        year   = 2015,
        month  = nov,
        url    = "http://hpac.rwth-aachen.de/ipcc/sc15_paper.pdf"
    }
    This contribution discusses the effectiveness and the sustainability of vectorization in molecular dynamics. As a case study, we present results for multi-body potentials on a range of vector instruction sets, targeting both CPUs and accelerators such as GPUs and the Intel Xeon Phi. The shared-memory and distributed-memory parallelization of MD simulations are problems that have been extensively treated; by contrast, vectorization is a relatively unexplored dimension. However, given both the increasing number of vector units available in the computing architectures, and the increasing width of such units, it is imperative to write algorithms and code for taking advantage of vector instructions. We chose to investigate multi-body potentials, as they are not as easily vectorizable as the pair potentials. As a matter of fact, their optimization pushes the boundaries of current compiler technology: our experience suggests that for such problems, compilers alone are not able to deliver appreciable speedups. By constrast, by resorting to the explicit use of instrinsics, we demonstrate significant improvements both on CPUs and the Intel Xeon Phi. Nonetheless, it is clear that the direct use of intrinsics is not a sustainable solution, not only for the effort and expertise it requires, but also because it hinders readability, extensibility and maintainability of code. In our case study, we wrote a C++ implementation of the Tersoff potential for the LAMMPS molecular dynamics simulation package. To alleviate the difficulty of using intrinsics, we used a template-based approach and abstracted from the instruction set, the vector length and the floating point precision. Instead of having to write 18 different kernels, this approach allowed us to only have one source code, from which all implementations (scalar, SSE, AVX, AVX2, AVX512, and IMCI, in single, double, and mixed precision) are automatically derived. Such an infrastructure makes it possible to compare kernels and identify which algorithms are suitable for which architecture. Without optimizations, we observed that a Intel Xeon E5-2650 CPU (Sandy Bridge) is considerably faster than a Xeon Phi. With our optimizations, the performance of both CPU and accelerator improved, and thanks to its wide vector units, the Xeon Phi takes the lead. For the future, besides monitoring the compilers' improvements, one should evaluate OpenMP 4.1 and its successors, which will include support for directive-based vectorization.
    abstractPDFbibtexhide