Recent Publications

Submitted Paper

  1. ChASE: Chebyshev Accelerated Subspace iteration Eigensolver for sequences of Hermitian eigenvalue problems
    May 2018.
    submitted to ACM TOMS.
    Solving dense Hermitian eigenproblems arranged in a sequence with direct solvers fails to take advantage of those spectral properties which are pertinent to the entire sequence, and not just to the single problem. When such features take the form of correlations between the eigenvectors of consecutive problems, as is the case in many real-world applications, the potential benefit of exploiting them can be substantial. We present ChASE, a modern algorithm and library based on subspace iteration with polynomial acceleration. Novel to ChASE is the computation of the spectral estimates that enter in the filter and an optimization of the polynomial degree which further reduces the necessary FLOPs. ChASE is written in C++ using the modern software engineering concepts which favor a simple integration in application codes and a straightforward portability over heterogeneous platforms. When solving sequences of Hermitian eigenproblems for a portion of their extremal spectrum, ChASE greatly benefits from the sequence's spectral properties and outperforms direct solvers in many scenarios. The library ships with two distinct parallelization schemes, supports execution over distributed GPUs, and it is easily extensible to other parallel computing architectures.
    abstractwebPDFhide

Journal Articles

  1. Optimizing AIREBO: Navigating the Journey from Complex Legacy Code to High Performance
    Journal of Computational Chemistry, 2019.
    Accepted.
    @article{Höhnerbach2019:708,
        author  = "Markus Höhnerbach and Paolo Bientinesi",
        title   = "Optimizing AIREBO: Navigating the Journey from Complex Legacy Code to High Performance",
        journal = "Journal of Computational Chemistry",
        year    = 2019,
        note    = "Accepted",
        url     = "https://arxiv.org/pdf/1810.07026.pdf"
    }
    Despite initiatives to improve the quality of scientific codes, there still is a large presence of legacy code. Such code often needs to implement a lot of functionality under time constrains, sacrificing quality. Additionally, quality is rarely improved by optimizations for new architectures. This development model leads to code that is increasingly difficult to work with. Our suggested solution includes complexity-reducing refactoring and hardware abstraction. We focus on the AIREBO potential from LAMMPS, where the challenge is that any potential kernel is rather large and complex, hindering systematic optimization. This issue is common to codes that model multiple physical phenomena. We present our journey from the C++ port of a previous Fortran code to performance-portable, KNC-hybrid, vectorized, scalable, optimized code supporting full and reduced precision. The journey includes extensive testing that fixed bugs in the original code. Large-scale, full-precision runs sustain speedups of more than 4x (KNL) and 3x (Skylake).
    abstractPDFbibtexhide
  2. Assessment of sound spatialisation algorithms for sonic rendering with headsets
    Ali Tarzan, Paolo Bientinesi and Marco Alunno
    Journal of New Music Research, November 2018.
    Accepted.
    @article{Tarzan2018:528,
        author  = "Ali Tarzan and Paolo Bientinesi and Marco Alunno",
        title   = "Assessment of sound spatialisation algorithms for sonic rendering with headsets",
        journal = "Journal of New Music Research ",
        year    = 2018,
        month   = nov,
        note    = "Accepted"
    }
    Given an input sound signal and a target virtual sound source, sound spatialisation algorithms manipulate the signal so that a listener perceives it as though it were emitted from the target source. There exist several established spatialisation approaches that deliver satisfactory results when loudspeakers are used to playback the manipulated signal. As headphones have a number of desirable characteristics over loudspeakers, such as portability, isolation from the surrounding environment, cost and ease of use, it is interesting to explore how a sense of acoustic space can be conveyed through them. This article first surveys traditional spatialisation approaches intended for loudspeakers, and then reviews them with regard to their adaptability to headphones.
    abstractwebbibtexhide
  3. MatchPy: Pattern Matching in Python
    Manuel Krebber and Henrik Barthels
    Journal of Open Source Software, Volume 3(26), pp. 2, June 2018.
    @article{Krebber2018:838,
        author  = "Manuel Krebber and Henrik Barthels",
        title   = "MatchPy: Pattern Matching in Python",
        journal = "Journal of Open Source Software",
        year    = 2018,
        volume  = 3,
        number  = 26,
        pages   = 2,
        month   = jun
    }
    webbibtexhide
  4. Accelerating scientific codes by performance and accuracy modeling
    Journal of Computational Science, April 2018.
    Accepted.
    @article{Fabregat-Traver2018:408,
        author  = "Diego Fabregat-Traver and {Ahmed E.} Ismail and Paolo Bientinesi",
        title   = "Accelerating scientific codes by performance and accuracy modeling ",
        journal = "Journal of Computational Science",
        year    = 2018,
        month   = apr,
        note    = "Accepted",
        url     = "http://arxiv.org/abs/1608.04694"
    }
    Scientific software is often driven by multiple parameters that affect both accuracy and performance. Since finding the optimal configuration of these parameters is a highly complex task, it extremely common that the software is used suboptimally. In a typical scenario, accuracy requirements are imposed, and attained through suboptimal performance. In this paper, we present a methodology for the automatic selection of parameters for simulation codes, and a corresponding prototype tool. To be amenable to our methodology, the target code must expose the parameters affecting accuracy and performance, and there must be formulas available for error bounds and computational complexity of the underlying methods. As a case study, we consider the particle-particle particle-mesh method (PPPM) from the LAMMPS suite for molecular dynamics, and use our tool to identify configurations of the input parameters that achieve a given accuracy in the shortest execution time. When compared with the configurations suggested by expert users, the parameters selected by our tool yield reductions in the time-to-solution ranging between 10% and 60%. In other words, for the typical scenario where a fixed number of core-hours are granted and simulations of a fixed number of timesteps are to be run, usage of our tool may allow up to twice as many simulations. While we develop our ideas using LAMMPS as computational framework and use the PPPM method for dispersion as case study, the methodology is general and valid for a range of software tools and methods.
    abstractPDFbibtexhide
  5. The ELAPS Framework: Experimental Linear Algebra Performance Studies
    International Journal of High Performance Computing, pp. 1-13, March 2018.
    @article{Peise2018:560,
        author    = "Elmar Peise and Paolo Bientinesi",
        title     = "The ELAPS Framework: Experimental Linear Algebra Performance Studies",
        journal   = "International Journal of High Performance Computing",
        year      = 2018,
        pages     = "1-13",
        month     = mar,
        publisher = "Sage",
        url       = "http://arxiv.org/pdf/1504.08035v1"
    }
    In scientific computing, optimal use of computing resources comes at the cost of extensive coding, tuning and benchmarking. While the classic approach of “features first, performance later” is supported by a variety of tools such as Tau, Vampir, and Scalasca, the emerging performance-centric approach, in which both features and performance are primary objectives, is still lacking suitable development tools. For dense linear algebra applications, we fill this gap with the Experimental Linear Algebra Performance Studies framework (ELAPS), a multi-platform open-source environment for easy and fast, yet powerful performance experimentation and prototyping. In contrast to many existing tools, ELAPS targets the beginning of the development process, assisting application developers in both algorithmic and optimization decisions. With ELAPS, users construct experiments to investigate how performance and efficiency depend on factors such as caching, algorithmic parameters, problem size, and parallelism. Experiments are designed either through Python scripts or a specialized GUI, and run on a spectrum of architectures, ranging from laptops to accelerators and clusters. The resulting reports provide various metrics and statistics that can be analyzed both numerically and visually. In this paper, we introduce ELAPS and illustrate its practical value in guiding critical performance decisions already in early development stages.
    abstractwebPDFbibtexhide

Peer Reviewed Conference Publications

  1. A Timer-Augmented Cost Function for Load Balanced DSMC
    13th International Meeting on High Performance Computing for Computational Science (VECPAR 18), Lecture Notes in Computer Science, Volume 11333, September 2019.
    @inproceedings{McDoniel2019:488,
        author    = "William McDoniel and Paolo Bientinesi",
        title     = "A Timer-Augmented Cost Function for Load Balanced DSMC",
        booktitle = "13th International Meeting on High Performance Computing for Computational Science (VECPAR 18)",
        year      = 2019,
        volume    = 11333,
        series    = "Lecture Notes in Computer Science",
        month     = sep
    }
    Due to a hard dependency between time steps, large-scale simulations of gas using the Direct Simulation Monte Carlo (DSMC) method proceed at the pace of the slowest processor. Scalability is therefore achievable only by ensuring that the work done each time step is as evenly apportioned among the processors as possible. Furthermore, as the simulated system evolves, the load shifts, and thus this load-balancing typically needs to be performed multiple times over the course of a simulation. Common methods generally use either crude performance models or processor-level timers. We combine both to create a timer-augmented cost function which both converges quickly and yields well-balanced processor decompositions. When compared to a particle-based performance model alone, our method achieves 2x speedup at steady-state on up to 1024 processors for a test case consisting of a Mach 9 argon jet impacting a solid wall.
    abstractbibtexhide
  2. Extended Pipeline for Content-Based Feature Engineering in Music Genre Recognition
    Tina Raissi, Alessandro Tibo and Paolo Bientinesi
    Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, IEEE Press, April 2018.
    @inproceedings{Raissi2018:320,
        author    = "{Tina } Raissi and Alessandro Tibo and Paolo Bientinesi",
        title     = "Extended Pipeline for Content-Based Feature Engineering in Music Genre Recognition",
        year      = 2018,
        month     = apr,
        publisher = "IEEE Press",
        url       = "https://arxiv.org/pdf/1805.05324.pdf"
    }
    We present a feature engineering pipeline for the construction of musical signal characteristics, to be used for the design of a supervised model for musical genre identification. The key idea is to extend the traditional two-step process of extraction and classification with additive stand-alone phases which are no longer organized in a waterfall scheme. The whole system is realized by traversing backtrack arrows and cycles between various stages. In order to give a compact and effective repre- sentation of the features, the standard early temporal integra- tion is combined with other selection and extraction phases: on the one hand, the selection of the most meaningful char- acteristics based on information gain, on the other hand, the inclusion of the nonlinear correlation between this subset of features, determined by an autoencoder. The results of the experiments conducted on GTZAN dataset reveal a noticeable contribution of this methodology towards the model’s perfor- mance in classification task.
    abstractwebPDFbibtexhide
  3. The Generalized Matrix Chain Algorithm
    Proceedings of 2018 IEEE/ACM International Symposium on Code Generation and Optimization, pp. 11, 24 February 2018.
    @inproceedings{Barthels2018:130,
        author    = "Henrik Barthels and Marcin Copik and Paolo Bientinesi",
        title     = "The Generalized Matrix Chain Algorithm",
        booktitle = "Proceedings of 2018 IEEE/ACM International Symposium on Code Generation and Optimization",
        year      = 2018,
        pages     = 11,
        address   = "Vienna, Austria",
        month     = feb,
        url       = "http://arxiv.org/abs/1804.04021"
    }
    In this paper, we present a generalized version of the matrix chain algorithm to generate efficient code for linear algebra problems, a task for which human experts often invest days or even weeks of works. The standard matrix chain problem consists in finding the parenthesization of a matrix product $M := A_1 A_2 \cdots A_n$​ that minimizes the number of scalar operations. In practical applications, however, one frequently encounters more complicated expressions, involving transposition, inversion, and matrix properties. Indeed, the computation of such expressions relies on a set of computational kernels that offer functionality well beyond the simple matrix product. The challenge then shifts from finding an optimal parenthesization to finding an optimal mapping of the input expression to the available kernels. Furthermore, it is often the case that a solution based on the minimization of scalar operations does not result in the optimal solution in terms of execution time. In our experiments, the generated code outperforms other libraries and languages on average by a factor of about 5. The motivation for this work comes from the fact that---despite great advances in the development of compilers---the task of mapping linear algebra problems to optimized kernels is still to be done manually. In order to relieve the user from this complex task, new techniques for the compilation of linear algebra expressions have to be developed.
    abstractwebPDFbibtexhide
  4. Program Generation for Small-Scale Linear Algebra Applications
    Daniele Spampinato, Diego Fabregat-Traver, Paolo Bientinesi and Markus Pueschel
    Proceedings of the International Symposium on Code Generation and Optimization, February 2018.
    @inproceedings{Spampinato2018:858,
        author  = "{Daniele } Spampinato and Diego Fabregat-Traver and Paolo Bientinesi and Markus Pueschel",
        title   = "Program Generation for Small-Scale Linear Algebra Applications",
        year    = 2018,
        address = "Vienna, Austria",
        month   = feb,
        url     = "https://arxiv.org/pdf/1805.04775.pdf"
    }
    We present SLinGen, a program generation system for linear algebra. The input to SLinGen is an application expressed mathematically in a linear-algebra-inspired language (LA) that we define. LA provides basic scalar/vector/matrix additions/multiplications, higher level operations including linear systems solvers, Cholesky and LU factorizations, as well as loops. The output of SLinGen is performance-optimized single-source C code, optionally vectorized with intrinsics. The target of SLinGen are small-scale computations on fixed-size operands, for which a straightforward implementation using optimized libraries (e.g., BLAS or LAPACK) is known to yield suboptimal performance (besides increasing code size and introducing dependencies), but which are crucial in control, signal processing, computer vision, and other domains. Internally, SLinGen uses synthesis and DSL-based techniques to optimize at a high level of abstraction. We benchmark our program generator on three prototypical applications: the Kalman filter, Gaussian process regression, and an L1-analysis convex solver, as well as basic routines including Cholesky factorization and solvers for the continuous-time Lyapunov and Sylvester equations. The results show significant speed-ups compared to straightforward C with icc/clang or a polyhedral optimizer, as well as library-based and template-based implementations.
    abstractwebPDFbibtexhide