Recent Publications

Submitted Paper

  1. Assessment of sound spatialisation algorithms for sonic rendering with headsets
    Ali Tarzan, Paolo Bientinesi and Marco Alunno
    Journal of New Music Research, November 2017.
    Given an input sound signal and a target virtual sound source, sound spatialisation algorithms manipulate the signal so that a listener perceives it as though it were emitted from the target source. There exist several established spatialisation approaches that deliver satisfactory results when loudspeakers are used to playback the manipulated signal. As headphones have a number of desirable characteristics over loudspeakers, such as portability, isolation from the surrounding environment, cost and ease of use, it is interesting to explore how a sense of acoustic space can be conveyed through them. This article first surveys traditional spatialisation approaches intended for loudspeakers, and then reviews them with regard to their adaptability to headphones.

Journal Articles

  1. Design of a high-performance GEMM-like Tensor-Tensor Multiplication
    ACM Transactions on Mathematical Software (TOMS), Volume 44(3), pp. 28:1-28:29, January 2018.
        author  = "Paul Springer and Paolo Bientinesi",
        title   = "Design of a high-performance GEMM-like Tensor-Tensor Multiplication",
        journal = "ACM Transactions on Mathematical Software (TOMS)",
        year    = 2018,
        volume  = 44,
        number  = 3,
        pages   = "28:1--28:29",
        month   = jan,
        url     = ""
    We present ''GEMM-like Tensor-Tensor multiplication'' (GETT), a novel approach to tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned ''macro-kernel''. This macro-kernel operates on suitably prepared (''packed'') sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. To compare our technique with other modern tensor contractions, we integrate GETT alongside the so called Transpose-Transpose-GEMM-Transpose and Loops-over-GEMM approaches into an open source ''Tensor Contraction Code Generator'' (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to 12.3x. More precisely, GETT achieves speedups of up to 1.42x over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to 91.3% of peak floating-point performance for compute-bound tensor contractions.
  2. Algorithm 979: Recursive Algorithms for Dense Linear Algebra—The ReLAPACK Collection
    ACM Transactions on Mathematical Software (TOMS), Volume 44(2), pp. 16:1-16:19, September 2017.
        author    = "Elmar Peise and Paolo Bientinesi",
        title     = "Algorithm 979: Recursive Algorithms for Dense Linear Algebra—The ReLAPACK Collection",
        journal   = "ACM Transactions on Mathematical Software (TOMS)",
        year      = 2017,
        volume    = 44,
        number    = 2,
        pages     = "16:1--16:19",
        month     = sep,
        publisher = "ACM",
        address   = "New York, NY, USA",
        url       = ""
    To exploit both memory locality and the full performance potential of highly tuned kernels, dense linear algebra libraries such as LAPACK commonly implement operations as blocked algorithms. However, to achieve next-to-optimal performance with such algorithms, significant tuning is required. On the other hand, recursive algorithms are virtually tuning free, and yet attain similar performance. In this paper, we first analyze and compare blocked and recursive algorithms in terms of performance, and then introduce ReLAPACK, an open-source library of recursive algorithms to seamlessly replace most of LAPACK's blocked algorithms. In many scenarios, ReLAPACK clearly outperforms reference LAPACK, and even improves upon the performance of optimizes libraries.
  3. TTC: A high-performance Compiler for Tensor Transpositions
    Paul Springer, Jeff R. Hammond and Paolo Bientinesi
    ACM Transactions on Mathematical Software (TOMS), Volume 44(2), pp. 15:1-15:21, August 2017.
        author    = "Paul Springer and {Jeff R.} Hammond and Paolo Bientinesi",
        title     = "TTC: A high-performance Compiler for Tensor Transpositions",
        journal   = "ACM Transactions on Mathematical Software (TOMS)",
        year      = 2017,
        volume    = 44,
        number    = 2,
        pages     = "15:1--15:21",
        month     = aug,
        publisher = "ACM",
        url       = ""
    We present TTC, an open-source parallel compiler for multidimensional tensor transpositions. In order to generate high-performance C++ code, TTC explores a number of optimizations, including software prefetching, blocking, loop-reordering, and explicit vectorization. To evaluate the performance of multidimensional transpositions across a range of possible use-cases, we also release a benchmark covering arbitrary transpositions of up to six dimensions. Performance results show that the routines generated by TTC achieve close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, and yield significant performance gains over modern compilers. By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large. Experiments indicate that when only 100 potential solutions are considered, the resulting performance is about 99% of that achieved with exhaustive search.
  4. The ELAPS Framework: Experimental Linear Algebra Performance Studies
    International Journal of High Performance Computing, August 2017.
        author  = "Elmar Peise and Paolo Bientinesi",
        title   = "The ELAPS Framework: Experimental Linear Algebra Performance Studies",
        journal = "International Journal of High Performance Computing",
        year    = 2017,
        month   = aug,
        note    = "Accepted",
        url     = ""
    In scientific computing, optimal use of computing resources comes at the cost of extensive coding, tuning and benchmarking. While the classic approach of “features first, performance later” is supported by a variety of tools such as Tau, Vampir, and Scalasca, the emerging performance-centric approach, in which both features and performance are primary objectives, is still lacking suitable development tools. For dense linear algebra applications, we fill this gap with the Experimental Linear Algebra Performance Studies framework (ELAPS), a multi-platform open-source environment for easy and fast, yet powerful performance experimentation and prototyping. In contrast to many existing tools, ELAPS targets the beginning of the development process, assisting application developers in both algorithmic and optimization decisions. With ELAPS, users construct experiments to investigate how performance and efficiency depend on factors such as caching, algorithmic parameters, problem size, and parallelism. Experiments are designed either through Python scripts or a specialized GUI, and run on a spectrum of architectures, ranging from laptops to accelerators and clusters. The resulting reports provide various metrics and statistics that can be analyzed both numerically and visually. In this paper, we introduce ELAPS and illustrate its practical value in guiding critical performance decisions already in early development stages.

Peer Reviewed Conference Publications

  1. Extended Pipeline for Content-Based Feature Engineering in Music Genre Recognition
    Tina Raissi and Paolo Bientinesi
    Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, IEEE Press, April 2018.
        author    = "{Tina } Raissi and Paolo Bientinesi",
        title     = "Extended Pipeline for Content-Based Feature Engineering in Music Genre Recognition",
        year      = 2018,
        month     = apr,
        publisher = "IEEE Press"
    We present a feature engineering pipeline for the construction of musical signal characteristics, to be used for the design of a supervised model for musical genre identification. The key idea is to extend the traditional two-step process of extraction and classification with additive stand-alone phases which are no longer organized in a waterfall scheme. The whole system is realized by traversing backtrack arrows and cycles between various stages. In order to give a compact and effective repre- sentation of the features, the standard early temporal integra- tion is combined with other selection and extraction phases: on the one hand, the selection of the most meaningful char- acteristics based on information gain, on the other hand, the inclusion of the nonlinear correlation between this subset of features, determined by an autoencoder. The results of the experiments conducted on GTZAN dataset reveal a noticeable contribution of this methodology towards the model’s perfor- mance in classification task.
  2. The Generalized Matrix Chain Algorithm
    Proceedings of 2018 IEEE/ACM International Symposium on Code Generation and Optimization, pp. 11, 24 February 2018.
        author    = "Henrik Barthels and Marcin Copik and Paolo Bientinesi",
        title     = "The Generalized Matrix Chain Algorithm",
        booktitle = "Proceedings of 2018 IEEE/ACM International Symposium on Code Generation and Optimization",
        year      = 2018,
        pages     = 11,
        address   = "Vienna, Austria",
        month     = feb,
        url       = ""
    In this paper, we present a generalized version of the matrix chain algorithm to generate efficient code for linear algebra problems, a task for which human experts often invest days or even weeks of works. The standard matrix chain problem consists in finding the parenthesization of a matrix product $M := A_1 A_2 \cdots A_n$‚Äč that minimizes the number of scalar operations. In practical applications, however, one frequently encounters more complicated expressions, involving transposition, inversion, and matrix properties. Indeed, the computation of such expressions relies on a set of computational kernels that offer functionality well beyond the simple matrix product. The challenge then shifts from finding an optimal parenthesization to finding an optimal mapping of the input expression to the available kernels. Furthermore, it is often the case that a solution based on the minimization of scalar operations does not result in the optimal solution in terms of execution time. In our experiments, the generated code outperforms other libraries and languages on average by a factor of about 5. The motivation for this work comes from the fact that---despite great advances in the development of compilers---the task of mapping linear algebra problems to optimized kernels is still to be done manually. In order to relieve the user from this complex task, new techniques for the compilation of linear algebra expressions have to be developed.
  3. Program Generation for Small-Scale Linear Algebra Applications
    Daniele Spampinato, Diego Fabregat-Traver, Paolo Bientinesi and Markus Pueschel
    Proceedings of the International Symposium on Code Generation and Optimization, February 2018.
        author  = "{Daniele } Spampinato and Diego Fabregat-Traver and Paolo Bientinesi and Markus Pueschel",
        title   = "Program Generation for Small-Scale Linear Algebra Applications",
        year    = 2018,
        address = "Vienna, Austria",
        month   = feb
    We present SLinGen, a program generation system for linear algebra. The input to SLinGen is an application expressed mathematically in a linear-algebra-inspired language (LA) that we define. LA provides basic scalar/vector/matrix additions/multiplications, higher level operations including linear systems solvers, Cholesky and LU factorizations, as well as loops. The output of SLinGen is performance-optimized single-source C code, optionally vectorized with intrinsics. The target of SLinGen are small-scale computations on fixed-size operands, for which a straightforward implementation using optimized libraries (e.g., BLAS or LAPACK) is known to yield suboptimal performance (besides increasing code size and introducing dependencies), but which are crucial in control, signal processing, computer vision, and other domains. Internally, SLinGen uses synthesis and DSL-based techniques to optimize at a high level of abstraction. We benchmark our program generator on three prototypical applications: the Kalman filter, Gaussian process regression, and an L1-analysis convex solver, as well as basic routines including Cholesky factorization and solvers for the continuous-time Lyapunov and Sylvester equations. The results show significant speed-ups compared to straightforward C with icc/clang or a polyhedral optimizer, as well as library-based and template-based implementations.
  4. Efficient Pattern Matching in Python
    Manuel Krebber, Henrik Barthels and Paolo Bientinesi
    Proceedings of the 7th Workshop on Python for High-Performance and Scientific Computing, November 2017.
    In conjunction with SC17: The International Conference for High Performance Computing, Networking, Storage and Analysis.
        author    = "Manuel Krebber and Henrik Barthels and Paolo Bientinesi",
        title     = "Efficient Pattern Matching in Python",
        booktitle = "Proceedings of the 7th Workshop on Python for High-Performance and Scientific Computing",
        year      = 2017,
        address   = "Denver, Colorado",
        month     = nov,
        note      = "In conjunction with SC17: The International Conference for High Performance Computing, Networking, Storage and Analysis",
        url       = ""
    Pattern matching is a powerful tool for symbolic computations. Applications include term rewriting systems, as well as the manipulation of symbolic expressions, abstract syntax trees, and XML and JSON data. It also allows for an intuitive description of algorithms in the form of rewrite rules. We present the open source Python module MatchPy, which offers functionality and expressiveness similar to the pattern matching in Mathematica. In particular, it includes syntactic pattern matching, as well as matching for commutative and/or associative functions, sequence variables, and matching with constraints. MatchPy uses new and improved algorithms to efficiently find matches for large pattern sets by exploiting similarities between patterns. The performance of MatchPy is investigated on several real-world problems.
  5. Linnea: Compiling Linear Algebra Expressions to High-Performance Code
    Proceedings of the 8th International Workshop on Parallel Symbolic Computation, July 2017.
        author    = "Henrik Barthels and Paolo Bientinesi",
        title     = "Linnea: Compiling Linear Algebra Expressions to High-Performance Code",
        booktitle = "Proceedings of the 8th International Workshop on Parallel Symbolic Computation",
        year      = 2017,
        address   = "Kaiserslautern, Germany",
        month     = jul,
        url       = ""
    Linear algebra expressions appear in fields as diverse as computational biology, signal processing, communication technology, finite element methods, and control theory. Libraries such as BLAS and LAPACK provide highly optimized building blocks for just about any linear algebra computation; thus, a linear algebra expression can be evaluated efficiently by breaking it down into those building blocks. However, this is a challenging problem, requiring knowledge in high-performance computing, compilers, and numerical linear algebra. In this paper we give an overview of existing solutions, and introduce Linnea, a compiler that solves this problem. As shown through a set of test cases, Linnea’s results are comparable with those obtained by human experts.