ACM Publications

Journal Articles

  1. Modeling Performance through Memory-Stalls
    ACM SIGMETRICS Performance Evaluation Review, Volume 40(2), pp. 86-91, October 2012.
    @article{Iakymchuk2012:120,
        author    = "Roman Iakymchuk and Paolo Bientinesi",
        title     = "Modeling Performance through Memory-Stalls",
        journal   = "ACM SIGMETRICS Performance Evaluation Review",
        year      = 2012,
        volume    = 40,
        number    = 2,
        pages     = "86--91",
        month     = oct,
        publisher = "ACM",
        address   = "New York, NY, USA",
        url       = "http://hpac.rwth-aachen.de/~pauldj/pubs/pmbs.pdf"
    }
    We aim at modeling the performance of linear algebra algorithms without executing either the algorithms or any parts of them. The performance of an algorithm can be expressed in terms of the time spent on CPU execution and memory-stalls. The main concern of this paper is to build analytical models to accurately predict memory-stalls. The scenario in which data resides in the L2 cache is considered; with this assumption, only L1 cache misses occur. We construct an analytical formula for modeling the L1 cache misses of fundamental linear algebra operations such as those included in the Basic Linear Algebra Subprograms (BLAS) library. The number of cache misses occurring in higher-level algorithms--like a matrix factorization--is then predicted by combining the models for the appropriate BLAS subroutines. As case studies, we consider the LU factorization and GER--a BLAS operation and a building block for the LU factorization. We validate the models on both Intel and AMD processors, attaining remarkably accurate performance predictions.
    abstractwebPDFbibtexhide
  2. Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix
    ACM Transactions on Mathematical Software, Volume 35(1), July 2008.
    @article{Bientinesi2008:648,
        author  = "Paolo Bientinesi and Brian Gunter and Robert {van de Geijn}",
        title   = "Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix",
        journal = "ACM Transactions on Mathematical Software",
        year    = 2008,
        volume  = 35,
        number  = 1,
        month   = jul,
        url     = "http://hpac.rwth-aachen.de/~pauldj/pubs/SPDInv.pdf"
    }
    We study the high-performance implementation of the inversion of a Symmetric Positive Definite (SPD) matrix on architectures ranging from sequential processors to Symmetric MultiProcessors to distributed memory parallel computers. This inversion is traditionally accomplished in three â??sweepsâ??: a Cholesky factorization of the SPD matrix, the inversion of the resulting triangular matrix, and finally the multiplication of the inverted triangular matrix by its own transpose. We state different algorithms for each of these sweeps as well as algorithms that compute the result in a single sweep. One algorithm outperforms the current ScaLAPACK implementation by 20-30 percent due to improved load-balance on a distributed memory architecture.
    abstractwebPDFbibtexhide
  3. Scalable Parallelization of FLAME Code via the Workqueuing Model
    Field Van Zee, Paolo Bientinesi, Tze Meng Low and Robert van de Geijn
    ACM Transactions on Mathematical Software, Volume 34(2), March 2008.
    @article{Van_Zee2008:598,
        author  = "Field {Van Zee} and Paolo Bientinesi and Tze {Meng Low} and Robert {van de Geijn}",
        title   = "Scalable Parallelization of FLAME Code via the Workqueuing Model",
        journal = "ACM Transactions on Mathematical Software",
        year    = 2008,
        volume  = 34,
        number  = 2,
        month   = mar
    }
    We discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and represents these algorithms as they are formally derived and presented. We report on two implementations of the workqueuing model, neither of which requires the use of explicit indices to specify parallelism. The first implementation uses the experimental taskq pragma, which may influence the adoption of a similar construct into OpenMP 3.0. The second workqueuing implementation is domain-specific to FLAME but allows us to illustrate the benefits of sorting tasks according to their computational cost prior to parallel execution. In addition, we discuss how scalable parallelization of dense linear algebra algorithms via OpenMP will require a two-dimensional partitioning of operands much like a 2D data distribution is needed on distributed memory architectures. We illustrate the issues and solutions by discussing the parallelization of the symmetric rank-k update and report impressive performance on an SGI system with 14 Itanium2 processors.
    abstractwebbibtexhide
  4. Representing Linear Algebra Algorithms in Code: The FLAME APIs
    Paolo Bientinesi, Enrique S. Quintana-Orti and Robert van de Geijn
    ACM Transactions on Mathematical Software, Volume 31(1), March 2005.
    @article{Bientinesi2005:504,
        author  = "Paolo Bientinesi and {Enrique S.} Quintana-Orti and Robert {van de Geijn}",
        title   = "Representing Linear Algebra Algorithms in Code: The FLAME APIs",
        journal = "ACM Transactions on Mathematical Software",
        year    = 2005,
        volume  = 31,
        number  = 1,
        month   = mar,
        url     = "http://hpac.rwth-aachen.de/~pauldj/pubs/FLAME-API.pdf"
    }
    In this article, we present a number of Application Program Interfaces (APIs) for coding linear algebra algorithms. On the surface, these APIs for the MATLAB M-script and C programming languages appear to be simple, almost trivial, extensions of those languages. Yet with them, the task of programming and maintaining families of algorithms for a broad spectrum of linear algebra operations is greatly simplified. In combination with our Formal Linear Algebra Methods Environment (FLAME) approach to deriving such families of algorithms, dozens of algorithms for a single linear algebra operation can be derived, verified to be correct, implemented, and tested, often in a matter of minutes per algorithm. Since the algorithms are expressed in code much like they are explained in a classroom setting, these APIs become not just a tool for implementing libraries, but also a valuable tool for teaching the algorithms that are incorporated in the libraries. In combination with an extension of the Parallel Linear Algebra Package (PLAPACK) API, the approach presents a migratory path from algorithm to MATLAB implementation to high-performance sequential implementation to parallel implementation. Finally, the APIs are being used to create a repository of algorithms and implementations for linear algebra operations, the FLAME Interface REpository (FIRE), which already features hundreds of algorithms for dozens of commonly encountered linear algebra operations.
    abstractwebPDFbibtexhide
  5. The Science of Deriving Dense Linear Algebra Algorithms
    Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique S. Quintana-Orti and Robert van de Geijn
    ACM Transactions on Mathematical Software, Volume 31(1), March 2005.
    @article{Bientinesi2005:460,
        author  = "Paolo Bientinesi and {John A.} Gunnels and {Margaret E.} Myers and {Enrique S.} Quintana-Orti and Robert {van de Geijn}",
        title   = "The Science of Deriving Dense Linear Algebra Algorithms",
        journal = "ACM Transactions on Mathematical Software",
        year    = 2005,
        volume  = 31,
        number  = 1,
        month   = mar,
        url     = "http://hpac.rwth-aachen.de/~pauldj/pubs/Recipe.ps"
    }
    In this article we present a systematic approach to the derivation of families of high-performance algorithms for a large set of frequently encountered dense linear algebra operations. As part of the derivation a constructive proof of the correctness of the algorithm is generated. The article is structured so that it can be used as a tutorial for novices. However, the method has been shown to yield new high-performance algorithms for well-studied linear algebra operations and should also be of interest to those who wish to produce best-in-class high-performance codes.
    abstractwebPDFbibtexhide

Peer Reviewed Conference Publications

  1. Execution-Less Performance Modeling
    Proceedings of the Second International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computing Systems (PMBS11) held as part of the Supercomputing Conference (SC11), November 2011.
    @inproceedings{Iakymchuk2011:258,
        author      = "Roman Iakymchuk and Paolo Bientinesi",
        title       = "Execution-Less Performance Modeling",
        booktitle   = "Proceedings of the Second International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computing Systems (PMBS11) held as part of the Supercomputing Conference (SC11)",
        year        = 2011,
        address     = "Seattle, USA",
        month       = nov,
        institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen University"
    }
    webbibtexhide
  2. Improving High-Performance Computations on Clouds Through Resource Underutilization
    Proceedings of ACM 26th Symposium on Applied Computing, pp. 119-126, ACM, 2011.
    @inproceedings{Iakymchuk2011:404,
        author    = "Roman Iakymchuk and Jeff Napper and Paolo Bientinesi",
        title     = "Improving High-Performance Computations on Clouds Through Resource Underutilization",
        booktitle = "Proceedings of ACM 26th Symposium on Applied Computing",
        year      = 2011,
        pages     = "119--126 ",
        address   = "Taichung, Taiwan",
        publisher = "ACM"
    }
    We investigate the effects of shared resources for high-performance computing in a commercial cloud environment where multiple virtual machines share a single hardware node. Although good performance is occasionally obtained, contention degrades the expected performance and introduces significant variance. Using the DGEMM kernel and the HPL benchmark, we show that the underutilization of resources considerably improves expected performance by reducing contention for the CPU and cache space. For instance, for some cluster configurations, the solution is reached almost an order of magnitude earlier on average when the available resources are underutilized. The performance benefits for single node computations are even more impressive: Underutilization improves the expected execution time by two orders of magnitude. Finally, in contrast to unshared clusters, extending underutilized clusters by adding more nodes often improves the execution time due to an increased parallelism even with a slow interconnect. In the best case, by underutilizing the nodes performance was improved enough to entirely offset the cost of an extra node in the cluster.
    abstractwebbibtexhide
  3. Can Cloud Computing Reach the TOP500?
    Jeff Napper and Paolo Bientinesi
    UCHPC-MAW '09 Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop, pp. 17-20, 2009.
    In conjunction with the 2009 ACM International Conference on Computing Frontiers.
    @inproceedings{Napper2009:804,
        author    = "Jeff Napper and Paolo Bientinesi",
        title     = "Can Cloud Computing Reach the TOP500?",
        booktitle = "UCHPC-MAW '09 Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop ",
        year      = 2009,
        pages     = "17--20",
        note      = "In conjunction with the 2009 ACM International Conference on Computing Frontiers",
        url       = "http://hpac.rwth-aachen.de/~pauldj/pubs/EC2-Cloud.pdf"
    }
    Computing as a utility has reached the mainstream. Scientists can now rent time on large commercial clusters through several vendors. The cloud computing model provides flexible support for "pay as you go" systems. In addition to no upfront investment in large clusters or supercomputers, such systems incur no maintenance costs. Furthermore, they can be expanded and reduced on-demand in real-time. Current cloud computing performance falls short of systems specifically designed for scientific applications. Scientific computing needs are quite different from those of web applications--composed primarily of database queries--that have been the focus of cloud computing vendors. In this paper we investigate the use of cloud computing for high-performance numerical applications. In particular, we assume unlimited monetary resources to answer the question, "How high can a cloud computing service get in the TOP500 list?" We show results for the Linpack benchmark on different allocations on Amazon EC2.
    abstractwebPDFbibtexhide
  4. SuperMatrix: a Multithreaded Runtime Scheduling System for Algorithms-by-Blocks
    Paolo Bientinesi, Ernie Chan, Enrique S. Quintana-Orti, Gregorio Quintana-Orti and Robert van de Geijn
    Proceedings of ACM SIGPLAN 2008 Symposium on Principles and Practice of Parallel Programming (PPoPP'08), 2008.
    @inproceedings{Bientinesi2008:400,
        author    = "Paolo Bientinesi and Ernie Chan and {Enrique S.} Quintana-Orti and Gregorio Quintana-Orti and Robert {van de Geijn}",
        title     = "SuperMatrix: a Multithreaded Runtime Scheduling System for Algorithms-by-Blocks",
        booktitle = "Proceedings of ACM SIGPLAN 2008 Symposium on Principles and Practice of Parallel Programming (PPoPP'08)",
        year      = 2008
    }
    This paper describes SuperMatrix, a runtime system that parallelizes matrix operations for SMP and/or multi-core architectures. We use this system to demonstrate how code described at a high level of abstraction can achieve high performance on such architectures while completely hiding the parallelism from the library programmer. The key insight entails viewing matrices hierarchically, consisting of blocks that serve as units of data where operations over those blocks are treated as units of computation. The implementation transparently enqueues the required operations, internally tracking dependencies, and then executes the operations utilizing out-of-order execution techniques inspired by superscalar microarchitectures. This separation of concerns allows library developers to implement algorithms without concerning themselves with the parallelization aspect of the problem. Different heuristics for scheduling operations can be implemented in the runtime system independent of the code that enqueues the operations. Results gathered on a 16 CPU ccNUMA Itanium2 server demonstrate excellent performance.
    abstractwebbibtexhide