Performance portability

From HandWiki

Performance portability refers to the ability of computer programs and applications to operate effectively across different platforms. Developers of performance portable applications seek to support multiple platforms without impeding performance, and ideally while minimizing platform-specific code.[1]

It is a sought after commodity within the HPC (high performance computing) community, however there is no universal or agreed upon way to measure it. There is some contention as to whether portability refers to the portability of an application or the portability of the source code.

Performance can be measured in two ways: either by comparing an optimized version of an application with its portable version; or to compare the theoretical peak performance of an application based on how many FLOPs are performed, with the data moved from main-memory to the processor.

The diversity of hardware makes developing software that works across a wide variety of machines increasingly important for the longevity of the application.

Contentions

The term performance portability is frequently used in industry and generally refers to: "(1) the ability to run one application across multiple hardware platforms; and (2) achieving some notional level of performance on these platforms."[1] For example, at the 2016 DOE (United States Department of Energy) Centers of Excellence Performance Portability Meeting,  John Pennycook (Intel), stated “An application is performance portable if it achieves a consistent level of performance [e.g. defined by execution time or other figure of merit, not percentage of peak FLOPS (floating point operations per second] across platforms relative to the best known implementation on each platform.”[2] More directly, Jeff Larkin (NVIDIA) noted that performance portability was when "The same source code will run productively on a variety of different architectures."[2]

Performance portability is a key topic of discussion within the HPC (high performance computing) community. Collaborators from industry, academia, and DOE national laboratories meet annually at the Performance, Portability, and Productivity at HPC Forum,[3] launched in 2016, to discuss ideas and progress toward performance portability goals on current and future HPC platforms.

Relevance

Performance portability retains relevance among developers[4][5][6][7] due to constantly evolving computing architectures that threaten to make applications designed for current hardware obsolete.[8] Performance portability represents the assumption that a developer's singular codebase will continue to perform within acceptable limits on newer architectures and on a variety of current architectures that the code hasn't yet been tested on.[8][9][10][11] The increasing diversity of hardware makes developing software that works across a wide variety of machines necessary for longevity and continued relevance.[12]

One prominent proponent of performance portability is the United States Department of Energy's (DOE) Exascale Computing Project (ECP). The ECP's mission of creating an exascale computing ecosystem requires a diverse array of hardware architectures, which has made performance portability an ongoing concern and something that must be prepared for in order to effectively use exascale supercomputers.[2] At the 2016 DOE Centers of Excellence Performance Portability Meeting, Bert Still (Lawrence Livermore National Laboratory) stated that performance portability was "a critical ongoing issue" for the ECP due to their continuing use of diverse platforms.[2] Since 2016 the DOE has hosted workshops exploring the continued importance of performance portability.[13] Companies and groups in attendance of the 2017 meeting include the National Energy Research Scientific Computing Center (NERSC), Lawrence Livermore National Laboratory (LLNL), Sandia National Laboratories (SNL), Oak Ridge National Laboratory (ORNL), International Business Machines (IBM), Argonne National Laboratory (ANL), Los Alamos National Laboratory (LANL), Intel, and NVIDIA.[13]

Measuring Performance Portability

Quantifying when a program reaches performance portability is dependent on two factors. The first factor, portability, can be measured by the total lines of code that are used across multiple architectures vs. the total lines of code that are intended for a single architecture.[14][1] There is some contention as to whether portability refers to the portability of an application (i.e. does it run everywhere or not), or the portability of source code (i.e. how much code is specialized). The second factor, performance, can be measured in a few ways. One method is to compare the performance of platform optimized version of an application vs. the performance of a portable version of the same application.[14][1] Another method is to construct a roofline performance model, which provides the theoretical peak performance of an application based on how many FLOPs are performed vs. the data moved from main-memory to the processor over the course of program execution.[15]

There are currently no universal standards for what truly makes code or an application performance portable, and no agreement about whether proposed measurement methods accurately capture the concerns that are relevant to code teams. During the 2016 DOE (United States Department of Energy) Centers of Excellence Performance Portability Meeting, speaker David Richards, from Lawrence Livermore National Laboratory, stated that, "A code is performance portable when the application team says its performance portable!"[2]

A study from 2019 titled Performance Portability across Diverse Computer Architectures analyzed multiple parallel programming models across a diverse set of architectures in order to determine the current state of performance portability. The study concluded that when writing performance portable code it's important to use open (standard) programming models supported by multiple vendors across multiple hardware platforms, expose maximal parallelism at all levels of the algorithm and application, develop and improve codes on multiple platforms simultaneously, and multi-objective auto-tuning can help find suitable parameters in a flexible codebase to achieve good performance on all platforms.[16]

Studies from 2022 are postulated that an adequate and inclusive definition of the performance portability of a parallel application is desirable, but rather complex, and it is doubtful whether such a definition would be accepted by most researchers and developers in the scientific community. Furthermore, the changes that have occurred in the past two decades in the development of parallel programming models, especially with the addition of new portable performance abstractions to current versions and those that will be added in the coming years, outline a new trend in the field. This trend indicates that the performance portability that parallel programming models will provide to applications will be more significant than the performance portability that applications can provide themselves on their own. In other words, it is proposed that parallel programming models will become more descriptive than prescriptive models, thus transferring a great deal of responsibility from the programmer to the programming model implementation and its underlying compiler, which ultimately determine the degree of performance portability of the application. This is a fundamental conceptual change in how applications will be developed in the foreseeable future. As a result of these changes, it is necessary to raise the abstraction level of the definition of performance portability. In other words, these studies propose a definition for performance portability that is parallel programming model-centric [17] rather than application-centric. [18]

Framework and Non-Framework Solutions

There are a number of programming applications and systems that help programmers make their applications performance portable. Some frameworks that claim to support functional portability include OpenCL, SYCL, Kokkos, RAJA, Java, OpenMP, OpenACC. These programming interfaces support multi-platform multiprocessing programming in particular programming languages.[19] Some non-framework solutions include Self-tuning and Domain-specific language.

References

  1. 1.0 1.1 1.2 1.3 Pennycook, John; Sewall, Jason; Lee, Victor (8 August 2017). "Implications of a metric for performance portability". Future Generation Computer Systems 92: 947–958. doi:10.1016/j.future.2017.08.007. 
  2. 2.0 2.1 2.2 2.3 2.4 Neely, Rob J. (April 21, 2016). DOE Centers of Excellence Performance Portability Meeting (Report). doi:10.2172/1332474. 
  3. "P3HPC: Performance, Portability & Productivity in HPC". https://p3hpc.org/. 
  4. Matthias, Jacob; Randall, Keith (2002). "Cross-Architectural Performance Portability of a Java Virtual Machine Implementation" (in en). USENIX (The Advanced Computing Systems Association). https://www.usenix.org/conference/java-vm-02/cross-architectural-performance-portability-java-virtual-machine. Retrieved 2021-10-10. 
  5. Edwards, H. Carter; Sunderland, Daniel; Porter, Vicki; Amsler, Chris; Mish, Sam (2012). "Manycore Performance-Portability: Kokkos Multidimensional Array Library" (in en). Scientific Programming 20 (2): 89–114. doi:10.3233/SPR-2012-0343. ISSN 1058-9244. https://www.hindawi.com/journals/sp/2012/917630/. Retrieved 2021-10-10. 
  6. Bosilca, George; Bouteiller, Aurelien; Herault, Thomas; Lemariner, Pierre; Saengpatsa, Narapat; Tomov, Stanimire; Dongarra, Jack (2011). "Performance Portability of a GPU Enabled Factorization with the DAGuE Framework". IEEE Cluster: Workshop on Parallel Programming on Accelerator Clusters (PPAC): 1–8. https://www.icl.utk.edu/publications/performance-portability-gpu-enabled-factorization-dague-framework. Retrieved 2021-10-10. 
  7. Nemire, Brad (2015-10-29). "Performance Portability for GPUs and CPUs with OpenACC" (in en-US). https://developer.nvidia.com/blog/pgi-15-10-delivers-performance-portability-from-gpus-to-cpus-for-openacc/. 
  8. 8.0 8.1 Howard, Micah; Bradley, Andrew Michael; Bova, Steven W.; Overfelt, James R.; Wagnild, Ross Martin; Dinzl, Derek John; Hoemmen, Mark Frederick; Klinvex, Alicia Marie (2017-06-01). "Towards a Performance Portable Compressible CFD Code" (in English). American Institute of Aeronautics and Astronautics. https://www.osti.gov/biblio/1458230. 
  9. McCool, Michael D. (2012). Structured parallel programming : patterns for efficient computation. James Reinders, Arch Robison. Amsterdam: Elsevier/Morgan Kaufmann. ISBN 978-0-12-391443-9. OCLC 798575627. https://www.worldcat.org/oclc/798575627. Retrieved 2021-10-10. 
  10. Hemsoth, Nicole (2020-11-19). ""Wombat" Puts Arm's SVE Instruction Set to the Test" (in en-US). http://www.nextplatform.com/2020/11/18/wombat-gauges-arm-environments-in-the-wild/. 
  11. "NERSC, ALCF, Codeplay Partner on SYCL GPU Compiler" (in en-US). 2021-03-01. https://insidehpc.com/2021/03/nersc-alcf-codeplay-partner-on-sycl-gpu-compiler/. 
  12. Marques, Osni (9 December 2020). "Software Design for Longevity with Performance Portability" (in en-US). https://www.exascaleproject.org/event/softwaredesign/. 
  13. 13.0 13.1 "DOE COE Performance Portability Meeting 2017". August 2017. https://www.lanl.gov/projects/advanced-simulation-computing/doe-coe-mtg-2017.php. 
  14. 14.0 14.1 "Measurement Techniques - Performance Portability". https://performanceportability.org/perfport/measurements/. 
  15. "Quantitatively Assessing Performance Portability with Roofline" (in en-US). 23 January 2019. https://www.exascaleproject.org/event/perfport/. 
  16. Deakin, Tom J.; McIntosh-Smith, Simon N.; Price, James; Poenaru, Andrei; Atkinson, Patrick R.; Popa, Codrin; Salmon, Justin (2020-01-02). "Performance Portability across Diverse Computer Architectures" (in English). 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). Institute of Electrical and Electronics Engineers (IEEE). pp. 1–13. doi:10.1109/P3HPC49587.2019.00006. ISBN 978-1-7281-6003-0. https://research-information.bris.ac.uk/en/publications/performance-portability-across-diverse-computer-architectures. Retrieved 2021-10-10. 
  17. Marowka, Ami (12 January 2022). "On the Performance Portability of OpenACC, OpenMP, Kokkos and RAJA". International Conference on High Performance Computing in Asia-Pacific Region. HPCAsia2022. pp. 103–114. doi:10.1145/3492805.3492806. ISBN 9781450384988. https://dl.acm.org/doi/10.1145/3492805.3492806. Retrieved 4 February 2022. 
  18. Marowka, Ami (January 2022). "Reformulation of the performance portability metric". Software: Practice and Experience 52 (1): 154–171. doi:10.1002/spe.3002. https://onlinelibrary.wiley.com/doi/10.1002/spe.3002. Retrieved 2022-02-04. 
  19. "OpenMP About Us" (in en-GB). https://www.openmp.org/about/about-us/. 

External links

Bibliography

  • Exascale Scientific Applications: Scalability and Performance Portability. United Kingdom, CRC Press, 2017.
  • Mazaheri A., Schulte J., Moskewicz M.W., Wolf F., Jannesari A. (2019) Enhancing the Programmability and Performance Portability of GPU Tensor Operations. In: Yahyapour R. (eds) Euro-Par 2019: Parallel Processing. Euro-Par 2019. Lecture Notes in Computer Science, vol 11725. Springer, Cham. https://doi.org/10.1007/978-3-030-29400-7_16