Automatic Adjoint Differentiation Compiler

From HandWiki
Short description: Run-time graph compiler for C++


Automatic Adjoint Differentiation Compiler (AADC) is a Just-In-Time (JIT) compiler designed for AAD that avoids inefficiencies introduced by traditional high-level programming languages[1] for repetitive calculations such as Monte-Carlo, VaR, or Stress-Testing. It was first presented in 2019 at Intel HPC Workshop.

In addition to AAD, the library is used to speed up 'forward' repetitive calculations, such as Monte-Carlo simulations required in XVA calculations [2] and for machine learning in time series analysis where it outperformed traditional Tensorflow LSTM implementation both in terms of prediction accuracy and training speed[3].

Technology overview

Traditionally, algorithmic adjoint differentiation is implemented through the use of a memory tape (Wengert list). Expression templates or operator overloading are typically used to extract the sequence of elementary operations for each pass through an algorithm to record the tape. The tape is then used to evaluate the derivatives of interest by applying the chain rule in reverse for each elementary operation (see reverse accumulation). In practice, the use of tape slows down the program since for every iteration, the CPU needs to interpret the sequence of operations held in tape, rather than executing such operations directly.

AADC solution uses operator overloading approach to record elementary operations during program execution and instead of using tape, AADC generates optimised machine code[4] at runtime. This machine code will be applied to a large set of data points, and the compiler will convert scalar operations to full SIMD[5] vector operations and process four AVX2 or eight AVX-512 data samples in parallel. Since the recording happens only for one input sample the resulting recorded function is thread-safe and turns non-multithread-safe code into code that can be safely executed on multicore systems[6][7]. Due to these properties, AADC allows to calculate all the differentials faster than the primal program [8]

References