Looplevel parallelism
Looplevel parallelism is a form of parallelism in software programming that is concerned with extracting parallel tasks from loops. The opportunity for looplevel parallelism often arises in computing programs where data is stored in random access data structures. Where a sequential program will iterate over the data structure and operate on indices one at a time, a program exploiting looplevel parallelism will use multiple threads or processes which operate on some or all of the indices at the same time. Such parallelism provides a speedup to overall execution time of the program, typically in line with Amdahl's law.
Description
For simple loops, where each iteration is independent of the others, looplevel parallelism can be embarrassingly parallel, as parallelizing only requires assigning a process to handle each iteration. However, many algorithms are designed to run sequentially, and fail when parallel processes race due to dependence within the code. Sequential algorithms are sometimes applicable to parallel contexts with slight modification. Usually, though, they require process synchronization. Synchronization can be either implicit, via message passing, or explicit, via synchronization primitives like semaphores.
Example
Consider the following code operating on a list L
of length n
.
for (int i = 0; i < n; ++i) { S1: L[i] += 10; }
Each iteration of the loop takes the value from the current index of L
, and increments it by 10. If statement S1
takes T
time to execute, then the loop takes time n * T
to execute sequentially, ignoring time taken by loop constructs. Now, consider a system with p
processors where p > n
. If n
threads run in parallel, the time to execute all n
steps is reduced to T
.
Less simple cases produce inconsistent, i.e. nonserializable outcomes. Consider the following loop operating on the same list L
.
for (int i = 1; i < n; ++i) { S1: L[i] = L[i1] + 10; }
Each iteration sets the current index to be the value of the previous plus ten. When run sequentially, each iteration is guaranteed that the previous iteration will already have the correct value. With multiple threads, process scheduling and other considerations prevent the execution order from guaranteeing an iteration will execute only after its dependence is met. It very well may happen before, leading to unexpected results. Serializability can be restored by adding synchronization to preserve the dependence on previous iterations.
Dependencies in code
There are several types of dependences that can be found within code.^{[1]}^{[2]}
Type  Notation  Description 

True (Flow) Dependence  S1 >T S2

A true dependence between S1 and S2 means that S1 writes to a location later read from by S2 
Anti Dependence  S1 >A S2

An antidependence between S1 and S2 means that S1 reads from a location later written to by S2. 
Output Dependence  S1 >O S2

An output dependence between S1 and S2 means that S1 and S2 write to the same location. 
Input Dependence  S1 >I S2

An input dependence between S1 and S2 means that S1 and S2 read from the same location. 
In order to preserve the sequential behaviour of a loop when run in parallel, True Dependence must be preserved. AntiDependence and Output Dependence can be dealt with by giving each process its own copy of variables (known as privatization).^{[1]}
Example of true dependence
S1: int a, b; S2: a = 2; S3: b = a + 40;
S2 >T S3
, meaning that S2 has a true dependence on S3 because S2 writes to the variable a
, which S3 reads from.
Example of antidependence
S1: int a, b = 40; S2: a = b  38; S3: b = 1;
S2 >A S3
, meaning that S2 has an antidependence on S3 because S2 reads from the variable b
before S3 writes to it.
Example of outputdependence
S1: int a, b = 40; S2: a = b  38; S3: a = 2;
S2 >O S3
, meaning that S2 has an output dependence on S3 because both write to the variable a
.
Example of inputdependence
S1: int a, b, c = 2; S2: a = c  1; S3: b = c + 1;
S2 >I S3
, meaning that S2 has an input dependence on S3 because S2 and S3 both read from variable c
.
Dependence in loops
Loopcarried vs loopindependent dependence
Loops can have two types of dependence:
 Loopcarried dependence
 Loopindependent dependence
In loopindependent dependence, loops have interiteration dependence, but do not have dependence between iterations. Each iteration may be treated as a block and performed in parallel without other synchronization efforts.
In the following example code used for swapping the values of two array of length n, there is a loopindependent dependence of S1 >T S3
.
for (int i = 1; i < n; ++i) { S1: tmp = a[i]; S2: a[i] = b[i]; S3: b[i] = tmp; }
In loopcarried dependence, statements in an iteration of a loop depend on statements in another iteration of the loop. LoopCarried Dependence uses a modified version of the dependence notation seen earlier.
Example of loopcarried dependence where S1[i] >T S1[i + 1]
, where i
indicates the current iteration, and i + 1
indicates the next iteration.
for (int i = 1; i < n; ++i) { S1: a[i] = a[i1] + 1; }
Loop carried dependence graph
A Loopcarried dependence graph graphically shows the loopcarried dependencies between iterations. Each iteration is listed as a node on the graph, and directed edges show the true, anti, and output dependencies between each iteration.
Types
There are a variety of methodologies for parallelizing loops.
 DISTRIBUTED Loop
 DOALL Parallelism
 DOACROSS Parallelism
 HELIX ^{[3]}
 DOPIPE Parallelism
Each implementation varies slightly in how threads synchronize, if at all. In addition, parallel tasks must somehow be mapped to a process. These tasks can either be allocated statically or dynamically. Research has shown that loadbalancing can be better achieved through some dynamic allocation algorithms than when done statically.^{[4]}
The process of parallelizing a sequential program can be broken down into the following discrete steps.^{[1]} Each concrete loopparallelization below implicitly performs them.
Type  Description 

Decomposition  The program is broken down into tasks, the smallest exploitable unit of concurrence. 
Assignment  Tasks are assigned to processes. 
Orchestration  Data access, communication, and synchronization of processes. 
Mapping  Processes are bound to processors. 
DISTRIBUTED loop
When a loop has a loopcarried dependence, one way to parallelize it is to distribute the loop into several different loops. Statements that are not dependent on each other are separated so that these distributed loops can be executed in parallel. For example, consider the following code.
for (int i = 1; i < n; ++i) { S1: a[i] = a[i1] + b[i]; S2: c[i] += d[i]; }
The loop has a loop carried dependence S1[i] >T S1[i+1]
but S2 and S1 do not have a loopindependent dependence so we can rewrite the code as follows.
loop1: for (int i = 1; i < n; ++i) { S1: a[i] = a[i1] + b[i]; } loop2: for (int i = 1; i < n; ++i) { S2: c[i] += d[i]; }
Note that now loop1 and loop2 can be executed in parallel. Instead of single instruction being performed in parallel on different data as in data level parallelism, here different loops perform different tasks on different data. Let's say the time of execution of S1 and S2 be [math]\displaystyle{ T_{S_1} }[/math] and [math]\displaystyle{ T_{S_2} }[/math] then the execution time for sequential form of above code is [math]\displaystyle{ n*(T_{S_1}+T_{S_2}) }[/math], Now because we split the two statements and put them in two different loops, gives us an execution time of [math]\displaystyle{ n*T_{S_1} + T_{S_2} }[/math]. We call this type of parallelism either function or task parallelism.
DOALL parallelism
DOALL parallelism exists when statements within a loop can be executed independently (situations where there is no loopcarried dependence).^{[1]} For example, the following code does not read from the array a
, and does not update the arrays b, c
. No iterations have a dependence on any other iteration.
for (int i = 0; i < n; ++i) { S1: a[i] = b[i] + c[i]; }
Let's say the time of one execution of S1 be [math]\displaystyle{ T_{S_1} }[/math] then the execution time for sequential form of above code is [math]\displaystyle{ n*T_{S_1} }[/math], Now because DOALL Parallelism exists when all iterations are independent, speedup may be achieved by executing all iterations in parallel which gives us an execution time of [math]\displaystyle{ T_{S_1} }[/math], which is the time taken for one iteration in sequential execution.
The following example, using a simplified pseudo code, shows how a loop might be parallelized to execute each iteration independently.
begin_parallelism(); for (int i = 0; i < n; ++i) { S1: a[i] = b[i] + c[i]; end_parallelism(); } block();
DOACROSS parallelism
DOACROSS Parallelism exists where iterations of a loop are parallelized by extracting calculations that can be performed independently and running them simultaneously.^{[5]}
Synchronization exists to enforce loopcarried dependence.
Consider the following, synchronous loop with dependence S1[i] >T S1[i+1]
.
for (int i = 1; i < n; ++i) { a[i] = a[i1] + b[i] + 1; }
Each loop iteration performs two actions
 Calculate
a[i1] + b[i] + 1
 Assign the value to
a[i]
Calculating the value a[i1] + b[i] + 1
, and then performing the assignment can be decomposed into two lines(statements S1 and S2):
S1: int tmp = b[i] + 1; S2: a[i] = a[i1] + tmp;
The first line, int tmp = b[i] + 1;
, has no loopcarried dependence. The loop can then be parallelized by computing the temp value in parallel, and then synchronizing the assignment to a[i]
.
post(0); for (int i = 1; i < n; ++i) { S1: int tmp = b[i] + 1; wait(i1); S2: a[i] = a[i1] + tmp; post(i); }
Let's say the time of execution of S1 and S2 be [math]\displaystyle{ T_{S_1} }[/math] and [math]\displaystyle{ T_{S_2} }[/math] then the execution time for sequential form of above code is [math]\displaystyle{ n*(T_{S_1}+T_{S_2}) }[/math], Now because DOACROSS Parallelism exists, speedup may be achieved by executing iterations in a pipelined fashion which gives us an execution time of [math]\displaystyle{ T_{S_1} + n*T_{S_2} }[/math].
DOPIPE parallelism
DOPIPE Parallelism implements pipelined parallelism for loopcarried dependence where a loop iteration is distributed over multiple, synchronized loops.^{[1]} The goal of DOPIPE is to act like an assembly line, where one stage is started as soon as there is sufficient data available for it from the previous stage.^{[6]}
Consider the following, synchronous code with dependence S1[i] >T S1[i+1]
.
for (int i = 1; i < n; ++i) { S1: a[i] = a[i1] + b[i]; S2: c[i] += a[i]; }
S1 must be executed sequentially, but S2 has no loopcarried dependence. S2 could be executed in parallel using DOALL Parallelism after performing all calculations needed by S1 in series. However, the speedup is limited if this is done. A better approach is to parallelize such that the S2 corresponding to each S1 executes when said S1 is finished.
Implementing pipelined parallelism results in the following set of loops, where the second loop may execute for an index as soon as the first loop has finished its corresponding index.
for (int i = 1; i < n; ++i) { S1: a[i] = a[i1] + b[i]; post(i); } for (int i = 1; i < n; i++) { wait(i); S2: c[i] += a[i]; }
Let's say the time of execution of S1 and S2 be [math]\displaystyle{ T_{S_1} }[/math] and [math]\displaystyle{ T_{S_2} }[/math] then the execution time for sequential form of above code is [math]\displaystyle{ n*(T_{S_1}+T_{S_2}) }[/math], Now because DOPIPE Parallelism exists, speedup may be achieved by executing iterations in a pipelined fashion which gives us an execution time of [math]\displaystyle{ n*T_{S_1} + (n/p)*T_{S_2} }[/math], where p is the number of processor in parallel.
See also
 Data parallelism
 Task parallelism
 Parallelism using different types of memory models like shared and distributed and Message Passing
References
 ↑ ^{1.0} ^{1.1} ^{1.2} ^{1.3} ^{1.4} Solihin, Yan (2016). Fundamentals of Parallel Architecture. Boca Raton, FL: CRC Press. ISBN 9781482211184.
 ↑ Goff, Gina (1991). "Practical dependence testing". Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation  PLDI '91. pp. 15–29. doi:10.1145/113445.113448. ISBN 0897914287.
 ↑ Murphy, Niall. "Discovering and exploiting parallelism in DOACROSS loops". https://www.cl.cam.ac.uk/techreports/UCAMCLTR882.pdf. Retrieved 10 September 2016.
 ↑ Kavi, Krishna. Parallelization of DOALL and DOACROSS Loopsa Survey. https://www.researchgate.net/publication/220662641_Parallelization_of_DOALL_and_DOACROSS_Loopsa_Survey.
 ↑ Unnikrishnan, Priya (2012), "A Practical Approach to DOACROSS Parallelization", EuroPar 2012 Parallel Processing, Lecture Notes in Computer Science, 7484, pp. 219–231, doi:10.1007/9783642328206_23, ISBN 9783642328190, https://semanticscholar.org/paper/0885cd07bc4affd8f433bd3b4ee56012101ae09a
 ↑ "DoPipe: An Effective Approach to Parallelize Simulation". https://software.intel.com/sites/default/files/m/a/a/7/d/6/12758MC_Forum_Zangbinyu_dopipe.pdf. Retrieved 13 September 2016.
Original source: https://en.wikipedia.org/wiki/Looplevel parallelism.
Read more 