Parallel external memory
In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine.[1] It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.
Model
Definition
The PEM model[1] is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of [math]\displaystyle{ P }[/math] processors and a two-level memory hierarchy. This memory hierarchy consists of a large external memory (main memory) of size [math]\displaystyle{ N }[/math] and [math]\displaystyle{ P }[/math] small internal memories (caches). The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size [math]\displaystyle{ M }[/math] which is partitioned in blocks of size [math]\displaystyle{ B }[/math]. The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size [math]\displaystyle{ B }[/math].
I/O complexity
The complexity measure of the PEM model is the I/O complexity,[1] which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if [math]\displaystyle{ P }[/math] processors load parallelly a data block of size [math]\displaystyle{ B }[/math] form the main memory into their caches, it is considered as an I/O complexity of [math]\displaystyle{ O(1) }[/math] not [math]\displaystyle{ O(P) }[/math]. A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.
Read/write conflicts
In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts[1] occur. Like in the PRAM model, three different variations of this problem are considered:
- Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently.
- Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time.
- Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.
The following two algorithms[1] solve the CREW and EREW problem if [math]\displaystyle{ P \leq B }[/math] processors write to the same block simultaneously. A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of [math]\displaystyle{ P }[/math] parallel block transfers. A second approach needs [math]\displaystyle{ O(\log(P)) }[/math] parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block. In the first round [math]\displaystyle{ P }[/math] processors combine their blocks into [math]\displaystyle{ P/2 }[/math] blocks. Then [math]\displaystyle{ P/2 }[/math] processors combine the [math]\displaystyle{ P/2 }[/math] blocks into [math]\displaystyle{ P/4 }[/math]. This procedure is continued until all the data is combined in one block.
Comparison to other models
Model | Multi-core | Cache-aware |
---|---|---|
Random-access machine (RAM) | No | No |
Parallel random-access machine (PRAM) | Yes | No |
External memory (EM) | No | Yes |
Parallel external memory (PEM) | Yes | Yes |
Examples
Multiway partitioning
Let [math]\displaystyle{ M=\{m_1,...,m_{d-1}\} }[/math] be a vector of d-1 pivots sorted in increasing order. Let A be an unordered set of N elements. A d-way partition[1] of A is a set [math]\displaystyle{ \Pi=\{A_1,...,A_d\} }[/math] , where [math]\displaystyle{ \cup_{i=1}^d A_i = A }[/math] and [math]\displaystyle{ A_i\cap A_j=\emptyset }[/math] for [math]\displaystyle{ 1\leq i\lt j\leq d }[/math]. [math]\displaystyle{ A_i }[/math] is called the i-th bucket. The number of elements in [math]\displaystyle{ A_i }[/math] is greater than [math]\displaystyle{ m_{i-1} }[/math] and smaller than [math]\displaystyle{ m_{i}^2 }[/math]. In the following algorithm[1] the input is partitioned into N/P-sized contiguous segments [math]\displaystyle{ S_1,...,S_P }[/math] in main memory. The processor i primarily works on the segment [math]\displaystyle{ S_i }[/math]. The multiway partitioning algorithm (PEM_DIST_SORT
[1]) uses a PEM prefix sum algorithm[1] to calculate the prefix sum with the optimal [math]\displaystyle{ O\left(\frac{N}{PB} + \log P\right) }[/math] I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm.
// Compute parallelly a d-way partition on the data segments [math]\displaystyle{ S_i }[/math] for each processor i in parallel do Read the vector of pivots M into the cache. Partition [math]\displaystyle{ S_i }[/math] into d buckets and let vector [math]\displaystyle{ M_i=\{j_1^i,...,j_d^i\} }[/math] be the number of items in each bucket. end for Run PEM prefix sum on the set of vectors [math]\displaystyle{ \{M_1,...,M_P\} }[/math] simultaneously. // Use the prefix sum vector to compute the final partition for each processor i in parallel do Write elements [math]\displaystyle{ S_i }[/math] into memory locations offset appropriately by [math]\displaystyle{ M_{i-1} }[/math] and [math]\displaystyle{ M_{i} }[/math]. end for Using the prefix sums stored in [math]\displaystyle{ M_P }[/math] the last processor P calculates the vector B of bucket sizes and returns it.
If the vector of [math]\displaystyle{ d=O\left(\frac{M}{B}\right) }[/math] pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with [math]\displaystyle{ O\left(\frac{N}{PB} + \left\lceil \frac{d}{B} \right\rceil\gt \log(P)+d\log(B)\right) }[/math] I/O complexity. The content of the final buckets have to be located in contiguous memory.
Selection
The selection problem is about finding the k-th smallest item in an unordered list A of size N.
The following code[1] makes use of PRAMSORT
which is a PRAM optimal sorting algorithm which runs in [math]\displaystyle{ O(\log N) }[/math], and SELECT
, which is a cache optimal single-processor selection algorithm.
if [math]\displaystyle{ N \leq P }[/math] then [math]\displaystyle{ \texttt{PRAMSORT}(A,P) }[/math] return [math]\displaystyle{ A[k] }[/math] end if //Find median of each [math]\displaystyle{ S_i }[/math] for each processor i in parallel do [math]\displaystyle{ m_i = \texttt{SELECT}(S_i, \frac{N}{2P}) }[/math] end for // Sort medians [math]\displaystyle{ \texttt{PRAMSORT}(\lbrace m_1, \dots, m_2 \rbrace, P) }[/math] // Partition around median of medians [math]\displaystyle{ t = \texttt{PEMPARTITION}(A, m_{P/2},P) }[/math] if [math]\displaystyle{ k \leq t }[/math] then return [math]\displaystyle{ \texttt{PEMSELECT}(A[1:t], P, k) }[/math] else return [math]\displaystyle{ \texttt{PEMSELECT}(A[t+1:N], P, k-t) }[/math] end if
Under the assumption that the input is stored in contiguous memory, PEMSELECT
has an I/O complexity of:
- [math]\displaystyle{ O\left(\frac{N}{PB} + \log (PB) \cdot \log(\frac{N}{P})\right) }[/math]
Distribution sort
Distribution sort partitions an input list A of size N into d disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.
If [math]\displaystyle{ P = 1 }[/math] the task is delegated to a cache-optimal single-processor sorting algorithm.
Otherwise the following algorithm[1] is used:
// Sample [math]\displaystyle{ \tfrac{4N}{\sqrt{d}} }[/math] elements from A for each processor i in parallel do if [math]\displaystyle{ M \lt |S_i| }[/math] then [math]\displaystyle{ d = M/B }[/math] Load [math]\displaystyle{ S_i }[/math] in M-sized pages and sort pages individually else [math]\displaystyle{ d = |S_i| }[/math] Load and sort [math]\displaystyle{ S_i }[/math] as single page end if Pick every [math]\displaystyle{ \sqrt{d}/4 }[/math]'th element from each sorted memory page into contiguous vector [math]\displaystyle{ R^i }[/math] of samples end for in parallel do Combine vectors [math]\displaystyle{ R^1 \dots R^P }[/math] into a single contiguous vector [math]\displaystyle{ \mathcal{R} }[/math] Make [math]\displaystyle{ \sqrt{d} }[/math] copies of [math]\displaystyle{ \mathcal{R} }[/math]: [math]\displaystyle{ \mathcal{R}_1 \dots \mathcal{R}_{\sqrt{d}} }[/math] end do // Find [math]\displaystyle{ \sqrt{d} }[/math] pivots [math]\displaystyle{ \mathcal{M}[j] }[/math] for [math]\displaystyle{ j = 1 }[/math] to [math]\displaystyle{ \sqrt{d} }[/math] in parallel do [math]\displaystyle{ \mathcal{M}[j] = \texttt{PEMSELECT}(\mathcal{R}_i, \tfrac{P}{\sqrt{d}}, \tfrac{j \cdot 4N}{d}) }[/math] end for Pack pivots in contiguous array [math]\displaystyle{ \mathcal{M} }[/math] // Partition Aaround pivots into buckets [math]\displaystyle{ \mathcal{B} }[/math] [math]\displaystyle{ \mathcal{B} = \texttt{PEMMULTIPARTITION}(A[1:N],\mathcal{M},\sqrt{d},P) }[/math] // Recursively sort buckets for [math]\displaystyle{ j = 1 }[/math] to [math]\displaystyle{ \sqrt{d} + 1 }[/math] in parallel do recursively call [math]\displaystyle{ \texttt{PEMDISTSORT} }[/math] on bucket jof size [math]\displaystyle{ \mathcal{B}[j] }[/math] using [math]\displaystyle{ O \left( \left \lceil \tfrac{\mathcal{B}[j]}{N / P} \right \rceil \right) }[/math] processors responsible for elements in bucket j end for
The I/O complexity of PEMDISTSORT
is:
- [math]\displaystyle{ O \left( \left \lceil \frac{N}{PB} \right \rceil \left ( \log_d P + \log_{M/B} \frac{N}{PB} \right ) + f(N,P,d) \cdot \log_d P \right) }[/math]
where
- [math]\displaystyle{ f(N,P,d) = O \left ( \log \frac{PB}{\sqrt{d}} \log \frac{N}{P} + \left \lceil \frac{\sqrt{d}}{B} \log P + \sqrt{d} \log B \right \rceil \right ) }[/math]
If the number of processors is chosen that [math]\displaystyle{ f(N,P,d) = O\left ( \left \lceil \tfrac{N}{PB} \right \rceil \right ) }[/math]and [math]\displaystyle{ M \lt B^{O(1)} }[/math] the I/O complexity is then:
[math]\displaystyle{ O \left ( \frac{N}{PB} \log_{M/B} \frac{N}{B} \right ) }[/math]
Other PEM algorithms
PEM Algorithm | I/O complexity | Constraints |
---|---|---|
Mergesort[1] | [math]\displaystyle{ O\left(\frac{N}{PB} \log_{\frac{M}{B}} \frac{N}{B}\right) = \textrm{sort}_P(N) }[/math] | [math]\displaystyle{ P \leq \frac{N}{B^2}, M = B^{O(1)} }[/math] |
List ranking[2] | [math]\displaystyle{ O \left ( \textrm{sort}_P(N) \right ) }[/math] | [math]\displaystyle{ P \leq \frac{N/B^2}{\log B \cdot \log^{O(1)} N}, M = B^{O(1)} }[/math] |
Euler tour[2] | [math]\displaystyle{ O \left ( \textrm{sort}_P(N) \right ) }[/math] | [math]\displaystyle{ P \leq \frac{N}{B^2}, M = B^{O(1)} }[/math] |
Expression tree evaluation[2] | [math]\displaystyle{ O \left ( \textrm{sort}_P(N) \right ) }[/math] | [math]\displaystyle{ P \leq \frac{N}{B^2 \log B \cdot \log^{O(1)}N}, M = B^{O(1)} }[/math] |
Finding a MST[2] | [math]\displaystyle{ O \left(\textrm{sort}_P(|V|) + \textrm{sort}_P(|E|) \log \tfrac{|V|}{pB} \right) }[/math] | [math]\displaystyle{ p \leq \frac{|V|+|E|}{B^2 \log B \cdot \log^{O(1)} N}, M = B^{O(1)} }[/math] |
Where [math]\displaystyle{ \textrm{sort}_P(N) }[/math] is the time it takes to sort N items with P processors in the PEM model.
See also
- Parallel random-access machine (PRAM)
- Random-access machine (RAM)
- External memory (EM)
References
- ↑ 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739.
- ↑ 2.0 2.1 2.2 2.3 Arge, Lars; Goodrich, Michael T.; Sitchinava, Nodari (2010). "Parallel external memory graph algorithms". 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE. pp. 1–11. doi:10.1109/ipdps.2010.5470440. ISBN 9781424464425.
Original source: https://en.wikipedia.org/wiki/Parallel external memory.
Read more |