DMelt:DataAnalysis/2 Data Collections

From HandWiki
Member

Data collections

DataMelt supports many data collections using Java API, Jython API and 3rd party collections.

Among many data collections, the most important is "array", which is a group of like-typed variables that are referred to by a common name.

Native Java data collections

Read Java Collection tutorial.

Generally, such collections can hold any objects, not only primitive types. Here is a simple for "double" values:

from java.util import *
import time
a=ArrayList()
r=Random()
for i in range(100000):
      a.add(r.nextGaussian()
start = time.clock()
Collections.sort(a)
print ' CPU time (s)=',time.clock()-start

In this case, we use ArrayList from the JAVA API to fill with the random Gaussian numbers and perform the search. Remember about the penalty you pay since Java keeps the objects "Double" rather than the primitive type "double".

Jython/Python data collections

Read Python data structures tutorial.

Here is a simple example using Python list:

from java.util import Random
import time
 
a=[]
r=Random()
for i in range(100000):
      a.add(r.nextGaussian() )
start = time.clock()
b.sort()
print ' CPU time (s)=',time.clock()-start

Remember the penalty you pay to use the Python list, since it is designed to store objects rather than primitive types


Fast data collections in DMelt

Here we will discuss high-speed data collections which are especially well suited for numerical analysis. They have a very small memory footprint and typically over-perform Python and Java collections by a large factor. Such collections store primitive types and, as the result, require less space and yield significant performance gains.

The DataMelt high-performance collections for numerical computations are:

  • jhplot.P0D jhplot.P0D - (double) data in 1 dimension. High-performance collection
  • jhplot.P0I jhplot.P0I - (integer) data in 1 dimension. High-performance collection

The example below illustrates how to use DataMelt high-performance collections. In this example we benchmark the collections implemented using Java ArrayList, Python list, and jhplot.P0D jhplot.P0D from DataMelt API:

collections_test.py

The result of this code is shown below:

CPU time for P0D (s)= 0.221985919
 CPU time for ArrayList (s)= 1.075895129
 CPU time for Python list (s)= 1.953228532

As you can see, the jhplot.P0D jhplot.P0D is a factor 5 faster as it come to the sort() method. In fact, it is faster almost for any numerical operation.

This is another comparison of how to find a pair of values (X,Y) using the jhplot.P1D jhplot.P1D and 2 Python lists.

from java.util import Random
import time
from java.util import *
from jhplot import *

a=P1D("high-performance")
b=P1D("high-performance")
x=[]
y=[]

r=Random()
for i in range(1000000):
      rr=i
      a.add(rr,rr)
      b.add(rr,rr)
      x.append(rr)
      y.append(rr)

start = time.clock()
i=a.indexOf(0,300000,300000) # find X-Y index starting from 0
print ' CPU time to find a value in  P1D list (s)=',time.clock()-start

start = time.clock()
i=x.index(300000)
i=y.index(300000)
print ' CPU time to find a value in Python list (s)=',time.clock()-start

This code is a factor 10 faster for P1D compared to Python.

Let us show a simple Jython code which illustrate how to use a collection of primitive type ("double values") which over-performs the Java java.util.ArrayList java.util.ArrayList by a factor 7 when sorting its elements (i.e. when using the sort() method). The code below prints the time needed for the calculations:

from cern.colt.list.tdouble *
from java.util import *
import time

a=DoubleArrayList()
b=ArrayList()
r=Random()
for i in range(100000):
      x=r.nextGaussian()
      a.add(x)
      b.add(x)

start = time.clock()
Collections.sort(b)
print ' CPU time (s)=',time.clock()-start

start = time.clock()
a.sort()
print ' CPU time (s)=',time.clock()-start

This example is based on the package Colt lists Colt lists, which is included to DataMelt by default.

Let us give another example showing that numerical analysis can be done faster and more efficient using the high-speed collections for primitive types: Let us create 2 lists with 200k integer values in each. For one list, we use the Python/Jython list implementation. For the second list, we will use a high-speed collection to keep primitive values (integers). We will insert the value 9999 into each list, and then will perform search for this value, printing "true" if the value is found. As before, we perform a benchmarking, i.e. printing the time (in ms) that is needed to find the value 9999. According to the code shown below, the high-speed collection overperforms the Python list by a factor 25.


from cern.colt.list.tdouble *
from java.util import Random
import time

a=IntArrayList()
b=[]

r=Random()
for i in range(100000):
      x=int(1000*r.nextGaussian())
      a.add(x)
      b.append(x)

a.add(9999)
b.append(9999)

for i in range(100000):
      x=int(1000*r.nextGaussian())
      a.add(x)
      b.append(x)

start = time.clock()
b.index(9999)
print ' CPU time (s)=',time.clock()-start

start = time.clock()
print a.contains(9999)  
print ' CPU time (s)=',time.clock()-start

Again, this example is based on the package Colt tdouble Colt tdouble, which is included to DataMelt by default. You can find API of this package Colt lists Colt lists

Data arrays from 3rd party Java packages

DMelt includes the following Java arrays and lists from 3rd party packages:


Typically, Colt and Trove lists are a factor 2 more performant than native Java lists. Search for all supported arrays using this link.

Using data collections

Data collections can be used not only to store data, but perform various manipulations

For example, here is a simple example to calculate derivative of the X-Y values, which is given by the slope [math]\displaystyle{ Y(i+1)-Y(i)/X(i+1)-X(i) }[/math]:

from jhplot  import  *
from java.awt import Color

p1=P1D("X-Y data points")
f=F1D("10*cos(0.1*x)",0,100)
for i in range(100):
         p1.add(i,f.eval(i))

c1 = HPlot("Canvas")
c1.visible()
c1.setAutoRange()
p2=p1.derivative()
p2.setColor(Color.red)
p2.setStyle("l")
c1.draw(p1)
c1.draw(p2)