DMelt:AI/6 Classification problem
Classification problem
Data classification is the central data-mining technique used for sorting data, understanding of data and for performing outcome predictions Data classification problem can be solved using statistical algorithms discussed in Section Statistical classification, as well as using Machine_learning which will be discussed in this section.
See also
Classification using neural networks
In this small tutorial we will use a library "Smile" (https://github.com/haifengl/smile) that includes many methods for supervising and non-supervising data classification methods. We will make a small Python-like code using Jython top build a complex Multilayer Perceptron Neural Network for data classification. It will have large number of inputs, several outputs, and can be easily extended for cases with many hidden layers. We will write a few lines of Jython code (most of our codding will deal with how to prepare an interface for reading data, rather than with Neural Network programming).
First of all, let us copy data samples. One sample will be for training (https://datamelt.org/examples/data/usps/zip.train), another file is for testing ((https://datamelt.org/examples/data/usps/zip.test). Copy these files to your local directory.
We will need to import the necessary classes to be used in this example:
from smile.data import AttributeDataset,NominalAttribute from smile.data.parser import DelimitedTextParser,IOUtils from smile.classification import NeuralNetwork from smile.math import Math from jarray import zeros,array from jhplot import * import java
We import the classes from several packages: "smile", "jhplot" and Java. Additional package "jarray" is used to work with Arrays using Jython. The NeuralNetwork class is the most important for our example. It creates Multilayer perceptron neural network that consists of several layers of nodes, interconnected through weighted acyclic arcs from each preceding layer to the following
We will call it as:
nn=[Nr_input, Nr_hidden, Nr_output] # define structure of NN net=NeuralNetwork(NeuralNetwork.ErrorFunction.LEAST_MEAN_SQUARES, NeuralNetwork.ActivationFunction.LOGISTIC_SIGMOID,nn) net.learn(x, y) # train neural net on input double array y, with the outcome given by x
Here we use the logistic sigmoid function and the error function "LEAST_MEAN_SQUARES".
Before start using this NN, we will read data from the files, and create input arrays x,y. We will write a small function "getJavaArrays" that returns two arrays, double[][] and int[] (using Java style). After NN was trained, we will use the test sample and call the method "predict(x)" to verify that our prediction are close to the expected values from the test sample.
The full code looks as this:
from smile.data import AttributeDataset,NominalAttribute from smile.data.parser import DelimitedTextParser,IOUtils from smile.classification import NeuralNetwork from smile.math import Math from jarray import zeros,array from jhplot import * import java def getJavaArrays(dataset): # this function create x[][] and y[] array from datasets rows=dataset.size() lst = [0.0]*rows twoDimArr = array([lst,[]], java.lang.Class.forName('[D')) x = dataset.toArray(twoDimArr) y = dataset.toArray(zeros(rows, "i")) return x,y parser =DelimitedTextParser(); parser.setDelimiter("[\t ]+") parser.setResponseIndex(NominalAttribute("class"), 0) train=parser.parse("Train",java.io.File("zip.train")) x,y=getJavaArrays(train) # get input and output for training NN print "Rescale data range .. " p = len(x[0]) mu = Math.colMeans(x); sd = Math.colSds(x); for i in range(len(x)): for j in range(p): x[i][j] = (x[i][j] - mu[j]) / sd[j]; nin=len(x[0]); nout=Math.max(y)+1 print "Training: Nr for input layer=",nin," Nr for output layer=",nout nn=[nin,50,nout] # 50 nodes in a hidden layer. Need another layer? Add integer after 50 print "NN layout=",nn net = NeuralNetwork(NeuralNetwork.ErrorFunction.LEAST_MEAN_SQUARES, NeuralNetwork.ActivationFunction.LOGISTIC_SIGMOID,nn) c1 = SPlot() # plot error vs epoch c1.visible(); c1.setGTitle("Neural Network output error") c1.setAutoRange(); c1.setMarksStyle('various') c1.setConnected(1, 0); c1.setNameX('Epoch'); c1.setNameY('Error') for j in range(70): net.learn(x, y) error=0.0 for i in range(len(x)): if (net.predict(x[i]) != y[i]): error +=1 error=error / len(x) c1.addPoint(0,j,error,1) c1.update() print "Epoch=",j," Error=",error print "Testing using zip.test file.." test=parser.parse("Test",java.io.File("zip.test")) testx,testy=getJavaArrays(test) for i in range(len(testx)): # rescale data for j in range(p): testx[i][j] = (testx[i][j] - mu[j]) / sd[j]; error=0.0 for i in range(len(testx)): if (net.predict(testx[i]) != testy[i]): error +=1 print "Error rate =", 100.0 * error / len(testx),"%"
Note that we use 50 nodes in a single hidden layer. The number of input (256) and outputs (10) are given by the structure of the data (looks at the data file to see their structure).
Now you can run it using the DataMelt data program (https://datamelt.org/). Install DataMelt, open the above code in the editor and save these lines in a file with the extension ".py" (say, "test.py"). This is an important step, since tells DataMelt how to run this code. Then run this code: click the icon with a green running man in the top menu, or press [F8]. You will see an image which shows how the learning error shrinks as a function of epoch number. The last line of the code gives the error rate for your prediction.
If you are interested in deep neural networks, increase the number of hidden layers using the input list "nn". For example, if you need a second hidden layer with 25 nodes, extend the list in the above example as:
nn=[nin,50,25,nout]
If you will see that the number of epochs will be drastically reduced to obtain a similar (or better) rate for successful predictions (but, the CPU time, will increase since more time will be needed to adjust the weights).
If you are interested in more examples using other methods to classify data, such as K-Nearest Neighbor, Linear Discriminant Analysis, Fisher's Linear Discriminant, Quadratic Discriminant analysis, Regularized Discriminant Analysis, Logistic Regression, Maximum Entropy Classifier etc, look this link: https://datamelt.org/code/index.php?art=Data_mining/Classification which shows how to use such methods using the Python interface.