DMelt:Statistics/5 Statistical classification

From HandWiki

Statistical classification

Statistical classification is technique used to predict group membership for data instances. DMelt includes several libraries for data classifications. This techniques refers to a procedure of assigning a given input object into one of a given number of categories. It supports K-Nearest Neighbor, Linear Discriminant Analysis (LDA), Fisher's Linear Discriminant (FLD), Quadratic Discriminant analysis (QDA), Regularized Discriminant Analysis (RDA), Logistic Regression (LR), Maximum Entropy Classifier, Multilayer Perceptron Neural Network, Radial Basis Function Networks and many other libraries. Many of such algorithms are included via the Smile Java project ( Here is a number of examples using python-like approach.

Bayesian classification

Naive Bayes_classifier is a technique for constructing models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set.

This method of classification of data is implemented in the 3rd party jsat.classifiers.bayesian.NaiveBayes jsat.classifiers.bayesian.NaiveBayes Java class. Let us consider an example that classifies IRIS data[1]. It reads data and then it attempts to predict correct labels of such data:

from import File
from jsat.classifiers import DataPoint,Classifier,CategoricalResults,ClassificationDataSet
from jsat.classifiers.bayesian import NaiveBayes
from jsat import ARFFLoader,DataSet

print "Download iris_org.arff"
from jhplot import *
print Web.get("")
dataSet = ARFFLoader.loadArffFile(fi)
# We specify '0' as the class we would like to make the target class. 
cDataSet = ClassificationDataSet(dataSet, 0)

errors = 0
classifier = NaiveBayes()
for i in range(dataSet.getSampleSize()):
  # It is important not to mix these up, the class has been removed from data points in 'cDataSet'
  dataPoint = cDataSet.getDataPoint(i) 
  truth = cDataSet.getDataPointCategory(i) # We can grab the true category from the data set
  # Categorical Results contains the probability estimates for each possible target class value. 
  # Classifiers that do not support probability estimates will mark its prediction with total confidence. 
  predictionResults = classifier.classify(dataPoint)
  predicted = predictionResults.mostLikely()
  if(predicted != truth): errors +=1  
  print i,"| True Class: ", truth, ", Predicted: ", predicted, ", Confidence: ", predictionResults.getProb(predicted) 
print errors, " errors were made, ", 100.0*errors/dataSet.getSampleSize(), "% error rate"

When you run this code, you will see that we can predict the correct category of data within 4% error.

See also


  1. Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950)