Member

Statistical classification

Statistical classification is technique used to predict group membership for data instances. DMelt includes several libraries for data classifications. This techniques refers to a procedure of assigning a given input object into one of a given number of categories. It supports K-Nearest Neighbor, Linear Discriminant Analysis (LDA), Fisher's Linear Discriminant (FLD), Quadratic Discriminant analysis (QDA), Regularized Discriminant Analysis (RDA), Logistic Regression (LR), Maximum Entropy Classifier, Multilayer Perceptron Neural Network, Radial Basis Function Networks and many other libraries. Many of such algorithms are included via the Smile Java project (http://haifengl.github.io/). Here is a number of examples using python-like approach.

Bayesian classification

Naive Bayes_classifier is a technique for constructing models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set.

This method of classification of data is implemented in the 3rd party jsat.classifiers.bayesian.NaiveBayes Java class. Let us consider an example that classifies IRIS data^[1]. It reads data and then it attempts to predict correct labels of such data:

from java.io import File
from jsat.classifiers import DataPoint,Classifier,CategoricalResults,ClassificationDataSet
from jsat.classifiers.bayesian import NaiveBayes
from jsat import ARFFLoader,DataSet

print "Download iris_org.arff"
from jhplot import *
print Web.get("https://datamelt.org/examples/data/iris_org.arff")
fi=File("iris_org.arff")
dataSet = ARFFLoader.loadArffFile(fi)
# We specify '0' as the class we would like to make the target class. 
cDataSet = ClassificationDataSet(dataSet, 0)

errors = 0
classifier = NaiveBayes()
classifier.train(cDataSet)
for i in range(dataSet.getSampleSize()):
  # It is important not to mix these up, the class has been removed from data points in 'cDataSet'
  dataPoint = cDataSet.getDataPoint(i) 
  truth = cDataSet.getDataPointCategory(i) # We can grab the true category from the data set
  # Categorical Results contains the probability estimates for each possible target class value. 
  # Classifiers that do not support probability estimates will mark its prediction with total confidence. 
  predictionResults = classifier.classify(dataPoint)
  predicted = predictionResults.mostLikely()
  if(predicted != truth): errors +=1  
  print i,"| True Class: ", truth, ", Predicted: ", predicted, ", Confidence: ", predictionResults.getProb(predicted) 
        
print errors, " errors were made, ", 100.0*errors/dataSet.getSampleSize(), "% error rate"

When you run this code, you will see that we can predict the correct category of data within 4% error.

References

↑ Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950)

[1] Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950)

[1]

Anonymous

Search

DMelt:Statistics/5 Statistical classification

Namespaces

More

Page actions

Contents

Statistical classification

Bayesian classification

See also

References

Navigation

Navigation

Help

Translate

Wiki tools

Wiki tools

Anonymous

Search

DMelt:Statistics/5 Statistical classification

Statistical classification

Bayesian classification

See also

References

Navigation

Wiki tools

Page tools

Other projects

Categories