Jenks natural breaks optimization

From HandWiki
Short description: Data clustering algorithm

The Jenks optimization method, also called the Jenks natural breaks classification method, is a data clustering method designed to determine the best arrangement of values into different classes. This is done by seeking to minimize each class's average deviation from the class mean, while maximizing each class's deviation from the means of the other classes. In other words, the method seeks to reduce the variance within classes and maximize the variance between classes.[1][2]

The Jenks optimization method is directly related to Otsu's Method and Fisher's Discriminant Analysis.

History

George Frederick Jenks

George Frederick Jenks was a 20th-century American cartographer. Graduating with his Ph.D. in agricultural geography from Syracuse University in 1947, Jenks began his career under the tutelage of Richard Harrison, cartographer for Time (magazine) and Fortune magazine.[3] He joined the faculty of the University of Kansas in 1949 and began to build the cartography program. During his 37-year tenure at KU, Jenks developed the Cartography program into one of three programs renowned for their graduate education in the field; the others being the University of Wisconsin and the University of Washington. Much of his time was spent developing and promoting improved cartographic training techniques and programs. He also spent significant time investigating three-dimensional maps, eye-movement research, thematic map communication, and geostatistics.[2][3][4]

Background and development

Jenks was a cartographer by profession. His work with statistics grew out of a desire to make choropleth maps more visually accurate for the viewer. In his paper, The Data Model Concept in Statistical Mapping, he claims that by visualizing data in a three dimensional model cartographers could devise a “systematic and rational method for preparing choroplethic maps”.[1] Jenks used the analogy of a “blanket of error” to describe the need to use elements other than the mean to generalize data. The three dimensional models were created to help Jenks visualize the difference between data classes. His aim was to generalize the data using as few planes as possible and maintain a constant “blanket of error”.

Description of method

The method requires an iterative process. That is, calculations must be repeated using different breaks in the dataset to determine which set of breaks has the smallest in-class variance. The process is started by dividing the ordered data into classes in some way which may be arbitrary. There are two steps that must be repeated:

  1. Calculate the sum of squared deviations from the class means (SDCM).
  2. Choose a new way of dividing the data into classes, perhaps by moving one or more data points from one class to a different one.

New class deviations are then calculated, and the process is repeated until the sum of the within class deviations reaches a minimal value.[1][5]

Alternatively, all break combinations may be examined, SDCM calculated for each combination, and the combination with the lowest SDCM selected. Since all break combinations are examined, this guarantees that the one with the lowest SDCM is found.

Finally the sum of squared deviations from the mean of the complete data set(SDAM), and the goodness of variance fit (GVF) may be calculated. GVF is defined as (SDAM - SDCM) / SDAM. GVF ranges from 0 (worst fit) to 1 (perfect fit).

Use in cartography

Main page: Choropleth map

Jenks’ goal in developing this method was to create a map that was absolutely accurate, in terms of the representation of data's spatial attributes. By following this process, Jenks claims, the “blanket of error” can be uniformly distributed across the mapped surface. He developed this with the intention of using relatively few data classes, less than seven, because that was the limit when using monochromatic shading on a choroplethic map.[1]

A choropleth map using the Jenks classification.

The Jenks classification method is commonly used in thematic maps, especially choropleth maps, as one of several available classification methods. When making choropleth maps, the Jenks classification method can be advantageous because if there are clusters in the data values, it will identify them. In fact, in current versions of ArcGIS software from Esri, Jenks is the default classification method. However, the Jenks classification is not recommended for data that have a low variance. The Jenks natural breaks in the data are used to provide a more meaningful visualization of map data based on the "natural breaks" in the data identified by the iterative process.

Alternative methods

Main page: Cluster analysis

Other methods of data classification include Head/tail Breaks, Natural Breaks (without Jenks Optimization), Equal Interval, Quantile, and Standard Deviation.

Further reading

  • J. A. Hartigan: Clustering Algorithms, John Wiley & Sons, Inc., 1975

See also

  • k-means clustering, a generalization for multivariate data (Jenks natural breaks optimization seems to be one dimensional k-means[6]).

References

  1. 1.0 1.1 1.2 1.3 Jenks, George F. 1967. "The Data Model Concept in Statistical Mapping", International Yearbook of Cartography 7: 186–190.
  2. 2.0 2.1 McMaster, Robert, "In Memoriam: George F. Jenks (1916–1996)". Cartography and Geographic Information Science. 24(1) p.56-59.
  3. 3.0 3.1 McMaster, Robert and McMaster, Susanna. 2002. “A History of Twentieth-Century American Academic Cartography”, Cartography and Geographic Information Science. 29(3) p.312-315.
  4. CSUN Cartography Specialty Group, Winter 1997 Newsletter
  5. ESRI FAQ, What is the Jenks Optimization method .
  6. "Chapter 9". http://www.quantdec.com/SYSEN597/GTKAV/section1/chapter_9.htm. 

External links