Computational statistics

From HandWiki
Short description: Interface between statistics and computer science
Students working in the Statistics Machine Room of the London School of Economics in 1964.

Computational statistics, or statistical computing, is the bond between statistics and computer science. It means statistical methods that are enabled by using computational methods. It is the area of computational science (or scientific computing) specific to the mathematical science of statistics. This area is also developing rapidly, leading to calls that a broader concept of computing should be taught as part of general statistical education.[1]

As in traditional statistics the goal is to transform raw data into knowledge,[2] but the focus lies on computer intensive statistical methods, such as cases with very large sample size and non-homogeneous data sets.[2]

The terms 'computational statistics' and 'statistical computing' are often used interchangeably, although Carlo Lauro (a former president of the International Association for Statistical Computing) proposed making a distinction, defining 'statistical computing' as "the application of computer science to statistics", and 'computational statistics' as "aiming at the design of algorithm for implementing statistical methods on computers, including the ones unthinkable before the computer age (e.g. bootstrap, simulation), as well as to cope with analytically intractable problems" [sic].[3]

The term 'Computational statistics' may also be used to refer to computationally intensive statistical methods including resampling methods, Markov chain Monte Carlo methods, local regression, kernel density estimation, artificial neural networks and generalized additive models.

History

Though computational statistics is widely used today, it actually has a relatively short history of acceptance in the statistics community. For the most part, the founders of the field of statistics relied on mathematics and asymptotic approximations in the development of computational statistical methodology.[4]

In statistical field, the first use of the term “computer” comes in an article in the Journal of the American Statistical Association archives by Robert P. Porter in 1891. The article discusses about the use of Hermann Hollerith’s machine in the 11th Census of the United States.[5] Hermann Hollerith’s machine, also called tabulating machine, was an electromechanical machine designed to assist in summarizing information stored on punched cards. It was invented by Herman Hollerith (February 29, 1860 – November 17, 1929), an American businessman, inventor, and statistician. His invention of the punched card tabulating machine was patented in 1884, and later was used in the 1890 Census of United States . The advantages of the technology were immediately apparent. the 1880 Census, with about 50 million people, and it took over 7 years to tabulate. While in the 1890 Census, with over 62 million people, it took less than a year. This marks the beginning of the era of mechanized computational statistics and semiautomatic data processing systems.

In 1908, William Sealy Gosset performed his now well-known Monte Carlo method simulation which led to the discovery of the Student’s t-distribution.[6] With the help of computational methods, he also has plots of the empirical distributions overlaid on the corresponding theoretical distributions. The computer has revolutionized simulation and has made the replication of Gosset’s experiment little more than an exercise.[7][8]

Later on, the scientists put forward computational ways of generating pseudo-random deviates, performed methods to convert uniform deviates into other distributional forms using inverse cumulative distribution function or acceptance-rejection methods, and developed state-space methodology for Markov chain Monte Carlo.[9]

By the mid-1950s, A lot of work was being done of testing the generators for randomness. Most of the computers could refer to random number tables now. In 1958, John Tukey’s jackknife was developed. It is as a method to reduce the bias of parameter estimates in samples under nonstandard conditions.[10] This requires computers for practical implementations. To this point, computers have made many tedious statistical studies feasible.[11]

Methods

Maximum likelihood estimation

Maximum likelihood estimation is used to estimate the parameters of an assumed probability distribution, given some observed data. It is achieved by maximizing a likelihood function so that the observed data is most probable under the assumed statistical model.

Monte Carlo method

Monte Carlo a statistical method relies on repeated random sampling to obtain numerical results. The concept is to use randomness to solve problems that might be deterministic in principle. They are often used in physical and mathematical problems and are most useful when it is difficult to use other approaches. Monte Carlo methods are mainly used in three problem classes: optimization, numerical integration, and generating draws from a probability distribution.

Markov chain Monte Carlo

The Markov chain Monte Carlo method creates samples from a continuous random variable, with probability density proportional to a known function. These samples can be used to evaluate an integral over that variable, as its expected value or variance.The more steps are included, the more closely the distribution of the sample matches the actual desired distribution.

Applications

Computational statistics journals

Associations

  • International Association for Statistical Computing

See also

References

  1. Nolan, D. & Temple Lang, D. (2010). "Computing in the Statistics Curricula", The American Statistician 64 (2), pp.97-107.
  2. 2.0 2.1 Wegman, Edward J. “Computational Statistics: A New Agenda for Statistical Theory and Practice.Journal of the Washington Academy of Sciences, vol. 78, no. 4, 1988, pp. 310–322. JSTOR
  3. Lauro, Carlo (1996), "Computational statistics or statistical computing, is that the question?", Computational Statistics & Data Analysis 23 (1): 191–193, doi:10.1016/0167-9473(96)88920-1 
  4. Watnik, Mitchell (2011). "Early Computational Statistics" (in en). Journal of Computational and Graphical Statistics 20 (4): 811–817. doi:10.1198/jcgs.2011.204b. ISSN 1061-8600. http://www.tandfonline.com/doi/abs/10.1198/jcgs.2011.204b. 
  5. Hendrickson, W. A.; Ward, K. B. (1975-10-27). "Atomic models for the polypeptide backbones of myohemerythrin and hemerythrin". Biochemical and Biophysical Research Communications 66 (4): 1349–1356. doi:10.1016/0006-291x(75)90508-2. ISSN 1090-2104. PMID 5. https://pubmed.ncbi.nlm.nih.gov/5. 
  6. Los Alamos science, Number 14. 1986-01-01. http://dx.doi.org/10.2172/6935980. 
  7. Trahan, Travis John (2019-10-03). Recent Advances in Monte Carlo Methods at Los Alamos National Laboratory. http://dx.doi.org/10.2172/1569710. 
  8. Metropolis, Nicholas; Ulam, S. (1949). "The Monte Carlo Method". Journal of the American Statistical Association 44 (247): 335–341. doi:10.1080/01621459.1949.10483310. ISSN 0162-1459. http://dx.doi.org/10.1080/01621459.1949.10483310. 
  9. Robert, Christian; Casella, George (2011-02-01). "A Short History of Markov Chain Monte Carlo: Subjective Recollections from Incomplete Data". Statistical Science 26 (1). doi:10.1214/10-sts351. ISSN 0883-4237. http://dx.doi.org/10.1214/10-sts351. 
  10. QUENOUILLE, M. H. (1956). "NOTES ON BIAS IN ESTIMATION". Biometrika 43 (3-4): 353–360. doi:10.1093/biomet/43.3-4.353. ISSN 0006-3444. http://dx.doi.org/10.1093/biomet/43.3-4.353. 
  11. Teichroew, Daniel (1965). "A History of Distribution Sampling Prior to the Era of the Computer and its Relevance to Simulation". Journal of the American Statistical Association 60 (309): 27–49. doi:10.1080/01621459.1965.10480773. ISSN 0162-1459. http://dx.doi.org/10.1080/01621459.1965.10480773. 

Further reading

Articles

  • Albert, J.H.; Gentle, J.E. (2004), Albert, James H; Gentle, James E, eds., "Special Section: Teaching Computational Statistics", The American Statistician 58: 1, doi:10.1198/0003130042872 
  • Wilkinson, Leland (2008), "The Future of Statistical Computing (with discussion)", Technometrics 50 (4): 418–435, doi:10.1198/004017008000000460 

Books

External links

Associations

Journals