Statistical data type

Short description: Taxonomy of statistical data elements

In statistics, data can have any of various types. Statistical data types include categorical (e.g. country), directional (angles or directions, e.g. wind measurements), count (a whole number of events), or real intervals (e.g. measures of temperature).

The data type is a fundamental concept in statistics and controls what sorts of probability distributions can logically be used to describe the variable, the permissible operations on the variable, the type of regression analysis used to predict the variable, etc. The concept of data type is similar to the concept of level of measurement, but more specific. For example, count data requires a different distribution (e.g. a Poisson distribution or binomial distribution) than non-negative real-valued data require, but both fall under the same level of measurement (a ratio scale).

Various attempts have been made to produce a taxonomy of levels of measurement. The psychophysicist Stanley Smith Stevens defined nominal, ordinal, interval, and ratio scales. Nominal measurements do not have meaningful rank order among values, and permit any one-to-one transformation. Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation. Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary (as in the case with longitude and temperature measurements in degree Celsius or degree Fahrenheit), and permit any linear transformation. Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any rescaling transformation.

Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together as categorical variables, whereas ratio and interval measurements are grouped together as quantitative variables, which can be either discrete or continuous, due to their numerical nature. Such distinctions can often be loosely correlated with data type in computer science, in that dichotomous categorical variables may be represented with the Boolean data type, polytomous categorical variables with arbitrarily assigned integers in the integral data type, and continuous variables with the real data type involving floating point computation. But the mapping of computer science data types to statistical data types depends on which categorization of the latter is being implemented.

Other categorizations have been proposed. For example, Mosteller and Tukey (1977)^[1] distinguished grades, ranks, counted fractions, counts, amounts, and balances. Nelder (1990)^[2] described continuous counts, continuous ratios, count ratios, and categorical modes of data. See also Chrisman (1998),^[3] van den Berg (1991).^[4]

The issue of whether or not it is appropriate to apply different kinds of statistical methods to data obtained from different kinds of measurement procedures is complicated by issues concerning the transformation of variables and the precise interpretation of research questions. "The relationship between the data and what they describe merely reflects the fact that certain kinds of statistical statements may have truth values which are not invariant under some transformations. Whether or not a transformation is sensible to contemplate depends on the question one is trying to answer" (Hand, 2004, p. 82).^[5]

Simple data types

The following table classifies the various simple data types, associated distributions, permissible operations, etc. Regardless of the logical possible values, all of these data types are generally coded using real numbers, because the theory of random variables often explicitly assumes that they hold real numbers.

Script error: No such module "Vertical header".	Possible values	Example usage	Script error: No such module "Vertical header".	Common Distributions	! Script error: No such module "Vertical header".	Permissible statistics	Common model
Script error: No such module "Vertical header".	0, 1 (arbitrary labels)	binary outcome ("yes/no", "true/false", "success/failure", etc.)	rowspan=2 Script error: No such module "Vertical header".	Bernoulli	rowspan=2 Script error: No such module "Vertical header".	mode, chi-squared	logistic, probit
Script error: No such module "Vertical header".	"name1", "name2", "name3", ... "nameK" (arbitrary labels)	categorical outcome with names or places like "Rome", "Amsterdam", "Madrid", "London", "Washington" (specific blood type, political party, word, etc.)	categorical	multinomial logit, multinomial probit		mode, chi-squared
Script error: No such module "Vertical header".	ordering categories or integer or real number (arbitrary scale)	Ordering adverbs like "Small", "Medium", "Large", relative score, significant only for creating a ranking	Script error: No such module "Vertical header".	categorical	Script error: No such module "Vertical header".		ordinal regression (ordered logit, ordered probit)
Script error: No such module "Vertical header".	0, 1, ..., N	number of successes (e.g. yes votes) out of N possible	Script error: No such module "Vertical header".	binomial, beta-binomial	Script error: No such module "Vertical header".	mean, median, mode, standard deviation, correlation	binomial regression (logistic, probit)
Script error: No such module "Vertical header".	nonnegative integers (0, 1, ...)	number of items (telephone calls, people, molecules, births, deaths, etc.) in given interval/area/volume	Script error: No such module "Vertical header".	Poisson, negative binomial	Script error: No such module "Vertical header".	All statistics permitted for interval scales plus the following: geometric mean, harmonic mean, coefficient of variation	Poisson, negative binomial regression
Script error: No such module "Vertical header".	real number	temperature in degree Celsius or degree Fahrenheit, relative distance, location parameter, etc. (or approximately, anything not varying over a large scale)	Script error: No such module "Vertical header".	normal, etc. (usually symmetric about the mean)	Script error: No such module "Vertical header".	mean, median, mode, standard deviation, correlation	standard linear regression
Script error: No such module "Vertical header".	positive real number	temperature in kelvin, price, income, size, scale parameter, etc. (especially when varying over a large scale)	Script error: No such module "Vertical header".	log-normal, gamma, exponential, etc. (usually a skewed distribution)	Script error: No such module "Vertical header".	All statistics permitted for interval scales plus the following: geometric mean, harmonic mean, coefficient of variation	generalized linear model with logarithmic link

Multivariate data types

Data that cannot be described using a single number are often shoehorned into random vectors of real-valued random variables, although there is an increasing tendency to treat them on their own. Some examples:

Random vectors. The individual elements may or may not be correlated. Examples of distributions used to describe correlated random vectors are the multivariate normal distribution and multivariate t-distribution. In general, there may be arbitrary correlations between any elements and any others; however, this often becomes unmanageable above a certain size, requiring further restrictions on the correlated elements.
Random matrices. Random matrices can be laid out linearly and treated as random vectors; however, this may not be an efficient way of representing the correlations between different elements. Some probability distributions are specifically designed for random matrices, e.g. the matrix normal distribution and Wishart distribution.
Random sequences. These are sometimes considered to be the same as random vectors, but in other cases the term is applied specifically to cases where each random variable is only correlated with nearby variables (as in a Markov model). This is a particular case of a Bayes network and often used for very long sequences, e.g. gene sequences or lengthy text documents. A number of models are specifically designed for such sequences, e.g. hidden Markov models.
Random processes. These are similar to random sequences, but where the length of the sequence is indefinite or infinite and the elements in the sequence are processed one-by-one. This is often used for data that can be described as a time series, e.g. the price of a stock on successive days. Random processes are also used to model values that vary continuously (e.g. the temperature at successive moments in time), rather than at discrete intervals.
Bayes networks. These correspond to aggregates of random variables described using graphical models, where individual random variables are linked in a graph structure with conditional distributions relating variables to nearby variables.
- Multilevel models are subclasses of Bayes networks that can be thought of as having multiple levels of linear regression.
- Random trees. These are a subclass of Bayes network, where the variables are linked in a tree structure. An example is the problem of parsing a sentence, when statistical parsing techniques are used, such as probabilistic context-free grammars (PCFG's).
Random fields. These represent the extension of random processes to multiple dimensions, and are common in physics, where they are used in statistical mechanics to describe properties such as force or electric field that can vary continuously over three dimensions (or four dimensions, when time is included).

These concepts originate in various scientific fields and frequently overlap in usage. As a result, it is very often the case that multiple concepts could potentially be applied to the same problem.

Comparison to programming data types

Most data types in statistics have comparable types in computer programming, and vice versa, as shown in the following table:

Statistics	Programming
real-valued (interval scale)	floating-point
real-valued (ratio scale)	floating-point
count data (usually non-negative)	integer
binary data	Boolean
categorical data	enumerated type
random vector	list or array
random matrix	two-dimensional array
random tree	tree

References

↑ Mosteller, F.; Tukey, J.W. (1977). Data analysis and regression. Addison-Wesley. ISBN 978-0-201-04854-4.
↑ Nelder, J.A. (1990). "The knowledge needed to computerise the analysis and interpretation of statistical information". Expert systems and artificial intelligence: the need for information about data. London: Library Association. OCLC 27042489.
↑ Chrisman, Nicholas R. (1998). "Rethinking Levels of Measurement for Cartography". Cartography and Geographic Information Science 25 (4): 231–242. doi:10.1559/152304098782383043. Bibcode: 1998CGISy..25..231C.
↑ van den Berg, G. (1991). Choosing an analysis method. Leiden: DSWO Press. ISBN 978-90-6695-062-7.
↑ Hand, D.J. (2004). Measurement theory and practice: The world through quantification. Wiley. p. 82. ISBN 978-0-470-68567-9.

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Statistical data type. Read more

[1] Mosteller, F.; Tukey, J.W. (1977). Data analysis and regression. Addison-Wesley. ISBN 978-0-201-04854-4.

[2] Nelder, J.A. (1990). "The knowledge needed to computerise the analysis and interpretation of statistical information". Expert systems and artificial intelligence: the need for information about data. London: Library Association. OCLC 27042489.

[3] Chrisman, Nicholas R. (1998). "Rethinking Levels of Measurement for Cartography". Cartography and Geographic Information Science 25 (4): 231–242. doi:10.1559/152304098782383043. Bibcode: 1998CGISy..25..231C.

[4] van den Berg, G. (1991). Choosing an analysis method. Leiden: DSWO Press. ISBN 978-90-6695-062-7.

[5] Hand, D.J. (2004). Measurement theory and practice: The world through quantification. Wiley. p. 82. ISBN 978-0-470-68567-9.

[1]

[2]

[3]

[4]

[5]

Anonymous

Search

Statistical data type

Namespaces

More

Page actions

Contents

Simple data types

Multivariate data types

Comparison to programming data types

References

Navigation

Navigation

Resources

Help

googletranslator

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Statistical data type

Simple data types

Multivariate data types

Comparison to programming data types

References

Navigation

Wiki tools

Page tools

Other projects

Categories