Score (statistics)

From HandWiki
Short description: Gradient of the likelihood function

In statistics, the score (or informant[1]) is the gradient of the log-likelihood function with respect to the parameter vector. Evaluated at a particular point of the parameter vector, the score indicates the steepness of the log-likelihood function and thereby the sensitivity to infinitesimal changes to the parameter values. If the log-likelihood function is continuous over the parameter space, the score will vanish at a local maximum or minimum; this fact is used in maximum likelihood estimation to find the parameter values that maximize the likelihood function.

Since the score is a function of the observations that are subject to sampling error, it lends itself to a test statistic known as score test in which the parameter is held at a particular value. Further, the ratio of two likelihood functions evaluated at two distinct parameter values can be understood as a definite integral of the score function.[2]

Definition

The score is the gradient (the vector of partial derivatives) of [math]\displaystyle{ \log \mathcal{L}(\theta) }[/math], the natural logarithm of the likelihood function, with respect to an m-dimensional parameter vector [math]\displaystyle{ \theta }[/math].

[math]\displaystyle{ s(\theta) \equiv \frac{\partial \log \mathcal{L}(\theta)}{\partial \theta} }[/math]

This differentiation yields a [math]\displaystyle{ (1 \times m) }[/math] row vector, and indicates the sensitivity of the likelihood (its derivative normalized by its value).

In older literature,[citation needed] "linear score" may refer to the score with respect to infinitesimal translation of a given density. This convention arises from a time when the primary parameter of interest was the mean or median of a distribution. In this case, the likelihood of an observation is given by a density of the form [math]\displaystyle{ \mathcal L(\theta;X)=f(X+\theta) }[/math]. The "linear score" is then defined as

[math]\displaystyle{ s_{\rm linear} = \frac{\partial}{\partial X} \log f(X) }[/math]

Properties

Mean

While the score is a function of [math]\displaystyle{ \theta }[/math], it also depends on the observations [math]\displaystyle{ \mathbf{x} = (x_{1}, x_{2}, \ldots x_{T}) }[/math] at which the likelihood function is evaluated, and in view of the random character of sampling one may take its expected value over the sample space. Under certain regularity conditions on the density functions of the random variables,[3][4] the expected value of the score, evaluated at the true parameter value [math]\displaystyle{ \theta }[/math], is zero. To see this, rewrite the likelihood function [math]\displaystyle{ \mathcal L }[/math] as a probability density function [math]\displaystyle{ \mathcal L(\theta; x) = f(x; \theta) }[/math], and denote the sample space [math]\displaystyle{ \mathcal{X} }[/math]. Then:

[math]\displaystyle{ \begin{align} \operatorname{E}(s\mid\theta) & =\int_{\mathcal{X}} f(x; \theta) \frac{\partial}{\partial\theta} \log \mathcal L(\theta;x) \,dx \\[6pt] & = \int_{\mathcal{X}} f(x; \theta) \frac{1}{f(x; \theta)}\frac{\partial f(x; \theta)}{\partial \theta}\, dx =\int_{\mathcal{X}} \frac{\partial f(x; \theta)}{\partial \theta} \, dx \end{align} }[/math]

The assumed regularity conditions allow the interchange of derivative and integral (see Leibniz integral rule), hence the above expression may be rewritten as

[math]\displaystyle{ \frac{\partial}{\partial\theta} \int_{\mathcal{X}} f(x; \theta) \, dx = \frac{\partial}{\partial\theta}1 = 0. }[/math]

It is worth restating the above result in words: the expected value of the score, at true parameter value [math]\displaystyle{ \theta }[/math] is zero. Thus, if one were to repeatedly sample from some distribution, and repeatedly calculate the score, then the mean value of the scores would tend to zero asymptotically.

Variance

Main page: Fisher information

The variance of the score, [math]\displaystyle{ \operatorname{Var}(s(\theta)) = \operatorname{E}(s(\theta) s(\theta)^{\mathsf{T}}) }[/math], can be derived from the above expression for the expected value.

[math]\displaystyle{ \begin{align} 0 & =\frac{\partial}{\partial \theta^{\mathsf{T}}} \operatorname{E}(s\mid\theta) \\[6pt] & =\frac{\partial}{\partial \theta^{\mathsf{T}}} \int_{\mathcal{X}} \frac{\partial \log \mathcal L(\theta;X)}{\partial\theta} f(x; \theta) \,dx \\[6pt] & = \int_{\mathcal{X}} \frac{\partial}{\partial \theta^{\mathsf{T}}} \left\{ \frac{\partial \log \mathcal L(\theta;X)}{\partial\theta} f(x; \theta) \right\} \,dx \\[6pt] & = \int_{\mathcal{X}} \left\{ \frac{\partial^{2} \log \mathcal{L}(\theta;X)}{\partial \theta \partial \theta^\mathsf{T}} f(x; \theta) + \frac{\partial \log \mathcal{L}(\theta;X)}{\partial \theta} \frac{\partial f(x; \theta)}{\partial \theta^\mathsf{T} } \right\} \,dx \\[6pt] & = \int_{\mathcal{X}} \frac{\partial^{2} \log \mathcal{L}(\theta;X)}{\partial \theta \partial \theta^\mathsf{T}} f(x; \theta) \,dx + \int_{\mathcal{X}} \frac{\partial \log \mathcal{L}(\theta;X)}{\partial \theta} \frac{\partial f(x; \theta)}{\partial \theta^\mathsf{T} } \,dx \\[6pt] & = \int_{\mathcal{X}} \frac{\partial^{2} \log \mathcal{L}(\theta;X)}{\partial \theta \partial \theta^\mathsf{T}} f(x; \theta) \,dx + \int_{\mathcal{X}} \frac{\partial \log \mathcal{L}(\theta;X)}{\partial \theta} \frac{\partial \log \mathcal{L}(\theta;X)}{\partial \theta^\mathsf{T} } f(x; \theta) \,dx \\[6pt] & = \operatorname{E}\left( \frac{\partial^{2} \log \mathcal{L}(\theta;X)}{\partial \theta \partial \theta^\mathsf{T}} \right) + \operatorname{E}\left( \frac{\partial \log \mathcal{L}(\theta;X)}{\partial \theta} \left[ \frac{\partial \log \mathcal{L}(\theta;X)}{\partial \theta} \right]^\mathsf{T} \right) \end{align} }[/math]

Hence the variance of the score is equal to the negative expected value of the Hessian matrix of the log-likelihood.[5]

[math]\displaystyle{ \operatorname{E}(s(\theta) s(\theta)^{\mathsf{T}}) = - \operatorname{E}\left( \frac{\partial^{2} \log \mathcal{L}}{\partial \theta \partial \theta^{\mathsf{T}} } \right) }[/math]

The latter is known as the Fisher information and is written [math]\displaystyle{ \mathcal{I}(\theta) }[/math]. Note that the Fisher information is not a function of any particular observation, as the random variable [math]\displaystyle{ X }[/math] has been averaged out. This concept of information is useful when comparing two methods of observation of some random process.

Examples

Bernoulli process

Consider observing the first n trials of a Bernoulli process, and seeing that A of them are successes and the remaining B are failures, where the probability of success is θ.

Then the likelihood [math]\displaystyle{ \mathcal L }[/math] is

[math]\displaystyle{ \mathcal L(\theta;A,B)=\frac{(A+B)!}{A!B!}\theta^A(1-\theta)^B, }[/math]

so the score s is

[math]\displaystyle{ s=\frac{1}{\mathcal L}\frac{\partial \mathcal L}{\partial\theta} = \frac{A}{\theta}-\frac{B}{1-\theta}. }[/math]

We can now verify that the expectation of the score is zero. Noting that the expectation of A is and the expectation of B is n(1 − θ) [recall that A and B are random variables], we can see that the expectation of s is

[math]\displaystyle{ E(s) = \frac{n\theta}{\theta} - \frac{n(1-\theta)}{1-\theta} = n - n = 0. }[/math]

We can also check the variance of [math]\displaystyle{ s }[/math]. We know that A + B = n (so Bn − A) and the variance of A is (1 − θ) so the variance of s is

[math]\displaystyle{ \begin{align} \operatorname{var}(s) & =\operatorname{var}\left(\frac{A}{\theta}-\frac{n-A}{1-\theta}\right) =\operatorname{var}\left(A\left(\frac{1}{\theta}+\frac{1}{1-\theta}\right)\right) \\ & =\left(\frac{1}{\theta}+\frac{1}{1-\theta}\right)^2\operatorname{var}(A) =\frac{n}{\theta(1-\theta)}. \end{align} }[/math]

Binary outcome model

For models with binary outcomes (Y = 1 or 0), the model can be scored with the logarithm of predictions

[math]\displaystyle{ S = Y \log( p ) + ( 1 - Y ) ( \log( 1 - p ) ) }[/math]

where p is the probability in the model to be estimated and S is the score.[6]

Applications

Scoring algorithm

Main page: Scoring algorithm

The scoring algorithm is an iterative method for numerically determining the maximum likelihood estimator.

Score test

Main page: Score test

Note that [math]\displaystyle{ s }[/math] is a function of [math]\displaystyle{ \theta }[/math] and the observation [math]\displaystyle{ \mathbf{x} = (x_{1}, x_{2}, \ldots x_{T}) }[/math], so that, in general, it is not a statistic. However, in certain applications, such as the score test, the score is evaluated at a specific value of [math]\displaystyle{ \theta }[/math] (such as a null-hypothesis value), in which case the result is a statistic. Intuitively, if the restricted estimator is near the maximum of the likelihood function, the score should not differ from zero by more than sampling error. In 1948, C. R. Rao first proved that the square of the score divided by the information matrix follows an asymptotic χ2-distribution under the null hypothesis.[7]

Further note that the likelihood-ratio test is given by

[math]\displaystyle{ -2 \left[ \log \mathcal{L}(\theta_{0}) - \log \mathcal{L}(\hat{\theta}) \right] = 2 \int_{\theta_{0}}^{\hat{\theta}} \frac{ d \, \log \mathcal{L}(\theta) }{d \theta} \, d \theta = 2 \int_{\theta_{0}}^{\hat{\theta}} s(\theta) \, d \theta }[/math]

which means that the likelihood-ratio test can be understood as the area under the score function between [math]\displaystyle{ \theta_{0} }[/math] and [math]\displaystyle{ \hat{\theta} }[/math].[8]

Score Matching (Machine Learning)

It might seem confusing that the word score has been used for [math]\displaystyle{ \nabla_x \log p(x) }[/math], because it is not a likelihood function, neither it has a derivative with respect to the parameters. For more information about this definition, see the referenced paper. [9]

See also

Notes

  1. Informant in Encyclopaedia of Maths, https://encyclopediaofmath.org/wiki/Informant 
  2. Pickles, Andrew (1985), An Introduction to Likelihood Analysis, Norwich: W. H. Hutchins & Sons, pp. 24–29, ISBN 0-86094-190-6, https://archive.org/details/introductiontoli0000pick/page/24 
  3. Serfling, Robert J. (1980). Approximation Theorems of Mathematical Statistics. New York: John Wiley & Sons. p. 145. ISBN 0-471-02403-1. https://archive.org/details/approximationthe00serf. 
  4. Greenberg, Edward; Webster, Charles E. Jr. (1983). Advanced Econometrics : A Bridge to the Literature. New York: John Wiley & Sons. p. 25. ISBN 0-471-09077-8. https://books.google.com/books?id=TSK7AAAAIAAJ&pg=PA25. 
  5. Sargan, Denis (1988). Lectures on Advanced Econometrics. Oxford: Basil Blackwell. pp. 16–18. ISBN 0-631-14956-2. 
  6. Steyerberg, E. W.; Vickers, A. J.; Cook, N. R.; Gerds, T.; Gonen, M.; Obuchowski, N.; Pencina, M. J.; Kattan, M. W. (2010). "Assessing the performance of prediction models. A framework for traditional and novel measures". Epidemiology 21 (1): 128–138. doi:10.1097/EDE.0b013e3181c30fb2. PMID 20010215. 
  7. Rao, C. Radhakrishna (1948). "Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation". Mathematical Proceedings of the Cambridge Philosophical Society 44 (1): 50–57. doi:10.1017/S0305004100023987. Bibcode1948PCPS...44...50R. 
  8. Buse, A. (1982). "The Likelihood Ratio, Wald, and Lagrange Multiplier Tests: An Expository Note". The American Statistician 36 (3a): 153–157. doi:10.1080/00031305.1982.10482817. 
  9. https://www.jmlr.org/papers/volume6/hyvarinen05a/hyvarinen05a.pdf

References