Philosophy:Principle of transformation groups

From HandWiki

The principle of transformation groups is a rule for assigning prior probabilities in a statistical inference problem. It was first suggested by E. T. Jaynes[1] and can be seen as a generalization of the principle of indifference.

Prior probabilities determined according to the principle are objective in the sense that they do not incorporate any information beyond the features of the problem itself, so that two people who apply the principle to the same problem will assign the same prior probabilities. Therefore, the principle forms part of the objective Bayesian interpretation of probability.

Motivation and description of the method

The rule is motivated by the following normative principle or desideratum:

In two problems with the same prior information, people should assign the same prior probabilities.

The rule is implemented by identifying the symmetries in a given problem that allow it to be transformed into an equivalent one, then exploiting the symmetries to solve for the prior probabilities. These symmetries are described by transformation groups.

The symmetries of problems with discrete variables (e.g. dice, cards, categorical data) are described by permutation groups. In these cases, the principle reduces to the principle of indifference. In problems with continuous variables, the symmetries involved may be described by other transformation groups, and determining the prior probabilities generally involves a differential equation which may not have a unique solution. Many common continuous problems do have prior probabilities which are uniquely determined by the principle of transformation groups. Jaynes called these problems "well-posed".[2]

Examples

Discrete case: coin flipping

Consider a coin with head (H) and tail (T). Denote this information by I. For a given coin flip, denote the probability of an outcome of heads as [math]\displaystyle{ P(H|I) }[/math]. Denote the probability of an outcome of tails by [math]\displaystyle{ P(T|I) }[/math].

In applying the desideratum, consider the information contained in the event of the coin flip as framed. It describes no distinction between heads and tails. Given no other information, the elements "head" and "tail" are interchangeable. Application of the desideratum then demands that:

[math]\displaystyle{ P(H|I)=P(T|I) }[/math]

The probabilities must add to 1, thus:

[math]\displaystyle{ P(H|I)+P(T|I)=1 \rightarrow 2 P(H|I)=1 \rightarrow P(H|I)=0.5 }[/math].

This argument extends to N categories, to give the "flat" prior probability 1/N.

This provides a consistency-based argument for the principle of indifference: If someone is truly ignorant about a discrete/countable set of outcomes apart from their potential existence, but does not assign them equal prior probabilities, then they are assigning different probabilities when given the same information.

This can be alternatively phrased as: a person who does not use the principle of indifference to assign prior probabilities to discrete variables, is either not ignorant about them, or reasoning inconsistently.

Continuous Case: Location Parameter

This is the easiest example for continuous variables. It is given by stating one is "ignorant" of the location parameter in a given problem. The statement that a parameter is a "location parameter" is that the sampling distribution, or likelihood of an observation X depends on a parameter [math]\displaystyle{ \mu }[/math] only through the difference.

[math]\displaystyle{ p(X|\mu,I)=f(X-\mu) }[/math]

for some normalized probability distribution f(.).

Note that the given information that f(.) is a normalized distribution is a significant prerequisite to obtaining the final conclusion of a uniform prior, because uniform probability distributions can only be normalized given a finite input domain. In other words, the assumption that f(.) is normalized implicitly also requires that the location parameter [math]\displaystyle{ \mu }[/math] does not extend to infinity in any of its dimensions. Otherwise, the uniform prior would not be normalisable.

Examples of location parameters include the mean parameter of a normal distribution with known variance and the median parameter of a Cauchy distribution with a known interquartile range.

The two "equivalent problems" in this case, given one's knowledge of the sampling distribution [math]\displaystyle{ p(X|\mu,I)=f(X-\mu) }[/math], but no other knowledge about [math]\displaystyle{ \mu }[/math], is given by a "shift" of equal magnitude in X and [math]\displaystyle{ \mu }[/math]. This is because of the relation:

[math]\displaystyle{ f(X-\mu)=f([X+b]-[\mu+b])=f(X^{(1)}-\mu^{(1)}) }[/math]

"Shifting" all quantities up by some number b and solving in the "shifted space" and then "shifting" back to the original one should give exactly the same answer as if we just worked on the original space. Making the transformation from [math]\displaystyle{ \mu }[/math] to [math]\displaystyle{ \mu^{(1)} }[/math] has a Jacobian of simply 1, while the prior probability [math]\displaystyle{ g(\mu) = p(\mu|I) }[/math] must satisfy the functional equation:

[math]\displaystyle{ g(\mu)=\left|{\partial \mu^{(1)} \over \partial \mu}\right| g(\mu^{(1)}) = g(\mu+b) }[/math]

And the only function that satisfies this equation is the "constant prior":

[math]\displaystyle{ p(\mu|I) \propto 1 }[/math]

Therefore, the uniform prior is justified for expressing complete ignorance of a normalized prior distribution on a finite, continuous location parameter.

Continuous case: scale parameter

As in the above argument, a statement that [math]\displaystyle{ \sigma }[/math] is a scale parameter means that the sampling distribution has the functional form:

[math]\displaystyle{ p(X|\sigma,I)={1 \over \sigma}f\left({X \over \sigma}\right) }[/math]

Where, as before, [math]\displaystyle{ f(\cdot) }[/math] is a normalized probability density function. The requirement that probabilities be finite and positive forces the condition [math]\displaystyle{ \sigma\gt 0 }[/math]. Examples include the standard deviation of a normal distribution with a known mean or the gamma distribution. The "symmetry" in this problem is found by noting that.

[math]\displaystyle{ {X \over \sigma}={X a \over \sigma a} ; a\gt 0 }[/math]

and setting [math]\displaystyle{ X^{(1)} = Xa }[/math] and [math]\displaystyle{ \sigma^{(1)} = \sigma a. }[/math] But, unlike in the location parameter case, the Jacobian of this transformation in the sample space and the parameter space is a, not 1. So, the sampling probability changes to:

[math]\displaystyle{ p(X^{(1)}|\sigma,I)={1 \over a} \cdot {1 \over \sigma}f\left({X a \over \sigma a}\right)= {1 \over \sigma^{(1)}}f\left({X^{(1)} \over \sigma^{(1)}}\right) }[/math]

Which is invariant (i.e. has the same form before and after the transformation), and the prior probability changes to:

[math]\displaystyle{ p(\sigma|I)={1 \over a} p(\sigma^{(1)}|I)={1 \over a}p\left({\sigma \over a}|I\right) }[/math]

Which has a unique solution (up to a proportionality constant):

[math]\displaystyle{ p(\sigma|I) \propto {1 \over \sigma} \rightarrow p(\log(\sigma)|I) \propto 1 }[/math]

Which is well-known Jeffreys prior for scale parameters, which is "flat" on the log scale, although it is derived using a different argument to that here, based on the Fisher information function. The fact that these two methods give the same results, in this case, does not imply it in general.

Continuous case: Bertrand's paradox

Edwin Jaynes used this principle to provide a resolution to Bertrand's Paradox[2] by stating his ignorance about the exact position of the circle.

Discussion

This argument depends crucially on I; changing the information may result in a different probability assignment. It is just as crucial as changing axioms in deductive logic - small changes in the information can lead to large changes in the probability assignments allowed by "consistent reasoning."

To illustrate, suppose that the coin flipping example also states as part of the information that the coin has a side (S) (i.e. it is a real coin). Denote this new information by N. The same argument using "complete ignorance", or more precisely, the information actually described, gives:

[math]\displaystyle{ P(H|I,N)=P(T|I,N)=P(S|I,N)=1/3 }[/math]

Intuition tells us that we should have P(S) very close to zero. This is because most people's intuition does not see "symmetry" between a coin landing on its side compared to landing on heads. Our intuition says that the particular "labels" actually carry some information about the problem. A simple argument could be used to make this more formal mathematically (e.g. the physics of the problem make it difficult for a flipped coin to land on its side)—we make a distinction between "thick" coins and "thin" coins (here thickness is measured relative to the coin's diameter). It could reasonably be assumed that:

[math]\displaystyle{ P(S|\text{thin coin}) \neq P(S|\text{thick coin}) }[/math]

Note that this new information probably wouldn't break the symmetry between "heads" and "tails", so that permutation would still apply in describing "equivalent problems", and we would require:

[math]\displaystyle{ P(T|\text{thin coin}) = P(H|\text{thin coin}) \neq P(H|\text{thick coin})=P(T|\text{thick coin}) }[/math]

This is a good example of how the principle of transformation groups can be used to "flesh out" personal opinions. All of the information used in the derivation is explicitly stated. If a prior probability assignment doesn't "seem right" according to what your intuition tells you, then there must be some "background information" that has not been put into the problem.[3] It is then the task to try and work out what that information is. In some sense, combining the method of transformation groups with one's intuition can be used to "weed out" the actual assumptions one has. This makes it a very powerful tool for prior elicitation.

Introducing the thickness of the coin as a variable is permissible because its existence was implied (by being a real coin) but its value was not specified in the problem. Introducing a "nuisance parameter" and then making the answer invariant to this parameter is a very useful technique for solving supposedly "ill-posed" problems like Bertrand's Paradox. This has been called "the well-posing strategy" by some.[4]

A strength of this principle lies in its application to continuous parameters, where the notion of "complete ignorance" is not so well-defined as in the discrete case. However, if applied with infinite limits, it often gives improper prior distributions. Note that the discrete case for a countably infinite set, such as (0,1, 2...) also produces an improper discrete prior. For most cases where the likelihood is sufficiently "steep" this does not present a problem. However, in order to be absolutely sure to avoid incoherent results and paradoxes, the prior distribution should be approached via a well-defined and well-behaved limiting process. One such process is the use of a sequence of priors with increasing range, such as [math]\displaystyle{ f(M) = {I(M \in [-b,b]) \over 2b} }[/math] where the limit [math]\displaystyle{ b \rightarrow \infty }[/math] is to be taken at the end of the calculation i.e. after the normalization of the posterior distribution. What this effectively is doing, is ensuring that one is taking the limit of the ratio, and not the ratio of two limits. See Limit of a function for details on limits and why this order of operations is important.

If the limit of the ratio does not exist or diverges, then this gives an improper posterior (i.e. a posterior that does not integrate into one). This indicates that the data are so uninformative about the parameters that the prior probability of arbitrarily large values still matters in the final answer. In some sense, an improper posterior means that the information contained in the data has not "ruled out" arbitrarily large values. Looking at the improper priors this way, it seems to make some sense that "complete ignorance" priors should be improper because the information used to derive them is so meagre that it cannot rule out absurd values on its own. From a state of complete ignorance, only the data or some other form of additional information can rule out such absurdities.

References

  1. Jaynes, Edwin T. (1968). "Prior Probabilities" (in en). IEEE Transactions on Systems Science and Cybernetics 4 (3): 227–241. doi:10.1109/TSSC.1968.300117. https://bayes.wustl.edu/etj/articles/prior.pdf. Retrieved 2023-06-30. 
  2. 2.0 2.1 Jaynes, Edwin T. (1973). "The Well-Posed Problem" (in en). Foundations of Physics 3 (4): 477–492. doi:10.1007/BF00709116. Bibcode1973FoPh....3..477J. https://bayes.wustl.edu/etj/articles/well.pdf. Retrieved 2023-06-30. 
  3. Jaynes, E. T. (1984). "Monkeys, Kangaroos, and N". in Justice, James H. (in en). Fourth Annual Workshop on Bayesian/Maximum Entropy Methods. Cambridge University Press. https://bayes.wustl.edu/etj/articles/cmonkeys.pdf. Retrieved 2023-11-13. 
  4. Shackel, Nicholas (2007). "Bertrand's Paradox and the Principle of Indifference" (in en). Philosophy of Science 74 (2): 150–175. doi:10.1086/519028. http://orca.cf.ac.uk/3803/1/Shackel%20Bertrand%27s%20paradox%205.pdf. Retrieved 2018-11-04. 

Further reading

  • Edwin Thompson Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2003. ISBN:0-521-59271-2.