Zipf–Mandelbrot law
Parameters |
[math]\displaystyle{ N \in \{1,2,3\ldots\} }[/math] (integer) [math]\displaystyle{ q \in [0;\infty) }[/math] (real) [math]\displaystyle{ s\gt 0\, }[/math] (real) | ||
---|---|---|---|
Support | [math]\displaystyle{ k \in \{1,2,\ldots,N\} }[/math] | ||
pmf | [math]\displaystyle{ \frac{1/(k+q)^s}{H_{N,q,s}} }[/math] | ||
CDF | [math]\displaystyle{ \frac{H_{k,q,s}}{H_{N,q,s}} }[/math] | ||
Mean | [math]\displaystyle{ \frac{H_{N,q,s-1}}{H_{N,q,s}}-q }[/math] | ||
Mode | [math]\displaystyle{ 1\, }[/math] | ||
Entropy | [math]\displaystyle{ \frac{s}{H_{N,q,s}}\sum_{k=1}^N\frac{\ln(k + q)}{(k + q)^s} +\ln(H_{N,q,s}) }[/math] |
In probability theory and statistics, the Zipf–Mandelbrot law is a discrete probability distribution. Also known as the Pareto–Zipf law, it is a power-law distribution on ranked data, named after the linguist George Kingsley Zipf who suggested a simpler distribution called Zipf's law, and the mathematician Benoit Mandelbrot, who subsequently generalized it.
The probability mass function is given by:
- [math]\displaystyle{ f(k;N,q,s)=\frac{1/(k+q)^s}{H_{N,q,s}} }[/math]
where [math]\displaystyle{ H_{N,q,s} }[/math] is given by:
- [math]\displaystyle{ H_{N,q,s}=\sum_{i=1}^N \frac{1}{(i+q)^s} }[/math]
which may be thought of as a generalization of a harmonic number. In the formula, [math]\displaystyle{ k }[/math] is the rank of the data, and [math]\displaystyle{ q }[/math] and [math]\displaystyle{ s }[/math] are parameters of the distribution. In the limit as [math]\displaystyle{ N }[/math] approaches infinity, this becomes the Hurwitz zeta function [math]\displaystyle{ \zeta(s,q) }[/math]. For finite [math]\displaystyle{ N }[/math] and [math]\displaystyle{ q=0 }[/math] the Zipf–Mandelbrot law becomes Zipf's law. For infinite [math]\displaystyle{ N }[/math] and [math]\displaystyle{ q=0 }[/math] it becomes a Zeta distribution.
Applications
The distribution of words ranked by their frequency in a random text corpus is approximated by a power-law distribution, known as Zipf's law.
If one plots the frequency rank of words contained in a moderately sized corpus of text data versus the number of occurrences or actual frequencies, one obtains a power-law distribution, with exponent close to one (but see Powers, 1998 and Gelbukh & Sidorov, 2001). Zipf's law implicitly assumes a fixed vocabulary size, but the Harmonic series with s=1 does not converge, while the Zipf–Mandelbrot generalization with s>1 does. Furthermore, there is evidence that the closed class of functional words that define a language obeys a Zipf–Mandelbrot distribution with different parameters from the open classes of contentive words that vary by topic, field and register.[1]
In ecological field studies, the relative abundance distribution (i.e. the graph of the number of species observed as a function of their abundance) is often found to conform to a Zipf–Mandelbrot law.[2]
Within music, many metrics of measuring "pleasing" music conform to Zipf–Mandelbrot distributions.[3]
Notes
- ↑ Powers, David M W (1998). "Applications and explanations of Zipf's law". Joint conference on new methods in language processing and computational natural language learning. Association for Computational Linguistics. pp. 151–160.
- ↑ Mouillot, D; Lepretre, A (2000). "Introduction of relative abundance distribution (RAD) indices, estimated from the rank-frequency diagrams (RFD), to assess changes in community diversity". Environmental Monitoring and Assessment (Springer) 63 (2): 279–295. doi:10.1023/A:1006297211561. http://cat.inist.fr/?aModele=afficheN&cpsidt=1411186. Retrieved 24 Dec 2008.
- ↑ Manaris, B; Vaughan, D; Wagner, CS; Romero, J; Davis, RB. "Evolutionary Music and the Zipf–Mandelbrot Law: Developing Fitness Functions for Pleasant Music". Proceedings of 1st European Workshop on Evolutionary Music and Art (EvoMUSART2003) 611. https://archive.today/wQYN.
References
- B.B. Wolman and E. Nagel, ed (1965). "Information Theory and Psycholinguistics". Scientific psychology. Basic Books. Reprinted as
- R.C. Oldfield and J.C. Marchall, ed (1968). "Information Theory and Psycholinguistics". Language. Penguin Books.
- Powers, David M W (1998). "Applications and explanations of Zipf's law". Joint conference on new methods in language processing and computational natural language learning. Association for Computational Linguistics. pp. 151–160.
- Zipf, George Kingsley (1932). Selected Studies of the Principle of Relative Frequency in Language. Cambridge, MA: Harvard University Press.
- Van Droogenbroeck F.J., 'An essential rephrasing of the Zipf–Mandelbrot law to solve authorship attribution applications by Gaussian statistics' (2019) [1]
External links
- Z. K. Silagadze: Citations and the Zipf–Mandelbrot's law
- NIST: Zipf's law
- W. Li's References on Zipf's law
- Gelbukh & Sidorov, 2001: Zipf and Heaps Laws’ Coefficients Depend on Language
- C++ Library for generating random Zipf–Mandelbrot deviates.
Original source: https://en.wikipedia.org/wiki/Zipf–Mandelbrot law.
Read more |