Social:Brown Corpus

Short description: Data set of American English in 1961

The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the scientific study of the frequency and distribution of word categories in everyday language use. Compiled by Henry Kučera and W. Nelson Francis at Brown University, in Rhode Island, it is a general language corpus containing 500 samples of English with 2000+ words each, compiled from works published in the United States in 1961, covering a wide range of styles and varieties of prose. It contained 1,014,312 words. Its construction cost the U.S. Office of Education ~$23,000 in 1963-64.^[1]

History

Its original name was "A Standard Sample of Present-day Edited American English for use with digital computers", as described in a manual in 1964.^[2]

In 1967, Kučera and Francis published their classic work, entitled "Computational Analysis of Present-Day American English", which provided basic statistics on what is known today simply as the Brown Corpus.^[3]

The Brown Corpus was a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources. Kučera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of linguistics, psychology, statistics, and sociology. It has been very widely used in computational linguistics, and was for many years among the most-cited resources in the field.^[4]

Shortly after publication of the first lexicostatistical analysis, Boston publisher Houghton-Mifflin approached Kučera to supply a million word, three-line citation base for its new American Heritage Dictionary. This ground-breaking new dictionary, which first appeared in 1969, was the first dictionary to be compiled using corpus linguistics for word frequency and other information.

The initial Brown Corpus had only the words themselves, plus a location identifier for each. Over the following several years part-of-speech tags were applied. The Greene and Rubin tagging program (see under part of speech tagging) helped considerably in this, but the high error rate meant that extensive manual proofreading was required.

The tagged Brown Corpus used a selection of about 80 parts of speech, as well as special indicators for compound forms, contractions, foreign words and a few other phenomena, and formed the model for many later corpora such as the Lancaster-Oslo-Bergen Corpus (British English from the early 1990s) and the Freiburg-Brown Corpus of American English (FROWN) (American English from the early 1990s).^[5]^[6] Tagging the corpus enabled far more sophisticated statistical analysis, such as the work programmed by Andrew Mackie, and documented in books on English grammar.^[7]

One interesting result is that even for quite large samples, graphing words in order of decreasing frequency of occurrence shows a hyperbola: the frequency of the n-th most frequent word is roughly proportional to 1/n. Thus "the" constitutes nearly 7% of the Brown Corpus, "to" and "of" more than another 3% each; while about half the total vocabulary of about 50,000 words are hapax legomena: words that occur only once in the corpus.^[8] This simple rank-vs.-frequency relationship was noted for an extraordinary variety of phenomena by George Kingsley Zipf (for example, see his The Psychobiology of Language), and is known as Zipf's law.

Although the Brown Corpus pioneered the field of corpus linguistics, by now typical corpora (such as the Corpus of Contemporary American English, the British National Corpus or the International Corpus of English) tend to be much larger, on the order of 100 million words.

Sample distribution

The Corpus consists of 500 samples, distributed across 15 genres in rough proportion to the amount published in 1961 in each of those genres. All works sampled were published in 1961; as far as could be determined they were first published then, and were written by native speakers of American English.

Verses and dramas were rejected on account of their presenting different problems for linguistics research compared to standard prose, but short verse passages quoted in prose samples were kept.^[2]

Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence boundary after 2,000 words. In a very few cases miscounts led to samples being just under 2,000 words.

The text was mostly sampled from the Brown University Library and the Providence Athenaeum. For the daily press, the list of American newspapers of which the New York Public Library keeps microfilms files was used, and The Providence Journal. Some periodical materials in the categories Skills and Hobbies and Popular Lore were somewhat arbitrarily chosen from "the contents of one of the largest second-hand magazine stores in New York City".^[2]

The original data entry was done on upper-case only keypunch machines; capitals were indicated by a preceding asterisk, and various special items such as formulae also had special codes.

The corpus originally (1961) contained 1,014,312 words sampled from 15 text categories:

A. PRESS: Reportage (44 texts)
- Political
- Sports
- Society
- Spot News
- Financial
- Cultural
B. PRESS: Editorial (27 texts)
- Institutional Daily
- Personal
- Letters to the Editor
C. PRESS: Reviews (17 texts)
- theatre
- books
- music
- dance
D. RELIGION (17 texts)
- Books
- Periodicals
- Tracts
E. SKILL AND HOBBIES (36 texts)
- Books
- Periodicals
F. POPULAR LORE (48 texts)
- Books
- Periodicals
G. BELLES-LETTRES - Biography, Memoirs, etc. (75 texts)
- Books
- Periodicals
H. MISCELLANEOUS: US Government & House Organs (30 texts)
- Government Documents
- Foundation Reports
- Industry Reports
- College Catalog
- Industry House organ
J. LEARNED (80 texts)
- Natural Sciences
- Medicine
- Mathematics
- Social and Behavioral Sciences
- Political Science, Law, Education
- Humanities
- Technology and Engineering
K. FICTION: General (29 texts)
- Novels
- Short Stories
L. FICTION: Mystery and Detective Fiction (24 texts)
- Novels
- Short Stories
M. FICTION: Science (6 texts)
- Novels
- Short Stories
N. FICTION: Adventure and Western (29 texts)
- Novels
- Short Stories
P. FICTION: Romance and Love Story (29 texts)
- Novels
- Short Stories
R. HUMOR (9 texts)
- Novels
- Essays, etc.

Part-of-speech tags used

Tag	Definition
CC	coordinating conjunction (and, or)
CD	cardinal numeral (one, two, 2, etc.)
CS	subordinating conjunction (if, although)
EX	existential there
IN	preposition (in, at, on)
JJ	adjective
JJA	adjective + Auxiliary
JJC	adjective, Comparative
JJCC	Adjective + Conjunction
JJS	semantically superlative adjective (chief, top)
JJF	Adjective + Female
JJM	Adjective + Male
NN	singular or mass noun
NNA	Noun + Auxiliary
NNC	Noun + Conjunction
NNS	plural noun
NNP	proper noun or part of name phrase
NNPC	proper noun + Conjunction
PRP	personal pronoun, singular
PRPS	personal pronoun, plural
PRP$	Possessive pronoun
RB	adverb
RBR	comparative adverb
RBS	superlative adverb
VB	verb, base form
VBA	verb + Auxiliary, singular, present
VBD	verb, past tense
VBG	verb, present participle/gerund
VBN	verb, past participle
VBZ	verb, 3rd. singular present
FW	Foreign Words
SYM	Symbols
PUN	All Punctuations

References

↑ Francis, W. N. "Problems of Assembling, Describing, and Computerizing Corpora. Research Techniques and Prospects. Papers in Southwest English, No. 1." (1975).
↑ ^2.0 ^2.1 ^2.2 Francis, W. N., and H. Kučera. Manual of Information to Accompany a Standard Sample of Present-day Edited American English, for Use with Digital Computers. Original ed. 1964, revised 1971, revised and augmented 1979. Providence, R.I.: Department of Linguistics, Brown University.
↑ Francis, W. Nelson & Henry Kucera. 1967. Computational Analysis of Present-Day American English. Providence, RI: Brown University Press.
↑ Francis, W. Nelson & Henry Kucera. 1979. BROWN CORPUS MANUAL: Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English for Use with Digital Computers. http://icame.uib.no/brown/bcm.html.
↑ Hundt, Marianne, Andrea Sand & Rainer Siemund. 1998. Manual of Information to Accompany the Freiburg-Brown Corpus of American English (FROWN). http://khnt.hit.uib.no/icame/manuals/frown/INDEX.HTM
↑ Leech, Geoffrey & Nicholas Smith. 2005. Extending the possibilities of corpus-based research on English in the twentieth century: A prequel to LOB and FLOB. ICAME Journal 29. 83–98.
↑ Winthrop Nelson Francis and Henry Kučera. 1983. Frequency Analysis of English Usage: Lexicon and Grammar, Houghton Mifflin.
↑ Kirsten Malmkjær, The Linguistics Encyclopedia, 2nd ed, Routledge, 2002, ISBN 0-415-22210-9, p. 87.

External links

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Brown Corpus. Read more

[1] Francis, W. N. "Problems of Assembling, Describing, and Computerizing Corpora. Research Techniques and Prospects. Papers in Southwest English, No. 1." (1975).

[:0-2] 2.0 ^2.1 ^2.2 Francis, W. N., and H. Kučera. Manual of Information to Accompany a Standard Sample of Present-day Edited American English, for Use with Digital Computers. Original ed. 1964, revised 1971, revised and augmented 1979. Providence, R.I.: Department of Linguistics, Brown University.

[3] Francis, W. Nelson & Henry Kucera. 1967. Computational Analysis of Present-Day American English. Providence, RI: Brown University Press.

[4] Francis, W. Nelson & Henry Kucera. 1979. BROWN CORPUS MANUAL: Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English for Use with Digital Computers. http://icame.uib.no/brown/bcm.html.

[5] Hundt, Marianne, Andrea Sand & Rainer Siemund. 1998. Manual of Information to Accompany the Freiburg-Brown Corpus of American English (FROWN). http://khnt.hit.uib.no/icame/manuals/frown/INDEX.HTM

[6] Leech, Geoffrey & Nicholas Smith. 2005. Extending the possibilities of corpus-based research on English in the twentieth century: A prequel to LOB and FLOB. ICAME Journal 29. 83–98.

[7] Winthrop Nelson Francis and Henry Kučera. 1983. Frequency Analysis of English Usage: Lexicon and Grammar, Houghton Mifflin.

[8] Kirsten Malmkjær, The Linguistics Encyclopedia, 2nd ed, Routledge, 2002, ISBN 0-415-22210-9, p. 87.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

v t e Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus EnTenTen International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES Croatian Language Corpus Croatian National Corpus Czech National Corpus Europarl Corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Quranic Arabic Corpus Russian National Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto TenTen Corpus Family Thesaurus Linguae Graecae
Organizations	BNC consortium COBUILD Sketch Engine

Anonymous

Search

Social:Brown Corpus

Namespaces

More

Page actions

Contents

History

Sample distribution

Part-of-speech tags used

See also

References

External links

Navigation

Navigation

Resources

Help

googletranslator

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Social:Brown Corpus

History

Sample distribution

Part-of-speech tags used

See also

References

External links

Navigation

Wiki tools

Page tools

Other projects

Categories