Stop words

In computing, stop words are words which are filtered out before or after processing of natural language data (text).^[1] Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search. Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "The Who", "The The", or "Take That". Other search engines remove some of the most common words—including lexical words, such as "want"—from a query in order to improve performance.^[2]

Hans Peter Luhn, one of the pioneers in information retrieval, is credited with coining the phrase and using the concept.^[3] The phrase "stop word", which is not in Luhn's 1959 presentation, and the associated terms "stop list" and "stoplist" appear in the literature shortly afterwards.^[4]

A predecessor concept was used in creating some concordances. For example, the first Hebrew concordance, Me’ir nativ, contained a one-page list of unindexed words, with nonsubstantive prepositions and conjunctions which are similar to modern stop words.^[5]

In SEO terminology, stop words are the most common words that most search engines avoid, saving space and time in processing large data during crawling or indexing. This helps search engines to save space in their databases.

References

↑ Rajaraman, A.; Ullman, J. D. (2011). "Data Mining". Mining of Massive Datasets. pp. 1–17. doi:10.1017/CBO9781139058452.002. ISBN 9781139058452. http://i.stanford.edu/~ullman/mmds/ch1.pdf.
↑ Stackoverflow: "One of our major performance optimizations for the "related questions" query is removing the top 10,000 most common English dictionary words (as determined by Google search) before submitting the query to the SQL Server 2008 full text engine. It’s shocking how little is left of most posts once you remove the top 10k English dictionary words. This helps limit and narrow the returned results, which makes the query dramatically faster".
↑ Luhn, H. P. (1959). Keyword-in-Context Index for Technical Literature (KWIC Index). Yorktown Heights, NY: International Business Machines Corp.. doi:10.1002/asi.5090110403.
↑ Flood, Barbara J. (1999). <1066::AID-ASI5>3.0.CO;2-A "Historical note: The Start of a Stop List at Biological Abstracts". Journal of the American Society for Information Science 50 (12): 1066. doi:10.1002/(SICI)1097-4571(1999)50:12<1066::AID-ASI5>3.0.CO;2-A. https://dx.doi.org/10.1002/(SICI)1097-4571(1999)50:12<1066::AID-ASI5>3.0.CO;2-A. Retrieved 16 February 2016.
↑ Weinberg, Bella Hass (2004). "Predecessors of scientific indexing structures in the domain of religion". Second Conference on the History and Heritage of Scientific and Technical Information Systems: 126–134. https://www.asis.org/History/11-weinberg.pdf. Retrieved 17 February 2016.

External links

category:Information retrieval techniques

0.00

(0 votes)

[1] Rajaraman, A.; Ullman, J. D. (2011). "Data Mining". Mining of Massive Datasets. pp. 1–17. doi:10.1017/CBO9781139058452.002. ISBN 9781139058452. http://i.stanford.edu/~ullman/mmds/ch1.pdf.

[2] Stackoverflow: "One of our major performance optimizations for the "related questions" query is removing the top 10,000 most common English dictionary words (as determined by Google search) before submitting the query to the SQL Server 2008 full text engine. It’s shocking how little is left of most posts once you remove the top 10k English dictionary words. This helps limit and narrow the returned results, which makes the query dramatically faster".

[3] Luhn, H. P. (1959). Keyword-in-Context Index for Technical Literature (KWIC Index). Yorktown Heights, NY: International Business Machines Corp.. doi:10.1002/asi.5090110403.

[4] Flood, Barbara J. (1999). <1066::AID-ASI5>3.0.CO;2-A "Historical note: The Start of a Stop List at Biological Abstracts". Journal of the American Society for Information Science 50 (12): 1066. doi:10.1002/(SICI)1097-4571(1999)50:12<1066::AID-ASI5>3.0.CO;2-A. https://dx.doi.org/10.1002/(SICI)1097-4571(1999)50:12<1066::AID-ASI5>3.0.CO;2-A. Retrieved 16 February 2016.

[5] Weinberg, Bella Hass (2004). "Predecessors of scientific indexing structures in the domain of religion". Second Conference on the History and Heritage of Scientific and Technical Information Systems: 126–134. https://www.asis.org/History/11-weinberg.pdf. Retrieved 17 February 2016.

[1]

[2]

[3]

[4]

[5]

v t e Natural language processing
General terms	Natural language understanding Text corpus Speech corpus Stopwords Bag-of-words AI-complete n-gram (Bigram, Trigram)
Text analysis	Text segmentation Part-of-speech tagging Text chunking Compound term processing Collocation extraction Stemming Lemmatisation Named-entity recognition Coreference resolution Sentiment analysis Concept mining Parsing Word-sense disambiguation Ontology learning Terminology extraction Textual entailment Truecasing
Automatic summarization	Multi-document summarization Sentence extraction Text simplification
Machine translation	Computer-assisted Example-based Rule-based Neural
Automatic identification and data capture	Speech recognition Speech synthesis Optical character recognition Natural language generation
Topic model	Pachinko allocation Latent Dirichlet allocation Latent semantic analysis
Computer-assisted reviewing	Automated essay scoring Concordancer Grammar checker Predictive text Spell checker Syntax guessing
Natural language user interface	Automated online assistant Chatbot Interactive fiction Question answering Voice user interface

v t e Search engine optimization
Exclusion standards	Robots exclusion standard Meta element nofollow
Marketing topics	Online advertising Email marketing Display advertising Web analytics
Search marketing	Search engine marketing Social media optimization Online identity management Paid inclusion Pay per click Google bomb
Search engine spam	Spamdexing Web scraping Scraper site Link farm Link building
Linking	Backlink Link building Link exchange Organic linking
People	Danny Sullivan Matt Cutts Vanessa Fox Barry Schwartz
Other	Geotargeting Human search engine Stop words Content farm

Anonymous

Search

Stop words

Namespaces

More

Page actions

See also

References

External links

Navigation

Navigation

Resources

Help

googletranslator

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Stop words

See also

References

External links

Navigation

Wiki tools

Page tools

Other projects

Categories