Spider trap

Short description: Set of web pages that can undermine web crawlers

A spider trap (or crawler trap) is a set of web pages that may intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. Web crawlers are also called web spiders, from which the name is derived. Spider traps may be created to "catch" spambots or other crawlers that waste a website's bandwidth. They may also be created unintentionally by calendars that use dynamic pages with links that continually point to the next day or year.

Common techniques used are:

creation of indefinitely deep directory structures like http://example.com/bar/foo/bar/foo/bar/foo/bar/...
Dynamic pages that produce an unbounded number of documents for a web crawler to follow. Examples include calendars^[1] and algorithmically generated language poetry.^[2]
documents filled with many characters, crashing the lexical analyzer parsing the document.
documents with session-id's based on required cookies.

There is no algorithm to detect all spider traps. Some classes of traps can be detected automatically, but new, unrecognized traps arise quickly.

Politeness

A spider trap causes a web crawler to enter something like an infinite loop,^[3] which wastes the spider's resources,^[4] lowers its productivity, and, in the case of a poorly written crawler, can crash the program. Polite spiders alternate requests between different hosts, and don't request documents from the same server more than once every several seconds,^[5] meaning that a "polite" web crawler is affected to a much lesser degree than an "impolite" crawler.

In addition, sites with spider traps usually have a robots.txt telling bots not to go to the trap, so a legitimate "polite" bot would not fall into the trap, whereas an "impolite" bot which disregards the robots.txt settings would be affected by the trap.^[6]

References

↑ ""What is a Spider Trap?"". https://www.techopedia.com/definition/5197/spider-trap.
↑ Neil M Hennessy. "The Sweetest Poison, or The Discovery of L=A=N=G=U=A=G=E Poetry on the Web". Accessed 2013-09-26.
↑ "Portent" (in en-US). 2016-02-03. https://www.portent.com/blog/seo/field-guide-to-spider-traps-an-seo-companion.htm.
↑ "How to Set Up a robots.txt to Control Search Engine Spiders (thesitewizard.com)". https://www.thesitewizard.com/archive/robotstxt.shtml.
↑ "Building a Polite Web Crawler" (in en). https://dev.to/turnersoftware/building-a-polite-web-crawler-3b8h.
↑ Group, J. Media (2017-10-12). "Closing a spider trap: fix crawl inefficiencies" (in en-US). https://jmediagroup.net/closing-a-spider-trap-fix-crawl-inefficiencies/.

0.00

(0 votes)

[1] ""What is a Spider Trap?"". https://www.techopedia.com/definition/5197/spider-trap.

[2] Neil M Hennessy. "The Sweetest Poison, or The Discovery of L=A=N=G=U=A=G=E Poetry on the Web". Accessed 2013-09-26.

[3] "Portent" (in en-US). 2016-02-03. https://www.portent.com/blog/seo/field-guide-to-spider-traps-an-seo-companion.htm.

[4] "How to Set Up a robots.txt to Control Search Engine Spiders (thesitewizard.com)". https://www.thesitewizard.com/archive/robotstxt.shtml.

[5] "Building a Polite Web Crawler" (in en). https://dev.to/turnersoftware/building-a-polite-web-crawler-3b8h.

[6] Group, J. Media (2017-10-12). "Closing a spider trap: fix crawl inefficiencies" (in en-US). https://jmediagroup.net/closing-a-spider-trap-fix-crawl-inefficiencies/.

[1]

[2]

[3]

[4]

[5]

[6]

v t e Internet search
Types	Web search engine (List) Metasearch engine Collaborative search engine Human flesh search engine Local search Vertical search Social search Image search Video search engine Enterprise search Semantic search Natural language search engine Voice search
Tools	Search engine marketing Search engine optimization Evaluation measures Search oriented architecture Selection-based search Document retrieval Text mining Web crawler Multisearch Federated search Search aggregator Index/Web indexing Focused crawler Spider trap Robots exclusion standard Distributed web crawling Web archiving Website mirroring software Web search query Web query classification
Protocols and standards	Z39.50 Search/Retrieve Web Service Search/Retrieve via URL OpenSearch Representational State Transfer Website Parse Template Wide area information server
See also	Search engine Desktop search Online search

Anonymous

Search

Spider trap

Namespaces

More

Page actions

Politeness

See also

References

Navigation

Navigation

Resources

Help

googletranslator

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Spider trap

Politeness

See also

References

Navigation

Wiki tools

Page tools

Other projects

Categories