2024 Google Search documentation leak
The 2024 Google Search documentation leak was the inadvertent publication of internal Google Search API documentation that occurred in March 2024 and became publicly known in May 2024. The leaked materials consisted of over 2,500 pages describing more than 14,000 attributes related to Google's search ranking systems.[1][2] Google confirmed the documents' authenticity on May 29, 2024.[3]
The leak was considered significant because some of its contents appeared to contradict public statements previously made by Google representatives regarding how the company's search algorithm operates, particularly regarding the use of click data and Google Chrome browser data in search rankings.[4][5]
Background
Google's search algorithm determines the ranking of websites in Google Search results and is considered one of the most closely guarded trade secrets in the technology industry.[1] The algorithm's workings have been the subject of extensive speculation and analysis by the search engine optimization (SEO) industry, which relies on understanding ranking factors to optimize websites for higher placement in search results.[5]
Google has historically provided general guidance to website operators through official documentation and public statements by its employees, but has not disclosed the specific signals and weights used in its ranking system.[4] This secrecy has led to an ongoing tension between Google's public statements and the observations made by SEO practitioners.[1]
Discovery and disclosure
Origin of the leak
On approximately March 13, 2024, an automated bot named yoshi-code-bot inadvertently committed internal documentation for Google's Content Warehouse API to a publicly accessible Google-owned repository on GitHub.[2] The commit was published under an Apache 2.0 open source license, as was standard for Google's public documentation.[2] The material remained publicly accessible until a follow-up commit on May 7, 2024, attempted to remove it, but by that time the documentation had been captured by Hexdocs, an external automated documentation service that indexes public GitHub repositories.[2][6]
Public disclosure
Erfan Azimi, CEO of SEO firm EA Eagle Digital, discovered the leaked documents and shared them with Rand Fishkin, co-founder of SparkToro and former CEO of Moz.[2][5] Fishkin verified the documents by consulting former Google employees, two of whom confirmed the materials appeared genuine and matched the formatting and notation style of internal Google documentation.[1]
Fishkin then shared the materials with Michael King, CEO of iPullRank, for detailed technical analysis.[2] Fishkin published his account on the SparkToro blog on May 27, 2024, and King published his analysis on iPullRank on May 28, 2024.[1][6]
Google's response
Google initially declined to comment on the leak.[1] On May 29, 2024, Google spokesperson Davis Thompson confirmed the documents' authenticity in a statement to The Verge, while cautioning against drawing conclusions from them: "We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information. We've shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation."[3]
Contents
The leaked documentation described protocol buffer field definitions used within Google's Content Warehouse API.[2] The materials contained 2,596 modules with 14,014 attributes, covering components related to YouTube, Google Assistant, Google Books, video search, links, web documents, and crawl infrastructure, among other systems.[6]
The documents did not contain source code or reveal the specific weights assigned to different ranking signals, making it difficult to determine the relative importance of individual attributes within the ranking algorithm.[2][4] The documentation also did not indicate which attributes were actively used in ranking versus those that may have been deprecated or used for other purposes.[3]
Key revelations
Click data and user behavior
The documents referenced a system called NavBoost, which uses click-driven metrics to adjust the ranking of search results.[4][6] The existence of NavBoost had previously been confirmed during the United States v. Google antitrust trial, where Google VP Pandu Nayak testified that the system had used click data since approximately 2005.[6] Google representatives had previously downplayed or denied the use of click-based signals in rankings.[4]
The documents included attributes for different types of clicks, including "badClicks," "goodClicks," "lastLongestClicks," and related metrics.[4]
Chrome browser data
The documentation contained attributes such as "ChromeInTotal" and "chrome_trans_clicks," which analysts interpreted as indicating that Google uses data from its Chrome browser to assess website quality and popularity.[4][2] Google employees had previously denied using Chrome data for ranking purposes.[4]
Site authority
The documents referenced a "siteAuthority" attribute, which appeared to represent a site-level authority score.[2] Google had publicly denied maintaining such a metric.[4]
Topical coherence metrics
The documentation revealed several attributes related to measuring a website's topical focus, including siteFocusScore, siteRadius, and site2vecEmbeddingEncoded.[6] These attributes were stored within the QualityNsrNsrData module, which is associated with a system called Normalized Site Rank (NSR).[6]
The siteFocusScore attribute measures the overall topical coherence of a website, while siteRadius measures how far individual pages deviate from the site's main topic using vector embeddings.[6] The site2vecEmbeddingEncoded attribute creates a compressed vector representation of the entire website's content, analogous to Word2vec at the site level.[6]
Content sandboxing
The documents included an attribute called "hostAge" and references to a sandboxing mechanism for new content, suggesting Google may delay the ranking of content from new websites.[2] Google had previously denied the existence of a sandbox for newer websites.[1]
Author identification
The documentation suggested that Google can identify authors and treat them as entities within its system, storing author data associated with documents.[6]
Significance and reception
Contradictions with public statements
The leak was widely noted for appearing to contradict several public statements made by Google employees over the years.[4][5] Fishkin stated that the documents contradicted "the company's repeated denial that click-centric user signals are employed, denial that subdomains are considered separately in rankings, denials of a sandbox for newer websites, denials that a domain's age is collected or considered, and more."[1]
In The Week, the leak was described as providing evidence "that Google hasn't been entirely truthful regarding how its search algorithm worked over the years."[5]
Limitations
Multiple analysts emphasized that the documents had significant limitations as a window into Google's actual ranking practices. The documents did not reveal how different attributes are weighted, whether all named attributes are currently in use, or the specific scoring functions that combine these signals into ranking decisions.[3][2]
The Verge noted that "there's no indication in the documents about how different attributes are weighted" and that "it's also possible that some of the attributes named in the documents [...] might have been deployed at some point" but subsequently removed.[4]
Industry impact
The leak was described as unprecedented in scope for Google's search division. The Register noted that "in the last quarter century, no leak of this magnitude or detail has ever been reported from Google's search division."[2] Tom's Guide described it as providing a rare "glimpse into how Search does its thing."[7]
Fishkin urged journalists covering SEO to adopt a more critical posture toward Google's public statements, writing: "Journalists and publishers of information about SEO and Google Search need to stop uncritically repeating Google's public statements, and take a much harsher, more adversarial view of the search giant's representatives."[1]
Context
The leak occurred during a period of heightened scrutiny of Google's search practices. The United States v. Google antitrust case, brought by the United States Department of Justice, had already revealed some internal details about Google's ranking systems through trial testimony.[6][4] The leak also followed growing public discourse about the perceived declining quality of Google Search results.[1]
See also
- Google Search
- Search engine optimization
- United States v. Google LLC (2020)
- PageRank
- Google Chrome
References
- ↑ 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 Sato, Mia (May 28, 2024). "Google won't comment on a potentially massive leak of its search algorithm documentation". https://www.theverge.com/2024/5/28/24166177/google-search-ranking-algorithm-leak-documents-link-seo.
- ↑ 2.00 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.10 2.11 2.12 "Google's technical info about search ranking leaks online". May 29, 2024. https://www.theregister.com/2024/05/29/internal_google_search_documents/.
- ↑ 3.0 3.1 3.2 3.3 Sato, Mia (May 29, 2024). "Google confirms the leaked Search documents are real". https://www.theverge.com/2024/5/29/24167407/google-search-algorithm-documents-leak-confirmation.
- ↑ 4.00 4.01 4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.10 4.11 Sato, Mia (May 31, 2024). "The biggest findings in the Google Search leak". https://www.theverge.com/2024/5/31/24167119/google-search-algorithm-documents-leak-seo-chrome-clicks.
- ↑ 5.0 5.1 5.2 5.3 5.4 Klawans, Justin (June 17, 2024). "Why is the tech industry up in arms about Google's search algorithm leak?". https://theweek.com/tech/google-seo-algorithm-leak.
- ↑ 6.00 6.01 6.02 6.03 6.04 6.05 6.06 6.07 6.08 6.09 6.10 King, Michael (May 28, 2024). "Unpacking Google's massive search documentation leak". https://searchengineland.com/unpacking-googles-massive-search-documentation-leak-442716.
- ↑ "Google Search secrets potentially exposed in massive document leak — what you need to know". May 30, 2024. https://www.tomsguide.com/computing/search-engines/google-search-secrets-potentially-exposed-in-massive-document-leak-what-you-need-to-know.
