In late May 2024, the internal documentation for Google Search’s Content Warehouse API was leaked.
There hasn’t been a leak this big or detailed from Google’s search division since Google’s launch into the market. The leaked documents reveal many secrets that Google has been hiding, or even neglecting, for a long time.
This topic is very ***, so let’s dive right in without any lengthy introductions. We’ll go over why this happened and how SEO specialists can use this data.
Background
Back in early May, an anonymous user shared internal Google Search API Documents with Rand Fishkin (co-founder of Moz and creator of their Domain Authority metric). Rand verified their personality himself, emailing and having a video call with them. After that, he asked Michael King (founder and CEO of iPullRank) to analyze this data. On May 27th, they published this information along with their analysis of all the data.
You can find all the leaked data through this link.
What’s in the docs?
Here, you’ll find more than 2,500 pages of API documentation containing 14,014 attributes (API features) that appear to come from Google’s internal Content API Warehouse. Many of these attributes play an important role in Google’s ranking process.
However, this documentation doesn’t show the weight of particular elements in the search ranking algorithm. It also doesn’t indicate which elements are used in the ranking systems. But, it does show incredible details about the data Google collects.
Here’s an example of the document format:
It’s similar to guidelines for Google team members, outlining what variables are available, what their functions are, and how to work with them.
Note: The documentation was up-to-**** as of last summer (references to other changes in 2023 and earlier years ****** back to 2005 are also present), and possibly even up-to-**** as of the March 2024 **** of disclosure. But it’s not guaranteed that this is the recent version of such ‘instructions.’ For example, there are no mentions of AI Overviews here. There are also some deprecated features (although they are marked as no longer in use).
In any case, this documentation contains a lot of relevant and important data. Let’s take a look.
Google myths revealed
To minimize manipulation of search results, the Google team has closely guarded the details of how their algorithms work and what truly influences rankings.
And now, thanks to the leaked information, we’re faced with what we have. Many claims that Google representatives once made about various aspects of search engine optimization have turned out to be untrue. Much of the leaked data directly contradicts Google’s official and public statements.
Let’s take a look at some of the most popular myths debunked by the leaked documentation.
Domain Authority
Google spokespeople have said numerous times that Google doesn’t use domain authority to rank pages. For example, John Mueller has repeatedly said this. Here is one of his comments on Reddit:
This was also mentioned in the Google Webmaster Central Office Hours.
Revealed ????
The leaked documentation says the opposite. Google uses the siteAuthority feature as one of the signals for page quality to be used in the Q* ranking system.
Sandbox
Google has repeatedly claimed that there is no such thing as a “sandbox” for new sites, meaning their age does not affect their ranking. John Mueller stated this in 2019.
He also said in 2017 that domain age does not influence rankings in Google’s search results.
Revealed ????
Leaked documentation mentions a hostAge attribute used “to sandbox fresh spam in serving time.” This fact fully contradicts Google’s denial of a sandbox for new websites.
Chrome data
Matt Cutts claimed previously that Google does not use Chrome data for search ranking or quality purposes.
Ten years later, John Mueller confirmed this again.
“I don’t think we use anything from Google Chrome for ranking. So the only thing that happens with Chrome is for the page experience report, we use the Chrome user experience report data, which is kind of that aggregated data of what users saw when they went to the website, with regards to the page experience specifically.”
Revealed ????
Leaked documentation shows that Chrome data is used by Google for ranking. For example, it is used to generate the Sitelinks SERP feature. Another module related to page quality scores includes a site-level measure of views from Chrome.
More SEO-worthy discoveries
This documentation is also valuable because it confirms many assumptions SEOs arrived at based on their practical experiences that were denied (or at least unconfirmed) by Google. Let’s look at some key discoveries from this document.
Links
Links remain important for Google, with metrics like sourceType indicating a loose relationship between the value of a page and its indexing location.
This means the higher the tier, the more valuable the link. Pages considered “fresh” are also high quality. That is, getting rankings from highly ranking pages and new pages yields better ranking performance. This could also be why websites generating links from fresh high-quality pages at scale see more benefit than traditional link earning, where links may come from outdated content.
In this context, it’s also worth mentioning PageRank, which remains relevant, as evidenced by the leaked documentation. The data shows that Google decides how to value a link based on how much they trust the homepage. Homepage PageRank is considered for all pages.
As always, in your link-building strategy, you should focus on the quality and relevance of your links and not just the volume.
Content
As for the content, there are several interesting points in this documentation. Let’s take a quick look at them.
- Short content is scored for originality.
Google evaluates the originality of short content and gives it an OriginalContentScore (from 0 to 512). Therefore, it is likely involved in the GSC functionality of defining thin content, which is not just a matter of content length.
- Google is focused on fresh content.
The documents show Google’s attempts to associate dates with pages. The following attributes prove this: bylineDate (the explicitly set **** on the page), syntacticDate (an extracted **** from the URL or in the title), and semanticDate (**** derived from the content of the page).
Authors
Google places heavy emphasis on E-E-A-T. If you ever had any doubts about the importance of content authorship for ranking, this documentation dispels them. It clearly indicates that Google explicitly stores author information.
The search engine also verifies if an entity on the page is also the page’s author.
Panda algorithm
According to the documentation, to determine quality content, Google uses a scoring modifier based on user behavior and external links, applying it at various levels (domain, subdomain and subdirectory).
The document pays significant attention to NavBoost’s data (or click data), which focuses on relevancy and user intent. The documentation proves that the search engine uses it in ranking.
Google’s documentation clarifies that Panda is far simpler than we thought. You just need to create high-quality, relevant content that receives many user clicks. Focusing on getting more relevant traffic and improving user experience will show Google that your page should rank higher.
Demotions
The document also contains information about the reasons for ranking drops. Various demotions are applied for issues like:
- Anchor mismatch
- SERP dissatisfaction
- Exact match domains
- Spammy product reviews
- **** content, etc.
This information isn’t groundbreaking, but it will help you confirm that you’re on the right track and remind you what to avoid.
Can this data be trusted?
These internal docs are most likely authentic, as stated by Michael King:
In addition, Rand Fishkin conducted his verification of the data’s reliability. He asked ex-Googlers to check this documentation. They confirmed it looks like the real documents.
So, it’s up to you whether or not to trust this leak, but there are many reasons to believe this documentation is genuine.
Summary
This leak dispelled (or confirmed) many doubts about Google’s internal workings.
Although Google aims to help and guide webmasters, it’s important to understand that they also take care to avoid giving spammers opportunities to manipulate search results.
The best way to gain valuable insights and a real understanding of SEO is through personal experience and practice. It’s crucial to evaluate all external opinions critically, even if they come from Google.