Term Frequency and Inverse Document Frequency at Google

Vernon October 14, 2021

0 4 minutes read

Term Frequency and Inverse Document Frequency at Google

Learning About SEO and How It Works With Language on Pages

A couple of the concepts that you learn when learning SEO besides an inverted index at Google is how often words appear on pages and in Google’s index of the Web.

Term Frequency

Term Frequency is a measure of how often a term may appear on a page. Some terms are common on most pages. For instance, articles like “the,” which might be the most common word in the English Language. Less common words can also appear frequently, especially if they are the page’s main topic.

“The” is often one of a group of words that is a stop word because they are so common and don’t tell you very much about the page they appear on. I wrote about stop words in Google Stopwords and Stop-Phrases.

It’s not unusual for a search engine to know the frequency of words on a page. The idea of looking for term frequency on pages was something that is from the 1950s.

Inverse Document Frequency

Almost 20 years later, in the 1970s, a related concept started to appear. This concept is Inverse Document Frequency.

It can tell you whether a term is common or rare in a corpus of documents.

You can get it by dividing the total number of documents in the corpus by the number of documents containing the term in the corpus.

Term Frequency and Inverse Document Frequency

You can look at Term Frequency joined together with Inverse Document Frequency. That means that you can tell whether a page is likely about a certain term. It would be one that shows up a lot on that page. That term could be a common one or a rare one on the index of the Web.

This approach to term frequency fits in well with understanding where all the words are on the web in an inverted index. Both are very important to search engines and to SEO.

Some pages are about a specific term because that term appears on that page frequently. That page may be more common or rarer in the Web corpus. That could depend on how many documents the term appears on in pages of the web. So a term such as “indeterminacy” is one with a specific meaning, and it appears fewer times on Google’s index of the Web. It is a rare word.

As an SEO, you can perform keyword research and create text for a page. You can decide what a page may be about. You are placing that page in the web corpus, and it becomes a document that contains that word. A term that is on a more rare page may have less competition from that corpus. But it also may be less searched for by someone who might become a customer of the site it is placed on.

Abbreviating Term Frequency-Inverse Document Frequency

Term Frequency – Inverse Document Frequency is often presented as TF-IDF to shorten the name. Those are concepts search engines know about and they often appear together since they are as related as they are. When I search the USPTO.gov site for patents for either concept assigned to Google, I get a little over 350 for each of them. often the same patent mentions both concepts.

TF-IDF has been part of many Algorithms used at Google for a wide range of purposes. Consider that words are a large part of the Web index. They are also an important part of it. I remember Term Frequency and Inverse Document Frequency during the creation of query refinements that appear at the bottoms of pages of search results at Google. It’s worth seeing in what else they appear.

TF-IDF at the USPTO Last Week

Sometimes you will see statements about Term Frequency and Inverse Document Frequency appear on patents in passages such as this one:

In some implementations, the statistical metric may represent an information content of the matching semantic criteria (e.g., based on a term frequency-inverse document frequency (“tf-idf”) where documents correspond to queries). In an illustrative implementation, if a new piece of information is true for 90% of queries, then the new piece of information may not be useful. The tf-idf may include a numerical statistic reflecting how important a word is to a query in a collection or corpus of queries. The tf-idf value may increase (e.g., proportionally) to the number of times a word appears in the corpus of queries but may be offset by the frequency of the word in the corpus.

Term Frequency and Inverse Document Frequency is Appearing in Patents About Entity Properties on the Web

That quote is from the following patent, granted July 6, 2021.

Selecting content using entity properties
Inventors: Henrik Jacobsson
Assignee: Google LLC
US Patent: 11,055,312
Granted: July 6, 2021
Filed: October 19, 2016

Abstract

Systems and methods of the disclosure relate to selecting content via a computer network. The system can receive a query to generate content selection criteria. The system can identify an entity of the query and a query graph based on the entity. The system can access a database to identify a template corresponding to the query graph. The template can include a topology and a named variable. The system can determine multiple semantic criteria corresponding to the named variable that matches the query graph. The system can use a statistical metric of each of the matching semantic criteria to select candidate content selection criteria.

Both information retrieval concepts are still in use today, even though SEO is changing to be more about entities than it was before. This patent focuses on finding the properties of entities.

So Term Frequency and Inverse Document Frequency have both been around for more than 50 years as part of information retrieval. Both are still part of modern algorithms as long ago as last week at Google. In the Wikipedia page on TF-IDF, they tell us that “Term Frequency and Inverse Document Frequency is one of the most popular term-weighting schemes today.”

Term Frequency and Inverse Document Frequency Conclusion

The ability to use TF-IDF for many algorithms about the words in an index makes it important as a tool to understand when it comes to search. When you search an inverted index for specific words, some will be more common and some will be rarer. This isn’t keyword density. It does not calculate the frequency of a word compared to all the words in a document. If you understand what term frequency and inverse document frequency both are, and how they could work together on an inverted index, You have an idea of how search and how SEO both work.

Source link