Tf is tf-idf?

Ever thought about how search queries brought the most relevant results to the top? Me neither. But learning about how it works is still very very cool.

Tf-idf is the method that uses algortihms and fun math to figure out which document/web page is most relevant to your search query. Tf-idf stands for: text frequency - inverse document frequency. It is used to balance out the frequency of a word in a certain area with it’s actual value. Words like “the” and “I” may have very high frequencies, but they aren’t very significant to the overall meaning of the work. Inverse document frequency helps balance the equation, words of very high frequency throughout the document are crossed off and words that are repeated just the right amount of times rise to the top in terms of value. In this way, we can both find relevant texts in a search query as well as find the most significant words in a text.

For example, if you search for “a class of monkeys”, the words “a” and “of” are quite common and so tf-idf would help find documents that have a higher frequency of “class” and “monkeys” and bring those to the top of the search result list.

Tf-idf can also help in analyzing or criticizing a literary work. For example, we can run a tf-idf test on a book we’ve never read and using the words with the highest weights/values, we can discover what the book’s main topic is. We can run this test on specific characters and learn about their roles in the work or what their main concerns are.

Tf-idf is a very good example of Digital Humanities. We can analyze literary works in the traditional sense by reading closely, but now we have another way of understanding the main idea of a text using digital tools. Of course, tf-idf aren’t always 100% accurate as it doesn’t account for implied meanings of certain phrases or texts, but it’s definitely a start to a field well worth exploring.

Written on September 9, 2016 by Saja Hamayel