Detecting Document Similarity With Doc2vec

A step-by-step, hands-on introduction in Python

Omar Sharaki
Towards Data Science
12 min readJul 10, 2020

--

“assorted berries” by William Felker on Unsplash

There is no shortage of ways out there that we can use to analyze and make sense of textual data. Such methods generally deal with an area of artificial intelligence called Natural Language Processing (NLP).

NLP allows us to perform a multitude of tasks where our data consists of text or speech. Sentiment analysis, machine translation, and information retrieval are just a few examples of NLP applications, many of which we use daily. Today, many of these tasks can be solved with a great degree of success using a combination of NLP and machine learning techniques.

In this post, I’d like to illustrate one such method, doc2vec, and hopefully provide some basic insight into how it works and how to implement it.

The task

Put simply, given a large number of text documents, we want to be able to:

  1. Measure how similar the documents are to each other semantically.
  2. Use this information to cluster the documents based on their similarities.

What we’ll cover

I realize this is a longer post. So before we get started, here’s an outline of everything we’ll cover:

  1. Introducing the dataset
  2. An overview of doc2vec and vector representations
  3. Training the doc2vec models
  4. Visualizing the generated document vectors
  5. Evaluating the models

Feel free to read it at whatever pace feels comfortable to you. I even encourage you to break it up into parts and read it over multiple sessions as you see fit to stay engaged.

The data

For the purpose of training and testing our models, we’re going to be using the 20Newsgroups data set. This data set consists of about 18000 newsgroup posts on 20 different topics:

Structure of the 20Newsgroups data set

To speed up training and to make our later evaluation clearer, we limit ourselves to four…

--

--

Software developer, standup comedian, and guy you wouldn’t mind sitting next to on a plane.