Today we are announcing a new NLP dataset which we are hosting on Kaggle, based on self-posts from reddit.com. Our aim was to try and create a text corpus which had a large number of distinct classes, but still have many examples per class. We have created a dataset of roughly 1M text posts, with 1013 distinct classes (1000 examples per class). The classes are based on the assumed ‘topic’ of the text post, the topics being a manually curated taxonomy based on subreddits (see next section).
It is similar in size and label variety to datasets such as ImageNet in computer vision, though our labels are not individually checked by humans. We felt that there was a lack of interesting, publicly available datasets that fit this profile, even though we have seen private real-world datasets that do (for example, classifying companies into SIC codes based on their websites).
We also think that this type of problem is an interesting counterpoint to text classification problems with lower class numbers such as sentiment analysis, which are well studied. We find that the state of the art techniques here such as LSTMs do not always translate seamlessly to the many-class domain. We hope that this dataset will be used to guide research in NLP, extreme classification, and be of interest to the wider machine learning community.
Reddit is a popular link aggregating website. Users can submit posts, and other user’s vote them up or down, allowing the most highly rated posts to gain the most attention. Reddit is divided into various ‘subreddits’ based on the types of posts being submitted, for example r/politics or r/MachineLearning. Subreddits are generally created and moderated by the users themselves, rather than the admins of reddit.
There are two main types of post one can submit on reddit - simple url link posts, and self-posts, with a title and a body of markdown text written by the user. We found from ad-hoc analyses that the large majority of self-posts were talking about the topic that their subreddit implied, suggesting that this may be an interesting task from a machine learning perspective.
We downloaded all the self-posts in a two year period (2016/06/01 --- 2018/06/01), and did a number of cleaning steps to try and find posts that were sufficiently detailed. This left us with about 3,000 subreddits which had 1,000 posts or more.Classifying into subreddits is often not feasible on its own due to massive overlap between the topics of different subreddits. For example consider the three subreddits r/buildapc, r/buildmeapc, r/buildapcforme, or the 26 popular subreddits dedicated to the video game League of Legends (each popular character has its own dedicated subreddit). For this reason, we decided to build a taxonomy of subreddits --- classifying each subreddit into categories, and subcategories, so that we could easily find major overlaps. This was a long and painful process, for the full gory details, see .
Here is a breakdown of top-level categories in our taxonomy in this datasetWe found a few popular categories of subreddit with many many subcategories, that we were not aware of before this project:
Here are some of the more interesting subreddits that made it into our dataset:
We can map the contents of all the subreddits in our dataset by looking at the word frequencies in their titles/text and using standard techniques to map these onto a 2d plot (t-SNE). This gives us the following plot (N.B. this is an interactive plot, mouseover points and use the tools on the right to help navigate).
Subreddits with similar content (in terms of word frequencies) will tend to mapped closer together. Also we have colored using the top-level category of the subreddit.
Sadly we can’t give too much away about our best performance on this dataset --- it builds upon proprietary research. However we can give a couple of basic benchmarks based on bag-of-words models (models based on word frequencies). We give benchmarks for Naive-Bayes (using unigrams/bigrams, Tf-Idf, chi2 feature selection), and FastText (using Facebook’s official implementation ).
The metrics we give are "Precision-at-K" ([email protected]), this means that we give the model K guesses at the subreddit for each self-post, and find the proportion of the time one of these guesses is correct (for K=1,3,5).
Interestingly, we found that popular sequential models, such as LSTMs as well as a transfer learning framework : Open AI’s transformer model, were not competitive with these baselines (in fact, we struggled to get the transformer to train). It would be interesting to know if this is due to lack of effort on our part, or indicates something more interesting about their limitations.
The biggest issue with the data is noisy labels - while many subreddits have been omitted for being generally off topic, posts have not been curated individually. We did a cursory analysis to try and work out what proportion of posts were ‘good’ enough to be potentially classified at top 5 precision, we believe that number is about 96%.
The taxonomy was also created manually, and due to it size this introduces ample room for human error. If you spot any problems, please send an email to us, or post a topic on the Kaggle discussion page.
At Evolution AI, we specialise in natural language processing, offering an annotation platform, as well as consultancy services for information extraction, classification and text-matching. Our interest in the problem of classifying with many classes is due to work we have done with SIC codes (classifying companies into industry sectors), SOC codes (classifying jobs into categories), as well as tagging large product ranges in e-commerce.
Our platform for annotation has been designed with huge multi-class problems in mind. We make it much quicker, and accurate for users to tag data, as well as scale to tens or even hundreds of concurrent users (generally necessary to tag large datasets with thousands of classes in reasonable time). We have a great deal of experience working in this difficult domain.
Drop us an email at [email protected] if you want to hear more.