Data Set of 750k Articles – Jared Rand's Blog

As part of my plan to build an open-source NLP tool for WordPress blogs, I have collected 750,000 articles from 100,000 WordPress blogs. I’m making the data available here for others to explore.

Download

The data is formatted as a gzipped CSV. The file size is 919 MB.

Download data

Sample

A sample view of the data is provided below. Each row is an article with some metadata.

	content	domain	extracted_categories	extracted_tags	link	source
14930	<a href=”http://adamgalambos.com/wp-content/up…	adamgalambos.com	[past projects]	[botany, dendrochronology, research, undergrad…	https://adamgalambos.com/2018/12/17/fossil-pol…	feed
224009	<p><span class=”dropcap”>M</span>orbi dapibus …	hesedorganics.com	[healthy]	[inspiration, brown, fun, guy, man, person, wi…	https://hesedorganics.com/2016/05/05/quote-of-…	feed
584049	\n<p>Lorem ipsum dolor sit amet, consectetur a…	fastitchembroidery.com	[style]	[]	https://www.fastitchembroidery.com/a-video-blo…	crawl
540883	\n<p>We love <em>Crazy Ex-Girlfriend!</em></p>…	betterwordspodcast.com	[show notes]	[danielle binks, author, blogger, project lect…	http://www.betterwordspodcast.com/shownotes-wr…	crawl
24472	\n<figure class=”wp-block-image alignwide”><im…	alainschools.com	[uncategorized]	[depression free, post, student]	https://alainschools.com/blog-depression-free/	feed

Data Dictionary

content – Raw text of article body, including HTML tags
domain – Domain name, including TLD
extracted_categories – List of categories assigned by author. Note that for articles with source='feed', only one category is included. Other assigned categories are mixed in with extracted_tags. This is because WordPress feeds do not distinguish between categories and tags. For articles with source='crawl', all categories are included here.
extracted_tags – List of tags assigned by author. Note that for articles with source='feed', this field may include categories as well as tags.
link – URL of article
source – If equal to feed, extracted_categories includes only one category and other categories are mixed in with extracted_tags. If equal to crawl, categories and tags are correctly separated.

Data Collection

I collected this data in order to train a classifier that can predict an article’s category and tags. Here is how I ensured that the extracted categories and tags are (mostly) correct.

Site List

This data set contains over 100k unique domains. While it would be possible to crawl the web to compile a large list of domains using WordPress on my own, I opted to use an existing list of roughly 500k compiled by https://brandnewblogs.com/.

As a side note, Brand New Blogs was ostensibly born on May 1st, 2018 when they began publishing lists of every new WP blog born each day, starting with May 1st, 2018. It is a happy coincidence that my son was born on May 1st, 2018.

Here’s a few more notes about the domains included in this data set.

All started between May 1st, 2018 and September 30th, 2018.
Most appear to be English-language blogs, but not all.
Lots are dedicated to unsavory topics which I will not name here.
Lots have only 1 post, the default “Hello World” post.
Lots are full of template filler such as “lorem ipsum…”
I’m not entirely sure of this, but I believe this is a complete cohort of new WP blogs launched during the time period mentioned above. There may be some interesting analysis of WP user behavior possible from that fact, including prevalence of various topics, survival analysis of blogs, and more.

WordPress Feeds

Almost all sites that run on WordPress have an XML feed located at http://<domain>/feed. This feed is consistently structured and contains the content, link, and categories for the 5 most recent articles published by the site.

These feeds are great, but there are two limitations that drove me to do some crawling as well.

It’s very difficult to personalize predictions for a domain using only 5 articles. I wanted to get every article ever published within a domain.
The feeds do not separate categories from tags. It was important for me to have this separation, since I expect the ML task to be different for each. Categories will be better predicted with document classification, while tags will be better predicted with keyword extraction. Note that 500k of the 750k articles included in this data set come from feeds and thus lack proper separation of categories and tags. I’ve used a heuristic to separate for this data set (the first category listed in the feed is always a category not a tag), but I will likely only use the source='crawl' articles for my modeling purposes.

Crawling

While the feeds were insufficient as the sole source of data for my task, they were crucial for validating the accuracy of my extracted content, categories, and tags during crawling. Here’s how the process worked.

For each domain, I first downloaded the feed and used the content, categories, and tags within it as ground truth.
I then scraped each link within the feed and developed various rules and heuristics for extracting the known content, categories, and tags that generalized to other links and domains.
1. I didn’t use any ML here because the precision and recall were good enough for a first pass, and because building a model here would’ve required painstaking annotation of which HTML elements contained the relevant fields followed by training a sequence model such as conditional random fields (CRFs) or hidden markov models (HMMs), neither of which I have experience building. This is left as future work. Let me know if you are interested in contributing to that component.
2. I found that categories and tags can be found with a small number of rules for most domains. So I used global rules for category and tag extraction.
3. The HTML element containing the article’s content varied enormously across domains, but luckily not much within a domain. So for each domain I performed a search for the correct article container by measuring the similarity of text within each HTML element to the known article content.
Once the rules were developed, I applied them to every link within the feed. If the extracted content, categories, or tags were incorrect above some thresholds, I threw out the entire domain.
For all domains that passed this quality control check, I crawled the entire domain.
1. WordPress has some features that made this fairly easy. One of the purposes of categories and tags is to create archive pages that list every article assigned a given category or tag. And the URLs for those archive pages are very predictable (in fact one of the rules used to extract the categories and tags utilized this URL structure).
2. So the crawl was seeded with the categories listed in the feed and the archive URLs discovered during the QC check. Each archive page was scraped to get all the article links for that category or tag.
  1. Another layer of QC would have been possible here, to vet that the articles with a given category or tag were actually included in the archive page. That is left as future work.
3. Once all the article links were scraped, all the categories and tags discovered in those new articles are used to reseed and the process starts over again.
4. The crawl ends when all known archive pages and article links have been scraped.

Next Steps

I hope this data can serve as a tool for you to build something cool and/or practice some analytics techniques. Personally, I plan to do both. I’ve already spent many hours cleaning and exploring the data, and have begun training models for one of my two goals, category prediction.

If you choose to download this data and explore it, I hope you will also choose to share your insights and ideas. I look forward to seeing what you discover.

2 replies on “Data Set of 750k Articles”

I appreciate the links to these resources. Sounds like an intriguing project. I have no doubt that Python can make this a simple, straight-forward charting and graphing design. Looking forward to a future article on that part of the process.

Hey Luis, good to hear from you. Yes Python is my preferred tool, too. I’m hoping to publish some exploratory data analysis eventually.