What is a taxonomy?
A taxonomy is a system for classifying and organizing things. The canonical example is the classification system for organisms, which groups similar species together in a hierarchy.
WordPress is the popular blogging software which runs 35% of the internet in 2020, including this blog. It provides a framework for users to create their own classification system for their blog.
The WordPress taxonomy framework provides two types of classification, categories and tags.
Categories in WordPress can be hierarchical, much like the classification system for organisms. A category is required for every post, and a user can assign more than one category to a post.
The default category is “Uncategorized” (for American English, the default language). According to my own research (which I will share on this blog soon), about 9% of all posts have this default category, while a further 3.5% are categorized as simply “Blog.” And there are numerous variations of uninformative categories near the top of the most popular list. I would ballpark the percentage of posts classified with uninformative categories at around 20%, conservatively.
Tags are not hierarchical, and are not required. When a user opts to assign tags to a post, the user will generally assign multiple tags.
In fact, the average number of tags assigned per post is about 3.5, while the median is 2. I’ll share how I know this in a later post, but for now here is a little treat: the distribution of posts by number of tags.
Notice how almost 40% of posts have 0 tags.
According to WordPress, the purpose of tags is as follows (according to a message provided when you try to publish without tags).
Tags help users and search engines navigate your site and find your content. Add a few keywords to describe your post.WordPress ‘Add tags’ Suggestion
So ostensibly, classifying posts gives an SEO boost. Traffic is king, so that should be enough motivation for most bloggers.
The Case for Automation
Classifying posts gives an SEO boost; yet, 40% of posts have no tags and 20% of posts have uninformative categories. What gives?
I would argue that the current WordPress taxonomy system places too much burden on the user to manually define, and accurately assign, categories and tags. There are no pre-defined labels for users to choose from. And there are no suggestions or hints when assigning labels. The end result is a user experience that leaves a large portion of posts inadequately classified, and drains those users who make an earnest attempt of their time and energy.
But it doesn’t have to be this way. Document classification (e.g. assigning categories) and keyword extraction (e.g. assigning tags) are solved problems in the fields of machine learning and natural language processing. In other words, the tools needed to automate WordPress taxonomies already exist.
WP Auto Taxonomy
As it happens, my day job for the last several years has been as a data scientist learning and applying exactly the ML and NLP techniques needed to automate taxonomies in WordPress. So I’ve begun a project to do just that, which I’m tentatively calling WP Auto Taxonomy.
Open Source It!
I intend WP Auto Taxonomy to be open source, but the servers needed won’t be cheap. I believe an ideal end state for this product is to be integrated within Jetpack as a premium feature, similar to Akismet. Here’s why.
- Akismet is a paid anti-spam tool that very few blogs can live without. Anti-spam can’t be free for the following reasons. Unlike other features, you can’t simply install a plugin and have all the code you need to perform some function. Anti-spam models need lots of fresh data in order to keep up with evolving spammers. So Akismet has to collect millions of comments, continuously retrain models, and host armies of servers to apply their latest models to basically every comment in the WordPress universe.
- WP Auto Taxonomy will require basically the same technical design, and a very similar business model. It will need lots of fresh data so that auto-suggestions can be personalized and trend-aware. Just like Akismet, this will require collecting millions of posts, continuously retraining models, and hosting an army of servers to return predictions in real-time.
- Unlike Akismet, WP Auto Taxonomy does not need to keep source code secret (if spammers could see Akismet’s code, it would be easier for them to dupe it). Privacy concerns may require certain data used by WP Auto Taxonomy to remain private, but for the most part the data needed is already publicly available (e.g. blog posts and corresponding tags).
- Also unlike Akismet, WP Auto Taxonomy is a nice-to-have rather than a must-have. So the value for users is lower. The marginal cost per user will likely be lower too, so I don’t have any doubts that it’s economically viable. But for reference, Akismet is currently just one of many features included with Jetpack’s Personal plan, which costs $3.50 per month. In the long run, WP Auto Taxonomy has to be bundled with something like Jetpack in order to scale up. I’m confident there will be enough users willing to pay $5 per month (or more) to get the product launched, but I don’t expect mass adoption at that price point.
To launch WP Auto Taxonomy as a minimally viable product, the following milestones need to be hit.
- Collect seed training data by scraping WP blogs.
- Explore the data and build some initial models for 1) document classification (categories) and 2) keyword extraction (tags).
- Deploy models behind an API.
- Build WP plugin that will call the API and present category and tag suggestions to the user.
As of April 2020, I’ve already completed milestone 1 and have dove deep enough into milestone 2 to feel very confident about the technical viability of this project. Milestone 3 is fairly easy compared to the rest, and milestone 4 can be done by a WP plugin developer without any ML knowledge.
I intend to publish my data and code in the coming weeks. If you are interested in contributing, and have experience with ML, NLP, or WP plugin development, please reach out to me here or on LinkedIn.
As a final note, I welcome the expansion of scope of this project beyond taxonomies. There’s tons of opportunity within WP for NLP-powered automation, for everything from automatic excerpts to search enhancements and beyond. If you have any ideas, please share in the comments below.