Wednesday, 24 April 2013

[Build Backlinks Online] Machine Learning and Link Spam: My Brush With Insanity

Build Backlinks Online has posted a new item, 'Machine Learning and Link Spam:
My Brush With Insanity'


Posted by wrttnwrd
This post was originally in YouMoz, and was promoted to the main blog because it
provides great value and interest to our community. The author's views are
entirely his or her own and may not reflect the views of SEOmoz, Inc.



Know someone who thinks theyre smart? Tell them to build a machine learning
tool. If they majored in, say, History in college, within 30 minutes theyll be
curled up in a ball, rocking back and forth while humming the opening bars of
Oklahoma.

Sometimes, though, the alternative is rooting through 250,000 web pages by
hand, checking them for compliance with Googles TOS. Doing that will skip you
right past the rocking-and-humming stage, and launch you right into
writing-with-crayons-between-your-toes phase.

Those were my two choices six months ago. Several companies came to Portent
asking for help with Penguin/manual penalties. They all, for one reason or
another, had dirty link profiles.

Link analysis, the hard way. Back when I was a kid...

I did the first link profile review by hand, like this:


Download a list of all external linking pages from SEOmoz, MajesticSEO, and
Google Webmaster Tools.

Remove obviously bad links by analyzing URLs. Face it: if a linking page is on
a domain like FreeLinksDirectory.com or ArticleSuccess.com, its gotta go.

Analyze the domain and page trustrank and trustflow. Throw out anything with a
zero, unless its on a list of whitelisted domains.

Grab thumbnails of each remaining linking page, using Python, Selenium, and
Phantomjs. You dont have to do this step, but it helps if youre going to get
help from other folks.

Get some poor bugger a faithful Portent team member to review the thumbnails,
quickly checking off whether theyre forums, blatant link spam, or something
else.


After all of that prep work, my final review still took 10+ hours of
eye-rotting agony.

There had to be a better way. I knew just enough about machine learning to
realize it had possibilities, so I dove in. After all, how hard can it be?

Machine learning: the basic concept

The concept of machine learning isnt that hard to grasp:


Take a large dataset you need to classify. It could be book titles, peoples
names, Facebook posts, or, for me, linking web pages.

Define the categories. In this case, Im looking for spam and good.

Get a collection of those items and classify them by hand. Or, if youre really
lucky, you find a collection that someone else classified for you. The Natural
Language Toolkit, for example, has a movie reviews corpus you can use for
sentiment analysis. This is your training set.


Pick the right machine learning tool (hah).

Configure it correctly (hahahahahahaha heee heeeeee sniff haa haaa sorry, Im
ok ha ha haaaaaaauuuugh).

Feed in your training set, with the features the item attributes used for
classification pre-selected. The tool will find patterns, if it can (giggle).

Use the tool to compare each item in your dataset to the training set.

The tool returns a classification of each item, plus its confidence in the
classification and, if its really cool, the features that were most critical in
that classification.


If you ignore the hysterical laughter, the process seems pretty simple. Alas,
the laughter is a dead giveaway: these seven steps are easy the same way Fly to
moon, land on moon, fly home is three easy steps.

Note: At this point, you could go ahead and use a pre-built toolset like BigML,
Datameer, or Googles Prediction API. Or, you could decide to build it all by
hand. Which is what I did. You know, because I have so much spare time. If youre
unsure, keep reading. If this story doesnt make you run, screaming, to the
pre-built tools, start coding. You have my blessings.

The ingredients: Python, NLTK, scikit-learn

I sketched out the process for IIS (Is It Spam, not Internet Information
Server) like this:


Download a list of all external linking pages from SEOmoz, MajesticSEO, and
Google Webmaster Tools.

Use a little Python script to scrape the content of those pages.

Get the SEOmoz and MajesticSEO metrics for each linking page.

Build any additional features I wanted to use. I needed to calculate the
reading grade level and links per word, for example. I also needed to pull out
all meaningful words, and a count of those words.

Finally, compare each result to my training set.


To do all of this, I needed a programming language, some kind of natural
language processing (to figure out meaningful words, clean up HTML, etc.) and a
machine learning algorithm that I could connect to the programming language.

Im already a bit of a Python hacker (not a programmer my code makes
programmers cry), so Python was the obvious choice of programming language.

Id dabbled a little with the Natural Language Toolkit (NLTK). Its built for
Python, and would easily filter out stop words, clean up HTML, and do all the
other stuff I needed.

For my machine learning toolset, I picked a Python library called scikit-learn,
mostly because there were tutorials out there that I could actually read.

I smushed it all together using some really-not-pretty Python code, and
connected it to a MongoDB database for storage.

A word about the training set

The training set makes or breaks the model. A good training set means your
bouncing baby machine learning program has a good teacher. A bad training set
means its got Edna Krabappel.

And accuracy alone isnt enough. A training set also has to cover the full range
of possible classification scenarios. One good and one spam page arent enough.
You need hundreds or thousands to provide a nice range of possibilities.
Otherwise, the machine learning program stagger around, unable to classify items
outside the narrow training set.

Luckily, our initial hand-review reinclusion method gave us a set of
carefully-selected spam and good pages. That was our initial training set. Later
on, we dug deeper and grew the training set by running Is It Spam and
hand-verifying good and bad page results.

That worked great on Is It Spam 2.0. It didnt work so well on 1.0.

First attempt: fail

For my first version of the tool, I used a Bayesian Filter as my machine
learning tool. I figured, hey, it works for e-mail spam, why not SEO spam?

Apparently, I was already delirious at that point. Bayesian filtering works for
e-mail spam about as well as fishing with a baseball bat. It does occasionally
catch spam. It also misses a lot of it, dumps legitimate e-mail into spam
folders, and generally amuses serious spammers the world over.

But, in my madness, I forgot all about these little problems. Is It Spam 1.0
seemed pretty great at first. Initial tests showed 75% accuracy. That may not
sound great, but with accurate confidence data, it could really streamline link
profile reviews. I was the proud papa of a baby machine learning tool.

But Bayesian filters can be poisoned. If you feed the filter a training set
where 90% of the spam pages talk about weddings, its possible the tool will
begin seeing all wedding-related content as spam. Thats exactly what happened in
my case: I fed in 10,000 or so pages of spammy wedding links (we do a lot of
work in the wedding industry). On the next test run, Is It Spam decided that
anything matrimonial was spam. Accuracy fell to 50%.

Since we tend to use the tool to evaluate sites in specific verticals, this
would never work. Every test would likely poison the filter. We could build the
training set to millions of pages, but my pointy little head couldnt contemplate
the infrastructure required to handle that.

The real problem with a pure Bayesian approach is that theres really only one
feature: The content of the page. It ignores things like links, page trust and
authority.

Oops. Back to the drawing board. I sent my little AI in for counseling, and a
new brain.

Note: I wouldnt have figured this out without help from SEOmozs Dr. Pete and
Matt Peters. A hat tip doesnt seem like enough, but for now, itll have to do.

Second attempt: a qualified success

My second test used logistic regression. This machine learning model uses
numeric data, not text. So, I could feed it more features. After the first
exercise, this actually wasnt too horrific. A few hours of work got me a tool
that evaluates:


Page TrustFlow and CitationFlow (from MajesticSEO Im adding SEOmoz metrics
now)

Links per word

Page Flesch-Kincaid reading grade level

Page Flesch Kincaid reading ease

Words per page

Syllables per page

Characters per page

A few other seemingly-random bits, like images per page, misspellings, and
grammar errors


This time, the tool worked a lot better. With vertical-specific training sets,
it ran with 85%+ accuracy.

In case you're wondering, this is what victory looks like:



When I tried to use the tool for more general tests, though, my coded kid
tripped over its big, adolescent feet. Some of the funnier results:


It saw itself as spam.

It thought Rands blog was a swirling black hole of spammy despair.


False positives remain a big problem if we try to build a training set outside
a single vertical.

Disappointing. But the tool chugs along happily within verticals, so we
continue using it for that. We build a custom training set for each client, then
run the training set against the remaining links. The result is a relatively
clear report:



Results and next steps

With little IIS learning to walk, weve cut the brute-force portion of large
link profile evaluations from 30 hours to 3 hours. Not. Too. Shabby.

I tried to launch a public version of Is It Spam, but folks started using it to
do real link profile evaluations, without checking their results. That scared
the crap out of me, so I took the tool down until we cure the false positives
problem.

I think we can address the false positives issue by adding a few features to
the classification set:


Bayesian filtering: Instead of depending on a Bayesian classification as 100%
of the formula well use the Bayesian score as one more feature.

Grammar scoring: Anyone know a decent grammar testing algorithm in Python? If
so, let me know. Id love to add grammar quality as a feature.

Anchor text matters a lot. The next generation of the tool needs to score the
relevant link based on the anchor text. Is it a name (like in a byline)? Or is
it a phrase (like in a keyword-stuffed link)?

Link position may matter, too. This is another great feature that could help
with spam detection. It might lead to more false positives, though. If Is It
Spam sees a large number of spammy links in press release body copy, it may
start rating other links located in body copy as spam, too. Well test to see if
the other features are enough to help with this.


If I'm lucky, one or more of these changes may yield a tool that can evaluate
pages across different verticals. If I'm lucky.

Insights

This is by far the most challenging development project I've ever tried. I
probably wore another 10 years' enamel off my teeth in just six weeks. But it's
been productive:


When you start digging into automated page analysis and machine learning, you
learn a lot about how computers evaluate language. That's awfully relevant if
you're a 21st Century marketer.

I uncovered an interesting pattern in Google's Penguin implementation. This is
based on my fumbling about with machine learning, so take it with a grain of
salt, but have a look here.

We learned that there is no such thing as a spammy page. There are only spammy
links. One link from a particular page may be totally fine: For example, a brand
link from a press release page. Another link from that same page may be spam:
For example, a keyword-stuffed link from the same press release.

We've reduced time required for an initial link profile evaluation by a factor
of ten.


It's also been a great humility-building exercise.
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten
hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think
of it as your exclusive digest of stuff you don't have time to hunt down but
want to read!






You may view the latest post at
http://feedproxy.google.com/~r/seomoz/~3/2S1MEON6pH8/machine-learning-and-link-spam-my-brush-with-insanity

You received this e-mail because you asked to be notified when new updates are
posted.
Best regards,
Build Backlinks Online
peter.clarke@designed-for-success.com

No comments:

Post a Comment