Danny Yee >> Travelogues >> Oxford Blog >> Technology
spires from Carfax

Google algorithms: positive + negative selection

Technology — June 2011

Motivated by Google taking a dislike to my book reviews, I've done some thinking about the latest change in its search algorithm.

I learned a bit about Search Engine Optimisation (SEO) some years ago, but after a brief period of interest basically realised I didn't need to worry about it. If I continued doing what I had always done, which was to design my web sites to be as user-friendly and easy to use as possible, then I would effectively be doing SEO anyway.

Google's traditional search algorithm positively selected pages that were useful to searchers. So if I targeted the same thing, chances are I'd never be too far - in the abstract design structure space of web pages and sites - from wherever Google's algorithm was currently pointing. The metrics in the "quality" and "Google search" spaces would always be different, but I could rely on lots of smart engineers at Google working to make them as close as possible.

The alternative was to target Google directly, but while that might have produced better results sometimes it would have involved a moving target and an "arms race". If you know where someone is going, heading straight there is a much better way to catch them than heading directly towards them!

Now it's clear, however, that negative selection plays a significant role in Google's algorithm, and that changes the situation for people like me, and not for the better.

Google's search results were filling up with spam (also known as "content farms", "Made for Adsense" pages, scrapers and various other names, but I'll stick with "spam"). These are basically automatically generated pages designed to rank highly on Google search results but with no or little actual content.

So Google has added a negative screen on top of their traditional algorithm. Using human input, both from users and employees, they have built up a large corpus of spam, of web pages that are junk but rank highly in their search results. And they've used these to train some kind of expert system which can be applied across their entire index.

Now the problem is that the expert system can't measure "quality" or detect junk directly - that's still too human a concept, which is why Google will have used humans to build the training set - so it will have to use quantifiable features of web pages and sites to make its actual decision. These might include lexical features (Mark's suggestion that my vocabulary is too esoteric), regularity (my book review pages are generated from text files by scripts), link structure (a lot of index pages?), backlink types, and so forth.

Why would my web site look like spam web sites on these measures? Bad luck is a distinct possibility. But it's not unlikely that spammers have looked at successful web sites like mine and deliberately copied their structure and features. Pretty much all the features of a web site - word frequencies, link structures, backlinks - can be emulated. And (rather than paying me to work part-time for twenty years) some plausible looking content can be generated by scraping sentences from vaguely related sources and munging them around a bit.

Not all spammers are this clever, of course, but there are a lot of them and they have time on their side. (There may not be selective pressure in the Darwinian sense, because failed spam sites probably just stay around, but there'll be strong design/selective pressure helping exploration of the search space.)

So now I'm in a much less pleasant situation. There may well be lots of smart Google engineers still trying to tune their algorithm to select good sites. But there may be even more smart spammers attempting to make web sites that, on measures amenable to algorithmic analysis, are hard to distinguish from sites like mine.

Is introducing negative screening good for Google? Quite possibly. If three out of the top 10 results on a search are spam, then getting rid of two of those at the cost of one non-spam "false positive" result being removed will probably be seen as an improvement by most users. So while Google does have an incentive to limit false positives, it's not clear that that incentive is going to outweigh it's incentive to clear out junk from its search results.

What can I do? Nothing, basically. There are too many spammers, generating too many different web sites using too many different approaches, to make trying to avoid them (in the space of site/page web design and structure) feasible. Also, I have a lot less freedom to manoeuvre than the spammers do. One consolation is that the spammers currently designing sites won't be using mine as an example.

And, on this analysis, changing the appearance of the site is not going to help. I don't believe for a moment that any human looked at my book reviews and said "yes, that site's spam" and if people do prefer glossier looking pages or social networking enabled sites (and feed that back to Google), that would just produce downgrading in the positive ranking algorithm, not classification as spam. Changing the structure might help, but that's dependent on a retraining of the spam filter, which might well clear me anyway.

0 Comments »

No comments yet.

TrackBack URI

Leave a comment

Technology << Oxford Blog << Travelogues << Danny Yee