How Should Corporate Blog Monitoring Cope with Spam?

futurelab default header

by: Mark Rogers

An excellent article in Red Herring reporting on work by Umbria in Boulder, Colorado, draws attention to the increasing problems posed by spam blogs or splogs. Apparently spam bloggers have targeted 44 out of the top 100 brands.

This problem has vastly increased in severity since September/October 2005 when the blog spammers seemed to change their game. It is worth pointing out here that blog spam differs from email spam. Email spam is chiefly aimed at getting the recipient to visit a website, or engage in a transaction. Blog spam is partly about that, but to a greater measure it is about ensuring that the spam site appears in search results for a popular topic. It might also be useful, once it is has succeeded in gaining search engine authority on a particular topic, in lending that authority to a legitimate business which is trying to spoof its own way to greater search engine prominence. We will write more about this topic in our forthcoming white paper, a sequel to our “Measuring blogging influence”.

But back to the spam bloggers: previously they had tried to boost the traffic to their sites by posting direct links to gambling and poker sites. From mid-Autumn onwards we noticed increasingly that product names (cars, mobile phones, broadband suppliers) and common search terms were being systematically targeted, often in the context of material which might itself be returned in response to common searches.

This strategy suggests a return to the bad old days before Google, when search prominence could be (and was) spoofed. The retrospective results in Google are still reliable, but live results (and this business is all about live results) are polluted by junk returns.

The immediate problem from our perspective and the perspective of anyone in the corporate intelligence business was simple: how did we maintain the integrity of our live blog search results in the face of this issue? If you are searching on “product name” + keyword, and that product name is suddenly the target of spammers, you are screwed.

In the article the research group Umbria announces that it is pinning its hopes on a linguistic approach, hoping that the spammers will betray themselves by using characteristic patterns. It is an approach that is definitely worth trying. However, Market Sentinel’s researchers find that spam blogs are cunning at reusing genuine content, and that oftentimes you cannot identify that a blog is a spam blog until you have clicked through from the search return to the blog posting itself. This kind of disguise means that any algorithmic filtering is likely to be hard to implement. If a human being cannot spot a fake blog, what chance does a machine have?

For our customers the key question is: is this result important? Is it relevant to me? We have established that the best approach is to filter all our results by the writer’s relevance to a particular issue, using an algorithm developed by our partners at influence-specialists Onalytica to assess the writer’s influence on a particular issue, and then highlighting the most relevant returns in the results we provide the customer. This approach ensures that any spam-polluted result can be eliminated, saving time and server space. It also helps to create a more valuable monitoring service, since you are highlighting only those commentators who are authoritative.

Original Post: