This blog has been hit by blog scrapers for the past three or four months now. Initially, I did not take it seriously other than marking those pingbacks from scrapers as spam or just deleting them. But Donace’ comment on the last post here was a good wake up call and I thought of writing about scraping to self-educate and as well as to spread awareness.
What exactly is blog or feed scraping?
Blog scraping (or blog plagiarism) is the process of searching large number of blogs and copying blog contents via automated tools. Basically, it is a form of content theft except that it is not really copy-paste but the scrapers work using tags and keywords via tools working on the RSS feeds of target blogs. Usually, the scrapped content can be monitored when you get a suspected trackback/pingback on your WordPress console (and when you trackback you see the same content as your original post there)
How scraping can affect your blog?
Scraping may be relatively a harmless thing compared to other forms of piracies and hacks. However, the scraper can usually receive some advantages at your expense. The following are some of the benefits for the scrapers who use your copied content.
- They can receive some traffic from your blog (or original post) using the trackbacks while traffic in the opposite direction is usually nothing
- The scraped blog post can receive SEO advantage as it is in a good neighbourhood
- The search engines may find your original content as duplicate if a smart sploger does an organized act
- Since scraping is usually associated with splogs (spam blogs), it can affect your blog’s brand value when your name and tags appear on such blogs
How to deal with scrapers?
Though most of the trackbacks from the scrapers end up in Akismet, it alone may not be sufficient to arrest scraping. The following are some of the techniques to efficiently deal with scraping.
#1 Anti-leech plugin
The anti-leech WordPress plugin basically works the same way as in the method #3 (described below). It provides a wrong set of contents or to the sploggers while maintaining the actual feed for others. Sploggers are identified by the settings provided by you as IP addresses or user-agent strings.
#2 .htacces ban
You can effectively use the .htaccess entry deny from [IP ADDRESS] to block the spam blog bots from accessing your blog. However, please note that it will have a blanket effect if the sploggers is working from a shared hosting that has several hundred blogs (many good blogs as well) on an IP address.
#3 Feed obfuscation or cloaking
This is a technique that I found on another blog and I have not tried it myself. Basically the idea is to fool splog bots by providing them a different version of the feed content than what is seen by humans and clean bots. You may read about this feed cloaking technique here
#4 Contact the hosting service of scrapers
Since most of the splogs do not provide contact information or a contact page, once scraping is discovered it is a good idea to directly contact the hosting service provider of these spam bloggers. Most hosting services (unless they are blackhats themselves) would respond possitively to such support queries in an effort to arrest plagiarism.
#5 Legal action
If you run a professional blog that has some quality intellectual property, it is a good idea to maintain a copyright notice and also legally proceed when copyright infringement is discovered. This may be an expensive but effective process to protect you against organized splogging. This can also reveal the intentions behind splogging if it was done with the intention of maligning the brand, for example. Legal action usually something that is associated when big corporate blogs or famous bloggers are suffering from scraping.