Uncategorized

How to Foil Blog Content Harvesters

June 28, 2008

author:

How to Foil Blog Content Harvesters

On the LinkedInBloggers group, there is an interesting discussion on how to prevent blog harvesting. Turning off RSS feeds and subscription feeds seemed to be the suggested solution. I think this is an impractical solution and makes your blog harder to find and harder to consume with RSS readers.

I wonder if disabling RSS/subscription widgets is the only way? What if there was a simple way to ensure that your content only displays in a browser if it is being served from your site and when displayed on a harvester’s site, it simply redirects the browser back to your site?

I came up with a solution that might work. My solution is based on two assumptions:

1) RSS readers ignore Javascript

2) Most blog engines have a templating feature that allow the URL of the blog post to be injected anywhere on the page containing the post

The solution is pretty simple:

Embed a simple script in your blog post that checks to see if the location where your blog content is being displayed is valid (i.e. your blog) or invalid (i.e. harvester site). If it is invalid, then redirect the browser to your blog.

Not only does this approach thwart harvesters (at least until they filter out the script), but it has the added benefit of getting the search traffic from the harvester’s site back to your blog.

Let’s walk through the changes you would make to your blog’s template in order to enable this capability:

Original blog HTML:

This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog.

Steps for modifying blog HTML:

Step 1: Add DIV element wrapper for content

<div id="BlogContent" title="http://www.yourblogsite.com/URL-of-your-blog-post.htm">This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog.</div>

I used an id of “BlogContent” but you can use anything you want. If your blog displays the entire contents of more than blog post on a page, you will want this entry to be changed for each blog. In that case, try using “BlogContent{{ ID }}” where {{ ID }} is your blog engine’s token for some unique identifier associated with your blog. If you take this approach, be sure to modify the “BlogContent” string in Step 2 also.

Also, note the URL in the value of the “title” attribute of the

containing the blog content. You should not actually type in a URL there, but instead use the token feature of your blog engine that will inject the URL of the blog post page. Something like:

<div id="BlogContent" title="{{ PostURL }}">

({{ ID }} and {{PostURL}} are not an actual tokens…I just made them up. You will need to look at your blog engine’s documentation to figure out the tokens you should use.)

This URL serves two purposes:
– It provides a standards-compliant way to include the original URL of your blog in the blog content so that no matter where the content is posted, the original URL is always in the HTML source code, and
– It provides the script in Step 2 to have a known place to find the original URL

Step 2: Embed script to foil harvesters

The script to embed is:

<script type="text/javascript">// <![CDATA[

var blogContent = document.getElementById("BlogContent");
if (location.href.toLowerCase().indexOf(blogContent.title.toLowerCase()) != 0) location.href = blogContent.title;
// ]]></script>

Here’s what the script is doing:

a) Find the HTML element containing the blog content

var blogContent = document.getElementById("BlogContent");

b) Test if the content is running on the original site, if not, then redirect to the original site

if (location.href.toLowerCase().indexOf(blogContent.title.toLowerCase()) != 0) location.href = blogContent.title;

Here’s what the final content might look like:

<div id="BlogContent" title="{{PostURL}}">This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog.
<script type="text/javascript">// <![CDATA[

var blogContent = document.getElementById("BlogContent");
if (location.href.toLowerCase().indexOf(blogContent.title.toLowerCase()) != 0) location.href = blogContent.title;
// ]]></script>

Step 3: (Optional) Putting the script in a separate file

Instead of placing the script in each blog post as described above, you can also put the script into a separate file such as harvestblock.js. This will reduce the page size as the entire script will not be repeated for each blog post. You only need to include this part of the script in the file

var blogContent = document.getElementById("BlogContent");
if (location.href.toLowerCase().indexOf(blogContent.title.toLowerCase()) != 0) location.href = blogContent.title;

If you do this, the revised content might look like:

<div id="BlogContent" title="{{PostURL}}">This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog. This is the content of my blog.
<script src="http://www.yourblog.com/harvestblock.js" type="text/javascript"></script></div>

Note: The URL used for the script must be a fully-qualified URL because it must work no matter whether the content is running on your site or on the harvester’s site.


Let’s look at what happens when you make this change:

1) User looking at content on your site

The script will detect a match between the URL being displayed in the browser and the URL of the blog post. As a result, it will do nothing and there will be no change in behavior from what your users are already seeing.

2) User looking at content in their RSS reader

The script will not run and as a result there will be no change in behavior from what your users are already seeing.

3) User looking at content on harvester site

The script will detect a mis-match between the URL being displayed in the browser and the URL of the blog post. As a result, it will redirect the user to the original blog post.

This solution is not fool-proof. If a harvester is stripping script embedded in a blog post then it will not work. I highly doubt this will happen very often because most harvested content is simply the content from the RSS feed as-is.

If you employ this solution please provide information on the specific token you use with your blogging engine in the comments.

Founder NftyDreams; founder Decentology; co-founder DNN Software; educator; Open Source proponent; Microsoft MVP; tech geek; creative thinker; husband; dad. Personal blog: http://www.kalyani.com. Twitter: @techbubble
Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.