If you're new here, you may want to sign up for email alerts or to subscribe to my RSS feed.

If you are not familiar with the terms “web scraping” and “scraper sites” here is the definition from Wikipedia:

Web scraping is defined as a web crawler that copies content from one or more existing websites in order to generate a scraper site. The result can range from fair use snippets of text to plagiarized content pages solely for the purpose of earning revenue through advertising. The typical scraper website is monetized using Google AdSense hence the term Made For AdSense or MFA website.

Now that you are familiar with the term, let’s move on and start celebrating Google’s new similarity patent.

The creator of the patent is Dr. Moses Charikar, an Indian native, now assistant professor at the Princeton University, back in December 2001 (when Google applied for the patent) member of the Google research team.The patent for “Methods and apparatus for estimating similarity” was granted to Google on January 2, 2007. (US Patent 7,158,961)

What does this patent mean for Webmasters? Well, that’s pretty clear if we read the Abstract:

A similarity engine generates compact representations of objects called sketches. Sketches of different objects can be compared to determine the similarity between the two objects. The sketch for an object may be generated by creating a vector corresponding to the object, where each coordinate of the vector is associated with a corresponding weight. The weight associated with each coordinate in the vector is multiplied by a predetermined hashing vector to generate a product vector, and the product vectors are summed. The similarity engine may then generate a compact representation of the object based on the summed product vector.

Now in plain English: the similarity engine determines similarity between Web pages. Meaning: Google will recognize duplicate content and decide whether to index a page or not. The purpose of Google is to reduce the number of duplicate pages from its results. So if you were a fan of “free content”, if you used to visit article submission directories and get “free articles” to “feed” your website, you are probably in for a big surprise.

From now on you have to be really careful in your SEO endeavors. Any technique associated with duplicate content might get your site penalized or banned out of Google. What you need is what professional SEOs promote since a long time already: unique content, quality content, and the type of content that is really king. If you haven’t done it by now, it’s time for you to hire a writer and rewrite your Web pages with this thought in mind: original content. No more automatic writing software, no more “Turn 1 article into 100s of unique article content pages in 14 secs.” No more scraping! You want unique content on your site, because you don’t want Google to decide not to index or not to crawl your site.

But as per social bookmarking and the recent abuse of digg, something similar (yeah, we talk similarity) might happen with this similarity engine. What if competitors will decide to copy your pages to spoil your work and your rankings? Let’s just hope Google has a solution for this and will know who was there first. Or let’s hope that Google will let place for some similarity. The similarity algorithm is not implemented yet, but it’s just a matter of time till Google will do it. In the meanwhile, if you have scraper sites, be afraid be very afraid! And if you don’t scrap but use free content every now and then because you “find it useful”, rethink your strategy. Link to the “useful content” if you must, but do write your own copy!


Affiliate Banner