
04
Jan
2007
Posted by Mihaela Lica as News, SEO Advice
If you're new here, you may want to sign up for email alerts or to subscribe to my RSS feed.
If you are not familiar with the terms “web scraping” and “scraper sites” here is the definition from Wikipedia:
Web scraping is defined as a web crawler that copies content from one or more existing websites in order to generate a scraper site. The result can range from fair use snippets of text to plagiarized content pages solely for the purpose of earning revenue through advertising. The typical scraper website is monetized using Google AdSense hence the term Made For AdSense or MFA website.
Now that you are familiar with the term, let’s move on and start celebrating Google’s new similarity patent.
The creator of the patent is Dr. Moses Charikar, an Indian native, now assistant professor at the Princeton University, back in December 2001 (when Google applied for the patent) member of the Google research team.The patent for “Methods and apparatus for estimating similarity” was granted to Google on January 2, 2007. (US Patent 7,158,961)
What does this patent mean for Webmasters? Well, that’s pretty clear if we read the Abstract:
A similarity engine generates compact representations of objects called sketches. Sketches of different objects can be compared to determine the similarity between the two objects. The sketch for an object may be generated by creating a vector corresponding to the object, where each coordinate of the vector is associated with a corresponding weight. The weight associated with each coordinate in the vector is multiplied by a predetermined hashing vector to generate a product vector, and the product vectors are summed. The similarity engine may then generate a compact representation of the object based on the summed product vector.
Now in plain English: the similarity engine determines similarity between Web pages. Meaning: Google will recognize duplicate content and decide whether to index a page or not. The purpose of Google is to reduce the number of duplicate pages from its results. So if you were a fan of “free content”, if you used to visit article submission directories and get “free articles” to “feed” your website, you are probably in for a big surprise.
From now on you have to be really careful in your SEO endeavors. Any technique associated with duplicate content might get your site penalized or banned out of Google. What you need is what professional SEOs promote since a long time already: unique content, quality content, and the type of content that is really king. If you haven’t done it by now, it’s time for you to hire a writer and rewrite your Web pages with this thought in mind: original content. No more automatic writing software, no more “Turn 1 article into 100s of unique article content pages in 14 secs.” No more scraping! You want unique content on your site, because you don’t want Google to decide not to index or not to crawl your site.
But as per social bookmarking and the recent abuse of digg, something similar (yeah, we talk similarity) might happen with this similarity engine. What if competitors will decide to copy your pages to spoil your work and your rankings? Let’s just hope Google has a solution for this and will know who was there first. Or let’s hope that Google will let place for some similarity. The similarity algorithm is not implemented yet, but it’s just a matter of time till Google will do it. In the meanwhile, if you have scraper sites, be afraid be very afraid! And if you don’t scrap but use free content every now and then because you “find it useful”, rethink your strategy. Link to the “useful content” if you must, but do write your own copy!
8 Responses
Cristian
January 4th, 2007 at 10:41 pm
1great to hear that Google has decided to take action against people stealing content off other sites.
I love your articles. straight to the point and very informative. Thank you.
mig
January 5th, 2007 at 12:37 am
2Thank you very much, Cristian. I am happy to give something back to the online community.
Alex
January 7th, 2007 at 2:10 am
3I’ve been watching your work for a long time. If you were wondering who’s posting your articles at Stylegala, that’s me.
Anyway, thanks for keeping this blog actual, and thanks for writing in English. And if I ever need a PR, I know where to contact you.
mig
January 7th, 2007 at 2:15 am
4Wow! Really? I was wondering who’s doing that! By the way, you’ve published there also an article that was not really matching the “style” of Stylegala, from pamil-visions.com. It’s OK for me (I’ve got some traffic), but that article was for beginners and I admit the content and the writing style are not sensational. Thank you very much for being a fan. It’s nice to know that people actually read my blog and find it useful!
Liam Billington
January 8th, 2007 at 12:03 pm
5Hopefully this will drop some Wikipedia pages from the index.
mig
January 8th, 2007 at 1:21 pm
6If they were copied from somewhere else, yes, I agree. I don’t really trust Wikipedia. There are way too many errors within its contents. But I hope that the made for adsense sites drop. Although Google does “sponsor” them. Time will tell.
Ricky
February 9th, 2007 at 7:52 pm
7I am thinking about one aspect, I am usually very active at forum and sometimes I do some classic long posts and to keep them at one place, i started a blog.So does it means that post on that page and the post on forum will be duplicated content? Its just matter of personal interest, many do this..
Secondly, Google adsense guys should do adsense approval per site basis , this will surely demoralise those scrapers.
Mihaela Lica
February 9th, 2007 at 7:58 pm
8Well, if you don’t re-write the posts, you will have duplicated content. The similarity engine is not aimed solelly at scrapper sites. I think google simply tries to reduce the ammount of irrelevant content.
You should really avoid duplicate content, especially if you don’t want your site to go supplemental.
RSS feed for comments on this post
Leave a reply
previous post: Website Redesign by Pamil Visions: www.sunrez.com
next post: Are Press Releases Important?
to top of page...