2014-07-29

Posted by billslawski

This is my first official blog post at Moz.com, and I'm going to be requesting your help and expertise and imagination.

I'm going to be asking you to take over as Panda for a little while to see if you can identify the kinds of things that Google's Navneet Panda addressed when faced with what looked like an incomplete patent created to identify sites as parked domain pages, content farm pages, and link farm pages. You're probably better at this now then he was then.



You're a subject matter expert.

To put things in perspective, I'm going to include some information about what appears to be the very first Panda patent, and some of Google's effort behind what they were calling the "high-quality site algorithm."

I'm going to then include some of the patterns they describe in the patent to identify lower-quality pages, and then describe some of the features I personally would suggest to score and rank a higher-quality site of one type.

Google's Amit Singhal identified a number of questions about higher quality sites that he might use, and told us in the blog post where he listed those that it was an incomplete list because they didn't want to make it easy for people to abuse their algorithm.

In my opinion though, any discussion about improving the quality of webpages is one worth having, because it can help improve the quality of the Web for everyone, which Google should be happy to see anyway.

Warning searchers about low-quality content

In "Processing web pages based on content quality," the original patent filing for Panda, there's a somewhat mysterious statement that makes it sound as if Google might warn searchers before sending them to a low quality search result, and give them a choice whether or not they might actually click through to such a page.

As it notes, the types of low quality pages the patent was supposed to address included parked domain pages, content farm pages, and link farm pages (yes,
link farm pages):

"The processor 260 is configured to receive from a client device (e.g., 110), a request for a web page (e.g., 206). The processor 260 is configured to determine the content quality of the requested web page based on whether the requested web page is a parked web page, a content farm web page, or a link farm web page.

Based on the content quality of the requested web page, the processor is configured to provide for display, a graphical component (e.g., a warning prompt). That is, the processor 260 is configured to provide for display a graphical component (e.g., a warning prompt) if the content quality of the requested web page is at or below a certain threshold.

The graphical component provided for display by the processor 260 includes options to proceed to the requested web page or to proceed to one or more alternate web pages relevant to the request for the web page (e.g., 206). The graphical component may also provide an option to stop proceeding to the requested web page.

The processor 260 is further configured to receive an indication of a selection of an option from the graphical component to proceed to the requested web page, or to proceed to an alternate web page. The processor 260 is further configured to provide for display, based on the received indication, the requested web page or the alternate web page."

This did not sound like a good idea.

Recently, Google announced in a post on the Google Webmaster Central blog post,
Promoting modern websites for modern devices in Google search results, that they would start providing warning notices on mobile versions of sites if there were issues on those pages that visitors might go to.

I imagine that as a site owner, you might be disappointed seeing such warning notice shown to searchers on your site about technology used on your site possibly not working correctly on a specific device. That recent blog post mentions Flash as an example of a technology that might not work correctly on some devices. For example, we know that Apple's mobile devices and Flash don't work well together.

That's not a bad warning in that it provides enough information to act upon and fix to the benefit of a lot of potential visitors. :)

But imagine if you tried to visit your website in 2011, and instead of getting to the site, you received a Google warning that the page you were trying to visit was a content farm page or a link farm page, and it provided alternative pages to visit as well.

That "
your website sucks" warning still doesn't sound like a good idea. One of the inventors listed on the patent is described in LinkedIn as presently working on the Google Play store. The warning for mobile devices might have been something he brought to Google from his work on this Panda patent.

We know that when the Panda Update was released that it was targeting specific types of pages that people at places such as
The New York Times were complaining about, such as parked domains and content farm sites. A
follow-up from the Timesafter the algorithm update was released puts it into perspective for us.

It wasn't easy to know that your pages might have been targeted by that particular Google update either, or if your site was a false positive—and many site owners ended up posting in the Google Help forums after a Google search engineer invited them to post there if they believed that they were targeted by the update when they shouldn't have been.

The wording of that
invitation is interesting in light of the original name of the Panda algorithm. (Note that the thread was broken into multiple threads when Google did a migration of posts to new software, and many appear to have disappeared at some point.)

As we were told in the invite from the Google search engineer:

"According to our metrics, this update improves overall search quality. However, we are interested in hearing feedback from site owners and the community as we continue to refine our algorithms. If you know of a high-quality site that has been negatively affected by this change, please bring it to our attention in this thread.

Note that as this is an algorithmic change we are unable to make manual exceptions, but in cases of high quality content we can pass the examples along to the engineers who will look at them as they work on future iterations and improvements to the algorithm.

So even if you don't see us responding, know that we're doing a lot of listening."

The timing for such in-SERP warnings might have been troublesome. A site that mysteriously stops appearing in search results for queries that it used to rank well for might be said to have gone astray of
Google's guidelines. Instead, such a warning might be a little like the purposefully embarrassing "Scarlet A" in Nathaniel Hawthorn's novel The Scarlet Letter.



A page that shows up in search results with a warning to searchers stating that it was a content farm, or a link farm, or a parked domain probably shouldn't be ranking well to begin with. Having Google continuing to display those results ranking highly, showing both a link and a warning to those pages, and then diverting searchers to alternative pages might have been more than those site owners could handle. Keep in mind that the fates of those businesses are usually tied to such detoured traffic.

My imagination is filled with the filing of lawsuits against Google based upon such tantalizing warnings, rather than site owners filling up a Google Webmaster Help Forum with information about the circumstances involving their sites being impacted by the upgrade.

In retrospect, it is probably a good idea that the warnings hinted at in the original Panda Patent were avoided.

Google seems to think that such warnings are appropriate now when it comes to multiple devices and technologies that may not work well together, like Flash and iPhones.

But there were still issues with how well or how poorly the algorithm described in the patent might work.

In the March, 2011 interview with Google's Head of Search Quality, Amit Sighal, and his team member and Head of Web Spam at Google, Matt Cutts, titled
TED 2011: The "Panda" That Hates Farms: A Q&A With Google’s Top Search Engineers, we learned of the code name that Google claimed to be using to refer to the algorithm update as "Panda," after an engineer with that name came along and provided suggestions on patterns that could be used by the patent to identify high- and low-quality pages.

His input seems to have been pretty impactful—enough for Google to have changed the name of the update, from the "High Quality Site Algorithm" to the "Panda" update.

How the High-Quality Site Algorithm became Panda

Danny Sullivan named the update the "Farmer update" since it supposedly targeted content farm web sites. Soon afterwards the joint interview with Singhal and Cutts identified the Panda codename, and that's what it's been called ever since.

Google didn't completely abandon the name found in the original patent, the "high quality sites algorithm," as can be seen in the titles of these Google Blog posts:

Finding more high-quality sites in search

High-quality sites algorithm goes global, incorporates user feedback

More guidance on building high-quality sites

High-quality sites algorithm launched in additional languages

Another step to reward high-quality sites

The most interesting of those is the "more guidance" post, in which Amit Singhal lists 23 questions about things Google might look for on a page to determine whether or not it was high-quality. I've spent a lot of time since then looking at those questions thinking of features on a page that might convey quality.

The original patent is at:

Processing web pages based on content quality
Inventors: Brandon Bilinski and Stephen Kirkham

Assigned to Google

US Patent 8,775,924

Granted July 8, 2014

Filed: March 9, 2012

Abstract

"Computer-implemented methods of processing web pages based on content quality are provided. In one aspect, a method includes receiving a request for a web page.

The method includes determining the content quality of the requested web page based on whether it is a parked web page, a content farm web page, or a link farm web page. The method includes providing for display, based on the content quality of the requested web page, a graphical component providing options to proceed to the requested web page or to an alternate web page relevant to the request for the web page.

The method includes receiving an indication of a selection of an option from the graphical component to proceed to the requested web page or to an alternate web page. The method further includes providing, based on the received indication, the requested web page or an alternate web page."

The patent expands on what are examples of low-quality web pages, including:

Parked web pages

Content farm web pages

Link farm web pages

Default pages

Pages that do not offer useful content, and/or pages that contain advertisements and little else

An invitation to crowdsource high-quality patterns

This is the section I mentioned above where I am asking for your help. You don't have to publish your thoughts on how quality might be identified, but I'm going to start with some examples.

Under the patent, a content quality value score is calculated for every page on a website based upon patterns found on known low-quality pages, "such as parked web pages, content farm web pages, and/or link farm web pages."

For each of the patterns identified on a page, the content quality value of the page might be reduced based upon the presence of that particular pattern—and each pattern might be weighted differently.

Some simple patterns that might be applied to a low-quality web page might be one or more references to:

A known advertising network,

A web page parking service, and/or

A content farm provider

One of these references may be in the form of an IP address that the destination hostname resolves to, a Domain Name Server ("DNS server") that the destination domain name is pointing to, an "a href" attribute on the destination page, and/or an "img src" attribute on the destination page.

That's a pretty simple pattern, but a web page resolving to an IP address known to exclusively serve parked web pages provided by a particular Internet domain registrar can be deemed a parked web page, so it can be pretty effective.

A web page with a DNS server known to be associated with web pages that contain little or no content other than advertisements may very well provide little or no content other than advertising. So that one can be effective, too.

Some of the patterns listed in the patent don't seem quite as useful or informative. For example, the one stating that a web page containing a common typographical error of a bona fide domain name may likely be a low-quality web page, or a non-existent web page. I've seen more than a couple of legitimate sites with common misspellings of good domains, so I'm not too sure how helpful a pattern that is.

Of course, some textual content is a dead giveaway the patent tells us, with terms on them such as "domain is for sale," "buy this domain," and/or "this page is parked."

Likewise, a web page with little or no content is probably (but not always) a low-quality web page.

This is a simple but effective pattern, even if not too imaginative:

... page providing 99% hyperlinks and 1% plain text is more likely to be a low-quality web page than a web page providing 50% hyperlinks and 50% plain text.

Another pattern is one that I often check upon and address in site audits, and it involves how functional and responsive pages on a site are.

"The determination of whether a web site is full functional may be based on an HTTP response code, information received from a DNS server (e.g., hostname records), and/or a lack of a response within a certain amount of time. As an example, an HTTP response that is anything other than 200 (e.g., "404 Not Found") would indicate that a web site is not fully functional.

As another example, a DNS server that does not return authoritative records for a hostname would indicate that the web site is not fully functional. Similarly, a lack of a response within a certain amount of time, from the IP address of the hostname for a web site would indicate that the web site is not fully functional."

As for user-data, sometimes it might play a role as well, as the patent tells us:

"A web page may be suggested for review and/or its content quality value may be adapted based on the amount of time spent on that page.

For example, if a user reaches a web page and then leaves immediately, the brief nature of the visit may cause the content quality value of that page to be reviewed and/or reduced. The amount of time spent on a particular web page may be determined through a variety of approaches. For example, web requests for web pages may be used to determine the amount of time spent on a particular web page."

My example of some patterns for an e-commerce website

There are a lot of things that you might want to include on an ecommerce site that help to indicate that it's high quality. If you look at the questions that Amit Singhal raised in the last Google Blog post I mentioned above, one of his questions was "Would you be comfortable giving your credit card information to this site?" Patterns that might fit with this question could include:

Is there a privacy policy linked to on pages of the site?

Is there a "terms of service" page linked to on pages of the site?

Is there a "customer service" page or section linked to on pages of the site?

Do ordering forms function fully on the site? Do they return 404 pages or 500 server errors?

If an order is made, does a thank-you or acknowledgement page show up?

Does the site use an https protocol when sending data or personally identifiable data (like a credit card number)?

As I mentioned above, the patent tells us that a high-quality content score for a page might be different from one pattern to another.

The
questions from Amit Singhal imply a lot of other patterns, but as SEOs who work on and build and improve a lot of websites, this is an area where we probably have more expertise than Google's search engineers.



What other questions would you ask if you were tasked with looking at this original Panda Patent? What patterns would you suggest looking for when trying to identify high or low quality pages?  Perhaps if we share with one another patterns or features on a site that Google might look for algorithmically, we could build pages that might not be interpreted by Google as being a low quality site. I provided a few patterns for an ecommerce site above. What patterns would you suggest?

(Illustrations: Devin Holmes @DevinGoFish)

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

Show more