Helpful Information
 
 
Category: Coding tips & tutorials threads
Making AJAX Applications Crawlable using PHP and jQuery.

Unfortunately, there is not much documentation on the web that pertains to this (none that I could find, at least). So here's a quick PHP tutorial on how to go about this. I'll try to be as concise as possible.

Here's Google's specification regarding their proposal for crawling AJAX applications:
http://code.google.com/web/ajaxcrawling/docs/specification.html

Problem: Googlebot is unable to crawl content which is loaded via AJAX into a webpage.

Solution: Using hash fragments as a signal to Googlebot to use an alternate URL ("ugly URL") to create a "snapshot" of the webpage, including the AJAX-loaded content, to be crawled.

Definitions:

Pretty URLs are the URLs which is seen client-end which loads dynamic content. Pretty URLs consist of #! with an identifier/query string at the end which tells the webpage what content to load.
For example, http://www.example.com/index.php#!content1.

Ugly URLs are the URLs that, when navigated to, display content that was pre-loaded from the server. These URLs, instead of having #! as the delimiter, have _escaped_fragment_=. For example, a pretty URL of http://www.example.com/index.php#!content1 would look like this to Googlebot: http://www.example.com/index.php?_escaped_fragment_=content1.

If you don't already have a vague idea of how Google crawls webpages, one of the primary techniques is that Googlebot looks for hyperlinks, goes to those links and crawls that page. So if Googlebot sees a hyperlink with the hypertext reference of http://www.example.com/index.php#!content1, Googlebot really sees this: http://www.example.com/index.php?_escaped_fragment_=content1. As I mentioned before, all of the HTML output on http://www.example.com/index.php?_escaped_fragment_=content1 is pre-loaded server-end, so no AJAX is happening here, which means Google crawls everything - yay!

Steps to go about making an AJAX-crawlable application:

1. Tell Googlebot that the page is crawlable via hash fragments. On the head of your page, insert this:

<meta name="fragment" content="!" />

2. If you are loading your AJAX content as soon as the page loads (which presumably you're going to do) then you want to get the hash fragment when the document is done. Here's an example of how it's done:


var escapedHash;

function findHash() {
if(window.location.hash) {
$("#content").html("Loading...");

escapedHash = window.location.hash.replace("#!", "");

$.get("data.php?_escaped_fragment_=" + escapedHash, function(data) {
$("#content").html(data);
});
}
}

$(document).ready(function() {
findHash();
});

$(window).bind('hashchange', function() {
findHash();
});


findHash() is going to find what the hash of the URL is, will strip the pretty URL delimiter out (#!), and will request another page that we name data.php whose query string of _escaped_fragment_= (ugly URL) is going to have the value of window.location.hash. There is an event called "hashchange" which is an event listener which will listen to when the hash is changed, and then execute findHash() again. The page, requested via AJAX, will display the returned data within a div that we named with the ID of content.

Here is the HTML of the page:


<table width="100%" cellspacing="0" cellpadding="10">
<tr>
<td style="background-color: #006600; color: #fff; padding: 10px; font-size: 18px" colspan="2"><b>AJAX-Crawlable content</b></td>
</tr>
<tr>
<td width="25%" style="background-color: #e8e8e8; border-right: 1px solid #000">
<b>Navigation:</b>
<ul>
<li><a href="#!content1">Content #1</a></li>
<li><a href="#!content2">Content #2</a></li>
<li><a href="#!content3">Content #3</a></li>
<li><a href="#!content4">Content #4</a></li>
<li><a href="#!content5">Content #5</a></li>
</ul>
</td>
<td>
<b>Content:</b>
<p id="content">
<?php include('data.php'); ?>
</p>
</td>
</tr>
</table>


If you notice, there is an include for data.php. This is for when _escaped_fragment_= is set, so that the page will return the contents of data.php when Googlebot requests the ugly URL.

3. Create a server-side page that will read the _escaped_fragment_= query string and output data based on its value.

data.php:


$hash_frag = $_GET['_escaped_fragment_'];

$content = array(
"content1" => "This is some sample content.",
"content2" => "Some more sample content.",
"content3" => "Even more sample content.",
"content4" => "And some more...",
"content5" => "And even more!"
);

if(isset($hash_frag)) {
echo $content[$hash_frag];
}


And now when the page is requested by Googlebot, it will read the hash fragment replaced by _escaped_fragment_, and thus read AJAX-created pages.

Here are some tips to remember:

When _escaped_fragment_ is activated, you only need a "snapshot" of the page. This means that you really only need working links, text and the same appearance of the webpage. Everything else isn't necessary, since Googebot won't be doing much except looking at the page.
jQuery is your friend. It will make writing everything 10x faster, especially for this.


I attached the full code to this message. Here is a working example:
http://184.173.246.8/~joshso/ajax_crawl/

That's very interesting. I don't have a use for it at the moment but it will be useful at some point I'm sure.
Do you know if this is the same system used by other search engines?

I'm not entirely sure about that, but I think it's quite probable (if they haven't already). I know that large websites such as Twitter and Facebook adopt this technique. Although the ability for Google being able to crawl AJAX-created webpages is reason enough to make AJAX applications this way.

If you do find out if other search engines adopt this technique, please post it here if you can.










privacy (GDPR)