Google Search Appliance

Office of Information Technology

Indexing Database-driven Web Pages

This guide will allow you to index Web Pages that are served by a template script from a database. The Search Appliance tyically only finds pages that are linked from other crawled pages, so if some of your database-driven content is not linked from anywhere, then it will likely remain unsearchable.

Step 1: Creating the index page

Create an "index page" that contains links to every page (record) in your database. For example, each of the content pages may have a URL that differs from the next by only an ID number:

http://myserver.umn.edu/news/Article.php?ArticleID=1202
http://myserver.umn.edu/news/Article.php?ArticleID=1203
http://myserver.umn.edu/news/Article.php?ArticleID=1204
...

Run a query against your database to retrieve a list of unique record IDs, the part that is unique to each page's URL. When creating the index page, iterate over your recordset, writing a link to each content page (record). At the top of the index page, in the HEAD, write a META tag that instructs search engine crawlers to not index the page itself, but to only follow the links on the page. For example:

file http://myserver.umn.edu/news/AllArticles.php
...

<HEAD>
  <META NAME="robots" CONTENT="noindex,follow" />
</HEAD>

...

(query-output-loop-begin)
  <a href="http://myserver.umn.edu/news/Article.php?ArticleID=thisID">thisID</a>
(query-output-loop-end)

...

Step 2: Crawling the index page

Now you need to give the Search Appliance an entry point to your index page so the page and all of its links can be crawled. There are two ways you could do this:

  • (preferred) Create a hidden link on a page in your Web site that the Search Appliance has already crawled and indexed. The Search Appliance will follow this link the next time it crawls the page. This method is preferred because it does not require extra Search Appliance configuration. It also allows other search engines to index your database content.
  • Submit the URL of the index page to the Search Appliance. It will be added to the crawler's list of starting URLs.

Step 3: Indexing the content pages (records)

Follow this step if your content pages (linked from the index page) contain '?' in their URLs. Currently, the Search Appliance is configured to ignore documents whose URLs contain a '?', due to excessive crawling that has historically occurred within database-driven Web applications.

Submit the URL of a content page and indicate the portion of the URL that is common to all such pages for your database. We will add an exception to the crawler's list of ignored URL patterns. Using the example from Step 1, you could submit the URL "http://myserver.umn.edu/news/Article.php?ArticleID=1202" to us, noting that the "1202" part is what changes from page to page.

Contact U of M Privacy
© Regents of the University of Minnesota. All rights reserved.
The University of Minnesota is an equal opportunity educator and employer.

Last modified 2007-03-09 14:22:47 CST · Retrieved 2009-11-25 18:24:14 CST · URL http://www.umn.edu/google/managers/database.html