Google Search Appliance

Office of Information Technology

Tips for Web (HTML) Authors

How we can help you:

How you can help us:


Increasing page rank

Web pages with a high page rank will tend to appear first in the search results among pages that match the search terms. The Search Appliance determines page rank using many criteria. There is nothing we can do on the Search Appliance to alter a page's rank.

Preventing indexing and further crawling

Certain types of Web pages typically have little to no value in a search index. Examples include: "Document Not Found" (404) error pages, comment or reply forms for blog entries, and "printer versions" of a news articles. A comment form is generic and meant for data input rather than output to the reader, and the printer version of an article only duplicates content already indexed in the regular version.

You can use META tags in an HTML document to prevent the Search Appliance adding the document to its search index or to prevent the crawler following links from the document to other documents. The META name attribute is "robots", and the content attribute must contain "noindex", "nofollow", or both:

  <meta name="robots" content="noindex" />
  <meta name="robots" content="nofollow" />
  <meta name="robots" content="noindex,nofollow" />

Suppressing page caching and snippet display

The image below is an example of an item from a search results list. The first line is the document title, the next two lines are a document snippet in the context of the search terms, and the last line, after the document URL, contains a "Cached" link that retrieves a cached copy of the document from the Search Appliance.

Item from search results, illustrating context-based snippet and a link to the cached copy of the document

To disable the snippet display and the "Cached" link for an HTML page, place the following META tag in the HEAD of the document:

  <meta name="googlebot" content="nosnippet" />

To prevent the document being cached in the Search Appliance while allowing a snippet in the search results, use this META tag instead:

  <meta name="robots" content="noarchive" />

Submitting your page to the Search Appliance index

We can add your Web page to the Search Appliance index if it is not appearing in the search results. To test whether your Web page has been indexed by the Search Appliance, try searching for an exact phrase that appears on your page, enclosing it in quotation marks. Choose a phrase you think will be unique to your page. Look carefully through the list of search results for your page.

Sometimes you will see this message at the bottom of the last page of results: "In order to show you the most relevant results, we have omitted some entries very similar to the [number] already displayed. If you like, you can repeat the search with the omitted results included." Click "repeat the search" and look again for your page. If you find it after including omitted results, this means the Search Appliance has already indexed your page (do not submit its URL); however, it had determined that your page was similar to higher-ranked pages also in the index. See the section on page rank.

If you need to add your entire Web site, please do not submit a URL for every page on your site. Typically, you will only need to submit the URL of your home page. The Search Appliance will use its crawler to start at your submitted URL and follow successive links into your site, indexing pages as it goes.

Adding a search form or link to your site

To allow your Web visitors to search all University Web content, you may provide a link to search.umn.edu. This is the "Search U of M" link that appears in the top banner of University Web sites using the standard Web Depot templates.

To search within a single Web site, or part of a site, you do not need to apply for Search Appliance service. Simply use the HTML code below. Note that you may omit the sitesearch input element for a University-wide search.

<form action="http://google.umn.edu/search" method="GET">
  <input type="text" name="q" maxlength="256" />
  <input type="hidden" name="site" value="default_collection" />
  <input type="hidden" name="client" value="searchumn_generic" />
  <input type="hidden" name="proxystylesheet" value="searchumn_generic" />
  <input type="hidden" name="output" value="xml_no_dtd" />
  <input type="hidden" name="sitesearch" value="website_url_base" />
  <input type="submit" value="search button text" />
</form>
client=searchumn_generic
The "client" parameter can be set to searchumn if you would like context-sensitive University links to appear above the search results. For example, a search for "technology" might display a link to the Institute of Technology.
sitesearch=website_url_base
the starting portion of the URL that is common to all pages of your site. For example, this Web site uses www.umn.edu/google.
Do not include the http:// part.
proxystylesheet=searchumn_generic
If you want the search results page to look exactly like that of "search.umn.edu", set this parameter to searchumn.
search button text
the text to appear on the form's submit button.
other search parameters
The Google Search Protocol Reference describes these and other search form parameters in more detail.

If you are a search manager in charge of a Search Appliance front end and collection, please see our search manager guides.


Pass session data through cookies, not the URL

The Search Appliance determines that two documents are unique if they have different URLs. If you have worked with Web applications or CGI scripts, you know that a single "page" can have an indefinite number of URLs, simply by adding arbitrary characters after a '?' in the URL (the query string). Web application frameworks in PHP, ASP, and ColdFusion, to name a few, offer mechanisms to pass user session tokens in the URL when requesting a page. Since these tokens are random and frequently changing, the Search Appliance will recrawl such session-enabled pages indefinitely, assuming each URL represents a unique page. This overinflates our limited search index, and it incurs an unnecessary load on both your Web server and the Search Appliance.

The solution is to use cookies instead of URL tokens to pass session data. This is often an option in the Web application settings.

Common session tokens passed via URL
Web application framework Token name
PHP PHPSESSIONID
ASP ASPSESSIONID
ColdFusion CFID
CFTOKEN
Contact U of M Privacy
© Regents of the University of Minnesota. All rights reserved.
The University of Minnesota is an equal opportunity educator and employer.

Last modified 2008-05-26 22:18:50 CDT · Retrieved 2008-10-15 14:05:48 CDT · URL http://www.umn.edu/google/webauthors.html