Is there any value in using keywords in the URLs of web pages? Would a search engine look at keywords that you might include in the addresses of your pages, and associate those keywords with the content of your pages in the search engine’s index?
If so, how would a search engine go about looking at the web addresses indicated in the URLs to your pages, and break them down into meaningful parts to identify keywords?
Breaking URLs down into parts may also play a role in how the pages of a web site might be crawled by a search engine.
A newly published Yahoo patent application gives us some ideas on how it might extract keywords from the URLs of pages, and rank them, as well as using information uncovered in the process to determine which pages to crawl first from a web site.
Techniques for Tokenizing URLs
Invented by Krishna Leela Poola and Arun Ramanujapuram
Assigned to Yahoo
US Patent Application 20090083266
Published March 26, 2009
Filed November 6, 2007
A search engine will look at many different signals to determine what a page on the Web is about, and attempt to rank pages based upon keywords that might be an indication of the subject matter or content of those pages.
Many of those keywords are extracted from the content of pages themselves, but a search engine can look at other information associated with pages, such as the addresses of the pages.
Keywords may also be extracted from the URLs of pages, by using an algorithm that can break the URL into components, understanding the structure of those URLs, and removing candidate keywords from the different parts found within the URL.
Parts of URLs
The patent application provides a definition for different parts of URLs:
Scheme - This section of a URL identifies the internet protocol used to access a resource, such as HTTP or FTP
Authority - The part of a URL that identifies the host server where the documents or resources are located, or the domain name.
Path - This is the information following the slash character after the authority, or domain name, and it identifies the specific page or resource
Query arguments - A string that may appear in a path that can be broken down into name and value pairs, such as “category=shirts”
Fragments - A fragment identifies a subsection within a page that might be pointed to in a URL, ususally started with the “#” symbol
Subscribe to:
Post Comments (Atom)

No comments:
Post a Comment