How to Compare Uniform Resource Identifiers
Essay Preview: How to Compare Uniform Resource Identifiers
Report this essay
How to Compare Uniform Resource Identifiers
Author: Tim Bray
Abstract
This document discusses issues concerning the comparison of Uniform Resource Identifiers (URIs) and documents common practice.
Introduction
Software is commonly required to compare two URIs. Such comparison is always in respect of some particular purpose, and different software modules might reasonably come to different conclusions about the same pair of URIs. This document uses the terms “different” and “equivalent” to describe the possible outcomes of such comparisons, but, as the discussion of examples and procedures makes clear, there are many possible application-dependent versions of equivalence.
Since URIs exist to identify resources, presumably they should be considered equivalent when they identify the same resource. This definition of equivalence is not of much practical use for reasons which include:
* Resources may have many different identifiers.
* Web architecture defines how resources are named and how their representations are interchanged, but doesnt define resource equivalence.
For these reasons, determination of equivalence or difference must be based on string comparison, perhaps augmented by reference to additional rules provided in one or more RFCs.
Software modules performing such comparisons differ in their requirements and therefore their URI equivalence criteria. This document describes a variety of methods which may be used to compare URIs, the trade-offs between them, and the types of applications which might use them.
The expressiveness of URIs is limited by their small character repertoire. The IRI specification currently under development is aimed at addressing this. The material in this note applies equally to URIs and IRIs.
Status of This Document
This the second draft of this document, and reflects editorial input from members of the TAG and the broader community, but may not represent the consensus of the TAG.
Background
Inevitability of False Negatives
URIs exist to identify resources. A resource, in the Web Architecture, is an abstraction; a URI may in some cases be dereferenced to yield a representation of the resource. Any two different URIs may identify the same resource, in the view of the user or publisher of that resource. Thus, while comparison of two URIs can establish with confidence that they are equivalent and identify the same resource, such comparisons can always yield “false negatives”. Put another way, it is often possible to determine that two URIs are equivalent, but it is never possible to be sure that they identify different resources.
Rules Governing URIs
The syntax of URIs is defined by RFC2396; the present document cannot really be understood without reference to that RFC. RFC2396 defines a URI as a sequence of characters, with the definition of “character” not tied to any particular form of storage; the characters may be stored on disk one byte per character, in a Java string two bytes per character, painted on the side of a bus, or spoken in conversation.
The repertoire of characters in URIs is limited, comprising a subset of US-ASCII. Certain of these characters have special roles, for example : and /, and may not be otherwise used in URIs.
The world contains many characters useful in identifying resources beyond those in US-ASCII, and furthermore the special characters such as : and / are also often useful. RFC2396s “%-escaping” mechanism is helpful in these situations. %-escaping is a two-step process; the logical characters in the URI are encoded in some fashion (such as ASCII, UTF-8, or Shift-JIS) as a series of octets; each octet is then represented as a 2-digit hexadecimal code preceded by the percent sign %.
URI Schemes
RFC2396 specifies that every URI has a “scheme”, a leading sequence of characters delimited by a colon character :. Two examples are
Each URI scheme which is appropriately registered with the Internet Assigned Names Authority has a governing RFC; for example, HTTP URIs are described by RFC 2616. The syntax and semantics of URIs vary significantly as a function of their scheme. For example, URIs whose scheme is urn (commonly referred to as URNs) are not allowed to contain / characters, and certain parts of HTTP URIs (but not others) are meant to be processed case-insensitively.
Comparison of Relative URI References
RFC2396 defines a construct called a “URI reference” which differs syntactically from URIs in two ways:
* The URI may be followed by the character # (not otherwise allowed in URIs) and a string called the “fragment identifier”. The semantics of the fragment identifier are specific to the media type of the resource representation. An example is
* Initial portions of the URI may be omitted, producing “relative URI references”. The reference may be made absolute by prepending a “base URI” – there are a variety of mechanisms to establish the base URI. An example of a relative reference is intro#chap1.
Two principles apply to the comparison of URI references:
1. In testing for equivalence, it is generally not useful to compare relative URI references; they must be converted to their absolute form before comparison.
2. RFC2396 states that the trailing # and fragment identifier are not part of a URI, and that fragment processing happens in the context of a retrieved presentation. Applications may choose to perform comparison operations on either the base URIs or the references including fragment identifiers. It is important that software and specifications be clear about which of these is being done. For example, when asked to navigate to