--- draft-fielding-uri-syntax-02.txt Wed Mar 4 16:00:39 1998 +++ draft-fielding-uri-syntax-03.txt Thu Jun 4 18:25:38 1998 @@ -1,7 +1,7 @@ Network Working Group T. Berners-Lee, MIT/LCS INTERNET-DRAFT R. Fielding, U.C. Irvine -draft-fielding-uri-syntax-02 L. Masinter, Xerox Corporation -Expires six months after publication date March 4, 1998 +draft-fielding-uri-syntax-03 L. Masinter, Xerox Corporation +Expires six months after publication date June 4, 1998 Uniform Resource Identifiers (URI): Generic Syntax @@ -20,11 +20,11 @@ as reference material or to cite them other than as ``work in progress.'' - To learn the current status of any Internet-Draft, please check the - ``1id-abstracts.txt'' listing contained in the Internet-Drafts - Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net - (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East - Coast), or ftp.isi.edu (US West Coast). + To view the entire list of current Internet-Drafts, please check the + "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow + Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern + Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific + Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). Instructions to RFC Editor: This document will obsolete RFC 1738 and RFC 1808. If the new version of the MHTML proposed standard is @@ -39,26 +39,33 @@ A Uniform Resource Identifier (URI) is a compact string of characters for identifying an abstract or physical resource. This document - defines the general syntax of URIs, including both absolute and + defines the generic syntax of URI, including both absolute and relative forms, and guidelines for their use; it revises and replaces the generic definitions in RFC 1738 and RFC 1808. + This document defines a grammar that is a superset of all valid URI, + such that an implementation can parse the common components of a URI + reference without knowing the scheme-specific requirements of every + possible identifier type. This document does not define a generative + grammar for URI; that task will be performed by the individual + specifications of each URI scheme. + 1. Introduction - Uniform Resource Identifiers (URIs) provide a simple and extensible + Uniform Resource Identifiers (URI) provide a simple and extensible means for identifying a resource. This specification of URI syntax and semantics is derived from concepts introduced by the World Wide Web global information initiative, whose use of such objects dates from 1990 and is described in "Universal Resource Identifiers in WWW" - [RFC1630]. The specification of URIs is designed to meet the + [RFC1630]. The specification of URI is designed to meet the recommendations laid out in "Functional Recommendations for Internet Resource Locators" [RFC1736] and "Functional Requirements for Uniform Resource Names" [RFC1737]. This document updates and merges "Uniform Resource Locators" [RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in - order to define a single, general syntax for all URIs. It excludes + order to define a single, generic syntax for all URI. It excludes those portions of RFC 1738 that defined the specific syntax of individual URL schemes; those portions will be updated as separate documents, as will the process for registration of new URI schemes. @@ -68,9 +75,9 @@ All significant changes from the prior RFCs are noted in Appendix G. -1.1 Overview of URIs +1.1 Overview of URI - URIs are characterized by the following definitions: + URI are characterized by the following definitions: Uniform Uniformity provides several benefits: it allows different types @@ -102,7 +109,7 @@ Identifier An identifier is an object that can act as a reference to - something that has identity. In the case of URIs, the object + something that has identity. In the case of URI, the object is a sequence of characters with a restricted syntax. Having identified a resource, a system may perform a variety of @@ -123,7 +130,7 @@ The URI scheme (Section 3.1) defines the namespace of the URI, and thus may further restrict the syntax and semantics of identifiers using that scheme. This specification defines those elements of the - URI syntax which are either required of all URI schemes or are common + URI syntax that are either required of all URI schemes or are common to many URI schemes. It thus defines the syntax and semantics that are needed to implement a scheme-independent parsing mechanism for URI references, such that the scheme-dependent handling of a URI can @@ -135,7 +142,7 @@ imply that the only way to access the URL's resource is via the named protocol. Gateways, proxies, caches, and name resolution services might be used to access some resources, independent of the protocol - of their origin, and the resolution of some URLs may require the use + of their origin, and the resolution of some URL may require the use of more than one protocol (e.g., both DNS and HTTP are typically used to access an "http" URL's resource when it can't be found in a local cache). @@ -148,7 +155,7 @@ namespace, as defined in "URN Syntax" [RFC2141] and its related specifications. - Most of the examples in this specification demonstrate URLs, since + Most of the examples in this specification demonstrate URL, since they allow the most varied use of the syntax and often have a hierarchical namespace. A parser of the URI syntax is capable of parsing both URL and URN references as a generic URI; once the scheme @@ -156,9 +163,9 @@ generic URI components. In other words, the URI syntax is a superset of the syntax of all URI schemes. -1.3. Example URIs +1.3. Example URI - The following examples illustrate URIs which are in common use. + The following examples illustrate URI that are in common use. ftp://ftp.is.co.za/rfc/rfc1808.txt -- ftp scheme for File Transfer Protocol services @@ -178,20 +185,20 @@ telnet://melvyl.ucop.edu/ -- telnet scheme for interactive services via the TELNET Protocol -1.4. Hierarchical URIs and Relative Forms +1.4. Hierarchical URI and Relative Forms An absolute identifier refers to a resource independent of the context in which the identifier is used. In contrast, a relative identifier refers to a resource by describing the difference within a hierarchical namespace between the current context and an absolute identifier of the resource. - + Some URI schemes support a hierarchical naming system, where the hierarchy of the name is denoted by a "/" delimiter separating the components in the scheme. This document defines a scheme-independent `relative' form of URI reference that can be used in conjunction with a `base' URI (of a hierarchical scheme) to produce another URI. The - syntax of hierarchical URIs is described in Section 3; the relative + syntax of hierarchical URI is described in Section 3; the relative URI calculation is described in Section 5. 1.5. URI Transcribability @@ -219,7 +226,7 @@ represented as a sequence of octets. o A URI may be transcribed from a non-network source, and thus - should consist of characters which are most likely to be able + should consist of characters that are most likely to be able to be typed into a computer, within the constraints imposed by keyboards (and related input devices) across languages and locales. @@ -230,7 +237,7 @@ These design concerns are not always in alignment. For example, it is often the case that the most meaningful name for a URI component - would require characters which cannot be typed into some systems. + would require characters that cannot be typed into some systems. The ability to transcribe the resource identifier from one medium to another was considered more important than having its URI consist of the most meaningful of components. In local and regional @@ -261,7 +268,7 @@ with * to designate n or more repetitions of the following element; n defaults to 0. - Unlike many specifications which use a BNF-like grammar to define the + Unlike many specifications that use a BNF-like grammar to define the bytes (octets) allowed by a protocol, the URI grammar is defined in terms of characters. Each literal in the grammar corresponds to the character it represents, rather than to the octet encoding of that @@ -291,23 +298,25 @@ 2. URI Characters and Escape Sequences - URIs consist of a restricted set of characters, primarily chosen to + URI consist of a restricted set of characters, primarily chosen to aid transcribability and usability both in computer systems and in non-computer communications. Characters used conventionally as - delimiters around URIs were excluded. The restricted set of + delimiters around URI were excluded. The restricted set of characters consists of digits, letters, and a few graphic symbols were chosen from those common to most of the character encodings and input facilities available to Internet users. + uric = reserved | unreserved | escaped + Within a URI, characters are either used as delimiters, or to represent strings of data (octets) within the delimited portions. Octets are either represented directly by a character (using the US-ASCII character for that octet [ASCII]) or by an escape encoding. This representation is elaborated below. -2.1 URIs and non-ASCII characters +2.1 URI and non-ASCII characters - The relationship between URIs and characters has been a source of + The relationship between URI and characters has been a source of confusion for characters that are not part of US-ASCII. To describe the relationship, it is useful to distinguish between a "character" (as a distinguishable semantic entity) and an "octet" (an 8-bit @@ -317,7 +326,7 @@ URI character sequence->octet sequence->original character sequence A URI is represented as a sequence of characters, not as a sequence - of octets. That is because URIs might be "transported" by means that + of octets. That is because URI might be "transported" by means that are not through a computer network, e.g., printed on paper, read over the radio, etc. @@ -353,12 +362,12 @@ charset used. It is expected that a systematic treatment of character encoding - within URIs will be developed as a future modification of this + within URI will be developed as a future modification of this specification. 2.2. Reserved Characters - Many URIs include components consisting of or delimited by, certain + Many URI include components consisting of or delimited by, certain special characters. These characters are called "reserved", since their usage within the URI component is limited to their reserved purpose. If the data for a URI component would conflict with the @@ -368,7 +377,7 @@ reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | "," - The "reserved" syntax class above refers to those characters which + The "reserved" syntax class above refers to those characters that are allowed within a URI, but which may not be allowed within a particular component of the generic URI syntax; they are used as delimiters of the components described in Section 3. @@ -381,7 +390,7 @@ 2.3. Unreserved Characters - Data characters which are allowed in a URI but do not have a reserved + Data characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include upper and lower case letters, decimal digits, and a limited set of punctuation marks and symbols. @@ -392,7 +401,7 @@ Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used - in a context which does not allow the unescaped character to appear. + in a context that does not allow the unescaped character to appear. 2.4. Escape Sequences @@ -419,7 +428,7 @@ a completed URI might change its semantics. Normally, the only time escape encodings can safely be made is when the URI is being created from its component parts; each component may have its own - set of characters which are reserved, so only the mechanism + set of characters that are reserved, so only the mechanism responsible for generating or interpreting that component can determine whether or not escaping a character will change its semantics. Likewise, a URI must be separated into its components @@ -445,7 +454,7 @@ 2.4.3. Excluded US-ASCII Characters Although they are disallowed within the URI syntax, we include here - a description of those US-ASCII characters which have been excluded + a description of those US-ASCII characters that have been excluded and the reasons for their exclusion. The control characters in the US-ASCII coded character set are not @@ -455,15 +464,15 @@ control = The space character is excluded because significant spaces may - disappear and insignificant spaces may be introduced when URIs are + disappear and insignificant spaces may be introduced when URI are transcribed or typeset or subjected to the treatment of - word-processing programs. Whitespace is also used to delimit URIs + word-processing programs. Whitespace is also used to delimit URI in many contexts. space = The angle-bracket "<" and ">" and double-quote (") characters are - excluded because they are often used as the delimiters around URIs + excluded because they are often used as the delimiters around URI in text documents and protocol fields. The character "#" is excluded because it is used to delimit a URI from a fragment identifier in URI references (Section 4). The percent character "%" @@ -484,7 +493,7 @@ 3. URI Syntactic Components The URI syntax is dependent upon the scheme. In general, absolute - URIs are written as follows: + URI are written as follows: : @@ -494,9 +503,9 @@ The URI syntax does not require that the scheme-specific-part have any general structure or set of semantics which is common among all - URIs. However, a subset of URIs do share a common syntax for + URI. However, a subset of URI do share a common syntax for representing hierarchical relationships within the namespace. This - "generic-URI" syntax consists of a sequence of four main components: + "generic URI" syntax consists of a sequence of four main components: ://? @@ -504,20 +513,9 @@ For example, some URI schemes do not allow an component, and others do not use a component. - absoluteURI = generic-URI | opaque-URI - - opaque-URI = scheme ":" *uric - - generic-URI = scheme ":" relativeURI - - The separation of the URI grammar into and - is redundant, since both rules will successfully parse any string of - characters. The distinction is simply to clarify that a - parser of relative URI references (Section 5) will view a URI as a - generic-URI, whereas a handler of absolute references need only view - it as an opaque-URI. + absoluteURI = scheme ":" ( hier_part | opaque_part ) - URIs which are hierarchical in nature use the slash "/" character for + URI that are hierarchical in nature use the slash "/" character for separating hierarchical components. For some file systems, a "/" character (used to denote the hierarchical structure of a URI) is the delimiter used to construct a file name hierarchy, and thus the URI @@ -525,6 +523,25 @@ the resource is a file or that the URI maps to an actual filesystem pathname. + hier_part = ( net_path | abs_path ) [ "?" query ] + + net_path = "//" authority [ abs_path ] + + abs_path = "/" path_segments + + URI that do not make use of the slash "/" character for separating + hierarchical components are considered opaque by the generic URI + parser. + + opaque_part = uric_no_slash *uric + + uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" | + "&" | "=" | "+" | "$" | "," + + We use the term to refer to both the and + constructs, since they are mutually exclusive for any + given URI and can be parsed as a single component. + 3.1. Scheme Component Just as there are many different methods of access to resources, @@ -536,13 +553,13 @@ Scheme names consist of a sequence of characters beginning with a lower case letter and followed by any combination of lower case letters, digits, plus ("+"), period ("."), or hyphen ("-"). For - resiliency, programs interpreting URIs should treat upper case + resiliency, programs interpreting URI should treat upper case letters as equivalent to lower case in scheme names (e.g., allow "HTTP" as well as "http"). scheme = alpha *( alpha | digit | "+" | "-" | "." ) - Relative URI references are distinguished from absolute URIs in that + Relative URI references are distinguished from absolute URI in that they do not begin with a scheme name. Instead, the scheme is inherited from the base URI, as described in Section 5.2. @@ -597,7 +614,7 @@ Some URL schemes use the format "user:password" in the userinfo field. This practice is NOT RECOMMENDED, because the passing of - authentication information in clear text (such as URIs) has proven to + authentication information in clear text (such as URI) has proven to be a security risk in almost every case where it has been used. The host is a domain name of a network host, or its IPv4 address as @@ -640,7 +657,7 @@ scheme if there is no authority component), identifying the resource within the scope of that scheme and authority. - path = [ "/" ] path_segments + path = [ abs_path | opaque_part ] path_segments = segment *( "/" segment ) segment = *pchar *( ";" param ) @@ -671,19 +688,19 @@ The term "URI-reference" is used here to denote the common usage of a resource identifier. A URI reference may be absolute or relative, and may have additional information attached in the form of a - fragment identifier. However, "the URI" which results from such a + fragment identifier. However, "the URI" that results from such a reference includes only the absolute URI after the fragment identifier (if any) is removed and after any relative URI is resolved to its absolute form. Although it is possible to limit the discussion of URI syntax and semantics to that of the absolute - result, most usage of URIs is within general URI references, and it + result, most usage of URI is within general URI references, and it is impossible to obtain the URI from such a reference without also parsing the fragment and resolving the relative form. URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ] - The syntax for relative URIs is a shortened form of that for absolute - URIs, where some prefix of the URI is missing and certain path + The syntax for relative URI is a shortened form of that for absolute + URI, where some prefix of the URI is missing and certain path components ("." and "..") have a special meaning when interpreting a relative path. The relative URI syntax is defined in Section 5. @@ -703,7 +720,7 @@ in the reference. Therefore, the format and interpretation of fragment identifiers is dependent on the media type [RFC2046] of the retrieval result. The character restrictions described in Section 2 - for URIs also apply to the fragment in a URI-reference. Individual + for URI also apply to the fragment in a URI-reference. Individual media types may define additional restrictions or structure within the fragment for specifying different types of "partial views" that can be identified within that media type. @@ -714,7 +731,7 @@ 4.2. Same-document References - A URI reference which does not contain a URI is a reference to the + A URI reference that does not contain a URI is a reference to the current document. In other words, an empty URI reference within a document is interpreted as a reference to the start of that document, and a reference containing only a fragment identifier is a reference @@ -732,9 +749,7 @@ components and fragment identifier in order to determine what components are present and whether the reference is relative or absolute. The individual components are then parsed for their - subparts and to verify their validity. A reference is parsed as if - it is a generic-URI, even though it might be considered opaque by - later processes. + subparts and, if not opaque, to verify their validity. Although the BNF defines what is allowed in each component, it is ambiguous in terms of differentiating between an authority component @@ -749,38 +764,43 @@ 5. Relative URI References It is often the case that a group or "tree" of documents has been - constructed to serve a common purpose; the vast majority of URIs in + constructed to serve a common purpose; the vast majority of URI in these documents point to resources within the tree rather than outside of it. Similarly, documents located at a particular site are much more likely to refer to other resources at that site than to resources at remote sites. - Relative addressing of URLs allows document trees to be partially + Relative addressing of URI allows document trees to be partially independent of their location and access scheme. For instance, it is possible for a single set of hypertext documents to be simultaneously accessible and traversable via each of the "file", "http", and "ftp" - schemes if the documents refer to each other using relative URIs. + schemes if the documents refer to each other using relative URI. Furthermore, such document trees can be moved, as a whole, without changing any of the relative references. Experience within the WWW has demonstrated that the ability to perform relative referencing - is necessary for the long-term usability of embedded URLs. + is necessary for the long-term usability of embedded URI. - relativeURI = net_path | abs_path | rel_path + The syntax for relative URI takes advantage of the syntax + of (Section 3) in order to express a reference that is + relative to the namespace of another hierarchical URI. - A relative reference beginning with two slash characters is termed a - network-path reference. Such references are rarely used. + relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ] - net_path = "//" authority [ abs_path ] + A relative reference beginning with two slash characters is termed a + network-path reference, as defined by in Section 3. Such + references are rarely used. A relative reference beginning with a single slash character is - termed an absolute-path reference. + termed an absolute-path reference, as defined by in + Section 3. - abs_path = "/" rel_path - - A relative reference which does not begin with a scheme name or a + A relative reference that does not begin with a scheme name or a slash character is termed a relative-path reference. - rel_path = [ path_segments ] [ "?" query ] + rel_path = rel_segment [ abs_path ] + + rel_segment = 1*( unreserved | escaped | + ";" | "@" | "&" | "=" | "+" | "$" | "," ) Within a relative-path reference, the complete path segments "." and ".." have special meanings: "the current hierarchy level" and "the @@ -797,18 +817,18 @@ segments (e.g., "./this:that") in order for them to be referenced as a relative path. - It is not necessary for all URIs within a given scheme to be - restricted to the generic-URI syntax, since the hierarchical - properties of that syntax are only necessary when relative URIs are + It is not necessary for all URI within a given scheme to be + restricted to the syntax, since the hierarchical + properties of that syntax are only necessary when relative URI are used within a particular document. Documents can only make use of - relative URIs when their base URI fits within the generic-URI syntax. + relative URI when their base URI fits within the syntax. It is assumed that any document which contains a relative reference will also have a base URI that obeys the syntax. In other words, - relative URIs cannot be used within a document that has an unsuitable + relative URI cannot be used within a document that has an unsuitable base URI. Some URI schemes do not allow a hierarchical syntax matching the - generic-URI syntax, and thus cannot use relative references. + syntax, and thus cannot use relative references. 5.1. Establishing a Base URI @@ -816,7 +836,7 @@ URI" against which the relative reference is applied. Indeed, the base URI is necessary to define the semantics of any relative URI reference; without it, a relative reference is meaningless. In order - for relative URIs to be usable within a document, the base URI of + for relative URI to be usable within a document, the base URI of that document must be known to the parser. The base URI of a document can be established in one of four ways, @@ -893,15 +913,15 @@ application. It is the responsibility of the distributor(s) of a document - containing relative URIs to ensure that the base URI for that + containing relative URI to ensure that the base URI for that document can be established. It must be emphasized that relative - URIs cannot be used reliably in situations where the document's + URI cannot be used reliably in situations where the document's base URI is not well-defined. 5.2. Resolving Relative References to Absolute Form This section describes an example algorithm for resolving URI - references which might be relative to a given base URI. + references that might be relative to a given base URI. The base URI is established according to the rules of Section 5.1 and parsed into the four main components as described in Section 3. @@ -928,6 +948,17 @@ absolute URI and we are done. Otherwise, the reference URI's scheme is inherited from the base URI's scheme component. + Due to a loophole in prior specifications [RFC1630], some parsers + allow the scheme name to be present in a relative URI if it is the + same as the base URI scheme. Unfortunately, this can conflict + with the correct parsing of non-hierarchical URI. For backwards + compatibility, an implementation may work around such references + by removing the scheme if it matches that of the base URI and the + scheme is known to always use the syntax. The parser + can then continue with the steps below for the remainder of the + reference components. Validating parsers should mark such a + misformed relative reference as an error. + 4) If the authority component is defined, then the reference is a network-path and we skip to step 7. Otherwise, the reference URI's authority is inherited from the base URI's authority @@ -1025,7 +1056,7 @@ 6. URI Normalization and Equivalence In many cases, different URI strings may actually identify the - identical resource. For example, the host names used in URLs are + identical resource. For example, the host names used in URL are actually case insensitive, and the URL is equivalent to . In general, the rules for equivalence and definition of a normal form, if any, are scheme @@ -1054,17 +1085,17 @@ cause a possibly damaging remote operation to occur. The unsafe URL is typically constructed by specifying a port number other than that reserved for the network protocol in question. The client - unwittingly contacts a site which is in fact running a different - protocol. The content of the URL contains instructions which, when + unwittingly contacts a site that is in fact running a different + protocol. The content of the URL contains instructions that, when interpreted according to this other protocol, cause an unexpected - operation. An example has been the use of gopher URLs to cause an + operation. An example has been the use of a gopher URL to cause an unintended or impersonating message to be sent via a SMTP server. - Caution should be used when using any URL which specifies a port + Caution should be used when using any URL that specifies a port number other than the default for the protocol, especially when it is a number within the reserved space. - Care should be taken when URLs contain escaped delimiters for a + Care should be taken when a URL contains escaped delimiters for a given protocol (for example, CR and LF characters for telnet protocols) that these are not unescaped before transmission. This might violate the protocol, but avoids the potential for such @@ -1125,7 +1156,7 @@ [RFC1034] Mockapetris, P. "Domain Names - Concepts and Facilities", STD 13, RFC 1034, USC/Information Sciences Institute, November 1987. -[RFC2110] Palme, J., and A. Hopmann. "MIME E-mail Encapsulation of +[RFC2110] Palme, J., and A. Hopmann. "MIME E-mail Encapsulation of Aggregate Documents, such as HTML (MHTML)", RFC 2110, Stockholm University/KTH, Microsoft Corporation, March 1997. @@ -1156,7 +1187,7 @@ University of California, Irvine Irvine, CA 92697-3425 - Fax: +1(714)824-1715 + Fax: +1(949)824-1715 EMail: fielding@ics.uci.edu @@ -1171,17 +1202,24 @@ Appendices -A. Collected BNF for URIs +A. Collected BNF for URI URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ] - absoluteURI = generic-URI | opaque-URI - opaque-URI = scheme ":" *uric - generic-URI = scheme ":" relativeURI + absoluteURI = scheme ":" ( hier_part | opaque_part ) + relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ] + + hier_part = ( net_path | abs_path ) [ "?" query ] + opaque_part = uric_no_slash *uric + + uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" | + "&" | "=" | "+" | "$" | "," - relativeURI = net_path | abs_path | rel_path net_path = "//" authority [ abs_path ] - abs_path = "/" rel_path - rel_path = [ path_segments ] [ "?" query ] + abs_path = "/" path_segments + rel_path = rel_segment [ abs_path ] + + rel_segment = 1*( unreserved | escaped | + ";" | "@" | "&" | "=" | "+" | "$" | "," ) scheme = alpha *( alpha | digit | "+" | "-" | "." ) @@ -1202,7 +1240,7 @@ IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit port = *digit - path = [ "/" ] path_segments + path = [ abs_path | opaque_part ] path_segments = segment *( "/" segment ) segment = *pchar *( ";" param ) param = *pchar @@ -1239,7 +1277,7 @@ B. Parsing a URI Reference with a Regular Expression - As described in Section 4.3, the generic-URI syntax is not sufficient + As described in Section 4.3, the generic URI syntax is not sufficient to disambiguate the components of some forms of URI. Since the "greedy algorithm" described in that section is identical to the disambiguation method used by POSIX regular expressions, it is @@ -1291,7 +1329,7 @@ http://a/b/c/d;p?q - the relative URIs would be resolved as follows: + the relative URI would be resolved as follows: C.1. Normal Examples @@ -1363,7 +1401,7 @@ g;x=1/../y = http://a/b/c/y All client applications remove the query component from the base URI - before resolving relative URIs. However, some applications fail to + before resolving relative URI. However, some applications fail to separate the reference's query and/or fragment components from a relative path before merging it with the base path. This error is rarely noticed, since typical usage of a fragment never includes the @@ -1377,12 +1415,11 @@ Some parsers allow the scheme name to be present in a relative URI if it is the same as the base URI scheme. This is considered to be - a loophole in prior specifications of partial URIs [RFC1630]. Its + a loophole in prior specifications of partial URI [RFC1630]. Its use should be avoided. - http:g = http:g - http: = http: - + http:g = http:g ; for validating parsers + | http://a/b/c/g ; for backwards compatibility D. Embedding the Base URI in HTML documents @@ -1396,7 +1433,7 @@ HTML defines a special element "BASE" which, when present in the "HEAD" portion of a document, signals that the parser should use the BASE element's "HREF" attribute as the base URI for resolving - any relative URIs. The "HREF" attribute must be an absolute URI. + any relative URI. The "HREF" attribute must be an absolute URI. Note that, in HTML, element and attribute names are case-insensitive. For example: @@ -1417,18 +1454,18 @@ obtained. -E. Recommendations for Delimiting URIs in Context +E. Recommendations for Delimiting URI in Context - URIs are often transmitted through formats which do not provide a + URI are often transmitted through formats that do not provide a clear context for their interpretation. For example, there are - many occasions when URIs are included in plain text; examples + many occasions when URI are included in plain text; examples include text sent in electronic mail, USENET news messages, and, most importantly, printed on paper. In such cases, it is important to be able to delimit the URI from the rest of the text, and in particular from punctuation marks that might be mistaken for part of the URI. - In practice, URIs are delimited in a variety of ways, but usually + In practice, URI are delimited in a variety of ways, but usually within double-quotes "http://test.com/", angle brackets , or just using whitespace @@ -1441,7 +1478,7 @@ (separated from the URI with a "#" character). In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) - may need to be added to break long URIs across lines. The + may need to be added to break long URI across lines. The whitespace should be ignored when extracting the URI. No whitespace should be introduced after a hyphen ("-") character. @@ -1452,13 +1489,13 @@ that the hyphen may or may not actually be part of the URI. Using <> angle brackets around each URI is especially recommended - as a delimiting style for URIs that contain whitespace. + as a delimiting style for URI that contain whitespace. The prefix "URL:" (with or without a trailing space) was recommended as a way to used to help distinguish a URL from other bracketed designators, although this is not common in practice. - For robustness, software that accepts user-typed URIs should + For robustness, software that accepts user-typed URI should attempt to recognize and strip both delimiters and embedded whitespace. @@ -1514,12 +1551,12 @@ given that they are not part of the URI, but are part of the URI syntax and parsing concerns. In addition, it provides a reference definition for use by other IETF specifications (HTML, HTTP, etc.) - which have previously attempted to redefine the URI syntax in order + that have previously attempted to redefine the URI syntax in order to account for the presence of fragment identifiers in URI references. Section 2.4 was rewritten to clarify a number of misinterpretations - and to leave room for fully internationalized URIs. + and to leave room for fully internationalized URI. Appendix F on abbreviated URLs was added to describe the shortened references often seen on television and magazine advertisements and @@ -1542,7 +1579,7 @@ set of characters with a reserved purpose (i.e., as meaning something other than the data to which the characters correspond), and that this set was fixed by the URI scheme. However, this has - not been true in practice; any character which is interpreted + not been true in practice; any character that is interpreted differently when it is escaped is, in effect, reserved. Furthermore, the interpreting engine on a HTTP server is often dependent on the resource, not just the URI scheme. The @@ -1556,6 +1593,9 @@ since it is extensively used on the Internet in spite of the difficulty to transcribe it with some keyboards. + The syntax for URI scheme has been changed to require that all + schemes begin with an alpha character. + The "user:password" form in the previous BNF was changed to a "userinfo" token, and the possibility that it might be "user:password" made scheme specific. In particular, the use @@ -1577,7 +1617,7 @@ describe the parsing algorithm. RFC 1630 never had this problem, since it considered the slash to be part of the path. In writing this specification, it was found to be impossible to accurately - describe and retain the difference between the two URIs + describe and retain the difference between the two URI and without either considering the slash to be part of the path (as corresponds to actual practice) or creating a separate component just @@ -1597,7 +1637,7 @@ expected to handle the case where the ":" separator between host and port is supplied without a port. - The recommendations for delimiting URIs in context (Appendix E) have + The recommendations for delimiting URI in context (Appendix E) have been adjusted to reflect current practice. G.4. Modifications from RFC 1808 @@ -1617,9 +1657,9 @@ MHTML [RFC2110]. RFC 1808 described various schemes as either having or not having the - properties of the generic-URI syntax. However, the only requirement + properties of the generic URI syntax. However, the only requirement is that the particular document containing the relative references - have a base URI which abides by the generic-URI syntax, regardless of + have a base URI that abides by the generic URI syntax, regardless of the URI scheme, so the associated description has been updated to reflect that. @@ -1636,6 +1676,10 @@ has been removed from the algorithm for resolving a relative URI reference. The resolution examples in Appendix C have been modified to reflect this change. + + Implementations are now allowed to work around misformed relative + references that are prefixed by the same scheme as the base URI, + but only for schemes known to use the syntax. H. Full Copyright Statement