*** draft-fielding-url-syntax-02.txt Sat Dec 7 03:59:22 1996 --- draft-fielding-url-syntax-03.txt Mon Jan 20 20:34:55 1997 *************** *** 1,18 **** - - Network Working Group T. Berners-Lee INTERNET-DRAFT MIT/LCS ! R. Fielding Expires six months after publication date. U.C. Irvine L. Masinter Xerox Corporation ! ! 07 December 1996 Uniform Resource Locators (URL) - Status of this Memo This document is an Internet-Draft. Internet-Drafts are working --- 1,14 ---- Network Working Group T. Berners-Lee INTERNET-DRAFT MIT/LCS ! R. Fielding Expires six months after publication date. U.C. Irvine L. Masinter Xerox Corporation ! 29 December 1996 Uniform Resource Locators (URL) Status of this Memo This document is an Internet-Draft. Internet-Drafts are working *************** *** 35,44 **** Issues: 1. We need to define a mechanism for using IPv6 addresses in the URL hostname which will not break existing systems too badly. ! 2. Section 6 (New URL Schemes) needs input from the Applications ! Area A.D.'s. ! ! Abstract A Uniform Resource Locator (URL) is a compact string representation --- 31,46 ---- Issues: 1. We need to define a mechanism for using IPv6 addresses in the URL hostname which will not break existing systems too badly. ! 2. Need a specific reference to the documents ! defining Content-Base and Content-Language. ! 3. Examples should include one with multiple parameters and ! one with multiple queries. ! 4. Suggestion to include a 'normalization' algorithm. Should we? ! 5. Is there semantics to empty fragment identifiers? ! 6. clarify issue with http://4kids/blah, where non FQDN is used. ! 7. Add [MHTML] reference ! 8. URN/URI/URL issue ! Abstract A Uniform Resource Locator (URL) is a compact string representation *************** *** 45,54 **** of a location for use in identifying an abstract or physical resource. This document defines the general syntax and semantics of URLs, including both absolute and relative locators, and guidelines ! for their use and for the definition of new URL schemes. It revises ! and replaces the generic definitions in RFC 1738 and RFC 1808. - 1. Introduction Uniform Resource Locators (URLs) provide a simple and extensible --- 47,55 ---- of a location for use in identifying an abstract or physical resource. This document defines the general syntax and semantics of URLs, including both absolute and relative locators, and guidelines ! for their use. It revises and replaces the generic definitions in ! RFC 1738 and RFC 1808. 1. Introduction Uniform Resource Locators (URLs) provide a simple and extensible *************** *** 58,70 **** objects dates from 1990 and is described in "Universal Resource Identifiers in WWW", RFC 1630 [1]. The specification of URLs is designed to meet the recommendations laid out in "Functional ! Recommendations for Internet Resource Locators", RFC 1736 [8]. This document updates and merges RFC 1738 "Uniform Resource Locators" ! [2] and RFC 1808 "Relative Uniform Resource Locators" [7] in order to define a single, general syntax for all URLs. It excludes those portions of RFC 1738 that defined the specific syntax of individual ! URL schemes; those portions will be updated as separate documents. All significant changes from the prior RFCs are noted in Appendix F. URLs are characterized by the following definitions: --- 59,73 ---- objects dates from 1990 and is described in "Universal Resource Identifiers in WWW", RFC 1630 [1]. The specification of URLs is designed to meet the recommendations laid out in "Functional ! Recommendations for Internet Resource Locators", RFC 1736 [9]. This document updates and merges RFC 1738 "Uniform Resource Locators" ! [2] and RFC 1808 "Relative Uniform Resource Locators" [6] in order to define a single, general syntax for all URLs. It excludes those portions of RFC 1738 that defined the specific syntax of individual ! URL schemes; those portions will be updated as separate documents, ! as will the process for registration of new URL schemes. ! All significant changes from the prior RFCs are noted in Appendix F. URLs are characterized by the following definitions: *************** *** 112,118 **** URLs are a subset of Uniform Resource Identifiers (URI), which also includes the notion of Uniform Resource Names (URN). A URN differs from a URL in that it identifies a resource in a location-independent ! fashion (see RFC 1737, [10]). URNs are defined by a separate set of specifications. Although this specification restricts its discussion to URLs, the --- 115,121 ---- URLs are a subset of Uniform Resource Identifiers (URI), which also includes the notion of Uniform Resource Names (URN). A URN differs from a URL in that it identifies a resource in a location-independent ! fashion (see RFC 1737, [11]). URNs are defined by a separate set of specifications. Although this specification restricts its discussion to URLs, the *************** *** 125,140 **** The following examples illustrate URLs which are in common use. ! ftp://ds.internic.net/rfc/rfc1808.txt -- ftp scheme for File Transfer Protocol services gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles -- gopher scheme for Gopher and Gopher+ Protocol services ! http://www.ics.uci.edu/pub/ietf/uri/ -- http scheme for Hypertext Transfer Protocol services ! mailto:masinter@parc.xerox.com -- mailto scheme for electronic mail addresses news:comp.infosystems.www.servers.unix --- 128,143 ---- The following examples illustrate URLs which are in common use. ! ftp://ftp.is.co.za/rfc/rfc1808.txt -- ftp scheme for File Transfer Protocol services gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles -- gopher scheme for Gopher and Gopher+ Protocol services ! http://www.math.uio.no/faq/compression-faq/part1.html -- http scheme for Hypertext Transfer Protocol services ! mailto:mduerst@ifi.unizh.ch -- mailto scheme for electronic mail addresses news:comp.infosystems.www.servers.unix *************** *** 143,150 **** telnet://melvyl.ucop.edu/ -- telnet scheme for interactive services via the TELNET Protocol ! Many other URL schemes have been defined. Section 6 describes how ! new schemes are defined and registered. The scheme defines the namespace of the URL. Although many URL schemes are named after protocols, this does not imply that the only --- 146,152 ---- telnet://melvyl.ucop.edu/ -- telnet scheme for interactive services via the TELNET Protocol ! Many other URL schemes have been defined. The scheme defines the namespace of the URL. Although many URL schemes are named after protocols, this does not imply that the only *************** *** 158,165 **** 1.3. URL Transcribability ! The URL syntax has been designed to promote transcribability over all ! other concerns. A URL is a sequence of characters, i.e., letters, digits, and special characters. A URL may be represented in a variety of ways: e.g., ink on paper, pixels on a screen, or a sequence of octets in a coded character set. The interpretation of a --- 160,167 ---- 1.3. URL Transcribability ! The URL syntax has been designed to promote transcribability as one ! of its main concerns. A URL is a sequence of characters, i.e., letters, digits, and special characters. A URL may be represented in a variety of ways: e.g., ink on paper, pixels on a screen, or a sequence of octets in a coded character set. The interpretation of a *************** *** 182,189 **** o A URL may be transcribed from a non-network source, and thus should consist of characters which are most likely to be able to be typed into a computer, within the constraints imposed by ! keyboards (and related input devices) across nationalities and ! languages. o A URL often needs to be remembered by people, and it is easier for people to remember a URL when it consists of meaningful --- 184,191 ---- o A URL may be transcribed from a non-network source, and thus should consist of characters which are most likely to be able to be typed into a computer, within the constraints imposed by ! keyboards (and related input devices) across languages and ! locales. o A URL often needs to be remembered by people, and it is easier for people to remember a URL when it consists of meaningful *************** *** 192,201 **** These design concerns are not always in alignment. For example, it is often the case that the most meaningful name for a URL component would require characters which cannot be typed on most keyboards. ! In such cases, the ability to access a resource is considered more important than having its URL consist of the most meaningful of components. 1.4. Syntax Notation and Common Elements This document uses two conventions to describe and define the syntax --- 194,208 ---- These design concerns are not always in alignment. For example, it is often the case that the most meaningful name for a URL component would require characters which cannot be typed on most keyboards. ! The ability to transcribe the resource ! location from one medium to another was considered more important than having its URL consist of the most meaningful of components. + In a few cases, exceptions were made for characters already in + widespread use within URLs: the "~", "$" and "#" characters might + have otherwise been excluded from URLs. + 1.4. Syntax Notation and Common Elements This document uses two conventions to describe and define the syntax *************** *** 211,217 **** the syntax requirements. The second convention is a BNF-like grammar, used to define the ! formal URL syntax. The grammar is that of RFC 822 [6], except that "|" is used to designate alternatives. Briefly, rules are separated from definitions by an equal "=", indentation is used to continue a rule definition over more than one line, literals are quoted with "", --- 218,224 ---- the syntax requirements. The second convention is a BNF-like grammar, used to define the ! formal URL syntax. The grammar is that of RFC 822 [5], except that "|" is used to designate alternatives. Briefly, rules are separated from definitions by an equal "=", indentation is used to continue a rule definition over more than one line, literals are quoted with "", *************** *** 231,243 **** The following definitions are common to many elements: ! alpha = lowalpha | hialpha lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" ! hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" --- 238,250 ---- The following definitions are common to many elements: ! alpha = lowalpha | upalpha lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" ! upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" *************** *** 246,296 **** alphanum = alpha | digit The complete URL syntax is collected in Appendix A. 2. URL Characters and Character Escaping ! All URLs consist of a restricted set of characters, chosen to ! maximize their transcribability and usability across varying computer ! systems, natural languages, and nationalities. This restricted set ! corresponds to a subset of the graphic printable characters of the ! US-ASCII coded character set [11]. ! ! The set of characters allowed for use within URLs can be described in three categories: reserved, unreserved, and escaped. ! urlchar = reserved | unreserved | escaped 2.1. Reserved Characters Many URLs include components consisting of, or delimited by, certain special characters. These characters are called "reserved", since their usage within the URL component is limited to their reserved ! purpose. If the data characters for a URL component would conflict ! with the reserved purpose, then the conflicting characters must be escaped before forming the URL. reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" ! This specification uses the "reserved" set to refer to those characters which are allowed within a URL, but which may not be allowed within a particular component of the generic URL syntax; they are used as delimiters of the components described in Section 4.3. ! Characters in the "reserved" set are not always reserved. The set of ! characters actually reserved within any given URL component is ! defined by that component. In general, a character is reserved if ! escaping that character would change the semantics of the URL. 2.2. Unreserved Characters Data characters which are allowed in a URL but do not have a reserved purpose are called unreserved. These include upper and lower case ! letters, decimal digits, and a subset of the punctuation marks and ! symbols found in US-ASCII. ! unreserved = alpha | digit | mark mark = "$" | "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" | "," --- 253,341 ---- alphanum = alpha | digit + The complete URL syntax is collected in Appendix A. 2. URL Characters and Character Escaping ! All URLs consist of a restricted set of characters, primarily chosen ! to aid transcribability and usability both in computer ! systems and in non-computer communications. In addition, characters ! used conventionally as delimiters around URLs were excluded. The ! restricted set of characters consists of digits, letters, and a few ! graphic symbols corresponding to a subset of the graphic printable ! characters of the US-ASCII coded character set [12]; they are ! common to most of the character encodings and input facilities ! available to Internet users. ! ! Within a URL, characters are either used as delimiters, or to ! represent strings of data (octets) within delimited portions. When ! used to represent data directly, the character denotes the octet ! corresponding to the US-ASCII code for that character. In ! addition, an octet may be represented by an escaped encoding. ! ! Thus, the set of "characters" allowed within URLs can be described in three categories: reserved, unreserved, and escaped. ! urlc = reserved | unreserved | escaped + 1.5. Characters, octets, and encodings + + URLs are sequences of characters. Parts of those sequences of + characters are then used to represent sequences of octets. In turn, + sequences of octets are (frequently) used (with a character + encoding scheme) to represent characters. This means that when + dealing with URLs it's necessary to work at three levels: + + represented characters + ^ + | + v + octets + ^ + | + v + URL characters + + This looks more complicated than necessary if all one is dealing + with is file names in ASCII, but is necessary when dealing with the + wide variety of systems in use. URL characters may represent octets + directly or with escape sequences (Section 2.3). Octets may + sometimes represent characters in ASCII, in other character + encodings, or sometimes be used to represent data that does not + correspond to characters at all. + 2.1. Reserved Characters Many URLs include components consisting of, or delimited by, certain special characters. These characters are called "reserved", since their usage within the URL component is limited to their reserved ! purpose. If the data for a URL component would conflict ! with the reserved purpose, then the conflicting data must be escaped before forming the URL. reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" ! The "reserved" syntax class above refers to those characters which are allowed within a URL, but which may not be allowed within a particular component of the generic URL syntax; they are used as delimiters of the components described in Section 4.3. ! Characters in the "reserved" set are not reserved in all contexts. ! The set of characters actually reserved within any given URL ! component is defined by that component. In general, a character is ! reserved if the semantics of the URL changes if the character is ! replaced with its escaped ASCII encoding. 2.2. Unreserved Characters Data characters which are allowed in a URL but do not have a reserved purpose are called unreserved. These include upper and lower case ! letters, decimal digits, and a limited set of punctuation marks and ! symbols. ! unreserved = alphanum | mark mark = "$" | "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" | "," *************** *** 299,353 **** of the URL, but this should not be done unless the URL is being used in a context which does not allow the unescaped character to appear. ! 2.3. Escaped Characters ! A character must be escaped if it is non-printable, if it is often ! used to delimit a URL from its context, if it is not found in ! the US-ASCII coded character set, if it is known to cause problems ! when passed through some e-mail gateways, or if it is being used as ! normal data within a component in which it is reserved. Other ! characters should not be escaped unless the context of their use ! requires it. 2.3.1. Escaped Encoding ! An escaped character is encoded as a character triplet, consisting of ! the percent character "%" followed by the two hexadecimal digits ! representing the character's octet code in an 8-bit coded character ! set. For example, "%20" is the escaped encoding for the space ! character. escaped = "%" hex hex hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f" - The 8-bit coded character set of the octet must be a superset of the - US-ASCII coded character set, such that the US-ASCII characters have - the same escaped encoding regardless of the larger octet character - set. The coded character set chosen must correspond to the character - set of the mechanism that will interpret the URL component in which - the escaped character is used. A sequence of escape triplets are - used if the character is coded as a sequence of octets. - - Any character, from any character set, can be included in a URL via - the escaped encoding, provided that the mechanism which will - interpret the URL has an octet encoding for that character. However, - only that mechanism (the originator of the URL) can determine which - character is represented by the octet. A client without knowledge of - the origination mechanism cannot unescape the character for display. - 2.3.2. When to Escape and Unescape A URL is always in an escaped form, since escaping or unescaping a ! completed URL might change its semantics. The only time that ! characters within a URL can be safely escaped is when the URL is being created from its component parts. Each component may have its own set of characters which are reserved, so only the mechanism responsible for generating or interpreting that component can determine whether or not escaping a character will change its ! semantics. Likewise, a URL must be separated into its components before the escaped characters within those components can be ! safely unescaped. Because the percent "%" character always has the reserved purpose of being the escape indicator, it must be escaped as "%25" in order to --- 344,387 ---- of the URL, but this should not be done unless the URL is being used in a context which does not allow the unescaped character to appear. ! 2.3. Escape Sequences ! Data must be escaped if it does not have a representation using an ! unreserved character; this includes data that does not correspond ! to a printable character of the US-ASCII coded character set, and ! also data that corresponds to characters used to delimit a URL from ! its context. 2.3.1. Escaped Encoding ! An escaped octet is encoded as a character triplet, consisting ! of the percent character "%" followed by the two hexadecimal digits ! representing the octet code. For example, "%20" is the escaped ! encoding for the US-ASCII space character. escaped = "%" hex hex hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f" 2.3.2. When to Escape and Unescape A URL is always in an escaped form, since escaping or unescaping a ! completed URL might change its semantics. Normally, the only time ! escape encodings can safely be made is when the URL is being created from its component parts. Each component may have its own set of characters which are reserved, so only the mechanism responsible for generating or interpreting that component can determine whether or not escaping a character will change its ! semantics. Likewise, a URL must be separated into its components before the escaped characters within those components can be ! safely decoded. ! ! In some cases, data that could be represented by an unreserved ! character may appear escaped; for example, some of the unreserved ! mark characters are automatically escaped by some systems. It ! is safe to unescape these within the body of a URL. ! For example, "%7e" is sometimes used instead of "~" in http URL ! path, but the two can be used interchangably. Because the percent "%" character always has the reserved purpose of being the escape indicator, it must be escaped as "%25" in order to *************** *** 357,376 **** data character as another escaped character, or vice versa in the case of escaping an already escaped string. - An exception to the unescaping rules is allowed when it is known that - some older systems are escaping a character that does not need to be - escaped, and when it is possible to reliably discriminate between - such an escaped data character and any reserved use for that - character. For example, it is generally safe to unescape "%7e" when - it occurs near the beginning of an http URL path, since many older - systems automatically escape the "~" character even though it is - unreserved. - 2.3.3. Excluded Characters Although they are not used within the URL syntax, we include here a ! description of those characters which have been excluded and the ! reasons for their exclusion. excluded = control | space | delims | unwise | national --- 391,401 ---- data character as another escaped character, or vice versa in the case of escaping an already escaped string. 2.3.3. Excluded Characters Although they are not used within the URL syntax, we include here a ! description of those US-ASCII characters which have been excluded ! and the reasons for their exclusion. excluded = control | space | delims | unwise | national *************** *** 393,405 **** excluded because they are often used as the delimiters around URLs in text documents and protocol fields. The character "#" is excluded because it is used to delimit a URL from a fragment identifier in URL ! references. The percent character "%" is excluded because it is used for the encoding of escaped characters. delims = "<" | ">" | "#" | "%" | <"> Other characters are excluded because gateways and other transport ! agents are known to sometimes modify such characters. unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" --- 418,431 ---- excluded because they are often used as the delimiters around URLs in text documents and protocol fields. The character "#" is excluded because it is used to delimit a URL from a fragment identifier in URL ! references (Section 3). The percent character "%" is excluded because it is used for the encoding of escaped characters. delims = "<" | ">" | "#" | "%" | <"> Other characters are excluded because gateways and other transport ! agents are known to sometimes modify such characters, or they are ! used as delimiters. unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" *************** *** 410,420 **** national = ! Excluded characters must be escaped in order to be properly ! represented within a URL. However, there do exist some systems that ! allow characters from the "unwise" and "national" sets to be used in ! URL references; a robust implementation should be prepared to handle ! those characters when it is possible to do so. 3. URL References --- 436,447 ---- national = ! Data corresponding to excluded characters must be escaped in order ! to be properly represented within a URL. However, there do exist ! some systems that allow characters from the "unwise" and "national" ! sets to be used in URL references (section 3); a robust ! implementation should be prepared to handle those characters when ! it is possible to do so. 3. URL References *************** *** 422,428 **** A common source of confusion in the use and interpretation of Uniform Resource Locators is the distinction between a reference to a URL and the URL itself. A URL reference may be absolute or relative, and may ! be attached to additional information in the form of a fragment identifier. However, "the URL" which results from such a reference includes only the absolute URL after the fragment identifier (if any) is removed and after any relative URL is resolved to its absolute --- 449,455 ---- A common source of confusion in the use and interpretation of Uniform Resource Locators is the distinction between a reference to a URL and the URL itself. A URL reference may be absolute or relative, and may ! have additional information attached in the form of a fragment identifier. However, "the URL" which results from such a reference includes only the absolute URL after the fragment identifier (if any) is removed and after any relative URL is resolved to its absolute *************** *** 446,454 **** retrieval action has been successfully completed. As such, it is not part of a URL, but is often used in conjunction with a URL. The format and interpretation of fragment identifiers is dependent on the ! media type of the retrieved resource. ! fragment = *urlchar A URL reference which does not contain a URL is a reference to the current document. In other words, an empty URL reference within a --- 473,481 ---- retrieval action has been successfully completed. As such, it is not part of a URL, but is often used in conjunction with a URL. The format and interpretation of fragment identifiers is dependent on the ! media type of the resource referenced by the URL. ! fragment = *urlc A URL reference which does not contain a URL is a reference to the current document. In other words, an empty URL reference within a *************** *** 498,510 **** absoluteURL = generic-URL | opaque-URL ! opaque-URL = scheme ":" *urlchar generic-URL = scheme ":" relativeURL URLs which are hierarchical in nature use the slash "/" character for ! separating hierarchical components. For some file systems, the "/" ! used to denote the hierarchical structure of a URL corresponds to the delimiter used to construct a file name hierarchy, and thus the URL path will look similar to a file pathname. This does NOT imply that the URL is a Unix pathname. --- 525,537 ---- absoluteURL = generic-URL | opaque-URL ! opaque-URL = scheme ":" *urlc generic-URL = scheme ":" relativeURL URLs which are hierarchical in nature use the slash "/" character for ! separating hierarchical components. For some file systems, a "/" ! character (used to denote the hierarchical structure of a URL) is the delimiter used to construct a file name hierarchy, and thus the URL path will look similar to a file pathname. This does NOT imply that the URL is a Unix pathname. *************** *** 566,572 **** port = *digit Domain names take the form as described in Section 3.5 of RFC 1034 ! [9] and Section 2.1 of RFC 1123 [5]: a sequence of domain labels separated by ".", each domain label starting and ending with an alphanumerical character and possibly also containing "-" characters. The rightmost domain label will never start with a digit, though, --- 593,599 ---- port = *digit Domain names take the form as described in Section 3.5 of RFC 1034 ! [10] and Section 2.1 of RFC 1123 [4]: a sequence of domain labels separated by ".", each domain label starting and ending with an alphanumerical character and possibly also containing "-" characters. The rightmost domain label will never start with a digit, though, *************** *** 608,614 **** The query component is a string of information to be interpreted by the resource. ! query = *urlchar Within a query component, the characters "/", "&", "=", and "+" are reserved. --- 635,641 ---- The query component is a string of information to be interpreted by the resource. ! query = *urlc Within a query component, the characters "/", "&", "=", and "+" are reserved. *************** *** 742,751 **** of how the base URL can be embedded in the Hypertext Markup Language (HTML) [3] is provided in Appendix D. ! Messages are considered to be composite documents. The base URL of a message can be specified within the message headers (or equivalent tagged metainformation) of the message. For protocols that make use ! of message headers like those described in MIME [4], the base URL can be specified by the Content-Base or Content-Location header fields. --- 769,778 ---- of how the base URL can be embedded in the Hypertext Markup Language (HTML) [3] is provided in Appendix D. ! MIME messages [7] are considered to be composite documents. The base URL of a message can be specified within the message headers (or equivalent tagged metainformation) of the message. For protocols that make use ! of message headers like those described in MIME [7], the base URL can be specified by the Content-Base or Content-Location header fields. *************** *** 786,792 **** encapsulated. Composite media types, such as the "multipart/*" and "message/*" ! media types defined by MIME (RFC 1521, [4]), define a hierarchy of retrieval context for their enclosed documents. In other words, the retrieval context of a component part is the base URL of the composite entity of which it is a part. Thus, a composite entity can --- 813,819 ---- encapsulated. Composite media types, such as the "multipart/*" and "message/*" ! media types defined by MIME[8], define a hierarchy of retrieval context for their enclosed documents. In other words, the retrieval context of a component part is the base URL of the composite entity of which it is a part. Thus, a composite entity can *************** *** 937,973 **** Resolution examples are provided in Appendix C. ! 6. Adding New Schemes ! ! The Internet Assigned Numbers Authority (IANA) maintains a registry ! of URL schemes. ! ! The current process for defining URL schemes is via the Internet ! standards process: new URL schemes should be described in ! standards-track RFCs. Over time, other methods of registering URL ! schemes may be added. ! ! URL schemes must have demonstrable utility and operability. One way ! to provide such a demonstration is via a gateway which provides ! objects in the new scheme for clients using an existing protocol. If ! the new scheme does not locate resources that are data objects, the ! properties of names in the new space must be clearly defined. ! ! URL schemes should follow the same syntactic conventions of existing ! schemes when appropriate. URL schemes should use the generic-URL ! syntax if they are intended to be used with relative URLs. A ! description of the allowed relative forms should be included in the ! scheme's definition. ! ! URL schemes cannot redefine the algorithm for resolving relative ! references. The resolution algorithm must remain independent of the ! scheme name in order to preserve the mobility of relative references ! between naming schemes and the ability to parse and resolve a ! relative reference without knowing the properties of any particular ! scheme. ! ! ! 7. Security Considerations A URL does not in itself pose a security threat. Users should beware that there is no general guarantee that a URL, which at one time --- 964,970 ---- Resolution examples are provided in Appendix C. ! 6. Security Considerations A URL does not in itself pose a security threat. Users should beware that there is no general guarantee that a URL, which at one time *************** *** 987,994 **** unwittingly contacts a server which is in fact running a different protocol. The content of the URL contains instructions which, when interpreted according to this other protocol, cause an unexpected ! operation. An example has been the use of gopher URLs to cause a rude ! message to be sent via a SMTP server. Caution should be used when using any URL which specifies a port number other than the default for the protocol, especially when it is a number within the reserved space. --- 984,993 ---- unwittingly contacts a server which is in fact running a different protocol. The content of the URL contains instructions which, when interpreted according to this other protocol, cause an unexpected ! operation. An example has been the use of gopher URLs to cause an ! unintended or impersonating message to be sent via a SMTP server. ! ! Caution should be used when using any URL which specifies a port number other than the default for the protocol, especially when it is a number within the reserved space. *************** *** 1004,1018 **** It is clearly unwise to use a URL that contains a password which is intended to be secret. ! ! 8. Acknowledgements ! This document was derived from RFC 1738 [2] and RFC 1808 [7]; the acknowledgements in those specifications still apply. In addition, ! this draft has benefited from comments by Lauren Wood. ! ! ! 9. References [1] Berners-Lee, T., "Universal Resource Identifiers in WWW: A Unifying Syntax for the Expression of Names and Addresses of --- 1003,1016 ---- It is clearly unwise to use a URL that contains a password which is intended to be secret. ! 7. Acknowledgements ! This document was derived from RFC 1738 [2] and RFC 1808 [6]; the acknowledgements in those specifications still apply. In addition, ! contributions by Lauren Wood, Martin Duerst, Gisle Aas, Martijn ! Koster, Ryan Moats and Foteos Macrides are gratefully acknowledged. ! ! 8. References [1] Berners-Lee, T., "Universal Resource Identifiers in WWW: A Unifying Syntax for the Expression of Names and Addresses of *************** *** 1026,1061 **** [3] Berners-Lee T., and D. Connolly, "HyperText Markup Language Specification -- 2.0", RFC 1866, MIT/W3C, November 1995. ! [4] Borenstein, N., and N. Freed, "MIME (Multipurpose Internet Mail ! Extensions): Mechanisms for Specifying and Describing the Format ! of Internet Message Bodies", RFC 1521, Bellcore, Innosoft, ! September 1993. ! ! [5] Braden, R., Editor, "Requirements for Internet Hosts -- Application and Support", STD 3, RFC 1123, IETF, October 1989. ! [6] Crocker, D., "Standard for the Format of ARPA Internet Text Messages", STD 11, RFC 822, UDEL, August 1982. ! [7] Fielding, R., "Relative Uniform Resource Locators", RFC 1808, UC Irvine, June 1995. ! [8] Kunze, J., "Functional Recommendations for Internet Resource Locators", RFC 1736, IS&T, UC Berkeley, February 1995. ! [9] Mockapetris, P., "Domain Names - Concepts and Facilities", STD 13, RFC 1034, USC/Information Sciences Institute, November 1987. ! [10] Sollins, K., and L. Masinter, "Functional Requirements for Uniform Resource Names", RFC 1737, MIT/LCS, Xerox Corporation, December 1994. ! [11] US-ASCII. "Coded Character Set -- 7-bit American Standard Code for Information Interchange", ANSI X3.4-1986. ! 10. Authors' Addresses Tim Berners-Lee World Wide Web Consortium --- 1024,1062 ---- [3] Berners-Lee T., and D. Connolly, "HyperText Markup Language Specification -- 2.0", RFC 1866, MIT/W3C, November 1995. ! [4] Braden, R., Editor, "Requirements for Internet Hosts -- Application and Support", STD 3, RFC 1123, IETF, October 1989. ! [5] Crocker, D., "Standard for the Format of ARPA Internet Text Messages", STD 11, RFC 822, UDEL, August 1982. ! [6] Fielding, R., "Relative Uniform Resource Locators", RFC 1808, UC Irvine, June 1995. ! [7] N. Freed & N. Borenstein, "Multipurpose Internet Mail ! Extensions (MIME) Part One: Format of Internet Message Bodies," ! RFC 2045, November 1996. ! ! [8] Freed, N., and N. Freed, "Multipurpose Internet Mail ! Extensions (MIME): Part Two: Media Types", RFC 2046, Innosoft, Bellcore, ! November 1996. ! ! [9] Kunze, J., "Functional Recommendations for Internet Resource Locators", RFC 1736, IS&T, UC Berkeley, February 1995. ! [10] Mockapetris, P., "Domain Names - Concepts and Facilities", STD 13, RFC 1034, USC/Information Sciences Institute, November 1987. ! [11] Sollins, K., and L. Masinter, "Functional Requirements for Uniform Resource Names", RFC 1737, MIT/LCS, Xerox Corporation, December 1994. ! [12] US-ASCII. "Coded Character Set -- 7-bit American Standard Code for Information Interchange", ANSI X3.4-1986. ! 9. Authors' Addresses Tim Berners-Lee World Wide Web Consortium *************** *** 1091,1097 **** URL-reference = [ absoluteURL | relativeURL ] [ "#" fragment ] absoluteURL = generic-URL | opaque-URL ! opaque-URL = scheme ":" *urlchar generic-URL = scheme ":" relativeURL relativeURL = net_path | abs_path | rel_path --- 1092,1098 ---- URL-reference = [ absoluteURL | relativeURL ] [ "#" fragment ] absoluteURL = generic-URL | opaque-URL ! opaque-URL = scheme ":" *urlc generic-URL = scheme ":" relativeURL relativeURL = net_path | abs_path | rel_path *************** *** 1118,1128 **** param = *pchar pchar = unreserved | escaped | ":" | "@" | "&" | "=" | "+" ! query = *urlchar ! fragment = *urlchar ! urlchar = reserved | unreserved | escaped reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" unreserved = alpha | digit | mark mark = "$" | "-" | "_" | "." | "!" | "~" | --- 1119,1129 ---- param = *pchar pchar = unreserved | escaped | ":" | "@" | "&" | "=" | "+" ! query = *urlc ! fragment = *urlc ! urlc = reserved | unreserved | escaped reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" unreserved = alpha | digit | mark mark = "$" | "-" | "_" | "." | "!" | "~" | *************** *** 1133,1144 **** "a" | "b" | "c" | "d" | "e" | "f" alphanum = alpha | digit ! alpha = lowalpha | hialpha lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" ! hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | --- 1134,1145 ---- "a" | "b" | "c" | "d" | "e" | "f" alphanum = alpha | digit ! alpha = lowalpha | upalpha lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" ! upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | *************** *** 1157,1164 **** The following line is the regular expression for breaking-down a URL reference into its components. ! ^(([^/?#]+):)?(//([^/?#]*))?([^?#]*)?(\?([^#]*))?(#(.*))? ! 12 3 4 5 6 7 8 9 The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each --- 1158,1165 ---- The following line is the regular expression for breaking-down a URL reference into its components. ! ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? ! 12 3 4 5 6 7 8 9 The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each *************** *** 1325,1339 **** http://test.com/ ! The prefix "URL:", with or without a trailing space, is sometimes ! used to help distinguish a URL from normal text. These wrappers do ! not form part of the URL. In the case where a fragment identifier is ! associated with a URL reference, the fragment would be placed within ! the brackets as well (separated from the URL with a "#" character). ! ! In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) may ! need to be added to break long URLs across lines. The whitespace ! should be ignored when extracting the URL. No whitespace should be introduced after a hyphen ("-") character. Because some typesetters and printers may (erroneously) introduce a --- 1326,1340 ---- http://test.com/ ! These wrappers do not form part of the URL. ! ! In the case where a fragment identifier is associated with a URL ! reference, the fragment would be placed within the brackets as well ! (separated from the URL with a "#" character). ! ! In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) ! may need to be added to break long URLs across lines. The ! whitespace should be ignored when extracting the URL. No whitespace should be introduced after a hyphen ("-") character. Because some typesetters and printers may (erroneously) introduce a *************** *** 1342,1347 **** --- 1343,1359 ---- all unescaped whitespace around the line break, and should be aware that the hyphen may or may not actually be part of the URL. + Using <> angle brackets around each URL is especially recommended + as a delimiting style for URLs that contain whitespace. + + The prefix "URL:" (with or without a trailing space) was + recommended as a way to used to help distinguish a URL from other + bracketed designators, although this is not common in pratice. + + For robustness, software that accepts user-typed URLs should + attempt to recognize and strip both delimiters and embedded + whitespace. + Examples: Yes, Jim, I found it under "http://www.w3.org/pub/WWW/", *************** *** 1450,1456 **** The description of the mythical Base header field has been replaced with the Content-Base and Content-Location header fields defined by ! HTTP/1.1 and MHTML. RFC 1808 described various schemes as either having or not having the properties of the generic-URL syntax. However, the only requirement --- 1462,1468 ---- The description of the mythical Base header field has been replaced with the Content-Base and Content-Location header fields defined by ! HTTP/1.1 and MHTML.[palme] RFC 1808 described various schemes as either having or not having the properties of the generic-URL syntax. However, the only requirement *************** *** 1477,1480 **** append the reference's query component to a relative path before merging it with the base path. The resolution algorithm has been changed accordingly. - --- 1489,1491 ----