Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

url: resolve strips drive letters from Windows file URLs #5452

Closed
domenic opened this issue May 11, 2013 · 16 comments
Closed

url: resolve strips drive letters from Windows file URLs #5452

domenic opened this issue May 11, 2013 · 16 comments

Comments

@domenic
Copy link

domenic commented May 11, 2013

url.resolve('file:///C:/file.txt', '/');

Got: file:///

Expected: file:///C:/

@awwright
Copy link

This is the correct behavior, according to RFC 3986, the hier-part is ///C:/file.txt: authority is blank, and the path is /C:/file.txt. Therefore, resolving a URI-reference of / will result in file:/// (empty authority, and path of /)

@bnoordhuis
Copy link
Member

What @ACubed said. It's the expected behavior.

@domenic
Copy link
Author

domenic commented May 12, 2013

It's correct according to the years-old RFC, but does not match real-world browser behavior nor the more recent URL standard: http://url.spec.whatwg.org/

@isaacs
Copy link

isaacs commented May 12, 2013

I'm with @domenic on this. Our goal with the url module is to follow browser behavior. The WhatWG has been kind enough to make a proper spec, which would have been nice if it'd been around 4 years ago. We should follow that spec, since it's what browsers actually do.

@isaacs isaacs reopened this May 12, 2013
@awwright
Copy link

For URI resolution, the HTML 4.01 specification references RFC 2396, which was updated by RFC 3986. The HTML 5 candidate recommendation normatively references only the newer RFC 3986. While the HTML 5 draft does explicitly vary its resolution from RFC 3986, it is limited in scope and in is marked in its rationale for supporting older documents before RFC 3986 that would otherwise be illegal (not a problem for Node.js), and nowhere does it specially handle file URLs, in which drive letters are supposed to be considered a directory.

I already use URIs with colons and such characters, in a number of schemes including file with and without an authority, which is a feature heavily used for CURIE, among other uses. Any special behavior would break my application, and could pose security problems if, for instance, certain path segments could modify the authority or the resolved filesystem path. (HTTP/1.1 mandates that servers accept absolute forms of URIs, too, and any URI, not just URLs, this is likely to become the only method in which requests are made in HTTP/2.0).

The point is that the behavior of URIs are explicitly not supposed to change between applications or over time. They're, well, uniform.

@domenic
Copy link
Author

domenic commented May 12, 2013

The point is that the behavior of URIs are explicitly not supposed to change between applications or over time. They're, well, uniform.

Indeed, the web has not been following those RFCs for a very long time. Nothing has changed since the early days of web browsers. I've run tests so far in all web browsers plus .NET, and URL handling uniformly figures out Windows drive letter paths correctly. The RFCs are simply inaccurate.

@domenic
Copy link
Author

domenic commented May 12, 2013

If it helps, Node's url module already has many improvments over the outdated RFCs that help it match real-world URL resolution behavior. This bug and #5453 are the only remaining missing pieces! But if you check out jsdom/jsdom#550 you'll see many many other divergences, as we took a URL resolution algorithm designed from the RFC and turned it into one that matched browsers.

@awwright
Copy link

If the browsers are doing it differently, they're doing it wrong. In the HTML specification itself, RFC 3986 is the normative (authoritative) reference in how to resolve and parse URIs.

Observing that implementations have done it differently over time is only a reason to make sure that Node.js follows the definition of the URI and not add to the tangle of inoperable implementations.

@domenic
Copy link
Author

domenic commented May 12, 2013

Ah, good catch; I'll talk to the appropriate people and get the HTML5 spec updated. Thanks!

Edit: Looks like you were just wrong? After asking around #whatwg in IRC, looks like the HTML spec references the URL spec already:

http://www.whatwg.org/specs/web-apps/current-work/#url-parser

@awwright
Copy link

The HTML5 spec already refers to RFC 3986? Even if it did define incompatible behavior, being a normative specification means that it can't be changed, even if such specifications wanted to - the URI behavior takes precedence.

For reference, here's a (very incomplete) list of standards or proposed standards that reference RFC 3986, RFC 3987, or a compatible older specification:

  • http://www.w3.org/TR/xml/ (URIs are resolved against the xml:base)
  • http://www.w3.org/TR/xml-names/ (Same here)
  • http://www.w3.org/TR/html5/ (HTML specifies multiple opportunities to define a document base, and several attributes that are URI references to be resolved against this base)
  • http://www.w3.org/TR/rdfa-core/ (Uses IRI references extensively)
  • http://www.w3.org/TR/DOM-Level-3-Core (and the rest of the DOM family, which exposes URIs resolved from URI references, in HTML and non-HTML XML documents alike; this behavior does not change between HTML and XML).
  • http://www.w3.org/TR/turtle/ (IRI references are allowed, are enclosed in <>, and resolved against a document base. The default base URI is almost certainly a file URI! These MUST be handled according to RFC 3986, even with Windows drive letters!)
  • http://www.w3.org/TR/rdf-primer/ (Two types of nodes are used to identify resources, URIs and bnodes, both must be well-formed)
  • http://www.w3.org/TR/rdf-interfaces/ (This ECMAScript API is expected to properly resolve URI references against the document base, usually provided by DOM)
  • http://www.w3.org/TR/CSS21/ (and many other documents defined in the <a * href="http://www.w3.org/TR/CSS/">CSS family of specifications: URL references are used with string enclosed by url() and are resolved relative to the document.)
  • http://www.w3.org/TR/curie/ (A format for expressing absolute URIs in a compact form, used in many Web technologies like RDFa, JSON, and notably Facebook's Open Graph)
  • http://www.w3.org/TR/json-ld/ (JSON-LD practically revolves around URI references)
  • http://tools.ietf.org/html/rfc2616 (HTTP allows URI references in almost anywhere a URI is used, except in the request-line, in which it must be an absolute URI or an absolute path. These references are resolved against the current resource URI.)
  • http://tools.ietf.org/html/rfc4287 (The Atom syndication format allows URI or IRI references and uses them in vocabularies defined for both XML and HTTP)
  • http://tools.ietf.org/html/rfc5988 (The Link header uses URI references, independent of the media type being transferred over HTTP, so for instance, a .png image can be given an "author" or "self" relation.)
  • http://tools.ietf.org/html/rfc6749 (OAuth requires URI references are followed properly.)
  • http://tools.ietf.org/html/rfc6570 (URI templates form URI references)
  • The JSON Reference draft-standard, which allows embedding JSON documents one in another using a {"$ref": "uri-reference"} syntax, which is commonly going to be resolved against a file URI!
  • The JSON Schema draft-standard, which uses JSON Reference, and allows annotating properties as links using URI Template, and allows URIs and URI references, so that Link-compatible metadata may be extracted from the document.

Varying the behavior from RFC 3986 would break all of them.

@domenic
Copy link
Author

domenic commented May 13, 2013

@ACubed thanks for doing all that leg work! I've passed it on to the appropriate parties, and we'll see updates to those specs soon to reflect web reality.

@awwright
Copy link

@domenic thanks for your snarky nonsensical help, it really helps contributes to the advancement of the Web. Not.

I actually do have a direct line of communication to the authors of many of those standards, they've all told me so far it's nonsense.

Can we please get on with the reality of the Web now, thanks. RFC 3986 is the single authoritative specification. It is still in STANDARD status; it has not been superseded.

@domenic
Copy link
Author

domenic commented May 13, 2013

What you seem to be missing is that standards reflect web reality; they do not create it. Browsers and other software all implement URLs in a way that diverges significantly from those outdated RFCs; the existence of the URL spec came about because vendors realized this and sought to codify the new reality in a document that they could all refer to for an interoperable implementation of edge cases. It's great that you've found places where older documents don't reflect that, and we'll work toward fixing that. But the reality of the software we work in is different, and that's not going to change---breaking many programs that rely on real-world URL spec behavior---just because an older document says so.

@awwright
Copy link

I don't know where you get the impression they're outdated. RFC 2732 is "outdated", or to use the industry vocabulary, "obsolete", it is superseded by RFC 3986. I I just listed more than a dozen specifications that rely on an exact behavior of RFC 3986. Diverging from the behavior of the vast majority of specifications is what is out of touch.

Like I described, HTML5 does accommodate a superset of URIs for reverse comparability like you described, but it doesn't change its behavior.

@annevk
Copy link

annevk commented May 14, 2013

There are some subtle differences actually. DOM has already changed: http://dom.spec.whatwg.org/ HTML has too: http://www.whatwg.org/specs/web-apps/current-work/multipage/ CSS will soon change too. Do not really know about the rest of the list @domenic mentioned.

@jasnell
Copy link
Member

jasnell commented Jun 3, 2015

Given that there is a plan to update the url implementation to conform better with the updated specs, I'm going to mark this as defer-to-convergence.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.