Link Fingerprints

Version 0.6 - 2008-07-09

Introduction

Link Fingerprints is a system for allowing a HTML link to reference not only a particular resource, but a particular version of that resource, in a way that the fetching client can validate. This is an updated and heavily modified version of an original proposal in a blog post.

Originally, link fingerprints were proposed as an extension to URL syntax. This idea met with a chilly reception on various IETF discussion lists, so it has been reformulated as an extension to HTML. This makes it useful in a narrower set of contexts, but half a loaf is better than no bread.

Idea

We define an additional attribute, "checksum", which is valid on all HTML elements which reference a remote resource using a URL - which I believe are <a>, <link>, <img>, <script>, <style>, <object>, <applet>, <area>, <iframe> and <frame>. (Some of these elements have more than one parameter which can be a URI, but it seems to me that the 'main' one is obvious in all cases.) For example:

<a href="http://www.example.com/path/to/downloadfile.bin" checksum="sha256:09F9...">Download downloadfile.bin</a>

The "sha256" is a short lower-case string identifying a particular hash algorithm and the numbers and letters are the SHA256 digest, or "fingerprint", of the file in question. For ease of implementation, this standard supports a single hash algorithm, SHA-256, which is considered secure at the time of writing. However, if problems develop, further algorithms could be standardised.

When downloading a file with a URL of this form, the client may compute the checksum of the downloaded file and tell the user if there is a mismatch. If there is a mismatch, the client MUST delete the file.

This has several applications. For example, if a company is providing a security update, by perhaps linking to it from their website or in an email, then the downloader can be more certain that they are getting the intended file. To substitute a trojan, the attacker would need to hack both the download site and the website supplying the information - or the user's mailbox. In addition, if the company is using a mirror network to supply the updates, it gives them some confidence that the distributed machines in the network are not serving a corrupt or trojaned file.

This measure is not supposed to be a replacement for "proper" cryptography. It is merely a way of raising the bar, whose chief advantage is that it requires no work from the end user. However, it would have provided much quicker notification of at least some of the recent software download trojaning attacks. Here's a quick list of the number of times I have discovered that this has happened:

Sendmail (2002)
OpenSSH (2002)
tcpdump/libpcap (2002)
Fragrouter (2002)
BitchX (2003; would not have been prevented by LF)
PostNuke (2004; not sure if LF would have helped)
SquirrelMail (2007)
WordPress (2007)

Advantages

Neither file provider nor user needs any specialised PKI tools.
There is no need for any changes to web servers.
The improved security is provided in supporting user agents without any need for user interaction or configuration.
The security can be provided on any type of file, no matter what its internal structure.
No false sense of security is given - if the file downloads correctly, the user agent should behave as with a normal, unmarked download.

Error Behaviour

The checksum can be checked for well-formedness (correct number of characters, hex digits only) at the time it is requested. If it is not well-formed, the user agent should give an error straight away and not request the file. E.g. "This link is broken; download is not possible.".

If, after download, the hashes do not match, the client MUST delete the file so as to prevent the user gaining access to it. I believe this strict behaviour is reasonable, as the mismatch could be for one of several reasons:

The download has been unrecoverably corrupted
The download was changed (trojaned)
The original URL was incorrectly created

In all cases, giving the user access to the downloaded data is a risk - in the case of #3, because you can't tell the difference between it and #2. The URL creator obviously thought the data important enough to use link fingerprints, and the protection has flagged a problem. So going back to the owner for a proper URL is reasonable. If file owners want downloaders to retain the download even if the hash is different, they should simply not use link fingerprints.

So the error would say something like: "Warning: the file at URL <url> is corrupted and unusable. Please contact the person or site responsible for the file." Or, for something like a download manager: "File corrupt. Download failed. Contact file owner".

Advanced users could, of course, re-request the file without the fingerprint using Copy Link Location and pasting into the URL bar. But it is assumed that anyone who knows how to do that would understand the consequences.

Objections

This is just a reinvention of the Content-MD5 HTTP header: No. Content-MD5 is sent by the same server which is serving the file; if the file gets changed or corrupted, the web server will most likely recalculate the Content-MD5 header. So there is no increase in security.
If the browser gives an error, people will just use another browser: This argument could be used against any scheme which prevents browser users doing dangerous things. Yet still browsers have recently implemented anti-phishing and anti-malware technologies. Also, if one person bypasses the error by switching browser, another will see the significance and alert the site. Only a subset of users have to be using a LF-supporting client in order for the takedown time for bogus versions of files to be improved, and therefore the protection level to be increased.

Further Considerations

For security reasons, the fingerprint would need to be preserved through any redirects. Such redirects should not be able to override the fingerprint with another one.

Compatibility

Using an attribute to hold the checksum information makes the links entirely backwardly-compatible with current HTML user agents. They will merely fetch the file and not check the checksum, which is the same behaviour as before the introduction of Link Fingerprints. Their protection comes from the fact that hopefully a downloader using a link-fingerprint-aware client will raise the alarm sooner than the tiny number of downloaders who actually check provided checksums manually.

Link fingerprints could also be used on files intended for display rather than download, but this is not considered a primary use case.

Clients are encouraged not to implement any hash algorithm other than SHA-256, until and unless SHA-256 is found to have flaws. Proliferation of algorithms raises the burden on implementing clients. New algorithms should be agreed with other implementors, and must be free of patent encumbrance.

However, for forwards-compatibility and later extensibility, the parsing algorithm is as follows. Take the contents of the "checksum" attribute, and split on semicolon. For each resulting string (one, in the common case), strip all whitespace and then look for a known hash name (followed by colon) as a prefix. Process the remainder of the text according to the client's knowledge of that algorithm. Clients should silently ignore hash prefixes (e.g. "sha1:") they do not recognise. This scheme allows for multiple algorithms to be specified during any future period of transition.

Credits

Thanks to Microsoft for unwittingly giving me the idea for this by storing security updates in directories named using GUIDs.