Open Data

April 25th, 2008

This is the keynote presentation I gave at the Accessibility 2.0 conference held in London in April 2008.

Listen to an audio recording of this talk.

We have come here to listen to a veritable pantheon of Web accessibility experts give us advice that is practical and relevant to working on the Web today. I’d like to offer something of an alternative. By that I don’t mean that I’m going to give you advice that is irrelevant and impractical. I mean that I’d like to take a step backwards and try to look at the bigger picture. There won’t be any code. There won’t be any hints and tips that will help you in working on the Web today. Let’s leave today behind and delve back into the past.

Let’s start with the Norman conquest of England in 1066. This was the watershed moment that split history into the Dark Ages of everything pre-Hastings and the Middle Ages of everything afterwards. Twenty years after the invasion, William the Conquerer commissioned the Domesday Book, a remarkably thorough snapshot of life in 11th Century England. This document still exists today. It rests in a glass case in the National Archives in this very city.

In the run-up to the 900th anniversary of the Domesday Book’s completion, the BBC began the Domesday Project, an ambitious attempt to create a multimedia version of the document. The medium they chose? Laserdiscs. The medium was out of date before the project was even finished. Vellum, while not the most popular storage medium these days, has proven to be far more durable.

The debacle of the Domesday Project is often cited in hand-wringing discussions around the problems of digital preservation. The laserdisc format was not durable. I don’t mean that the physical medium stopped storing its ones and zeros. I mean those ones and zeros don’t make any sense to a modern computer. To put it another way, laserdiscs are inaccessible.

Let’s go back even further to examine a medium that is even more durable than vellum. Egyptian hieroglyphic writing was often carved into stone. Symbols dating back as far as 3200 years BC have survived to this day. But for most of the past two millennia, this writing was completely inaccessible. The ones and zeros had been preserved but the key to interpreting them had not. It was only thanks to the Rosetta Stone (also on display in this very city) and the valiant efforts of Champollion that we can read and understand hieroglyphics today.

By the way — and this is a complete tangent — do you know what the great-grandson of Chamopollion does for a living? I only know this because my wife is a translator: he writes software for translators. Well, I say software …he’s actually created a plugin for Word. So his legacy might not be quite as enduring as his ancestor’s.

Word suffers from the same problem as laserdiscs. It is not a good format for digital preservation. Given time, it will become an inaccessible format.

It is my contention that what is good for digital preservation is good for accessibility.

Here’s a tired old cliché: let’s compare digital documents to buildings. This conceptual metaphor is as old as the Web itself. We talk about web “sites”: an accurate description of places that so often feel as if they are under construction.

I’d like to compare the digital and the concrete in a slightly different way. In his book How Buildings Learn, Stewart Brand explains the concept of shearing layers, a term first coined by the architect Frank Duffy and explained thusly:

“Our basic argument is that there isn’t any such thing as a building. A building properly conceived is several layers of longevity of built components.”

Those layers are:

the site
the structure,
the skin (which is the exterior surface),
the services (like wiring and pipes),
the space plan and
the stuff (like chairs, tables, carpets and pictures).

Each one of these shearing layers is dependent on the layer before. The stuff depends on the space plan, the skin depends on the structure, the structure depends on the site, and so on.

Already you might be seeing parallels with Web development, especially Web development carried out according to the principle of progressive enhancement. But let’s not get ahead of ourselves here. What I’d like to point out is the different pace at which each one of these shearing layers changes.

It’s easy to rearrange furniture. It’s more troublesome to change the wiring or pipes. Making changes to the fundamental structure of a building is a real pain in the ass. The site of a building is unlikely to change at all, discounting any unforeseen tectonic activity.

If we want to preserve information, we should aim to bury it in the deepest shearing layer available. Vellum and stone have worked out well because they are the informational equivalent of a reasonably deep shearing layer. But they don’t scale very well, they aren’t easily searchable and it’s extremely time-consuming to make non-destructive copies. That’s where digital storage excels.

So how can we ensure that we choose the right formats in which to store our information? How can we tell whether a storage medium is a deep shearing layer? How can we avoid reinventing the laserdisc?

We have a few rules of thumb to help us answer those questions.

Open formats are better than closed formats. I don’t mean they are necessarily qualitatively better but from the viewpoint of digital preservation (and therefore, accessibility), over a long enough timescale they are always better.

The terms “open” and “closed” are fairly nebulous. Rather than define them too rigidly, I’d like to point to the qualities that can be described as either open or closed. The truth is that most formats contain a mixture of open and closed qualities.

First of all, there’s the development process of creating a format in the first place. On the face of it, a closed process might seem preferable. It allows greater control of how a format develops. But it turns out that this isn’t always desirable. The open-source model of development, for all its chaotic flaws, has one huge advantage: evolution. Time and time again, the open-source community has produced efficient, well-honed gems instead of the Towers of Babel that would be logically expected. That’s because Darwinian selection, be it natural or otherwise, will always produce the best adaptations for any environment. It doesn’t matter if we’re talking about ones and zeros instead of strands of DNA; the Theory of Evolution is borne out in either case. Microsoft aren’t getting their ass kicked by the Linux penguin or the burning fox of fire; Microsoft are getting their ass kicked by Charles Darwin.

Open-source development is the most obvious open quality that a format can have. Another open quality is standardization. Again, at first glance, this might seem counter-intuitive. After all, the standardization process is all about defining boundaries and setting limits as to what is and isn’t permitted. But this deliberate reigning in of the possibility space is what gives a format longevity. This will come as no surprise to the designers amongst you who are well aware that constraints breed creativity. As Douglas Adams said, we demand rigidly-defined areas of doubt and uncertainty.

As a card-carrying member of The Web Standards Project, it will probably come as no surprise that I’m rather fond of standards. But my fondness for standards extends beyond the Web. When visiting Paris with my good friend and fellow geek Brian Suda, we tried calling up the International Bureau of Weights and Measures which has its headquarters there. We wanted to see the meter. But we were rebuffed in brusque French fashion. “Zis is not a museum!”

Harrumph! Who needs the French anyway? The true father of standards is a British man, a member of The Royal Society which was based, yes, right here in this city. His name was Joseph Whitworth and he was an engineer. A developer in other words. He standardized screw threads. Before Whitworth, screws were made on a case-by-case basis, each one different from the next. That didn’t scale well for the ambitious project that Whitworth was working on. He was the chief engineer on Charles Babbage’s difference engine which, although it can’t boast a direct lineage to this computer, bears an uncanny resemblance in its internal design. I love the idea that there’s a connection between the screws that were created for the difference engine and the standards that we use to build the Web.

Standardization doesn’t necessarily lead to qualitatively better formats. Quite the opposite in fact. The standardization process, by its very nature, involves compromise. But I would rather use a compromised standardized format than a perfect proprietary one.

The Flash format, for example, while it has some open qualities remains mostly closed as long as the Flash player remains under lock and key. I’ve discussed this with my fellow Brightonian Aral Balkan who knows a thing or two about Flash. He sympathises with Adobe’s position, claiming that if anybody were able to build a Flash player, then developers would have to support buggy players. Aral recently made a foray into building a site using CSS for layout. Now that he’s experienced the pain of cross-browser, cross-platform development, the last thing he wants is to port that pain over to the Flash environment. I see his point but personally I’m willing to pay the price for working with standardized formats… even if I sometimes do find myself tearing my hair out over some browser’s inconsistent rendering of some CSS rule.

The standardization of HTML, CSS and ECMAScript means that, in theory, anyone can make a Web browser. While I hope that remains just a theory (I don’t want any more browsers, thank you very much) that bodes very well for the longevity of data written in those formats.

Of that trio of formats, the one that’s most directly relevant to information storage and accessibility is HTML. It’s also a vital component in another trio of technologies: HTTP, URLs and HTML. If I had any slides, I’d probably be showing you a Venn diagram right now with HTML as the common component bridging the infrastructure and the content of the World Wide Web.

I’ve had the great pleasure of meeting some of the people who worked with Tim Berners-Lee at CERN ‘round about the time that he created the Hypertext Markup Language, the World Wide Web and the first Web browser. One of those people, the lovely Håkon Wium Lie was so enamoured with HTML he placed a bet that the language would be around for at least 50 years. That’s a good start. That’s in a different shearing layer to most of the file formats that our computers read today.

The Web was not the first distributed network of documents. Tim Berners-Lee stood on the shoulders of giants like Vannevar Bush and Doug Engelbart. HTML is far from the best possible hypertext system. Other systems envisioned two-way linkage and Ted Nelson’s idea of transclusion would be a welcome addition to the World Wide Web.

The strength of HTML is its simplicity. Simplicity beats complexity for many of the same reasons that open beats closed. Simple formats are more likely to have a longer lifespan. Yes, it can sometimes feel limiting to work with a relatively small number of HTML elements but on balance, I don’t mind paying that price. Remember what I said about constraints breeding creativity? Just look at the amazing multi-faceted Web that we’ve managed to construct with this simple technology.

The same simplicity that informs HTML extends right down the stack into the infrastructure of the Hypertext Transfer Protocol. It’s not just simple, it’s downright dumb. By design. In retrospect, given the simple, open nature of HTTP plus URLs plus HTML, the rise of the stupid network looks inevitable.

As you can probably tell, I’m a big fan of HTML. Not only do I believe it to be a relatively durable format, I believe its simplicity lends itself well to accessibility. The most obvious example of this is the way that HTML can be interpreted by a screen reader. But that’s just one example of information stored in markup being transformed into another format (in this case, speech). Another example would be transforming information from markup onto a piece of paper by printing out a web page.

I’ve come to realize that there are fundamentally two kinds of web designer. On one side, you’ve got the people who, perhaps with a background in print, think that when an HTML document is rendered on a screen in a browser, that’s the end of the line. For them, markup, CSS and JavaScript are the means to that end. Then there’s the other kind of web designer. Let’s call them the professionals. These are the people who realise that the very strength of the Web is the fact that you don’t know how someone is going to consume your information. They might have it printed out, they might have it read out or they might view it on a screen but even then, who knows what size that screen will be or what kind of device the screen is attached to? It might be a computer, it might be a mobile phone, it might be a fridge. How do you design for that?

The glib answer is to surrender control and embrace flexibility. Instead of battling against the anarchic nature of the Web, go with it.

I’m sure that piece of advice is old news to you but I think you can take it further. Embrace flexibility in your attitude towards accessibility.

We nerds tend to be a logical bunch. We like looking at the world as a binary system where there’s a right way and a wrong way to do something. But accessibility isn’t that simple. It’s not black and white — it’s a big messy shade of grey. Reducing accessibility down to a Boolean value is harmful.

Who was at @media last year? Remember when Joe Clark gave a nuanced and well-argued presentation entitled When Web Accessibility Is Not Your Problem? Before he had even left the stage, people were already claiming that Joe Clark was saying “Web accessibility is not your problem.”

This Reductio ad Absurdum has got to stop. It even creeps into our thinking about users. We start thinking about disability as a permanent state, either one or zero. It isn’t that simple. If I’m suffering from a dearth of sleep or a surfeit of alcohol, I am cognitively disabled. If I’m trying to use the trackpad on my laptop while I’m squashed into a seat on the train from Brighton to London, I am motor-impaired.

Also, let’s stop talking about making websites accessible. Instead, let’s talk about keeping websites accessible. I’m not saying that HTML is a magic bullet but as long as you are using the most semantically appropriate elements to mark up your content, you are creating something is, by default, accessible. It’s only afterwards, when we start adding the bells and whistles, that the problems can begin.

Don’t get me wrong: I’m not saying that we should censor ourselves and stifle our innovative ideas. I’m just talking about having a good baseline of solid structure. For a start, don’t fuck with links and forms.

If you ask me what technology I think every web designer should know, I’m not going to answer with CSS or Ajax or any programming language. No, I think that every web designer should know the difference between GET and POST. Know when to use a link and when to use a form. This is basic stuff that was built into the infrastructure of the Web from day one.

GET and POST aren’t the only methods that were created at the birth of the Web. Tim Berners-Lee also gave us the lesser-known PUT and DELETE. From the start, the World Web Wide was conceived as a read/write environment. It just didn’t turn out that way …until now.

Speaking for myself, I’ve found that I’m increasingly using the Web to publish information as well as consume it. I’ve got a bookmarks folder called “my other head” which contains links to the services I use daily: Flickr, Twitter, Pownce, Magnolia. They aren’t just websites, they are publishing tools. On today’s Web, I read and write in equal measure.

Accessibility guidelines that deal with Web content just don’t cut it any more. Guidelines intended for authoring tools are more applicable (if I had my way, the number one guideline would be “don’t fuck with links and forms”).

Accessibility doesn’t just mean that everyone should be able to consume what’s on the Web, it also means that everyone should be able to publish on the Web.

On the face of it, the current situation does not look good. Most social media sites have dreadful markup, obtrusive JavaScript and inflexible designs. But at the same time, they have a pervasive sense of openness that I find very encouraging indeed. The shared ethos is that this is your data so you should have access to it.

These services provide the ability to read and write information not just through an HTML page rendered in a browser. They offer the same access in a multiplicity of ways from the simplicity of microformats through to RSS and right up to fully-fledged APIs. The most successful social media websites are the ones where you don’t have to visit the site at all.

Time for another tired old cliché: information wants to be free. As trite as this sounds, I think that on the Web it’s fundamentally true. Lack of access to data is damage. People will find a way to route around it.

Matthew Somerville excels at routing around the damage of inaccessibility. He’s the guy who built the accessible version of the Odeon cinema listings. He also built traintimes.org.uk, a more accessible way of getting train timetable information. He had to scrape the original websites to build these. That’s hard work. APIs provide an easier way for us to create alternate, accessible versions of valuable services.

If APIs are an accessibility feature, then we need to change how we judge websites accordingly. Suppose we’re looking at a web page with a list of stories. If the document doesn’t make good use of headers — h1, h2, etc. — then that’s a minus point. But if there’s link to an RSS equivalent, then that’s a plus point.

The more and varied the formats in which you can access data, the more accessible that data is. I realise that this flies in the face of the programming principle of DRY: Don’t Repeat Yourself. But really, you can never have too much data.

I’m not suggesting that any inaccessible website that provides an API automatically receives a “get out of a jail free” card. But I do think that the API offers more potential solutions to fixing the accessibility issues. Instead of bitching and moaning about bad markup and crappy Ajax, we could more constructively use our time hacking on the API to provide a more accessible alternative.

The idea that information must reside on one specific website is dying. I hope that outdated marketing terms like “eyeballs” and “stickiness” die along with it.

As with any great change, there’s plenty of fear. If you have a business model that is based on the premise that some data is centralised, scarce and closed, you are backing a losing horse. The inaccessibility of that model dooms it.

There is a spirit of openness and collaboration that has spread inexorably through the Web since its creation. That spirit extends beyond data formats and technology. Our concepts of ownership and property are also changing. Try to ignore any whiff of socialism you might detect — this process is much more natural and inevitable.

So we come to the most important and the most contentious quality of openness: the right to information.

In this country, we suffer many affronts to our right to information. We have to pay to access ordnance survey data that was gathered using our tax money. Open Street Map and Free The Postcode are the natural responses to these most egregious of insults. People are beginning to ask for other data too. The Guardian is spearheading a campaign called Free Our Data to do exactly what it says on the tin.

Data that comes laden with restrictive licensing is crippled. When those restrictions are encoded into the format itself, the data is doomed. I’m talking about what is so euphemistically referred to as Digital Rights Management.

Here’s one last tired old cliché, this one from the sphere of anthropology. If a visitor from another planet came to Earth, what would they make of our society?

This is the very situation that Iain M. Banks describes in his short story State Of The Art. A visitor from a post-singularity culture, called simply The Culture, looks down from her ship above Earth and reflects on her recent sojourn there:

I stroked one of Tagm’s hands, gazed again at the slowly revolving planet, my gaze flicking in one glance from pole to equator. ‘You know, when I was in Paris, seeing Linter for the first time, I was standing at the top of some steps in the courtyard where Linter’s place was, and I looked across it and there was a little notice on the wall saying it was forbidden to take photographs of the courtyard without the man’s permission.’ I turned to Tagm. ‘They want to own the light!’

They want to own the light. They really do. They call it plugging the analogue hole. Even DRMd images and video must eventually be converted into photons. Even DRMd audio must eventually be converted into vibrations in the air. That’s the analogue hole. They don’t just want to own the light, they want to own our very culture.

Every day we write words, we record videos, we take photographs. We also read, we watch movies, we listen to music, we look at works of art. We are contributing to a digital record that is an order of magnitude greater than the Domesday Book. This is more than just data. This is who we are. It must be preserved. It must be accessible.

It’s time to take sides. It would be hyperbole to describe it as a battle between good and evil but it’s no exaggeration to say it’s a battle between good and bad.

We can either spend our time and effort locking data up into closed formats with restrictive licensing. Or we can make a concerted effort to act in the spirit of the Web: standards, simplicity, sharing… these are the qualities of openness that will help us preserve our culture. If we want to be remembered for a culture of accessibility, we must make a commitment to open data.