Journal tags: cache

13

Caching and storing

When I was speaking at conferences last year about service workers, I’d introduce the Cache API. I wanted some way of explaining the difference between caching and other kinds of storage.

The way I explained was that, while you might store stuff for a long time, you’d only cache stuff that you knew you were going to need again. So according to that definition, when you make a backup of your hard drive, that’s not caching …becuase you hope you’ll never need to use the backup.

But that explanation never sat well with me. Then more recently, I was chatting with Amber about caching. Once again, we trying to define the difference between, say, the Cache API and things like LocalStorage and IndexedDB. At some point, we realised the fundamental difference: caches are for copies.

Think about it. If you store something in LocalStorage or IndexedDB, that’s the canonical home for that data. But anything you put into a cache must be a copy of something that exists elsewhere. That’s true of the Cache API, the browser cache, and caches on the server. An item in one of those caches is never the original—it’s always a copy of something that has a canonical home elsewhere.

By that definition, backing up your hard drive definitely is caching.

Anyway, I was glad to finally have a working definition to differentiate between caching and storing.

Periodic background sync

Yesterday I wrote about how much I’d like to see silent push for the web:

I’d really like silent push for the web—the ability to update a cache with fresh content as soon as it’s published; that would be nifty! At the same time, I understand the concerns. It feels more powerful than other permission-based APIs like notifications.

Today, John Holt Ripley responded on Twitter:

hi there, just read your blog post about Silent Push for acthe web, and wondering if Periodic Background Sync would cover a few of those use cases?

Periodic background sync looks very interesting indeed!

It’s not the same as silent push. As the name suggests, this is about your service worker waking up periodically and potentially fetching (and caching) fresh content from the network. So the service worker is polling rather than receiving a push. But I’ll take it! It’s definitely close enough for the kind of use-cases I’ve been thinking about.

Interestingly, periodic background sync also ties into the other part of what I was writing about: permissions. I mentioned that adding a site the home screen could be interpreted as a signal to potentially allow more permissions (or at least allow prompts for more permissions).

Well, Chromium has a document outlining metrics for attempting to gauge site engagement. There’s some good thinking in there.

Silent push for the web

After Indie Web Camp in Berlin last year, I wrote about Seb’s nifty demo of push without notifications:

While I’m very unwilling to grant permission to be interrupted by intrusive notifications, I’d be more than willing to grant permission to allow a website to silently cache timely content in the background. It would be a more calm technology.

Phil Nash left a comment on the Medium copy of my post explaining that Seb’s demo of using the Push API without showing a notification wouldn’t work for long:

The browsers allow a certain number of mistakes(?) before they start to show a generic notification to say that your site sent a push notification without showing a notification. I believe that after ~10 or so notifications, and that’s different between browsers, they run out of patience.

He also provided me with the name to describe what I’m after:

You’re looking for ��silent push” as are many others.

Silent push is something that is possible in native apps. It isn’t (yet?) available on the web, presumably because of security concerns.

It’s an API that would ripe for abuse. I mean, just look at the mess we’ve made with APIs like notifications and geolocation. Sure, they require explicit user opt-in, but these opt-ins are seen so often that users are sick of seeing them. Silent push would be one more permission-based API to add to the stack of annoyances.

Still, I’d really like silent push for the web—the ability to update a cache with fresh content as soon as it’s published; that would be nifty! At the same time, I understand the concerns. It feels more powerful than other permission-based APIs like notifications.

Maybe there could be another layer of permissions. What if adding a site to your home screen was the first step? If a site is running on HTTPS, has a service worker, has a web app manifest, and has been added to the homescreen, maybe then and only then should it be allowed to prompt for permission to do silent push.

In other words, what if certain very powerful APIs were only available to progressive web apps that have successfully been added to the home screen?

Frankly, I’d be happy if the same permissions model applied to web notifications too, but I guess that ship has sailed.

Anyway, all this is pure conjecture on my part. As far as I know, silent push isn’t on the roadmap for any of the browser vendors right now. That’s fair enough. Although it does annoy me that native apps have this capability that web sites don’t.

It used to be that there was a long list of features that only native apps could do, but that list has grown shorter and shorter. The web’s hare is catching up to native’s tortoise.

Going offline with microformats

For the offline page on my website, I’ve been using a mixture of the Cache API and the localStorage API. My service worker script uses the Cache API to store copies of pages for offline retrieval. But I used the localStorage API to store metadata about the page—title, description, and so on. Then, my offline page would rifle through the pages stored in a cache, and retreive the corresponding metadata from localStorage.

It all worked fine, but as soon as I read Remy’s post about the forehead-slappingly brilliant technique he’s using, I knew I’d be switching my code over. Instead of using localStorage—or any other browser API—to store and retrieve metadata, he uses the pages themselves! Using the Cache API, you can examine the contents of the pages you’ve stored, and get at whatever information you need:

I realised I didn’t need to store anything. HTML is the API.

Refactoring the code for my offline page felt good for a couple of reasons. First of all, I was able to remove a dependency—localStorage—and simplify the JavaScript. That always feels good. But the other reason for the warm fuzzies is that I was able to use data instead of metadata.

Many years ago, Cory Doctorow wrote a piece called Metacrap. In it, he enumerates the many issues with metadata—data about data. The source of many problems is when the metadata is stored separately from the data it describes. The data may get updated, without a corresponding update happening to the metadata. Metadata tends to rot because it’s invisible—out of sight and out of mind.

In fact, that’s always been at the heart of one of the core principles behind microformats. Instead of duplicating information—once as data and again as metadata—repurpose the visible data; mark it up so its meta-information is directly attached to the information itself.

So if you have a person’s contact details on a web page, rather than repeating that information somewhere else—in the head of the document, say—you could instead attach some kind of marker to indicate which bits of the visible information are contact details. In the case of microformats, that’s done with class attributes. You can mark up a page that already has your contact information with classes from the h-card microformat.

Here on my website, I’ve marked up my blog posts, articles, and links using the h-entry microformat. These classes explicitly mark up the content to say “this is the title”, “this is the content”, and so on. This makes it easier for other people to repurpose my content. If, for example, I reply to a post on someone else’s website, and ping them with a webmention, they can retrieve my post and know which bit is the title, which bit is the content, and so on.

When I read Remy’s post about using the Cache API to retrieve information directly from cached pages, I knew I wouldn’t have to do much work. Because all of my posts are already marked up with h-entry classes, I could use those hooks to create a nice offline page.

The markup for my offline page looks like this:

<h1>Offline</h1>
<p>Sorry. It looks like the network connection isn’t working right now.</p>
<div id="history">
</div>

I’ll populate that “history” div with information from a cache called “pages” that I’ve created using the Cache API in my service worker.

I’m going to use async/await to do this because there are lots of steps that rely on the completion of the step before. “Open this cache, then get the keys of that cache, then loop through the pages, then…” All of those thens would lead to some serious indentation without async/await.

All async functions have to have a name—no anonymous async functions allowed. I’m calling this one listPages, just like Remy is doing. I’m making the listPages function execute immediately:

(async function listPages() {
...
})();

Now for the code to go inside that immediately-invoked function.

I create an array called browsingHistory that I’ll populate with the data I’ll use for that “history” div.

const browsingHistory = [];

I’m going to be parsing web pages later on, so I’m going to need a DOM parser. I give it the imaginative name of …parser.

const parser = new DOMParser();

Time to open up my “pages” cache. This is the first await statement. When the cache is opened, this promise will resolve and I’ll have access to this cache using the variable …cache (again with the imaginative naming).

const cache = await caches.open('pages');

Now I get the keys of the cache—that’s a list of all the page requests in there. This is the second await. Once the keys have been retrieved, I’ll have a variable that’s got a list of all those pages. You’ll never guess what I’m calling the variable that stores the keys of the cache. That’s right …keys!

const keys = await cache.keys();

Time to get looping. I’m getting each request in the list of keys using a for/of loop:

for (const request of keys) {
...
}

Inside the loop, I pull the page out of the cache using the match() method of the Cache API. I’ll store what I get back in a variable called response. As with everything involving the Cache API, this is asynchronous so I need to use the await keyword here.

const response = await cache.match(request);

I’m not interested in the headers of the response. I’m specifically looking for the HTML itself. I can get at that using the text() method. Again, it’s asynchronous and I want this promise to resolve before doing anything else, so I use the await keyword. When the promise resolves, I’ll have a variable called html that contains the body of the response.

const html = await response.text();

Now I can use that DOM parser I created earlier. I’ve got a string of text in the html variable. I can generate a Document Object Model from that string using the parseFromString() method. This isn’t asynchronous so there’s no need for the await keyword.

const dom = parser.parseFromString(html, 'text/html');

Now I’ve got a DOM, which I have creatively stored in a variable called …dom.

I can poke at it using DOM methods like querySelector. I can test to see if this particular page has an h-entry on it by looking for an element with a class attribute containing the value “h-entry”:

if (dom.querySelector('.h-entry h1.p-name') {
...
}

In this particular case, I’m also checking to see if the h1 element of the page is the title of the h-entry. That’s so that index pages (like my home page) won’t get past this if statement.

Inside the if statement, I’m going to store the data I retrieve from the DOM. I’ll save the data into an object called …data!

const data = new Object;

Well, the first piece of data isn’t actually in the markup: it’s the URL of the page. I can get that from the request variable in my for loop.

data.url = request.url;

I’m going to store the timestamp for this h-entry. I can get that from the datetime attribute of the time element marked up with a class of dt-published.

data.timestamp = new Date(dom.querySelector('.h-entry .dt-published').getAttribute('datetime'));

While I’m at it, I’m going to grab the human-readable date from the innerText property of that same time.dt-published element.

data.published = dom.querySelector('.h-entry .dt-published').innerText;

The title of the h-entry is in the innerText of the element with a class of p-name.

data.title = dom.querySelector('.h-entry .p-name').innerText;

At this point, I am actually going to use some metacrap instead of the visible h-entry content. I don’t output a description of the post anywhere in the body of the page, but I do put it in the head in a meta element. I’ll grab that now.

data.description = dom.querySelector('meta[name="description"]').getAttribute('content');

Alright. I’ve got a URL, a timestamp, a publication date, a title, and a description, all retrieved from the HTML. I’ll stick all of that data into my browsingHistory array.

browsingHistory.push(data);

My if statement and my for/in loop are finished at this point. Here’s how the whole loop looks:

for (const request of keys) {
  const response = await cache.match(request);
  const html = await response.text();
  const dom = parser.parseFromString(html, 'text/html');
  if (dom.querySelector('.h-entry h1.p-name')) {
    const data = new Object;
    data.url = request.url;
    data.timestamp = new Date(dom.querySelector('.h-entry .dt-published').getAttribute('datetime'));
    data.published = dom.querySelector('.h-entry .dt-published').innerText;
    data.title = dom.querySelector('.h-entry .p-name').innerText;
    data.description = dom.querySelector('meta[name="description"]').getAttribute('content');
    browsingHistory.push(data);
  }
}

That’s the data collection part of the code. Now I’m going to take all that yummy information an output it onto the page.

First of all, I want to make sure that the browsingHistory array isn’t empty. There’s no point going any further if it is.

if (browsingHistory.length) {
...
}

Within this if statement, I can do what I want with the data I’ve put into the browsingHistory array.

I’m going to arrange the data by date published. I’m not sure if this is the right thing to do. Maybe it makes more sense to show the pages in the order in which you last visited them. I may end up removing this at some point, but for now, here’s how I sort the browsingHistory array according to the timestamp property of each item within it:

browsingHistory.sort( (a,b) => {
  return b.timestamp - a.timestamp;
});

Now I’m going to concatenate some strings. This is the string of HTML text that will eventually be put into the “history” div. I’m storing the markup in a string called …markup (my imagination knows no bounds).

let markup = '<p>But you still have something to read:</p>';

I’m going to add a chunk of markup for each item of data.

browsingHistory.forEach( data => {
  markup += `
<h2><a href="${ data.url }">${ data.title }</a></h2>
<p>${ data.description }</p>
<p class="meta">${ data.published }</p>
`;
});

With my markup assembled, I can now insert it into the “history” part of my offline page. I’m using the handy insertAdjacentHTML() method to do this.

document.getElementById('history').insertAdjacentHTML('beforeend', markup);

Here’s what my finished JavaScript looks like:

<script>
(async function listPages() {
  const browsingHistory = [];
  const parser = new DOMParser();
  const cache = await caches.open('pages');
  const keys = await cache.keys();
  for (const request of keys) {
    const response = await cache.match(request);
    const html = await response.text();
    const dom = parser.parseFromString(html, 'text/html');
    if (dom.querySelector('.h-entry h1.p-name')) {
      const data = new Object;
      data.url = request.url;
      data.timestamp = new Date(dom.querySelector('.h-entry .dt-published').getAttribute('datetime'));
      data.published = dom.querySelector('.h-entry .dt-published').innerText;
      data.title = dom.querySelector('.h-entry .p-name').innerText;
      data.description = dom.querySelector('meta[name="description"]').getAttribute('content');
      browsingHistory.push(data);
    }
  }
  if (browsingHistory.length) {
    browsingHistory.sort( (a,b) => {
      return b.timestamp - a.timestamp;
    });
    let markup = '<p>But you still have something to read:</p>';
    browsingHistory.forEach( data => {
      markup += `
<h2><a href="${ data.url }">${ data.title }</a></h2>
<p>${ data.description }</p>
<p class="meta">${ data.published }</p>
`;
    });
    document.getElementById('history').insertAdjacentHTML('beforeend', markup);
  }
})();
</script>

I’m pretty happy with that. It’s not too long but it’s still quite readable (I hope). It shows that the Cache API and the h-entry microformat are a match made in heaven.

If you’ve got an offline strategy for your website, and you’re using h-entry to mark up your content, feel free to use that code.

If you don’t have an offline strategy for your website, there’s a book for that.

The trimCache function in Going Offline …again

It seems that some code that I wrote in Going Offline is haunted. It’s the trimCache function.

First, there was the issue of a typo. Or maybe it’s more of a brainfart than a typo, but either way, there’s a mistake in the syntax that was published in the book.

Now it turns out that there’s also a problem with my logic.

To recap, this is a function that takes two arguments: the name of a cache, and the maximum number of items that cache should hold.

function trimCache(cacheName, maxItems) {

First, we open up the cache:

caches.open(cacheName)
.then( cache => {

Then, we get the items (keys) in that cache:

cache.keys()
.then(keys => {

Now we compare the number of items (keys.length) to the maximum number of items allowed:

if (keys.length > maxItems) {

If there are too many items, delete the first item in the cache—that should be the oldest item:

cache.delete(keys[0])

And then run the function again:

.then(
    trimCache(cacheName, maxItems)
);

A-ha! See the problem?

Neither did I.

It turns out that, even though I’m using then, the function will be invoked immediately, instead of waiting until the first item has been deleted.

Trys helped me understand what was going on by making a useful analogy. You know when you use setTimeout, you can’t put a function—complete with parentheses—as the first argument?

window.setTimeout(doSomething(someValue), 1000);

In that example, doSomething(someValue) will be invoked immediately—not after 1000 milliseconds. Instead, you need to create an anonymous function like this:

window.setTimeout( function() {
    doSomething(someValue)
}, 1000);

Well, it’s the same in my trimCache function. Instead of this:

cache.delete(keys[0])
.then(
    trimCache(cacheName, maxItems)
);

I need to do this:

cache.delete(keys[0])
.then( function() {
    trimCache(cacheName, maxItems)
});

Or, if you prefer the more modern arrow function syntax:

cache.delete(keys[0])
.then( () => {
    trimCache(cacheName, maxItems)
});

Either way, I have to wrap the recursive function call in an anonymous function.

Here’s a gist with the updated trimCache function.

What’s annoying is that this mistake wasn’t throwing an error. Instead, it was causing a performance problem. I’m using this pattern right here on my own site, and whenever my cache of pages or images gets too big, the trimCaches function would get called …and then wouldn’t stop running.

I’m very glad that—witht the help of Trys at last week’s Homebrew Website Club Brighton—I was finally able to get to the bottom of this. If you’re using the trimCache function in your service worker, please update the code accordingly.

Management regrets the error.

Am I cached or not?

When I was writing about the lie-fi strategy I’ve added to adactio.com, I finished with this thought:

What I’d really like is some way to know—on the client side—whether or not the currently-loaded page came from a cache or from a network. Then I could add some kind of interface element that says, “Hey, this page might be stale—click here if you want to check for a fresher version.”

Trys heard my plea, and came up with a very clever technique to alter the HTML of a page when it’s put into a cache.

It’s a function that reads the response body stream in, returning a new stream. Whilst reading the stream, it searches for the character codes that make up: <html. If it finds them, it tacks on a data-cached attribute.

Nice!

But then I was discussing this issue with Tantek and Aaron late one night after Indie Web Camp Düsseldorf. I realised that I might have another potential solution that doesn’t involve the service worker at all.

Caveat: this will only work for pages that have some kind of server-side generation. This won’t work for static sites.

In my case, pages are generated by PHP. I’m not doing a database lookup every time you request a page—I’ve got a server-side cache of posts, for example—but there is a little bit of assembly done for every request: get the header from here; get the main content from over there; get the footer; put them all together into a single page and serve that up.

This means I can add a timestamp to the page (using PHP). I can mark the moment that it was served up. Then I can use JavaScript on the client side to compare that timestamp to the current time.

I’ve published the code as a gist.

In a script element on each page, I have this bit of coducken:

var serverTimestamp = <?php echo time(); ?>;

Now the JavaScript variable serverTimestamp holds the timestamp that the page was generated. When the page is put in the cache, this won’t change. This number should be the number of seconds since January 1st, 1970 in the UTC timezone (that’s what my server’s timezone is set to).

Starting with JavaScript’s Date object, I use a caravan of methods like toUTCString() and getTime() to end up with a variable called clientTimestamp. This will give the current number of seconds since January 1st, 1970, regardless of whether the page is coming from the server or from the cache.

var localDate = new Date();
var localUTCString = localDate.toUTCString();
var UTCDate = new Date(localUTCString);
var clientTimestamp = UTCDate.getTime() / 1000;

Then I compare the two and see if there’s a discrepency greater than five minutes:

if (clientTimestamp - serverTimestamp > (60 * 5))

If there is, then I inject some markup into the page, telling the reader that this page might be stale:

document.querySelector('main').insertAdjacentHTML('afterbegin',`
  <p class="feedback">
    <button onclick="this.parentNode.remove()">dismiss</button>
    This page might be out of date. You can try <a href="javascript:window.location=window.location.href">refreshing</a>.
  </p>
`);

The reader has the option to refresh the page or dismiss the message.

This page might be out of date. You can try refreshing.

It’s not foolproof by any means. If the visitor’s computer has their clock set weirdly, then the comparison might return a false positive every time. Still, I thought that using UTC might be a safer bet.

All in all, I think this is a pretty good method for detecting if a page is being served from a cache. Remember, the goal here is not to determine if the user is offline—for that, there’s navigator.onLine.

The upshot is this: if you visit my site with a crappy internet connection (lie-fi), then after three seconds you may be served with a cached version of the page you’re requesting (if you visited that page previously). If that happens, you’ll now also be presented with a little message telling you that the page isn’t fresh. Then it’s up to you whether you want to have another go.

I like the way that this puts control back into the hands of the user.

Push without notifications

On the first day of Indie Web Camp Berlin, I led a session on going offline with service workers. This covered all the usual use-cases: pre-caching; custom offline pages; saving pages for offline reading.

But on the second day, Sebastiaan spent a fair bit of time investigating a more complex use of service workers with the Push API.

The Push API is what makes push notifications possible on the web. There are a lot of moving parts—browser, server, service worker—and, frankly, it’s way over my head. But I’m familiar with the general gist of how it works. Here’s a typical flow:

  1. A website prompts the user for permission to send push notifications.
  2. The user grants permission.
  3. A whole lot of complicated stuff happens behinds the scenes.
  4. Next time the website publishes something relevant, it fires a push message containing the details of the new URL.
  5. The user’s service worker receives the push message (even if the site isn’t open).
  6. The service worker creates a notification linking to the URL, interrupting the user, and generally adding to the weight of information overload.

Here’s what Sebastiaan wanted to investigate: what if that last step weren’t so intrusive? Here’s the alternate flow he wanted to test:

  1. A website prompts the user for permission to send push notifications.
  2. The user grants permission.
  3. A whole lot of complicated stuff happens behinds the scenes.
  4. Next time the website publishes something relevant, it fires a push message containing the details of the new URL.
  5. The user’s service worker receives the push message (even if the site isn’t open).
  6. The service worker fetches the contents of the URL provided in the push message and caches the page. Silently.

It worked.

I think this could be a real game-changer. I don’t know about you, but I’m very, very wary of granting websites the ability to send me push notifications. In fact, I don’t think I’ve ever given a website permission to interrupt me with push notifications.

You’ve seen the annoying permission dialogues, right?

In Firefox, it looks like this:

Will you allow name-of-website to send notifications?

[Not Now] [Allow Notifications]

In Chrome, it’s:

name-of-website wants to

Show notifications

[Block] [Allow]

But in actual fact, these dialogues are asking for permission to do two things:

  1. Receive messages pushed from the server.
  2. Display notifications based on those messages.

There’s no way to ask for permission just to do the first part. That’s a shame. While I’m very unwilling to grant permission to be interrupted by intrusive notifications, I’d be more than willing to grant permission to allow a website to silently cache timely content in the background. It would be a more calm technology.

Think of the use cases:

  • I grant push permission to a magazine. When the magazine publishes a new article, it’s cached on my device.
  • I grant push permission to a podcast. Whenever a new episode is published, it’s cached on my device.
  • I grant push permission to a blog. When there’s a new blog post, it’s cached on my device.

Then when I’m on a plane, or in the subway, or in any other situation without a network connection, I could still visit these websites and get content that’s fresh to me. It’s kind of like background sync in reverse.

There’s plenty of opportunity for abuse—the cache could get filled with content. But websites can already do that, and they don’t need to be granted any permissions to do so; just by visiting a website, it can add multiple files to a cache.

So it seems that the reason for the permissions dialogue is all about displaying notifications …not so much about receiving push messages from the server.

I wish there were a way to implement this background-caching pattern without requiring the user to grant permission to a dialogue that contains the word “notification.”

I wonder if the act of adding a site to the home screen could implicitly grant permission to allow use of the Push API without notifications?

In the meantime, the proposal for periodic synchronisation (using background sync) could achieve similar results, but in a less elegant way; periodically polling for new content instead of receiving a push message when new content is published. Also, it requires permission. But at least in this case, the permission dialogue should be more specific, and wouldn’t include the word “notification” anywhere.

The trimCache function in Going Offline

Paul Yabsley wrote to let me know about an error in Going Offline. It’s rather embarrassing because it’s code that I’m using in the service worker for adactio.com but for some reason I messed it up in the book.

It’s the trimCache function in Chapter 7: Tidying Up. That’s the reusable piece of code that recursively reduces the number of items in a specified cache (cacheName) to a specified amount (maxItems). On page 95 and 96 I describe the process of creating the function which, in the book, ends up like this:

 function trimCache(cacheName, maxItems) {
   cacheName.open( cache => {
     cache.keys()
     .then( items => {
       if (items.length > maxItems) {
         cache.delete(items[0])
         .then(
           trimCache(cacheName, maxItems)
         ); // end delete then
       } // end if
     }); // end keys then
   }); // end open
 } // end function

See the problem? It’s right there at the start when I try to open the cache like this:

cacheName.open( cache => {

That won’t work. The open method only works on the caches object—I should be passing the name of the cache into the caches.open method. So the code should look like this:

caches.open( cacheName )
.then( cache => {

Everything else remains the same. The corrected trimCache function is here:

function trimCache(cacheName, maxItems) {
  caches.open(cacheName)
  .then( cache => {
    cache.keys()
    .then(items => {
      if (items.length > maxItems) {
        cache.delete(items[0])
        .then(
          trimCache(cacheName, maxItems)
        ); // end delete then
      } // end if
    }); // end keys then
  }); // end open then
} // end function

Sorry about that! I must’ve had some kind of brainfart when I was writing (and describing) that one line of code.

You may want to deface your copy of Going Offline by taking a pen to that code example. Normally I consider the practice of writing in books to be barbarism, but in this case …go for it.

Update: There was another error in the code for trimCache! Here’s the fix.

Minimal viable service worker

I really, really like service workers. They’re one of those technologies that have such clear benefits to users that it seems like a no-brainer to add a service worker to just about any website.

The thing is, every website is different. So the service worker strategy for every website needs to be different too.

Still, I was wondering if it would be possible to create a service worker script that would work for most websites. Here’s the script I came up with.

The logic works like this:

  • If there’s a request for an HTML page, fetch it from the network and store a copy in a cache (but if the network request fails, try looking in the cache instead).
  • For any other files, look for a copy in the cache first but meanwhile fetch a fresh version from the network to update the cache (and if there’s no existing version in the cache, fetch the file from the network and store a copy of it in the cache).

So HTML files are served network-first, while all other files are served cache-first, but in both cases a fresh copy is always put in the cache. The idea is that HTML content will always be fresh (unless there’s a problem with the network), while all other content—images, style sheets, scripts—might be slightly stale, but get refreshed with every request.

My original attempt was riddled with errors. Jake came to my rescue and we revised the script into something that actually worked. In the process, my misunderstanding of how await works led Jake to write a great blog post on await vs return vs return await.

I got there in the end and the script seems solid enough. It’s a fairly simplistic strategy that could work for quite a few sites, but it has some issues…

Service workers don’t perform any automatic cleanup of caches—that’s up to you to do (usually during the activate event). This script doesn’t do any cleanup so the cache might grow and grow and grow. For that reason, I think the script is best suited for fairly small sites.

The strategy also assumes that a file will either be fetched from the network or the cache. There’s no contingency for when both attempts fail. So there’s no fallback offline page, for example.

I decided to test it in the wild, but I expanded it slightly to fix the fallback issue. The version on the Ampersand 2018 website includes a worst-case-scenario option to show a custom offline page that has been pre-cached. (By the way, if you haven’t got a ticket for Ampersand yet, get a ticket now—it’s going to be superb day of web typography nerdery.)

Anyway, this fairly basic script seems to be delivering some good performance improvements. If you’ve got a site that you think would benefit from this network/caching strategy, and it’s served over HTTPS, then:

  1. Feel free to download the script or copy and paste it into a file called serviceworker.js,
  2. Put that file in the root directory of your website,
  3. Add this in a script element at the bottom of your HTML pages:

if (navigator.serviceWorker && !navigator.serviceWorker.controller) { navigator.serviceWorker.register('/serviceworker.js'); }

You can also use the script as a starting point. You might find issues specific to your particular website. That’s okay—you can tweak and adjust the script to suit your needs.

If this minimal service worker script proves in any way useful to you, thank Jake.

In AMP we trust

AMP Conf was one of those deep dive events, with two days dedicated to one single technology: AMP.

Except AMP isn’t really one technology, is it? And therein lies the confusion. This was at the heart of the panel I was on. When we talk about AMP, we could be talking about one of three things:

  1. The AMP format. A bunch of web components. For instance, instead of using an img element on an AMP page, you use an amp-img element instead.
  2. The AMP rules. There’s one JavaScript file, hosted on Google’s servers, that turns those web components from spans into working elements. No other JavaScript is allowed. All your styles must be in a style element instead of an external file, and there’s a limit on what you can do with those styles.
  3. The AMP cache. The source of most confusion—and even downright enmity—this is what’s behind the fact that when you launch an AMP result from Google search, you don’t go to another website. You see Google’s cached copy of the page instead of the original.

The first piece of AMP—the format—is kind of like a collection of marginal gains. Where the img element might have some performance issues, the amp-img element optimises for perceived performance. But if you just used the AMP web components, it wouldn’t be enough to make your site blazingly fast.

The second part of AMP—the rules—is where the speed gains start to really show. You can’t have an external style sheet, and crucially, you can’t have any third-party scripts other than the AMP script itself. This is key to making AMP pages super fast. It’s not so much about what AMP does; it’s more about what it doesn’t allow. If you never used a single AMP component, but stuck to AMP’s rules disallowing external styles and scripts, you could easily make a page that’s even faster than what AMP can do.

At AMP Conf, Natalia pointed out that The Guardian’s non-AMP pages beat out the AMP pages for performance. So why even have AMP pages? Well, that’s down to the third, most contentious, part of the AMP puzzle.

The AMP cache turns the user experience of visiting an AMP page from fast to instant. While you’re still on the search results page, Google will pre-render an AMP page in the background. Not pre-fetch, pre-render. That’s why it opens so damn fast. It’s also what causes the most confusion for end users.

From my unscientific polling, the behaviour of AMP results confuses the hell out of people. The fact that the page opens instantly isn’t the problem—far from it. It’s the fact that you don’t actually go to an another page. Technically, you’re still on Google. An analogous mental model would be an RSS reader, or an email client: you don’t go to an item or an email; you view it in situ.

Well, that mental model would be fine if it were consistent. But in Google search, only some results will behave that way (the AMP pages) and others will behave just like regular links to other websites. No wonder people are confused! Some search results take them away and some search results keep them on Google …even though the page looks like a different website.

The price that we pay for the instantly-opening AMP pages from the Google cache is the URL. Because we’re looking at Google’s pre-rendered copy instead of the original URL, the address bar is not pointing to the site the browser claims to be showing. Everything in the body of the browser looks like an article from The Guardian, but if I look at the URL (which is what security people have been telling us for years is important to avoid being phished), then I’ll see a domain that is not The Guardian’s.

But wait! Couldn’t Google pre-render the page at its original URL?

Yes, they could. But they won’t.

This was a point that Paul kept coming back to: trust. There’s no way that Google can trust that someone else’s URL will play by the AMP rules (no external scripts, only loading embedded content via web components, limited styles, etc.). They can only trust the copies that they themselves are serving up from their cache.

By the way, there was a joint AMP/search panel at AMP Conf with representatives from both teams. As you can imagine, there were many questions for the search team, most of which were Glomar’d. But one thing that the search people said time and again was that Google was not hosting our AMP pages. Now I don’t don’t know if they were trying to make some fine-grained semantic distinction there, but that’s an outright falsehood. If I click on a link, and the URL I get taken to is a Google property, then I am looking at a page hosted by Google. Yes, it might be a copy of a document that started life somewhere else, but if Google are serving something from their cache, they are hosting it.

This is one of the reasons why AMP feels like such a bait’n’switch to me. When it first came along, it felt like a direct competitor to Facebook’s Instant Articles and Apple News. But the big difference, we were told, was that you get to host your own content. That appealed to me much more than having Facebook or Apple host the articles. But now it turns out that Google do host the articles.

This will be the point at which Googlers will say no, no, no, you can totally host your own AMP pages …but you won’t get the benefits of pre-rendering. But without the pre-rendering, what’s the point of even having AMP pages?

Well, there is one non-cache reason to use AMP and it’s a political reason. Beleaguered developers working for publishers of big bloated web pages have a hard time arguing with their boss when they’re told to add another crappy JavaScript tracking script or bloated library to their pages. But when they’re making AMP pages, they can easily refuse, pointing out that the AMP rules don’t allow it. Google plays the bad cop for us, and it’s a very valuable role. Sarah pointed this out on the panel we were on, and she was spot on.

Alright, but what about The Guardian? They’ve already got fast pages, but they still have to create separate AMP pages if they want to get the pre-rendering benefits when they show up in Google search results. Sorry, says Google, but it’s the only way we can trust that the pre-rendered page will be truly fast.

So here’s the impasse we’re at. Google have provided a list of best practices for making fast web pages, but the only way they can truly verify that a page is sticking to those best practices is by hosting their own copy, URLs be damned.

This was the crux of Paul’s argument when he was on the Shop Talk Show podcast (it’s a really good episode—I was genuinely reassured to hear that Paul is not gung-ho about drinking the AMP Kool Aid; he has genuine concerns about the potential downsides for the web).

Initially, I accepted this argument that Google just can’t trust the rest of the web. But the more I talked to people at AMP Conf—and I had some really, really good discussions with people away from the stage—the more I began to question it.

Here’s the thing: the regular Google search can’t guarantee that any web page is actually 100% the right result to return for a search. Instead there’s a lot of fuzziness involved: based on the content, the markup, and the number of trusted sources linking to this, it looks like it should be a good result. In other words, Google search trusts websites to—by and large—do the right thing. Sometimes websites abuse that trust and try to game the system with sneaky tricks. Google responds with penalties when that happens.

Why can’t it be the same for AMP pages? Let me host my own AMP pages (maybe even host my own AMP script) and then when the Googlebot crawls those pages—the same as it crawls any other pages—that’s when it can verify that the AMP page is abiding by the rules. If I do something sneaky and trick Google into flagging a page as fast when it actually isn’t, then take my pre-rendering reward away from me.

To be fair, Google has very, very strict rules about what and how to pre-render the AMP results it’s caching. I can see how allowing even the potential for a false positive would have a negative impact on the user experience of Google search. But c’mon, there are already false positives in regular search results—fake news, spam blogs. Googlers are smart people. They can solve—or at least mitigate—these problems.

Google says it can’t trust our self-hosted AMP pages enough to pre-render them. But they ask for a lot of trust from us. We’re supposed to trust Google to cache and host copies of our pages. We’re supposed to trust Google to provide some mechanism to users to get at the original canonical URL. I’d like to see trust work both ways.

Making Resilient Web Design work offline

I’ve written before about taking an online book offline, documenting the process behind the web version of HTML5 For Web Designers. A book is quite a static thing so it’s safe to take a fairly aggressive offline-first approach. In fact, a static unchanging book is one of the few situations that AppCache works for. Of course a service worker is better, but until AppCache is removed from browsers (and until service worker is supported across the board), I’m using both. I wouldn’t recommend that for most sites though—for most sites, use a service worker to enhance it, and avoid AppCache like the plague.

For Resilient Web Design, I took a similar approach to HTML5 For Web Designers but I knew that there was a good chance that some of the content would be getting tweaked at least for a while. So while the approach is still cache-first, I decided to keep the cache fairly fresh.

Here’s my service worker. It starts with the usual stuff: when the service worker is installed, there’s a list of static assets to cache. In this case, that list is literally everything; all the HTML, CSS, JavaScript, and images for the whole site. Again, this is a pattern that works well for a book, but wouldn’t be right for other kinds of websites.

The real heavy lifting happens with the fetch event. This is where the logic sits for what the service worker should do everytime there’s a request for a resource. I’ve documented the logic with comments:

// Look in the cache first, fall back to the network
  // CACHE
  // Did we find the file in the cache?
      // If so, fetch a fresh copy from the network in the background
      // NETWORK
          // Stash the fresh copy in the cache
  // NETWORK
  // If the file wasn't in the cache, make a network request
      // Stash a fresh copy in the cache in the background
  // OFFLINE
  // If the request is for an image, show an offline placeholder
  // If the request is for a page, show an offline message

So my order of preference is:

  1. Try the cache first,
  2. Try the network second,
  3. Fallback to a placeholder as a last resort.

Leaving aside that third part, regardless of whether the response is served straight from the cache or from the network, the cache gets a top-up. If the response is being served from the cache, there’s an additional network request made to get a fresh copy of the resource that was just served. This means that the user might be seeing a slightly stale version of a file, but they’ll get the fresher version next time round.

Again, I think this acceptable for a book where the tweaks and changes should be fairly minor, but I definitely wouldn’t want to do it on a more dynamic site where the freshness matters more.

Here’s what it usually likes like when a file is served up from the cache:

caches.match(request)
  .then( responseFromCache => {
  // Did we find the file in the cache?
  if (responseFromCache) {
      return responseFromCache;
  }

I’ve introduced an extra step where the fresher version is fetched from the network. This is where the code can look a bit confusing: the network request is happening in the background after the cached file has already been returned, but the code appears before the return statement:

caches.match(request)
  .then( responseFromCache => {
  // Did we find the file in the cache?
  if (responseFromCache) {
      // If so, fetch a fresh copy from the network in the background
      event.waitUntil(
          // NETWORK
          fetch(request)
          .then( responseFromFetch => {
              // Stash the fresh copy in the cache
              caches.open(staticCacheName)
              .then( cache => {
                  cache.put(request, responseFromFetch);
              });
          })
      );
      return responseFromCache;
  }

It’s asynchronous, see? So even though all that network code appears before the return statement, it’s pretty much guaranteed to complete after the cache response has been returned. You can verify this by putting in some console.log statements:

caches.match(request)
.then( responseFromCache => {
  if (responseFromCache) {
      event.waitUntil(
          fetch(request)
          .then( responseFromFetch => {
              console.log('Got a response from the network.');
              caches.open(staticCacheName)
              .then( cache => {
                  cache.put(request, responseFromFetch);
              });
          })
      );
      console.log('Got a response from the cache.');
      return responseFromCache;
  }

Those log statements will appear in this order:

Got a response from the cache.
Got a response from the network.

That’s the opposite order in which they appear in the code. Everything inside the event.waitUntil part is asynchronous.

Here’s the catch: this kind of asynchronous waitUntil hasn’t landed in all the browsers yet. The code I’ve written will fail.

But never fear! Jake has written a polyfill. All I need to do is include that at the start of my serviceworker.js file and I’m good to go:

// Import Jake's polyfill for async waitUntil
importScripts('/js/async-waituntil.js');

I’m also using it when a file isn’t found in the cache, and is returned from the network instead. Here’s what the usual network code looks like:

fetch(request)
  .then( responseFromFetch => {
    return responseFromFetch;
  })

I want to also store that response in the cache, but I want to do it asynchronously—I don’t care how long it takes to put the file in the cache as long as the user gets the response straight away.

Technically, I’m not putting the response in the cache; I’m putting a copy of the response in the cache (it’s a stream, so I need to clone it if I want to do more than one thing with it).

fetch(request)
  .then( responseFromFetch => {
    // Stash a fresh copy in the cache in the background
    let responseCopy = responseFromFetch.clone();
    event.waitUntil(
      caches.open(staticCacheName)
      .then( cache => {
          cache.put(request, responseCopy);
      })
    );
    return responseFromFetch;
  })

That all seems to be working well in browsers that support service workers. For legacy browsers, like Mobile Safari, there’s the much blunter caveman logic of an AppCache manifest.

Here’s the JavaScript that decides whether a browser gets the service worker or the AppCache:

if ('serviceWorker' in navigator) {
  // If service workers are supported
  navigator.serviceWorker.register('/serviceworker.js');
} else if ('applicationCache' in window) {
  // Otherwise inject an iframe to use appcache
  var iframe = document.createElement('iframe');
  iframe.setAttribute('src', '/appcache.html');
  iframe.setAttribute('style', 'width: 0; height: 0; border: 0');
  document.querySelector('footer').appendChild(iframe);
}

Either way, people are making full use of the offline nature of the book and that makes me very happy indeed.

Taking an online book offline

Application Cache is—as Jake so infamously described—not a good API. It was specced and shipped before developers had a chance to figure out what they really needed, and so AppCache turned out to be frustrating at best and downright dangerous in some situations. Its over-zealous caching combined with its byzantine cache invalidation ensured it was never going to become a mainstream technology.

There are very few use-cases for AppCache, but I think I hit upon one of them. Six years ago, A Book Apart published HTML5 For Web Designers. A year and a half later, I put the book online. The contents are never going to change. There’s a second edition of the book out now but if you want to read all the extra bits that Rachel added, you’re going to have to buy the book. The website for the original book is static and unchanging. That’s what made it such a good candidate for using AppCache. I could just set it and forget.

Except that’s no longer true. AppCache is being deprecated and browsers are starting to withdraw support. Chrome is already making sure that AppCache—like geolocation—no longer works on sites that aren’t served over HTTPS. That’s for the best. In retrospect, those APIs should never have been allowed over unsecured HTTP.

I mentioned that I spent the weekend switching all my book websites over to HTTPS, so AppCache should continue to work …for now. It’s only a matter of time before AppCache is removed completely from many of the browsers that currently support it.

Seeing as I’ve got the HTML5 For Web Designers site running on HTTPS now, I might as well go all out and make it a progressive web app. By far the biggest barrier to making a progressive web app is that first step of setting up HTTPS. It’s gotten cheaper—thanks to Let’s Encrypt —but it still involves mucking around in the command line with root access; I never wanted to become a sysadmin. But once that’s finally all set up, the other technological building blocks—a Service Worker and a manifest file—are relatively easy.

In this case, the Service Worker is using a straightforward bit of logic:

  • On installation, cache absolutely everything: HTML, CSS, images.
  • When anything is requested, grab it from the cache.
  • If it isn’t in the cache, try the network.
  • If the network doesn’t work, show an offline page (or image).

Basically I’m reproducing AppCache’s overzealous approach. It works for this site because the content is never going to change. I hope that this time, I really can just set it and forget it. I want the site to be an historical artefact, available at the same URL for at least my lifetime. I don’t want to have to maintain it or revisit it every few years to swap out one API for another.

Which brings me back to the way AppCache is being deprecated…

The Firefox team are very eager to ditch AppCache as soon as possible. On the one hand, that’s commendable. They’re rightly proud of shipping Service Workers and they want to encourage people to use the better technology instead. But it sure stings for the suckers (like me) who actually went and built stuff using AppCache.

In a weird way, I think this rush to deprecate AppCache might actually hurt the adoption of Service Workers. Let me explain…

At last year’s Edge Conference, Nolan Lawson gave a great presentation on storing data in the browser. He enumerated the many ways—past and present—that we could store data locally: WebSQL, Local Storage, IndexedDB …the list goes on. He also posed the question: why aren’t more people using insert-name-of-latest-API-here? To me it seemed obvious why more people weren’t diving into using the latest and greatest option for local data storage. It was because they had been burned before. The developers who rushed into trying previous solutions end up being mocked for their choice. “Still using that ol’ thing? Pffftt!”

You can see that same attitude on display from Mozilla as they push towards removing AppCache. Like in a comment that refers to developers using AppCache in production as “the angry hordes”. Reminds me of something Tom said:

In that same Mozilla thread, Soledad echoes Tom’s point:

As a member of the devrel team: I think that this should be better addressed in a blog post that someone from the team responsible for switching AppCache off should write, so everyone can understand the reasons and ask questions to those people.

I’d rather warn people beforehand, pointing them to that post and help them with migration paths than apply emergency mitigation strategies when a lot of people find their stuff stopped working in the newer Firefox…

Bravo! That same approach should have also been taken by the Chrome team when it came to their thread about punishing display:browser in manifest files. There was absolutely no communication with developers about this major decision. I only found out about it because Paul happened to mention it to me.

I was genuinely shocked by this:

Withholding the “add to home screen” prompt like that has a whiff of blackmail about it.

I can confirm that smell. When I was making the manifest file for HTML5 For Web Designers, I really wanted to put display: browser because I want people to be able to copy and paste URLs (for the book, for individual chapters, and for sections within chapters). But knowing that if I did that, Android users would never see the “add to home screen” prompt made me question that decision. I felt strong-armed into declaring display: standalone. And no, I’m not mollified by hand-waving reassurances that the Chrome team will figure out some solution for this. Figure out the solution first, then punish the saps like me who want to use display: browser to allow people to share URLs.

Anyway, the website for HTML5 For Web Designers is now using AppCache and Service Workers. The AppCache part will probably be needed for quite a while yet to provide offline support on iOS. Apple are really dragging their heels on Service Worker support, with at least one WebKit engineer actively looking for reasons not to implement it.

There’s a lot of talk about making apps work offline, but I think it’s just as important that we consider making information work offline. Books are a great example of this. To use the tired transport tropes, the website for a book is something you might genuinely want to access when you’re on a plane, or in the underground, or out at sea.

I really, really like progressive web apps. But I also think it’s important that we don’t fall into the trap of just trying to imitate native apps on the web. I love the idea of taking the best of the web—like information being permanently available at a URL—and marrying that up with the best of native—like offline access. I also like the idea of taking the best of books—a tome of thought—and marrying it up with the best of the web—hypertext.

I’d love to see more experimentation around online/offline hypertext/books. For now, you can visit HTML5 For Web Designers, add it to your home screen, and revisit it whenever and wherever you like.

My first Service Worker

I’ve made no secret of the fact that I’m really excited about Service Workers. I’m not alone. At the Coldfront conference in Copenhagen, pretty much every talk mentioned Service Workers.

Obviously I’m excited about what Service Workers enable: offline caching, background processes, push notifications, and all sorts of other goodies that allow the web to compete with native. But more than that, I’m really excited about the way that the Service Worker spec has been designed. Instead of being an all-or-nothing technology that you have to bet the farm on, it has been deliberately crafted to be used as an enhancement on top of existing sites (oh, how I wish that web components would follow a similar path).

I’ve got plenty of ideas on how Service Workers could be used to enhance a community site like The Session or the kind of events sites that we produce at Clearleft, but to begin with, I figured it would make sense to use my own personal site as a playground.

To start with, I’ve already conquered the first hurdle: serving my site over HTTPS. Service Workers require a secure connection. But you can play around with running a Service Worker locally if you run a copy of your site on localhost.

That’s how I started experimenting with Service Workers: serving on localhost, and stopping and starting my local Apache server with apachectl stop and apachectl start on the command line.

That reminds of another interesting use case for Service Workers: it’s not just about the user’s network connection failing (say, going into a train tunnel); it’s also about your web server not always being available. Both scenarios are covered equally.

I would never have even attempted to start if it weren’t for the existing examples from people who have been generous enough to share their work:

Also, I knew that Jake was coming to FF Conf so if I got stumped, I could pester him. That’s exactly what ended up happening (thanks, Jake!).

So if you decide to play around with Service Workers, please, please share your experience.

It’s entirely up to you how you use Service Workers. I figured for a personal site like this, it would be nice to:

  1. Explicitly cache resources like CSS, JavaScript, and some images.
  2. Cache the homepage so it can be displayed even when the network connection fails.
  3. For other pages, have a fallback “offline” page to display when the network connection fails.

So now I’ve got a Service Worker up and running on adactio.com. It will only work in Chrome, Android, Opera, and the forthcoming version of Firefox …and that’s just fine. It’s an enhancement. As more and more browsers start supporting it, this Service Worker will become more and more useful.

How very future friendly!

The code

If you’re interested in the nitty-gritty of what my Service Worker is doing, read on. If, on the other hand, code is not your bag, now would be a good time to bow out.

If you want to jump straight to the finished code, here’s a gist. Feel free to take it, break it, copy it, improve it, or do anything else you want with it.

To start with, let’s establish exactly what a Service Worker is. I like this definition by Matt Gaunt:

A service worker is a script that is run by your browser in the background, separate from a web page, opening the door to features which don’t need a web page or user interaction.

register

From inside my site’s global JavaScript file—or I could do this from a script element inside my pages—I’m going to do a quick bit of feature detection for Service Workers. If the browser supports it, then I’m going register my Service Worker by pointing to another JavaScript file, which sits at the root of my site:

if (navigator.serviceWorker) {
  navigator.serviceWorker.register('/serviceworker.js', {
    scope: '/'
  });
}

The serviceworker.js file sits in the root of my site so that it can act on any requests to my domain. If I put it somewhere like /js/serviceworker.js, then it would only be able to act on requests to the /js directory.

Once that file has been loaded, the installation of the Service Worker can begin. That means the script will be installed in the user’s browser …and it will live there even after the user has left my website.

install

I’m making the installation of the Service Worker dependent on a function called updateStaticCache that will populate a cache with the files I want to store:

self.addEventListener('install', function (event) {
  event.waitUntil(updateStaticCache());
});

That updateStaticCache function will be used for storing items in a cache. I’m going to make sure that the cache has a version number in its name, exactly as described in the Guardian’s use case. That way, when I want to update the cache, I only need to update the version number.

var staticCacheName = 'static';
var version = 'v1::';

Here’s the updateStaticCache function that puts the items I want into the cache. I’m storing my JavaScript, my CSS, some images referenced in the CSS, the home page of my site, and a page for displaying when offline.

function updateStaticCache() {
  return caches.open(version + staticCacheName)
    .then(function (cache) {
      return cache.addAll([
        '/path/to/javascript.js',
        '/path/to/stylesheet.css',
        '/path/to/someimage.png',
        '/path/to/someotherimage.png',
        '/',
        '/offline'
      ]);
    });
};

Because those items are part of the return statement for the Promise created by caches.open, the Service Worker won’t install until all of those items are in the cache. So you might want to keep them to a minimum.

You can still put other items in the cache, and not make them part of the return statement. That way, they’ll get added to the cache in their own good time, and the installation of the Service Worker won’t be delayed:

function updateStaticCache() {
  return caches.open(version + staticCacheName)
    .then(function (cache) {
      cache.addAll([
        '/path/to/somefile',
        '/path/to/someotherfile'
      ]);
      return cache.addAll([
        '/path/to/javascript.js',
        '/path/to/stylesheet.css',
        '/path/to/someimage.png',
        '/path/to/someotherimage.png',
        '/',
        '/offline'
      ]);
    });
}

Another option is to use completely different caches, but I’ve decided to just use one cache for now.

activate

When the activate event fires, it’s a good opportunity to clean up any caches that are out of date (by looking for anything that doesn’t match the current version number). I copied this straight from Nicolas’s code:

self.addEventListener('activate', function (event) {
  event.waitUntil(
    caches.keys()
      .then(function (keys) {
        return Promise.all(keys
          .filter(function (key) {
            return key.indexOf(version) !== 0;
          })
          .map(function (key) {
            return caches.delete(key);
          })
        );
      })
  );
});

fetch

The fetch event is fired every time the browser is going to request a file from my site. The magic of Service Worker is that I can intercept that request before it happens and decide what to do with it:

self.addEventListener('fetch', function (event) {
  var request = event.request;
  ...
});

POST requests

For a start, I’m going to just back off from any requests that aren’t GET requests:

if (request.method !== 'GET') {
  event.respondWith(
      fetch(request)
  );
  return;
}

That’s basically just replicating what the browser would do anyway. But even here I could decide to fall back to my offline page if the request doesn’t succeed. I do that using a catch clause appended to the fetch statement:

if (request.method !== 'GET') {
  event.respondWith(
      fetch(request)
          .catch(function () {
              return caches.match('/offline');
          })
  );
  return;
}

HTML requests

I’m going to treat requests for pages differently to requests for files. If the browser is requesting a page, then here’s the order I want:

  1. Try fetching the page from the network first.
  2. If that doesn’t work, try looking for the page in the cache.
  3. If all else fails, show the offline page.

First of all, I need to test to see if the request is for an HTML document. I’m doing this by sniffing the Accept headers, which probably isn’t the safest method:

if (request.headers.get('Accept').indexOf('text/html') !== -1) {

Now I try to fetch the page from the network:

event.respondWith(
  fetch(request)
);

If the network is working fine, this will return the response from the site and I’ll pass that along.

But if that doesn’t work, I’m going to look for a match in the cache. Time for a catch clause:

.catch(function () {
  return caches.match(request);
})

So now the whole event.respondWith statement looks like this:

event.respondWith(
  fetch(request)
    .catch(function () {
      return caches.match(request)
    })
);

Finally, I need to take care of the situation when the page can’t be fetched from the network and it can’t be found in the cache.

Now, I first tried to do this by adding a catch clause to the caches.match statement, like this:

return caches.match(request)
  .catch(function () {
    return caches.match('/offline');
  })

That didn’t work and for the life of me, I couldn’t figure out why. Then Jake set me straight. It turns out that caches.match will always return a response …even if that response is undefined. So a catch clause will never be triggered. Instead I need to return the offline page if the response from the cache is falsey:

return caches.match(request)
  .then(function (response) {
    return response || caches.match('/offline');
  })

With that cleared up, my code for handing HTML requests looks like this:

event.respondWith(
  fetch(request, { credentials: 'include' })
    .catch(function () {
      return caches.match(request)
        .then(function (response) {
          return response || caches.match('/offline');
        })
    })
);

Actually, there’s one more thing I’m doing with HTML requests. If the network request succeeds, I stash the response in the cache.

Well, that’s not exactly true. I stash a copy of the response in the cache. That’s because you’re only allowed to read the value of a response once. So if I want to do anything with it, I have to clone it:

var copy = response.clone();
caches.open(version + staticCacheName)
  .then(function (cache) {
    cache.put(request, copy);
  });

I do that right before returning the actual response. Here’s how it fits together:

if (request.headers.get('Accept').indexOf('text/html') !== -1) {
  event.respondWith(
    fetch(request, { credentials: 'include' })
      .then(function (response) {
        var copy = response.clone();
        caches.open(version + staticCacheName)
          .then(function (cache) {
            cache.put(request, copy);
          });
        return response;
      })
      .catch(function () {
        return caches.match(request)
          .then(function (response) {
            return response || caches.match('/offline');
          })
      })
  );
  return;
}

Okay. So that’s requests for pages taken care of.

File requests

I want to handle requests for files differently to requests for pages. Here’s my list of priorities:

  1. Look for the file in the cache first.
  2. If that doesn’t work, make a network request.
  3. If all else fails, and it’s a request for an image, show a placeholder.

Step one: try getting the file from the cache:

event.respondWith(
  caches.match(request)
);

Step two: if that didn’t work, go out to the network. Now remember, I can’t use a catch clause here, because caches.match will always return something: either a response or undefined. So here’s what I do:

event.respondWith(
  caches.match(request)
    .then(function (response) {
      return response || fetch(request);
    })
);

Now that I’m back to dealing with a fetch statement, I can use a catch clause to take care of the third and final step: if the network request doesn’t succeed, check to see if the request was for an image, and if so, display a placeholder:

.catch(function () {
  if (request.headers.get('Accept').indexOf('image') !== -1) {
    return new Response('<svg>...</svg>',  { headers: { 'Content-Type': 'image/svg+xml' }});
  }
})

I could point to a placeholder image in the cache, but I’ve decided to send an SVG on the fly using a new Response object.

Here’s how the whole thing looks:

event.respondWith(
  caches.match(request)
    .then(function (response) {
      return response || fetch(request)
        .catch(function () {
          if (request.headers.get('Accept').indexOf('image') !== -1) {
            return new Response('<svg>...</svg>', { headers: { 'Content-Type': 'image/svg+xml' }});
          }
        })
    })
);

The overall shape of my code to handle fetch events now looks like this:

self.addEventListener('fetch', function (event) {
  var request = event.request;
  // Non-GET requests
  if (request.method !== 'GET') {
    event.respondWith(
      ... 
    );
    return;
  }
  // HTML requests
  if (request.headers.get('Accept').indexOf('text/html') !== -1) {
    event.respondWith(
      ...
    );
    return;
  }
  // Non-HTML requests
  event.respondWith(
    ...
  );
});

Feel free to peruse the code.

Next steps

The code I’m running now is fine for a first stab, but there’s room for improvement.

Right now I’m stashing any HTML pages the user visits into the cache. I don’t think that will get out of control—I imagine most people only ever visit just a handful of pages on my site. But there’s the chance that the cache could get quite bloated. Ideally I’d have some way of keeping the cache nice and lean.

I was thinking: maybe I should have a separate cache for HTML pages, and limit the number in that cache to, say, 20 or 30 items. Every time I push something new into that cache, I could pop the oldest item out.

I could imagine doing something similar for images: keeping a cache of just the most recent 10 or 20.

If you fancy having a go at coding that up, let me know.

Lessons learned

There were a few gotchas along the way. I already mentioned the fact that caches.match will always return something so you can’t use catch clauses to handle situations where a file isn’t found in the cache.

Something else worth noting is that this:

fetch(request);

…is functionally equivalent to this:

fetch(request)
  .then(function (response) {
    return response;
  });

That’s probably obvious but it took me a while to realise. Likewise:

caches.match(request);

…is the same as:

caches.match(request)
  .then(function (response) {
    return response;
  });

Here’s another thing… you’ll notice that sometimes I’ve used:

fetch(request);

…but sometimes I’ve used:

fetch(request, { credentials: 'include' } );

That’s because, by default, a fetch request doesn’t include cookies. That’s fine if the request is for a static file, but if it’s for a potentially-dynamic HTML page, you probably want to make sure that the Service Worker request is no different from a regular browser request. You can do that by passing through that second (optional) argument.

But probably the trickiest thing is getting your head around the idea of Promises. Writing JavaScript is generally a fairly procedural affair, but once you start dealing with then clauses, you have to come to grips with the fact that the contents of those clauses will return asynchronously. So statements written after the then clause will probably execute before the code inside the clause. It’s kind of hard to explain, but if you find problems with your Service Worker code, check to see if that’s the cause.

And remember, please share your code and your gotchas: it’s early days for Service Workers so every implementation counts.

Updates

I got some very useful feedback from Jake after I published this…

Expires headers

By default, JavaScript files on my server are cached for a month. But a Service Worker script probably shouldn’t be cached at all (or cached for a very, very short time). I’ve updated my .htaccess rules accordingly:

<FilesMatch "serviceworker.js">
  ExpiresDefault "now"
</FilesMatch>
Credentials

If a request is initiated by the browser, I don’t need to say:

fetch(request, { credentials: 'include' } );

It’s enough to just say:

fetch(request);
Scope

I set the scope parameter of my Service Worker to be “/” …but because the Service Worker is sitting in the root directory anyway, I don’t really need to do that. I could just register it with:

if (navigator.serviceWorker) {
  navigator.serviceWorker.register('/serviceworker.js');
}

If, on the other hand, the Service Worker file were sitting in a folder, but I wanted it to act on the whole site, then I would need to specify the scope:

if (navigator.serviceWorker) {
  navigator.serviceWorker.register('/path/to/serviceworker.js', {
    scope: '/'
  });
}

…and I’d also need to send a special header. So it’s probably easiest to just put Service Worker scripts in the root directory.