Journal tags: wiki

4

Play me off

One of the fun fringe events at Build in Belfast was The Standardistas’ Open Book Exam:

Unlike the typical quiz, the Open Book Exam demands the use of iPhones, iPads, Androids—even Zunes—to avail of the internet’s wealth of knowledge, required to answer many of the formidable questions.

Team Clearleft came joint third. Initially it was joint fourth but an obstreperous Andy Budd challenged the scoring.

Now one of the principles of this unusual pub quiz was that cheating was encouraged. Hence the encouragement to use internet-enabled devices to get to Google and Wikipedia as quickly as the network would allow. In that spirit, Andy suggested a strategy of “running interference.”

So while others on the team were taking information from the web, I created a Wikipedia account to add misinformation to the web.

Again, let me stress, this was entirely Andy’s idea.

The town of Clover, South Carolina ceased being twinned Larne and became twinned with Belfast instead.

The world’s largest roller coaster become 465 feet tall instead of its previous 456 feet (requiring a corresponding change to a list page).

But the moment I changed the entry for Keyboard Cat to alter its real name from “Fatso” to “Freddy” …BAM! Instant revert.

You can mess with geography. You can mess with measurements. But you do. Not. Mess. With. Keyboard Cat.

For some good clean Wikipedia fun, you can always try wiki racing:

To Wikirace, first select a page off the top of your head. Using “Random page” works well, as well as the featured article of the day. This will be your beginning page. Next choose a destination page. Generally, this destination page is something very unrelated to the beginning page. For example, going from apple to orange would not be challenging, as you would simply start at the apple page, click a wikilink to fruit and then proceed to orange. A race from Jesus Christ to Subway (restaurant) would be more of a challenge, however. For a true test of skill, attempt Roman Colosseum to Orthographic projection.

Then there’s the simple pleasure of getting to Philosophy:

Some Wikipedia readers have observed that clicking on the first link in the main text of a Wikipedia article, and then repeating the process for subsequent articles, usually eventually gets you to the Philosophy article.

Seriously. Try it.

Calling all scientists

So you like the idea of a Science Hack Day? Well, okay then. Let’s make it happen.

I’ve set up a wiki at sciencehackday.pbworks.com. You know the drill: if you see something that needs editing or adding to, go right ahead and do it.

There’s not much up there at the moment, so I’d appreciate anything you can add in the way of:

APIs,
Datasets and
Ideas for hacks, mashups, or anything involving science.

Don’t think that you have to be a coder to participate. There’s plenty of room for data visualisations, directories of awesomeness, educational tools and anything else you feel inspired to make.

There’s still the little matter of a venue to sort out but I’m sure we’ll come up with something. If you have any ideas… add ‘em to the wiki.

Let’s make this happen.

Using socially-authored content to provide new routes through existing content archives

Rob Lee is talking about making the most of user-authored (or user-generated) content. In other words, content written by you, Time’s person of the year.

Wikipedia is the poster child. It’s got lots of WWILFing: What Was I Looking For? (as illustrated by XKCD). Here’s a graph entitled Mapping the distraction that is Wikipedia generated from a greasemonkey script that tracks link paths.

Rob works for Rattle Research who were commissioned by the BBC Innovation Labs to do some research into bringing WWILFing to the BBC archive.

Grab the first ten internal links from any Wikipedia article and you will get ten terms that really define that subject matter. The external links at the end of an article provide interesting departure points. How could this be harnessed for BBC news articles? Categories are a bit flat. Semantic analysis is better but it takes a lot of time and resources to generate that for something as large as the BBC archives. Yahoo’s Term Extractor API is a handy shortcut. The terms extracted by the API can be related to pages on Wikipedia.

Look at this news story on organic food sales. The “see also” links point to related stories on organic food but don’t encourage WWILFing. The BBC is a bit of an ivory tower: it has lots of content that it can link to internally but it doesn’t spread out into the rest of the Web very well.

How do you decide what would be interesting terms to link off with? How do you define “interesting”? You could use Google page rank or Technorati buzz for the external pages to decide if they are considered “interesting”. But you still need contextual relevance. That’s where del.icio.us comes in. If extracted terms match well to tags for a URL, there’s a good chance it’s relevant (and del.icio.us also provides information on how many people have bookmarked a URL).

So that’s what they did. They called it “muddy boots” because it would create dirty footprints across the pristine content of the BBC.

The “muddy boots” links for the organic food article links off to articles on other news sites that are genuinely interesting for this subject matter.

Here’s another story, this one from last week about the dissection of a giant squid. In this case, the journalist has provided very good metadata. The result is that there’s some overlap between the “see also” links and the “muddy boots” links.

But there are problems. An article on Apple computing brings up a “muddy boots” link to an article on apples, the fruit. Disambiguation is hard. There are also performance problems if you are relying on an external API like del.icio.us’s. Also, try to make sure you recommend outside links that are written in the same language as the originating article.

Muddy boots was just one example of using some parts of the commons (Wikipedia and del.icio.us). There are plenty of others out there like Magnolia, for example.

But back to disambiguation, the big problem. Maybe the Semantic Web can help. Sources like Freebase and DBpedia add more semantic data to Wikipedia. They also pull in data from Geonames and MusicBrainz. DBpedia extracts the disambiguation data (for example, on the term “Apple”). Compare terms from disambiguation candidates to your extracted terms and see which page has the highest correlation.

But why stop there? Why not allow routes back into our content? For example, having used DBpedia to determine that your article is about Apple, the computer company, you could an hCard for the Apple company to that article.

If you’re worried about the accuracy of commons data, you can stop. It looks like Wikipedia is more accurate than traditional encyclopedias. It has authority, a formal review process and other tools to promote accuracy. There are also third-party services that will mark revisions of Wikipedia articles as being particularly good and accurate.

There’s some great commons data out there. Use it.

Rob is done. That was a great talk and now there’s time for some questions.

Brian asks if they looked into tying in non-text content. In short, no. But that was mostly for time and cost reasons.

Another question, this one about the automation of the process. Is there still room for journalists to spend a few minutes on disambiguating stories? Yes, definitely.

Gavin asks about data as journalism. Rob says that this particularly relevant for breaking news.

Ian’s got a question. Journalists don’t have much time to add metadata. What can be done to make it easier — it is an interface issue? Rob says we can try to automate as much as possible to keep the time required to a minimum. But yes, building things into the BBC CMS would make a big difference.

Someone questions the wisdom of pushing people out to external sources. Doesn’t the BBC want to keep people on their site? In short, no. By providing good external references, people will keep coming back to you. The BBC understand this.

Taking back the Web

I’m at an event called Take Back The Web. It’s a cosy little unconference aimed at non-profits and activist groups.

There’s been plenty of education and discussion going on all day, mostly around things like blogs, wikis, RSS and podcasting. I followed up the RSS talk with a little spiel about APIs and how they can be used to pull in data from other places on the web.

I’m used to attending geekier events where everyone is fairly tech-savvy, but the crowd here is mostly made up of people on the ground who want to be able use technology but who aren’t necessarily from a technological background. It really brought home to me just how far we have to go in making this stuff less geeky and scary-sounding.

Just about everyone gets blogs, and it’s pretty easy to get started with them. Wikis are a little bit trickier, but still attainable. RSS becomes harder again: it’s still too hard to subscribe, and even the term “subscribe” is itself misleading, implying payment. As for APIs, that’s still all pretty much rocket science so I just gave a basic overview of the benefits without really discussing the nitty-gritty of programming.

Notice how the terms change in complexity along that scale: from the word blog to the term API. We’re using way too many acronyms and technobabble for this stuff. Of course, we can’t change the names without upsetting the geeky programmers.

I got a lot of food for thought from the day so far, even though I already know about the technologies. It’s been fascinating to see how people are using the web now and also how much more they could be doing.

The guys from mySociety/They Work For You are talking through their services now and I’ve just found about this nifty API. I’ll have a play around with that. I’ll quiz Matthew about it later; he’s staying over with me. More grist for the bedroll.