Simon Willison’s Weblog

Subscribe

25 items tagged “pdf”

2024

Compare PDFs. Inspired by this thread on Hacker News about the C++ diff-pdf tool I decided to see what it would take to produce a web-based PDF diff visualization tool using Claude 3.5 Sonnet.

It took two prompts:

Build a tool where I can drag and drop on two PDF files and it uses PDF.js to turn each of their pages into canvas elements and then displays those pages side by side with a third image that highlights any differences between them, if any differences exist

That give me a React app that didn't quite work, so I followed-up with this:

rewrite that code to not use React at all

Which gave me a working tool! You can see the full Claude transcript plus screenshots of the tool in action in this Gist.

Being able to knock out little custom interactive web tools like this in a couple of minutes is so much fun.

# 2nd July 2024, 7:54 pm / pdf, projects, llms, anthropic, claude

PDF to Podcast (via) At first glance this project by Stephan Fitzpatrick is a cute demo of a terrible sounding idea... but then I tried it out and the results are weirdly effective. You can listen to a fake podcast version of the transformers paper, or upload your own PDF (with your own OpenAI API key) to make your own.

It's open source (Apache 2) so I had a poke around in the code. It gets a lot done with a single 180 line Python script.

When I'm exploring code like this I always jump straight to the prompt - it's quite long, and starts like this:

Your task is to take the input text provided and turn it into an engaging, informative podcast dialogue. The input text may be messy or unstructured, as it could come from a variety of sources like PDFs or web pages. Don't worry about the formatting issues or any irrelevant information; your goal is to extract the key points and interesting facts that could be discussed in a podcast. [...]

So I grabbed a copy of it and pasted in my blog entry about WWDC, which produced this result when I ran it through Gemini Flash using llm-gemini:

cat prompt.txt | llm -m gemini-1.5-flash-latest

Then I piped the result through my ospeak CLI tool for running text-to-speech with the OpenAI TTS models (after truncating to 690 tokens with ttok because it turned out to be slightly too long for the API to handle):

llm logs --response | ttok -t 690 | ospeak -s -o wwdc-auto-podcast.mp3

And here's the result (3.9MB 3m14s MP3).

It's not as good as the PDF-to-Podcast version because Stephan has some really clever code that uses different TTS voices for each of the characters in the transcript, but it's still a surprisingly fun way of repurposing text from my blog. I enjoyed listening to it while I was cooking dinner.

# 13th June 2024, 1:03 am / pdf, podcasts, projects, text-to-speech, ai, openai, generative-ai, llms, gemini

Experimenting with local alt text generation in Firefox Nightly (via) The PDF editor in Firefox (confession: I did not know Firefox ships with a PDF editor) is getting an experimental feature that can help suggest alt text for images for the human editor to then adapt and improve on.

This is a great application of AI, made all the more interesting here because Firefox will run a local model on-device for this, using a custom trained model they describe as "our 182M parameters model using a Distilled version of GPT-2 alongside a Vision Transformer (ViT) image encoder".

The model uses WebAssembly with ONNX running in Transfomers.js, and will be downloaded the first time the feature is put to use.

# 2nd June 2024, 1:12 pm / firefox, javascript, mozilla, pdf, ai, webassembly, llms, transformers-js

Running OCR against PDFs and images directly in your browser

Visit Running OCR against PDFs and images directly in your browser

I attended the Story Discovery At Scale data journalism conference at Stanford this week. One of the perennial hot topics at any journalism conference concerns data extraction: how can we best get data out of PDFs and images?

[... 2,263 words]

unstructured. Relatively new but impressively capable Python library (Apache 2 licensed) for extracting information from unstructured documents, such as PDFs, images, Word documents and many other formats.

I got some good initial results against a PDF by running “pip install ’unstructured[pdf]’” and then using the “unstructured.partition.pdf.partition_pdf(filename)” function.

There are a lot of moving parts under the hood: pytesseract, OpenCV, various PDF libraries, even an ONNX model—but it installed cleanly for me on macOS and worked out of the box.

# 2nd February 2024, 2:47 am / ocr, pdf, python

Portable EPUBs. Will Crichton digs into the reasons people still prefer PDF over HTML as a format for sharing digital documents, concluding that the key issues are that HTML documents are not fully self-contained and may not be rendered consistently.

He proposes “Portable EPUBs” as the solution, defining a subset of the existing EPUB standard with some additional restrictions around avoiding loading extra assets over a network, sticking to a smaller (as-yet undefined) subset of HTML and encouraging interactive components to be built using self-contained Web Components.

Will also built his own lightweight EPUB reading system, called Bene—which is used to render this Portable EPUBs article. It provides a “download” link in the top right which produces the .epub file itself.

There’s a lot to like here. I’m constantly infuriated at the number of documents out there that are PDFs but really should be web pages (academic papers are a particularly bad example here), so I’m very excited by any initiatives that might help push things in the other direction.

# 25th January 2024, 8:32 pm / html, pdf, webcomponents

2023

textra (via) Tiny (432KB) macOS binary CLI tool by Dylan Freedman which produces high quality text extraction from PDFs, images and even audio files using the VisionKit APIs in macOS 13 and higher. It handles handwriting too!

# 23rd March 2023, 9:08 pm / audio, macosx, ocr, pdf

2022

Building a searchable archive for the San Francisco Microscopical Society

Visit Building a searchable archive for the San Francisco Microscopical Society

The San Francisco Microscopical Society was founded in 1870 by a group of scientists dedicated to advancing the field of microscopy.

[... 1,845 words]

s3-ocr: Extract text from PDF files stored in an S3 bucket

Visit s3-ocr: Extract text from PDF files stored in an S3 bucket

I’ve released s3-ocr, a new tool that runs Amazon’s Textract OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.

[... 1,493 words]

2019

Automate the Boring Stuff with Python: Working with PDF and Word Documents. I stumbled across this while trying to extract some data from a PDF file (the kind of file with actual text in it as opposed to dodgy scanned images) and it worked perfectly: PyPDF2.PdfFileReader(open("file.pdf", "rb")).getPage(0).extractText()

# 6th November 2019, 4:17 pm / pdf, python

2017

arxiv-vanity (via) Beautiful new project from Ben Firshman and Andreas Jansson: “Arxiv Vanity renders academic papers from Arxiv as responsive web pages so you don’t have to squint at a PDF”. It works by pulling the raw LaTeX source code from Arxiv and rendering it to HTML using a heavily customized Pandoc workflow. The real fun is in the architecture: it’s a Django app running on Heroku which fires up on-demand Hyper.sh Docker containers for each individual rendering job.

# 25th October 2017, 8:06 pm / ben-firshman, django, pdf, science, docker

2010

pdf.js. A JavaScript library for creating simple PDF files. Works (flakily) in your browser using a data:URI hack, but is also compatible with server-side JavaScript implementations such as Node.js.

# 17th June 2010, 7:39 pm / datauri, javascript, node, nodejs, pdf, recovered

2009

node.js at JSConf.eu (PDF). node.js creator Ryan Dahl’s presentation at this year’s JSConf.eu. The principle philosophy is that I/O in web applications should be asynchronous—for everything. No blocking for database calls, no blocking for filesystem access. JavaScript is a mainstream programming language with a culture of callback APIs (thanks to the DOM) and is hence ideally suited to building asynchronous frameworks.

# 17th November 2009, 6:07 pm / asynchronous, eventio, javascript, node, pdf, ryan-dahl

Adobe is Bad for Open Government. The problem isn’t just that PDFs are a bad way of sharing data, it’s that Adobe have been actively lobbying the US government to use their PDF and Flash formats for open government initiatives.

# 1st November 2009, 12:51 pm / adobe, flash, opengovernment, pdf, sunlightfoundation

No PDFs! The Sunlight Foundation point out that PDFs are a terrible way of implementing “more transparent government” due to their general lack of structure. At the Guardian (and I’m sure at other newspapers) we waste an absurd amount of time manually extracting data from PDF files and turning it in to something more useful. Even CSV is significantly more useful for many types of information.

# 1st November 2009, 12:04 pm / adobe, csv, opendata, opengovernment, pdf, sunlightfoundation

Prawn (via) Really nice PDF generation library for Ruby, used to generate Dopplr’s beautiful end of year reports.

# 16th January 2009, 4:04 pm / dopplr, pdf, prawn, ruby

Dopplr presents the Personal Annual Report 2008: freshly generated for you, and Barack Obama... So classy it hurts. I’d love to know what library they used to generate the PDF.

# 16th January 2009, 12:17 pm / barack-obama, dopplr, pdf

2008

Robust Defenses for Cross-Site Request Forgery [PDF]. Fascinating report which introduces the “login CSRF” attack, where an attacker uses CSRF to log a user in to a site (e.g. PayPal) using the attacker’s credentials, then waits for them to submit sensitive information or bind the account to their credit card. The paper also includes an in-depth study of potential protection measures, including research that shows that 3-11% of HTTP requests to a popular ad network have had their referer header stripped. Around 0.05%-0.10% of requests have custom HTTP headers such as X-Requested-By stripped.

# 24th September 2008, 9:40 am / csrf, http, logincsrf, paypal, pdf, phishing, security, xrequestedby

PDFMiner. Useful looking PDF parsing library in Python—can produce an XML representation of the text and style information in a PDF document.

# 3rd August 2008, 3:29 pm / pdf, pdfminer, python, screenscraping, xml

Scaling your website with the Perlbal web server (PDF) (via) Perlbal documentation is pretty thin on the ground; this is a really useful introduction from Frank Wiles.

# 17th June 2008, 10:39 pm / frank-wiles, load-balancing, pdf, perlbal

OSM Super-Strength Export. Awesome new feature on OpenStreetMap: you can browse to anywhere on the map, then hit “export” and download a rendered bitmap or vector (PDF and SVG) image of the currently displayed map—and because it’s OSM there’s no watermark and a very liberal usage license.

# 22nd April 2008, 9:56 am / mapping, maps, openstreetmap, pdf, svg, vector

2007

Restructured Text to Anything. Slick set of online tools for converting Restructured Text (one of the more mature wiki-style markup languages) to HTML or PDF. Includes a nice looking API. Powered by Django.

# 13th September 2007, 3:54 pm / django, html, pdf, python, restructuredtext

PDF Shrink. $35 OS X app that crunches down the size of PDF files—useful if you often embed photos in your presentations.

# 28th May 2007, 2:21 pm / osx, pdf, pdfshrink, presentations

The Adobe PDF XSS Vulnerability. If you host a PDF file anywhere on your site, you’re vulnerable to an XSS attack due to a bug in Acrobat Reader versions below 8. The fix is to serve PDFs as application/octet-stream to avoid them being displayed inline.

# 11th January 2007, 4:23 pm / adobe, pdf, security, vulnerability, xss

2002

PHP generated PDFs

R&OS PDF PHP classes (via tidak ada). This is the most useful PHP library I’ve seen in a long time. It allows dynamic generation of PDF files without needing any additional modules installed on the server (although GD is required if you want to add images to your PDFs). It is extremely easy to use and has an impressive set of features, including PDF drawing tools, built in page number support and excellent documentation. On the topic of PDFs, Yes You Can advocates their use for presentations and touches on a method of generating them using Python.