Notes on bookmarks from 1997

On August 30, 2014, I imported 264 bookmarks into Pinboard. The source was a file named "bookmark.htm" with a last modified date of October 12, 1997.

These bookmarks date between January 1995 and October 1997.

Upon import, Pinboard reported 163 (63%) of the URLs as being unavailable, with 403 Forbidden, 404 Not Found, 410 Gone, or 500 Server Error. Less than 2/3 link rot over ~17 years doesn't sound so bad.

However, despite reporting 200 on the rest, many URLs weren't the original content. As one example, "serve.com" was a web host named DataRealm, and is now an American Express prepaid card. As another, a VRML tutorial is now a video about birth control. Some of these 200s are only so because of repeated 3xx redirections to ultimately unrelated content, or because of domain name hoarders serving ads.

Of the 226:

That's 57%, which sounds even better than the original figure. But then I looked at those ninety-eight 200 OK URLs, too.

That's 205 failures, an actual link rot figure of 91%, not 57%.

That leaves only 21 URLs as 200 OK and containing effectively the same content.

In an attempt to confirm and/or recover as much of the original content as possible, I checked the Internet Archive's Wayback Machine for every URL.

That's 104 failures beaten back by the Internet Archive at some level of fidelity, reducing effective link rot over ~17 years to 45%.

In addition, 9 of the twenty-one 200 OK URLs had old enough copies in the Wayback Machine, which I selected simply to provide a more accurate representation of the content.

There are a couple things you can do to help combat link rot for your own bookmarks moving forward.

First, donate to the Internet Archive: http://archive.org/donate/

Second, if you use a bookmark storage service like Pinboard or others, ask them to support adding submitted URLs to the Wayback Machine. Their bookmarklets or plugins could also submit to the Wayback Machine's "Save Page Now" endpoint. Or that could be done by the service on the back-end. For services that provide full page archives, they could capture a full WARC (network headers plus content), so every successfully cached page could be donated to the Internet Archive and integrated into the Wayback Machine. Or all of the above.

Every URL saved in more than one place increases the likelihood that their content will survive as domains change owners.

I've a lot more bookmarks to import, and doing this processing by hand is tedious.

Any 4xx or 5xx URL could be checked against the Wayback Machine, with the option to link to that instead.

It also seems like some heuristics could be developed to flag URLs as likely being 210 OK But Gone. Parked domains have common content on every URL. Advertising landers have a common format. The Wayback Machine could be checked, and content could be extracted from both and compared. URLs aren't supposed to change, and they're supposed to point to a persistent resource, but companies and domain squatters aren't playing nice. If we want our bookmarks to represent the content we saved as it was when we saved it, we have to be proactive about grooming them.

Vitorio



last updated august 2014