Notes on bookmarks from 1997
On August 30, 2014, I imported 264 bookmarks into Pinboard. The source was a file named "bookmark.htm" with a last modified date of October 12, 1997.
- 264 bookmarks according to Pinboard's import status
- 260 bookmarks according to Pinboard's count of tags
These bookmarks date between January 1995 and October 1997.
- 2 were from January 1995
- 14 were from September-December 1996
- The rest were from 1997
Upon import, Pinboard reported 163 (63%) of the URLs as being unavailable, with 403 Forbidden, 404 Not Found, 410 Gone, or 500 Server Error. Less than 2/3 link rot over ~17 years doesn't sound so bad.
However, despite reporting 200 on the rest, many URLs weren't the original content. As one example, "serve.com" was a web host named DataRealm, and is now an American Express prepaid card. As another, a VRML tutorial is now a video about birth control. Some of these 200s are only so because of repeated 3xx redirections to ultimately unrelated content, or because of domain name hoarders serving ads.
- 12 bookmarks were for FTP sites, all of which Pinboard reported as 500 Server Error. These were not tested with an FTP client.
- 22 bookmarks were for local resources, all of which Pinboard reported as 404 Not Found.
- 226 bookmarks were left for testing
Of the 226:
- 1 was 410 Gone
- 2 were 403 Forbidden
- 49 were 500 Server Error
- 76 were 404 Not Found
That's 57%, which sounds even better than the original figure. But then I looked at those ninety-eight 200 OK URLs, too.
- 77 reported 200 OK, but were parked domains, advertising landing pages, or otherwise completely different content. This is link rot, too, just harder for an automated system to detect. I marked these as 210 OK But Gone.
That's 205 failures, an actual link rot figure of 91%, not 57%.
That leaves only 21 URLs as 200 OK and containing effectively the same content.
In an attempt to confirm and/or recover as much of the original content as possible, I checked the Internet Archive's Wayback Machine for every URL.
- 1 of the two 403 Forbidden URLs had an old enough copy in the Wayback Machine.
- 23, or 47%, of the forty-nine 500 Server Error URLs had copies.
- 45, or 59%, of the seventy-six 404 Not Found URLs had copies.
- 35, or 45%, of the seventy-seven 210 OK But Gone URLs had copies.
That's 104 failures beaten back by the Internet Archive at some level of fidelity, reducing effective link rot over ~17 years to 45%.
In addition, 9 of the twenty-one 200 OK URLs had old enough copies in the Wayback Machine, which I selected simply to provide a more accurate representation of the content.
There are a couple things you can do to help combat link rot for your own bookmarks moving forward.
First, donate to the Internet Archive: http://archive.org/donate/
Second, if you use a bookmark storage service like Pinboard or others, ask them to support adding submitted URLs to the Wayback Machine. Their bookmarklets or plugins could also submit to the Wayback Machine's "Save Page Now" endpoint. Or that could be done by the service on the back-end. For services that provide full page archives, they could capture a full WARC (network headers plus content), so every successfully cached page could be donated to the Internet Archive and integrated into the Wayback Machine. Or all of the above.
Every URL saved in more than one place increases the likelihood that their content will survive as domains change owners.
I've a lot more bookmarks to import, and doing this processing by hand is tedious.
Any 4xx or 5xx URL could be checked against the Wayback Machine, with the option to link to that instead.
It also seems like some heuristics could be developed to flag URLs as likely being 210 OK But Gone. Parked domains have common content on every URL. Advertising landers have a common format. The Wayback Machine could be checked, and content could be extracted from both and compared. URLs aren't supposed to change, and they're supposed to point to a persistent resource, but companies and domain squatters aren't playing nice. If we want our bookmarks to represent the content we saved as it was when we saved it, we have to be proactive about grooming them.
Vitorio