Linuxvoice.com

The Internet Archive

2016-03-23

Join Mayank Sharma and marvel at the vision of the group that’s on a mission to one-up the Greeks.

Some people collect stamps; others collect comics. Brewster Kahle collects the internet. Or, at least that’s how he started. Once his appetite was whetted, Kahle set his sights on bigger and better things. He now wants to archive and channel all the knowledge in the world. Kahle is the founder of the Internet Archive, a non-profit he set up in 1996 right around the time he co-founded the for-profit Alexa Internet. Recounting its start at the annual open house event at the company’s in San Francisco HQ in late 2014, Kahle said that the initial plan was – funnily enough – just to build an archive of the internet. By the mid 90s, people had already started sharing things they knew and pouring their souls onto the internet, and Kahle didn’t want this information to disappear.

So the organisation started taking snapshots of websites and today has over 430 billion web pages, and is adding about a billion pages a week. Since there’s an endless stream of web pages, its archiving system prioritises websites and caches some more often than others, but the goal is to cache some pages for every website every two months.

Just when the team were getting good at collecting the web, Kahle discovered that there were a lot of things that were not on the internet yet: “So we swivelled and in 2002 we became an archive ON the internet.” Inspired by the ancient Greek Library of Alexandria, which housed the largest collection of text scrolls, Kahle set about to build its 21st Century equivalent by archiving books. “We worked with libraries around the world that had different types of media and started to digitise them cost-effectively to bring them to the screen generation.”

Executing the vision

According to Google, there are over 129 million different published books, and scanning them all is a momentous task. After experimenting with robots and outsourcing the work to low-wage countries, the team decided to make their own book scanner.

Currently, of the Archive’s 140 employees, 100 scan books along with several volunteers. Kahle told us that they have 33 scanning centres with about 100 scanners in total spread across eight countries that scan books. Together they scan about a thousand books daily and have scanned about 2.6 million in all. There are other similar projects, such as Google Books, which has scanned over 1 million public domain books. But one thing that sets the Archive apart from the others is its effort to preserve at least one physical copy of the scanned book. In a blog post (http://blog.archive.org/2011/06/06/why-preserve-books-the-new-physical-archive-of-the-internet-archive), the Archive talks of an unnamed library that throws out books based on what’s been digitised by Google. The Archive, on the other hand, has vowed to keep a copy of the books it digitises if it isn’t returned to a library.

The Archive has a physical archive in Richmond, California, that can house up to 3 million books for upto 100 years. And it’s no ordinary warehouse. “We have high-density, long-term, deep storage devices. These units that we have are hooked up with thermocouples to measure temperature and humidity. Each one holds approximately 40,000 books”, explains Robert Miller, Global Director of Books at the Archive, in a documentary (https://vimeo.com/59207751).

Behind the archive.org redesign

By the time you read this, the Internet Archive’s website should be wearing a new look. But there’s more to the redesign than a cosmetic uplift. Explaining the redesign in a blog post, its Director of Web Services, Alexis Rossi, writes that the current look of the site dates back to 2002 and has only had minor design changes and some usability feature additions over the years. One of the biggest reasons for overhauling the interface is that the archive now hosts a lot more data than it did over a decade ago. From just about 3TB worth of books, audios and videos in 2002, the collection has now grown to over 10,000TB, and that doesn’t include the almost two decades worth of web pages. Similarly, the number of daily users has also grown exponentially (Archive.org is one of the top 200 websites on the web and gets around 2.5 million individuals who use the items it hosts daily). Furthermore, about 30% of these users access the archive from a mobile device – a demographic that isn’t served well by the current website.

According to Rossi, the group got serious about overhauling the website in January 2014. It hired people, and conducted interviews to better understand how people interacted with the website and the archived items. After months of work, the new website was launched in beta in November 2014 with “more visual cues to help you find things, facets on collections to quickly get you where you want to go, easy searching within collections, user pages, and many more.”

Demoing the beta at the open house event, Kahle said the new website isn’t just designed to find and serve the collections it currently archives, but also caters to users who wish to add items and create collections.

The Internet Arcade

At its annual event in October, 2014, the Archive took the wraps off the newest addition to its website – the Internet Arcade (https://archive.org/details/internetarcade). It’s a web-based library of vintage arcade games from the 70s, 80s and 90s. The best thing about the collection is that you can experience and play these games from within the browser itself!

The games are emulated in the JSMESS emulator, which is a JavaScript port of the popular Multi Emulator Super System (MESS). The JSMESS emulation project is one of many open source projects that the Internet Archive is involved with. In addition to the games for classic gaming consoles such as the Atari 2600, Atari 7800, and Astrocade on the Internet Arcade, you can also play over 2400 classic DOS games in the Archive’s software library for MS-DOS games (https://archive.org/details/softwarelibrary_msdos_games) thanks to the efforts of Jason Scott, who is equally adept hacking away on his computer and filming documentaries.

Zoom out a bit more and the Archive’s software library includes over 95,000 vintage and historical programs.

Knowledge repository

After getting a handle on scanning books, the Archive set it sights on to other media types – audio and video. But unlike the relatively small ebooks, audio and video media types typically require much larger storage space.

Illustrating the challenge at Ted, Kahle said “If you give something to a charity or to the public, you get a pat on the back and a tax donation. Except on the Net, where you can go broke. If you put up a video of your garage band, and it starts getting heavily accessed, you can lose your guitars or your house.” This realisation led the Archive to offer unlimited storage and bandwidth to “anybody who has something to share that belongs in a library.”

Since 2005, the Archive has been collecting moving images of all types. Besides theatrical releases of movies that are out of copyright, the Archive houses lots of other types of movies sourced from the institutions and individuals around the world. These include political films, non-English language videos, stock footage, sports videos, and a lot of amateur films. For example, the Archive hosts over 250 hours of video lectures and interviews with Dr Timothy Leary, one of the century’s most controversial figures and inspiration for many of the early technologists including Kahle.

The Archive has a special interest in television, particularly in news. The group recorded 24 hours of news channels from around the world for one week from 11 September, 2001 in a bid to understand and analyse the reporting of the worldwide media in the days following the attacks. Using this they were able to dispel the myth that the Palestinians were dancing in the streets post 9/11, shares Kahle in his Ted talk. In his words: “How can we have critical thinking without being able to quote and being able to compare what happened in the past?”

The Archive is also a big collector of music and all sorts of audio. It has digitised music from all types of vinyl records and archived music from optical discs. In his open house address, Kahle mentioned that the Archive deliberated on ways to archive music so as to not disrupt musicians and people who are still trying to make money distributing music. The Archive approached a couple of labels and offered to archive their material and then brainstorm together on how to make it available. It found willing partners in Music Omnia and Other Minds, which offered their portfolio of CDs for digitisation and are working with the Archive to “figure out how far we can go in such a way that it’s a good balance between the commercial constraints of a real label with the interests of what you can do if you have it all in one place.” Similarly, the group has tied up with the Archive of Contemporary Music and is digitising its collection of 500,000 CDs before moving on to its couple of million vinyl records.

Since commercial music is such a heavily litigated area, Kahle mentions that the Archive is also looking at other niches “that aren’t served terribly well by the classic commercial publishing system.” One such niche is concert recordings. It started with recordings of the Grateful Dead (one of their members was John Perry Barlow, co-founder of the Electronic Frontier Foundation). Now the Archive gets about two or three bands a day signing up. “They give permission, and we get about 40 or 50 concerts a day”, shares Kahle. The Archive has also partnered with the etree.org community and houses their collection of over 1,00,000 concert recordings. Additionally, the Archive has also imported over 42,000 albums from the now defunct Internet Underground Music Archive community and over 58,000 items of Creative Commons-licensed catalogs of Netlabels.

As with video, Kahle’s intention is to preserve these classic musical collections that help define the generation’s musical heritage. The Archive is feeding its musical archive to researchers such as Prof. Daniel Ellis of Columbia University, who is studying the link between signal processing and listener behaviour. The group is also using technology developed by the UPF University in Barcelona, which can identify rhythmic structures, chord structures and other metadata from the music to help them sort it in novel ways.

Aaron Swartz, who helped establish the Archive’s Open Library project, is among those with a terracotta statue.

Universal access

Digitising books, audio and video is just one part (albeit a big one) of the process of building a generational archive. The archive puts in a lot of effort to preserve data and to keep it relevant. But there’s more to do than just replacing bad disks. “Can you read the old formats? We’ve had to translate our movies over five times”, says Kahle.

However, the biggest weakness the Archive insulates against is institutional failure. “The problem with libraries is that they burn. They get burned by governments. That’s not a political statement, it’s just historically what happened. The Library of Congress has already burned once. So if that’s what happens to libraries, let’s design for it.” The biggest lesson the Archive has learnt from the burning of the ancient Library of Alexandria is to keep multiple copies, which is a relatively easier task in the digital age. So the Archive has made a partial mirror of itself and put it in the new Library of Alexandria and another partial copy in Amsterdam.

Of course, archiving all this culture is a massive job, so the group is building a complete set of tools to help communities and individuals to store, catalogue and sort through culturally relevant collections. “What Wikimedia did for encyclopedia articles, the Internet Archive hopes to do for collections of media: give people the tools to build library collections together and make them accessible to everyone.”

The Internet Archive has preserved over 430bn web pages, and about 20m books are downloaded from its website every month. “We get more visitors in a year than most libraries do in a lifetime”, writes Kahle.

Thanks to the positive experience over the last decade, the Archive is of the firm belief that building a digital library of Alexandria is just a matter of scale and money. “Everything we do is open source, and all the things we do we try to give away. Can you make it work to give everything away? This is a real experiment and it’s turning out to work”.

The Table Top Scribe

The Internet Archive’s scanner is an all-round hardware, software and digital library solution. The scanner can capture A3, A4 and A5-sized pamphlets, bound or loose leaf material, archival items and more. The base system is built on two 18-megapixel digital cameras. The Table Top Scribe, as the device is known, has a V-shaped cradle for bound materials such as books and an add-on for scanning flat items such as maps. The scanner can digitise pages at the rate of 500–800 pages per hour.

The Internet Archive sells these scanners for a shade under $10,000 (about ₤6,800). Libraries can use the scanner to scan and store the images locally at no additional cost. The Archive also offers an add-on Gold Package, which offers several benefits including the ability to auto-upload the scanned items to archive.org and the Archive’s back-end processing including QA, OCR’d images, and more. It costs $0.04 per image and subscribers aren’t charged for the first 50 books or 12,000 pages.

Lan Zhu, a scanner at Internet Archive, scanning a book using the Table Top Scribe.

From Linux Voice issue 15. Click here to subscribe for more top-quality Linux learning every month!