“The Australian Web Archive [AWA] is one of the biggest in the world. And when we say big, we mean enormous,” says director general of the National Library of Australia, Dr Marie-Louise Ayres.
The new archive, which launched last week, contains around 600 terabytes of data across 9 billion records. In bookshelf terms; if the records were printed and stacked they would stretch from Canberra to Cairns.
The archive contains thousands of .au domain web pages – some still popular and others defunct – allowing users to see how they looked at different points in time from 1996 to the present.
The project has been some 20 years in the making, with the archive's functionality developed over the last two years by National Library's small technology team, led by chief information officer David Wong.
"The sites have been collected form when the Internet was more or less born," Wong says. "They’re all preserved in perpetuity so the intention is to have them available forever."
While the concept will be familiar to users of the Internet Archive's Wayback Machine, which was launched in 2001, the AWA features significantly more functionality as it is fully text-searchable.
To discover pages in the Wayback Machine, users need to know the URL. With AWA, they can find content via a Google-like search built in-house by Wong's team.
"You can type in a search term and find sites with that term on there. It makes content a hell of a lot more accessible and discoverable. It’s a very compelling feature," Wong explains.
Collecting what’s important tomorrow, today
The National Library was one of the first organisations of its kind globally to build an Internet archive.
The collection and storage of web pages began in 1996, when library curators picked out 'sites of significance' for archiving, forming the PANDORA Web Archive. Later, all government websites were included in the collection effort, in the Australian Government Web Archive.
“It was really good foresight. A lot of the sites you see disappeared many years ago. We’re grateful to our predecessors for having the idea and carrying it out," Wong says.
In 2005, the library began doing an annual 'bulk harvest' of all sites on the .au domain. The harvest involves web crawlers, internet bots which browse the web and index the pages.
Although the bulk harvest means the collection is a truer picture of Australia online, the volumes involved make the repository much harder to browse.
"It’s our most complex search project we’ve done to date," Wong says.
Four of the library's developers used their own and open source code, to allow for text searches of the repository.
"We used our version of Google’s PageRank algorithm and we offer some Bayesian filtering and machine learning. It took us quite some time to come up with that algorithm, and tuning over many months," says Wong.
The team also used machine learning-based image recognition to identify and delete pages with pornographic material displayed.
"We had to think of ways of suppressing the noise and the junk but also provide users with what they want. We used a few techniques combined," says Wong.
The size of the repository presented significant challenges. The team had experience with such problems having developed the Trove service, an interface launched in 2008 to search more than 90 million items from libraries, museums, archives, repositories and other research organisations. The AWA is accessible via Trove, which can draw up to 70,000 users a day.
"We had to modify the approach and implement a new solution, because when you look at the Web Archive it's got a much larger corpus, but we expect the user base to be lower. We had to rearchitect the solution to suit the content. We had to invert the design," Wong explains.
"Our servers are specially configured to run the service. With Trove, the servers have lots of RAM and then for the Web Archive it’s the other way round — there's less RAM and more for disk," Wong adds.
For the medium term, the archive will remain in the library's on-premises servers, but could move to the cloud as it expands.
"Going forward it’s going to be a challenge collecting what’s important tomorrow, today," Wong says.
The method of harvesting could also adapt.
"There’s so much content on the web, we’re going to have to explore other mechanisms for capturing using machine learning and artificial intelligence, but right now we have curators playing that role and doing it very effectively," Wong adds.
Custodians of history
The archive of Australia’s internet activity is expected to be a key resource to future historians and researchers.
“[We are] building the foundations for the next 100 years. When people want to find out about today, what happens today, rather than accessing digitised books and journals and newspapers they’re going to want to find out what was on Twitter, what was on news sites and in the comments section. It’s going to be based on content on the web,” says Wong.
The National Library considers the effort part of its responsibility as “custodians of Australian history,” says Ayres.
“For those of us who lived and worked before the dawn of the website, it’s a fascinating reminder of how much things have changed. For those who’ve never known a world without the web, it’s a remarkable history lesson,” she said.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.