The British Library: Helping archive the web
Each year more than six million searches are generated by the British Library online catalogue, and nearly 400,000 people visit the British Library reading rooms, looking for information. The British Library receives a copy of every physical publication produced in the UK and Ireland, amounting to more than 150 million maps, manuscripts, musical scores, newspapers and magazines that it must archive. Beyond just the physical assets, the British Library has been archiving selected Web pages from the UK domain since 2004.
The Web is rapidly changing with new pages created every day causing an explosion of data that is disappearing almost as quickly as it is published. Recent research estimates the average life expectancy of a Web site is just 44 â€“ 75 days. In turn, every six months, 10 percent of Web pages on the UK domain are lost. The challenge is to preserve the digital culture of the nation and with websites being published, modified or removed daily, it is an enormous endeavor to capture snapshots of the UK web domain. In the case of the 2005 UK General Elections alone, disused websites from former MPs, old campaigns, and bankrupt businesses need to be kept for historians and researchers of the future.
"We estimate the UK Web space will contain over 11 million Web sites by 2011. To take on the enormous challenge of capturing this content, we need a system capable of taking the UK Web Archive to Web-scale," said Helen Hockx-Yu, Web Archiving Programme Manager, The British Library. "IBM can help us analyse the web archive containing millions of pages and unlock embedded knowledge which otherwise is difficult to discover using traditional search methods."
What they're saying...
Leo King, ComputerWorld
The new analytics software project, called IBM BigSheets, helps extract, annotate and visually analyse vast amounts of Web information using a Web browser. IBM's new technology preview is helping the British Library archive and preserve massive amounts of Web pages, and then unlock the virtual door to its archives for generations to come. Users can explore and generate new data insights using a Web application and then the IBM software publishes Web 2.0 standard data feeds which can be searchable by British Library patrons.
What if it was your job to archive information--and that included the web? Listen to Dame Lynn Brindley, CEO of the British Library, explain on BBC Radio the importance of the international endeavor to preserve a nation's digital culture via web archiving techniques.
BigSheets is an extension of the mashup paradigm that integrates gigabytes, terabytes, or petabytes of unstructured data from Web-based repositories; collects a wide range of unstructured Web data stemming from user-defined seed URLs; extracts and enriches that data using an unstructured information management architecture; and lets the user explore and visualise this data in specific, user-defined contexts. For example, users can see search results in a pie chart and look at the data in a tag cloud.
Value generated for the client
Whether it's someone interested in their own genealogy or a student working on a project for school, people need help making sense of this growing sea of information on the Web. For example, the 2005 election marked the first attempts by UK politicians to use the Web as a campaigning tool. With the use of Web campaigns expected to explode during the 2010 election, the 2005 collection will enable researchers studying the evolution of politics and the Web to access hugely valuable primary source material.
This year, the amount of digital information is expected to reach 988 exabytes which is the equivalent to a stack of books from the Sun to Pluto and back. The Web is exploding with data and business professionals want to access that data -- both structured and unstructured -- to get better insights to their business. IBM BigSheets is an insight engine that helps businesses get insights from really large data sets easily and in a timely manner. By building on top of the Apache Hadoop framework, IBM BigSheets is able to process large amounts of data quickly and efficiently.
Resources and Links
Want more information?
Want to know more about this jStart engagement?