wiki.arcwayback

ARCWayback is a set of utilities to maintain archives (Internet Archive's ARC format file). Java application connected to database provides access to archive throught url and time and receives all versions from archive.

At present National library of Czech Republic is crawling whole .cz domain and some specific hosts. We are using Heritrix software to crawl sites so we are continuosly building our archive from ARC files formats. For indexing and full-text searching in these archives we decided to use Internet Archive's indexing software Nutch. But we are not able to index all documents (with nutchwax) in our archive because we haven't resources to do that (we expected to use ten or more machines to do this job). So, we want to create a layer between our archive and searching engine. The basic idea is to know what exactly in the archive is. So we need to get record quickly from archive by using url and time. Aim of this project is to develop a set of utilities in java (connected to database) to perform quick access to archive and redesing WERA(previous NWA tools) web frontend to retrieving records.

Mapa Webu
contact: mail@webarchiv.cz
Actualization