Saturday, September 4, 2010

Open source Java Web crawlers

During my research for making company intranet search-able, I had to investigate several web crawlers in Java.
  • Heretix was out of my choice as archiving was not important at all.
  • Nutch has inherent ability to crawl, but there is no out of the box script for periodic index update.
  • JSpider was tried with custom rule implementation, but due to thread deadlock index creation was not successful.
  • Finally came to know about Aperture (while reading Lucene In Action, 2nd Edition) which has web crawler based on RDF. With persistent RDF repository, incremental indexing became very fun. But Aperture does not support politeness which restricts its usage for indexing websites from different company.