Saturday, September 4, 2010

Open source Java Web crawlers

During my research for making company intranet search-able, I had to investigate several web crawlers in Java.
  • Heretix was out of my choice as archiving was not important at all.
  • Nutch has inherent ability to crawl, but there is no out of the box script for periodic index update.
  • JSpider was tried with custom rule implementation, but due to thread deadlock index creation was not successful.
  • Finally came to know about Aperture (while reading Lucene In Action, 2nd Edition) which has web crawler based on RDF. With persistent RDF repository, incremental indexing became very fun. But Aperture does not support politeness which restricts its usage for indexing websites from different company.

2 comments:

Hoang said...

Hi Hafizur Rahman,

Thanks for your useful post! I have tried some code of you and the Aperture crawler work quit well. But i am facing the problem about the heap memory using when use FileAccessData to enable incremental crawling. It eat so much memory (about 1500MB with 20000 urls/site) so the crawler run very slow.
Have you ever face this problem? can you give me some suggestion? Thanks you in advanced!

Hafizur Rahman said...

Later I used Crawler4j with some modifications to use Apache Tika. This works very well. I hope that will simplify things.