Jorogumo is a collection of tools for automatically visiting Web sites and collecting information from them. It can be used for such tasks as checking for broken links; updating a search engine's database; and collecting training corpora for machine learning projects. The design of the system is guided by the Unix philosophy of providing small, powerful tools that each do one thing and can be combined arbitrarily into larger systems.
The Jorogumo tools descend from an in-house Web crawler written to index the author's personal and business Web sites. The current 0.1 version is just a selection of code from that project, with some user configuration options and packaging added. The packaging, in particular, is still in a pre-alpha state, even though the crawler itself is mature code that had been deployed for years.