I’ve recently been working to obtain some publicly available data from a few websites who list many projects. As part of that work, I wrote a small simple HTML crawler that identifies all links in a given html page and navigates to them downloading their HTML content. Writing this crawler took me roughly a day of development in Ruby and is not very interesting in itself.

The results of crawling through a website, however, are very interesting. Crawling is a great way to identify every page in your website that a user navigating can find. It is, in no way, a way to ensure ONLY the pages you want are available. But it can give you valuable information about what information can be publicly accessed, through which route and, in some cases, what information was meant to be accessible but was not.

If you ever had someone report to you that you’re website includes a link that isn’t working, a famous 404, you should definitively be thinking about putting in place a way to run a crawler against your staging environment (or similar) on every commit or, at least, on every deploy to that pre-production environment. Your test could be as simple as:



describe Application do

  it 'should not include any link that leads to a 404 url' do

    expect(Crawler.new(Application.new).errors).to_not include(‘404’)

  end

end

Note that test is an end-to-end test and it will be slow because it exercises all the pages of your web-application. As such, it should not be run before we have some degree of confidence that the application is working. For now, if you can’t afford to write such code or need a compelling argument to spend some time on it, I found a website which offers to run similar crawlers to report dead links. It is not really set up as a service that you could easily integrate into your CI/CD application but clearly there is an opportunity here.