- 3 minutes read

Recently, I've decided to migrate my blog to Angular. One of the first challenges was to find out which URLs my blogs has and to pull the HTML code of these articles from the web. There are more than 350 articles, so it's obvious I had to write a program for it. As it turns out, that's simple with node.js and a few lines of JavaScript.

Crawling one's own blog

When I started the project, I planned to download and install a program. A quick internet research showed me many crawlers, but for some reason, I was unhappy with most of them. I'm a programmer. I prefer to use a programmatic approach. That also settled the question which language to choose. The second crawler I found and understood is a JavaScript library. So JavaScript it is.

Let's have a look at some snippets of the source code. Like so often in the JavaScript world, it begins with npm:

npm init npm install crawler --save

After that, it's a simple node.js program:

var Crawler = require('crawler'); var fs = require('fs'); var c = new Crawler({ maxConnections: 1, rateLimit: 1000, // This will be called for each crawled page callback: function(error, res, done) { if (!error) { var body = res.body; var filename = res.options.uri; // todo: convert the URL into a filename // todo: create the folder fs.createWriteStream(filename).write(res.body); var hrefs = body.split('href='); for (let i = 1; i < hrefs.length; i++) { var end = hrefs[i].indexOf('"', 1); var url = hrefs[i].substring(1, end); if (url.includes('example.com') >= 0) { if (!alreadyCrawled.includes(url)) { c.queue(url); alreadyCrawled.push(url); } } } } done(); } }); c.queue({ uri: 'https://www.example.com' });

I've omitted a few bits and pieces, but basically, it's just that simple.

Is it legal?

Crawling a website is frowned upon, but of course, it's legal. Search engine companies do this all the time. Nonetheless, crawling a website is an unfriendly act if you don't ask the website owner. It costs them bandwidth.

That's the reason why I've limited both the "rate limit" and the number of parallel connections in the code snippet. It's low enough to avoid problems.

Wrapping it up

Writing a website crawler proved to be surprisingly simple with node.js. At the end of the project, I had an almost complete list of URL my website provides. My crawler even found the search function and the feeds, revealing several surprises. For instance, I wasn't aware that every article of a Wordpress blog offers its own comment feed.


Comments