Scraping
I've been doing some scraping - writing apps that fetch HTML content using HTTP GET and the occasional POST.I've found two reasonably nice solutions for making scrapers easily:
- Scrapy - a Python framework, optional Splash server available if full browser implementation (especially Javascript) is needed.
- CasperJS - a Javascript framework built on PhantomJS, a headless browser.
One recent accomplishment has to do with downloading files from a web site, but I'm under non-disclosure and can't talk about that. Dang.
That work was done in CasperJS, which has an interesting approach to defining and executing scrapers and spiders.
CasperJS
CasperJS handles PhantomJS "under the covers" and provides nice wrappers around important features like injecting Javascript code into the browser and waiting on DOM elements, not to mention inputting keystrokes and mouse clicks.Functions Inside Functions
Using CasperJS, you create a Casper object then start it with an initial URL, which is requested husing HTTP(S) via the PhantomJS headless browser.Instead of directly coding the spider or scraper, you define a series of CasperJS steps using casper.then() (or inside of a casper object use this.then()). Each definition is a function:
casper.then(function doSomething() {this.wait(250);
});
These functions are added to an array of function definitions and are not immediately run. When you get done defining them, you do a casper.run() {} and the functions will be invoked in order (maybe, see the bypass() function).
Functions frequently add new functions to the list, so you can be executing step 3 out of 4, and when the step complete you are now executing step 4 out of 7.
You can add logic that skips forward or backward through the array of functions, allowing loops and optional steps.
Most everything is asynchronous, which can bite you. If you code this.wait(500) and this.wait(500), they both run asynchronously after the last active bit of this step completes, and finish at the same time. They do not add additonal delay to each other at all if they are in the same .then().
The approach of adding functions everywhere for everything can lead to an accumulation of anonymous functions. This is actually a bad idea, since the debug/log mechanisms available will report the function names being processed - if they exist. It's best to add a unique function name to each and every function:
this.then(function check4errors() { var errorsFound = false; if (verbose) { this.echo('Check for errors'); }
Be careful, though. There are also tight requirements around the casper.waitFor()/this.waitFor() and casper.waitUntil()/this.waitUntil() methods provided by CasperJS. The successful case has to be named function then() and the timeout case has to be named onTimeout() or things simply do not work. Here's an example of correct coding within a CasperJS object, so "this" is used rather than casper:
this.waitFor(function check() { return this.evaluate(function rptcheck() { return ((document.querySelectorAll('#reports\\:genRpt').length > 0) &&
(document.querySelectorAll('#_idJsp1\\:data\\:0\\:genRpt').onclick.toString().length > 0)); }); }, function then() { if (verbose) { this.echo('Found report generation button.'); this.capture('ButtonFound.png'); } this.wait(100); }, function onTimeout() { this.echo('Timed out waiting for report generation button.'); this.capture('NoButtonFound.png'); }, 20000);
Pluses and Minuses
CasperJS/PhantomJS are different than most NodeJS apps, so integrating them (other than via command line execution) is complicated. Writing NodeJS wrappers for CasperJS scrapers is straightforward. This is how I solved the challenges that I can;t talk about due to non-disclosure.
Inability to easily mix NodeJS and CasperJS code is a minus, but not horrendous. The ease of injecting JS into the browser is a plus, and the consistent JS language for both our code and the injected code has benefits too. Linting tools work well with the CasperJS code, enforcing correct coding in the CasperJS and browser injected code at the same time.
Scrapy Splash
Scrapy is nice, and the Scrap/Splash combination looks like the best bet for truly large scale approaches. The tools allow you to run an extremely stateful back end process on a stateless server with the ScrapySplash adaptation layer handling the tricky state management bits for you. I got this working, but only to a limited degree. It looks like the best solution for a large scale professional approach, but it does have a higher barrier to entry with the separate back-end server and the adaptation layer on top of everything else.I ended up not using this approach, so for now my opinions on Scrapy and ScrapySplash are not well informed. If I ever get a chance to use it for real I'll revisit this article.
No comments:
Post a Comment