Wednesday, May 24, 2017

PhantomJS, CasperJS and Arrays of Functions

Scraping

I've been doing some scraping - writing apps that fetch HTML content using HTTP GET and the occasional POST.

I've found two reasonably nice solutions for making scrapers easily:

  1. Scrapy - a Python framework, optional Splash server available if full browser implementation (especially Javascript) is needed.
  2. CasperJS - a Javascript framework built on PhantomJS, a headless browser.


One recent accomplishment has to do with downloading files from a web site, but I'm under non-disclosure and can't talk about that. Dang.

That work was done in CasperJS, which has an interesting approach to defining and executing scrapers and spiders.

CasperJS

CasperJS handles PhantomJS "under the covers" and provides nice wrappers around important features like injecting Javascript code into the browser and waiting on DOM elements, not to mention inputting keystrokes and mouse clicks.

Functions Inside Functions

Using CasperJS, you create a Casper object then start it with an initial URL, which is requested husing HTTP(S) via the PhantomJS headless browser.

Instead of directly coding the spider or scraper, you define a series of CasperJS steps using casper.then() (or inside of a casper object use this.then()). Each definition is a function:
casper.then(function doSomething() {
    this.wait(250);
});

These functions are added to an array of function definitions and are not immediately run. When you get done defining them, you do a casper.run() {} and the functions will be invoked in order (maybe, see the bypass() function).

Functions frequently add new functions to the list, so you can be executing step 3 out of 4, and when the step complete you are now executing step 4 out of 7.

You can add logic that skips forward or backward through the array of functions, allowing loops and optional steps.

Most everything is asynchronous, which can bite you. If you code this.wait(500) and this.wait(500), they both run asynchronously after the last active bit of this step completes, and finish at the same time. They do not add additonal delay to each other at all if they are in the same .then().

The approach of adding functions everywhere for everything can lead to an accumulation of anonymous functions. This is actually a bad idea, since the debug/log mechanisms available will report the function names being processed - if they exist. It's best to add a unique function name to each and every function:
this.then(function check4errors() {
    var errorsFound = false;
    if (verbose) {
        this.echo('Check for errors');
    }


Be careful, though. There are also tight requirements around the casper.waitFor()/this.waitFor() and casper.waitUntil()/this.waitUntil() methods provided by CasperJS. The successful case has to be named function then() and the timeout case has to be named onTimeout() or things simply do not work. Here's an example of correct coding within a CasperJS object, so "this" is used rather than casper:
this.waitFor(function check() {
    return this.evaluate(function rptcheck() {
        return ((document.querySelectorAll('#reports\\:genRpt').length > 0) && 
                (document.querySelectorAll('#_idJsp1\\:data\\:0\\:genRpt').onclick.toString().length > 0));
    });
}, function then() {
    if (verbose) {
        this.echo('Found report generation button.');
        this.capture('ButtonFound.png');
    }
    this.wait(100);
}, function onTimeout() {
    this.echo('Timed out waiting for report generation button.');
    this.capture('NoButtonFound.png');
}, 20000);

Pluses and Minuses

CasperJS/PhantomJS are different than most NodeJS apps, so integrating them (other than via command line execution) is complicated. Writing NodeJS wrappers for CasperJS scrapers is straightforward. This is how I solved the challenges that I can;t talk about due to non-disclosure.

Inability to easily mix NodeJS and CasperJS code is a minus, but not horrendous. The ease of injecting JS into the browser is a plus, and the consistent JS language for both our code and the injected code has benefits too. Linting tools work well with the CasperJS code, enforcing correct coding in the CasperJS and browser injected code at the same time.

Scrapy Splash

Scrapy is nice, and the Scrap/Splash combination looks like the best bet for truly large scale approaches. The tools allow you to run an extremely stateful back end process on a stateless server with the ScrapySplash adaptation layer handling the tricky state management bits for you. I got this working, but only to a limited degree. It looks like the best solution for a large scale professional approach, but it does have a higher barrier to entry with the separate back-end server and the adaptation layer on top of everything else.

I ended up not using this approach, so for now my opinions on Scrapy and ScrapySplash are not well informed. If I ever get a chance to use it for real I'll revisit this article.

Friday, May 19, 2017

Really Tiny Embedded Code

I did a couple of generations of control systems based on 8051 detivatives. These had a bit of ROM - 8K, usually - but only 128 bytes of RAM. The stack used that 128 bytes, and so did your variables. Pretty darn tight fit for interrupt driven code, which needs to use some stack.

I did the If VI Were IX guitar robots on the 8051. It handled MIDI input over the serial port in interrupts, and it had a timer driver interrupt that we used to both update our servo loop and also to derive the pulse chain used to control position on the pluckers - the shaft with a plastic pick or nib that plucks the string when rotated properly.

We put 2 on each shaft - as I said, a pick and a nub, and added support for a MIDI mode switch to select between the two. Based on the requested velocity in MIDI note on commands received from the serial port, we would set the PWM parameters higher (mostly on) for high velocities and lower (mostly off) for low velocities. To make the PWM average out and not be too jerky and noisy, we need to update it as fast as we possibly can. Bear in mind that our 8051 used a 4 MHz clock, and many instruction took more than 4 cycles, so we got less than 1 million instructions per second. Not much power for handling real time updates and asynchronous serial input while playing a guitar in real time.

(Old man ranting at lazy kids mode)
Micro-controller chips today usually have hardware PWM circuits, so we can just load a duty cycle number and provide a really fast MHz+ digital clock and we get a great PWM. Luxury! The 8051 I was using had no PWM hardware, so we implemented it in software using interrupts. Messier, less smooth, lots more code and a few variables in a system that had little room for either. We couldn't even get 1M instructions/sec.

Micro-controllers today also have more RAM - either built in, or external, they just don't make them with 128 bytes much any more. Luxury! (You're supposed to hear the Monty Python bit about having to walk to school uphill both ways, not like the lazy kids today; doesn't come across well in text). Clocks on modern micro-controllers are often in the gigahertz, a thousand or more times faster than the 8051, and also 32 bits wide, so each instruction handles and processes 4 times as much data as the old 8051s could.

So we had all of our local variables - assorted modes (damper tied to notes, or off to allow hammer-on and hammer-off, or on for a damped sound, etc), state (plucking has several states and we need requested and expected positions i order to include the error component in the feedback loop), limit details, channel details, note range details, and more. We also had to have enough left over in the 128 bytes to allow for the registers to be stored during an interrupt (MIDI IO) with enough room for an additional stack frame for an overlapping timer interrupt (servo and position updating).

We managed to squeeze it all in and it works fine. It helps that registers are only 8 bits and there aren't many of them, and the program counter (pushed onto the stack as the return address) is small too - not all that much needs to be pushed on the stack. The upside of little room is that you simply can't have code bloat and variables must be as simple as possible. The result is small enough that you can realistically squeeze all of the bugs out.

The If VI Were IX installation has never crashed due to software fault, and has outlived every single moving part in the system - MIDI sources had to be replaced with a solid state source, pluckers and strings replaced, yet the 8051 micro-controllers are still fine a decade later.

If I was doing this over again from scratch today, I'd probably base it on a Raspberry Pi system with gigabytes of memory and a flash hard drive with tens of gigabytes more. Luxury!

In My Day We Had to Grind Square Roots Out A Bit At A Time - If We Were Smart

I'm old for my industry, getting into my late fifties. I also started very young, with my first paying tech gig in 1976. I have programmed computers by throwing switches and liked using punched paper tapes because that was a step up.

My first paying gig was a square root. I met Chuck through Paul, who lived down the block from him.
Check was, like me, a bit of a math prodigy. He was working on a Z8000 based motion control system, and having problems with the square root calculations, needed to figure out distances and how fast to go on each axis. You know, square each individual component or axis, add them, and take the square root. The result is the total distance of the motion as a vector on those axes. Given the speed desired you can now work out how fast each individual axis should go to achieve the correct speed.

They were using the method of Newton, and it was painfully slow. Too slow. To do motion control in real time, you have to "get ahead" a bit. You start calculating when to take steps on each axis, but not doing it yet. Instead you build a queue and store up a fair number of steps before starting the first one. This allows you to have slower portions (setting up for the next motion) as long as they are fairly short and the longer bit is faster, using the queue to provide the data we are too slow to calculate. How slow you are in the worst case directly determines how big that queue needs to be. Memory wasn't what it is now, if we were lucky we had 64K bytes for everything. Nowadays that's not even big enough for a stack. That meant the queues were quite a bit smaller still.

The method of Newton: given an integer, determine it's square root by starting with an estimate (int/2, for example) and repeatedly:
    Divide the integer by the latest estimate
    Average the estimate and the result: New estimate = (latest estimate + division result)/2
Repeat with new estimate

This will converge on the correct result. Slowly. For example, square root of 2 has these estimates:
1
1.5
1.333
1.41666
1.411764...

5th estimate is off by about 1/3000 or so of the actual square root of 2.

Each round requires a divide, which is the slowest instruction on the Z8000. Really slow for the 64 bit divided by 32 bit numbers we were dealing with. The biggest queue they could make would empty after 2 or 3 fairly short motions, used up by the divides in the square root calculations. The solution didn't work.

Chuck mentioned the difficulties to me and I got curious. Six years prior in grade school they had taught us how to take square roots manually and I sat down and worked that out, then figured out how to do it in binary. It turned out to be much simpler in binary.

To generate each digit (bit) you select the largest digit that when multiplied by the square root result so far is less than the input value. There's a doubling step in there too.

Doubling is just a bit shift, the fastest thing the Z8000 does. In binary the "largest digit" is always 1, which makes largest * so far the same as so far, so the whole multiply needed in base 10 drops out in binary, completely eliminated. This all just reduces to a simple compare: if so far is too big then this bit is not set & sofar = sofar * 2 (again just a shift, add new 0 bit in least significant bit), else it is set and so far = so far * 2 (another bit shift) + 1.

So we do 2 shifts & a compare for each 0 bit in the result and 2 shifts plus a compare and an increment for each 1 bit in the result. This is already faster than a single divide. Since the short motions are the most challenging, having less time to build up the queue with the simple linear timing generation algorithm, I optimized the algorithm for small distances. If the top 32 bits are 0 then we only need to do a 32 into 16 bit square root, taking half the time. For the 32/16 bit case, if the top 16 bits are 0 it turns into a square root of a 16 bit number, twice as fast again. Optimized to the byte, the shortest motions end up needing at most 8 cycles through our extremely fast 2 shift plus maybe 1 increment loop. This was screamingly fast, and immediately made the real time system work. The queues were more than long enough even for short motions. We were able to reduce the size of the queues, freeing up memory for the part program that the machine would cut and for code to be added when we added features.

They paid me $1,000 for solving their problem. I was working a near minimum wage job at the Neptune Theater for $1.45 an hour or something like that. This was more money than I made in 3 months.

This experience inclined me to go into the field I'm still in, software engineering and related technical specialties.

That gives me 41 years as a software engineer as of 2017, there can't be all that many around with more. The industry in the seventies was tiny by modern standards, and most of the engineers at that time were electrical engineers or other technical sorts who had switched over to meet the demand for the new skills, so they were already a decade into their careers for the most part. Those folk are in their seventies now and the few who did not leave the industry over the decades have largely retired now.

If I can make it to my retirement in the industry in a bit over another decade, I'll probably be one of the most experienced software engineers on the planet. I wonder how many folk who started before the mid seventies are still in the industry? Where's the leader-board when you need it?

Non-diclosure

I worked recently on a project where I got to build a new set of services from scratch and deploy them to the cloud. Since the services are intended to be used with smart phone apps, we ended up hosting on Google's cloud, GCE, since they have quite a few useful tools available for supporting smartphones like Firebase and nicely integrated credentials and authorization management.

Unfortunately, since this is a commercial product that is likely to face competition, or at least inspire competition once it is released, I'm under non-disclosure. I'm not allowed to say much about what the services I designed, built and deployed actually do.

Once the product is actually released I'll be able to blog about it, but for now any blogs on the topic (like this one) can't talk about much and therefore end up being pretty short.

Guitar Robot Revisited

One of the cooler projects I've ever done was for Trimpin's "If VI Were IX" robot guitar installation at the EMP, which is now call the Museum of Pop or MoPop.

Photo is by Thomas Upton who was nice enough to license it under the Creative Commons license. A higher resolution version from Mr. Upton is available on flickr.

The giant whirlwind looking collection of instruments in MoPop shown in the picture above is actually a guitar robot. You can see various guitar-ish looking items with rows of devices lined up on either side of the guitar fret board. These are the guitar robot elements. A little above the center, slightly to the left, the purple and blue-ish guitars are both Trimpin's custom build guitar robot elements.

Each element has a single string, a "plucker," and "frettles" which are the devices lined up along the fretboard. These are solenoids and when you apply voltage to their coils they pull the solenoid closed, which pushes a pad down on the guitar string, one device per fret.

Each string has a controller - well, two for now, one to control frettles and one to control the plucker. A MIDI input is routed to all of the controllers, and groups of 6 pluckers and their associated frettles are assigned to a single MIDI channel as a logical guitar. The mapping from MIDI notes to guitar frets is odd since we skip octaves between strings so that each fret on each string is a unique note. This means that the MIDI played on the robot guitar has to be processed to put notes into artificial octaves (incorrect musically) so that each note can only be played on one specific string, making the control easier to implement.

Trimpin made a second generation version for his own use that was simplified and improved. The plucker controller now also controls the frettles fot the string it plucks. The quality of the sound degrades as youi move up the frets - by 7 it's tinnier, by the 12th fret it sounds lousy - so in the second version Trimpin only provides 5 frets per string, and adds more strings to handle multiple notes simultaneously.

We also added a damper - a soft pad that rests on the strings, damping them so that a plucked note ends very quickly. When a note goes on, we lift the damper for that string, and when we get the note off command we drop the damper back onto the string. If a note-on is followed by a different note-on for the same string, we switch frettles and keep the damper up.

 We modified our response to MIDI note-on commands so that low velocity notes (below 12% of full velocity) do not actually use the plucker, they just keep the damper up and deploy the correct frettle. This allows for hammer on and hammer off techniques - playing notes by hitting frets with the left hand, rather than using the right hand to pluck.

We also added a mode where you can switch off the dampers, so they stay damped the whole time. This allows for a different sounding result - notes are brief and a little muffled,

Using MIDI commands the new generation control has a wider variety of options and sounds available. Trimpin has used the second generation guitar robot control in live performances from time to time, but I dob't think there's a permanent installation using them yet.

We're hoping that the EMP will want to replace the current first generation controls with a 2nd generation control (or by then probably a 3rd generation since we'll most likely add a few more tweaks). We've looked into the devices needed - our original microcontrollers are no longer available new, and using surplus used devices is dicey at best, so we'll be changing microcontrollers and tool sets if we get to do this work.

I hope we get to do it, I enjoy working on robots and robotic guitars are one of the more interesting robots I've ever worked on.