Gunkholing Bottomfeeders: February 2017

Sunday, February 19, 2017

Mini Data

Heard of "big data?" Billions of records, operations so complex and huge that fleets of computers work on them?
Yeah, I'm not doing that. I found an interesting source of useful data to mine, it's got a bit over 100K records. Not billions, no multiple sources to cross tabulate, just a linear "fetch A, use it as a key for B" process. Not quite trivial, but pretty direct.

I'm afraid I couldn't even call it medium sized data, it's small enough that I'm just stuffing the results into a mySQL database I brought up for that purpose on my laptop. The horror! Not a pool of DB machines, not even a separate DB, just another process grinding away on the laptop. The overhead is so small that I don't particularly notice.

I'll christen this mini-data, not quite as small scale and non-uniform as micro-data, but not big at all.
 
Micro-data is the bottom up stuff, the Internet-of-Everything background noise when every minor device is chattering away in some odd dialect out at the edge of the network mesh, in our kitchens and thermostats and cars and factories.

Unlike micro-data, mini-data is still fairly regular and uniform. Mini-data is from a single or small number of sources, so it's much easier to work with.

Much easier to work with is a relative term, of course. I got the import running, then a bit past 1,000 records it spewed chunks. Unicode characters I hadn't prepared to handle properly. Doh! Fixed that (brutally ugly bit of a fix, but hey! it's a one time process!).

Somewhere over 2,000: crunch. Oddly formatted data led to a missing array element and another run ended by an exception. Did I mention it's a one off, so it's under-engineered? Like missing error checking and handling, so all errors are usually catastrophic the first time. I guess this is what you call "testing the quality in" and by the time I get my data gathered and organized from top to bottom I will have squeezed all of the bugs that apply to my current data set out. The approach is quite fragile and not re-usable, but that appears to be the appropriate approach, for now.

Error handling will inevitably improve as I get from a fatal failure every thousand to every 10,000 then 100,000 and finally 0 for the whole data set. At that point I've achieved my goal and won't run the process again. Until the inevitable source data update, which for this source often includes major formatting changes, so the code is possibly not reusable when I next need to repeat this operation. 

Ask me what I think of the trade-off in a month or two, and again after the source data updates.

Meanwhile, I'm enjoying my work in mini-data. Hopefully it will be as useful as we think it can be!

Saturday, February 18, 2017

Sometimes you build a code Taj Mahal, or at least try to. A thing of rare beauty, of soaring architecture and fine attention to detail, the combined efforts of many over time.

This is not about that.

Sometimes you're a bit off the reservation - you're exploring new technologies by doing simple tasks, and that is not going to look pretty. Building the huts of sticks comes first. It shows you can at least do something with the available materials, but the first few cuts are often a bit unstable and poorly designed.

We have an embarrassment of riches available for free - well, as long as you have a computer and can pay for internet access, anyway. The barrier is much lower than it once was, and keeps getting lower.

I've been playing with AngularJS and needed to get some REST data as a client. AngularJS does that well, so I made a simple app and added a service to supply the REST data via promises. I copied large portions of the solution from assorted "How Do I...?" posts on the internet with good answers. In no time it was working, and I added some controls and UI embellishments easily. That's also something AngularJS does well.

AngularJS does have it's limitations. For local file access and more complex tasks without obvious solutions to crib it can be a slog.

Luckily up my other sleeve I've got Python, to cover the cases not easily covered via an MVC interface in the browser and access to backends. AngularJS isn't good at everything! Python mostly is good at (or at least capable of) most everything, it seems like.

Python is great for cases like this recent one. I needed to grab some data from a REST API with a data rate cap, with assorted further processing needed before shipping the data sliced various ways to an SQL database. Using Python allows me to solve all the issues directly in one script.

Python will handle pulling data from a REST API easily enough, and it's one of the more pleasant languages when it comes to getting things written to disk. It understands XML easily enough, and interfaces to SQL are simple enough. A script or two to create a database and tables and an import or two in the main script and we're off.

I was able to mine logged streaming data for input keys used to request records from a different but related REST API. Handling the REST client details and getting all of the SQL tables updated properly for each data point turns out to be easy to do in Python - I got it running in an afternoon.

Then the cranky part came in. Rate limiting made test runs slower, and my local mySQL implementation seemed to like to confuse the mySQL Workbench tool periodically, making tests even more tedious as I frequently exited and restarted utilities.

Just when I think I have it working, past 1,000 records into a 50K record run, things looking good, bang! An exception. The first of many. Characters in JSON results that can't be translated to the local code page (exception!), remote requests failing in various interesting ways, most triggering exceptions as it hadn't occurred to you that might happen so you certainly didn't prepare for it.

Your good idea that was oh so close to completion and looked like a nice reasonable tight bit of code - well, something happened when we weren't looking. It's a creaky thing, prone to falling over at the first hint of trouble. There are cures for that, but they take time and attention and may have their own issues.

Your code accumulates "Try:/except:"s all over, parameters are checked, text describing important features in the comments is added, logging of results and branches, days and weeks go by and it mostly does the same thing just more correctly. Code bloats up until you have to lose yet more time restructuring and then chasing down and correcting the inevitable bugs that introduces.

Even though the code does not look the same or as nice, with some careful refactoring it won't be too bad. Get some good unit test coverage and automated tests going, now hook up the CI/CD. Uh, where to hook up the CI/CD, well that's another topic.

Simple ideas don't stay that way if they get worked on, I suppose. The process of using AngularJS to get the first level of details recorded, then Python scripts to do the heavier API, data analysis and SQL output gets more complex over time, but I'm also gaining increasing value.

A bit more re-architecting and getting past a few one-time startup issues (i.e. initial data load with throttling) and I'll have a much more useful set of information, broken out nicely in tables, where it can be searched and sorted and used to drive other processes.

The next layer of intellectual property is to master how to manipulate that data to generate code, in this case for smartphone apps, that carries useful or important information.

I remember being told once that only an idiot, somebody who did not know what they were doing, used run-time code generation. He was wrong then, and still is. Plenty of smart people have joined the idiots doing run-time code generation, and it continues to be an appropriate way to solve various sorts of interesting problems.

This is simpler, this is back end analysis and code generation, not real time. The code thus generated in Java will be compiled into an Android phone app and apply the data to the user's benefit. This is data we gleaned from our complicated dance across the internet and an assortment of REST APIs, tricks and simple approaches to aggregate and increase the usefulness of the results.

I enjoy the complex mental work, figuring out where something that can be leveraged into something useful is available without too much effort (and for free is always nicest). I can see how the IP will fit together before I've written a line of it. It will change a bit as I continue - next time I'll lead with Python, probably, it's just a bit more suited to ad-hoc on the fly API ingestion via code and reverse engineering - but the phases are all there, just the flavor of the mouthwash used in this step might change, or a different brush there...

Now I'm down in the guts of the architecture, across the first chasm and running the creaky/cranky engine that will cross the next one, getting me useful data properly organized in mySQL.

The final step is to flesh out a specific task that could use this data, write the code and design the data object (a map of something, basically) then write code to generate the specific implementation needed based on the data and the use case we're trying to solve. This would be easier to describe if it were public so I could explain exactly what is going on. The details are trade secrets for now. Maybe later, we'll see.

I plan on doing this final step multiple times, working out specific implementation forms then generating code for them based on insights from the data. The data can source an awful lot of different useful features if you can figure out how to extract the needed details and turn it into code. As I said, I enjoy this sort of work and look forward to the challenge.

Wednesday, February 15, 2017

What Was That?

My first 3 blog posts here on Gunkholing Bottomfeeders were kinda not really blog posts.

The first post is from 2012. I was teaching myself Android programming and I had quite a few videos of performances at the Vera project, where I volunteered frequently.

I decided to build an app that would play my videos of bands at the Vera Project on your phone.

I figured out how to make that work with an awful UI - buttons all over the place, each for a different video, and you paged through the buttons. I had lots of videos, already in the hundreds back in 2012, so it was interesting. At least to me. I also added some buttons to get to the Vera Project Donation page, and info and links to tickets for upcoming shows.

UI By Engineer (me), Ouch!

I got it working, coded and hosted the back end on Google's App Engine, which had the most usable free tier at the time. I got an Android Developer account for $25, the only part of this that was not free. Well, not free in terms of money, Android dev tools, the SDK and Eclipse are all free. The time I put in recording videos and posting them and writing the app wasn't free, but it was fun and I learned all kinds of useful things.

Useful things like how to program and publish an Android app; Vera Video has been available since 2012 (closing in on five years as I write this) and has been installed a bit over 5,700 times total.

To make the list of upcoming shows work I had to provide a back end with a master list and have the application check in once a day to get the latest list. I also added a bit that allowed me to add more Vera Project videos to the phone apps on the fly. I "versioned" the lists, and the phone app would report it's version when asking for the latest list of shows. If the version was out of date, code on the backend would add a list of all the new videos and their details to the response with the 5 upcoming shows, and the ugly UI would have even more pages of buttons that all show different Vera Project live music videos. I didn't maintain these for very long, manual updates to App Engine and Data Store are tedious and boring,

When I went to publish the application I found that I had to post a privacy statement on the internet to link to from the Android Play Store. Finally I have reached the point of the story, that took a bit. The first post from 2012 is the privacy statement I linked to from the Google App Store.

VeraVideo is still on the app store, so you can check it out if you feel like it. You can also find it on the Google Play Store by searching for VeraVideo. I'm not sure if it even works anymore, but it's tiny and it shouldn't do any harm. Let me know in the comments if you tried it and whether it works.

The second post is less interesting technically and a bit bloody. I store photos of shows in Flickr and I wanted to blog about a particular photo that was mildly bloody.

This was from a live performance and shows don't get bloody much anyway, even at punk or heavy metal shows. I noticed that Flickr allowed me to "post the photo to a blog" and I wanted to try that. I didn't want some funky one-off posting messing up my music blog, so I pointed it at the inactive (except for the privacy statement two years earlier) blog and tried it. The second blog post is the result. I ended up blogging on my music blog the way I always do, didn't find the flickr feature that useful. Good photo, anyway.

The third post I introduce my vision and what I expect to blog about. More of an explanation and introduction to the blog than a blog post, to some degree.

That brings us to this, the fourth blog post and the first one that is really and completely a blog post. Hopefully this is a trend and from here on out there will be plenty of actual blog posts coming up, In any event, thanks for hang in there and reading the whole blog post!

Tuesday, February 14, 2017

Vision

In most of my corporate jobs my employer makes a point of having a formal "vision" statement. Why not?

My vision for this blog is to write about whatever takes my fancy on subjects that are at least tangentially related to tech.

Like all old farts, I've got stories. I've been working in tech since the seventies so I've loaded programs via switches (punched card readers? Luxury!) and worked on systems where assembly language was the highest level (and only) language available.

I don't think I'll be all that consistent in the kinds of posts I make. Sometimes I'll tell old stories, sometimes I'll talk about what I'm working on, sometimes I'll talk about what I wish I was working on. Sometimes I'll talk about interesting tech notes or applications I've come across.

At least that's the plan.

Enough vision, hopefully I'll get enough inspiration and time to fulfill my "sometimes" predictions.

Gunkholing Bottomfeeders