Sunday, February 19, 2017

Mini Data

Heard of "big data?" Billions of records, operations so complex and huge that fleets of computers work on them?
Yeah, I'm not doing that. I found an interesting source of useful data to mine, it's got a bit over 100K records. Not billions, no multiple sources to cross tabulate, just a linear "fetch A, use it as a key for B" process. Not quite trivial, but pretty direct. I'm afraid I couldn't even call it medium sized data, it's small enough that I'm just stuffing the results into a mySQL database I brought up for that purpose on my laptop. The horror! Not a pool of DB machines, not even a separate DB, just another process grinding away on the laptop. The overhead is so small that I don't particularly notice. I'll christen this mini-data, not quite as small scale and non-uniform as micro-data, but not big at all. Micro-data is the bottom up stuff, the Internet-of-Everything background noise when every minor device is chattering away in some odd dialect out at the edge of the network mesh, in our kitchens and thermostats and cars and factories. Unlike micro-data, mini-data is still fairly regular and uniform. Mini-data is from a single or small number of sources, so it's much easier to work with. Much easier to work with is a relative term, of course. I got the import running, then a bit past 1,000 records it spewed chunks. Unicode characters I hadn't prepared to handle properly. Doh! Fixed that (brutally ugly bit of a fix, but hey! it's a one time process!). Somewhere over 2,000: crunch. Oddly formatted data led to a missing array element and another run ended by an exception. Did I mention it's a one off, so it's under-engineered? Like missing error checking and handling, so all errors are usually catastrophic the first time. I guess this is what you call "testing the quality in" and by the time I get my data gathered and organized from top to bottom I will have squeezed all of the bugs that apply to my current data set out. The approach is quite fragile and not re-usable, but that appears to be the appropriate approach, for now. Error handling will inevitably improve as I get from a fatal failure every thousand to every 10,000 then 100,000 and finally 0 for the whole data set. At that point I've achieved my goal and won't run the process again. Until the inevitable source data update, which for this source often includes major formatting changes, so the code is possibly not reusable when I next need to repeat this operation. Ask me what I think of the trade-off in a month or two, and again after the source data updates. Meanwhile, I'm enjoying my work in mini-data. Hopefully it will be as useful as we think it can be!

No comments:

Post a Comment