Data: Analysis is the Easy Part

Rian Schmidt

October 20, 2023

You might be surprised how much time is spent dealing with just getting the data to do all the amazing analytics magic with. I've read estimates that half (or more) of data analysis is just spent acquiring and cleaning data.

One of the convenient things about today's celebrity AI, large language models, that gives them a advantage on other approaches is that they largely train on "unstructured" data. Here's the Internet, derive what you can from it.

Traditional analytics on the other hand require very structured data in which you know exactly what each piece of data represents, and a huge part of that job is just acquiring and cleaning the data, and making sure that the pipeline works from end-to-end.

Yesterday, I learned that we're receiving signals from one client in seven different ways-- pulled from BigQuery, Amazon, ftp, emailed internal data, faxed, tied to a brick and thrown through a window... OK, maybe not those last two, but seriously, seven different ways-- several of them requiring manual intervention on one end or the other.

Then, after we do get that data, the processing still breaks. Why? It's usually a fun surprise! Sheets or columns in Excel files change names because someone decided it wasn't descriptive enough. Or they change a timezone or apply a multiplier to some column to normalize it for their purposes. Maybe for the first time, there's no value, so instead of a zero, they put "N/A" in there. Hey! That's not a number! Error!

Then, there's the sheer mechanical nature of interfacing anything with anything else. Emails get filtered. APIs don't answer. Data files get too big and processing times out. The network drops mid-transfer. Someone sends yesterday's data again today. Something else spins up on the processor and consumes all the available resources. And everything just stops. Surprise!

In the data analytics world, much like software testing, all you can really do is automate and validate the scenarios you know about. It seems like every day, though, a new thing happens that makes us say "oh, right, I guess THAT could happen" and start thinking about how to prevent it next time.

Meanwhile, a rat chews through a cable somewhere.