top of page

Data is Blood!


When one thinks about blood, one of the most obvious connections will probably be to vampires, especially if you are a fan of horror films.


#TIL about Francis Marion Crawford (https://en.wikipedia.org/wiki/Francis_Marion_Crawford), who was an American author active around the turn of the 20th century. I had never heard of him, but when I did a search on the phrase “blood is life”, I found a reference to his short story “For the Blood is the Life” (https://gutenberg.org/files/40386/40386-h/40386-h.htm#Page_165), from his collection “Wandering Ghosts”.


It’s actually quite a good story, and got me thinking a bit about horror fiction in general, and vampire stories in particular. I’ve read a number of “vampire” books, from Bram Stoker’s Dracula (https://en.wikipedia.org/wiki/Dracula), to (most of) the Anne Rice Vampire Chronicles (https://en.wikipedia.org/wiki/The_Vampire_Chronicles), and even the Twilight series (https://en.wikipedia.org/wiki/Twilight_(novel_series)), though one of my favourites is “The Dracula Tape” (https://en.wikipedia.org/wiki/List_of_works_by_Fred_Saberhagen#The_Dracula_Series), by Fred Saberhagen. This later is written from Dracula’s perspective and provides a wonderful contrast to most depictions, such as where Dracula refers to Abraham Van Helsing as “that idiot Van Helsing”, and holds him responsible for most of the tragedy in the book.

Then I had a look at some lists of vampire movies.

I had no idea how many vampire films I had seen over the years until I realized that I had seen a majority of the films on the “150 of the Best Vampire Movies Across all Genres” (https://www.imdb.com/list/ls026145207/?sort=moviemeter,asc&st_dt=&mode=detail&page=1), ranging from the 1922 silent film Nosferatu (https://en.wikipedia.org/wiki/Nosferatu), to multiple versions of the Dracula story, most of the Hammer series (https://en.wikipedia.org/wiki/Dracula_(Hammer_film_series), and even to comedies such as Mel Brooks’ “Dracula: Dead and Loving It” (https://en.wikipedia.org/wiki/Dracula:_Dead_and_Loving_It), and “Buffy the Vampire Slayer” (https://en.wikipedia.org/wiki/Buffy_the_Vampire_Slayer_(film).

Back to my original point, though, it’s actually a very good metaphor to describe data as the lifeblood of most organizations, especially in the Internet Age.

I’ve written a bit about what databases are and how they work (https://www.til-technology.com/post/data-basics-what-is-a-database), and commented on value of treating data as if it is radioactive (https://www.til-technology.com/post/data-is-radioactive), but haven’t really gone into what it is and how it is used.

The way people think about data varies widely, and depends on factors such as what they have available, what form it is in, how much is available, and what they want to do with it.

For most, the focus will be on a transaction, or single record. I bought a widget – was the payment processed correctly? I want to go to my friend’s house – do I have the correct address? This is what most might consider the “easy” part, though it really isn’t.

Take something as “simple” as a phone number. When I was young, a local call required 7 digits, and long-distance required 10, and international calls generally required an operator. Now, we have “country codes”, which may or may not correspond to a single country (eg, US and Canada both use “1”), along with a variable number of digits.


In Canada (https://en.wikipedia.org/wiki/Telephone_numbers_in_Canada), we use the country code “+1”, followed by a three-digit area code, a a three-digit central office code (aka “exchange code”), and a four-digit station code, but different countries handle things differently.


And what about the “type” of phone number? Can/should we try to distinguish between a “Home” number and an “Office” number? How about “Mobile” numbers? And how many should we store?

Moving away from the transaction/record level, the sky is the limit, and it becomes vital to understand what you want to accomplish. Leaving aside larger-scale analysis which can be lumped together under the label “big data” (https://en.wikipedia.org/wiki/Big_data), and about which I know little, let’s focus more on the area of business intelligence (https://en.wikipedia.org/wiki/Business_intelligence)


My first exposure to Business Intelligence was in the context of data warehousing. We were building a data warehouse (https://en.wikipedia.org/wiki/Data_warehouse) which focused mainly on financial and marketing data, and I worked on some of the data feeds from the source systems, along with the metadata and security sub-systems. One day, my supervisor came by my desk, handed me a box (software used to actually come in BOXES, which usually contained printed manuals and disks containing the software – in this case, it was a CD-ROM), and said: “Here. Learn this, please.”


The box was labelled “Brio”, and my initial response was: “That’s a drink, isn’t it?”


In fact, yes, it is a drink https://en.wikipedia.org/wiki/Brio_%28soft_drink%29, which was quite familiar to people in Ontario, and #TIL that it was actually created in Toronto, Canada. I couldn’t remember whether I had tasted it before, though, so I tried it again at my earliest opportunity. (Spoiler: It’s not really to my taste.)

In any case, the box contained a version of the Brio (https://en.wikipedia.org/wiki/Brio_Technology) business intelligence tool (later bought by Hyperion, then later by Oracle), with which I was fairly heavily involved for the next several years. Up until that point, my data analysis work was dominated by building my own SQL scripts, but Brio was a game-changer for me, and greatly simplified ad hoc data analysis and reporting.

One of the fascinating things about data analysis and data warehousing is that it can be both simple and difficult at the same time.

It’s really all about understanding the data model, and on breaking things down.

To invent a trivial example, take the database from our friend Alice’s bookstore. We have an inventory table and a sales table, and we want to build some summary reports.

A beginner would probably come in and build a query to pull together all the data, then show it to Alice, only to hear that the numbers are wrong. There would then be an exercise in trying to find out why the numbers are off, and what they should be, usually at a higher-than-usual level of stress (because Alice will be concerned about the original report being “wrong”)

Someone with more experience might come along and try to understand exactly how the inventory and sales tables related to each other, and might quickly discover that the inventory table has the current price, while the sales table has the price at the time sold. Then, get a total of all books in inventory, a count of books sold, and a total of all sales, and see if they match.

If they don’t match, something is wrong. Simple. Now, by “match”, I mean that the numbers can be reconciled, not necessarily that they are the same. In the example above, the analyst might work to determine the historical prices of the books, and confirm that the differences are explained by the historical vs current pricing.

However, and this is critical, getting the same number does NOT necessarily mean that the data is correct. You can’t prove a negative, so all you can really say is that you have tested X, Y, and Z, and those tests have all been successful. Still, if you take small steps that are easily reconcilable, it’s much easier to know exactly when and where something goes wrong. Then, you can dig into it and learn why.

Of course, if everything comes out perfectly the first time around, then you should worry. From my experience, it usually means something disastrous has happened, since things never come out perfectly on the first pass. See https://www.til-technology.com/post/oopsie for more on mistakes...

Cheers!

Comments


bottom of page