Well, the illustration above IS normal, but that’s actually a bit unusual. It’s “normal” to assume that everyone has the same understanding of the word “normal”, but they often don’t.
I have often encountered issues where people stumble over words for which they assume that everyone has a consistent definition. One of my favourite examples was where (not sure if I have mentioned it previously) we were trying to define reports by “office”, and managed to simplify things to a mere three definitions.
But wait! Everyone knows what an office is, right?
Yes. Absolutely. Sure. And everyone started by assuming that everyone shared their definition. There was a great deal of confusion until we realized that the HR team was assuming that a person’s office was based on their physical office location, the management team assumed it depended on their organizational reporting structure, and the Finance team assumed that it depended on their cost centre. So you could (invented example) have someone that was in the “New York” office (HR), the “Corporate Finance” office (Org) and the “US Northeast” office (Cost centre).
The word “normal” (https://www.merriam-webster.com/dictionary/normal) has several definitions, including a chemical definition and the mathematical one noted above. Among the others is “approximating the statistical average or norm”, which simply describes a situation and how common it is. Another other is “generally free from physical or mental impairment or dysfunction : exhibiting or marked by healthy or sound functioning”, which includes terms which are subjective, technical, or both.
Look at it this way. A person suffering from depression can be both “normal” and “not normal” at the same time, depending on which definition is used. This person may be struggling with the pressure of the COVID-19 pandemic and reacting in a way consistent with the way “most people” might – so, “normal” according to the first definition listed above.
At the same time, they might (or might not) satisfy the criteria for a diagnosis of “depression” by a medical professional, which describes a “mental impairment or dysfunction”. Based on the second definition, this makes the person “not normal”.
In practice, probably safest to avoid any value judgments based on a term like “normal”.
So, with regard to normalization of data, I have previously commented on it, both when I noted that I originally learned about it without ever hearing the word “normalization” (https://www.til-technology.com/post/it-was-twenty-years-ago-today), and when defining what we mean by a database in the first place (https://www.til-technology.com/post/data-basics-what-is-a-database). Still, it’s hard to over-emphasize the importance of understanding the basic concepts if you want to understand business intelligence and reporting (https://www.til-technology.com/post/data-is-blood).
Basic data modelling for a system should formally be broken into “logical” data modelling and “physical” data modelling. (I’ll describe things in traditional “relational” terms, but the concept applies more generally)
Logical modelling is where we try to understand how data will be gathered or used, generally from a business perspective. Going back to our example of Alice’s bookstore, we need to understand things like inventory, sales, and reporting. The data analyst needs to understand and dig into things which may seem trivial but can be vital. A superficial understanding might lead one to believe that a “book” has a “price”, and fail to understand that a “book” might include multiple editions with different prices, such as a hard-cover, soft-cover, e-book, spoken word recording, etc. There may also be significant differences in how these different editions are handled – there’s a big difference between a shipment of 1000 physical books, and hosting an e-book.
In contrast, physical modelling is where we take a logical model and made decisions about how we implement it. It is highly dependent on the quality of the logical model, and many organizations short-change the modelling process, leading to many issues and rework. (I was involved in a project where the logical modelling was done, and then leadership announced that the first version of the tool we were building would be deployed in a very short time frame. The practical impact was that we had to basically skip the physical modelling and implement the logical model – we got it to work, but had issues for years to come as a result of being forced to short-change the physical modelling.)
This is the point where things get interesting, though. As a general rule (over-simplifying, of course), a transactional application (such as a point-of-sale system) should be “fully” normalized https://en.wikipedia.org/wiki/Database_normalization), so that transactional processing can be handled as quickly and efficiently as possible.
But what about reporting? What if I want to generate monthly reports on sales across different stores, or annual reports on sales for books by a specific author?
In years past, many organizations built “data warehouses” (https://en.wikipedia.org/wiki/Data_warehouse) to support enterprise reporting, usually pulling data from multiple transactional systems into a single reporting system. There are several approaches available for organizing the data, along with endless debate and criticism that I won’t get into.
One model is the “snowflake”, which keeps a more normalized model that is optimized for analytical reporting.
In contrast, a number of reporting systems denormalize (https://en.wikipedia.org/wiki/Denormalization) their data to a greater degree, to simplify visualization and query generation. This might seem like it’s reversing the process of normalization and giving up the advantages, but it’s really not. The idea is to organize the data in such a way as to optimize the efficiency of reporting, rather than that of transactions.
One issue I have seen many times is dogmatism associated with different approaches, often (sadly) combined with a lack of thorough understanding of the underlying issues. In this case, people talk about denormalization as “good” or “bad”, often without even understanding that there is more than one type.
Similar to logical vs physical modelling, we can also have logical or physical denormalization. Logical denormalization refers to building a database query and building reports on that query, while physical denormalization refers to building a separate table of denormalized data.
From the end user’s perspective, the two approaches usually appear identical, so the debate is generally on the technology side. Some say that logical denormalization is superior due to the efficiency of storage, while critics say that the database queries are less efficient due to their complexity. Others say that physical denormalization is superior due to all data being pulled from a single location, while critics say that the cost of additional storage and pre-processing is sub-optimal.
As always, the issue is complex and the answers have changed over the years, as the technology has evolved. In years past, a reporting system generally focused on structured data (ie, data from transactional systems), but nowadays, many reporting systems include semi-structured or unstructured data, and a vastly more complex environment than existed previously.
For my own part, I always seem to come back to the perspective of recognizing that the issue is almost always more complex than portrayed, and noting that it’s important to understand at least a bit about the underlying concepts and issues.
I’m not sure what “normal” is, but everyone should try to be their best selves, and try to figure out what is really going on.
Cheers!
Comments