Data Basics – What is a database?

#TIL that the falling green code seen in the Matrix (https://en.wikipedia.org/wiki/The_Matrix_(franchise)) movies is sometimes referred to as “digital rain” (https://en.wikipedia.org/wiki/Matrix_digital_rain). Cool!

While the code is virtually meaningless (pun intended!), these iconic images are a fantastic way of visually expressing the idea of data as the lifeblood of the system. Not just the Matrix, but the real world. Ultimately, from the perspective of the Internet, we are all just virtual raindrops in a sea of data.

Data is all around us, but we don’t want or need most of it. We filter data, we examine it, we judge its quality, then we decide what to keep, and how, and for how long. On and on, endlessly.

I see terms like “data”, “information”, and “intelligence” used with reckless abandon, and near-interchangeably. I also see attempts to distinguish between them in wildly different ways, but the focus seems to generally be on the idea of data being “raw” and information being “processed” or given context.

At the simplistic end of the spectrum, one example (https://www.computerhope.com/issues/ch001629.htm) offers

“UT, 1234, Joe, Circle, SLC, 8015553211, 84084, Smith”

as an example of data, and

“Joe Smith 1234 Circle Salt Lake City, UT 84084 (801) 555-3211”

as an example of information – ie some formatting, and the context of a person’s contact information.

Technically correct, but not particularly useful in practice.

Daniel Miessler (https://danielmiessler.com/blog/difference-data-information-intelligence/) offers what I would consider a far more useful set of definitions for data, information, and intelligence. To summarize, data refers to facts or individual observations, information is a “useful” collection of data, and intelligence uses information to support actual decision-making.

I think this is an excellent practical definition, but not what I wanted to focus on. At this point, I’m more focused on how data is stored and organized. (I’ll use the term “data”, and won’t bother debating whether or not the organization and context provided by a database actually transforms it into “information”...)

When it comes to data storage and data management, it is important to note that there are a number of different ways to store and manage data (https://en.wikipedia.org/wiki/Database), but the most popular (currently) is the relational database. I may look into the early history of databases, but not today.

In recent years, “NoSQL” (https://en.wikipedia.org/wiki/NoSQL) databases have been rising dramatically in popularity, partly because of the incredible growth in the quantity of data collected, and the different types of data being generated. I mentioned MongoDB in the context of understanding “the rules” (https://www.til-technology.com/post/know-the-rules-before-you-break-them), and there is quite a lot of hype about these newer products, but there are certainly justifications for their use. (My pet peeve is mainly that people jump into new technologies because they’re “cool”, rather than because there is a clear justification for using them.)

But, back to relational databases.

A relational database is based on the “relational model” (https://en.wikipedia.org/wiki/Relational_model) which was developed by Edgar F Codd, an English computer scientist then working at IBM, and described in two papers, “Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks” (1969) and “A Relational Model of Data for Large Shared Data Banks" (1970).

It should be noted that, when we talk about “relational databases”, we usually include products which purists would condemn as not being “truly” relational. Codd published his “12 rules” https://en.wikipedia.org/wiki/Codd%27s_12_rules to try and emphasize this point, while Fabian Pascal (https://en.wikipedia.org/wiki/Fabian_Pascal) is notorious for his harshly-worded exchanges with some of his interlocutors (See http://www.dbazine.com/ofinterest/oi-articles/pascal11/ for an example, or https://www.dbdebunk.com/) Sometimes highly technical, but very interesting and they can be quite entertaining.

To simplify (and bypass the drama), a relational database is based on the idea of tables (aka “relations”), which consist of rows (aka “tuples”) and columns (aka “attributes”).

Let’s say that Alice runs a bookstore, and her inventory looks like this:

You could store this in a single spreadsheet (ie, table), but it would have a lot of redundancy, so any change would need to be made in several places. For example, if the publisher “Catastrophic Publishing” moved their headquarters to Belgium, you would need to update the “Publisher Country” in four records. And if you missed one, you might have problems that can be very difficult to find.

It should be noted that this example is trivial. If you are dealing with hundreds of thousands of records, the challenges of a poor data model can be catastrophic.

One of the features/benefits of a relational database is the “normalization” of the data (https://en.wikipedia.org/wiki/Database_normalization), which can dramatically improve the efficiency of storage and query of data elements.

If, instead of the list above, we designed our database a bit differently, we might have several tables.

Book:

Publisher:

Title-Format:

This might seem more complicated, but consider the case where the publisher moves. Rather than updating multiple records in the book table and potentially missing some, we simply update a single record in the publisher table, and include the publisher ID (which is independent of all the other data).

Ok, I guess, but what’s this “Title-Format” table for?

Well, the relational model is built on “one-to-one” or “one-to-many” relationships, so part of normalization is resolving any situations where we would have a “many-to-many” relationship. (The reason for this is that a many-to-many relationship creates redundancy, which is inefficient).

The relational model is based on relational algebra, and can be formally-defined with great clarity, which is part of why it has been so successful. For most applications of the last 50 years, relational databases are solid, reliable, stable, and efficient. It’s important to note that the data model doesn’t necessarily change how Alice updates the data, just how the data is stored. The vast majority of people use programs or interfaces, which can be built in a variety of ways. I may discuss the use of SQL (Structured Query Language) at some future date.

Ok. So why do many people consider relational databases “old-fashioned”?

To some extent, because of hype and lack of understanding of the theory underlying the technology, which can lead to adoption of new technologies without really understanding their benefits and drawbacks. As an example, a number of technology people (who should know better) do not understand the difference between logical and physical normalization – to vastly oversimplify, think of “logical” as being what the humans see, and “physical” as the way the database actually organizes the data.

That said, when we start having to deal with vast volumes of data scattered all over the world, or thousands of simultaneous transactions, or data models which change very rapidly, we start encountering the limitations of the relational model in practice. I may compare relational databases with other types at some future date.

In the end, data is essential to our modern world, so I think we all benefit from a better understanding of what it is, how it works, and how it is stored and managed.

Cheers!

“In science, 'fact' can only mean 'confirmed to such a degree that it would be perverse to withhold provisional assent.' I suppose that apples might start to rise tomorrow, but the possibility does not merit equal time in physics classrooms."

Stephen Jay Gould

Today I Learned

Data Basics – What is a database?

Recent Posts

Kommentare