Data is radioactive

Data is, like, literally radioactive.

I was going to comment on the distortion and misuse of the word “literally” and comment that its use as an intensifier is a barbaric anomaly, but I’m afraid it just ain’t so. I can still dislike the usage, but it has been around since the 18th century, and in the dictionary (Merrriam-Webster, at least) since 1909 (https://www.merriam-webster.com/words-at-play/misuse-of-literally).

If this were a recent development, I might still indulge in a bit of linguistic snobbery, but I will not descend into the pedantic (and insufficiently-informed) position of saying: “Of all the words in the English language that we’ve bastardized over the past couple of decades, the misuse of literally is among the most hideous.” (https://crossworded.com/misuse-of-literally/)

Sigh. I really need to stop all this silliness about double-checking things before I start offering opinions – it really doesn’t pay, particularly in this “post-truth” world. Facts are such pesky things...

Seriously, though, while it can be frustrating, it is important to make sure that we live in the real world – the one were objective facts exist, and opinions need to take facts into account. The one where facts and opinions are different things. We bear a moral, ethical, and sometimes legal responsibility for confirming (to the best of our ability) the information we receive AND DISSEMINATE.

At any rate, back to data.

We need data, and many of us seem to be (figuratively) addicted to it. I discussed the applicability of Moore’s Law (https://www.til-technology.com/post/nasty-thoughts) to data storage, and we see a near-endless and increasing appetite for data everywhere we look.

Most of us think of more data being better, because we “might need it” someday, or we create multiple copies in case one copy fails. In this age of vast storage capacities, “what could possibly go wrong?” (Nowadays, I usually hear that in Steve Gibson’s voice – https://twit.tv/shows/security-now).

We can think of data in a number of ways. How much do we have? How much do we use? How much capacity do we have? How much does it cost? Most people just buy another hard drive and keep piling on the cat videos.

For most IT people, the focus is usually on the reliability of the storage, recoverability of the data, and data retention requirements imposed on them by others (ie, the “what” and “how”, rather than the “why”)

InfoSec and Compliance people will think more about the “why”, which is where we can start.

Most companies consider data a valuable commodity. (Disclaimer: I am using the term “commodity” in a non-technical sense.) They gather it, store it, buy it, sell it, and worry about how to protect it. Looked at this way, the main focus will be on quality controls, effective storage, and protection, which seems to be the way data is viewed by most people.

We don’t generally worry about the risk of having too much of something, except in the context of cost of storage, risk of rot, or similar considerations. In general, more is better, so long as you have the capacity to store it safely.

Also, we don’t generally think of commodities as “dangerous” (energy commodities like oil and natural gas are a bit different, since the value of the commodity is actually tied to their energy storage capacity), so thinking of data as a commodity leads us to think of it as relatively “safe.”

So, why change? What’s the point of considering data to be radioactive?

The challenge with the “commodity” model, is that most will think about risks to the data, but not the risks of collecting or storing the data in the first place. When storing most commodities, we focus mainly on stable storage (so the substance doesn’t deteriorate, rot, or become contaminated), safe transport, and the safety of the people managing the system.

But radioactive substances are different. When we think of radioactive substances, we have all of the risks of commodities, but also have a whole list of other regulations around how to protect ourselves from it, how to handle it, how to store it, how long we can store it, how we use it, how we can use it, how we dispose of it, and so on. We use radioactive substances in different ways, in different quantities, at different levels of purity, and have different standards (and legal requirements) for transport, storage, transport, retention periods, and disposal.

By thinking of data as analogous to radioactive substances, we will start to think a lot more seriously about the risks associated with acquisition, storage, and use. The stuff is DANGEROUS! We need to be careful with it.

It’s important to note that we won’t necessarily ask different questions - we’ll just think of the questions and the answers in different ways.

As an example, if Alice decides to start a cat-video site, what data is needed, what is it for, and what are the questions we should be thinking of?

Cat Videos:

No problem. We’ll just get storage for them and.... Wait. Where will the videos come from? Is there risk that someone might upload copyrighted material? Are there any regulations which address how they must be managed? What about the new laws in the newly-formed country of Catlandia, which bans the storage, transport, and distribution of cat-videos?

Sure. Fine. Whatever. We’ll figure out how to minimize the nightmares associated with DRM (Digital Rights Management https://en.wikipedia.org/wiki/Digital_rights_management) and such, block access to our site from Catlandia, and decide that our cat-video repository will be managed at the basic level we use in our data centre. Problem solv-

Wait!

What about PII (http://en.wikipedia.org/wiki/Personal_data) and SPI (Sensitive Personal Information)?

Huh? What are you talking about? They’re cat videos!

Well, the video metadata (ie, the text information embedded in a video file) could include the name and other information for the person creating or uploading the video. Or, the video could include an image showing some document (in a silly example, the cat jumps on me while I am working on my tax return). Or what about hoomans in the video – what if their faces are visible and someone uses facial recognition to identify them? And what if the cat is a citizen of Catlandia? What sort of consent do we need?

Er, ok. Well, we could scrub all metadata prior to upload, and develop a process for obfuscating text or faces...

Good. We’re starting to get somewhere. Moving on.

Member Data:

We need some way of identifying our members, so we’ll collect their name, favourite colour, email address, name of first pet, physical address, favourite book, mother’s birth name, phone number, second-grade teacher’s name, place of birth, passport number, date of birth, credit card details, driver’s license number, sex/gender information, race, credit history, health information, GPS location...

STOP!!!

What’s wrong? We need to be able to identify our users, right?

Exactly. But what do we NEED, and how dangerous is it? This is where the concept of radioactivity comes in handy. PII data is highly radioactive, and SPI even more so. If we store PII / SPI differently, it will generally cost more, so we’ll want to separate it from our “normal” data. We’ll also want to ensure that we don’t “contaminate” our normal data, as contamination will mean that the data is now PII / SPI and must be managed accordingly.

Why would we keep large quantities of uranium lying around if we’re not going to use it? All it’s going to do is cost a lot of money and risk contaminating everything around it.

Also, when we get to PII / SPI, we need to start thinking of privacy legislation, globally. GDPR (https://en.wikipedia.org/wiki/General_Data_Protection_Regulation), CCPA (https://en.wikipedia.org/wiki/California_Consumer_Privacy_Act), to name just two.

First, do we really need to identify our users? If so, then what is the minimum requirement? For example, can we allow the user to define their own username, use an email address for identification, and a mobile phone number as an optional second-factor? (See https://www.til-technology.com/post/infosec-basics-multi-factor-authentication for more on this)

Do we need to store more than that? If so, why? And how will we protect it? What about all of that other information? In the past, we might just decide to collect it as “it won’t hurt, and we might need it someday”. Treating it as radioactive, however, leads us to avoid it if we don’t need it. It serves no purpose, and simply increases storage cost and risk.

But here’s the best part: If you don’t need it, and don’t collect it in the first place, you don’t need to worry about it. The cost associated with data you don’t store is zero, and the risk of losing data you don’t have is also zero.

So, leaving Alice’s cat videos aside, the key point is to stop loading things you don’t need. Storage is comparatively cheap, but the cost and risk associated with losing data is not, and most people can dramatically reduce the risk of radioactive data by simply avoiding it.

Interestingly, while reading up on a few things, I came across a paper which uses the radioactivity analogy in the context of using “radioactive data” to determine whether that data was used to train a machine learning model, in the same way that radioactive isotopes are used for medical imaging. Very cool! (https://proceedings.icml.cc/static/paper_files/icml/2020/3974-Paper.pdf)

Cheers!

“In science, 'fact' can only mean 'confirmed to such a degree that it would be perverse to withhold provisional assent.' I suppose that apples might start to rise tomorrow, but the possibility does not merit equal time in physics classrooms."

Stephen Jay Gould

Today I Learned

Data is radioactive

Recent Posts

Comments