Data Management for Science

I follow @AnneWHilborn on Twitter because she tweets out awesome wildlife photos. You should, too, because wildlife photos make your Twitter feed happier.

I am not a wildlife biologist, and so I usually don’t have much to contribute to the conversations in her tweet stream. But yesterday, she was having a conversation about data management, and that is something I know something about. In fact, scientific data management is my “technical home,” the subject area in which I’ve spent most of my career.

Here’s one of the early tweets. You should be able to find a lot of the conversation around this tweet.

My particular little corner of scientific data management has primarily been in drug discovery and development, but there are some aspects that I think are fairly broadly applicable. Watching the tweets about data management go by made me want to write something more substantial than 140 characters on the topic, and since I have a blog… I decided to do just that.

First of all, I want to make it clear that nothing in this post is meant as a judgment on anyone’s data management practices. The most fundamental thing I know about data management is that it is hard to get right, and that you will almost certainly have a stomach-sinking “oh no!” moment at some point. I have been working in this area for more than fifteen years, and there are still aspects of my personal data management that I would be embarrassed to share. There are always trade offs.

However, I have learned some things over the years. Here are some general things to think about as you think about how you want to manage your scientific data.

There is no such thing as a “set and forget” data management solution.

No matter what you do, you will need to put some effort into keeping your data accessible. The main culprits that people forget about are (1) file formats become obsolete, (2) storage formats become obsolete (I finally threw out the iOmega zip drive with files from my PhD thesis….), (3) storage formats get corrupted, (4) catastrophes happen.

As you think about how to handle these issues, it is easy to get overwhelmed. The best way to avoid that is to take a risk-based approach. Think about how hard it would be to recover the data. Think about just how screwed you would be if the data were to be unrecoverable. Think about how the answer to that second question changes with time. And then think about how fast things change, and make a plan to keep your data reasonably safe and accessible over the time period in which you would be unacceptably screwed if the data got lost. This gives you a feeling for the minimum amount of effort you need to expend to keep your data safe. It is usually a good idea to then expend a little extra effort, just in case.

baby accessing computer
A potential catastrophe. Or a future data management expert.

Scientists usually think their data should be saved forever, but when you start asking questions, it often becomes clear that after some point in time, techniques are likely to have improved to the point that it would be more efficient to regather data than to try to reanalyze old data. The point at which this happens is highly dependent on the type of data, but it is good to think about it, because trying to make all of your data aboslutely safe for ever is a very expensive proposition.

Let’s say you decide you want to try to keep your data safe and accessible for ten years. This is actually a very long time in data land. What do you need to do to achieve that? First, decide what physical format you’re going to use. If you choose something like DVD, learn about the decay rate of that storage format, and decide how often you’ll refresh to new storage. If you choose to keep it live in the cloud, check into what the service provider guarantees for data that haven’t been touched in years. If you have a local server on which you plan to store it, keep in mind that operating systems reach “end of life” and there will be migrations required every few years.

If your data is constantly changing (e.g., you are adding observations), you will also need to think about short term back up and recovery. What happens if you screw up one data entry? How easily can you restore to an earlier point? How long do you have to realize that you need to restore to that point? It may be that you need two different back up solutions: one for long term recovery, and one for short term “oops” protection.

Next, think about the file formats. If you use software to produce or analyze your data, do not assume that software will always be available, even if you have the source code. Operation systems change, and over a defined period of time, that source code will become uncompilable without a significant expenditure of effort. Do not assume that if the software is available, it will still be able to read the format it wrote a decade ago. Excel is actually safer than many formats, because Microsoft has a remarkable record of being backwards compatible, or at least providing migration tools. However, I would not count on that continuing. Microsoft is a business, and selling Excel to scientists is not a large part of their revenue stream.

The safest things to do is to try to export into some sort of “plain text” format. You can tell the format is plain text if you can open it in something like Notepad and it is human readable. Excel will export as csv (comma separated value). XML is a good option, too, because although it might be a pain to figure out the details of format later, it should always be at least theoretically possible. Well designed XML is also self-describing, to at least some extent. This is better than a bunch of blocks of text with no context.

If your data sets are really, really big, you may not have a good plain text option. This means that you need to make migrating your legacy data part of every upgrade decision. Lucky you!

Finally, think about your risk of catastrophe. Earthquakes, fires, floods, lost storage boxes…. all of these things happen. You have to decide which you need to worry about. One good, standard practice is to have an “offsite backup.” However, if you live in an earthquake or fire prone location, you need to think about what, exactly, constitutes “offsite.” Also think about how much delay you can accept in getting access to your offsite backup. The rise of cloud-based storage has made keeping a true offsite backup much easier, but always remember to check on the service provider’s policies to make sure they meet your needs.

You will forget the context of your data.

I have been involved in a lot of data migration projects and also in a lot of data integration projects. The most expensive part of these projects is usually the process of understanding the data. This is true even when people who were involved in the development of the original systems are available to answer questions. When they aren’t… well, so far I have yet to come across a database I couldn’t figure out, but sometimes it takes a lot more time than we wanted.

This is why librarians and database geeks like myself get starry-eyed about “metadata”- that is data about the data, the information that puts the data in context. Even a little bit of metadata can save a lot of time spelunking in the raw data, trying to rediscover connections.

The number one thing you can do to protect yourself from frustrating weeks trying to reconstruct the meaning in the data is to document that meaning up front. Assume you will forget just about everything, and write it down.

Your data is useless if you can’t find the bits you need.

Once again, assume you’ll forget just about everything, especially where you put things. Don’t give things cute names. You won’t think they are funny five years from now when you’re trying to find the data you need to finish some crucial analysis. Search technology has gotten a lot better than it used to be, but it still needs information to function. You may also be surprised by the extent to which the shreds of your memory plus a decent filing scheme beat out searching, particularly if you’re searching for things without a lot of metadata.

This is another reason we get starry-eyed about metadata. Metadata helps you find things.

Redundancy is your friend, and your enemy

Redundancy is great because it allows recovery from catastrophe. But redundancy also introduces the possibility of confusion. This can be a huge problem, and it takes data management novices by surprise. Once you have two copies of something, there is a chance they will get out of sync. When that happens, which one is “correct”? Which do you believe?

The solution for this is to always designate a “gold standard,” which is the copy that is the one, true copy. Make it 100% clear to everyone—especially yourself!—that if there is any discrepancy, the gold standard copy is the one that will be believed. Then act on that. Make sure that any changes go in that copy first. Also try to make sure they are propagated to other copies, to avoid that awful “why do these two versions say different things?” moment. However, it is almost inevitable that you will have a moment like that. You practically guarantee it once you make a copy. But you need to copy for backup….

Trust me, just decide up front which copy is the gold standard and try to never deviate from that decision.

Those are the basics. Tell me what I missed in the comments! 


    • Melanie said:

      Oh, definitely, there are great data stores out there that are much older than 10 years. I tend to think of those as “paper land,” but I know many aren’t.

      One extra twist if you work at a company is that there might be some types of data you are required to destroy after X amount of time. It is so hard to do that as a scientist!

      March 2, 2016
  1. […] As scientists we often don’t think about data management until it’s too late and we end up losing data due to a computer crash or catastrophe! Don’t let this happen to you! Excellent advice from Melanie Nelson over at the Beyond Managing Blog on data management. […]

    March 8, 2016

Leave a Reply

Your email address will not be published. Required fields are marked *