From Ancient DAOs to Modern Solutions: Why Self-Describing Data Matters
Dive into Storacha’s spicy approach to preserving data for generations to come.
You wake up with a smile on your face and slide yourself easily out of your bivy pod. Around you stretches an endless chaos of metal and plastic, the electronic waste of thousands of years of civilization hiding gems of pure information. Your scavenger drones have left you a pile of data storage devices and after a hot cup of your finest ghost pepper infusion you grab a few disks to start the day’s work. The first one you check is a total loss — even after encryption the data doesn’t match any format you’ve ever seen. It’s probably an old database, but if it isn’t self-describing data the chances it’s worth anything are effectively zero. The second disk makes up for your disappointment — a multiformatted archive of the activities of an ancient DAO with enough social signal and complexity to fetch a few credits from researchers interested in this time period. More exciting than that, however, is a private key store — you spin up a cloud research bot and set it to work looking for digital resources attached to the key. If you’re lucky you might find an old stash of a token that still has market value, and if you’re REALLY lucky — you let yourself dream — you could find that holiest of holies, a piece from the original Racha NFT collection…
You sit back and pour yourself another cup of your steaming, spicy brew. If only your ancestor could see you now, scouring the earth for remnants of that long-gone era. You are Racha4000, and saving and preserving data will always be your first and most important calling…
When you meet a new person, you rely on a mix of contextual clues and “self description” in order to communicate with them. If you meet them on the streets of New York City, you might assume they are able to speak English, and once they tell you their name you will be able to greet them. You may be wrong in your assumption that they speak English, but if they tell you their name you can reliably assume that you can call them that.
Similarly, when an application encounters a new piece of data, it will generally rely on a mix of contextual clues and “self description” to decide what sorts of interaction that data supports. If data is moved from one storage location to another the contextual information may disappear, but “self description,” by its nature, moves with the data itself. Let’s look at a simple example:
Say I have a list of numbers — let’s say a list of hourly outside temperature readings. We’d like to store them as compactly as possible, so we’d like to avoid using a separator to mark the numbers — instead, since most temperature readings in celsius are between -99 and 99 degrees we can just use two characters for each reading — this means that if the temperature readings were 15, 10, 5, 0, -6, 10, 3, we’d encode it like:
15100500–061003
To figure out which measurements this corresponds to, we just read digits two at a time, ignoring (but preserving) the negative sign. But what if it does get extremely hot or cold? We’d like to at least avoid breaking our data encoding scheme in this case, so we decide to use a special trick: If a measurement has more than two characters, add an @ symbol and then print another character. Since temperatures REALLY shouldn’t be more than three characters, this shouldn’t take up too much more space. For example, the given the sequence 15, 103, 3, 34, -2, 20, we’d represent it like:
1510@30334–0220
This is a basic form of self description. Given a simple set of rules — “read two characters at a time, read more characters if you see an @ symbol” — anyone can read arbitrarily long numbers from our list because the data itself contains information about its own encoding.
Self description is particularly helpful over the lifetime of long-lived data. A 2011 NASA study notes that “self-describing data formats have become a well accepted way of archiving and disseminating scientific data.” Because we at Storacha believe we need to work as long-term stewards of your data these properties are also very useful for our work, and we make extensive use of the multiformats set of self-describing data formats to store your data. From the multihash and multicodecs in our CIDs to the indexes at the head of each of the CARs your data is packed into for storage, self-describing data formats are at the core of our next generation data storage service.
Data isn’t just storage; it’s a story waiting to be uncovered, a treasure waiting to be claimed. Whether you’re piecing together the digital footprints of an ancient DAO or dreaming of a long-lost Racha NFT, one thing is clear — self-describing data makes it all possible. It ensures that, no matter how deep into the chaos of bytes and bits we dive, the map to understanding is always in the palm of our hands.
At Storacha, we don’t just store data; we preserve its identity, its meaning, and its potential. With self-describing formats at the core of our mission, we’re not just safeguarding information — we’re future-proofing its journey. Because in a world of endless possibilities, the only limit is what we’re bold enough to decode.