By: St John Karp
At the BitCurator Forum this year I presented a talk called “The Interconnectedness of All Things.” I discussed the potential use for a new tool in the digital archivist’s toolkit, something that would help archivists discover, navigate, and describe digital collections. I’d like to talk a little more about how a tool like this could work and why I think it’s important.
Traditional archival description imposes a hierarchical structure on records. Hierarchies, however, can fail to reflect the messy nature of real life. It can be impossible for a hierarchy to capture relationships between records that fall outside of the hierarchical structure, such as indicating that two digital files are drafts of the same document. Standards for archival description such as Records-in-Contexts (RiC) are making progress towards modeling archival data in more sophisticated ways by using RDF to describe semantic relationships between records. But who is going to do all the work to describe collections in even more detail? We barely have enough time to describe collections with simple hierarchies, let alone complex semantic webs.
Physical records still have to be understood and described manually, but digital records hold new possibilities. A computer can scan one or more disk images, analyze the files, and automatically determine the relationships between them. A document, for example, may exist in many different versions including drafts, annotated copies, and copies with suggested changes. An image may have different versions too, such as the raw original, a cropped version, and a scaled version. Files may also be included in one another. An image might be embedded in a Word document, the Word document exported to a PDF, and that PDF attached to an email. Everything can be related in complex ways.
Ted Nelson has been describing documents this way for decades. Nelson is an information technology theorist who saw computers in the 1960s and understood the possibilities they had as document systems or “literary machines”:
Most things that people describe and model with hierarchies and categories are overlapping and cross-connected, and the hierarchical and categorical descriptions usually miss this. Everything is much more likely to be interconnected, overlapping, cross-connected, intertwined and intermingled (I like to say “intertwingled”).
—The Future of Information: Ideas, Connections, and the Gods of Electronic Literature, 1997
Nelson was, however, ultimately let down by the fact that document systems, such as what eventually became the World Wide Web, never achieved the level of sophistication that he had imagined. All the components of a sophisticated system do, however, exist. Jesse Kornblum described and implemented ssdeep, an algorithm that can identify matching binary sequences between files. Other algorithms can match images not based on their binary contents but based on what they look like, which is how Google’s reverse image search works. Others, such as the algorithms used by Shazam, can match audio based on what it sounds like, and YouTube’s content moderation algorithms match videos based on what they look like. All these ideas can be packaged together into a new kind of tool for digital archivists.
Institutional digital archiving is often straightforward because accessions have been planned and agreed upon in advance. Ingest workflows are agreed upon so that the archivist knows what they’re getting. Other kinds of collections, however, may be less organized. The digital assets belonging to an artist, musician, or a writer, for example, may not have any consistent organization. There may be many file formats, spread across various hard drives, CDs, and DVDs, any of which may contain duplicate material or related records. These kinds of collections are often processed minimally due to the enormous amount of the archivist’s time that would be required to describe them in more detail. However, if the bar could be lowered, if automated tools could be used to make the archivist’s job easier and quicker, more digital collections could be described with a higher level of detail.
I’m currently developing a proof-of-concept called Eltrovo to test out some of these ideas and get an idea for what this kind of tool might look like. Eltrovo (Esperanto for “finding out”) would be able to scan any number of disk images and identify not just duplicates but files that are related in one way or another. It would thus enable the archivist to understand and describe the collection with the benefit of information it is almost impossible to discover by hand. It would also empower the archivist to make important decisions about deaccessioning content that is largely replicated elsewhere. Running the kinds of hashing algorithms required to do this is, unfortunately, computationally intensive, but I don’t see this as a deal breaker because that time doesn’t come out of the archivist’s day. While the computer is sitting in the corner and thinking, the archivist can be getting on with other work.
I am excited by the prospect of archivists being supported and enabled by tools that take a more sophisticated approach to digital records. Many problems with Eltrovo have yet to be solved, so I don’t know what that future looks like yet. Still, as the Esperantists might say, “ni eltrovu” (let’s find out).
St John Karp is a graduate student in his last semester of library school with a concentration in archives. He has previously studied classics and spent over ten years working as a computer programmer. He is pursuing interests in metadata, cataloging, and standards for bibliographic and archival description.