The Shape of Information: A New Approach to Digital Collections

April 10, 2024April 4, 2024bloggERS! Editors1 Comment

By: St John Karp

At the BitCurator Forum this year I presented a talk called “The Interconnectedness of All Things.” I discussed the potential use for a new tool in the digital archivist’s toolkit, something that would help archivists discover, navigate, and describe digital collections. I’d like to talk a little more about how a tool like this could work and why I think it’s important.

Traditional archival description imposes a hierarchical structure on records. Hierarchies, however, can fail to reflect the messy nature of real life. It can be impossible for a hierarchy to capture relationships between records that fall outside of the hierarchical structure, such as indicating that two digital files are drafts of the same document. Standards for archival description such as Records-in-Contexts (RiC) are making progress towards modeling archival data in more sophisticated ways by using RDF to describe semantic relationships between records. But who is going to do all the work to describe collections in even more detail? We barely have enough time to describe collections with simple hierarchies, let alone complex semantic webs.

Physical records still have to be understood and described manually, but digital records hold new possibilities. A computer can scan one or more disk images, analyze the files, and automatically determine the relationships between them. A document, for example, may exist in many different versions including drafts, annotated copies, and copies with suggested changes. An image may have different versions too, such as the raw original, a cropped version, and a scaled version. Files may also be included in one another. An image might be embedded in a Word document, the Word document exported to a PDF, and that PDF attached to an email. Everything can be related in complex ways.

Ted Nelson has been describing documents this way for decades. Nelson is an information technology theorist who saw computers in the 1960s and understood the possibilities they had as document systems or “literary machines”:

Most things that people describe and model with hierarchies and categories are overlapping and cross-connected, and the hierarchical and categorical descriptions usually miss this. Everything is much more likely to be interconnected, overlapping, cross-connected, intertwined and intermingled (I like to say “intertwingled”).
—The Future of Information: Ideas, Connections, and the Gods of Electronic Literature, 1997

Nelson was, however, ultimately let down by the fact that document systems, such as what eventually became the World Wide Web, never achieved the level of sophistication that he had imagined. All the components of a sophisticated system do, however, exist. Jesse Kornblum described and implemented ssdeep, an algorithm that can identify matching binary sequences between files. Other algorithms can match images not based on their binary contents but based on what they look like, which is how Google’s reverse image search works. Others, such as the algorithms used by Shazam, can match audio based on what it sounds like, and YouTube’s content moderation algorithms match videos based on what they look like. All these ideas can be packaged together into a new kind of tool for digital archivists.

Institutional digital archiving is often straightforward because accessions have been planned and agreed upon in advance. Ingest workflows are agreed upon so that the archivist knows what they’re getting. Other kinds of collections, however, may be less organized. The digital assets belonging to an artist, musician, or a writer, for example, may not have any consistent organization. There may be many file formats, spread across various hard drives, CDs, and DVDs, any of which may contain duplicate material or related records. These kinds of collections are often processed minimally due to the enormous amount of the archivist’s time that would be required to describe them in more detail. However, if the bar could be lowered, if automated tools could be used to make the archivist’s job easier and quicker, more digital collections could be described with a higher level of detail.

I’m currently developing a proof-of-concept called Eltrovo to test out some of these ideas and get an idea for what this kind of tool might look like. Eltrovo (Esperanto for “finding out”) would be able to scan any number of disk images and identify not just duplicates but files that are related in one way or another. It would thus enable the archivist to understand and describe the collection with the benefit of information it is almost impossible to discover by hand. It would also empower the archivist to make important decisions about deaccessioning content that is largely replicated elsewhere. Running the kinds of hashing algorithms required to do this is, unfortunately, computationally intensive, but I don’t see this as a deal breaker because that time doesn’t come out of the archivist’s day. While the computer is sitting in the corner and thinking, the archivist can be getting on with other work.

I am excited by the prospect of archivists being supported and enabled by tools that take a more sophisticated approach to digital records. Many problems with Eltrovo have yet to be solved, so I don’t know what that future looks like yet. Still, as the Esperantists might say, “ni eltrovu” (let’s find out).

St John Karp is a graduate student in his last semester of library school with a concentration in archives. He has previously studied classics and spent over ten years working as a computer programmer. He is pursuing interests in metadata, cataloging, and standards for bibliographic and archival description.

Bridging Historical Gaps: The Importance of Inclusive Archival Practices

February 29, 2024February 29, 2024bloggERS! EditorsLeave a comment

By: Doreen Dixon

History, as it is commonly understood, is a constructed narrative of the past,¹ and within the vast corridors of history, archives stand as guardians of society’s collective memory. Charged with the crucial task of preserving records, they serve as repositories of our past, holding the narratives that shape our understanding of the world. However, the narratives they preserve have not always been inclusive. Historically, archives have often overlooked and inadequately represented the stories of historically marginalized communities, integral to understanding the complete tapestry of our shared history. As a result, the stories of these communities have been silenced—underrepresented, misrepresented, and at times, entirely omitted from archival holdings—resulting in significant gaps in their documentation. This continuous exclusion perpetuates a cycle of erasure and reinforces existing power dynamics.

Recognizing the importance of addressing existing gaps, the Drake University Archives and Special Collections has embarked on a collaborative oral history project with the university’s Black Faculty and Staff Affinity Group (BFSA). MacDonald, Lanctot, and Fernandez stated, “For communities that have traditionally been marginalized in both the historical record and in historiography, oral histories can be a form of empowerment, a way in which community members can literally add their voices to the historical narrative. The process of a community sharing its stories can provide personal opportunities for self-reflection, an appreciation for the struggles endured, and a celebration of the community’s accomplishments.”² The Black Faculty, Staff and Alumni of Drake University Oral History Project, launched in February 2022, seeks to rectify historical oversight by documenting the lives and experiences of Black alumni of Drake University, as well as and former faculty and staff.

As the Archives representative assigned to the project, I have had the privilege of being part of this transformative endeavor. From the onset, it was clear that this initiative held immense significance, not just for the Archives and Special Collections but for the university as a whole. Working closely with the BFSA Group, I assisted in project planning, which involved drafting project descriptions and guidelines, crafting invitations to participate, and developing interview checklists and questions.

Beyond administrative tasks, my role extends to practical aspects of the oral history project. This includes providing training on recording devices, scheduling and conducting interviews, and following up with participants. Post-interviews, I meticulously transcribe audio recordings, edit transcripts for accuracy, rename and organize files according to unit standards, and create accessible copies of files. Moreover, I upload audio files and associated documents and add metadata into records within CONTENTdm.

The impact of the Black Faculty, Staff, and Alumni of Drake University oral history project goes beyond diversifying archival holdings; it enriches our understanding of the Drake University experience. By amplifying voices that were previously silenced or marginalized, the project adds depth and nuance to the university’s history, fostering a more inclusive narrative.

The online accessibility of the oral history interviews through the Voices of Drake Oral Histories digital collection was the culmination of months of collaborative effort. Researchers and community members now have unprecedented access to these invaluable resources, allowing for a deeper exploration of the university’s past and its connections to broader historical contexts.

Furthermore, the project complements the Archives’ other oral history initiatives, namely the Drake Bands, Women Remember Drake, and the Drake African American Alumni Reunion projects. As the archival landscape continues to evolve, these endeavors serve as prime examples of the transformative impact of inclusive archival practices.

The Black Faculty, Staff, and Alumni of Drake University oral history project is a testament to the importance of addressing historical gaps and amplifying marginalized voices within archival collections. Authors Krim, Gwynn, and Larimore state that “The lack of collecting and preserving documentation may be the result of unconscious decisions made on the part of the archivist, or it may be an international tool of oppression. Regardless of the intent, when a population’s life and struggles are lacking, the historical narrative cannot reflect their contributions making them appear inconsequential or absent.”³ Therefore, it is essential to intentionally support and embed inclusive practices into archival work. Through collaborative efforts and a commitment to inclusivity, archives can fulfill their mandate of preserving society’s collective memory more comprehensively and equitably.

Stacey R. Krim, David Gwynn, and Erin Lawrimore, “Reconstructing History: Addressing Marginalization, Absences and Silences Through Community and Collaboration” in Diversity, Equity, and Inclusion in Action: Planning, Leadership, and Programming, ed. Christine Bombaro (Chicago: ALA Editions, 2020).
Beth McDonald, Heather Lanctot and Natalia M. Fernandez, “Little Big Stories: Case Studies in Diversifying the Archival Record through Community Oral Histories,” Journal of Western Archives 12, Iss. 1 (2021): Article 4, https://digitalcommons.usu.edu/westernarchives/vol12/iss1/4.
Krim, Gwynn, and Lawrimore, “Reconstructing History.”

Doreen Dixon is the Electronic Records Archivist and Assistant Professor of Librarianship at Drake University, Des Moines, Iowa, responsible for managing born-digital and hybrid collections. Her primary responsibility is to develop the digital preservation program for the University Archives and Special Collections unit. She is an early career professional who has been in the archival field for two years and has gained certifications as an Archivist, Records Analyst, and Digital Archives Specialist.

Call for bloggERS: DEI in Practice

January 30, 2024bloggERS! EditorsLeave a comment

Do you have any DEI-related archival projects that you are working on this year? What are you working on now? What projects are you thinking about working on? The bloggERS editorial team is interested in posts that discuss DEI-related projects and projects that are still in the planning phase. We want to hear about them all!

Writing for bloggERS!:

We encourage visual representations! Posts can include or largely consist of comics, flowcharts, a series of memes, etc.
Written content should be roughly 600-800 words in length.
Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA.
Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.

Please let us know if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com! We are looking forward to your submissions.

New Year, New Resources

January 25, 2024January 25, 2024bloggERS! EditorsLeave a comment

Happy new year from the editorial team at BloggERS! The past year was full of digital preservation developments, so as we leave 2023 behind, we’ve compiled a list of new resources to take forward into 2024. Of course, this list isn’t comprehensive, but we hope it will nevertheless come in handy for our dear BloggERS! readers. For a throwback, you can also look at last year’s write-up.

In addition to the list below, we’d love to hear what other new resources our readers are excited about! Feel free to respond in the comments and share widely with fellow electronic records practitioners.

Tools

Attachment Converter

Archivists working heavily with email may find this tool useful. To quote the documentation, “Attachment Converter is an open-source command-line tool for digital archivists. Given an email mailbox in MBOX format, it will go through that mailbox and batch convert all the attachments it can into preservation formats, putting a converted copy of each attachment back into the email from which it originated.”

Qiwi

Those interested in using the QEMU emulator but daunted by the prospect of running it from the command line can consider Ethan Gates’s Qiwi. This app aims to “flatten as much as possible the amount of setup time and/or technical understanding necessary to start using QEMU,” in part by implementing a GUI for it.

Rclone

Folks who frequently need to download or upload to cloud storage might appreciate this command line tool. Rclone is modeled on rsync and affords easy cloud data transferral with fixity checking and timestamp preservation. This tool is at least a decade old, but a member of our editorial team learned about it this year and has found it so invaluable that we wanted to include it on our list.

Reports

Survey of the Video Game Reissue Market in the United States

This report from the Video Game History Foundation and Software Preservation Network received widespread media coverage for its finding that 87% of classic games released in the United States are not currently available from publishers.

Technical Guidelines for Digitizing Cultural Heritage Materials, 3rd edition

The Federal Agencies Digital Guidelines Initiative (FADGI) has released an updated version of their much-cited digitization guidelines.

Software Accessibility for Open Source Digital Preservation Applications

FADGI has also published a guide to “accessibility best practices for open-source software.” The project’s goal is “enhancing accessibility in open-source desktop applications for the digital preservation community.”

Reference and Roundups

AMPlifying AV: Leveraging AI to create metadata for audiovisual collections

If you work heavily with A/V materials, you may be interested in this Code4Lib presentation from some of the folks behind the Audiovisual Metadata Platform. It shares their findings from evaluating the performance of over a dozen machine-learning tools in generating metadata for audiovisual archival materials. Services tested included AWS Transcribe, MS Azure Video Indexer, PySceneDetect, and INA Speech Segmenter.

Browser Extensions and Shortcuts for Archivists

This community-compiled resource was presented by Pamela Campbell and Jona Whipple at the 2023 SAA conference under the title “Building the Toolbox”. It collects browser tools that may be useful to all sorts of archivists, digital archivists included. Some of us on the bloggERS! editorial team have found particular use in the code to generate a list of all links on a page.

DANNNG’s Tool Selection Factors

The group that brought us the Disk Imaging Decision Factors is back with this draft version of a rubric for evaluating digital preservation tools. Keep an eye out for the release of the final resource later this year.

Examples of Born Digital Description in Finding Aids

This site, created by the DLF Born Digital Access Working Group, gathers over 80 examples of how different institutions incorporate born-digital materials into their finding aids. While not strictly new in 2023, it’s recent enough and useful enough that members of the editorial team decided it was worth including.

Global Bit List of Endangered Digital Species

The past year saw a new version of the Digital Preservation Coalition (DPC)’s Bit List, “a community-sourced list of at-risk digital materials which is revised every two years.” The report is aimed specifically at archivists needing “independent evidence … to support their targeted advocacy” for digital preservation activities.

A Guide to the Installation of IsoBuster, IROMLAB and IROMSGL

This post by Niamh Murphy on the DPC’s blog, along with the more detailed accompanying guide, walks digital archivists through getting started with some common software tools for imaging optical media.

Level up with RAM

The DPC Rapid Assessment Model (RAM) has been around for several years. This new companion resource aims to help organizations address weaknesses revealed by the assessment by offering tips, resources, and case studies for each of the 11 sections of the RAM.

Updated Library of Congress file format resourcesThe File Formats team at LC has continued to expand and update its compendium of file format descriptions, in addition to making some changes to its recommended formats statement. More details on the new developments can be found in this blog post by LC staff.

2023 DigiPres Recap

December 20, 2023February 28, 2024bloggERS! EditorsLeave a comment

By: Doreen Dixon and Kari May

The NDSA 2023 Digital Preservation Conference (DigiPres) in St. Louis, Missouri, on November 14-16 focused on the theme “Communities Through Time and Space.” The conference had 187 attendees, including 66 presenters, and 10 international participants from Ireland, the United Kingdom, the Netherlands, Norway, Canada, and Japan. The event provided a venue for experienced and new professionals to gather, network, and discuss advancing digital stewardship.

The DigiPres Opening Plenary delivered essential updates on organizational changes, aiding members in planning future involvement. It featured the 2023 Excellence Awards ceremony, highlighting notable achievements in digital stewardship. Dr. Jamie Lee, this year’s keynote speaker, presented Kairotic and Kin-centric Archives: Addressing Abundances and Abandonments challenging the audience with a new perspective on archival bodies and advocating for a relational approach in working with non-dominant communities, incorporating oral history interviews and storytelling to redefine how archives are perceived, deployed, and accessed.

Sessions, workshops, and posters emphasized community-building through partnerships and collaboration. Attendees gained practical skills for daily archival work and were inspired to foster innovation through radical collaborative initiatives across institutions and professions. Some sessions presented case studies like The Curricular Asset Warehouse at the University of Illinois: A Digital Archive’s Sustainability Case on sustainable digital assets management. This presentation offered strategies to minimize electronic waste by implementing the greenest options. Examples demonstrated the impact on workflows, resources, and policy, and suggested solutions. Another session highlighted the Arizona State Archives and Records Management’s advocacy for funding to preserve and store permanent public records electronically. Their session, Is It Feasible? Arizona’s Digital Repository Feasibility Study, outlined their process of gathering necessary information through inventories, surveys, statistics, and interviews, emphasizing the importance of communication in understanding needs, and resources, and gaining support.

Other sessions explored approaches to understanding and working with digital data, like the Grounded in Theory session that addressed provenance understanding, intrinsic biases of computing technology and its impacts on born-digital archives, and digital curation dualities. Virtualization for Processing and Accessing Digital Archives delved into containerization and desktop virtualization for born-digital workflows. Additionally, Building a New Skill Takes Time and Directed Effort: a Practice Plan for Learning the Command Line aimed at skill development, offering techniques for incorporating CLI practice into daily activities for process automation. Demonstrations included using a troubleshooting dialog for identifying and correcting command errors, with attendees receiving handouts, practice demo files, and useful online resources. Resources shared during the session included:

DigiPres 2023 also allowed presenters to showcase projects through poster presentations. One example was the University Libraries at Virginia Tech’s poster, Digitization of the Largest Insect Collection in Virginia: A Quick Look at the Workflow. It featured a 3D project using digital photography to capture details of approximately 300 specimens from their insect collection, transforming them into accessible online 3D models. For additional poster presentations and session talks, visit the NDSA notes and slides page.

Overall, the conference was great because it addressed professional and social needs. Wednesday night Dinearounds provided opportunities for DigiPres attendees to explore local restaurants, fostering smaller, more intimate spaces for conversations over a meal. These moments allowed individuals to relax, reconnect with old friends, and make new professional acquaintances.

Since 2016, NDSA has hosted nine events, comprising six in-person and three virtual gatherings. Open to both members and non-members, DigiPres has evolved into an inclusive platform for seasoned and early career professionals to share, discuss, and collaborate. While the 2023 DigiPres successfully aligned with its theme, NDSA recognized the community’s concern that returning to the pre-pandemic status quo was less than feasible or desirable. Efforts are underway to address conference format, length, cost, and content. For updates, visit the News from NDSA webpage. Although a conference is not planned for 2024, stay tuned for the DigiPres Virtual Gathering on January 31-February 1, 2024!

Kari May is the Digital Archives and preservation Librarian heading the design, development, and management of digital preservation for the University of Pittsburgh Library System. She has worked in digital preservation for 11 years and is currently a member of the SAA Collection Management Section Steering Committee and Co-Chair of the NDSA Excellence Awards Working Group.

Recap: DLF Forum, DigiPres, and Collaboration

December 13, 2023bloggERS! EditorsLeave a comment

by Abby Beazer

This year, the DLF Forum and DigiPres 2023 were held in St Louis, Missouri. Seeing as this was only my second time attending the DLF Forum in person, and my first time attending DigiPres, the conference week was full of opportunities to learn from my colleagues. Although the DLF Forum didn’t have a formal theme this year, I did notice some trends throughout the conference schedule and the sessions that I attended. Concerns around diversity, equity, and inclusion continued to be a common thread, in addition to digital asset management system migrations (both in-progress and complete), and collaborative efforts.

Collaboration is one thing that I have learned is very important over the few years that I have been a part of this field—digital libraries and archives, and the digital preservation work that accompanies them, don’t happen in a vacuum. At the very least, these fields require collaboration within your institution to accomplish projects and workflows. But, as digital library and archives professionals, collaboration and communication with those outside our institutions is very beneficial because of what we can learn from and share with each other. Some of the best things that I have learned or gained from a conference haven’t come from a session, but from my interactions and conversations with other conference-goers. At this year’s conference, I was able to make better connections with colleagues from institutions in my own state that I don’t see very often.

There were a number of sessions at DLF Forum and DigiPres that highlighted collaboration as a vital part of a project or process, and some sessions were even collaborative presentations across institutions. At DLF, one such presentation was Slaying the Migration Dragon: Approaches to Navigating an Open Source System Migration presented by Lisa McFall and Shay Foley of Hamilton College, Sarah Walden McGowan of Amherst College, and Brenden McCarthy of Rensselaer Polytechnic Institute. These presenters were brought together through a common need to migrate away from Islandora 7 and met via the Islandora Collaboration Group. They also each independently determined to migrate to the same new system—Archipelago—based on it being open source and having good metadata schema flexibility. Their presentation consisted of a series of questions that each institutions’ representative answered based on their experience during their migrations. This valuable collaborative presentation enabled attendees to gain not one perspective and understanding of an institutions’ experience migrating digital asset management systems, but three. For example, each institution took slightly different migration paths. Hamilton College utilized Amazon Web Services to assist them during the migration and brought in an outside metadata consultant after previous experiences when they migrated to Islandora, while Rensselaer Polytechnic Institute had a local instance of Archipelago installed with mostly out of the box settings and used OpenRefine to clean up their metadata. Similar to Hamilton College, Amherst College worked with vendors BornDigital and Metro to accomplish their migration because they had a low number of in-house staff to assist with the process. Every repository has its own issues to resolve and peculiar elements to work with; by featuring multiple perspectives and having a more varied set of information in a panel, there’s more to be learned and potentially applied by the audience.

Later in the week, during DigiPres, I attended another presentation that focused on collaboration: Brenna Edwards, Hyeeyoung Kim, and Christy Toms from the University of Texas at Austin participated in the Praxis Makes Perfect? session with their presentation From Assessment to Standard: Using NDSA Levels of Digital Preservation to Define a Local Standard. Their presentation focused on the work of the UT DigiPres group that had been created on the UT Austin campus. The group is made up of different departments and centers, all concerned with digital preservation work. Meetings for the group are organized on what they call a “TV season” schedule to accomplish an agreed-upon goal for that season while avoiding burnout. This “season” runs from about September through March with a hiatus from April through August, roughly following the main academic semesters. They noted that this setup was favorable to members because it wasn’t a year-long commitment and allowed for down time. For 2022-23, the group’s goal was to complete a preservation assessment based on the NDSA Levels of Digital Preservation. Through meetings every other month, this group was able to establish the current state of digital preservation at UT Austin and set goals for where they wanted to end up. At the end of that “season” of work, UT DigiPres was able to raise awareness of digital preservation work across campus and establish a more cohesive digital preservation ecosystem. They also discussed how this work led into the next focus for the group—documentation—which is still ongoing. The group’s final takeaway from reviewing their efforts was to encourage attendees to consider creating a collaborative group of their own. With talk that a DigiPres 2024 conference may not be scheduled, I made note of their entire presentation and recommendations. A local digital preservation working group would be a great chance for more local and focused collaboration, an idea that I intend to discuss with colleagues at my own institution and as well as others locally. Why? Because the power of collaboration is how it provides motivation, inspiration, and support to those involved. These efforts often lead to better work and solutions for more people.

Abby Beazer is the Digital Initiatives Technical Specialist at the Brigham Young University Library and holds an MLIS from the University of Arizona. She supervises teams of student employees completing on-demand special collections digitization requests and transcription projects. She also oversees digitization equipment maintenance, and collaborates with colleagues on cultural heritage digitization best practices, standards, and digital preservation.

Harnessing the Power of the ChatGPT API for Metadata Extraction from Print Bibliographies

November 1, 2023bloggERS! EditorsLeave a comment

By: Sara Palmer

Metadata extraction from print bibliographies is a process that often gets underplayed in discussions about big data and AI. However, the fact remains that for researchers, archivists, and bibliophiles, it’s an essential task that traditionally requires countless hours of manual labor. Advancements in AI and natural language processing have made this large task more feasible and efficient. Here’s a case study on how the ChatGPT API revolutionized the extraction process for a large print bibliography project.

Breaking Down the Bibliography

The Wayfinder project at Emory University is an NEH-funded planning grant to determine how to make an online version of James Danky and Maureen Hady’s 1998 African American Newspapers and Periodicals: A National Bibliography. The bibliography has over 6,500 periodicals cataloged with data in paragraph form. Traditionally, this would involve a team of graduate students working to manually extract the required metadata. But when an advisory board member demonstrated how ChatGPT could perform this work quickly on a sample entry, we knew we had to take advantage of this technology.

We started with a traditional programming tool, regular expressions, for pattern recognition and string parsing. Using regular expressions, we were able to break the bibliography into discrete entries. To test the potential of the ChatGPT API in aiding our extraction, we took a random 10-percent sample (650 entries) from the larger dataset. Each entry was treated as a unique query for the API. Combining a prompt that spelled out the data schema in natural language with a small amount of Python code, we obtained an XML serialization of the data for each entry.

The ChatGPT Effect

The initial results, though not without flaws, were commendable. For each entry, the ChatGPT API was provided with metadata terms to identify and extract, such as title, editor names, publication dates, and subjects. In many cases, the extraction was accurate, providing a structured output that would save us hours in manual labor.

However, the challenges were also evident. Despite specifying our desired metadata terms, the API had a penchant for creativity. In some instances, the same metadata was given varying names. For example, Library of Congress numbers labeled “LCCN” in one entry might be labeled “library_congress_card_no” in another. Such variances required additional post-processing to standardize the terms.

Looking to the Future

With such promising preliminary results, the project is now venturing further into the realm of AI. Our experiences with the API have led us to consider its fine-tuning capabilities. We aim to use our perfected sample dataset to train a model that’s more closely tailored to the specific task of extracting metadata from our bibliography. With this fine-tuned model, we hope to tackle the entire dataset, achieving even higher accuracy and efficiency.

In conclusion, our journey with the ChatGPT API showcases the transformative potential of AI in traditional academic tasks. The ability for AI to interpret data in prose form does more than save time and money. It also frees up academic workers to think more deeply about the nature of their data and how it might best be presented to end users. In doing so, it can empower scholars and researchers to focus on creativity, innovation, and critical thinking, core aspects that drive the academic world forward.

Sara Palmer is a digital scholarship specialist at the Emory Center of Digital Scholarship within Emory University. She collaborates with researchers on creating online exhibits, designing databases, generating network visualizations, and performing textual markup and analysis.

Call for bloggERS: Blog Posts on the 2023 DLF Forum

October 18, 2023bloggERS! EditorsLeave a comment

With just a few weeks to go before the Digital Library Federation 2023 Forum (November 13-15) kicks off in St. Louis, bloggERS! is seeking attendees who are interested in writing a re-cap or a blog post covering a particular session, theme, or topic relevant to SAA Electronic Records Section members. The program for the forum is available if you want inspiration.

Please let us know if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com! You can also let us know if you’re interested in writing a general re-cap or if you’d like to cover something more specific.

Writing for bloggERS!:

We encourage visual representations! Posts can include or largely consist of comics, flowcharts, a series of memes, etc.
Written content should be roughly 600-800 words in length.
Write posts for a wide audience: anyone who stewards, studies, or has an interest in digital archives and electronic records, both within and beyond SAA.
Align with other editorial guidelines as outlined in the bloggERS! guidelines for writers.

Please let us know if you are interested in contributing by sending an email to ers.mailer.blog@gmail.com!

92NY Unterberg Poetry Center Email Archives: Approaches to Processing Large Email Archives

October 16, 2023bloggERS! EditorsLeave a comment

The 92nd Street Y is home to the Unterberg Poetry Center, a revered venue for writers to share their work with the public. Like many cultural organizations, 92NY is an unintentional repository of social memory, and the email collections of the Poetry Center’s directors, documenting exchanges with illustrious writers alongside day-to-day operations, is no exception. The Poetry Center adopted email in the late 1990s, and it has been the primary tool of communication for at least the last 15 years. In September of 2022, The Poetry Center and the digital archives team at Stanford University Libraries Department of Special Collections are collaborating on an email assessment and preservation project generously funded by the Email Archives: Building Capacity and Community regrant program, administered by the University of Illinois at Urbana-Champaign, and funded by the Andrew W. Mellon Foundation. This project will apply the pioneering ePADD email curation and access software to the email archive with the aim of developing a model of processing and accessibility that other cultural centers might learn from and adopt.

Our approach to processing the Poetry Center’s email collection was in many ways similar to how any collection, analog or digital, is processed. It was dictated by the sheer volume of the collection and, of course, the “more product, less process” ethos. Our overarching goals in processing the collection are to ensure that we have appropriately restricted access to any personal or legally protected information found in the collection, and to consolidate correspondents to facilitate researcher use of the collection. We used ePADD to process these email accounts, and while ePADD itself helps a great deal with the processing, an archivist is still necessary to cull through the information, identify false positives as well as messages for restriction, consolidate correspondence, and check image attachments.

We began by surveying each director’s account, as they differed dramatically based on length of tenure at the Poetry Center and when email was adopted as a tool for professional communication. We observed the number of messages, frequently messaged recipients, and dates of each account. Many of the directors continued a relationship with the Center well after they had moved on from the institution, helping to bring in authors and acting as a contact. The two most recent directors had by far the largest accounts, coming in at 68.9 gigabytes and 222.4 gigabytes. Interestingly, we noticed that early use of the email seemed to mimic paper correspondence in formality and length but, as people became more familiar and comfortable with the medium, messages became shorter in content but were sent and received much more frequently.

We decided that it would be beneficial to process the smaller, more manageable accounts first. This approach built momentum for the project, which created a sense of accomplishment and confidence for us, as well as a road map for processing that we could use for the larger accounts. It also fostered a familiarity with ePADD that was easier to generate when working with smaller amounts of data–a bit like learning to swim in a pool before wading into the ocean. The larger accounts required multiple accessions, so we made the decision to ingest five files at a time–between 100,000-200,000 messages.

We first consulted ePADD’s Sensitive Lexicon to survey and flag messages that may need to be restricted because they contain stand-alone personally identifiable information or words related to health, employment, recreational drug use, and prescription drugs, as well as anything pertaining to financial, criminal, and legal issues. Some of the messages identified by ePADD might be false positives, but the lexicon allows for a more focused and economical inspection of messages that may contain potentially confidential information. Then, using an email appraisal template that we created, we documented key search terms and correspondents that were found in messages that we restricted to streamline our review process.

Next we turned to cleaning up the accounts’ address books. This task includes aggregating multiple email addresses that belong to the one individual and potentially making minor corrections to the formatting of names. In the case of email, individuals can maintain multiple accounts and vary the form of their l names within any of them, and their contact information and corresponding names may be spread throughout the list of correspondents multiple times. Many of the accounts contained close to (and some well over) 10,000 correspondents.This proliferation of names and contacts makes for a messy and confusing experience for researchers trying to use the collection. ePADD automatically associates email addresses with names of individuals, but it may take some additional manual editing by the archivist to make final corrections. Although consolidating correspondent names can be a time consuming process, it makes using the collection much easier to use. With our timeline in mind, we decided to consolidate 100 of the most frequently messaged correspondents.

For our final processing step, we screened image attachments for sensitive visual content. ePADD pulls the images out of the original email and makes them viewable by year, so this step is fairly easy. After we have conducted this general processing on all of the accounts, we might go back in (depending on time and need) and process on a more granular level. This overall approach has worked well for the smaller collections, and we plan to use the same roadmap for processing the last two large collections which will require multiple accessions because of the sheer volume of emails collected over the years.

The vast majority of email collections outside of traditional libraries and archives are uncatalogued, unfindable, and unusable, yet we know that access to data is a high priority, institutionally and culturally. There is rich and untapped scholarship in these collections. As more cultural organizations begin to preserve and process their email collections, there will be inevitable challenges. We hope that by sharing our own challenges, our solutions, and our approach to processing will make the preservation and archiving of emails more accessible to other cultural organizations.

Read more about this case study in their May 9, 2023 blog: 92NY Unterberg Poetry Center Email Archives: A Case Study on Troubleshooting Email File Transfer for Processing.

Marian Clarke is the project archivist working on the email collections of the 92nd Street Y’s Unterberg Poetry Center. She was previously a digital archivist at the Frick Art Reference Library Archives and an audiovisual archivist at LaGuardia and Wagner Archives, CUNY. She holds an MA in media studies from the University of Texas and MLIS from Pratt Institute.

Sally DeBauche is a Digital Archivist in the Department of Special Collection at Stanford University Libraries. She is responsible for creating policy and workflows related to born digital archiving and processing born digital collections, with a particular focus on email. She also project managed the development for the ePADD software from 2020-2021 and consulted on the most recent cycle of development led by the University of Manchester and Harvard University. Sally received a BA in History from the University of Wisconsin-Madison and an MSIS from the University of Texas at Austin.

Welcome to the newest series on bloggERS, Tales from ChatGPT!

August 23, 2023August 23, 2023bloggERS! EditorsLeave a comment

Welcome to the newest series on bloggERS, Tales from ChatGPT. In the coming months, bloggERS will feature posts from digital archives professionals that will explore the question: How have you used ChatGPT or other AI models to help solve a problem or complete a task in your archival workflow? Have something you’d like to contribute? Send us an email!

Transkribus: A Practical Tool for Achieving Our Digital Archive Dreams

by Aaron O’Donovan

Creating accurate transcriptions of historical materials has been an issue for libraries and archives since humans have been recording thoughts, ideas, and events in written form. Certainly things are more efficient than they were, but no amount of crowdsourcing or volunteers is going to make the process anything that we would deem efficient.

The greatest leap we have made in the transcription world is the automation of typed text. We have efficient optical character recognition (OCR) software—big names in this space like ABBYY FineReader and the open-source Tesseract do a fine job at transcribing machine created text if the formatting in the document is consistent and the text is clear. OCR at its best is transformative because it makes the once unfindable accessible, and it leads to discoveries that were impossible even a decade ago.

The archivist’s dream, of course, is handwritten text recognition (HTR). This past year our organization was preparing for our 150^th anniversary and we knew we would have to look through many handwritten ledgers to understand our early history. At the same time that I was investigating software for this purpose, we had other projects in our queue that could benefit from OCR and HTR: a large set of cemetery interment records and several handwritten diaries of a local circus that traveled to Australia in 1891-1892. I had heard about the origins of Transkribus as far back as to when it was a Horizon 2020 “READ” EU project. When the time came to experiment with the software, I signed up for a free account with Transkribus Lite at https://app.transkribus.eu/ and tested the software on many of the projects we were working on that could benefit from handwritten text recognition.

While Transkribus bills itself mostly as an AI powered training model, its base models provided a good start for the projects I was interested in transcribing. I knew going into the testing that I didn’t have enough time to train the software on 10,000 or more words, so I was in a sense at the mercy of the other work that researchers and archivists had done before me. The set of data (text) that the models had been trained with was so vast that I had a good base to work with before having to edit anything unique to the documents that were in my project. With 500 free pages of HTR I was able to get a good sense of what the “out of the box” models could do without training the model further on a particular kind of handwriting.

The first document I uploaded were the notes from the library’s first board meeting in 1873. While the cursive handwriting was neat, I wasn’t sure what to expect from the software considering I had not trained it on the hand of the person who took notes at the board meeting. Surprisingly the “out of the box” English Handwriting M3 model performed exceptionally well, with no layout help, and no other data training being provided to the model. In sum, I counted a few errors per paragraph in the board minutes that I uploaded. Usually, the errors were constrained to reading vowels incorrectly; mistaking an O for an A, for example.. The travel diaries I uploaded also had similar issues, mistaking the author’s Ws for Vs, but transcribing most other words correctly. While Transkribus strives for 95-98% accuracy after training a document I believe most of us working in archives would be fine with 75% accuracy “out of the box” and editing as needed. While the available public models will never be perfect, for our projects it is better than anything that we had before, which often is nothing at all. We now have something we can work with and improve, and for that I am beyond grateful. With more time dedicated to training the software on those documents I know we would have very good results, but as it is now, I am quite happy with what we were able to achieve with minimal effort on my part.

With transcription projects it is always hard to resist the urge for instant gratification, and as a society I think it is instilled in us that technology should just always work quickly, and it should always be error free. While I’m not certain the “out of the box” Transkribus software will get to that point soon, it still is a very intriguing piece of software that seems to be getting better with each new version. As I write this, Transkribus Super Models—capable of recognizing multiple languages as well as handwritten and printed text simultaneously—are set for release. In addition to the Super Models, Transkribus is releasing Field and Table models. These new models have been specifically designed to tackle complex layouts often found in historical documents. Field models will let you build your own unique models to label or extract things like headings, paragraphs, marginalia, initials, name, and date fields on index cards and many more. With Table models, you can train AI to recognize and replicate the row and column structure of your documents, making it easy to turn them into spreadsheets. With improvements like these happening so quickly it is hard to imagine what Transkribus will look like in 5-10 years. Could we one day see historic manuscripts transcribed accurately with the click of a button, never having to worry about training the document? If that is the dream, it looks like it is gradually becoming a reality.

Aaron O’Donovan is a Special Collections Manager at the Columbus Metropolitan Library. He helps manage the local history collection at the library, as well as helping to manage the My History project that provides digital images of the city of Columbus to the world. He has a BA in Sociology from the Ohio State University and an MLIS from Kent State University. He has appeared numerous times on WOSU’s Columbus Neighborhoods series, as well as providing countless hours of research and images for the show.

	glenmcblog on The Shape of Information: A Ne…
	Week 6 — Text… on Of Python and Pandas: Using Pr…
	Archives, libraries,… on DLFF’d Behind?
	scott kushner on What’s Your Set-up?: Pro…
	Glen Mcaninch on REPOST: Big Data and Big Chall…