Reflections on Collaborative cataloging, phase I

29 April 2019

We established Neogranadina in 2015 in order to carry out large-scale but low-cost digitization projects in Colombian archives and libraries with the express intention of making the results freely available to everyone. This sort of work in Colombia had previously been limited to wealthy institutions or collections maintained by the national government, and was firmly out of the reach of smaller, regional archives. Because the most immediate obstacle to digitization was the often prohibitive cost of the equipment and technology involved, Juan Cobo developed low-cost document scanners based on open-source designs and software, and we installed them in two key institutions in Tunja and Popayán. The results were very positive — we began to create tens of thousands of images of manuscripts and early printed books each week, progressing at a rate greatly beyond our expectations — but this influx of materials overwhelmed our ability to process them and make them available to the public. It was also at this point that we became more acutely aware of other obstacles to digitization projects: the cost of storing thousands of gigabytes worth of images and the logistical problems of organizing and describing the digitized materials in order to publish them.

Through the use of our custom scanners, which Juan continued to develop and refine, we met our first objective of dramatically lowering the cost of projects of this kind in Colombia. But we also faced the same material problem faced by many institutions — the lack of comprehensive catalogs and finding aids — in the digital realm. The individual manuscript documents of the collections we digitize tend to have been bound together with dozens of other documents into single volumes which largely lack comprehensive catalogs. A notable handful have detailed descriptions in the form of spreadsheets, a few more have ad hoc, if often incomplete, paper catalogs (some not updated since the 1920s), and many more have none at all. As a result, our hundreds of thousands of digital surrogates were as inaccessible as the paper originals. So, once again, we turned to design and technology to find a solution to the problem.

The first step of this process was to generate information about the volumes that we digitized. Juan created a spreadsheet that the operators of the scanners filled in when digitizing each volume of material which covers some basic information about it including its archival reference, the number of photographs taken per volume, and any comments that the operator has regarding its physical condition. This allowed us to keep track of the volumes we are digitizing, which we are able to cross-reference with the inventories of the archives. However, this did not get us any closer to cataloging the individual documents held within those volumes.

The solution that Juan, Santiago Muñoz, and Natalie Cobo came up with between April and June 2015 when they were in the administrative process of legally founding Neogranadina was to crowdsource this metadata capture. Juan had followed the progress of previous crowdsourcing efforts for a number of years, notably UCL’s Transcribe Bentham project, while our friends at the Medici Archive Project had built tools to enable user submissions of document transcriptions and other metadata into their digital archive platform, BIA, which we hoped to use as the basis for our own digital archive. Santiago had experience working with student interns in Colombian universities, among whom we found our first volunteers.

María José Afanador joined Neogranadina in October 2015 and took the lead in supervising our volunteers and finding new ones, a role she performed until March 2018. She also developed resources to aid them in the form of guidelines for cataloging and blogging with the help of team member María Alejandra Quintero and of Samir Pinzón, an excellent student paleographer whose work and input were invaluable in the early discussion stages of this project. Our junior researchers, Rafael Nieto and Andrés Jácome, later assisted Maria José in supervising the catalogers and Andrés produced an introductory guide to the type of documents we were using for this project.

After a few months of experimentation, the Collaborative Cataloging project was fully launched in January 2017 and attracted dozens of volunteers from all across the world. María Alejandra and Maria José were active in using social media to attract new volunteers and our academic colleagues were similarly important in encouraging students to participate. Presentations about the project were made by Natalie at the Max Planck Institute for European Legal History in March 2016, Santiago at the International Federation of Public History conference in 2016, by Maria Alejandra at the IV Semana del Libro y la Lectura Digital 2016 in Bogotá, and Maria José in digital humanities events in Bogotá in 2016 and 2017, the DH2018 conference in Mexico, and, shortly after leaving Neogranadina, at the University of Virginia in 2019. All these presentations served to attract new catalogers and fostered connections with other people and institutions interested in this project.

There were two technical aspects to creating the tools for digital cataloging which were largely developed by Juan, who is the most proficient programmer of the team. The first was to devise a way of sharing the images with catalogers so that they could see entire volumes and scroll through them quickly, whilst not compromising the security of the materials or allowing for their widespread circulation before they were ready for publication. He did this by creating an image server that runs IIPImage, an advanced high-performance image server system that allows users to view high-resolution images quickly and without needing much bandwidth, processing, or memory. He created pyramidal tiled images from our archival originals, and displayed them through a javascript viewer, Diva.js, that allowed them to be embedded in practically any website. Users were then able to view and scroll through entire volumes very quickly and without needing to download hundreds of images to their devices. Each volunteer was assigned one volume to work on at a time.

The second aspect was related to data capture. We wanted to open the cataloging to as many volunteers as possible but, due to the difficulties of reading material from this period, we knew that there would be limits placed on each individual according to their paleographic skills. Therefore catalogers could engage in this process on different levels. At the most basic level, a cataloger could help with the quality control by correlating the photograph image number with the folio number of the volume and observing whether images were blurry or duplicated. Those who were more skilled could also identify individual documents and record their title and corresponding folios, as well as write a brief description of the document. In late 2016, María Camila Salcedo — a specialist in developing tools for tracking and managing volunteers in non-profit organizations — had helped us develop the Google Forms-based platform we use for this purpose which combines Google Forms and Google Sheets. This system is rather basic and reflects the financial limitations of the foundation. However, through the generosity of the UCHRI grant, we have been able to employ a programmer to create a more robust system of cataloging, which will be launching later this year, initiating phase II of collaborative cataloging that will be discussed in a later blog post.

Crowdsourcing our cataloging has several benefits. First and foremost, it allows many more people to tackle this challenge than worked for the institutions we worked with or with Neogranadina. Many hands make light work, and being able to divide this task among dozens of volunteers made it much more manageable without increasing the cost. Catalogers need not travel to these archives, and several people can work on the same volume at once.

Second, this proved to be an amazing pedagogical opportunity. Having access to thousands of images of early modern materials of different kinds provides instructors with an invaluable resource with which to train new generations of scholars in the archival skills they need to work with them. Some of our volunteers organized reading groups and classes, and used the materials to train new readers. Colleagues who teach at universities have also been using them with their students, and Juan and Santiago have found them to be an invaluable source for training their students in paleography.

Perhaps most encouragingly, the ultimate result has been the creation community among scholars, students, and the general public who are interested in contributing to the preservation and dissemination of these materials. We think this community is going to be crucial to ensuring the long-term success of our efforts as we go forward. After all, there is little point to maintaining a digital archive if no one is interested in using it.

While we have been delighted with the results of our efforts so far, we have also realized we have a very long way to go. We have only processed a tiny fraction of the materials we have digitized, and the pace of digitization far outstrips that of cataloging and even quality control. We are still looking to recruit volunteers for this process, especially as we transition into phase II, so if you are interested or would like to find out more, please visit our website.

We have faced a number of practical challenges, from the technical, such as which standard for archival description to use, to the human, such as how to keep dozens of people around the world motivated and interested, the latter of which can be read about in greater detail in by Maria José about her experience of supervising the volunteers which is in the process of being published. The technical challenges which confronted us have been discussed above but we are optimistic that with the opening of phase II and the dedicated cataloging platform, these will be greatly reduced moving forward.

Natalie Cobo

Neogranadina