To create a digital humanities project, one must have data to analyze in a digital and machine-readable format. Yet much of the data used to create DH projects is not born-digital, but originates as physical objects, such as books, photos, and archival records. How does an item you can hold become something that can be analyzed by a computer? What is lost and gained in digitizing a physical object?
Far beyond simply scanning, the item needs to be encoded, organized, and preserved in a collection to be discoverable. Digital collections, such as digital libraries, archives, and galleries, are typically created by cultural heritage and academic institutions and available on major platforms like Project Gutenberg and the Internet Archive, as well as institutional websites like the UNBC Northern BC Digital Collection. These collections mostly consist of works in the public domain, free of copyright restrictions and legally distributable. With the millions of items freely accessible to anyone with internet access, digital libraries and archives appear to be a utopian solution to the problem of inequitable access to cultural material, while providing a seemingly unlimited corpus for analysis.
However, digital collections are created by humans, who choose what to digitize, how to describe and encode it, and the means to preserve it. What issues could arise from the work of preserving cultural memory being done almost exclusively by a select group in large established institutions? Whose stories may be excluded from the digital cultural record and what are the ethics of inclusion of without consent? Long-term preservation of digital content also poses a significant challenge, as it requires infrastructure and ongoing funding to maintain. With much of the cultural record not just living on shelves, but servers, how do we ensure that both digitized and born-digital content isn’t the victim of digital decay and remains accessible in years to come?
Getting research materials in a digital form that you can search and computationally analyze can be a time-consuming initial step in a research process. Converting documents, text, images, and sound files to digital and/or machine-readable formats is a prerequisite for many digital humanities projects. Digitization is the process of capturing analog materials as digital images. Optical Character Recognition (OCR) programs “read” these images and convert them to text documents which can be easily searched, copied, edited, or used for computational text analysis methods.
- This week, explore some OCR software for your computer or mobile device. There are many blogs and articles that review popular options – try a few.
- Find a document that you can scan with that software to digitize the a page of text. That text can be from any magazine, newspaper, etc.
- Save that text to a PDF file and share it at your Omeka installation.
- Once that is done, create a post at your website and share the software/app you used for your OCR. What worked well? What didn’t? Did you learn something new or useful during the process? Is this something you might possibly use in your other classes or studies?