English professor develops virtual Open Corpus Project
Prof. Dorothy Kim (ENG) presented her progress on developing the Open Corpus Project to the Brandeis community.
Prof. Dorothy Kim (ENG) is currently working to develop a virtual corpus, or collection of written texts, of Early Middle English language. This would give researchers the opportunity to search across multiple archives and databases of manuscripts. The current status of the Open Corpus Project, as the site is titled, was unveiled at a Faculty Lunch Symposium on Thursday, March 17.
Kim presented her project in three sections: “the larger state of corpus projects,” “digital editing and property entity,” and an “environmental scan” of the pros and cons of existing corpus projects.
She began by explaining how online corpora shift away from the “print imaginary” and “print culture” into a digital realm of accessible and researchable media. The Open Corpus Project will be a centralized platform designed with this idea in mind, allowing for the ability to layer and filter searches, and explore linked resources, manuscript information, and annotations, Kim said. Most importantly, however, the project will have a set of editorial standards that each manuscript submission must adhere to.
There are many existing corpora for Early Middle English and other languages, but each one has a different set of pros and cons, Kim explained. First, she analyzed the TEAMS: Middle English Texts Series designed through the University of Rochester. This HTML platform lacks searchability, linked data, and an editorial standard, Kim said, which makes the database difficult to navigate and to analyze manuscripts. “It is terrible for corpus linguists,” she added.
Second, Kim discussed the Piers Plowman Electronic Archive, which she said is “stuck in the print imaginary.” Also lacking key elements, Kim described the PPEA as being “literally the most infuriating, unimaginative digital visual project ever.” She said that regulating editorial standards is definitely the biggest issue with this platform –– individual editors can edit manuscripts in any way they choose, and they only edit the portion of the manuscripts published on PPEA instead of the entirety of the document.
Third, Digital Mappa is a platform that is designed with the intention of editors being able to use a virtual workspace to develop their own corpora and other publications. However, Kim explained that Digital Mappa is “not scalable.” While one may be able to produce a short book using this software, she said that an entire corpus would be impossible since editors are limited to using 100 images.
Fourth, Parker Library on the Web, designed through Stanford University, is both “stuck in the print imaginary and the physical library imaginary,” Kim said. Although researchers can view many manuscripts within this corpus, searchability remains difficult. Kim explained that the site is organized like a library catalogue database.
This means that users can browse documents and retrieve bibliographical information on them, but in order to do more in-depth research, one would already need to have very specific knowledge of the manuscripts they are interested in locating.
Moreover, once viewing a manuscript, Parker Library on the web still does not have linked data to other virtual archives, Kim added. She did say, however, that while this platform has its issues, it is “a best practices example of a digitized manuscript archive” in comparison to the rest of the corpora she has analyzed.
Kim continued on to further discuss her ideas for the Open Corpus Project and how it would be a better platform for manuscript editors, researchers, linguists, and students alike. She explained that the design for the Open Corpus Project will be mainly based on a digital platform called Open Context, which is an open access archeological database. She said that Open Context has a landing page with a map, clickable links, and search filters; searches are presented in an organized list so that documents are easy to view and further searches can be done from the results. In order to develop the Open Corpus Project in a similar manner, Kim is partnering with Geocene, an engineering consultancy.
The goal for the Open Corpus Project is to develop a corpus platform that publishes manuscripts of Early Middle English language that are edited by a set of specific standards, presented in an accessible manner, and are linked to multiple other virtual corpora and archives for in-depth research, Kim explained. Similarly to Open Context, she described how she would like the site to have a landing page with refined search abilities and a structured list of available manuscripts, as well as the ability to download data.
Moreover, the Open Corpus Project would have built-in analytical tools. Kim said her hope is that each manuscript will be edited to have three different viewing options in order to make research more accessible to all types of users: a document edition, a critical edition, and a student reading edition. She explained that the document edition would be a manuscript that remains close to the original text, the critical edition would contain commentary and annotations of the text, and the student reading edition would be written with normalized spelling so that it is simpler for the general public to understand. The website would also be developed so that users can simply change the interface to display their desired edition, Kim added.
Kim also emphasized how the key element of the Open Corpus Project will be a regulated set of editorial standards. However, she said that developing these guidelines will take time as there is much to consider, particularly in terms of standardizing the vocabulary between texts.
Kim said that the second phase for the Open Corpus Project, once developed for Early Middle English, will be to add manuscripts of other early languages, and the third phase will be to add images, charts, and music.
Please note All comments are eligible for publication in The Justice.