Digital Dostoevsky has had a very busy spring, with lots of significant project developments. First, we worked with a developer, Simon Wiles, to automate the tagging of our remaining corpus of five novels and one novella for structure as well as speech and names. Simon helped us to put empty TEI headers, <p> tags to mark paragraphs and three layers of <div> tags to mark parts, chapters, and sections of our novels. He also created an automatic tagger that helped us to add speech tags with attributes such as @who, @toWhom, @direct, and @aloud which we would populate manually at a later date. Automating speech tagging proved fiendishly difficult thanks to Russian speech marking conventions of which we will explain more in due course. He then used the NLP Named Entity Recognition program in Spacy to predict person names and place names within the text and to use them to generate onomastica or name lists which could then be tagged with empty <persName> tags. Though some of the code proved tricky, autotagging saved a huge amount of time and helped us to prepare our texts for our exciting May schedule.
In May, we were chosen as one of the projects for the wonderful Jackman Humanities Centre Scholars in Residence program at the University of Toronto in which undergraduate students are paired with faculty research projects that fit their interests. Student RAs work with faculty PIs for four weeks in May on a project for three and a half hours every morning. Digital Dostoevsky took on six wonderful undergraduate RAs from all three of our campuses: St. George, Mississauga, and Scarborough. We required our undergraduate RAs to have at least three years of Russian study or else be native or heritage speakers. We ended up with an impressively multinational team of students, with two Ukrainians, two Russians, and two North Americans: Anastasiya Gordiychuk, Dmytro Ishchenko, Nadezhda Ivanova, Elijah Sciborowski, Veronika Sizova, and Eden Zorne. We spent the first few days training the students in TEI-XML and acquainting them with our corpus, but by the end of the first week, they were already ready to go. Each student took a different text and worked on it for the rest of the month. We met every morning on Zoom, together with our whole team. Students worked individually in Zoom breakout rooms and if they came across TEI problems or coding quandaries they either messaged us on Slack or else we went to join them in their breakout rooms to discuss the peculiar Dostoevskian issues that arose, such as differentiating hierarchies of monks in The Brothers Karamazov, how to code speakers who exist only in characters’ imaginations, or how to add fictional and historical characters to our reference lists. Students gave a final presentation at the end of the last week and it was impressive how much they had learned. We’re hoping to keep the students on the project as we move forward, and to publish some of their experiences as forthcoming blog entries.
Also in May, Katia and I took part (via Zoom) in the final workshop of the NEH-funded Princeton New Languages for NLP institute. We had been working since the previous June on this project and in mid-May we gave our final presentation, “Challenges in the Development of NLP for New Languages: A (19th-Century) Russian Case Study” on a panel with colleagues working in Kannada and Quechua! We will be publishing more here soon on this project, but suffice it to say that while we began by reinventing the wheel, we ended up with a much more ambitious plan dealing with temporality in Dostoevsky’s novels, and created a model which floundered on the methodological rocks. More of that on here soon!
And finally in June the team attended the DHSI online for the second year running, where we retook the course, Processing XML and TEI into What?, a crash course in XPath, XQuery, and XSLT, taught by the wonderful Elisa Beshero-Bondar. These are all different coding languages used for processing TEI encoded documents and corpora. This year the course focused mainly on XPath and XSLT, and on the last day Elisa showed us how to process our encoding of speakers in Notes from Underground into a tsv file which she then turned into a network graph using the Cytoscape platform.
Our aim is to write more focused blog posts on each of these aspects of our project, so stay tuned for more Digital Dostoevsky coming your way soon!