What is Digital Dostoevsky?
Digital Dostoevsky is a computational text analysis project on a corpus of 5 novels and two novellas by Fyodor Dostoevsky. It is a digital humanities project which emerges out of our long-standing interest in traditional philological analysis. We are excited by how digital approaches such as TEI encoding, machine reading, and natural language processing can help to answer questions about the deep structure of Dostoevsky’s novels, questions about speech, character, space, temporality, affect, and fictionality, among other areas. The project is hosted at the University of Toronto and supported by an Insight Grant from the Social Sciences and Humanities Research Council of Canada.
Background
Computational text analysis has flourished in the last few years and many 19th-century writers now have their own digital editions and digital archives. In the Russian context, computational text analysis seems like a natural fit, since Russian scholarship has a long tradition of textology; academic editions of canonical Russian works were produced with painstaking care by teams of editors throughout the Soviet period and beyond. Russia also has a strong tradition of computational methods in linguistics. The research questions which motivate our project are the same ones which scholars have been asking about Dostoevsky’s works for decades. Machine reading opens up possibilities for examining Dostoevsky’s corpus using technologies which neither the Formalists nor Bakhtin had at their disposal. Dostoevsky’s works are already available online. There is a wonderful digital edition of Dostoevsky’s Complete Works based at Petrozavodsk State University in Karelia here. This edition includes a digital concordance that can be used to parse the corpus. The academic Complete Works of Dostoevsky (both the 1972-1990 Soviet Academy of Sciences edition and the more recent Russian Academy of sciences edition that is still being created) is also online at the Russian Academy of Sciences (Pushkin House) and elsewhere. One aim of the Digital Dostoevsky project is to create a digital edition of Dostoevsky’s works that prepares the ground for scholars beginning to work with computational methods. In addition to our analysis of the corpus, we hope that this project will serve as a resource for future projects like it.
Our corpus
Our plain text corpus documents are taken from the canonical Soviet Academy of Sciences 30-volume edition of the Complete Works of Dostoevsky. We stripped the texts of their commentary and converted them to plain text files. So far, our corpus consists of five novels and two novellas: The Double, Notes from Underground, Crime and Punishment, The Idiot, Demons, The Adolescent, and The Brothers Karamazov. We may eventually add to them with the rest of Dostoevsky’s works, as well as adding translations in English and possibly even French.
Our encoding
We are in the process of XML tagging our corpus using TEI (click here to find out more about this methodology). So far, we’ve manually tagged The Double (Dvoinik). We started with basic TEI tagging (paragraphs, speech, named entities), and have moved on to places, direct and indirect speech, addresser and addressees, and liminal spaces and states. You can see our tagged Dvoinik file on our Github repo. All of our working files will be publicly available on our Digital Dostoevsky project Github moving forward. Next we are moving to automate more of our tagging in the coming months on a software program called Oxygen XML Editor.
Future plans
We are also currently exploring other computational methods beyond TEI tagging. We’re part of a NEH funded institute based at Princeton, New Languages for NLP, which will help us to use natural language processing to build models to analyze Dostoevsky’s novels from the perspective of named entity recognition, named entity disambiguation, and other methodologies. Stay tuned for more!
Pingback: Tagging Speech in Dvoinik | Digital Dostoevsky
Pingback: Welcome! – The Digital Gogol Project