Encoding Dostoevsky

One of the central aspects of our methodology in the project at present is encoding Dostoevsky’s novels. Here “encoding” means tagging using XML tags (XML = extendable markup language) following the TEI guidelines (TEI = Text Encoding Initiative). The TEI has created a massive, thousands of pages long guide to best practices in tagging and there are tags for all kinds of different aspects of literary and other texts. Want to tag a poem? TEI has got you. Want to tag a play? TEI has got you. Want to tag a bunch of Dostoevsky novels? TEI has got you.

When a computer reads any text, all it gets is a bunch of bits. Bits are the numbers (0 or 1) behind each letter, other number, special character, or punctuation mark. Tagging a text allows the computer to identify some aspects of what it is reading. So, you can tag paragraph breaks, speech, character names, locations. You can go as deep with this tagging as you want to. Want to tag all the nouns? Or verbs? Totally doable. It just depends on how much time and patience (and eyesight) you have to do this kind of encoding. The place to start with coding is your research questions. What needs to be tagged to get at the questions you’re interested in asking?

Team Digital Dostoevsky set out to tag The Double first. It is the shortest novel of those in our original corpus and the earliest, so it seemed like a logical place to start. We even thought it would be a good way to ease into encoding. This is hilarious to think of now. We started out by building a header that included all of the information about the project and the text: our names, Dostoevsky’s name, the novel’s name, the date of publication, the date we were tagging it. This seemed pretty straightforward.

Then we added in tags to designate the structure of the book: chapter and paragraph breaks. So far, so good.

Then we decided to go through and tag characters and speech. We cheerfully divided the novel into four sections, each claimed one, started trying to tag… and hit our first wall. The thing about tagging in this way is that it is not a neutral exercise. TEI is, in many cases, interpretive. And, while in many cases something as common as speech is straightforward (someone says something that is delineated in quote marks to someone else), we quickly ran into query after query. What if a character is talking to himself? What if the narrator is talking as a character? What if it isn’t clear if a character is saying something aloud or not? And these questions kept piling on. Tagging our first novel opened up an analytical and interpretive can of worms that we absolutely did not expect. We thought the analysis would come after the tagging, but it began during the tagging and we started having conversations about the minutia of the novel, that quickly spiraled into the bigger existential and philosophical questions at the heart of the work. Who knew such basic analytical questions as “who said that and to whom?” could bring so much up?

This is when we decided to begin this blog, as a way of introducing the project and also bringing you along with us as we grapple with and address these questions.

3 thoughts on “Encoding Dostoevsky”

Pingback: Tagging Speech in Dvoinik | Digital Dostoevsky
Pingback: A name or not a name? Tagging names in Dostoevsky | Digital Dostoevsky
Dave Douglass

January 12, 2025 at 1:38 am

Excellent to see you’ve been doing this work!

Have you looked at high-power tools to help (Large-Language Models , Large-Concept Models, Wolfram Language…)?

My background is in computational linguistics, machine translation specifically. I’m “retired” (and busier than ever, volunteering for some nonprofits), so I have lots of time for challenging work.

Have any of you looked at Mikhael Bakhtin’s approach to DIALOG analysis? He and his students (including many teaching today) have worked up a powerful, if under-formalized, framework for analyzing the “voices” in texts – who is REALLY talking to WHO? – which are “inside” and which “outside” the speaker — and so on.

Would you consider starting the analysis with an even-shorter piece? I suggest “Bobok” as a text to work on. Bakhtin did a much more thorough job on it (in “Problems of Dostoevsky’s Poetics”) than he did on ANY of FD’s novels.

I’ve been working with UD (Universal Dependencies) as a low-level markup (words, lemmas, morphology…). TEI has a way to integrate these tags into TEI document, but this seems to be under-utilized so far.

(I don’t know Russian. Sigh!)

Let’s coordinate! In the meantime, I’ll look at your work so far.
I’m in the East Asia – China/Singapore/Manila – time zone.

LikeLike