“Straying Into This Wilderness” : A First Taste of TEI [0]

TEI for Actual Beginners

This semester, I am taking Digital Scholarly Editing, a new module created by Dr. Mark Sweetnam at Trinity College, Dublin. Eventually, I will produce a fully-encoded TEI project; however, step one is wetting my feet with a practice exercise. Learning to encode demands balance between theory and practice, like learning to ride a bike. You need to know to sit in the saddle, hold the handlebars, and push the pedals before you start, but no matter how hard you listen, advice like “brake slowly” just doesn’t mean anything until you are on the pavement with a skinned knee.[1] Initially, I spent a few weeks familiarizing myself with TEI and doing warm-up exercises. For those interested in getting started with TEI themselves, the next paragraph reviews the resources I used. If you want to read about TEI in practice, skip to the following section.

A little girl riding a pink Strider.

Of course, if you are creative about it, you can always skip a few steps. (Image courtesy of Joel Hagan/Wikimedia Commons.)

I began my foray into TEI with a hefty advantage—my course had already introduced me to XML.[2] But understanding how tagging works is just the beginning; TEI, while customizable for savvy users, does come with a strict set of rules. I began with A Very Gentle Introduction to the TEI Markup Language, which does a great job explaining concepts such as the separation of content (XML) and display (HTML/CSS). However, I found references to building the TEI DTD[3] confusing and unnecessary for a greenhorn. Next, I tried reading the P5 Guidelines straight, but, like a straight shot of tequila, this proved too much too quickly for a newbie.[4] Finally, I began the TEI by Example tutorials, a wonderful beginner’s tool. Fair warning: the tutorials themselves are rather dull and use some offputtingly gendered language, but I found I could still learn what I needed to while dividing my attention between the screen and a podcast.[5] I recommend skipping the quizzes, which are multiple-choice and rather nit-picky, and heading straight for the practice exercises, which will give you the hands-on experience you need to get started.

A shot of tequila and a tequila sunrise.

Learning to encode is like going on a bender in Mexico, apparently. (Image courtesy Antonio Cavallo/Wikimedia Commons.)

Choosing a good practice document

TEI by Example was useful, but the exercises are a training-wheels version of TEI. It was time to ride my own bike.[6] I’m a Woolf nut, so I decided to work with a well-documented object, Virginia Woolf’s To the Lighthouse, and specifically the holograph of the “Time Passes” section, which has been digitized and transcribed by the Woolf Online project. At random, I chose to work with page 180. When choosing a practice document, look for something simultaneously interesting enough to present a few challenges, but simple enough that you can predict which questions will arise, and know where to find answers. The challenges I sought out were encoding character names, proper nouns (book titles, place names), and evidence of editorial intervention—Woolf’s additions and deletions. However, after I finished that work, I realized it would be interesting and relatively simple to encode differences between the holograph and a print edition. Because it was easily searchable, I used Project Gutenberg Australia’s edition of To the Lighthouse. Project Gutenberg’s books have the distinct disadvantage of failing to adhere to a particular print edition to the letter, and so were I to continue with the work, I would choose one of the print editions available at Woolf Online instead.

Editorial choices

Before beginning to encode a document, you will need to decide which divisions to preserve and discard, and which artefacts to tag. I wanted to easily compare my TEI edition to the digital photographs of the holograph pages, so I used a numbered page break at the top of the document (<pb n=”180”/>) and numbered line breaks within the document (<lb n=”1”/>). I also wished to facilitate searches by character, even if the character were called by multiple names, so I used <name ref=”#REF”>. <name> tags are easy to add, so I encoded other proper nouns as well, such as places and book titles, skipping the @ref.

Adding the Project Gutenberg text led to more editorial choices. First, I was faced with the question of how to handle Woolf’s punctuation. Woolf is well-known for experimenting with commas, ellipses, dashes, and semicolons,[7] so I decided that I would record differences such as the addition or subtraction of a comma, or an ampersand versus “and,” even if they did not make a noticeable difference to meaning. To encode deviations between the texts, I used a <choice> tag, with <orig> referring to the holograph and <reg> referring to the print edition.[8] I also had to decide if I would encode <name>s in both editions. I decided against it—for this practice project, I imagined the holograph as the base text, and the Project Gutenberg edition as a piece of critical apparatus, not a document which itself would be analyzed by users.

Some of my encoding decisions stemmed from my own digital ethics. I am committed to open linked data and the semantically intelligent Web it creates, so I used the @xml:base to connect entities to their DBpedia pages whenever possible. Additionally, I licensed my project with a Creative Commons 4.0 Share and Share Alike license in the TEI header. This license would not be practicable for a project I actually intended to launch because the Woolf documents are not under a Creative Commons license themselves, but CC licensing is a practice I support as it allows others to build upon previous work. I also believe that linked data has particular promise for the DH community; embracing DBpedia as a hub which connects disparate information sources across the Internet could help remedy the persistent disappearance of defunct projects, driven in part, I believe, by the challenges interested laypeople face in finding those projects.[9]

A woman and man working on a crossword puzzle.

Really I’m only in it for the crossword puzzle answers. (Image courtesy of Ed Yourdon, Wikimedia Commons.)

Challenges and solutions

Working with TEI presents familiar challenges. The TEI by Example validator turned up many small typographical errors, as one would expect. I was caught off-guard by the number of tags that require further tags as content, especially in the TEI Header[10]. I wrote the header last, and found it particularly difficult and time-consuming. Unlike the majority of TEI, header element names are non-obvious, and it can be challenging to know which elements you should and shouldn’t include. I recommend writing the most basic header first, then reading up on the other elements and determining which ones make sense for your project.

For my project in particular, properly accounting for the spaces between words took a surprising amount of thought—I needed to plan the spacing so that it worked with additions, deletions, the holograph text, and the Gutenberg text toggled on and off in any combination. How to link to DBpedia also wasn’t immediately obvious, but a half-hour’s searching eventually led me to discover @xml:base. Finally, I struggled with the best way to reference characters—@ref=”#REF” needs to point to a matching @xml:id=”REF”, but where should that go? I decided to take advantage of the <front> matter of a TEI text, and created a list of characters, each character name encased in a <name xml:id=”REF”> tag.

Conclusion: Why bother encoding?

As a literature student just beginning to explore the world of digital humanities, I’ve worried about the utility to my own career of making digital editions for others’ use. However, I was pleased to discover that the granular approach to a text necessitated by encoding can yield unexpected discoveries. I haven’t before had cause to look at drafts of Woolf’s work, and I was interested to see the kinds of changes she made between the holograph and final text. Most notably, she added more references to named characters and to the presence of humans in the landscape. These changes could conceivably connect to her famous argument with Arnold Bennett over Edwardian versus Modernist character-building techniques,[11] and the Modernist interest in stream-of-consciousness and psychoanalysis in character creation.[12] The process of encoding suggests further work, which itself is made simpler by the existence of the encoded document.

[0] Yes, this is a quote from the text to which I will refer in this blog post. Just to be fickle, half was actually crossed out in the original holograph. If this were a TEI document, I would encode that intervention, but as it is a WordPress blog title, I will let it stand and therefore mislead anyone who doesn’t choose to read this excessively long footnote.

[1] Pretty good extended metaphor, right? You can borrow it for presentations and stuff if you like—I’m into Creative Commons licensing.

[2] The language TEI is written in. If you had any sort of blog in the noughties, you already know the basics, too—XML lets you make HTML-style start and end tags for whatever the heck you want, with attributes. So instead of using HTML to make your blog pretty (i.e. <font face=”Arial” color=”blue”>), you use TEI tags to label your blog’s content (i.e. <p> for a paragraph).

[3] The DTD is the document which defines all those pesky TEI rules. Most users will need to know how to use them, not make them, so I recommend cutting out of this tutorial after the “introduction to computers and coding” bits.

[4] The search bar on this page, however, is fabulous for finding quick, helpful information about the proper use of a particular tag. Just search <tag_name>. Attributes are a bit trickier—if you search @attribute_name, you will get a page containing information about an entire class of attributes; your attribute should be somewhere on that page.

[5] Might I recommend Radiolab, 99% Invisible, or Freakonomics Radio for your multitasking pleasure?

[6] No more bike metaphors for the rest of this blog post, I pinky-swear-promise.

[7] See H.R. Woudhuysen, “Punctuation and its Contents: Virginia Woolf and Evelyn Waugh” (Essays in Criticism, July 2012 63.3) for a treatment of the subject.

[8] After the fact, I realized that this was an incorrect use of the tag. Future work will use <app> instead.

[9] Example: The Story of the Beautiful, a fantastic Peacock Room project, doesn’t show up until page three of a Google search. (It shows up on page one of DuckDuckGo, though … which is another good reason to move away from the “personalization bubble” that is Google.)

[10] The relatively neutral <p> tag is a lifesaver in many of these cases.

[11] See Woolf’s “Mr. Bennett and Mrs. Brown” (1923), Arnold Bennet’s “Another Criticism of the New School” (1926).

[12] See E.M. Forster, “The Early Novels of Virginia Woolf” (1936), Elaine Showalter’s introduction to the Penguin Mrs. Dalloway (1992).


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s