Assembling a Corpus

Our first task was locating an electronic version of the first Harry Potter book, Harry Potter and the Sorcerer's Stone in French, Russian and Swedish. This was relatively easy in French and Russian, as those languages are spoken by a much greater amount of people than Swedish. Fortunately, we did successfully locate each text in each language. We then decided, that for the purpose of our research, tagging and analyzing the entire book would produce too large of a corpus for us to be able to effectively address our research question. We initially decided to analyze the first and last paragraphs and to identify neologisms created by J.K. Rowling (such as Quidditch, Expelliarmus, etc), but found we did not find many neologisms in these chapters. We added the eighth chapter as well, even after adding a third chapter to analyze, our list of true neologisms was still short. As a group, we decided to widen our scope. We decided that neologisms existed in the Harry Potter world, but in the first book (which is largely an introduction to the universe Rowling created) there was not an adequate amount to address our research question. Thus we created a new method of analysis, a concept called the “Rowlingism.” A Rowlingism is a creation of J. K. Rowling’s that is specific to the Harry Potter world, such as certain spells, classes, houses, families, characters, foods, and places. In each language, we then assembled our three chapters into individual XML documents to begin our markup process.

Structural Markup

First and foremost, we needed to develop tagging guidelines so that all three documents would follow the same structure. The root element was established as < HarryPotter> , then subdivided into < chapter> elements containing < para> elements which then contain < sentence> elements, which contain < word> elements. Speaking quotes were tagged within sentences as < q> . At the beginning of each chapter, the < chapter> tag had the attribute “title.” We tagged paragraphs, sentences and words using Regular Expressions. Our Regex worked for the most part, but we did run into some issues. In Swedish, our Regex did not account for the three additional letters in the Swedish alphabet, ö, ä, and å. It would segment complete words into two words because it registered the accented words as being the end of a word element. For example the word utanför, meaning “outside,” was tagged as < word> utanf</word> < word> ör</word> . This was easy to fix using search and replaces methods. One issue we ran into in the Russian text is that our Regex for tagging words was not tagging hyphenated words such as что-то as one word, but rather as two. This was also fixed using search and replace methods. Now that we established the bare bones of our markup, we were able to move forward with more in depth tagging.

Contextual Markup

Our next step was to identify all Rowlingisms within the text and tag them properly. Initially we tagged some Rowlingisms as <Rowl> with attributes other, character, noun or place. Other Rowlingisms were tagged as <other>, some nouns that are Rowlingisms were tagged <noun> with attributes object, creature, activity, house, housePersons, potion, spell, class, or edible. Some places ended up being tagged as <place> with attributes of magical or normal. And finally, some characters ended up being tagged <character> also with the attributes of magical or normal. We quickly realized that our initial method of contextual markup was not uniform in all three documents and thus we decided to redo our method to create a uniform and simpler method of accomplishing our contextual tagging. With all the variation between Rowlingisms being tagged as both <Rowl> and elements such as <noun>, <place>, <other> and <character>, we decided that everything would be tagged as <Rowl> with attributes stating their function. Instead of <Rowl character=””> we decided to give characters xml:ids to facilitate searching for characters for future analysis. Thus the element <Rowl> contained the mixed content attributes of id (xml:id), house, housepersons, family, class, spell, adj, noun or place. After creating a clean, functional schema to follow for our contextual markup, we were able to quickly edit our three documents so that they were all uniform to facilitate further analysis and use of other technologies drawing from our uniform XML documents.