Our research interest focuses primarily on structural aspects of dramatic texts. The structural data is extracted from the 465 dramatic texts that constitute our Sydney corpus and then screened and edited before it can be evaluated statistically with regard to literary history.
The structural abstraction is provided by a PHP script that processes the TEI files, collects all the data needed for our purpose and puts it in our own zwischenformat (roughly translates as ‘intermediary format’, the DLINA data format we developed for this project and announced in our previous post). The script and what it produces, our zwischenformat, represent a structure-oriented form of data mining, so to speak.
Let’s assume that the basic structure of a drama looks as follows (without paratexts):
The <segment>
s represent the predefined structures of a drama: acts and scenes. Our script will extract the structure of segments and speakers from the full-text TEI files and write it into our zwischenformat. The actual content of the speeches is disregarded and represented by the number of speech acts, words, lines, and string length (in characters) instead, each of which are summarised per occuring character identified via its who
attribute. Now we’re able to see at a glance how many words each character is contributing to a play, and we’re able to do that for the whole Sydney corpus. Stay tuned for a post on the greatest chatterboxes in German literature, hehe!
Anyhow, the result looks something like this:
The representation of drama structure (segmentations, speakers) is at the core of our zwischenformat. But it does even more. It captures metadata and it creates complete cast lists for each drama by making use of the who
attributes.
Our zwischenformat consists of three main parts (each of which is required):
<header>
(the metadata)<personae>
(a cast list created by help of allwho
attributes)<text>
(drama segmentation and speakers)
Plus, there is also an optional part:
<documentation>
(for documenting non-trivial editing decisions)
A complete yet very short and simple one-act drama would be represented like this by our zwischenformat:
The zwischenformat is validated against:
- http://raw.githubusercontent.com/dlina/project/master/rules/lina.rnc
- http://raw.githubusercontent.com/dlina/project/master/rules/lina.sch
The raw zwischenformat versions of our Sydney corpus can be found here (i.e., the 465 files extracted from the TextGrid Repository before we started editing them):
The edited zwischenformat files can be found here (this is the deluxe version of our corpus, so to speak, the basis for all further analyses and visualisations; our editing rules will be published at a later point):
And now: