Network Analysis of Dramatic Texts

December Hackathon in Potsdam

2017-12-22T00:00:00+01:00

Thanks to the funding we received from the University of Potsdam (KoUP 1) and the Higher School of Economics (НУГ), we were able to organise two hackathons this year, one in September in Moscow, another one earlier this month at Fontane Archive in Potsdam. The latter concluded with a mini conference.

The network analysis of literary texts remains the main business of our German-Russian research group. In 2017, though, we rebuilt our whole infrastructure so we’re able to look beyond network-analytical research questions and combine the network approach with other (quantitative) methods. Some of the scientific outcome of our efforts throughout this year was presented at the mini conference and on Twitter via the hashtag #potsdam_digilit, some will find its way into our upcoming research papers.

To capture a bit of the hackathon spirit, this end-of-the-year blog post will just roll out some pics from our December meeting, so here goes:

Arriving at Berlin Central Station, magenta style.

Walking down Friedrichstrasse.

The bunch, first morning.

Welcome to Fontane Archive!

Literary hackathon in black and white.

Fine-tuning our new Shiny app.

Discussing our Chekhov conference poster for DHd2018.

Testing the new version of dramavis on Russian plays (a.k.a. "laptop-sticker competition").

Discussing next steps.

New API!

Lunch break: Lui and Moi on their way to the Café de la Régence.

GG and FF.

And a meta perspective.

Let's study ducks and swans.

A quick visit to the comma of SANS, SOUCI.

Let's go back.

Studying a real-life copy of Cäsar Flaischlen's "Graphische Litteratur-Tafel" (1890).

Still hacking.

@peertrilcke and @umblaetterer looking at things. (context)

Still hacking.

Danya on Tolstoy.

Conference break.

The inevitable night walk.

Visit to the Potsdam Christmas market …

… and some ice-skating.

Restaging a random swashbuckler movie …

… and a jump cut to the final scene of Emilia Galotti, "crazy Odoardo" edition.

Best wishes and see all you next year.

December Hackathon in Potsdam was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on December 22, 2017.

Know Your Implementation: Subgraphs in Literary Networks

2017-10-03T00:00:00+02:00

The network analysis of literary texts rests on a number of algorithmic foundations, which are often not sufficiently reflected in the field. In this regard, one problematic case is the existence of detached subgraphs. Here’s a classic example, the network of Goethe’s Faust, Part One (1808), visualised with our online tool ezlinavis (Faust being one of the examples you can select from the pull-down menu in the right upper corner):

We can visually distinguish three subgraphs:

the main graph revolving around Faust and Mephisto, which basically comprises the entire plot of the play, except for two detached single scenes:
- Vorspiel auf dem Theater (Prelude in the Theater)
- Walpurgisnachtstraum (Walpurgis Night’s Dream)

The two latter scenes do not feature any character from the main graph, which is problematic when starting to calculate network metrics. For example, if we want to calculate the average path length, which is the average of all average distances from one node to all other nodes, how long is the distance between, say, Faust and any of the characters in the detached Walpurgis Night’s Dream? It is, well, infinite. If we still want to calculate things like the average distance, we can do that, we just have to find a way to deal with unconnected pairs of nodes. In any case: “Computing the average distance in disconnected graphs needs careful consideration.” (Zweig 2016, p. 223).

There are different ways to implement this, and even if you’re just using network tools out of the box, you should be aware of the kind of algorithm that is used to calculate network metrics in unconnected graphs.

One way is to only consider the paths that actually exist and neglect all other pairs of nodes. If we use that option, the results for six selected characters from Faust, Part One are such:

Character	Degree	Average Distance	Closeness Centrality
Faust	55	1.11	0.90
Mephistopheles	35	1.44	0.70
Wagner	25	1.71	0.58
Margarete	9	1.85	0.54
…	…	…	…
Weltkind	35	1.0	1.0
Sternschnuppe	35	1.0	1.0
…	…	…	…

This actually makes sense. Characters/speakers in Walpurgis Night’s Dream (represented by Weltkind and Sternschnuppe) are not interacting directly with characters in other scenes and “stay among themselves”, so to speak, which is why they all have an average distance of 1.0. – Yet if it is true that the central character, the protagonist if you will, is “the character that minimize[s] the sum of the distances to all other vertices” (Alberich/Miro-Julia/Rosselló 2002), we have a problem, because Faust stops being the protagonist of Faust, overrun by the 36 speakers of the Walpurgis Night’s Dream. In other words: Goethe’s Walpurgis Night’s Dream, in regard of network theory, is a link farm.

If we still want network metrics to be meaningful when it comes to determining who the central character of a play could be, we better rely on a different option. For practical reasons, the distance between two unconnected nodes is sometimes declared as length of the longest existing path, plus one. If we use this method to assume an (artificial) distance for every pair of nodes, the above table would look like this:

Character	Degree	Average Distance	Closeness Centrality
Faust	55	1.81	0.55
Mephistopheles	35	2.33	0.42
Wagner	25	2.78	0.35
Margarete	9	3.02	0.33
…	…	…	…
Weltkind	35	2.88	0.34
Sternschnuppe	35	2.88	0.34
…	…	…	…

And … Faust is back! Shortest average distance! – For our upcoming paper on the different kinds of extracting protagonists from plays, we are using this method to calculate average distances. But, having said that, it cannot be emphasised enough that since the concept of the protagonist is such a rich concept, we should not try to use but one simple measure to automatically determine such entities. Which is something we’ll address in said paper, stay tuned. 😊

Ok, let’s consider one last way to calculate distance values between unconnected networks. E.g., when we used igraph as network library (before switching to networkx), we saw results that were totally different, because we used a fallback that determined that “the length of the missing paths are counted having length vcount(graph), one longer than the longest possible geodesic in the network” (i.e., vcount being the number of vertices of a graph). The resulting metrics, although calculated correctly, don’t make much sense:

Character	Degree	Average Distance	Closeness Centrality
Faust	55	40.07	0.02
Mephistopheles	35	40.27	0.02
Wagner	25	40.44	0.02
Margarete	9	40.52	0.02
…	…	…	…
Weltkind	35	67.0	0.01
Sternschnuppe	35	67.0	0.01
…	…	…	…

In this approach, the assumed paths when bridging the infinite distance between two subgraphs are much longer than with the previous algorithms, and almost equal: differences in the average distances really only become visible after the decimal point. So while this approach might make sense in some contexts, it is not very helpful in our case.

All told, our maxim really has to be, and not only when confronted with subgraphs: Know your implementation!

Know Your Implementation: Subgraphs in Literary Networks was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on October 03, 2017.

Network Analysis of Gogol's Metaplay "Leaving the Theatre …" (1842)

2017-07-09T00:00:00+02:00

A couple of days ago, we presented a first version of our TEI-encoded Russian Drama Corpus (RusDraCor) at the CORPORA 2017 conference in St. Petersburg (slides). Our goal is to assemble hundreds of Russian plays from the 1740s (Sumarokov) up to the 1930s with authors like Gorky and Mayakovsky.

Right in the middle, chronologically, our corpus features a number of plays by Gogol, one of which is “Театральный разъезд после представления новой комедии” (“Leaving the Theatre after the Presentation of a New Comedy”; full text at ilibrary.ru).

We don’t concentrate so much on individual networks in our research, we’re more focusing in on the structural evolution of a bulk of literary texts over time. But some networks are just special enough to warrant a bit more attention. So here is the network graph for “Leaving the Theatre”, extracted from our TEI version of the play and embellished with Gephi:

This is a ridicilously big social network for a theatre play (99 characters, it is hard to find plays with more characters). The reason is that Gogol’s “Leaving the Theatre” is a metaplay. Gogol started to draft it right after his infamous “Revizor” was released in 1836, but he didn’t publish “Leaving the Theatre” until 1842.

The plot, if we can call it that: A playwright is eavesdropping on the audience leaving the theatre after the presentation of his new play. We hear him comment sometimes, but he doesn’t directly interact with any of the other characters, and neither do they. They are just the exiting audience, ranting or raving about the play they just saw. They have no names, Gogol uses type descriptions to launch their speech acts. They go by names such as …

“Светский человек, щеголевато одетый” (“A society man, smartly dressed”)
“Господин, несколько беззаботный насчет литературы (“A gentleman a little careless about literature”)
“Чиновник разговорчивого свойства (“An official of talkative qualities”)
etc.

Like mentioned above, we can distinguish 99 characters (or voices) in this play. Most of the people are just pouring out of the theatre, alone or in groups of two or three, contributing their bit, then vaporising into the evening. We cannot really apply our understanding of social interaction here (the ‘digital spectator’), but with a little tweak we can create a meaningful graph.

The play has no acts or scenes, so we segmented it to catch what Manfred Pfister called ‘configurations’, subsets of the character list of a play, i.e., groups of people present on the stage at a certain point during the play. For all characters present in the same segment, we would establish a relation. That way, we’d end up with many small, unconnected subnets. And here comes our tweak: Since our “author” character eavesdrop on all conversations, we added him to all 37 ‘configurations’, ending up with the star-like network you’ve seen above.

Of course, this is an experimental extension of our approach, but it still helps to better understand the structure of Gogol’s metaplay. For example, we can easily tell apart single characters uttering their opinion and larger conversations involving a group of people, something that doesn’t become as clear when close-reading the play.

Btw, the underlying CSV file for “Leaving the Theatre” can be found here.

A Note on Laughter

Although we spent a lot of time to get our network data right, there’s still at least one shortcoming when we look at this nice quote from the concluding speech of Gogol’s alter ego in the play:

“Странно: мне жаль, что никто не заметил честного лица, бывшего в моей пьесе. Да, было одно честное, благородное лицо, действовавшее в нем во все продолжение ее. Это честное, благородное лицо был – смех.”

“It’s strange: I regret that no one noticed the one honest person in the play. Yes, there was an honest, noble person acting in it throughout its continuance. This honest, noble person was – laughter.” (our trans.)

Our current algorithms aren’t able to extract an abstract entity like “laughter” as part of a communication network, but who knows, involving more actor–network theory might bring us a whole bunch of new ideas.

Russian Drama Network as Shiny App

On a different note, we also released a Shiny App for the analysis of our networks at the aforementioned conference. It looks like this …

… and can be accessed at https://rusdracor.shinyapps.io/showcase/. It features live data, so to speak, continuously generated from our TEI files as the corpus grows. “Leaving the Theatre” is among the plays, as are works by Blok, Bulgakov, Chechov, Fonvizin, Gorky, Gumilyov, Krylov, Mayakovsky, Ostrovsky, Plavilschikov, Prutkov ☺, Pushkin, Sumarokov, Leo Tolstoy and Turgenev. And more is to come.

Oh, our project will also be presented at the “Digitizing the Stage” conference starting tomorrow at the University of Oxford.

Etc. etc. etc.

Network Analysis of Gogol's Metaplay "Leaving the Theatre …" (1842) was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on July 09, 2017.

Extracting Network Data from Mayakovsky's Play "The Bedbug" (1928/29)

2016-09-18T00:00:00+02:00

We don’t know if you noticed, but the LINA research field (LIterary Network Analysis) has come up with pretty good PR videos lately. Look at this fancy Youtube clip produced by the “Nation, Genre & Gender” project at the University College Dublin (their project homepage is here). The NG+G project applies Social Network Analysis to Irish and British Fiction (1800–1922), their corpus involves 46 novels from 29 authors (according to the video they identified 9,630 unique fictional characters). And although the automated extraction of characters from novels has made progress in recent years (see, for example, Jannidis et al.’s paper from DH2016), it is still rough on many edges. That’s why the UCD project chose manual annotation as their approach, and that’s why their data is of such high quality (but also limited in scope).

If you’re working with dramatic texts, automated character extraction is far less of a problem, since this kind of texts comes pre-structured, so to speak. If you work with one of the many TEI-tagged corpora it is even easier to pull out interactions and start analysing them with network metrics. Although, admittedly, sometimes it’s harder than it seems, depending on the quality and depth of the mark-up (we covered that issue in multiple postings last year).

But what do you do if you can’t rely on a fine-grained TEI corpus? That’s what we’re confronted with when gathering network data from Russian drama. If you assemble all the plays that you can find on lib.ru, rvb.ru and ru.wikisource.org, you got yourself a pretty good working corpus. The sustainable way would be to assemble all the works and then transform them into TEI and share it with the community. But corpus building is a task of its own and needs a lot of dedication. And after all, we “just” need some kind of network data, not a polished digital edition of the works. So one idea to go forward is to exploit the HTML structure of the texts.

Mayakovsky’s “The Bedbug”

In the beginning of July, we taught a Network Analysis course at the First Moscow-Tartu Digital Humanities Summer School in Yasnaya Polyana (if you speak Russian, slides are here). Originally, we wanted to analyse 19th-century drama, but one of the participants preferred to confront our methods with one of Vladimir Mayakovsky’s plays (hi G.! :-). He chose “Klop” (translated as “The Bedbug”, see en.wikipedia.org; an English adaption by Snoo Wilson is here as PDF; a concise English summary can be found at sovlit.net), written in 1928 and first published the year after.

“Klop” is definitely one of the challenging plays when it comes to character extraction. And now, two months after the summer school, we tried to automatise the extraction process and used “Klop” as an example. Before we get into the details, this is the end result (visualised in Gephi 0.9.1 using its built-in modularity algorithm; the image is licensed under CC BY 4.0):

Network-Driven Synopsis

It’s the late 1920s in a mid-sized town in Soviet Russia. The protagonist in “Klop”, “Pierre Skripkin” (who changed his name from “Prisypkin”), abandons his socialist ideals, because after all the fighting and suffering he wants to start benefiting from what has been achieved. And because this is such an unusual play, we can actually base our synopsis on the network graph. The play consists of nine scenes:

In scene 1, we see Skripkin (dark-green, central node) with his friend Bayan and his soon-to-be mother-in-law Rozaliya (both orange) strolling through a warehouse where merchants praise their products (dark-green cluster).
In scene 2, Skripkin discusses his lifestyle with the characters in the light-brown/beige cluster.
Scene 3 shows Skripkin’s wedding with his bourgeois bride Elsevira (orange cluster). However, fire breaks out and everybody dies, except for Skripkin who, …
… in scene 4, goes unnoticed by the firefighters and is preserved in the icy water in the cellar. The firefighters and their captain are depicted in the red cluster, which is detached from the other clusters.
In scene 5, the play reaches the future, jumping 50 years ahead in time. It is now the end of the 1970s, a global socialist state has been created (kind of an aseptic one, though). We follow a call-in discussion among several participants led by an operator, depicted in the light-blue cluster. It is discussed if Skripkin’s recovered body shall be defrosted or not, and a majority votes in favour of unfreezing. Just like the red cluster, this light-blue one is also detached from the main cluster. So the transitional scenes between present and future are detached, character-wise, from the rest of the play, which is a nice structure-related finding: Skripkin is kind of tunnelling through these scenes into the 1970s.
In scene 6, we meet Skripkin’s ex-girlfriend Zoya Beryozkina, who already occurred in the first two scenes and who is the only other person next to Skripkin who makes it from the present to the future in this play. She shares scene 6 with the professor (purple), some doctors (dark-green) and the resurrected protagonist.
In scene 7, we see a journalist reporting about the “resurrected mammal” (purple cluster). It is said that Skripkin is dangerous since he started to spread these ancient diseases among the people (like dancing, drinking beer and falling in love). In the same scene, the equally dangerous bedbug, which was defrosted along with Skripkin, is hunted down. The eponymous insect, which clearly serves as a symbol in the play, is not featured in the network graph, since no speech act can be attributed to it. 😉 (Although you might well think of a different approach including the little bug in the network analysis.)
Scene 8 presents a disappointed Skripkin who doesn’t like this aseptic future and declares that he would have preferred to stay frozen. The scene is mainly shared between him, Zoya and the professor.
Scene 9 takes place in the zoo, where Skripkin and the bedbug are presented as attractions (light-green cluster). When Skripkin is released from his cage, he holds a speech, but people are appalled and he’s put behind bars again and, further on, “displayed as a specimen of society’s primitive past, where school children can feed him with cigarettes and alcohol” (dramaonlinelibrary.com).

Extracting the Network Data

Coming back to where we started, how did we extract the character network: The play was digitised at Wikisource. After having a closer look at the underlying HTML it was clear that extraction was easy, we just needed clear indicators for the beginning of a new scene and all speakers involved. A little Bash script (making use of xmllint) extracted the info like this:

I
Разносчик пуговиц
Разносчик кукол
Разносчица яблок
(…)
Присыпкин (Пьер Скрипкин)
Розалия Павловна
Присыпкин (Пьер Скрипкин)
Баян
(…)
II
Босой
Уборщик
Босой
Молодой рабочий
Девушка
Парень
(…)
III
Эльзевира
Присыпкин (Пьер Скрипкин)
Эльзевира
Присыпкин (Пьер Скрипкин)
Гость
(…)

Disambiguation

Now came the tricky part. Since we’re relying on character names, just like the author put them in his play, we had to deal with plenty of ambiguities. This wouldn’t happen with proper TEI, when every <sp>eech act provides IDs for all involved characters. An additional problem is that you have different entities going by the same name, like “Голоса” (“Voices”) in the second and third scene.

So what we had to account for to get a really clean character network is the following:

“Зоя” = “Зоя Берёзкина”
“Присыпкин” and “Скрипкин” where combined to “Присыпкин (Пьер Скрипкин)” (since the protagonist proactively changed his name, see above)
1st scene: “Пуговичный разносчик” = “Разносчик пуговиц”
2nd scene: “Босой парень” and “Босой” are the same
2nd scene: “Молодой рабочий” and “Парень” are the same (just like “Парень с метлой”)
2nd scene: the “Девушка” in this scene is not the same as in scene 7 (disambiguation by numbering)
3rd scene: “Посажёный отец—бухгалтер” = “Бухгалтер”
3rd scene: “Крики” at the end eliminated
4th scene: “Пожарные” deleted (for the same reasons for which “Все” was deleted)
5th scene: “Старший и младший” deleted
5th scene: the incoming messages from the several outposts are not marked with their speakers (as a result, they don’t appear in the network)
6th scene: “Хором” deleted
9th scene: “Голос из толпы” occurs three times, all voices are apparently different, so we numbered them
9th scene: “Председатель совета” and “Председатель” are the same

We also eliminated all occurrences of “Все” (“All”): the idea is that characters contained in the “Все” already participate in the corresponding scene. That way, we avoid having “Все” as an additional character in the network. For the same reason we could have eliminated all occurrences of “Голоса” (“Voices”), but that’s a different thing since voices can come from unmentioned characters that don’t otherwise contribute to a speech act. So we let those in.

(The resulting TXT file can be found here: “mayakovsky-klop-speakers-per-scene.txt”.)

In comparison, the intermediary XML format we introduced when starting to work with our corpus of German drama can be much more fine-grained, because we’re working with a TEI-encoded corpus there. One of the purposes of this article, though, is to demonstrate that you can already do stuff with the most basic of interactional data.

Building the CSV File

After we had cleaned the names of all speakers, we wrote another small script, this time in Python, to generate a CSV file containing all the edges of the network, here’s a little excerpt:

Source,Target,Weight
Баян,Босой,1
Баян,Бухгалтер,1
Баян,Голос,1
Баян,Голоса II,1
Баян,Голоса III,1
(…)
Баян,Присыпкин (Пьер Скрипкин),3
(…)

Really just containing info on who is talking to whom in how many scenes. (The CSV file can be obtained here: “mayakovsky-klop-edges.csv”. This, of course, was the data we fed into Gephi to visualise the network shown above.)

Some Network Values

The network graph does well in demonstrating the structural uniqueness of Mayakovsky’s play. It is rather unusual that almost every scene can be identified as an individual cluster in the graph. The number of characters (= network size) is 94, the network density is fairly low, 0.17 (i.e., 17% of all possible connections between nodes are actually happening). The node-degree distribution shows traits of a power law, but it’s hard to draw any conclusions from that, since the play is so short and the interactional mode of the play so unique.

If you have a look at the CSV file, almost all weights are “1”, meaning that characters share exactly one scene. The play is really about showing Pierre Skripkin in different contexts, in the present and the future. His closest contacts are his former lover Zoya Beryozkina and Oleg Bayan (3 shared scenes each), Rozaliya Pavlovna (bride’s mother) and the professor in the future (2 shared scenes each).

Something Like a Conclusion

You cannot reflect enough on the practice of character extraction from literary texts. The method you use has a big impact on the numbers that you’re working with later. You not only have to “know your corpus”, but you also have to keep in mind the rationale on which you based the information extraction. Especially if you want to process not just one file (like we did in this post) but hundreds or thousands of them.

Extracting Network Data from Mayakovsky's Play "The Bedbug" (1928/29) was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on September 18, 2016.

“Distant-Reading Showcase”: Designing Our DHd2016 Conference Poster

2016-03-30T00:00:00+02:00

Three weeks ago, we attended the annual Digital Humanities conference of the German-speaking countries (DHd2016), this time taking place at the University of Leipzig. We delivered two papers (more on them later) and a poster. And were really excited to be awarded the price for the best poster out of 78 poster submissions (listed in this PDF).

I will try to quickly explain what we tried to do when creating our poster. But first and foremost, this is the poster we’re talking about, its full title goes as follows: “Distant-Reading Showcase: 200 Years of German Drama History at a Glance”.

A full-res version can be downloaded from Figshare (PDF; 28.88 MB).

What we set out to do with this poster was to produce a data-driven showcase for what they™ call distant reading. We have a working definition of ‘distant reading’ that differs from the one that underlies Franco Moretti’s articles on the matter since he first coined the term in 2000. Just recently, Peer and I gave a talk on the matter, last November in Vienna, at a workshop dedicated to “Distant Reading and Discourse Analysis”. (The corresponding article will appear shortly, we just finished the final editing.) Let’s just point you to two aspects: Moretti never talks about programming or code and neither describes nor provides his working corpus so that anybody could reproduce his findings, two things we consider essential and tried to address throughout the course of the DLINA project (see our older postings). ‘Data-driven’ means that we wanted the computer to generate the better part of the poster, a job done by our tool dramavis which was revamped and completely rewritten from scratch just weeks before the conference (current version is v0.2).

In order to be a convincing Distant-Reading Showcase our poster should really show visualised data that could actually be read by viewers. The 465 character networks showing German-language dramas written/published between 1730 and 1930 are sorted chronologically, and one thing people should be able to spot is the decisive decade in which German authors started to binge-read and adapt Shakespeare. All of a sudden in the 1770s, they start to build character networks far bigger than the ones before: Goethe’s play “Götz von Berlichingen” is one of the first that, instead of only 8 or 12 or 16 characters, started to let more than 70 characters appear on stage. You can witness this ‘explosion’ in the 3rd row from above, 3rd column from the right. There are other things you can actually recognise in the poster, just take the network built from Schnitzler’s “Der Reigen” (“La Ronde”), which describes a circle in correspondence with the symptomatic course of the play (6th line from below, 7th column from the right; see also Gerrit Imsieke’s tweet on the matter).

At some point (when pottering about with Illustrator trying to open and convert a 20+ MB SVG) we had the notion that next time we should aim at generating the entire poster directly as script-driven SVG. But okay, this time we still managed to undertake the finishing steps on an old 2×2.8 GHz Quad-Core Intel Xeon Mac Pro with just about 6 GB of RAM using InDesign to properly fill the rest of the poster with descriptive info and some additional stuff: The two diagrams in the lower left of the sidebar already show further parts of our research, one of them the number of dramas with ‘small world’ characteristics, something we will also talk about at the DH2016 in Krakow, on July 14.

To add a bit of suspense, we arrived in Leipzig with a still unfinished poster. A tiny little night shift at Café Telegraph settled things and on Wednesday, the very day of the poster presentations, we printed the actual poster on glossy paper in A0 format at the local print shop sedruck, their store at Beethovenstraße 23. The result was amazing, one of the best A0 printing experiences we had so far.

Credits

Creating this poster was a team effort:

Some Criticism

It was keynote speaker Daniel Keim himself who uttered some criticism when discussing the poster with us later that evening, broaching the problems of spring-embedder algorithms. And we couldn’t agree more: Spring embedders have “an undeniable aesthetic appeal, […] yet a random layout is nearly always the default” (source). One side effect of this is that graphs always look a tad different when generating them anew. Thus, similar graphs don’t always look similar. This is a mere graph-visualisation problem and not too relevant for the actual research we’re conducting with the network measures we calculate with our dramavis tool. But feel free to give us a hint on how to normalise graphs generated with spring-embedding algorithms.

Closing Words

Albeit the usual time pressure, it was great fun to plan, design and discuss our poster and to face some real competition. A big shout-out to our fellow winners who ranked 2nd (“Digitales Publizieren. Bedingungen – Optionen – Empfehlungen”) and 3rd (“Das Tool LAKomp und seine Anwendung auf Texte nichtstandardisierter Sprachstufen”). Right after the ceremony, we enjoyed a nice little dinner with the runners-up and some other friends at the dimly lit restaurant located in the Alte Nikolaischule building of which there is a twitpic here.

See y’all next year at the DHd2017 conference in Berne, CH.

“Distant-Reading Showcase”: Designing Our DHd2016 Conference Poster was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on March 30, 2016.

The Facebook of German Playwrights

2016-01-28T00:00:00+01:00

This short article is a follow-up to our last posting, “The Birth & Death of German Playwrights”. Plotting the birth and death places of our 178 authors onto a map was bringing us closer to understanding the character of our corpus which – codenamed “Sydney” – contains 465 German-language plays. But it didn’t bring us close enough to understanding who the authors are. So let’s build a gallery with their portraits, a facebook of German playwrights, so to speak, and let’s do that automatically.

We’re relying on Wikidata again and, for each author, extract a link to their principal image which leads to the actual portrait file on Wikimedia Commons. We do this with nothing more than an XSLT transformation. Some simple BASH scripting was added to build the actual gallery for this post. The male and female silhouettes for authors who still lack an image on Commons were designed by our accomplice Ruth Reiche (thanks!). For some more details on how we did all this scroll down to the end of the gallery. And now without further ado, this is the gallery (click on an image to get to the source file on Commons):

Neuber, Friederike Caroline
(1697–1760)
Bodmer, Johann Jacob
(1698–1783)
Gottsched, Johann Christoph
(1700–1766)
Borkenstein, Hinrich
(1705–1777)
Gottsched, Luise Adelgunde Victorie
(1713–1762)
Gellert, Christian Fürchtegott
(1715–1769)
Kurz, Joseph von
(1717–1784)
Schlegel, Johann Elias
(1719–1749)
Auenbrugger, Johann Leopold von
(1722–1809)
Mylius, Christlob
(1722–1754)
Quistorp, Theodor Johann
(1722–1776)
Krüger, Johann Christian
(1723–1750)
Klopstock, Friedrich Gottlieb
(1724–1803)
Weiße, Christian Felix
(1726–1804)
Lessing, Gotthold Ephraim
(1729–1781)
Gessner, Salomon
(1730–1788)
Cronegk, Johann Friedrich von
(1731–1758)
Hafner, Philipp
(1731–1764)
Pfeil, Johann Gottlob Benjamin
(1732–1800)
Ayrenhoff, Cornelius Hermann von
(1733–1819)
Wieland, Christoph Martin
(1733–1813)
Brandes, Johann Christian
(1735–1799)
Klemm, Christian Gottlob
(1736–1802)
Sturz, Helfrich Peter
(1736–1779)
Gerstenberg, Heinrich Wilhelm von
(1737–1823)
Brawe, Joachim Wilhelm von
(1738–1758)
Engel, Johann Jakob
(1741–1802)
Hippel, Theodor Gottlieb von
(1741–1796)
Stephanie, Johann Gottlieb (d. J.)
(1741–1800)
Schröder, Friedrich Ludwig
(1744–1816)
Weidmann, Paul
(1744–1801)
Gotter, Friedrich Wilhelm
(1746–1797)
Wagner, Heinrich Leopold
(1747–1779)
Bretzner, Christoph Friedrich
(1748–1807)
Goethe, Johann Wolfgang von
(1749–1832)
Müller, Friedrich (Maler Müller)
(1749–1825)
Lenz, Jakob Michael Reinhold
(1751–1792)
Schikaneder, Johann Emanuel
(1751–1812)
Klinger, Friedrich Maximilian
(1752–1831)
Leisewitz, Johann Anton
(1752–1806)
Törring, Josef August von
(1753–1826)
Soden, Julius von
(1754–1831)
Gemmingen-Hornberg, Otto Heinrich von
(1755–1836)
Schink, Johann Friedrich
(1755–1835)
Iffland, August Wilhelm
(1759–1814)
Hensler, Karl Friedrich
(1759–1825)
Schiller, Friedrich
(1759–1805)
Kotzebue, August von
(1761–1819)
Benkowitz, Karl Friedrich
(1764–1807)
Sonnleithner, Joseph Ferdinand von
(1766–1835)
Schlegel, August Wilhelm
(1767–1845)
Kind, Johann Friedrich
(1768–1843)
Voß, Julius von
(1768–1832)
Werner, Zacharias
(1768–1823)
Zschokke, Heinrich
(1771–1848)
Gleich, Joseph Alois
(1772–1841)
Schlegel, Friedrich
(1772–1829)
Tieck, Ludwig
(1773–1853)
Weißenthurn, Johanna von
(1773–1847)
Breuning, Stephan von
(1774–1827)
Müllner, Adolph
(1774–1829)
Meisl, Karl
(1775–1853)
Treitschke, Georg Friedrich
(1776–1842)
Kleist, Heinrich von
(1777–1811)
Klingemann, Ernst August Friedrich
(1777–1831)
Fouqué, Friedrich de la Motte
(1777–1843)
Brentano, Clemens
(1778–1842)
Bernard, Josef Karl
(1780–1850)
Günderode, Karoline von
(1780–1806)
Arnim, Ludwig Achim von
(1781–1831)
Chézy, Helmina von
(1783–1856)
Raupach, Ernst
(1784–1852)
Bäuerle, Adolf
(1786–1859)
Uhland, Ludwig
(1787–1862)
Eichendorff, Joseph von
(1788–1857)
Raimund, Ferdinand
(1790–1836)
Grillparzer, Franz
(1791–1872)
Körner, Theodor
(1791–1813)
Kupelwieser, Josef
(1791–1866)
Malß, Karl
(1792–1848)
Gehe, Eduard Heinrich
(1795–1830)
Wohlbrück, Wilhelm August
(1795–1848)
Immermann, Karl
(1796–1840)
Platen, August von
(1796–1835)
Schober, Franz von
(1796–1882)
Droste-Hülshoff, Annette von
(1797–1848)
Heine, Heinrich
(1797–1856)
Holtei, Karl von
(1798–1880)
Beer, Michael
(1800–1833)
Birch-Pfeiffer, Charlotte
(1800–1868)
Devrient, Philipp Eduard
(1801–1877)
Grabbe, Christian Dietrich
(1801–1836)
Lortzing, Albert (Gustav)
(1801–1851)
Nestroy, Johann
(1801–1862)
Bauernfeld, Eduard von
(1802–1890)
Braun von Braunthal, Karl Johann
(1802–1866)
Simrock, Karl
(1802–1876)
Kobell, Franz von
(1803–1882)
Haffner, Carl
(1804–1876)
Riese, Friedrich Wilhelm
(1805–1879)
Halm, Friedrich
(1806–1871)
Laube, Heinrich
(1806–1884)
Vischer, Friedrich Theodor
(1807–1887)
Schumann, Robert
(1810–1856)
Benedix, Julius Roderich
(1811–1873)
Gutzkow, Karl
(1811–1878)
Büchner, Georg
(1813–1837)
Hebbel, Friedrich
(1813–1863)
Ludwig, Otto
(1813–1865)
Wagner, Richard
(1813–1883)
Kaiser, Friedrich
(1814–1874)
Niebergall, Ernst Elias
(1815–1843)
Freytag, Gustav
(1816–1895)
Prutz, Robert Eduard
(1816–1872)
Dulk, Albert
(1819–1884)
Roeber, Friedrich
(1819–1901)
Kalisch, David
(1820–1872)
Mosenthal, Salomon Hermann von
(1821–1877)
Genée, Richard
(1823–1895)
Cornelius, Peter
(1824–1874)
Lassalle, Ferdinand
(1825–1864)
Moser, Gustav von
(1825–1903)
Heyse, Paul
(1830–1914)
Berg, O. F.
(1833–1886)
Schaefer, Wilhelm
(1835–1908)
Bunge, Rudolf
(1836–1907)
Wilbrandt, Adolf von
(1837–1911)
L'Arronge, Adolph
(1838–1908)
Anzengruber, Ludwig
(1839–1889)
Schnitzer, Ignaz
(1839–1921)
Goetz, Hermann
(1840–1876)
May, Karl
(1842–1912)
Widmann, Joseph Viktor
(1842–1911)
Wildenbruch, Ernst von
(1845–1909)
Schönthan, Franz von
(1849–1913)
Kadelburg, Gustav
(1851–1925)
Blumenthal, Oskar
(1852–1917)
Panizza, Oskar
(1853–1921)
Ganghofer, Ludwig
(1855–1920)
Jacoby, Wilhelm
(1855–1925)
Avenarius, Ferdinand
(1856–1923)
Sudermann, Hermann
(1857–1928)
Hauptmann, Carl
(1858–1921)
Laufs, Carl
(1858–1900)
Wette, Adelheid
(1858–1916)
Bleibtreu, Karl
(1859–1928)
Jerschke, Oskar
(1861–1928)
Ruederer, Josef
(1861–1915)
Alberti, Konrad
(1862–1918)
Schlaf, Johannes
(1862–1941)
Schnitzler, Arthur
(1862–1931)
Dehmel, Richard
(1863–1920)
Holz, Arno
(1863–1929)
Scheerbart, Paul
(1863–1915)
Hartleben, Otto Erich
(1864–1905)
Wedekind, Frank
(1864–1918)
Lachmann, Hedwig
(1865–1918)
Busoni, Ferruccio
(1866–1924)
Dovsky, Beatrice
(1866–1923)
Thoma, Ludwig
(1867–1921)
Gerhäuser, Emil
(1868–1917)
Rosenow, Emil
(1871–1904)
Hofmannsthal, Hugo von
(1874–1929)
Heiseler, Henry von
(1875–1928)
Rilke, Rainer Maria
(1875–1926)
Stavenhagen, Fritz
(1876–1906)
Boßdorf, Hermann
(1877–1921)
Essig, Hermann
(1878–1918)
Mühsam, Erich
(1878–1934)
Fock, Gorch
(1880–1916)
Lautensack, Heinrich
(1881–1919)
Rubiner, Ludwig
(1881–1920)
Wildgans, Anton
(1881–1932)
Ball, Hugo
(1886–1927)
Ertler, Bruno
(1889–1927)
Klabund
(1890–1928)
Sorge, Reinhard
(1892–1916)
Kaltneker, Hans
(1895–1919)

Some Details on How It Was Done

The XSLT file for the automatic generation of the gallery out of the TEI files that comprise our corpus can be found here.

The renaming of the image files was done with some regexps on BASH, the conversion and crunching of the images to 150px height were done with the ImageMagick command-line tool convert in a simple for loop:

for file in $SOURCE_DIR/*
do
  convert $file -strip -resize x150 -quality 60 $TARGET_DIR/`basename $(echo $file | sed 's/\(gif\|png\)/jpg/g')`
done

Gender Data and Placeholder Images

Our XSLT file also extracts gender information from Wikidata showing that there are only 10 female writers among the 178 authors.

As stated above, the placeholder images were done by Ruth Reiche, we chose one for female and one for male authors out of a whole bunch of silhouettes she designed for her network visualisations of characters in Daniel Kehlmann’s novel “Ruhm”.

(End of transmission.)

The Facebook of German Playwrights was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on January 28, 2016.

The Birth and Death of German Playwrights

2015-10-22T00:00:00+02:00

“If your metadata is good, it can help you in many ways,” mumbled Captain Obvious when we last met, and we couldn’t agree more. So let’s toy around with some metadata today to get a better impression of what our corpus of roughly half a thousand German-language theatre plays actually contains.

You surely have seen the piece in Science, “A Network Framework of Cultural History”, and the corresponding lifetime-curve videos. Max Schich et al. set out to visualise “intellectual mobility” based on “spatiotemporal birth and death information (…) of more than 150,000 notable individuals”. That’s a lot of people, and we wouldn’t even dare to compare this little blog post to what they did. But anyway, we dabbled in telling the story of the birth and death of German playwrights by using a similar method with a much (like, much) smaller set of people – 178 authors altogether who wrote 465 plays published between 1731 and 1929.

The tl;dr version of how we did that: Wrote an XQuery script that uses the GND identifier for each author in our XML files to find our way to corresponding Wikidata objects where we extracted dates and places of birth and death of all the authors contained in our corpus. Generated two KML files and put them into the GeoBrowser – mission accomplished (feel free to zoom in a bit):

view full screen

Workflow, Bit More Detailed

Our Sydney corpus – which was derived from the “Digitale Bibliothek” corpus within the TextGrid Repository – holds 465 dramatic pieces from 1731 to 1929, written by 178 authors altogether. By plotting the places of birth and death of all of them onto a map we would probably find out if our corpus was balanced or if there were any (regional) biases we weren’t aware of.

All the documents in our repository contain authorship information, including GND identifiers. Their values are stored in an XML attribute (key) as follows (for legacy reasons, the value starts with pnd, not GND):

<author key="pnd:118540238">Goethe, Johann Wolfgang von</author>

We had to update our schema to insert this attribute into our intermediary format (here’s the commit) to fully benefit from the beauty of linked open data (LOD). If you read German, there’s a nice chapter on the topic in the TextGrid compendium published last year (pp. 91, “Metadaten, LOD und der Mehrwert standardisierter und vernetzter Daten”, authored by Martin de la Iglesia, Nicolas Moretto and Max Brodhun).

The identifier stored in @key is related to an entry in the Integrated Authority File (which is the translation for GND, Gemeinsame NormDatei) hosted by the German National Library. They provide an HTML view of the data, but you can also directly download the RDF and other representations. Let’s have a look at the data set on Goethe at http://d-nb.info/gnd/118540238. You’ll find basic info on him: aliases, occupation, dates and places of birth and death. In most cases, given places have an own GND identifier contained in the RDF file to each personal record. In the case of Goethe we’re pointed to his birthplace Frankfurt am Main like this:

<gndo:placeOfBirth>
  <rdf:Description rdf:about="http://d-nb.info/gnd/4018118-2">
    <gndo:preferredNameForThePlaceOrGeographicName>Frankfurt am Main</gndo:preferredNameForThePlaceOrGeographicName>
  </rdf:Description>
</gndo:placeOfBirth>

Eventually, the Frankfurt am Main record gives away the geographical coordinates of the city:

<geo:hasGeometry rdf:parseType="Resource">
<rdf:type rdf:resource="http://www.opengis.net/ont/sf#Point" />
<geo:asWKT rdf:datatype="http://www.opengis.net/ont/geosparql#wktLiteral">Point ( +008.684166 +050.115277 )</geo:asWKT>
</geo:hasGeometry>

We just had to trim the string to +008.684166 +050.115277 and hand it over to a KML file (which can be interpreted by the majority of geo-visualisation tools) like this:

<kml>
  <Placemark>
    <address>Frankfurt am Main</address>
    <description>Place of Birth; 28 August 1749</description>
    <name>Gūta, Yūhān Wulfgāng fun</name>
    <Point>
      <coordinates>+008.684166 +050.115277</coordinates>
    </Point>
    <TimeStamp>
        <when>1749</when>
    </TimeStamp>
  </Placemark>
<kml>

Easy enough, we just had to repeat this for the other authors to fill up our KML file and we’d be all set, we thought.

Wikidata Comes Into Play

But there was a catch. We only found coordinates for about two thirds of the places. Now, instead of manually adding the missing data, we wanted to try out if Wikidata was a good way out of this problem. We are keen followers of Magnus Manske’s Twitter and blog and he’s undertaking great efforts to enhance Wikidata, so our expectations were high.

There’s probably a more elegant way to do this, but we went in brute force, extracted the Wikipedia link from the RDF representations over at the GND, fetched the Wikipedia page, extracted the Q identifier from it and went over to the corresponding Wikidata record. Luckily, there’s a simple way to obtain the RDF representation of a single Wikidata object, something that Magnus helped us find out via Twitter (thanks again!).

Once we could directly examine the XML/RDF representation it was dead easy to get hold of all the geographical coordinates. We put the two resulting KML files on our GitHub:

https://dlina.github.io/data/geobrowser/lina-birth.kml
https://dlina.github.io/data/geobrowser/lina-death.kml

Pushing Our Data Into the GeoBrowser

Now we could finally feed the files into the GeoBrowser, our spatio-temporal visualisation playground of choice (after years in beta, it finally went 1.0 just this month). GeoBrowser supports both CSV and KML files. There is a pretty nice datasheet editor with autofill of coordinates based on the Getty Thesaurus of Geographic Names for those who want to copy/paste lists of place names. You can also spice up your KML files with HTML elements and link back to your edition or to wherever you like. And btw, if you want to feed the GeoBrowser directly from your own server, just ask the developers to add your domain to the whitelist.

You already viewed the result and thus the story of the birth and death of (some) German playwrights in the 18th, 19th and 20th century in the iframe above.

Analysis

As with most visualisations in the Humanities, this one needs a bit of explanation. First off, orange circles indicate places of birth, purple circles indicate places of death. As background map we chose the 1880 one. Bearing in mind that our corpus covers texts from ca. 1730 to 1930, you can also change the layout to a 1783, 1815, 1914 or 1920 map up in the GeoBrowser interface.

Now what is it we can see there? Feel free to zoom in and out as you please. One first impression is that our corpus is pretty well-balanced since there is no regional bias, i.e., no over-representation of authors from specific regions (like, no emphasis on Hessian, or Swabian, or Saxon, or East Prussian writers, etc., plus we’ve got a fair handful of Swiss and Austrian writers, too).

The biggest bubbles surround Berlin (11 births, 15 deaths) and Vienna (13 births, 20 deaths), the two metropolises of the Holy Roman Empire (and later the German and Austro-Hungarian Empires). But again, the two do not dominate the whole picture. So the well-balancedness is something we can state, even if we know that birth and death places are just basic metadata not saying anything about where the authors spent the most part of their lives.

Some Geospatial Peculiarities

Let’s take a look at geospatial extremities here. Of course, we cannot say anything about German-language literature in general, just about the 178 authors whose works are contained in our corpus of 465 German dramas. The outmost places are, clockwise:

Direction	Playwright	Place
N	Henry von Heiseler	born 1875 in St. Petersburg
E	J.M.R. Lenz	died 1792 in Moscow
S	Ernst von Wildenbruch	born 1845 in Beirut
W	Christlob Mylius	died 1752 in London

Lenz and Mylius surely add behavioural and artistic extremism to their geographical one (btw, there are some nice passages on Mylius in Hugh Barr Nisbet’s 2008 biography on Lessing, start reading here, pp. 51). Oh, and let’s not forget Heinrich Heine being the westward runner-up having died in Paris in 1856.

Another thing you can see in the visualisation is that some German-language authors preferred to die in Italy:

Author	Time and place
Maler Müller	1825 in Rome
August von Platen	1835 in Syracuse, Sicily
Friedrich Wilhelm Riese	1879 in Naples
Richard Wagner	1883 in Venice
Otto Erich Hartleben	1905 in Salò

Some More Notes on the Balancedness of Our Corpus

In addition to the regional well-balancedness of the corpus, there is also a temporal one, if we might say so. Have a look at the time-bar diagram right underneath the map (you can use the pull-down menus to change the scale). The first author appearing on the time bar, born in 1697, is Caroline Neuber. The first one to die is Johann Elias Schlegel, in 1749. Our youngest author is Hans Kaltneker, born in 1895. The author who lived the longest is Johannes Schlaf who died in 1941. The reason for him being the most recent author are copyright issues, of course (German copyright expires 70 years after the author’s death).

Obstacles

Some of the minor issues we encountered on our way were the usual amounts of strange (unrelatable) values and nonexistent data, like missing Wikipedia entries or missing properties on Wikidata (they were not many and we fixed them while we went along, i.e., two playwrights finally got their Wikipedia aticle, and Wikidata was filled with some new properties).

While building our bridge from the GND entries to the corresponding Wikipedia articles, we found an accordant relation in the RDF file – good. Yet it turned out not every RDF file contains something like

<foaf:page rdf:resource="http://de.wikipedia.org/wiki/Johann_Wolfgang_von_Goethe"/>

Instead, the HTML presentation of the data contains a link to Wikipedia, automatically generated by help of a BEACON file. So we had to parse the entire webpage. If we had encountered an XHTML page we could have made use of the doc() function. Alas, the German National Library uses redirects (not supported by the doc() function) rather than URL rewriting (supported by the doc() function), so we had to let the EXPath HTTP client grab the page.

The case of Karl Haffner was a tad more complicated. The RDF file did contain a link to Wikipedia, but it nowadays leads to a disambiguation page where we obviously wouldn’t find the corresponding Wikidata object. So we had to add an exception (just this one) to our crawler.

One last thing, in our initial data set we found an author who died in 1952, undercutting the 70-year copyright rule. A very early adaptor in terms of open-source publishing, we thought. 😉 But the Wilhelm Schäfer (pnd:118794868) referenced in our source was not the author who should be referenced for writing Faustine, der weibliche Faust. So we corrected the data and pointed to the real Wilhelm Schaefer (pnd:117099309) instead. Same happened with one of Arno Schmidt’s favourite authors, Friedrich de la Motte Fouqué, who was mistaken with his grandson (correcting commit here). When we started, we took over the wrong PNDs from the TextGrid Repository, and things can go wrong any time, sure, especially when you (have to) apply automated tagging. In this case, we only found two wrong identifiers, but just imagine a slightly bigger project where you cannot double-check everything anymore, a wee bit of a nightmare for LOD.

Conclusion

So what did we achieve here? Nothing much, really. This is just one possible response to the imperative: “Know your data!” By automatically visualising the birth and death places of the playwrights that build our corpus of dramatic texts, we added a useful layer of description. And this will help us to classify any new results that our research on the corpus might yield in the future.

(End of transmission.)

The Birth and Death of German Playwrights was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on October 22, 2015.

dramavis: A Tool for Visualising and Calculating Literary Network Data

2015-08-06T00:00:00+02:00

Some of you will have seen our distant-reading showcase poster, this one (hi-res version on figshare):

These are the character networks of 465 German-language dramas from 1731 (left upper corner) to 1929 (bottom right) at one glance. You can see how networks are changing over time, the first network explosions occurring with Klopstock’s “Hermanns Schlacht” (1769) and Goethe’s “Götz von Berlichingen” (1773): second row, fifth and second from the right.

The network of Klopstock’s piece can be studied in detail here, the Goethe one here. All 465 network graphs can be accessed in a folder on GitHub.

Character-Centric Data

Visualisations are nice, especially when they set you up with the ability to kind of read a large number of literary texts from a distant. But there are two other things we did with our data. We used network values like size, min/avg/max degree, density and avg path length to make assumptions about literary evolution over time, as described in our last posting (“Comedy vs. Tragedy: Network Values by Genre”).

But we also calculated character-centric data, play by play, to make assumptions about single characters and their position in a network. We haven’t written anything about the character-centric data yet, but the data is all there (and will probably overwhelm you at first sight), in a single HTML document.

For each character of a play, you will find the following values in the tables: degree, betweenness centrality, average distance, closeness centrality. Let me give you a small example on how to bring all this data to talk. In the second pamphlet of the Stanford Literary Lab series, Franco Moretti takes a look at the average of the distance of a character to each of the other characters, suggesting that the one with the lowest score would be the protagonist of a play (cf. “Network Theory, Plot Analysis”).

Another promising way to find the most important/most central person in a network of people is the betweenness-centrality score. To quote Wikipedia: “A node with high betweenness centrality has a large influence on the transfer of items through the network, under the assumption that item transfer follows the shortest paths.” We still have to discuss in what way this could apply to character networks of dramatic texts (“items”, in this case, could be information passed on from character to character), but let’s assume for a minute that the betweenness centrality score does correlate to the importance of a character. Then the numbers would tell us in an instant that, e.g., Emilia Galotti is not the most central character in Lessing’s play that bears her name. We knew this already, of course, but with this method we can easily generate a long list of plays whose title characters are not the most central ones, without having to actually read or reread any of the plays of our corpus. “Just think of this,” says Moretti, “I am discussing Hamlet, and saying nothing about Shakespeare’s words.” In fact, there will be some surprises if you look at our numbers. Such as this one: Lessing’s eponymous Nathan the Wise only ranks second after the sultan, Saladin.

Certainly, these are only very simple examples on how to leverage all the data we calculated. Working on such kind of models rather than on the actual text can bring a whole set of new results, it can draw our attention to aspect that went unnoticed so far. Just look at the centrality scores of Schiller’s first play “Die Räuber” which are in sharp disaccord with traditional research and our own intuition when reading the play. No doubt about it, there will have to be a lot of further research on these things.

How dramavis Works

But now onto the main thing, the foremost reason for this post is to introduce you to the tool we developed for the purposes described. It is a Python script called dramavis and was written by Christopher Kittel and me. You can find it on my GitHub account (https://github.com/lehkost/dramavis). Feel free to use it for your own purposes. To facilitate that a little, here is how “dramavis.py” works:

The script reads character networks of dramatic pieces from CSV files,
plots these networks into PNG graphs (using the igraph library and Fruchterman–Reingold as layout, things you can change in the code, of course),
writes drama network values to a CSV file,
writes drama character values to an HTML file (using the Django template language).

There are input/output directories on GitHub, so if you clone the whole shebang to your harddrive and have all the necessary libraries installed it should work out of the box and you can start adapting it to work with your own data.

As for a little history, the first version of the script was written in August, 2014, during the DARIAH International Digital Humanities Summer School in Göttingen. We were bascially toying around with the networkx and igraph libraries and fed them with some literary network data. We showed some first results at workshops in Würzburg and Munich and at DH conferences in Graz and Sydney where some people were asking for the code. We didn’t wanna put it on GitHub until we revised the somewhat chaotic script (ha!), and that’s what we did at a spontaneous 2-day hackathon at the Göttingen Centre for Digital Humanities, at Heyne-Haus, in June, 2015.

Depending on your machine, it can take up to five minutes or so to process the 465 standard input files and generate all the different outputs, so in order to know that the script is still running, we included a simple progress bar and want to include other things in the future (input formats other than CSV would be nice, for example), so if you have any suggestions, please bring them forward.

Other Approaches to Visualise Literary Network Data

This Python-based approach runs parallel to another approach based on D3.js leveraging our intermediary XML format to generate different kinds of outputs (as demonstrated on this slide from our talk at the DH2015 in Sydney). You can have a look at all the data generated via this approach at dlina.github.io/linas. There are still some bugs we have to fix, but feel free to toy around a bit. This small collection of dynamic visualisations already has traits of a toolbox for the structural analysis of dramatic texts. Either way, that’s where we’re headed.

Nothing more to say today. Happy distant reading!

dramavis: A Tool for Visualising and Calculating Literary Network Data was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on August 06, 2015.

Comedy vs. Tragedy: Network Values by Genre

2015-07-31T00:00:00+02:00

As described in a previous post, our DLINA intermediary format stores structural data extracted from the full-text TEI files of the TextGrid Repository as well as various metadata, including the author’s name and date of origin of a play (and its publication and/or premiere date). In addition, the DLINA format also stores specific title information, three in total: the main title of a play, its subtitle (if available) and a genre title (only if a genre can be derived from the official subtitle of a play). To give an example, the first piece of our Sydney corpus, Gottsched’s “Der sterbende Cato” from 1731, looks something like this:

<header>
 <title>Der sterbende Cato</title>
 <subtitle>Ein Trauerspiel</subtitle>
 <genretitle>Trauerspiel</genretitle>
 [...]
</header>

As said before, we only inserted a <genretitle> if the subtitle of a play contained a definite and largely conventional genre indication. Terms like “dramatische Skizze” (dramatic sketch) or an unspecified indication like “Drama” we did not regard as conventional genres, in the same way as we neglected unconventionally specified genres like “Ein Ammenmärchen in vier Akten” (An Old Wives’ Tale in Four Acts) or “Arabische Fantasia in zwei Akten” (Arabian Fantasy in Two Acts).

The resulting set of genre titles included the classic genres “Tragödie” and “Komödie”, or, “Trauerspiel” and “Lustspiel”, but also the general “Schauspiel”, “Posse” or “Oper”. These genre titles help us to better describe our corpus. We can now state that of 465 dramas in our Sydney corpus,

101 are marked as tragedy (“Tragödie” or “Trauerspiel”) and
92 are marked as comedy (“Komödie” or “Lustspiel”).

En plus, we were interested in how many of the texts were combined with music of any sort (i.e., “Opern”, “Operetten”, “Singspiele”, “Musikdramen”, etc.). For reasons of simplicity, we marked these texts as “Libretti”. Not all of these texts bear a corresponding genre indication in their subtitle. Wagner’s “Master-Singers of Nuremberg”, for example, don’t feature a subtitle we could directly use as <genretitle>. In these cases we did a little research to identify all libretti. The result is that

56 texts from our Sydney corpus are marked as “Libretti”.

With this kind of metadata, we could now easily build generic subcorpora and have a differentiated, genre-specific look into our network data. The corresponding median values and averages look like this:

Table 1: Network Measures, by Genre

	N=	Number of Characters (Median)	Max Degree (Median)	Average Degree (Average)	Density (Average)	Average Path Length (Average)
Corpus	465	16	13	9,01	0,59	1,46
Tragedy	101	19	16	9,57	0,52	1,56
Comedy	92	14	11	8,61	0,67	1,36
Libretto	56	16	13,5	9,09	0,64	1,39
Other	216	17	14	8,88	0,59	1,48

Let’s feed this data into some diagrams:

Fig. 1: Network Size (Median), by Genre

We can see that tragedy peaks, while comedy troughs. This trend is also confirmed when looking at other values like the network density, only this time comedy values are peaking and tragedy values troughing:

Fig. 2: Density (Mean), by Genre

Results for the other values are similar, suggesting that there is an evident connection between the size of a network and the other values. But we still need to further examine this connection, it isn’t as simple as it looks. It can be assumed that typical genre conventions also have a strong influence on the values: Like, in tragedies, we often have two (or more) opposing groups of people who don’t share the stage too often, and if they do, it is mainly in shape of single representatives. Comedies, on the other hand, have a tendency to make as many characters as possible once more appear on stage, together, at the end, typically for the purpose of a wedding (or even multiple weddings). Just take George Bernard Shaw who argued that comedies were “plays in which everyone was married in the last act”. These genre conventions have a crucial influence on, e.g., the density values (many characters on stage at the same time would make for higher density values, whereas density decreases if characters from two antagonising parties hardly ever meet).

[Edit, 3 June 2018 – Nils and Marcus pointed us to this nice quote: “Somebody always has to die onstage, die or marry; that’s the only difference between a comedy and a tragedy as far as the world knows.” – from Mary di Michele, Tenor of Love, 2005, p. 20]

Regarding the density values, Figure 2 suggests a proximity between comedy and libretto. This is confirmed if we don’t consider the median, but the mean values:

Fig. 3: Network Size (Mean), by Genre

The structural similarity of comedy and libretto and their coinciding distance to the tragedy is showing up even if we look at the temporal evolution over two centuries, another simple subdivision of our corpus. These are the values we calculated:

Table 2: Network Size (Median), by Genre and Century

Genre	18th Century	19th Century	20th Century
Tragedy	11,00	24,50	20,00
Comedy	9,00	16,50	16,00
Libretto	10,00	16,00	17,50

Let’s put them into a diagram:

Fig. 4: Network Size (Median), by Genre and Century

And now have a look at the table with the density values and the corresponding diagram:

Table 3: Density (Mean), by Genre and Century

Genre	18th Century	19th Century	20th Century
Tragedy	0,56	0,49	0,58
Comedy	0,71	0,59	0,75
Libretto	0,67	0,60	0,75

Fig. 5: Density (Mean), by Genre and Century

So a closer look at the evolution over two centuries shows even more clearly the proximity of comedy and libretto and the persistent distance from the tragedy. Let’s keep in mind that our corpus only contains texts from 1731 to 1929, therefore, the 18th and the 20th century are only partially covered. Nevertheless, we can recognise some particularities at second glance.

First, it is interesting that the distances regarding the network densities remain fairly constant (Fig. 5), but not regarding network sizes (Fig. 4). Especially in the 18th century, network-size differences between the three genres are not as clear as in the 19th century, whereas differences regarding network densities are even slightly bigger than in the 19th century. This would be further proof that the network size, i.e., the number of characters in a play, is indeed an important factor influencing all other values, but there is no strict correlation. Because if there was one, the density values of the three genres would have to be very close to each other in the 18th century. Yet this is not the case, which could indicate that the above-mentioned genre conventions are another crucial factor for all network values and shouldn’t be underestimated.

Second, we can observe how the tragedy stands out in the 19th century. In other words: When looking at our network data, the 19th century proves to be a time of strong generic differences, at least in regard to the structural data we elevated.

All in all, what we presented in this post are so far mere indications. We will have to look further into our data in order to better understand the evolution of subgenres over time as well as the impact of genre conventions on network measures. We also want to build larger generic subcorpora in the future. For example, it is very tempting to analyse the structure of the corpus of bourgeois tragedies discussed in Cornelia Mönch’s dissertation “Abschrecken oder Mitleiden. Das deutsche bürgerliche Trauerspiel im 18. Jahrhundert. Versuch einer Typologie” (1993). But, as they say, a lot of water will certainly flow down the river Rhine before we get there. We will continue to report in this blog. Stay tuned.

Great bow in the Rhine at Boppard. Source: Wikimedia Commons.

Comedy vs. Tragedy: Network Values by Genre was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on July 31, 2015.

Our Talk at DH2015 in Sydney (Full Text and Slides)

2015-07-13T00:00:00+02:00

That’s right, we transcribed the talk we gave at the DH2015 in Sydney, on 2 July 2015, entitled “Digital Network Analysis of Dramatic Texts”. Please note that our grammar might appear a bit jetlagged here and there. ;) We were the last group to speak in a very interesting network-analysis centric session chaired by Glenn Roe. If you take a veeery close look (hehe) at this panorama pic, you will recognise us setting up the room together with the other speakers, Elisa Beshero-Bondar and Ryan Heuser (big hello there!):

Since we used reveal.js as presentation framework, we can easily reference individual slides so you can follow both our transcript and the slides simultaneously. (For further reference: Our original abstract can be found here.) But let’s now start with our presentation:

Slide 0/0: Title

Slide 1/0: TOC

1. Approach

Slide 2/0: Basic Ideas

The tradition of structural approaches in Literary Studies reaches back to (at least) the different flavours of European structuralism developed since the 1960s. Our project sets out to continue this tradition, but our new take on the issue is to apply an automated data analysis to identify and characterise structural features of literary texts, or, more precisely: of dramatic texts. The long-term objective of our project is to gather and provide structural data which can be used, for example, to describe different compositional types of plays. What we mean by compositional types, or, types of structural composition, is best illustrated by an example.

Slide 2/1: Different Styles of Structural Composition

Let’s have a look at these two network graphs generated during our data analysis. The graph on the left visualises the scenic interactions between characters in Goethe’s neo-classical play from 1787, “Iphigenie auf Tauris” (“Iphigenia in Tauris”), which is influenced by Aristotelian poetics. The graph on the right side visualises the interactions in a work also written by Goethe, the historical play “Götz von Berlichingen”, which, for his part, is strongly influenced by Shakespearean poetics. We cannot discuss this in detail here. But even at first glance you can clearly see the structural differences. The two works are composed in very different ways, exhibiting two very different types of structural or compositional style.

Slide 2/2: The Digital Spectator

In our project, we are looking into these different structural styles in the context of the history of dramatic texts. As mentioned before, we are stepping into the tradition of structuralistic approaches, and we combine these approaches with methods developed in the field of Social Network Analysis – as done by several scholars since the early 21st century, by de Nooy, Rydberg-Cox or Moretti, to name but a few. We are borrowing our specific definition of structure from Social Network Analysis, meaning: The structure of a dramatic text we are analysing originates from interactions between the characters in a play. To be more precise: A relation between two characters as we define it is given if both characters are performing a speech act in a given segment of a play, generally in a scene. So, if character A and character B are speaking in a scene, they are – following our definition – linked to each other.

This definition is inspired by works of Romanian mathematician Solomon Marcus. In his intriguing study “Mathematical Poetics” from 1973, Marcus suggests a concept which one might call “the digital spectator”. This is right up our alley, because with out digitally-driven method we are simulating this digital spectator who is not looking at a performance of the play on stage, but at an XML file. By means of our definition of interaction we are collecting our data; we are calculating network measures, generating graphs, running statistics and doing some other quantitative-analysis stuff.

Slide 2/3: 465 Network Graphs

And we are doing this, at the moment, with a corpus of 465 German-language plays – which you can see here. (By the way: You can download this fancy poster containing the network graphs of our whole corpus of 465 texts from figshare.) We are planning to include plays from other periods in our analysis, and we will, in the future, include plays written in other languages. But for the time being, our focus is on German plays from two centuries.

Slide 2/4: Workflow

Let’s now introduce our workflow and its three main steps: data mining, data editing and the display and analysis of our data.

2. Data Mining

Slide 3/0: Corpus

What makes our analysis quite difficult is that we are dealing with (excuse our French:) dirty data. If you happen to work with TEI documents like the ones from the Folger Shakespeare Library corpus, it’s pretty easy to do the kind of structural analysis we have in mind. But such a well-tagged TEI corpus is rare. Typically, we can consider ourselves lucky if we can obtain a digitised text of a focal text, and even luckier if it contains some basic markup. This applies, for example, to the bigger corpora containing German-language literature. For example, the quite large and freely available TextGrid Repository (comprising texts from around 1500 to the 1930s) features TEI markup converted automatically from a proprietary format. For that reason, we have to deal with quite a basic TEI markup and loads of different tagging errors. We had two options, basically: We could try to initiate a 6-figure project and manually improve the faulty tagging over several years; or, we could try to extract just the specific data we need and take it from there, improving just the bits essential for our approach. Lacking several hundreds of thousands of Euros, we chose the second way.

In order to build our corpus, we first had to extract all dramatic texts from the TextGrid Repository. This might sound relatively easy, but it’s not. We wrote a larger blog post on this subject, but to make it short: We ended up with 666 dramatic texts containing some basic TEI markup.

Slide 3/1: DLINA Corpus 15.07 (“Codename Sydney”)

Out of this corpus, we constructed our own subcorpus, its current version is entitled “DLINA Corpus 15.7 (Codename: Sydney)”. This corpus comprises 465 dramatic texts, so we discarded some 200 files. There are several reasons for excluding texts from our corpus. First, we wanted to limit our research to a specific time span, beginning with the modernisation of German drama at the outset of Enlightenment era, which means – following a well-established academic position – to start with the 1730s works of Johann Christoph Gottsched. We further ruled out foreign-language plays, translations, mere pantomime plays, and fragments. Plus, we sorted out a few texts with very defective TEI markup. All in all, our Sydney corpus comprises 465 dramatic texts in German language ranging from 1731 to 1929.

This was just our starting point. The next step was turned out to be another major task, the data editing.

3. Data Editing

Slide 4: Extracting Structural Data

As mentioned before, the TEI markup in the TextGrid Repository data was quite rudimentary – and often erroneous. So we had to find a way to edit and improve the data. Because we are just interested in a specific kind of structural data, we decided not to regularly edit the original XML files but to selectively extract those data we are interested in, and then just edit these specific data. This was the moment in which we invented what we call the “DLINA zwischenformat”, which roughly translates as DLINA intermediary format’, or simply ‘DLINA data file’. This intermediary format can be considered a structural abstraction of the fulltext TEI documents in our corpus. It is an XML file which is validated against a specific RNG schema. A zwischenformat file is created for each drama. It stores metadata, the actual structural data, and documentation (optional).

The structural data stored are the acts and scenes, the speakers occurring in the respective segments, and (optionally) some amounts, e.g., the number of speech acts, words, etc. for these characters. This DLINA data file not only makes improving the data quality easier but it also allows for a relatively quick way of gathering new data by basically just writing down the structure of a play and the speakers in the segments. You may have a look at our blog post that introduces the DLINA intermediary format. Or maybe you like to explore all the 465 DLINA files which we generated from our corpus and which are stored on GitHub.

Slide 4/1: Editing Process

However, just generating the DLINA data files was not enough. This is because the raw DLINA data files still retained some of the dirty data from the original XMLs – it was, in other words, full of bugs. So, manual intervention was often necessary to improve the data quality and correct errors in the source data.

In this editing process we had to face some errors that came about due to the automated conversion to TEI from the proprietary XML format; and we had to face some, so to say, intrinsic problems, i.e., characteristics typical for a play.

One recurrent problem of the first group were OCR errors in the <speaker> name. One of the phenomenons resulting from intrinsic characteristics of a play was that there were different ways of refer to a character. For example, a character’s full name might be given on the first appearance and only the first name on further appearances. In this case, we had to manually identify both names and attribute them to the one character they refer to.

There were a lot more problems we had to solve while editing the DLINA intermediary format. We established a larger set of editing rules, a complete documentation including examples can be found on our blog.

Slide 4/2: Outlook

We will further improve our editing process. At this time, for example, we are developing a GUI for some of the more simple editing procedures. This GUI will also include some gamification elements, so maybe we will have some crowd-editing-option in the future.

4. Display and Analysis

Slide 5/0: Four Types of Visualisation

After editing and cleaning our 465 data files, we did two things: We started to publish our data and commented on it in larger blog posts, and we ran some statistics, also with detailed comments. As we stressed before and can’t stress enough, this project is still very much a work in progress, but we can and we will show you some promising first results. As mentioned before, all our data is stored on our GitHub, and therefore very transparent. On top of that, we built a small network-data publishing machine to provide easy access to our data. We created a homepage for every of the 465 plays. There is a list of all these plays if you click on this link. On each of the homepages you’ll find 4 links leading to further information on the particular play.

Slide 5/1: Example: G. E. Lessing’s “Emilia Galotti” (1772)

One of the pages shows the network graph, with edges between all the people speaking in the same segment. There is a static graph and one with sticky nodes. Another page shows a matrix of encounters. And there is a page where you can have a look at our intermediary source file. That’s our data in plain daylight! Eventually, there is one page that contains a bar chart with word counts for each character of the play. You can also interactively sort them.

Slide 5/2: Skit (The Biggest Chatterboxes in German Literature)

Which brings me to a little skit, a little interlude. With the kind of data we gathered, it was easy for us to make a list of the biggest chatterboxes in German literature, of course, only based on our middle-sized corpus. And for all of you who didn’t do their German Literature 101: It doesn’t matter, I’m sure you will at least know Faust and his counterpart Mephistopheles, from Goethe’s play “Faust, part 1”. And both of them are very talkative, earning places 3 and 4 of this top-10 of the biggest chatterboxes in German literature. Again, there’s a blog post on the subject, but of course, this one is a bit “tongue-in-cheek” and not part of our actual research.

Let’s rather have a look at some more meaningful facts. We actually started out to process our data by means of Social Network Analysis. Again, our measures are currently very basic, for example, we’re computing the size of our drama networks, their density, their average degree and so on. For now, let us acquaint you with just two charts we’re currently discussing in the group and with other colleagues.

Slide 5/3: Network Size (Median) by Decade (1730–1930)

Here you can see the evolution of network sizes between 1730 and 1930. On the x-axis you can see the 20 decades. The y-axis features the median values of the number of characters of all the plays of a decade.

Let’s now try some cautious-cautious interpretation of this diagram. Something is happening there, that’s for sure, but what exactly? Well, some of the ups and downs we did expect. Like for example: The increase of this value in the second half of the 18th century might be associated with the beginning reception of Shakespeare in Germany, which lead, among other things, to the rise of the Historical Play in German literature. Or, another quick glimpse, the dropping values at the end of the 19th century might be associated with the rise of the Naturalistic Drama, which – to make a long story short – returned to the ideas of something like a Aristotelian poetics.

We published many more charts on our website, we also started to discuss them there, and this process will continue, of course. If you like, please have a look – and join the discussion. For now, we just show you one last chart, one that introduces another idea we will address with our statistical approaches. I’m talking about concepts of genre, or rather: subgenre.

Slide 5/4: Network Density (Mean) by Genre and Century

While editing our intermediary files we also included basic genre data, with the main focus on the usual suspects, major genres like tragedy, comedy, and opera libretti. With this kind of genre data we could now build subcorpora to have a look at genres and their specific network measures.

And we immediately noticed some interesting things. What we can see here is a multiple-line chart featuring the arithmetic means of density values by genre and century. Just looking at this single value over time, we can conclude that comedies and libretti implement a very similar structural composition over the centuries, while character networks of tragedies (the lowermost line in our diagram) show a much lower density. What is more, the values shown are pretty consistent over the centuries. This might be a first indication that we could actually cluster genres of dramatic texts by just looking at a few basic measures.

But as stated before, today we’re only talking about very basic data. Actually, we calculated a lot more network data and started to look into them. But we should not run to conclusions too fast. It is still a long way to integrate our network analysis of dramatic texts into a holistic study of literary evolution. We will be pushing out more data on our blog in the next few months. For example, we’re putting the finishing touches on an article on Network Values by Genre, should be ready in two weeks or three.

5. Further Research

Slide 6/0: Yada Yada

Wrapping things up, here is a slide with some notes on further research ideas. We need more statistical data, and we need to interpret them thoroughly. In addition, we will enlarge our German-language corpus. We will also look into existing foreign-language corpora which also opens up the field of comparative studies. I’m especially thinking of Paul Fièvre’s excellent corpus comprising more than 750 French plays, but we will also be looking into a collection of American drama and we’re also happy to cooperate with other scholars on the subject. But our first and foremost task will be to find ways to contribute to traditional Literary Studies, to evaluate existing hypotheses reached by close-reading approaches, by traditional means, so to speak. Plus, we will try to reach an own set of interpretations and hold them against established hypotheses in the field of Literary Studies. That is and should be our long-term plan at least.

6. Bibliography

Slide 7/0: Literary Theory, Social Network Analysis

Slide 7/1: Literary Studies & SNA

Slide 7/2: Literary Studies & SNA (Cont’d)

That was it. Thanks a lot for all the feedback we got, for the nice talks after the session and throughout the whole conference. Also if we aren’t even halfway there, it is nice to see how the network analysis of literary texts progresses. It’s definitely something to look out for at upcoming DH conferences.

Our Talk at DH2015 in Sydney (Full Text and Slides) was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on July 13, 2015.

200 Years of Literary Network Data

2015-06-25T00:00:00+02:00

After creating our corpus and extracting the structural data that are of interest to us it’s time to run some statistics. As it is with statistical data, they can evoke manifold interpretations and sometimes have the inclination to speak in riddles. We will certainly need a few more months to make sense of all the values we computed and collected.

Nevertheless, we’re prepared to offer at least some observations and insights already, all of which is still very much a work in progress. Our statistical analyses are quite rudimentary for the time being, more complex calculations will follow. However, some things can already be recognised in our data, or at least we can put them in front of you and open them up for discussion.

We already gave an example of how to ask our data in our last posting on the biggest chatterboxes in German literature. But these kinds of rankings are only one thing; our main purpose is to look at the network values we computed in the context of Social Network Analysis (SNA). Again, we will start with very rudimentary data and concentrate on the following five measures:

Number of characters, i.e., the number of characters appearing in each drama network; equates to the ‘size’ of any given network.
Maximum degree, i.e., the highest degree of an actor of a drama network; degree here refers to the sum of scenic co-presences of a character in a play (that is, how many of the other characters does a character ‘meet’/’speak to’ throughout the whole play).
Average degree, i.e., the average of all character degrees of a dramatic text.
Density, i.e., the ratio of the number of actual co-presences to the number of potential co-presences among all the characters of a play; the density value is always somewhere between 0 and 1: if it is 1, every character speaks to every other character at least once.
Average path length, which is (quote Wikipedia:) “calculated by finding the shortest path between all pairs of nodes, adding them up, and then dividing by the total number of pairs. This shows us, on average, the number of steps it takes to get from one member of the network to another.”

As stated before, these are very basic measures. But let’s go ahead and have a look at what these measures tell us about our Sydney corpus that includes 465 German-language plays from about 1730 to 1930.

In order to observe literary evolution throughout time, we grouped our dramatic texts by decades. This decision is contingent, of course, and we will also experiment with other periodisations (see below). But for a first look into the data, this approach will fulfill its purpose.

First example, a table referring to the “Number of characters” of a play, revealing the average, median and standard-deviation values:

Table: Number of Characters

Decade	N	Average	Median	Standard Deviation
1730	5	11,6	11	3,51
1740	18	8,33	8	2,4
1750	10	9,2	8,5	3,58
1760	15	11,2	10	9,65
1770	36	13,42	12,5	11,74
1780	20	18,1	15,5	11,36
1790	20	27,1	20,5	28,42
1800	23	27,96	15	27,26
1810	24	32,75	23	22,62
1820	31	27,29	25	14,24
1830	31	39,55	25	45,32
1840	43	19,35	17	11,09
1850	16	21,81	17,5	13,47
1860	11	24,45	21	18,83
1870	14	21,29	23	6,28
1880	14	24,86	23	12,7
1890	36	18,06	15	13,2
1900	49	11,88	9	8,83
1910	33	22,85	18	17,46
1920	16	29,25	24,5	15,7

Let’s now acquaint you with some visualisations by putting our data into some diagrams:

Fig. 01: Number of Characters (Median)

The standard-deviation values look like this:

Fig. 02: Number of Characters (Standard Deviation)

As you can see, there is something going on in our corpus. For example, in the second half of the 18th century, we witness a period of gradual increase in the number of characters that can be brought in connection with the renunciation of classical drama poetics and the beginning reception of Shakespeare. We also recognise a peak in the 1830s, not least due to the success of the historical drama in this period. In the late 19th century, we can observe a significant reduction in the number of characters, probably an effect owed to the naturalistic drama and its recourse to the classical poetics and their idea of the three unities.

At the same time, it is significant how the standard deviation goes up towards the end of the 18th century. This indicates an increased number of different structural styles of drama composition. What we can observe here is a differentiation of dramatic production, in structural terms, away from the uniformity of the years from 1730 to 1750. This, however, changes again in the mid-19th century.

We don’t want to further discuss these statistical values at this point, especially because we don’t want to espouse any monocausal explanations.

Instead, let’s throw a glance at some more charts dedicated the other values, i.e., Max Degree, Average Degree, Density and Average Path Length.

Fig. 03: Max Degree (Median)

Fig. 04: Average Degree (Average)

Fig. 05: Density (Average)

Fig. 06: Average Path Length (Average)

As stated above, we will evaluate and discuss these results later.

Before closing this post, we want to suggest one more way to analyse our data. We already mentioned that the classification by decades is rather arbitrary. However, there’s another option to pursue this idea. Why don’t we sort our corpus by already established periodisations of German literature and take it from there? Does our data reproduce established divisions into literary epochs?

This question must be approached with great caution. Established divisions into literary epochs do not just rely on a set of very specific structural elements (like our approach), no, they are, of course, much richer. We are absolutely not able to evaluate whether the known divisons into literary epochs are ‘correct’ or anything. That sort of thing is not possible with that kind of structural data. But anyhow, we can always check how our data relates to the established division into literary periods.

For that purpose, we picked two different divisions into epochs. The first was developed in the context of German Structuralism (cf., inter alia, Titzmann 1991a, Titzmann 1991b, Titzmann 2002, Titzmann 2012a, Titzmann 2012b, Wünsch 1991, Wünsch 1998, Wünsch 2007). The other classification was pulled from the timespans of the separate volumes of “Hansers Sozialgeschichte der deutschen Literatur vom 16. Jahrhundert bis in die Gegenwart” (Grimminger 1980–2009).

In the context of German structuralism, the following epoch classification are discussed (all time spans are give or take, of course):

1720–1750: Literatursystem ‘Frühaufklärung’ (‘Early Enlightenment’)
1750–1770: Literatursystem ‘Empfindsamkeit’ (‘Sentimentalism’)
1770–1830: Literatursystem ‘Goethezeit’
1830–1850: Literatursystem ‘Biedermeier’
1850–1890: Literatursystem ‘Realismus’
1890–1930: Literatursystem ‘Frühe Moderne’

The separate volumes of “Hansers Sozialgeschichte der deutschen Literatur” are divided like this:

1680–1789 (Vol. 3)
1789–1815 (Vol. 4)
1815–1848 (Vol. 5)
1848–1890 (Vol. 6)
1890–1918 (Vol. 7)
1918–1933 (Vol. 8)

So let’s see how our network values relate to these periodisations (this time around, we’re limiting this venture to the number of characters and network density).

The first four charts are dedicated to the Structuralist periodisation (since our Sydney corpus contains texts only from 1730 to 1930, the X-axes start at 1730):

Fig. 07: Number of Characters (Median), time spans according to Structuralist approach

Fig. 08: Number of Characters (Standard Deviation), time spans according to Structuralist approach

Fig. 09: Density (Average), time spans according to Structuralist approach

Fig. 10: Density (Standard Deviation), time spans according to Structuralist approach

Let’s now map our values onto the time spans suggested by the volumes of “Hansers Sozialgeschichte” (yet again: our Sydney corpus contains texts only from 1730 to 1930; hence, our X-axes are limited to this period of time):

Fig. 11: Number of Characters (Median), time spans according to “Hansers Sozialgeschichte”

Fig. 12: Number of Characters (Standard Deviation), time spans according to “Hansers Sozialgeschichte”

Fig. 13: Density (Average), time spans according to “Hansers Sozialgeschichte”

Fig. 14: Density (Standard Deviation), time spans according to “Hansers Sozialgeschichte”

Disclaimer

All results we’re presenting here are initial explorations of our corpus of 465 dramatic pieces and the network data we pulled out of the texts. Their significance is limited. But we do have network data that can be toyed around with, and that is what we are going to do in the near future. We will have to readjust and we will have te recalculate things. On that note, always bear in mind to never trust any statistics you didn’t forge yourself. Right?

Bibliography

Rolf Grimminger et al., Hansers Sozialgeschichte der deutschen Literatur vom 16. Jahrhundert bis in die Gegenwart, München 1980–2009.
Michael Titzmann (ed.), Modelle des literarischen Strukturwandels, Tübingen 1991.
Michael Titzmann, Skizze einer integrativen Literaturgeschichte und ihres Ortes in einer Systematik der Literaturwissenschaft, in: Michael Titzmann (ed.), Modelle des literarischen Strukturwandels, Tübingen 1991, 395–438.
Michael Titzmann, Epoche und Literatursystem. Ein terminologisch-methodologischer Vorschlag, in: Epochen. Mitteilungen des Deutschen Germanistenverbandes 49.3 (2002), 294–307.
Michael Titzmann: Probleme des Epochenbegriffs in der Literaturgeschichtsschreibung, in: Michael Titzmann, Anthropologie der Goethezeit. Studien zur Literatur und Wissensgeschichte, Berlin/Boston 2012, 31–67.
Michael Titzmann, “Empfindung” und “Leidenschaft”. Strukturen, Kontexte, Transformationen der Affektivität/Emotionalität in der deutschen Literatur in der 2. Hälfte des 18. Jahrhunderts, in: Michael Titzmann: Anthropologie der Goethezeit. Studien zur Literatur und Wissensgeschichte, Berlin/Boston 2012, 333–371.
Marianne Wünsch, Vom späten “Realismus” zur “Frühen Moderne”. Versuch eines Modells des literarischen Strukturwandels, in: Michael Titzmann (ed.): Modelle des literarischen Strukturwandels, Tübingen 1991, 187–203.
Marianne Wünsch, Die Fantastische Literatur der Frühen Moderne (1890–1930). Definition. Denkgeschichtlicher Kontext. Strukturen, München 1998.
Marianne Wünsch, Realismus (1850–1890). Zugänge zu einer literarischen Epoche, Kiel 2007.

200 Years of Literary Network Data was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on June 25, 2015.

The Biggest Chatterbox in German Literature

2015-06-23T00:00:00+02:00

The DLINA zwischenformat we recently introduced also stores amounts of speech acts, words, lines, chars. Truth be told, we will always have to cope with some erroneous and inaccurate markup contained in the TextGrid Repository TEI files here and there, but now we can roughly specify how many speech acts are executed by each character, how many words are uttered by each of them, and the amount of letters used by everybody. These values were elevated from all dramas of our Sydney corpus, i.e., 465 dramas written or published inbetween 1731 and 1929.

A complete list of all 9,913 characters contained in our corpus can be found here (i.e., the average cast list of a play has 21 characters).

But today we’re not interested in the list as a whole (we’ll get back to that later), but the top 20. So may we acquaint you with the biggest chatterboxes of German literature (omitting the ones who are not in our corpus, of course):

	Character	Title	Author	Chars	Words	Speech acts	Additional data
1	GEORG	Ignorabimus	Holz, Arno	133443	20859	952	http://dlina.github.io/390/
2	DUFROY	Ignorabimus	Holz, Arno	107534	16588	885	http://dlina.github.io/390/
3	FAUST	Faust. Der Tragödie erster Teil	Goethe, Johann Wolfgang	97546	9037	225	http://dlina.github.io/243/
4	MEPHISTOPHELES	Faust. Der Tragödie erster Teil	Goethe, Johann Wolfgang	92536	8408	257	http://dlina.github.io/243/
5	GOTHLAND	Herzog Theodor von Gothland	Grabbe, Christian Dietrich	86529	16325	508	http://dlina.github.io/158/
6	HOLLRIEDER	Sonnenfinsternis	Holz, Arno	85600	13544	663	http://dlina.github.io/174/
7	ONKEL LUDWIG	Ignorabimus	Holz, Arno	79066	12322	864	http://dlina.github.io/390/
8	LIBUSSA	Die Gründung Prags	Brentano, Clemens	70723	13139	308	http://dlina.github.io/384/
9	FRANZ	Franz von Sickingen	Lassalle, Ferdinand	67829	12445	219	http://dlina.github.io/287/
10	CARDENIO	Halle	Arnim, Ludwig Achim von	67167	12299	237	http://dlina.github.io/301/
11	MARIANNE	Ignorabimus	Holz, Arno	66707	10383	766	http://dlina.github.io/390/
12	ANATOL	Anatol	Schnitzler, Arthur	61885	11526	723	http://dlina.github.io/89/
13	FIESCO	Die Verschwörung des Fiesco zu Genua	Schiller, Friedrich	61633	10412	326	http://dlina.github.io/451/
14	MEPHISTOPHELES	Faust. Der Tragödie zweiter Teil	Goethe, Johann Wolfgang von	61231	10845	240	http://dlina.github.io/201/
15	CROMWELL	Ein Faust der That	Bleibtreu, Karl	61034	10581	257	http://dlina.github.io/322/
16	LA BELLA CENCI	Sonnenfinsternis	Holz, Arno	60956	10000	453	http://dlina.github.io/174/
17	DOKTOR FAUST	Doktor Faust	Soden, Julius von	60696	10640	543	http://dlina.github.io/450/
18	TASSO	Torquato Tasso	Goethe, Johann Wolfgang von	60095	11338	123	http://dlina.github.io/82/
19	FRANZ VON MOOR	Die Räuber	Schiller, Friedrich	57676	10303	172	http://dlina.github.io/8/
20	CARLOS	Don Carlos, Infant von Spanien	Schiller, Friedrich	55514	10444	333	http://dlina.github.io/217/

The appearance of Arno Holz and his play Ignorabimus seems natural, given that it’s the longest play in our corpus (cf. the corresponding blog entry).

But let’s have a closer look at the first four lines regarding only two dramas, aforementioned Ignorabimus and Goethe’s Faust, part I. Their values document quite some structural differences between the two texts, or rather, they indicate a completely different way of speaking:

	Character	Title	Author	Chars	Words	Speech acts	Additional data
1	GEORG	Ignorabimus	Holz, Arno	133443	20859	952	http://dlina.github.io/390/
2	DUFROY	Ignorabimus	Holz, Arno	107534	16588	885	http://dlina.github.io/390/
3	FAUST	Faust. Der Tragödie erster Teil	Goethe, Johann Wolfgang	97546	9037	225	http://dlina.github.io/243/
4	MEPHISTOPHELES	Faust. Der Tragödie erster Teil	Goethe, Johann Wolfgang	92536	8408	257	http://dlina.github.io/243/

Simply put: In Arno Holz’s play, characters speak much more often, but their utterances are quite short and so are the words they use. In Goethe’s play, the characters speak less often, but their speeches are much longer, as are the words they use.

In any case, this difference can be explained with the whole different eras that spawned the two plays, the temporal distance amounts to more than a century. Perhaps we’re witnessing the difference of two historical styles here: Isn’t this difference all about pre-modern vs. modern drama? We will discuss this further, of course, since this is the kind of quantitative evidence we are looking for when researching structural styles of dramatic texts.

But let’s leave the stage to the poets for now. At first, we’ll have some text by our winner, Arno Holz, followed by some notorious Goethe lines:

1124 characters from Ignorabimus (1914)

GEORG
in diesem Augenblick durch die Tür links; schlanke, nervöse Erscheinung; in ihrer ganzen Haltung den ehemaligen Offizier noch verratend; das dunkle Haar an den Schläfen bereits stark ergraut; Schnurrbart noch dunkel, die Augen hellgrau und durchdringend.
Guten Morgen!
MARIANNE
herzklopfend aufgestanden; ihn groß anstarrend; sie hat unwillkürlich versucht, die Blumen etwas zu verbergen.
…
GEORG
unruhig, dabei eine Zigarette rauchend, auf und ab; seine Sprechweise ist hastig knapp.
Du brauchst die Dinger nicht zu verstecken! … Laßt euch nicht stören!
ONKEL LUDWIG
die Blumen ergreifend und sie vor sich hinlegend; ruhig.
Gib sie mir, Kind. Ich werde sie mir oben auf meine stille Stube stellen.
MARIANNE
die sich erst jetzt etwas gefaßt hat; stockend; zu Georg.
Hat dir der Diener … deinen Tee schon gebracht?
GEORG
durch dessen Ton fast permanent etwas wie Unruhe, federnde Unzufriedenheit oder Gereiztheit klingt.
Danke. Ich rauche! … Hatte nur so aus Gewohnheit geschellt. Reflexbewegung! Kann ihn wieder wegtragen. Pferdegetrappel.
ONKEL LUDWIG
ablenkend; nach dem Garten hin.
Eine Hitze draußen …
GEORG
kurz; sachlich.
Ja.

4565 characters from Faust. Der Tragödie erster Teil (1808)

FAUST.
Habe nun, ach! Philosophie,
Juristerei und Medizin,
Und leider auch Theologie
Durchaus studiert, mit heißem Bemühn.
Da steh’ ich nun, ich armer Tor,
Und bin so klug als wie zuvor!
Heiße Magister, heiße Doktor gar,
Und ziehe schon an die zehen Jahr’
Herauf, herab und quer und krumm
Meine Schüler an der Nase herum –
Und sehe, daß wir nichts wissen können!
Das will mir schier das Herz verbrennen.
Zwar bin ich gescheiter als alle die Laffen,
Doktoren, Magister, Schreiber und Pfaffen;
Mich plagen keine Skrupel noch Zweifel,
Fürchte mich weder vor Hölle noch Teufel –
Dafür ist mir auch alle Freud’ entrissen,
Bilde mir nicht ein, was Rechts zu wissen,
Bilde mir nicht ein, ich könnte was lehren,
Die Menschen zu bessern und zu bekehren.
Auch hab’ ich weder Gut noch Geld,
Noch Ehr’ und Herrlichkeit der Welt;
Es möchte kein Hund so länger leben!
Drum hab’ ich mich der Magie ergeben,
Ob mir durch Geistes Kraft und Mund
Nicht manch Geheimnis würde kund;
Daß ich nicht mehr mit sauerm Schweiß
Zu sagen brauche, was ich nicht weiß;
Daß ich erkenne, was die Welt
Im Innersten zusammenhält,
Schau’ alle Wirkenskraft und Samen,
Und tu’ nicht mehr in Worten kramen.

O sähst du, voller Mondenschein,
Zum letztenmal auf meine Pein,
Den ich so manche Mitternacht
An diesem Pult herangewacht:
Dann über Büchern und Papier,
Trübsel’ger Freund, erschienst du mir!
Ach! könnt’ ich doch auf Bergeshöhn
In deinem lieben Lichte gehn,
Um Bergeshöhle mit Geistern schweben,
Auf Wiesen in deinem Dämmer weben,
Von allem Wissensqualm entladen,
In deinem Tau gesund mich baden!

Weh! steck’ ich in dem Kerker noch?
Verfluchtes dumpfes Mauerloch,
Wo selbst das liebe Himmelslicht
Trüb durch gemalte Scheiben bricht!
Beschränkt von diesem Bücherhauf,
Den Würme nagen, Staub bedeckt,
Den, bis ans hohe Gewölb’ hinauf,
Ein angeraucht Papier umsteckt;
Mit Gläsern, Büchsen rings umstellt,
Mit Instrumenten vollgepfropft,
Urväter-Hausrat drein gestopft –
Das ist deine Welt! das heißt eine Welt!

Und fragst du noch, warum dein Herz
Sich bang in deinem Busen klemmt?
Warum ein unerklärter Schmerz
Dir alle Lebensregung hemmt?
Statt der lebendigen Natur,
Da Gott die Menschen schuf hinein,
Umgibt in Rauch und Moder nur
Dich Tiergeripp’ und Totenbein.

Flieh! auf! hinaus ins weite Land!
Und dies geheimnisvolle Buch,
Von Nostradamus’ eigner Hand,
Ist dir es nicht Geleit genug?
Erkennest dann der Sterne Lauf,
Und wenn Natur dich unterweist,
Dann geht die Seelenkraft dir auf,
Wie spricht ein Geist zum andern Geist.
Umsonst, daß trocknes Sinnen hier
Die heil’gen Zeichen dir erklärt:
Ihr schwebt, ihr Geister, neben mir;
Antwortet mir, wenn ihr mich hört!

Er schlägt das Buch auf und erblickt das Zeichen des Makrokosmus.

Ha! welche Wonne fließt in diesem Blick
Auf einmal mir durch alle meine Sinnen!
Ich fühle junges, heil’ges Lebensglück
Neuglühend mir durch Nerv’ und Adern rinnen.
War es ein Gott, der diese Zeichen schrieb,
Die mir das innre Toben stillen,
Das arme Herz mit Freude füllen
Und mit geheimnisvollem Trieb
Die Kräfte der Natur rings um mich her enthüllen?
Bin ich ein Gott? Mir wird so licht!
Ich schau’ in diesen reinen Zügen
Die wirkende Natur vor meiner Seele liegen.
Jetzt erst erkenn’ ich, was der Weise spricht:
›Die Geisterwelt ist nicht verschlossen;
Dein Sinn ist zu, dein Herz ist tot!
Auf, bade, Schüler, unverdrossen
Die ird’sche Brust im Morgenrot!‹

Er beschaut das Zeichen.

Wie alles sich zum Ganzen webt,
Eins in dem andern wirkt und lebt!
Wie Himmelskräfte auf und nieder steigen
Und sich die goldnen Eimer reichen!
Mit segenduftenden Schwingen
Vom Himmel durch die Erde dringen,
Harmonisch all das All durchklingen!

Welch Schauspiel! Aber ach! ein Schauspiel nur!
Wo fass’ ich dich, unendliche Natur?
Euch Brüste, wo? Ihr Quellen alles Lebens,
An denen Himmel und Erde hängt,
Dahin die welke Brust sich drängt –
Ihr quellt, ihr tränkt, und schmacht’ ich so vergebens?

Er schlägt unwillig das Buch um und erblickt das Zeichen des Erdgeistes.

Wie anders wirkt dies Zeichen auf mich ein!
Du, Geist der Erde, bist mir näher;
Schon fühl’ ich meine Kräfte höher,
Schon glüh’ ich wie von neuem Wein,
Ich fühle Mut, mich in die Welt zu wagen,
Der Erde Weh, der Erde Glück zu tragen,
Mit Stürmen mich herumzuschlagen
Und in des Schiffbruchs Knirschen nicht zu zagen.
Es wölkt sich über mir –
Der Mond verbirgt sein Licht –
Die Lampe schwindet!
Es dampft – Es zucken rote Strahlen
Mir um das Haupt – Es weht
Ein Schauer vom Gewölb’ herab
Und faßt mich an!
Ich fühl’s, du schwebst um mich, erflehter Geist.
Enthülle dich!
Ha! wie’s in meinem Herzen reißt!
Zu neuen Gefühlen
All’ meine Sinnen sich erwühlen!
Ich fühle ganz mein Herz dir hingegeben!
Du mußt! du mußt! und kostet’ es mein Leben!

The Biggest Chatterbox in German Literature was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on June 23, 2015.

Editing Rules

2015-06-22T00:00:00+02:00

Introduction

After the structural data have been extracted and put into the DLINA zwischenformat, manual intervention is often necessary to improve the data quality and correct errors in the source data. Especially the TextGrid data proved to be quite problematic due to OCR errors and false tagging.

Some of the “external” problems we encountered are (that is, problems not inherent to the text per se but introduced through automated or manual conversion to a computer-readable format and creating the markup):

no or insufficient structural data encoded,
OCR errors in a <speaker> names (strings),
stage directions interpreted as part of a speaker’s name.

Additionally, there are a few “internal” phenomena – i.e. characteristics typical for a play – that have to be taken into account:

different ways of referring to a person – e.g., the full name might be given on the first appearance and only the first name on further appearances,
collectives or groups of speakers, e.g., “Alle” (all), “Einige” (some), “Andere” (others),
indeterminate speakers, e.g., “Ein Diener” (a servant), “Erster Ritter” (first knight) which might refer to different characters throughout a play.

In order to get around these problems, we had to manually edit the DLINA data files. We established a fixed set of rules (see below) to cover the most common problems and added comments to the data files if the changes involved non-trivial interpretation.

Rules for editing our zwischenformat (DLINA data files)

Rule 1 – Add the schema files as a PI
Rule 2 – Edit the metadata header
Rule 3 – Identification of characters
Rule 4 – Multiple speakers (explicit)
Rule 5 – Multiple speakers (implicit)
Rule 6 – Multiple speakers (collective)
Rule 7 – Same day, different shit
Rule 8 – Collectives as part of a collective

Rule 1: Add the schema files as a Processing Instruction – example

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://raw.githubusercontent.com/DLiNa/project/master/rules/lina.rnc"?>
<?xml-model href="http://raw.githubusercontent.com/DLiNa/project/master/rules/lina.sch"?>

Rule 2: Edit the metadata header – example

The TextGrid sources come with false and/or incomplete tagging of metadata in its (usually two) <tei:teiHeader>. This information has to be brought into a consistent state and crucial information has to be added. This usually means:

removing surplus <title> tags,
if applicable, adding <subtitle> and <genretitle> (the former usually including a self-attributed genre like “Ein Trauerspiel in 5 Akten” and the latter containing the genre in a normalised way, in this case: “Trauerspiel”; to make things comparable, we’re considering adding attribute lists for the major genres),
adding known dates (when the play was written, first printed and premiered),
adding the URI of the data source(s) – in case we had to add structural information, a second <source> tag is added.

Before editing

  <head>
    <title>Dramen</title>
    <title>Gottsched, Johann Christoph</title>
    <title>Der sterbende Cato</title>
    <author>Gottsched, Johann Christoph</author>
    <date when="1730"/>
    <source> </source>
  </head>

After editing

<header>
    <title>Der sterbende Cato</title>
    <subtitle>Ein Trauerspiel</subtitle>
    <genretitle>Trauerspiel</genretitle>
    <author>Gottsched, Johann Christoph</author>
    <date type="print" when="1732">1732</date>
    <date type="premiere" when="1731">1731</date>
    <date type="written" when="1730">1730</date>
    <source>https://textgridlab.org/1.0/tgcrud-public/rest/textgrid:nks0.0/data</source>
  </header>

Rule 3: Identification of characters – example 1

The easiest case is two similar and easily understandable names for one character. Often, a character is introduced by a full name, possibly including a title or an article, and later referred to only by the given name or the title alone. Another possibility is a simple typo in a character’s name. Here, we move the <alias> of one <character> (usually the less frequent, or the one containing a typo) to the “right” one.

Before editing

<personae>
  <character>
      <name>ODOARDO GALOTTI</name>
      <alias xml:id="odoardo_galotti">
        <name>ODOARDO GALOTTI</name>
      </alias>
  </character>
  <character>
      <name>ODOARDO </name>
      <alias xml:id="odoardo">
        <name>ODOARDO</name>
      </alias>
  </character>
</personae>

After editing

<personae>
  <character>
      <name>ODOARDO GALOTTI</name>
      <alias xml:id="odoardo_galotti">
        <name>ODOARDO GALOTTI</name>
      </alias>
      <alias xml:id="odoardo">
        <name>ODOARDO</name>
      </alias>
  </character>
</personae>

Rule 3: Identification of characters – example 2

A second, less obvious possibility is that a character is not visible on stage but its voice can be heard. In these cases, we add an <alias> to the lina:character and add an @type="voiceOf". The idea behind the attribute is to be later able to differentiate between a character actually on stage and one merely heard.

Before editing

<personae>
  <character>
      <name>MARIANE</name>
      <alias xml:id="mariane">
        <name>MARIANE</name>
      </alias>
    </character>
    <character>
      <name> MARIANENS STIMME </name>
      <alias xml:id="marianens_stimme">
        <name>MARIANENS STIMME</name>
      </alias>
  </character>
</personae>

After editing

<personae>
  <character>
      <name>MARIANE</name>
      <alias xml:id="mariane">
        <name>MARIANE</name>
      </alias>
      <alias xml:id="marianens_stimme" type="voiceOf">
        <name>MARIANENS STIMME</name>
      </alias>
  </character>
</personae>

Rule 4: Multiple speakers (explicit) – example

A common “internal” phenomenon of plays is two or more characters speaking at the same time. In the easy cases they are explicitly named, separated by comma or a conjunction like “und”/”and”. In these cases, in the //lina:text//lina:sp we partition @who to its constituents, removing any comma or conjunction. Additionally, the lina:character in lina:personae is deleted.

Before editing

<text>
  <sp who="#madame_welldorf_und_luise">
    <amount n="1" unit="speech_acts"/>
    <amount n="5" unit="words"/>
    <amount n="1" unit="lines"/>
    <amount n="21" unit="chars"/>
  </sp>
</text>

After editing

<text>
  <sp who="#madame_welldorf #luise">
    <amount n="1" unit="speech_acts"/>
    <amount n="5" unit="words"/>
    <amount n="1" unit="lines"/>
    <amount n="21" unit="chars"/>
  </sp>
</text>

Rule 5: Multiple speakers (implicit) – example

In the “implicit” case, no names are given for the speakers, but are referred to by their role or some attribute they have in common. Here, the surplus <character> is deleted and the @who expanded to contain a pointer to all the individual characters.

Before editing

<personae>
  <character>
    <name>ERSTE MAGD</name>
    <alias xml:id="erste_magd">
      <name>ERSTE MAGD</name>
    </alias>
  </character>
  <character>
    <name>ZWEITE MAGD</name>
    <alias xml:id="zweite_magd">
      <name>ZWEITE MAGD</name>
    </alias>
  </character>
  <character>
    <name>DIE BEIDEN MÄGDE</name>
    <alias xml:id="die_beiden_mägde">
      <name>DIE BEIDEN MÄGDE</name>
    </alias>
  </character>
</personae>
<text>
  <sp who="#die_beiden_mägde">
    <amount n="1" unit="speech_acts"/>
    <amount n="7" unit="words"/>
    <amount n="1" unit="lines"/>
    <amount n="32" unit="chars"/>
  </sp>			
</text>

After editing

<personae>
  <character>
    <name>ERSTE MAGD</name>
    <alias xml:id="erste_magd">
      <name>ERSTE MAGD</name>
    </alias>
  </character>
  <character>
    <name>ZWEITE MAGD</name>
    <alias xml:id="zweite_magd">
      <name>ZWEITE MAGD</name>
    </alias>
  </character>
</personae>
<text>
  <sp who="#erste_magd #zweite_magd">
    <amount n="1" unit="speech_acts"/>
      <amount n="7" unit="words"/>
      <amount n="1" unit="lines"/>
      <amount n="32" unit="chars"/>
  </sp>
</text>

Rule 6: Multiple speakers (collective) – example 1

When no explicit names are given but an easily discernable collective, the <character> for the collective name is deleted and the @who edited to contain the names of all characters speaking.

Before editing

<personae>
  <character>
    <name>MADAME WELLDORF</name>
    <alias xml:id="madame_welldorf">
      <name>MADAME WELLDORF</name>
    </alias>
  </character>
  <character>
    <name>LUISE</name>
    <alias xml:id="luise">
      <name>LUISE</name>
    </alias>
  </character>
  <character>
    <name>BEIDE</name>
    <alias xml:id="beide">
      <name>BEIDE</name>
    </alias>
  </character>
</personae>
<text>
  <sp who="#beide">
    <amount n="1" unit="speech_acts"/>
    <amount n="5" unit="words"/>
    <amount n="1" unit="lines"/>
    <amount n="21" unit="chars"/>
  </sp>
</text>

After editing

<personae>
  <character>
    <name>MADAME WELLDORF</name>
    <alias xml:id="madame_welldorf">
      <name>MADAME WELLDORF</name>
    </alias>
  </character>
  <character>
    <name>LUISE</name>
    <alias xml:id="luise">
      <name>LUISE</name>
    </alias>
  </character>
</personae>
<text>
  <sp who="#madame_welldorf #luise">
    <amount n="1" unit="speech_acts"/>
    <amount n="5" unit="words"/>
    <amount n="1" unit="lines"/>
    <amount n="21" unit="chars"/>
  </sp>
</text>

Rule 6: Multiple speakers (collective) – example 2

Often, multiple speakers are not given explicitly but rather a collective reference is given, e.g., “Einige” (“some”), “Alle” (“all”), “the Borg”, etc. In these cases it often is necessary to revert to close reading to discern who is actually meant. Usually, we add a <change> to the <documentation> section if the expansion to explicit names is not obvious, requires lengthy close reading or a lot of interpretation.

Before editing

<div>
    <head>1. Akt</head>
  <div>
    <head>Erster Akt</head>
    <sp who="#mana">
      <amount n="12" unit="speech_acts"/>
      <amount n="177" unit="words"/>
      <amount n="10" unit="lines"/>
      <amount n="966" unit="chars"/>
    </sp>
    <sp who="#sora">
      <amount n="18" unit="speech_acts"/>
      <amount n="193" unit="words"/>
      <amount n="15" unit="lines"/>
      <amount n="972" unit="chars"/>
    </sp>
    <sp who="#feria">
      <amount n="14" unit="speech_acts"/>
      <amount n="168" unit="words"/>
      <amount n="11" unit="lines"/>
      <amount n="958" unit="chars"/>
    </sp>
    <sp who="#lato">
      <amount n="13" unit="speech_acts"/>
      <amount n="88" unit="words"/>
      <amount n="13" unit="lines"/>
      <amount n="439" unit="chars"/>
    </sp>
    <sp who="#alle">
      <amount n="3" unit="speech_acts"/>
      <amount n="9" unit="words"/>
      <amount n="3" unit="lines"/>
      <amount n="41" unit="chars"/>
    </sp>
    <sp who="#andrason">
      <amount n="39" unit="speech_acts"/>
      <amount n="1428" unit="words"/>
      <amount n="23" unit="lines"/>
      <amount n="7989" unit="chars"/>
    </sp>
    <sp who="#mela">
      <amount n="6" unit="speech_acts"/>
      <amount n="38" unit="words"/>
      <amount n="6" unit="lines"/>
      <amount n="228" unit="chars"/>
    </sp>
  </div>
</div>

Inspecting speech act and stage direction

Andrason kommt.
FERIA.
Sei uns willkommen! herzlich willkommen!
ALLE.
Willkommen!
ANDRASON.
Ich umarme dich, meine Schwester! Ich grüße euch, meine Kinder! Eure Freude macht mich glücklich, eure Liebe tröstet mich.

After editing

<div>
    <head>1. Akt</head>
  <div>
    <head>Erster Akt</head>
    <sp who="#mana">
      <amount n="12" unit="speech_acts"/>
      <amount n="177" unit="words"/>
      <amount n="10" unit="lines"/>
      <amount n="966" unit="chars"/>
    </sp>
    <sp who="#sora">
      <amount n="18" unit="speech_acts"/>
      <amount n="193" unit="words"/>
      <amount n="15" unit="lines"/>
      <amount n="972" unit="chars"/>
    </sp>
    <sp who="#feria">
      <amount n="14" unit="speech_acts"/>
      <amount n="168" unit="words"/>
      <amount n="11" unit="lines"/>
      <amount n="958" unit="chars"/>
    </sp>
    <sp who="#lato">
      <amount n="13" unit="speech_acts"/>
      <amount n="88" unit="words"/>
      <amount n="13" unit="lines"/>
      <amount n="439" unit="chars"/>
    </sp>
    <sp who="#mana #sora #feria #lato #mela">
      <amount n="3" unit="speech_acts"/>
      <amount n="9" unit="words"/>
      <amount n="3" unit="lines"/>
      <amount n="41" unit="chars"/>
    </sp>
    <sp who="#andrason">
      <amount n="39" unit="speech_acts"/>
      <amount n="1428" unit="words"/>
      <amount n="23" unit="lines"/>
      <amount n="7989" unit="chars"/>
    </sp>
    <sp who="#mela">
      <amount n="6" unit="speech_acts"/>
      <amount n="38" unit="words"/>
      <amount n="6" unit="lines"/>
      <amount n="228" unit="chars"/>
    </sp>
  </div>
</div>

Rule 7: Same name for different characters – example

Sometimes, two different characters are referred to by the same name, e.g., a servant to the president and a servant to the prince are both named “servant”. Here, it is necessary to add a <character> for the second individuum, give both an easily recognisable name and ID and edit the @who attributes to reflect which of these it refers to.

Before editing

<personae>
  <character>
    <name>EIN KAMMERDIENER</name>
    <alias xml:id="ein_kammerdiener">
      <name>EIN KAMMERDIENER</name>
    </alias>
  </character>
  <character>
    <name>KAMMERDIENER</name>
    <alias xml:id="kammerdiener">
      <name>KAMMERDIENER</name>
    </alias>
  </character>
</personae>

Inspecting speech acts and stage directions

1. Akt
Fünfte Szene
[…]
PRÄSIDENT.
Zwar du bist mir gewiß. Ich halte dich an deiner eigenen Schurkerei, wie den Schröter am Faden!
EIN KAMMERDIENER
tritt herein.
Hofmarschall von Kalb –
PRÄSIDENT.
Kommt, wie gerufen. – Er soll mir angenehm sein.
Kammerdiener geht.

2. Akt
Zweite Szene
Ein alter Kammerdiener des Fürsten, der ein Schmuckkästchen trägt.
[…]
KAMMERDIENER.
Seine Durchlaucht der Herzog empfehlen sich Mylady zu Gnaden, und schicken Ihnen diese Brillanten zur Hochzeit. Sie kommen soeben erst aus Venedig.

After editing

<personae>
  <character>
    <name>EIN KAMMERDIENER (PRÄSIDENT)</name>
    <alias xml:id="ein_kammerdiener_präsident">
      <name>EIN KAMMERDIENER</name>
    </alias>
  </character>
  <character>
    <name>EIN KAMMERDIENER (FÜRST)</name>
    <alias xml:id="kammerdiener_fürst">
      <name>EIN KAMMERDIENER (FÜRST)</name>
    </alias>
  </character>
</personae>

Rule 8: Collectives as part of a collective – example

Especially in dramas with several large crowds, subdivisions of these crowds take action and speak out while there is no explicit reference to who is actually part of this subdivision (no Six-of-Twelve here). Usually, these groups include none of the major characters and the utterances – while important for the atmosphere of the setting – are quite short. Here, we decided to not partition the collective, but rather to build it up: “Some of the crowd”, “Others of the crowd” etc. are considered an <alias> of the larger collectives <character>.

Before editing

<personae>
  <character>
    <name>DAS VOLK</name>
    <alias xml:id="das_volk">
      <name>DAS VOLK</name>
    </alias>
  </character>
  <character>
    <name>DAS GANZE VOLK</name>
    <alias xml:id="das_ganze_volk">
      <name>DAS GANZE VOLK</name>
    </alias>
  </character>
  <character>
    <name>EINIGE VOM VOLK</name>
    <alias xml:id="einige_vom_volk">
      <name>EINIGE VOM VOLK</name>
    </alias>
  </character>
  <character>
    <name>STIMMEN AUS DEM VOLK</name>
    <alias xml:id="stimmen_aus_dem_volk">
      <name>STIMMEN AUS DEM VOLK</name>
    </alias>
  </character>
</personae>

After editing

<character>
  <name>DAS VOLK<name>
  <alias xml:id="das_volk">
    <name>DAS VOLK</name>
  </alias>
  <alias xml:id="das_ganze_volk">
    <name>DAS GANZE VOLK</name>
  </alias>
  <alias xml:id="stimmen_aus_dem_volk">
    <name>STIMMEN AUS DEM VOLK</name>
  </alias>
  <alias xml:id="einige_vom_volk">
    <name>EINIGE VOM VOLK</name>
  </alias>
</character>

Conclusion and caveat

Using these rules, we were able to work around most of the problems. The resulting data are much more consistent than what we started out with. But one always has to bear in mind that improving the data is still limited by some constraints of the source texts:

We had to assume that the structure as given in the source files was generally correct; in a few cases, we manually added the missing information to the sources as the results were grossly wrong as was the case with Goethe’s “Götz von Berlichingen” where no scenes were tagged.
Characters that are not tagged as a <speaker> will not be recognised. If two speakers speak collectively and are tagged <sp>Kolja und Mitja</sp> in the source, the script will correctly recognise both speakers. However, there are instances of incorrect tagging where only one speaker is tagged (and the other might “disappear” into a stage direction). In these cases, the second speaker will not be recognised and thus not be present in the zwischenformat data. Usually, it is impossible to recognise these errors at first glance.
Stage directions might be tagged as parts of a speech, and vice versa. This will result in erroneous amounts in the zwischenformat’s <lina:sp>. Our worst case is a missing speaker, for example if all utterances of a character were falsely tagged as stage directions.

Editing Rules was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on June 22, 2015.

Introducing Our 'Zwischenformat'

2015-06-21T00:00:00+02:00

Our research interest focuses primarily on structural aspects of dramatic texts. The structural data is extracted from the 465 dramatic texts that constitute our Sydney corpus and then screened and edited before it can be evaluated statistically with regard to literary history.

The structural abstraction is provided by a PHP script that processes the TEI files, collects all the data needed for our purpose and puts it in our own zwischenformat (roughly translates as ‘intermediary format’, the DLINA data format we developed for this project and announced in our previous post). The script and what it produces, our zwischenformat, represent a structure-oriented form of data mining, so to speak.

Let’s assume that the basic structure of a drama looks as follows (without paratexts):

<segment>
 <sp who="#speaker1"></sp>
 <sp who="#speaker2"></sp>
 <sp who="#speaker3"></sp>
 <sp who="#speaker1"></sp>
 <sp who="#speaker3"></sp>
 ...
</segment>
<segment>
 <sp who="#speaker4"></sp>
 <sp who="#speaker2"></sp>
 ...
</segment>
...

The <segment>s represent the predefined structures of a drama: acts and scenes. Our script will extract the structure of segments and speakers from the full-text TEI files and write it into our zwischenformat. The actual content of the speeches is disregarded and represented by the number of speech acts, words, lines, and string length (in characters) instead, each of which are summarised per occuring character identified via its who attribute. Now we’re able to see at a glance how many words each character is contributing to a play, and we’re able to do that for the whole Sydney corpus. Stay tuned for a post on the greatest chatterboxes in German literature, hehe!

Anyhow, the result looks something like this:

<text>
<div>
 <head>Vierte Szene</head>
  <sp who="#ferdinand">
   <amount n="7" unit="speech_acts"/>
   <amount n="481" unit="words"/>
   <amount n="2" unit="lines"/>
   <amount n="2585" unit="chars"/>
  </sp>
  <sp who="#luise">
   <amount n="7" unit="speech_acts"/>
   <amount n="208" unit="words"/>
   <amount n="3" unit="lines"/>
   <amount n="1057" unit="chars"/>
  </sp>
 </div>
</text>

The representation of drama structure (segmentations, speakers) is at the core of our zwischenformat. But it does even more. It captures metadata and it creates complete cast lists for each drama by making use of the who attributes.

Our zwischenformat consists of three main parts (each of which is required):

<header> (the metadata)
<personae> (a cast list created by help of all who attributes)
<text> (drama segmentation and speakers)

Plus, there is also an optional part:

<documentation> (for documenting non-trivial editing decisions)

<documentation>
 <change n="1" type="expandCollective" who="peertrilcke">
   <path>/play/text[1]/div[4]/div[2]/div[1]</path>
   <orig>#die_abziehenden</orig>
   <corr>#fritz_kleinmichel #berta #kämpe #frau_piepenbrink #bellmaus #bolz #piepenbrink</corr>
   <comment>Siehe Text: "Fritz Kleinmichel mit seiner Braut, Kämpe mit Kleinmichel, Frau Piepenbrink mit Bellmaus, zuletzt Bolz mit Piepenbrink"; "Braut" i.e. Berta</comment>
  </change>
</documentation>

A complete yet very short and simple one-act drama would be represented like this by our zwischenformat:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://raw.githubusercontent.com/DLiNa/project/master/rules/lina.rnc"?>
<?xml-model href="http://raw.githubusercontent.com/DLiNa/project/master/rules/lina.sch"?>
<play xmlns="http://lina.digital">
 <header>
  <title>Die Urgrossmutter</title>
  <subtitle>Eine Tragi-Komödie in einem Aufzuge</subtitle>
  <genretitle></genretitle>
  <author>Scheerbart, Paul</author>
  <date type="print" when="1904" />
  <date type="premiere" />
  <date type="written" />
  <source>https://textgridlab.org/1.0/tgcrud-public/rest/textgrid:tv6f.0/data</source>
 </header>
 <personae>
  <character>
   <name>URGROSSMUTTER</name>
   <alias xml:id="urgrossmutter">
    <name>URGROSSMUTTER</name>
   </alias>
  </character>
  <character>
   <name>MANELLA</name>
   <alias xml:id="manella">
    <name>MANELLA</name>
   </alias>
  </character>
  <character>
   <name>CONSTANTIN</name>
   <alias xml:id="constantin">
    <name>CONSTANTIN</name>
   </alias>
  </character>
 </personae>
 <text>
  <div>
   <head>Personen</head>
  </div>
  <div>
   <head>[Stücktext]</head>
   <div>
    <head>[Stücktext]</head>
    <sp who="#urgrossmutter">
     <amount n="17" unit="speech_acts"/>
     <amount n="497" unit="words"/>
     <amount n="7" unit="lines"/>
     <amount n="2795" unit="chars"/>
    </sp>
    <sp who="#manella">
     <amount n="3" unit="speech_acts"/>
     <amount n="22" unit="words"/>
     <amount n="3" unit="lines"/>
     <amount n="154" unit="chars"/>
    </sp>
    <sp who="#constantin">
     <amount n="13" unit="speech_acts"/>
     <amount n="154" unit="words"/>
     <amount n="10" unit="lines"/>
     <amount n="948" unit="chars"/>
    </sp>
   </div>
  </div>
 </text>
</play>

The zwischenformat is validated against:

The raw zwischenformat versions of our Sydney corpus can be found here (i.e., the 465 files extracted from the TextGrid Repository before we started editing them):

https://github.com/dlina/project/tree/master/data/zwischenformat/raw_lina_data

The edited zwischenformat files can be found here (this is the deluxe version of our corpus, so to speak, the basis for all further analyses and visualisations; our editing rules will be published at a later point):

https://github.com/dlina/project/tree/master/data/zwischenformat

And now:

<div>
 <sp="#everybody_and_their_aunt">
  <p>Long live the zwischenformat!</p>
 </sp>
</div>

Introducing Our 'Zwischenformat' was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on June 21, 2015.

Introducing DLINA Corpus 15.07 (Codename: Sydney)

2015-06-20T00:00:00+02:00

Our working corpus is based on the 666 dramas extracted from the TextGrid Repository (the not-so simple extraction process was described by Frank and Mathias in an earlier post). This blog post will describe the criteria for selecting 465 dramas from said repository to represent our working corpus. The version number 15.07 is referring to ‘July 2015’ as we’re going to present our results at the DH2015 conference on July 2, 2015. Further versions of the DLINA Corpus will receive according versioning numbers. As the imminent reason for needing a reliable corpus with clean data is the upcoming conference in Sydney, it was also very easy to pick a codename for the corpus.

Anyway, in order to build our corpus for Sydney we started with a quick survey and picked out 497 plays that seemed suitable. I.e., we ruled out 169 of the TextGrid plays by following these assumptions:

Our corpus should be limited to a specific time span: We will start with the German Enlightenment drama focussing on the modernisation of the German drama in the first half of the 18th century, a process associated with the name of Johann Christoph Gottsched. It is a well-established academic position that Gottsched’s dramatic writings as well as his dramatic theory hallmark a turning point in the history of German drama (see, for example, Rochow 1994, Catholy 1982, Koopmann 1979). Therefore, we ruled out 147 dramas that saw the day of light before Gottsched’s Der sterbende Cato (printed in 1732).
We also discarded:
- foreign-language originals,
- translations,
- mere pantomime plays, that is to say, plays that don’t feature <sp>eech elements,
- fragments, i.e., texts that were clearly left unfinished by their author.

While we were editing our data using our very own zwischenformat (roughly translatable as “intermediate format”, an according blog post will be published shortly) we sorted out another 32 texts for the following reasons:

if the TEI markup was too defective (missing <speaker> elements and such),
if additional texts turned out to be fragments that had slipped our attention before,
if the structure of a text proved to be too complicated (the treatment of 11 dramas had to be postponed for this reason).

All in all, our DLINA Corpus 15.07 (Codename: Sydney) comprises 465 dramatic texts, in the shape of 465 XML zwischenformat files.

Bibliography

Christian Rochow, Das Drama hohen Stils. Aufklärung und Tragödie in Deutschland (1730–1790), Heidelberg 1994 (DNB)
Eckehard Catholy, Das deutsche Lustspiel. Von der Aufklärung bis zur Romantik, Stuttgart 1982 (DNB)
Helmut Koopmann, Drama der Aufklärung. Kommentar zu einer Epoche, München 1979 (DNB)

Introducing DLINA Corpus 15.07 (Codename: Sydney) was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on June 20, 2015.

Working With Inconsistent Metadata

2015-06-19T00:00:00+02:00

As we underlined before, we can’t stop celebrating the fact that there are so many literary corpora on the web today. Just a fortnight ago, Martin Müller released the Shakespeare His Contemporares (SHC) collection, a corpus of early English modern drama, encoded in TEI Simple. We will definitely look into this corpus at a later point, but today we will again be bothering you with the depths of the TextGrid Repository. No worries, today’s blog entry won’t be as excessive as the one we published yesterday. ;)

If you’re trying to work with corpora you didn’t create yourself, you will always have the problem of inconsistent metadata. They may be inconsistent or incomplete (or simply missing). Maybe the corpus builders just didn’t have the same metadata needs as you.

So let’s get back to our drama collection derived from the TextGrid Repository, kind of picking up our recent blog post on the top 10 longest German-language theatre plays contained in this very corpus. Today we want to look at the available metadata and try to put all the hundreds of play in a chronological order by just relying on the (inconsistent) metadata provided in the documents.

There are many purposes for doing so, one being the creation of a subcorpus of, let’s say, 18th-century drama. For this, you will need metadata that tells you when a theatre piece was written, or published, or when it premiered. Now, TEI provides a <creation> element to include information like that. Yet, it is not used consistently in the TextGrid Repository. In many cases, the <creation> slot is left empty. In other cases, it features something like this: <date notBefore="1837" notAfter="1872"/>, the mentioned years being the lifespan of an author. In a way, this information is still helpful to narrow down a text’s date of origin, but it is as vague as can be, of course.

So for the sake of putting all the hundreds of theatre pieces in chronological order, we had to work around this problem. Luckily, the TextGrid Repository also provides some publication info within the <note> element, something like this:

<note>Erstdruck in: »Urania«, 1826. Uraufführung am 22.12.1823, Königliches Theater, Berlin.</note>

In this example, we’ve got two year specifications, 1823 for the premiere, 1826 for the first print. It is always possible that a piece was written years or decades before it premiered or before it was printed (take, for example, Goethe’s “Urfaust”). If we had the resources, we would definitely try to add the missing metadata by hand. But what we were trying to do here is working with what we have to narrow down the date of origin of a play. So in the mentioned example, we would opt for the earlier date, 1823.

Our decision tree would thus look something like this:

Look for an exact year in <creation>. If no such year is provided then:
Look for the earliest year mentioned within the <note> element. If that doesn’t yield a satisfactory result then:
Take the author’s year of death as the latest possible year of creation of a piece.

For easier processing, we decided to use the detected year as part of the filename, followed by the name of the author and the title of the play. You can have a look at the result at the respective GitHub folder. Due to our treatment, the plays are automatically listed in chronological order, with the little exception of the 10 Greek and Roman plays written BC (to be found at the end of the file list).

As we stressed before, we chose this approach just to approximate the dates of origin. Such an approach never replaces the proper integration of metadata. For example, all Shakespeare plays are referenced by the year 1616 (rule 3 of our decision tree), due to the lack of better metadata. Again, we could start to repair this by hand, but that was not the purpose of this venture. If your corpus is big enough and you can’t just fix all the metadata with your bare hands, this is what you can do to get an approximation.

But let’s cut to the chase. Let’s have a look at the XQuery we used to work out the year specifications from the metadata provided. The query creates a list of Bash commands to replace the original filenames with the filename schema we described above. The last five lines starting with the mv command feature problematic filenames. It was a bit late yesterday and we, errm, decided to hardcode so we could eventually call it a day (the collection is still the same we used for our previous post):

xquery version "3.0";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
let $collection := '/db/data/textgrid-repository-dramas/'
return

((for $filename in xmldb:get-child-resources($collection)
let $doc := doc($collection || $filename)
let $noteStmt := for $item in tokenize($doc//tei:notesStmt/tei:note, '\W')[matches(., '\d{3,4}')] return number($item)
let $noteStmt := number(min($noteStmt))[1]
let $noteStmt := if ($doc//tei:creation/tei:date/@when or $doc//tei:creation/tei:date/string-length() = 4)
                    then min((number(min($doc//tei:creation/tei:date/@when)), if ($doc//tei:creation/tei:date/string-length() = 4) then number($doc//tei:creation/tei:date/string()) else ()) )
                    else $noteStmt
let $noteStmt := if ($noteStmt gt min($doc//tei:profileDesc/tei:creation/tei:date/number(@notAfter)))
                    then $doc//tei:profileDesc/tei:creation/tei:date/@notAfter
                    else $noteStmt
(: ok, if we still have no date, we look for the pubStmt and compare with creation@notAfter :)
let $noteStmt :=
        if (string($noteStmt) = 'NaN')
            then
                let $pub := number($doc//tei:biblFull/tei:publicationStmt/tei:date/@when)
                let $creation := number($doc//tei:profileDesc/tei:creation/tei:date/@notAfter)
                return
                min(($pub, $creation))
            else
            $noteStmt
let $noteStmt :=
        if (string($noteStmt) = 'NaN') then number($doc//tei:profileDesc/tei:creation/tei:date/@notAfter) else $noteStmt
let $noteStmt := if (string-length($noteStmt) = 3) then 'BC0' || $noteStmt else $noteStmt

let $target := $noteStmt || '_' || replace(string(($doc//tei:author)[1]), '\s+', '_') || '_-_' || replace(($doc//tei:fileDesc[1]/tei:titleStmt/tei:title/string())[1], '\s+', '_')
let $mv  :=
"mv '" || replace(xmldb:decode($filename), "[']",  "'\\$0'") || "' '" || replace($target, "[']",  "'\\$0'") || ".xml'
"

(:replace($mv, "[!|\(|\)|,|'|:|;|-]", '\\$0'):)
return
    $mv)
    ,   "
mv 'Aischylos_-_Der_gefesselte_Proemetheus_(-0525--0456).xml' 'BC0470_Aischylos_-_Der_gefesselte_Proemetheus.xml'
mv 'Aischylos_-_Die_Orestie_(-0525--0456).xml' 'BC0456_Aischylos_-_Die_Orestie.xml'
mv 'Euripides_-_Iphigenie_in_Aulis_(-0480--0406).xml' 'BC0406_Euripides_-_Iphigenie_in_Aulis.xml'
mv 'Euripides_-_Medea_(-0480--0406).xml' 'BC0431_Euripides_-_Medea.xml'
mv 'Plautus,_Titus_Maccius_-_Amphitryon_(-0250--0184).xml' 'BC0207_Plautus,_Titus_Maccius_-_Amphitryon.xml'
" )

Let’s conclude this rather dry blog post with some eye candy. We will introduce our “dramavis” script at a later point, but here is what it does. Among other things, it creates network graphs out of theatre pieces. The resulting PNGs can be glued together using ImageMagick and this is what we did to create a superposter of all the 666 dramas contained in the TextGrid Repository. Attention: In this initial version of the poster, the graphs are mostly erroneous due to inconsistent markup. We mainly used these graphs to find and correct markup errors since it’s a lot easier to look at a graph than read thousands of lines of TEI markup. The cleaning of dirty network data based on problematic markup is something we will address later. But for now, here’s a small version of our superposter in JPG format, the actual PNG version weighs 74 MB and was uploaded to Fighshare where you can download it in all its dubious beauty:

Well, this must be how Núñez de Balboa felt when he first saw the Pacific Ocean. ;) But apart from looking nice, this little superposter of 666 theatre plays can definitely be part of a distant-reading strategy once it is based on reliable network data, and this is definitely where we’re headed.

Working With Inconsistent Metadata was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on June 19, 2015.

A (Not So) Simple Question and a Somewhat Diabolic Answer

2015-06-18T00:00:00+02:00

How Many Dramatic Pieces Are Contained in the TextGrid Repository?

Simple question, seemingly. Before we try to answer it, a little heads-up: This blog post is ridiculously long. It can be regarded a proof-of-concept of what Mareike König recently said at the “Wissensspeicher” conference in Düsseldorf in the beginning of March: “Blogs have no space constraints.” (In this video, 17:45 mins. in.) True that! So here we go:

Corpus building is a crucial task of many Digital Humanities projects and it is great to see a number of new corpora appear on a fairly regular basis. Many of these text collections feature markup following the TEI Guidelines. However, the mere existence of a corpus and its application of standardised formats doesn’t relieve you from working your way through its peculiarities. The purpose of this article is to demonstrate how you can start this process, our example being the vast TextGrid Repository and its subset of German-language drama.

The TextGrid Repository is the largest TEI-tagged corpus of German literature and released freely under a CC-BY 3.0 licence. It contains thousands of literary texts from around 1500 to the 1930s: novels, theatre pieces, poems, etc. The corpus is accessible through a web interface here, but it can also be downloaded in its entirety so you can toy around with it in your own environment.

Using the Web Interface

The answer to the question posed in the title of this post seems to be a piece of cake. But it really isn’t, for several reasons. By trying to find the correct answer we turn the corpus upside down which will help us to gain insights on what to expect from the corpus when we start to build our theories around it.

Now, the first approach to answer our not-so-simple question leads us to the TextGrid Rep search form. If we make use of existing metadate and enter genre:"drama" as search term, the TextGrid Rep search engine returns 1462 results. These are far too many due to the fact that a search in the repository also considers ‘work objects’ according to TextGrid’s metadata schema (see the corresponding cheat sheet here).

If we limit our search to just XML documents we get a much better approximation, so let’s specify our search term: genre:"drama" format:"text/xml". And we’re down to 690! This is a promising answer and the good news is that we’re halfway there. Easy as pie, so far. But wait. We wouldn’t have written this article if it was that easy, right? The second half of our trip will take a lot (like, a lot) longer. But we will learn a plethora of things about our corpus and its constraints.

When we started getting acquainted with our corpus we found certain anomalies:

Some dramas are split into parts, each of which comes in its own XML document frame and has an own TEI header with the genre information we took advantage of before. These parts are counted as own drama when just looking for genre info in TEI headers and, for this reason, distort our results.
The second big problem are doublets. There are several dramatic pieces that appear twice or even three times. This happens due to co-authorship. E.g., O. F. Berg und David Kalisch both authored the dramatic text “Berlin, wie es weint und lacht” (1858). The full text appears only once in the corpus, but there’s a reference to the text for every co-author and it features another genre value which falsely increases the number of dramatic pieces we are counting.

To get rid of those things, we need to dive deep and therefore we need tools that are a bit more flexible, in this case, an XML database that we can use to build our own queries. So let’s download the whole corpus and load it into a local eXist-db instance.

eXist-db, an Open-Source Native XML Database

If you haven’t done so already, please go ahead and download eXist-db. After installing and starting it, you can access it via your browser on port 8080 of localhost. Just let it run for the time being.

Loading Data into Our Own XML Database

The corpus can be downloaded as one integral ZIP file from the TextGrid website. There are two versions of the corpus. The differences are explained on the website but aren’t that noteworthy, let’s just go ahead and download the second version (390 MB, zipped). Unzip the file. All XML files are contained in the “12-publication” folder. There is one XML file for every author, 695 altogether (there are several Goethe files, but nevermind). Apart from these exceptions, all the works of the same author are all contained in one file.

Let’s load all the XML files into our XML database:

On the eXist-db dashboard, click on “Collections” and enter login and password (if you haven’t specified any login data, enter “admin” as login and leave the password field empty). Now click on the icon “New collection” (third from right) and create a new folder for our collection. Let’s call it “data” where from now on we will put all our data (hence the name, for good practice!). Let’s create a subfolder called “tgrep” for our repository and then click on “Upload resources” (the icon on the far right). Look for the folder we unzipped earlier, change into the “12-publication” folder, mark all XML files (CTRL + A is your friend) so all of them will be loaded into our collection. This will take some time, around 5 minutes, exactly the time you need to squeeze two or three oranges and set you up with a glass of fresh juice.

If you wonder what’s contained in the XML files, just double-click on one. A new browser tab will open and give away the plain XML with some syntax highlighting to easily differentiate between TEI elements, plain text, URLs, etc. The document starts with the <teiCorpus> element, meaning that the file contains several works. According to the TextGrid metadata schema based on FRBR there may be several <teiCorpus> nested within the root element. So there are several hierarchies which in this case are not uniform, but let’s leave that for now.

The genre of a text is specified within the TEI element textClass, the schema (an .xsd file) specifies that the genre info in this corpus is contained within <tei:term>.

So once again, how many dramatic pieces are contained in the TextGrid Repository???

Building an XQuery

Let’s start with reproducing our 690-result with a basic XQuery. This is to show you that we can easily reproduce the results of the search form.

So we want to find all works that are marked as “drama” in the genre-specific metadata. As indicated before, the TEI element <textClass> contains info on the genre. So let’s count all occurrences in the whole TextGrid Repository by using eXide, “a cool, handy, fully integrated editor for working with XQuery, XML, and other resources stored in eXist” (O’Reilly). Close the Collection Browser and click on the “eXide – XQuery IDE” logo. You should see a fresh sheet for your own queries.

First of all, we need to declare a namespace for technical reasons, just insert as line two:

declare namespace tei = "http://www.tei-c.org/ns/1.0";

To address our imported collection we write in the next line:

collection('/db/data/tgrep/')

If we now want to count the occurences, we can use a count function. Just wrap a count() around the specified collection. Then we have to determine what to count, so let’s have a look on the genre information as described above: //tei:textClass/tei:keywords/tei:term[text() = 'drama'])

Eventually, our query looks like this:

count(collection('/db/data/tgrep/')//tei:textClass/tei:keywords/tei:term[text() = 'drama'])

To evaluate it, just click on the “Eval” button and see what happens (after some seconds, anyway).

Most of the stuff in this query is a so-called XPath. Basically, XPath is a language for browsing through and operate on your XML documents. XPath, XSLT and XQuery share the same function set. We can get the same results by using a loop, which helps us generating more readable and sometimes more efficient queries. This is becoming more important in a further step:

count(
    for $occurrence in collection('/db/data/tgrep/')//tei:textClass/tei:keywords/tei:term[text() = 'drama']
    return $occurrence)

Click on “Eval” and wait some seconds after which the output window returns a number, but what is this: 703? Are there, all of a sudden, 703 dramas in our corpus? Rhetorical question, of course not. So what happened? Obviously, there are some appearances of “drama” outside of TEI documents. So let’s specify our query and look just for occurrences of “drama” as a genre in TEI documents:

count(collection('/db/data/tgrep/')//tei:textClass[ancestor::tei:TEI]/tei:keywords/tei:term[text() = 'drama'])

We added the part [ancestor::tei:TEI] which tells the engine that we look for the occurrence in TEI documents only, and we leave the teiCorpus uncounted. “TEI” here is the root element of a TEI document. And look, we end up at 690, good! We just reproduced the result we got from the search form. The nice thing about reproducing this result is that we don’t stop here. With XQuery we can do much more.

For example, let’s try to substract the 690 from the 703 pieces found earlier. This is interesting as it points us to a bunch of subcorpora in the repository containing a number of dramas. By executing the following query …

collection('/db/data/tgrep/')//tei:textClass[*not*(ancestor::tei:TEI)]/tei:keywords/tei:term[text() = 'drama']/base-uri()

… we get 13 evidences. More precisely, we get the resource addresses within the database (comparable to the file name):

/db/data/tgrep/Literatur-Arnim%2C-Ludwig-Achim-von.xml
/db/data/tgrep/Literatur-Goethe%2C-Johann-Wolfgang-001.xml
/db/data/tgrep/Literatur-Grabbe%2C-Christian-Dietrich.xml
/db/data/tgrep/Literatur-Hauptmann%2C-Carl.xml
/db/data/tgrep/Literatur-Hauptmann%2C-Carl.xml
/db/data/tgrep/Literatur-Hebbel%2C-Friedrich.xml
/db/data/tgrep/Literatur-Immermann%2C-Karl.xml
/db/data/tgrep/Literatur-Metastasio%2C-Pietro.xml
/db/data/tgrep/Literatur-Scheerbart%2C-Paul.xml
/db/data/tgrep/Literatur-Schiller%2C-Friedrich.xml
/db/data/tgrep/Literatur-Schnitzler%2C-Arthur.xml
/db/data/tgrep/Literatur-Scribe%2C-Eugene.xml
/db/data/tgrep/Literatur-Wagner%2C-Richard.xml

So what about these 13 evidences? They describe a teiCorpus, but they are not part of a TEI document themselves. So they describe a subcorpus aggregating several dramatic texts.

Why does this happen? Because some dramas are split into several TEI subdocuments. How do we find out which? Here’s our query:

collection('/db/data/tgrep/')//tei:textClass[not(ancestor::tei:TEI)]/tei:keywords/tei:term[text() = 'drama']/concat(base-uri(), ': ', (ancestor::tei:teiCorpus[1]//tei:fileDesc[1]/tei:titleStmt/tei:title/string())[1], ' >  ', count(ancestor::tei:teiCorpus[1]//tei:TEI))

Yields the following output:

/db/data/tgrep/Literatur-Arnim%2C-Ludwig-Achim-von.xml: Halle und Jerusalem > 4
/db/data/tgrep/Literatur-Goethe%2C-Johann-Wolfgang-001.xml: Faust. Eine Tragödie > 5
/db/data/tgrep/Literatur-Grabbe%2C-Christian-Dietrich.xml: Die Hohenstaufen > 2
/db/data/tgrep/Literatur-Hauptmann%2C-Carl.xml: Panspiele > 4
/db/data/tgrep/Literatur-Hauptmann%2C-Carl.xml: Die goldnen Straßen > 3
/db/data/tgrep/Literatur-Hebbel%2C-Friedrich.xml: Die Nibelungen > 5
/db/data/tgrep/Literatur-Immermann%2C-Karl.xml: Alexis > 3
/db/data/tgrep/Literatur-Metastasio%2C-Pietro.xml: L’isola disabitata > 2
/db/data/tgrep/Literatur-Scheerbart%2C-Paul.xml: Revolutionäre Theaterbibliothek > 23
/db/data/tgrep/Literatur-Schiller%2C-Friedrich.xml: Wallenstein > 4
/db/data/tgrep/Literatur-Schnitzler%2C-Arthur.xml: Marionetten > 3
/db/data/tgrep/Literatur-Scribe%2C-Eugene.xml: La dame blanche > 2
/db/data/tgrep/Literatur-Wagner%2C-Richard.xml: Der Ring des Nibelungen > 4

The number at the end of each line shows us how many separate texts are contained in each subcorpus. So, Wagner’s “Ring of the Nibelungs”: check. Etc. etc. But there are still problems. E.g., Hebbel’s “Nibelungs”, in reality, consist of merely 3 parts, not 5. So let’s refine our query to leave out all TEI documents that aren’t marked as “drama”:

collection('/db/data/tgrep/')//tei:textClass[not(ancestor::tei:TEI)]/tei:keywords/tei:term[text() = 'drama']/concat(base-uri(), ': ', (ancestor::tei:teiCorpus[1]//tei:fileDesc[1]/tei:titleStmt/tei:title/string())[1], ' >  ', count(ancestor::tei:teiCorpus[1]//tei:TEI[descendant::tei:term/text() = 'drama']))

/db/data/tgrep/Literatur-Arnim%2C-Ludwig-Achim-von.xml: Halle und Jerusalem > 2
/db/data/tgrep/Literatur-Goethe%2C-Johann-Wolfgang-001.xml: Faust. Eine Tragödie > 5
/db/data/tgrep/Literatur-Grabbe%2C-Christian-Dietrich.xml: Die Hohenstaufen > 2
/db/data/tgrep/Literatur-Hauptmann%2C-Carl.xml: Panspiele > 4
/db/data/tgrep/Literatur-Hauptmann%2C-Carl.xml: Die goldnen Straßen > 3
/db/data/tgrep/Literatur-Hebbel%2C-Friedrich.xml: Die Nibelungen > 3
/db/data/tgrep/Literatur-Immermann%2C-Karl.xml: Alexis > 3
/db/data/tgrep/Literatur-Metastasio%2C-Pietro.xml: L’isola disabitata > 2
/db/data/tgrep/Literatur-Scheerbart%2C-Paul.xml: Revolutionäre Theaterbibliothek > 22
/db/data/tgrep/Literatur-Schiller%2C-Friedrich.xml: Wallenstein > 4
/db/data/tgrep/Literatur-Schnitzler%2C-Arthur.xml: Marionetten > 3
/db/data/tgrep/Literatur-Scribe%2C-Eugene.xml: La dame blanche > 2
/db/data/tgrep/Literatur-Wagner%2C-Richard.xml: Der Ring des Nibelungen > 4

What do we have here? We received a list with all segmented dramas. How do we check if these numbers are reliable? Well, this one is not for the computer to decide, but for the humanist’s eye. Goethe’s “Faust”, in our repository, still consists of these 5 files:

Zueignung
Vorspiel auf dem Theater
Prolog im Himmel
Faust. Der Tragödie erster Teil
Faust. Der Tragödie zweiter Teil

We could argue that the whole “Faust” is one integral piece. We could argue that Wagner’s “Ring of the Nibelung” is one piece. But we probably can’t declare the same thing for Scheerbart’s “Revolutionäre Theaterbibliothek” which consists of 22 pieces, and we probably shouldn’t count them as one.

Why this strange segmentation of some of the plays? This has to do with the origin of the TextGrid Repository, the zeno.org project. As we can see at the zeno.org website, Goethe’s Faust is split into 5 parts there when it really should be split into 2 parts only, “Faust, part 1”, and “Faust, part 2”.

So let’s use the human brain and some semesters of studying literature (hehe) and decide what to count as a separate text and what not:

/db/data/tgrep/Literatur-Arnim%2C-Ludwig-Achim-von.xml: Halle und Jerusalem > 2
- double drama, new amount of plays: 1
/db/data/tgrep/Literatur-Goethe%2C-Johann-Wolfgang-001.xml: Faust. Eine Tragödie > 5
- two originary parts, new amount of plays: 2
/db/data/tgrep/Literatur-Grabbe%2C-Christian-Dietrich.xml: Die Hohenstaufen > 2
- remains 2
/db/data/tgrep/Literatur-Hauptmann%2C-Carl.xml: Panspiele > 4
- no overlaps in personnel, remains 4
/db/data/tgrep/Literatur-Hauptmann%2C-Carl.xml: Die goldnen Straßen > 3
- no overlaps in personnel, remains 3
/db/data/tgrep/Literatur-Hebbel%2C-Friedrich.xml: Die Nibelungen > 3
- Hebbel himself describes the 3 parts as “one integral tragedy”, new amount of plays: 1
/db/data/tgrep/Literatur-Immermann%2C-Karl.xml: Alexis > 3
- overlaps in personnel, new amount of plays: 1
/db/data/tgrep/Literatur-Metastasio%2C-Pietro.xml: L’isola disabitata > 2
- one of the 2 parts is the Italian original, new amount of plays: 1
/db/data/tgrep/Literatur-Scheerbart%2C-Paul.xml: Revolutionäre Theaterbibliothek > 22
- completely several dramas, remains 22
/db/data/tgrep/Literatur-Schiller%2C-Friedrich.xml: Wallenstein > 4
- new amount of plays: 1
/db/data/tgrep/Literatur-Schnitzler%2C-Arthur.xml: Marionetten > 3
- no overlaps in personnel, new amount of plays: 3
/db/data/tgrep/Literatur-Scribe%2C-Eugene.xml: La dame blanche > 2
- one of the 2 parts is the French original, new amount of plays: 1
/db/data/tgrep/Literatur-Wagner%2C-Richard.xml: Der Ring des Nibelungen > 4
- remains 4

You will notice that some of our decisions are contingent. E.g., there are overlaps in personnel in the two parts of Goethe’s “Faust”. And the two parts of “Faust” have been put on stage together (cf. Peter Stein’s Faust-Projekt). Yet we would still argue that they are two different pieces. Others may think otherwise.

So we have to substract the results of this equation from our 690 found dramas:

690-((2-1)+(5-2)+(2-2)+(4-4)+(3-3)+(3-1)+(3-1)+(2-1)+(22-22)+(4-1)+(3-3)+(2-1)+(4-4)) = 690-13

And we’re down to 677 dramas. We’re almost there! But there’s another thing we came across while working on the corpus: doublets.

How to Find Doublets

Due to the specific mapping in the repository every work is assigned to all of its authors which falsely doubles the number of dramas in cases of co-authorship. The full text can be found in only one of those documents and the others just contain the title and a reference (tei:ref) to the full text. If a piece has two authors, it has got two TEI headers. So when looking for occurrences of the string “genre” in the TEI element textClass, we’re counting the drama twice. But altogether, that one’s easy-peasy, we just have to substract the redundant item.

But how do we find out how many theatre pieces are counted twice when using our previous query? This is the last step in order to answer our central question!

To determine the differences of the documents created by more than one author we have to look at the TEI code. The <text> node we find in Kalisch’s document is not empty which makes it a bit more complicated:

<text>
 <body>
  <div type="text" xml:id="tg4.3">
   <milestone unit="sigel" n="Berg-Berlin" xml:id="tg4.3.1"/>
    <head type="h4" xml:id="tg4.3.3">O. F. Berg / David Kalisch</head>
    <head type="h2" xml:id="tg4.3.4">
     <ref cRef="/Literatur/M/Berg, O. F./Drama/Berlin, wie es weint und lacht" xml:id="tg4.3.4.1">Berlin, wie es weint und lacht</ref>
    </head>
    <head type="h4" xml:id="tg4.3.5">Volksstück mit Gesang</head>
    <head type="h4" xml:id="tg4.3.6">in 3 Aufzügen und 11 Bildern</head>
  </div>
 </body>
</text>

The cRef attribute tells us that the actual text is to be found in the XML file dedicated to the other co-author, in this case, O. F. Berg. Now, to be able to distinguish between actual documents containing the dramatic text and documents that only contain a reference, we have to find a distinctive feature. Let’s try this one: The referenced TEI document contains <div> elements featuring subtype="work:no" attributes (this is to make sure that single scenes are not marked as separate “works”). The Kalisch document doesn’t have this feature, so that’s a good way to differentiate between the two. Mind you, you can always find other XML propoerties that suit you better, for example, look for <sp> elements (Berg has it, Kalisch does not). But anyway, let’s execute a query that gives us all the documents lacking the mentioned subtype="work:no" attribute:

for $item in collection('/db/data/tgrep')//tei:TEI
    where $item/tei:teiHeader//tei:keywords/tei:term/string() = 'drama' and not($item//tei:text//tei:div/@subtype="work:no")
    return ($item//tei:title)[1]

The result is a list of 27 tei:title elements:

<title xmlns="http://www.tei-c.org/ns/1.0">Der gefesselte Proemetheus</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Fidelio</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Die Geisterinsel</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Iphigenie in Aulis</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Medea</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Die Fledermaus</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Die jüngste Walpurgisnacht</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Zueignung</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Vorspiel auf dem Theater</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Prolog im Himmel</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Der Widerspenstigen Zähmung</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Pension Schöller</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Traumulus</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Im weißen Rößl</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Berlin, wie es weint und lacht</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Die Pfandung</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Der Besuch um Mitternacht</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Ablaßkrämer</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Doktor Faustus</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Das Mirakel</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Prolog</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Die Familie Selicke</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Genoveva</title>
<title xmlns="http://www.tei-c.org/ns/1.0">König Ödipus</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Antigone</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Gespenstersonate</title>
<title xmlns="http://www.tei-c.org/ns/1.0">Fidelio</title>

The goal of our query was to find documents that don’t feature the actual text of a drama. But the very first result (“Der gefesselte Proemetheus”) shows us that we have to refine our query because this play does contain text but does not feature any <div> element with a specified subtype="work:no" attribute. To correct our results, let’s exclude all documents that contain a <tei:l> or a <tei:p> element (because, obviously, then they do contain running text):

count(for $item in collection('/db/data/tgrep')//tei:TEI
    where $item/tei:teiHeader//tei:keywords/tei:term/string() = 'drama'
       and not($item//tei:text//tei:div/@subtype="work:no")
       and not($item//tei:text//tei:l)
       and not($item//tei:text//tei:p)
    return $item//tei:sourceDesc/tei:biblFull/tei:titleStmt/tei:title)

Ok, let’s try to translate this query into a humanly readable form:

First off, we use a for loop like explained before. This seperates the single TEI documents starting with a TEI node (//tei:TEI) from our whole data set (collection('/db/data/tgrep')). We now operate on single documents until the loop finishes. Then we specify a condition for the documents we like to take into focus (the where part). We …

select all documents where the genre specification is “drama”,
exclude all documents that contain a tei:div where the attribute subtype has a “work:no” value,
also exclude every document that contains at least a single tei:l and, finally,
exclude all documents with at least a single paragraph (tei:p).

Regarding the exclusion part, we are aware of the ancestor elements of the node, so we exclude documents only if we find the tei:div, tei:l and tei:p inside tei:text. Our loop returns the number of documents that match our pattern. If we omit the count function we receive the actual title information from the teiHeader (tei:sourceDesc/tei:biblFull/tei:titleStmt/tei:title) and the author information as well. So our query returns 11 items which are:

Arno Holz und Oskar Jerschke: Traumulus. Achtes bis zehntes Tausend, Dresden: Carl Reißner, 1909.Author in TG Rep: Jerschke, Oskar (link)
Carl Laufs: Pension Schöller. Nach einer Idee von W. Jacoby, elfte Auflage, Berlin: Eduard Bloch Theaterverlag, [o.J.].Author in TG Rep: Jacoby, Wilhelm (link)
Hermann Goetz: Der Widerspenstigen Zähmung. Komische Oper in vier Akten, nach Shakespeares gleichnamigen Lustspiel frei bearbeitet von Joseph Viktor Widmann, Musik von Hermann Goetz, Zürich, Wien, München: Apollo-Verlag, [ca. 1925].Author in TG Rep: Goetz, Hermann Gustav (link)
Johann Friedrich Reichardt: Die Geisterinsel. Ein Singspiel in drey Akten, in: Friedrich Wilhelm Gotter: Literarischer Nachlass, Gotha: J. Perthes, 1802, S. 419–564.Author in TG Rep: Einsiedel, Friedrich Hildebrand von (link)
Johann Strauß: Die Fledermaus. Operette in drei Aufzügen, Text nach H. Meilhac und L. Halévy von C. Haffner und Richard Genée, hg. v. Wilhelm Zentner, Stuttgart: Reclam, 1976.Author in TG Rep: Genée, Richard (link)
Ludwig van Beethoven: Fidelio. Oper in zwei Aufzügen, hg. v. Wilhelm Zentner, Stuttgart: Reclam, 1970.Author in TG Rep: Breuning, Stephan von (link)
Ludwig van Beethoven: Fidelio. Oper in zwei Aufzügen, hg. v. Wilhelm Zentner, Stuttgart: Reclam, 1970.Author in TG Rep: Treitschke, Georg Friedrich (link)
Naturalismus_– Dramen. Lyrik. Prosa. Herausgegeben und mit einem Nachwort von Ursula Münchow, Band 1: 1885–1891, Berlin und Weimar: Aufbau, 1970.Author in TG Rep: Schlaf, Johannes (link)
O.F. Berg und D[avid] Kalisch: Berlin, wie es weint und lacht. Leipzig: Verlag von Phillipp Reclam jun., [o.J.] [Universal-Bibliothek Nr. 4689].Author in TG Rep: Kalisch, David (link)
Oskar Blumenthal und Gustav Kadelburg: Im weißen Rössl. 16. Auflage, Berlin: Eduard Bloch Verlag, [o.J.].Author in TG Rep: Kadelburg, Gustav (link)
Robert Schumann: Genoveva. Oper in vier Akten nach Tieck und Hebbel, Berlin: Eduard Bloch, [1960].Author in TG Rep: Schumann, Robert Alexander (link)

The majority of these texts are libretti for operas written by two authors and one work written by three collaborators (Beethoven’s “Fidelio”, to be precise).

But let’s jump to our initial question and to the final answer. How many dramas are contained in the TextGrid Rep? For that to answer, we just have to substract these 11 doublets and we end up at: 666 dramas! A bit diabolic, but, in the end, just a number. (Speaking of which, have you heard the story of Route 666 and how it was renamed to Route 491? It’s a fun story, you can read it on Wikipedia.)

A list with all the 666 dramas can be obtained via our GitHub account. Or, you can generate it yourself using the following XQuery where we also added an option in order to prepare this list for a website. You can store this query (Shift+Ctrl+s), for example, within the /db/apps/ collection using the filename tgrep.xql and call it via this link.

xquery version "3.0";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
declare option exist:serialize "method=html5 media-type=text/html";
<ol>
{for $item in collection('/db/data/tgrep')//tei:TEI
    where
        $item/tei:teiHeader//tei:keywords/tei:term/string() = 'drama'
        and ($item//tei:text//tei:div/@subtype="work:no"
        or $item//tei:text//tei:l
        or $item//tei:text//tei:p)
    order by ($item//tei:author)[1] || $item//tei:fileDesc[1]/tei:titleStmt/tei:title
    return
    <li>
	    {($item//tei:author)[1]/string() || ': ' || $item//tei:fileDesc[1]/tei:titleStmt/tei:title/string()}
	</li>
}
</ol>

Please mind that this list still contains 679 texts. We still have to substract the texts that belong to an integral play. As described before, we decided to bundle 5 dramatic pieces that consist of several parts and glued them together in a new XML file:

Arnim: “Halle und Jerusalem”,
Goethe: “Faust, Teil 1”,
Hebbel: “Nibelungen”,
Immermann: “Alexis”,
Schiller: “Wallenstein”.

Plus, we had to delete the two original (non-German) pieces (a French and an Italian one) to get down to our 666 pieces. Now our list only contains German-language texts of the genre ‘drama’. We uploaded the 666 XML files to our Github here. A list of all the plays can be found here (in a .txt file).

Conclusion

Whenever you obtain a corpus on the web, one that you didn’t build yourself, you have to deeply look into it to know your way around it. Trying to answer simple questions as we did in this blog post can help a great deal to lay the groundwork.

So now you made it. This paragraph concludes this 30.000-character blog post. Tomorrow we will deliver a shorter piece revolving around inconsistent metadata and what you can do about it. Howgh!

A (Not So) Simple Question and a Somewhat Diabolic Answer was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on June 18, 2015.

Longest German-Language Theatre Plays

2015-06-08T00:00:00+02:00

Ok, time for some Digital Humanities fun facts! We had another meeting today and, as always, were working our way through the vast TextGrid Repository. Since we’re only interested in the dramatic texts contained in the corpus, we had to find a way to automatically extract these kinds of texts which isn’t as easy as it sounds. Anyway, we finally managed to do so and also wrote a small (well …) 30.000-character piece on the subject which is to appear later. For the time being, the extracted dramas can be found as single XML files here on our GitHub.

When we were looking at the files we had the quick idea to make a list of the top 10 longest German-language theatre plays contained in the TextGrid Repository. And here they are, measured by their file size:

Holz, Arno: Ignorabimus (2.1 MB)
Schiller, Friedrich: Wallenstein (1.99 MB)
Fouqué, Friedrich de la Motte: Der Held des Nordens (1.88 MB)
Brentano, Clemens: Die Gründung Prags (1.81 MB)
Baggesen, Jens: Der vollendete Faust oder Romanien in Jauer (1.69 MB)
Hebbel, Friedrich: Die Nibelungen (1.61 MB)
Immermann, Karl: Alexis (1.49 MB)
Rosner, Ferdinand: Oberammergauer Passionspiel (1.48 MB)
Grabbe, Christian Dietrich: Herzog Theodor von Gothland (1.40 MB)
Arnim, Ludwig Achim von: Halle und Jerusalem (1.35 MB)

At least two thirds of each file is TEI markup (wild guess). In some cases, the markup is really bloating the file size, so here is another version of our top 10, this time measured by the number of words inside <sp> (since we’re talking about theatre plays here):

Holz, Arno: Ignorabimus (100,283 words)
Arnim, Ludwig Achim von: Halle und Jerusalem (74,675 words)
Brentano, Clemens: Die Gründung Prags (70,672 words)
Fouqué, Friedrich de la Motte: Der Held des Nordens (63,074 words)
Schiller, Friedrich: Wallenstein (56,820 words)
Tieck, Ludwig: Prinz Zerbino oder die Reise nach dem guten Geschmack (56,759 words)
Holz, Arno: Sonnenfinsternis (53,909 words)
Rosner, Ferdinand: Oberammergauer Passionspiel (52,717 words)
Goethe, Johann Wolfgang: Faust. Der Tragödie zweiter Teil (46,180 words)
Müller, Friedrich (Maler Müller): Golo und Genovefa (45,904 words)

As you can see, Arno Holz rules them all! His monstrous naturalistic drama Ignorabimus from 1913 is a fair 500-pager as shows a quick glance at the catalogue of the German National Library.

For the fans, this is our query for the second list, using eXist-db (“textgrid-repository-dramas” is the name of our collection):

xquery version "3.0";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
for $file in xmldb:get-child-resources('/db/data/textgrid-repository-dramas')
	order by count(tokenize(string-join(doc('/db/data/textgrid-repository-dramas/' || $file)//tei:sp), '\W+')[. != '']) descending
return (count(tokenize(string-join(doc('/db/data/textgrid-repository-dramas/' || $file)//tei:sp), '\W+')[. != '']), $file)

Ok, there’s more where this came from, stay tuned! :-)

Update (one hour after touchdown)

Quickly answering a question raised by Nils on Twitter: “Where is Karl Kraus: Die letzten Tage der Menschheit?!” Well, unfortunately, the ultimate German-language mega drama is not contained in the TextGrid Repository. But it would certainly crush all the other plays. We dug out the Gutenberg-DE DVD and counted the words like this:

w3m -dump -I 'iso-8859-1' -T text/html letzttag.xml | wc -w

Yielded 187,696 words. To put it short: Karl Kraus beats Arno Holz any time. Please mind that we did not limit the Kraus word count to just the spoken words like we did with the XML files (by just counting the words uttered inside <sp>). But even if we have to substract a couple of thousand words, the result remained the same.

Longest German-Language Theatre Plays was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on June 08, 2015.

Road to Sydney

2015-05-08T00:00:00+02:00

Met today to work on our stuff for Sydney. Office panorama:

Wanted to include a Sydney screenshot from International Karate (spirit of 1986!), but a link to the screenshot will do.

Road to Sydney was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on May 08, 2015.

Conference in Munich

2015-03-10T00:00:00+01:00

In a few days, March 12/13, we’re taking part at a conference at Bayerische Akademie der Wissenschaften, Computer-based analysis of drama and its uses for literary criticism and historiography:

The CfP is here. The program can be found here (PDF).
Our presentation will be held on Thursday, 12 March 2015, 17:15, in German: Digitale Netzwerkanalyse dramatischer Texte.

Update:

The conference can be relived on Twitter: #CompDrama15.

Conference in Munich was originally published by Frank Fischer, Mathias Göbel, Dario Kampkaspar, Peer Trilcke at Network Analysis of Dramatic Texts on March 10, 2015.