In a museum I once overheard a conversation between two visitors about a painting both thought to be by a particular artist. They discussed how one could well see the characteristics of that artist’s style and how this painting would fit into a particular period of his life’s work. After five minutes they finally looked at the tag next to the painting only to be surprised by the fact that it was by a different artist they have never heard of.
In an article (Stefanowitsch (2006): Konstruktionsgrammatik und Korpuslinguistik) that I recently read for a seminar on construction grammar (CxG), I stumbled across a use of the Cohen’s kappa coefficient that I had not seen before. The article is about how to verify via quantitative techniques that a linguistic structure is a construction in the sense of CxG. The author takes the German structure haben zu + Infinitiv as in “Sie haben zu gehorchen.” as a case study and, in a first step, identifies five classes of the the general structure NP-NOM hab- XP* zu V-INF, where only one of them he deemes to be the potential construction to consider. In order to verify his classification he asked a colleague to classify a set of examples structures according to his scheme, providing him with a paraphrasal description (X ist beschäftigt for the class containing sentences like “Wir hatten zu tun”) of each class and nothing else. The author then calculates inter-rater agreement using kappa in order to prove his scheme correct.
Now, is this a valid use of kappa (or any other measure of inter-rater agreement for that matter)? Kappa is often used to make a statement about the quality of, e.g. an annotation, but also about the difficulty of a particular task. In this case, there are several problems whose severity I am not fully certain about. One is that the author has only two different raters, one of whom is himself, which makes the results questionable. The other problem is that I don’t see a reason to assume that a classification schmeme is “the right one” by showing high kappa values. In principle, it should be possible to develop any kind of scheme and achieve good agreement rates as long as the rater instructions are clear and adequately describe all classes. Just to be fair: the author’s classes made intuitive sense and where perfectly acceptable. But generally, agreeing on the wrong things does not them right.

#1 by A.S. at June 25th, 2009
| Quote
Hi,
Exactly. Kappa is being used here as a measure of interrater reliability, not as a measure of validity of the annotation scheme itself. In other words, a high kappa value indicates that the instructions are clear enough for different raters to achieve essentialy the same results when using the annotation scheme. Whether the categories in the scheme make sense is an entirely different matter.
#2 by Armin at June 26th, 2009
| Quote
Thanks for the clarification. I guess what made me stumble about this was that it didn’t become entirely clear to me what it is the article tries to show by calculating inter-annotator agreement at that point. Classifying different kinds of haben+zu+inf didn’t seem very central with respect to the argumentation since we knew from the beginning that the “Verpflichtung”-class was the one we were interested in. One could just as well have made a binary decision (“Verpflichtung” vs. not “Verpflichtung”), which is why readers may be led to assume the scheme to form some kind of natural one (whatever that might mean).
#3 by Armin at June 26th, 2009
| Quote
Hi again,
Just another two quick questions:
1. This blog is quite new and I was not aware of any people actually reading it. How did you find it?
2. Are there any follow-ups on the work described in that (btw. extremely interesting) paper? There were a few other things that I found worth discussing but which, I suspect, have already been adressed in subsequent work.
#4 by A.S. at June 29th, 2009
| Quote
I was doing research on a very important — okay, actually, I just was googling my name.
There’s my work with Stefan Gries on Collostructional Analysis, and I believe he has actually done some more sophisticated work on dispersion. Stefanie Wulffs dissertation (that I quote in the paper) has since been published as “Rethinking Idiomaticity” with Continuum press. If you have any specific aspects that you’re interested it, feel free to email me.