In a museum I once overheard a conversation between two visitors about a painting both thought to be by a particular artist. They discussed how one could well see the characteristics of that artist’s style and how this painting would fit into a particular period of his life’s work. After five minutes they finally looked at the tag next to the painting only to be surprised by the fact that it was by a different artist they have never heard of.

In an article (Stefanowitsch (2006): Konstruktionsgrammatik und Korpuslinguistik) that I recently read for a seminar on construction grammar (CxG), I stumbled across a use of the Cohen’s kappa coefficient that I had not seen before. The article is about how to verify via quantitative techniques that a linguistic structure is a construction in the sense of CxG. The author takes the German structure haben zu + Infinitiv as in “Sie haben zu gehorchen.” as a case study and, in a first step, identifies five classes of the the general structure NP-NOM hab- XP* zu V-INF, where only one of them he deemes to be the potential construction to consider. In order to verify his classification he asked a colleague to classify a set of examples structures according to his scheme, providing him with a paraphrasal description (X ist beschäftigt for the class containing sentences like “Wir hatten zu tun”) of each class and nothing else. The author then calculates inter-rater agreement using kappa in order to prove his scheme correct.

Now, is this a valid use of kappa (or any other measure of inter-rater agreement for that matter)? Kappa is often used to make a statement about the quality of, e.g. an annotation, but also about the difficulty of a particular task. In this case, there are several problems whose severity I am not fully certain about. One is that the author has only two different raters, one of whom is himself, which makes the results questionable. The other problem is that I don’t see a reason to assume that a classification schmeme is “the right one” by showing high kappa values. In principle, it should be possible to develop any kind of scheme and achieve good agreement rates as long as the rater instructions are clear and adequately describe all classes. Just to be fair: the author’s classes made intuitive sense and where perfectly acceptable. But generally, agreeing on the wrong things does not them right.