Data Preprocessing – Lessons Learned

Finally, I turned in my Master’s thesis. Altogether, it has 12 chapters, 87 pages, roughly 23.000 words (tokens), 140.000 characters (including whitespace), 27 tables, 10 figures (7 of which are colored), a few mathematical formulas, and 4 pages of references.

I think most of what I learned during the work on this project was of practical importance and not so much on the theoretical side. After all, around 90% of the work on this thesis went straight into dumb work like pre-processing data, converting them from one format to another, and controling the processing overhead of tools like parsers and taggers. I could have saved a lot of time, had I known better what problems can arise when dealing with a setup like mine. So, I compiled a list of recommendations that I can look up next time I do something similar. If you’re new to natural language processing, writing a thesis, working with large amounts of textual data, conducting experiments or just want to parse some text corpus, this may be useful to you. If you’re an experienced nlp’er, I’d be happy to read further suggestions in your comment.

Roughly, this was my setup: I wanted to apply machine learning to the classification of verbs into semantic classes. I had training data which was labeled via a file that contained the labels and pointers to the corresponding tokens in specific sentences and files. The size of the data was manageble and the texts clean. In order to get the features for the classifier, I had to run these data through other programs like semantic role labelers, named entity recognizers, word sense disambiguation tools, and parsers. The output of these programs had to be aligned on sentence and word level, which was problematic as they all used different input and output formats and sometimes couldn’t produce an analysis for a sentence. In addition, I had a very large amount of  unlabeled data extracted from British websites which contained a considerable amount of noise. Since I wanted to use them in a semi-supervised learning algorithm, these data needed to be processed by all kinds of tools in order to compute features. If you have ever worked with a setting remotely similar to mine, you will know the problems that accompany it. If you’re like me, you try to find reusable techniques and develop some kind of scheme  that helps you deal with the issues in a structured way. Here’s what I found to be useful:

  1. Decide on a consistent file naming scheme. Consistent file naming is a very practical means when you want to use scripting routines to iterate over a large number of files. For me, it has proven useful to encode the name of the tool that produced a file in the file extension. For instance, file.1.txtfile.1.parser.txt and file.1.tagger.txt are preferable to parser1.txt, tagger1, and file1-Input.txt. The base names should always stay the same, and the extension should make clear what the respective file contains. Consistently using telling file extensions gives you a lot of flexibility and makes it very easy to use wildcards in, for example, shell scripts. Not defining a naming scheme will probably force you to rename some of your files at some point. If the number of files is large, chances are that you miss some of them or accidently overwrite others.
  2. Expect noisy data and test for error input. When working with tools like taggers and parsers, you will find that they may have certain restrictions regarding the input they can handle. They might have constraints on the length of sentences or on the input encoding, the character set, or the length of tokens. If documentation exists, don’t rely on it alone. Run a couple of tests with error input like very short or very long lines, special characters and long, non-existent words. Test for all these restrictions for all the tools you want to use. If you don’t, you may run into the situation where you have successfully processed the data with one tool but the next one has problems with it, so you end up repeating the entire processing pipeline or have to develop complex and error-prone ways to align the output afterwards.
  3. Test for all requirements on I/O, hard disk, main memory. Some tools might attempt to load the entire input file into main memory or write a lot of temporary files on disk. Always find out what the requirements are before you start experimenting. Check disk and main memory usage and compare them to what is available on your machine. Splitting your data into many small files lets you keep memory overhead low and provides you with test data that you can process quickly and check the results. If a tool writes and reads a lot of temporary files, you should definitely check for I/O usage because having too many I/O-intensive processes run in parallel may put them into sleep state.
  4. Take “weird” output seriously. Even if you’re certain that you took all potential issues into consideration, the last test should always be checking your data manually, i.e. by looking at it. Look at the first and last lines/instances of the respective files, and also at some random ones. Use standard Unix tools like head,tail and shuf for these tasks. If you stumble across things that look “weird” to you, don’t just ignore them, but check the input, compare it to what you would have expected, and test these instances separately.
  5. Do all of the above steps as early as possible. The later you do them, the greater the risk that you will have to repeat certain steps, potentially wasting several days of your limited time. In particular, splitting large file into smaller ones after you’ve already run the data through the parser or the like for several days will make it hard to align everything properly afterwards. The same is true for cleaning data.
  6. Define input and output formats and write transform routines. Avoid making the mistake to handle several file formats in the same scripts or even the same functions. If you write a program that extracts features for the machine learner or implements some logic that is important to your approach, you want to first define a clear and easy-to-manage input format your program accepts. If the output of your external tools does not conform to the format you require, you should implement the conversion functionality separately.
  7. Document everything. If you’re writing a thesis or working on a longer project, always take notes about what you’ve done while it is still apparent to you. Document your code. You may reasonably anticipate that you will have to repeat some steps or revoke certain simplifying assumptions you’ve come up with at an earlier stage. In these situations you may have to understand what you’ve done many weeks or even months ago. Having everything clearly documented may save you hours or even days of work, trying to figure out why the there’s this funny if-statement in your code or which parameters you’ve invoked your parser with. Also make notes about papers you read – this saves you from having to read them again when you write a literature overview.
  8. Follow best programming practices. There’s not much to say about this point. Well-structured code, good documentation, clearly defined interfaces, a good class design, telling method names, etc. are all indispensable even for small projects. At the beginning you may think it would be easier to just write a small script for converting this particular file into another format or filter certain lines in the input. Then, after a while, you will find yourself with dozens of more or less incomprehensible scripts, each one supposed to take care of one small task and you won’t know anymore, when to use which one, how, and why.
  9. Use version control. Even if you’re working on your own and not in a team, versioning tools are extremely useful. They serve two main purposes in a one-player setting: document changes and backup. The importance of backing up everything will become clear as soon as you’ve acidentally deleted an important file or would like to revert changes you’ve made to your program, now, that nothing works anymore. Commit often and clearly state in the commit message what you’ve done, how and why.
  10. Have one file per chapter. This has the advantage that you can give someone a chapter for proofreading and not interfere with what you’re currently writing on. It also helps keeping the thesis organized and well-structured. You won’t even have to decide on a structure from the beginning, since you just write all the parts more or less separately and won’t have to bother about how to assemble them until the very end. Finally, you just merge everything into one file the way that makes most sense.
  11. Plan your time wisely. If your first guess is for something to take an hour, plan a day. If you think it’ll take a day, assign it a week. If it turns out your first guess was about right, well, great! Most of the times, though, you will face some unexpected difficulties that will waste another day before you haven’t even started to deal with the real task.
  12. Know when to stop. Even if the results are not what you expected, finish experiments on time and start writing. There are always things one hasn’t yet tried out and it’s easy to waste another week in the attempt to imrpove the results by 0.5%. Writing a good and coherent thesis is something that needs time and coolness. You don’t want to write the last chapter the night before the deadline, like I did.

Tags:

1 Comment

Geschlossene Räume

Dass die Heidelberger im Allgemeinen und die Weststädtler im Besonderen ihre Eigenheiten haben, stellt man schnell fest, wenn man hier wohnt. Als geschlossene Gesellschaften würde ich sie dennoch nur bedingt bezeichnen, denn dafür ist die Fluktuation unter den Bewohnern hier zu groß. Vermutlich meint meine neue GPS-Laufuhr auch etwas anderes, wenn sie mich beim Joggen in der Weststadt fragt, ob ich mich in einem geschlossenen Raum befinde. Hm, kann man so nicht sagen, denke ich und bin nicht überrascht, als die Frage folgt, ob ich mich als Bewohner dieses Bezirks für etwas besseres halte. In Wirklichkeit möchte die Forerunner305 wissen, ob sie sich seit ihrer letzten Benutzung hunderte von Kilometern bewegt habe und wie das aktuelle Datum ist. Sollte sie das nicht eigentlich mir sagen können?

Aber im Grunde mag ich sie ja, meine Uhr. Sie kann so gut wie alles, was das Läuferherz begehrt. Sie kennt alle möglichen unterschiedlichen Trainingsarten und endlich weiß ich auch, wie weit, schnell, hoch, in welchem Herzfrequenzbereich ich laufe. Dass sie letzteren beim Intervalltraining nicht anzeigen kann, verzeihe ich ihr und auch, dass sie aussieht, als würde man sein Telefon am Handgelenk tragen. Oder sein Netbook. Nur, dass es immer ewig dauert, bis genügend Satelliten hergestellt hat, ist furchtbar nervend. Nichts ätzt mehr an, als wenn man loslaufen will und dann fünf Minuten rumstehen muss, bis die Ortung funktioniert. Gestern hatte ich schon ca. drei Kilometer hinter mir, als die Forerunner endlich wusste, wo sie ist.  Oder wenn man die Autopausen-Funktion aktiviert hat und die Uhr während des Laufens alle Nase lang piept und behauptet, man würde sich nicht bewegen. Autopause ist bei mir nun per default deaktiviert.

Aber wer weiß – vielleicht ist die Weststadt ja wirklich ein geschlossener Raum. Dann kann ich nur hoffen, dass Zürich die offene Weltstadt ist, die sie zu sein behauptet.

Tags: ,

3 Comments

Lauftagebuch: Weißer Stein

Der Weiße Stein liegt 548 Meter über dem Meeresspiegel nördlich von Heidelberg. Dorthin bin ich heute meine bisher anspruchsvollste Strecke gelaufen. Vom Römerkreis aus überwindet man dabei eine Höhendifferenz von 436 Metern. Da ich ohne Karte, Kompass oder GPS-Gerät unterwegs war und die Waldwege vor Ort eben doch immer völlig anders aussehen, als GoogleMaps oder GoogleEarth es haben aussehen lassen, muss ich mich auf dem Hinweg ein paarmal leicht vertan haben, so dass es wohl nicht bei den angegebenen 9.3k für den Hinweg blieb. Auf dem Rückweg war noch ein kleiner Umweg über die Thingstätte auf dem Heiligenberg drin.

Insgesamt betrug die Strecke ca. 20k. Am anstrengsten finde ich persönlich immer den Anfang; die ersten 5k braucht mein Körper, so scheint es, allein, um die Pumpe auf Trab zu bringen. Wenn man von Heidelberg aus seine Touren nicht gerade nach Westen orientiert, d.h. zur Rheinebene, dann kommt immer nach spätestens 3k die erste Steigung. Diesmal war ich oben am Philosophenweg schon etwas aus der Puste und konnte den einen ebenen Kilometer Panorama-Blick aufs Schloss gut gebrauchen. Ab dem Ende des Philosphenwegs geht es dann nur noch bergauf, wobei die Steigung, mit ein paar Ausnahmen, relativ human und langgestreckt ist. Der Wald auf der Strecke ist (relativ zum Gaiberg oder Königsstuhl) schön einsam und landschaftlich umwerfend – anfangs lichte Buchenbestände, viele Farne, weiter oben dann Nadelbäume und dichtes Unterholz. Zeitweise ist der Wald so dunkel und wild, dass mann genausogut kurz vor der nördlichen Einöde Lapplands laufen könnte. Und als ich schon dachte, nun wäre ich der Zivilisation endgültig entkommen, tat sich vor mir das Höhenrestaurant “Zum weißen Stein” auf, wo es fieserweise nach Schweinebraten und Semmelknödeln roch (man beachte, dass ich da schon ca. 10k hinter mir hatte). Auf dem Abstieg von der Thingstätte zum Philosophen haben sich meine Beine dann langsam bemerkbar gemacht und die letzten hundert Meter über den Römerkreis bin ich tatsächlich gegangen. Meine Netto-Zeit lag bei wohlwollend geschätzten 1:50h.

Tags: ,

No Comments

Agreeing on the wrong things

In a museum I once overheard a conversation between two visitors about a painting both thought to be by a particular artist. They discussed how one could well see the characteristics of that artist’s style and how this painting would fit into a particular period of his life’s work. After five minutes they finally looked at the tag next to the painting only to be surprised by the fact that it was by a different artist they have never heard of.

In an article (Stefanowitsch (2006): Konstruktionsgrammatik und Korpuslinguistik) that I recently read for a seminar on construction grammar (CxG), I stumbled across a use of the Cohen’s kappa coefficient that I had not seen before. The article is about how to verify via quantitative techniques that a linguistic structure is a construction in the sense of CxG. The author takes the German structure haben zu + Infinitiv as in “Sie haben zu gehorchen.” as a case study and, in a first step, identifies five classes of the the general structure NP-NOM hab- XP* zu V-INF, where only one of them he deemes to be the potential construction to consider. In order to verify his classification he asked a colleague to classify a set of examples structures according to his scheme, providing him with a paraphrasal description (X ist beschäftigt for the class containing sentences like “Wir hatten zu tun”) of each class and nothing else. The author then calculates inter-rater agreement using kappa in order to prove his scheme correct.

Now, is this a valid use of kappa (or any other measure of inter-rater agreement for that matter)? Kappa is often used to make a statement about the quality of, e.g. an annotation, but also about the difficulty of a particular task. In this case, there are several problems whose severity I am not fully certain about. One is that the author has only two different raters, one of whom is himself, which makes the results questionable. The other problem is that I don’t see a reason to assume that a classification schmeme is “the right one” by showing high kappa values. In principle, it should be possible to develop any kind of scheme and achieve good agreement rates as long as the rater instructions are clear and adequately describe all classes. Just to be fair: the author’s classes made intuitive sense and where perfectly acceptable. But generally, agreeing on the wrong things does not them right.

Tags:

4 Comments