Finally, I turned in my Master’s thesis. Altogether, it has 12 chapters, 87 pages, roughly 23.000 words (tokens), 140.000 characters (including whitespace), 27 tables, 10 figures (7 of which are colored), a few mathematical formulas, and 4 pages of references.
I think most of what I learned during the work on this project was of practical importance and not so much on the theoretical side. After all, around 90% of the work on this thesis went straight into dumb work like pre-processing data, converting them from one format to another, and controling the processing overhead of tools like parsers and taggers. I could have saved a lot of time, had I known better what problems can arise when dealing with a setup like mine. So, I compiled a list of recommendations that I can look up next time I do something similar. If you’re new to natural language processing, writing a thesis, working with large amounts of textual data, conducting experiments or just want to parse some text corpus, this may be useful to you. If you’re an experienced nlp’er, I’d be happy to read further suggestions in your comment.
Roughly, this was my setup: I wanted to apply machine learning to the classification of verbs into semantic classes. I had training data which was labeled via a file that contained the labels and pointers to the corresponding tokens in specific sentences and files. The size of the data was manageble and the texts clean. In order to get the features for the classifier, I had to run these data through other programs like semantic role labelers, named entity recognizers, word sense disambiguation tools, and parsers. The output of these programs had to be aligned on sentence and word level, which was problematic as they all used different input and output formats and sometimes couldn’t produce an analysis for a sentence. In addition, I had a very large amount of unlabeled data extracted from British websites which contained a considerable amount of noise. Since I wanted to use them in a semi-supervised learning algorithm, these data needed to be processed by all kinds of tools in order to compute features. If you have ever worked with a setting remotely similar to mine, you will know the problems that accompany it. If you’re like me, you try to find reusable techniques and develop some kind of scheme that helps you deal with the issues in a structured way. Here’s what I found to be useful:
- Decide on a consistent file naming scheme. Consistent file naming is a very practical means when you want to use scripting routines to iterate over a large number of files. For me, it has proven useful to encode the name of the tool that produced a file in the file extension. For instance,
file.1.txt,file.1.parser.txtandfile.1.tagger.txtare preferable toparser1.txt, tagger1, andfile1-Input.txt. The base names should always stay the same, and the extension should make clear what the respective file contains. Consistently using telling file extensions gives you a lot of flexibility and makes it very easy to use wildcards in, for example, shell scripts. Not defining a naming scheme will probably force you to rename some of your files at some point. If the number of files is large, chances are that you miss some of them or accidently overwrite others. - Expect noisy data and test for error input. When working with tools like taggers and parsers, you will find that they may have certain restrictions regarding the input they can handle. They might have constraints on the length of sentences or on the input encoding, the character set, or the length of tokens. If documentation exists, don’t rely on it alone. Run a couple of tests with error input like very short or very long lines, special characters and long, non-existent words. Test for all these restrictions for all the tools you want to use. If you don’t, you may run into the situation where you have successfully processed the data with one tool but the next one has problems with it, so you end up repeating the entire processing pipeline or have to develop complex and error-prone ways to align the output afterwards.
- Test for all requirements on I/O, hard disk, main memory. Some tools might attempt to load the entire input file into main memory or write a lot of temporary files on disk. Always find out what the requirements are before you start experimenting. Check disk and main memory usage and compare them to what is available on your machine. Splitting your data into many small files lets you keep memory overhead low and provides you with test data that you can process quickly and check the results. If a tool writes and reads a lot of temporary files, you should definitely check for I/O usage because having too many I/O-intensive processes run in parallel may put them into sleep state.
- Take “weird” output seriously. Even if you’re certain that you took all potential issues into consideration, the last test should always be checking your data manually, i.e. by looking at it. Look at the first and last lines/instances of the respective files, and also at some random ones. Use standard Unix tools like
head,tailandshuffor these tasks. If you stumble across things that look “weird” to you, don’t just ignore them, but check the input, compare it to what you would have expected, and test these instances separately. - Do all of the above steps as early as possible. The later you do them, the greater the risk that you will have to repeat certain steps, potentially wasting several days of your limited time. In particular, splitting large file into smaller ones after you’ve already run the data through the parser or the like for several days will make it hard to align everything properly afterwards. The same is true for cleaning data.
- Define input and output formats and write transform routines. Avoid making the mistake to handle several file formats in the same scripts or even the same functions. If you write a program that extracts features for the machine learner or implements some logic that is important to your approach, you want to first define a clear and easy-to-manage input format your program accepts. If the output of your external tools does not conform to the format you require, you should implement the conversion functionality separately.
- Document everything. If you’re writing a thesis or working on a longer project, always take notes about what you’ve done while it is still apparent to you. Document your code. You may reasonably anticipate that you will have to repeat some steps or revoke certain simplifying assumptions you’ve come up with at an earlier stage. In these situations you may have to understand what you’ve done many weeks or even months ago. Having everything clearly documented may save you hours or even days of work, trying to figure out why the there’s this funny if-statement in your code or which parameters you’ve invoked your parser with. Also make notes about papers you read – this saves you from having to read them again when you write a literature overview.
- Follow best programming practices. There’s not much to say about this point. Well-structured code, good documentation, clearly defined interfaces, a good class design, telling method names, etc. are all indispensable even for small projects. At the beginning you may think it would be easier to just write a small script for converting this particular file into another format or filter certain lines in the input. Then, after a while, you will find yourself with dozens of more or less incomprehensible scripts, each one supposed to take care of one small task and you won’t know anymore, when to use which one, how, and why.
- Use version control. Even if you’re working on your own and not in a team, versioning tools are extremely useful. They serve two main purposes in a one-player setting: document changes and backup. The importance of backing up everything will become clear as soon as you’ve acidentally deleted an important file or would like to revert changes you’ve made to your program, now, that nothing works anymore. Commit often and clearly state in the commit message what you’ve done, how and why.
- Have one file per chapter. This has the advantage that you can give someone a chapter for proofreading and not interfere with what you’re currently writing on. It also helps keeping the thesis organized and well-structured. You won’t even have to decide on a structure from the beginning, since you just write all the parts more or less separately and won’t have to bother about how to assemble them until the very end. Finally, you just merge everything into one file the way that makes most sense.
- Plan your time wisely. If your first guess is for something to take an hour, plan a day. If you think it’ll take a day, assign it a week. If it turns out your first guess was about right, well, great! Most of the times, though, you will face some unexpected difficulties that will waste another day before you haven’t even started to deal with the real task.
- Know when to stop. Even if the results are not what you expected, finish experiments on time and start writing. There are always things one hasn’t yet tried out and it’s easy to waste another week in the attempt to imrpove the results by 0.5%. Writing a good and coherent thesis is something that needs time and coolness. You don’t want to write the last chapter the night before the deadline, like I did.

#1 by Raza at July 25th, 2010
| Quote
@OG Dude pre-processing can be pain but it is the MOST important step in any data mining projects. Unless feature selection/extraction and data representation is done right, your DM technique is next to being useless. Spending more time in pre-processing is actually a good thing.
“Worry about the data first before you worry about the algorithm” – Peter Norvig
#2 by OG Dude at June 20th, 2010
| Quote
Yeah, I guess you could carve those truisms in stone… pre-processing especially is a b*tch. Frustrating to know you spend more time on squeezing stuff into the right shape than actually doing something noteworthy with the data.
Another thing I find hard to get right is the trade-off between “early optimization” – which is bad and doing a quick and dirty job – which is also bad. I mean most of the times you’ll NEVER use code you wrote for a particular project again but if you’re like me then doing a half-ass job under that premise is like watching people use MS Comic Sans in Powerpoint – it gives you cringes on so many levels…
#3 by DrNI@AM at March 3rd, 2010
| Quote
Congrats for finishing your thesis! And thanks for the advise in this post. It comes too late for me, hence I can confirm most of the points.
We tend to think of pre-processing tools perfect and readily available things… depending on the application, the errors in sentence splitting, tokenization, and tagging multiply. Fixing these things could fill a thesis of its own.