TreeTagger widget: lemma and POS-tag annotation

case study

Since Textable v3.1.0, the TreeTagger widget lets you integrate the morphosyntactic analysis of TreeTagger right into your Textable workflows. In order to use the widget, you need a working install of the TreeTagger package (which is free). If you don’t already have one, instructions on how to get it can be found at the end of this post. If you already have a working install, keep reading!

Connecting the widget with TreeTagger

The first time that you create a copy of the TreeTagger widget, it will try to locate your TreeTagger distribution automatically in a number of standard locations (such as C: on Windows and /Applications on MacOS). If it’s not in a standard location, the widget will issue a warning and you’ll need to locate the TreeTagger folder manually by clicking on Locate TreeTagger:

This opens a dialog that enables you to navigate to the location of the TreeTagger base folder (which contains the lib, bin and cmd subdirectories) and select it. You only need to do this once, as the location is saved in the widget’s configuration.

Using the TreeTagger widget

Once the widget is connected with TreeTagger, using it is straightforward. First connect the text source you want to analyze to the widget’s input. For example, you might retrieve the text of the Universal declaration of human rights using the URLs widget:

Once this is done, you need to configure the TreeTagger widget, in particular choose the input language (provided the corresponding parameter files have been installed):

The two other options concern the output of the widget. If the Output format is set to segment into words, every token identified by TreeTagger will be in a separate segment, with lemma and pos-tag annotations:

This format is most useful if you simply want to count lemmas or POS-tags in the entire input. If, on the other hand, the input is already segmented into meaningful units (e.g. the content of several urls) and you want to preserve this information, set the Output format to Add XML tags. This way, the output contains the same number of segments as the input, and the lemma and POS-tag information is encoded by means of xml tags:

If needed, these data can then be segmented into individual tokens using the Extract XML widget. Here’s a complete example using this design pattern to compare the frequency of POS-tags in the Universal Declaration of Human Rights and in the Declaration of the Rights of Man and of the Citizen, which you’re encouraged to download and try for yourself: Comparing POS-tag frequency in two texts (650 downloads)

Installing TreeTagger (for use with Textable)

On Windows:

  1. Download the Windows distribution of TreeTagger.
  2. Unzip it and copy the contained TreeTagger folder on your computer (preferably at the root of your main hard disk, e.g. C:)
  3. From the TreeTagger website, download the  parameter files for the languages you’re interested in.
  4. Unzip each parameter file and copy it to the move them to the /lib subdirectory of the TreeTagger folder.
  5. Rename each parameter files to <language>-utf8.par (e.g. rename french-par-linux-3.2-utf8.bin to french-utf8.par).

On MacOS:

  1. Create a new Treetagger subdirectory in the Applications folder.
  2. Download the MacOS distribution of TreeTagger and place it in the Treetagger folder (don’t extract the downloaded file).
  3. From the TreeTagger website, download the parameter files for the languages you’re interested in and place them in the Treetagger folder (don’t extract them either).
  4. Download the installation script and place it in the Treetagger folder.
  5. Launch Terminal (in Applications/Utilities) and type
    cd /Applications/Treetagger (+enter)
    to navigate to the Treetagger folder.
  6. To complete the install process, type the following command in Terminal:
    bash install-tagger.sh (+enter)

Get email updates

Enter your email to be informed when new recipes, case studies or software updates are made available.