Why does NLP need a POS tag

introduction

In computational linguistics, "tagging" is generally understood to mean the annotation of corpora with linguistic information. In a narrower sense, this means automatic part-of-speech tagging, which involves assigning each word of a corpus its part of speech using a computer program. For example, the phrase "He's reading the book she recommended." annotated as follows:

He / PPER reads / VVFIN the / ART book / NN, / $, which / PRELS she / PPER recommended to him / PPER / VVPP / VAFIN ./$.

The inventory of part-of-speech names used is called a "tag set". Depending on how finely differentiated it is and what morphosyntactic information (number, gender, case, tense etc.) is represented, the tag set can include between around 15 and over a thousand part-of-speech tags. The STTS tagset was used in the example above.

Part of speech tagging is important for many applications (information extraction, speech synthesis, automatic translation, parsing and many more).

Part of speech taggers can be classified as follows:

  • rule-based tagger
    • manually created rules (Constraint Grammar)
    • automatically learned rules (Brill Tagger)
  • statistical tagger
    • based on Hidden Markow models (TnT, TreeTagger, HunPos)
    • based on support vector machines (SVMTool)
    • based on maximum entropy models (MXPOST, Stanford Tagger)
    • based on neural networks (Morce)

All systems except those based on manually created rules require a corpus manually annotated with part of speech for training. The main difficulty with tagging is correctly disambiguating words with multiple possible parts of speech and unknown words.

Some part of speech taggers break down the input text themselves into individual words, punctuation marks, brackets, etc. This breakdown is called "tokenization". Other taggers already expect tokenized input text. Some taggers (like the TreeTagger) provide the part of speech as well as the lemma of a word.

literature

Brants, Thorsten. 2000. TnT - A Statistical Part-of-Speech Tagger. "6th Applied Natural Language Processing Conference".

Giménez, J., and Márquez, L. 2004. SVMTool: A general POS tagger generator based on Support Vector Machines. Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04). Lisbon, Portugal.

Adwait Ratnaparkhi. (1996). A Maximum Entropy Model for Part-Of-Speech Tagging. In Proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP), University of Pennsylvania.

Toutanova, K., Klein, D., Manning, C.D., Yoram Singer, Y. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. Proceedings of HLT-NAACL 2003, pages 252-259.

Spoustová, Drahomíra "Johanka", Jan Hajic, Jan Raab and Miroslav Spousta. 2009. Semi-supervised Training for the Averaged Perceptron POS Tagger. Proceedings of the 12 EACL, pages 763-771.

Manning, Christopher D. 2011. Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In Alexander Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, 12th International Conference, CICLing 2011, Proceedings, Part I. Lecture Notes in Computer Science 6608, pp. 171-189. Jumper.

Left


Supervised by: Helmut Schmid, IMS, Uni Stuttgart