diachronic corpus

Content type
Author
Year
Language
Publication Type
Platform/Software
License
All Rights reserved
Record Status
Description (in English)

Lack of lemmatization often undermines the quality of concordances, which is especially relevant for diachronic corpora. A significant part of lemmatizers are designed for Modern English. This paper presents MiddleEnglishLem, an application designed for dictionary-based lemmatization of Middle English texts. We use a hybrid algorithm to lemmatize the Helsinki Corpus of English Texts, a long-time-span diachronic corpus that includes Middle English texts of different genres, a total of 608 570 words. MiddleEnglishLem is capable of associating multiple inflected forms and orthographic varieties with canonical forms. Lemmatization becomes more accurate owing to comprehensive premade dictionaries. The competitiveness of this lemmatizer is proved by the low average errors – less than 2.5 percent, whereas its prebuilt stemmer has a strength of 0.38, a relatively high value. Accuracy of the lemmatization process can be improved by implementing syntagmatic analysis at the part-of-speech identification step. MiddleEnglishLem can be applied to diachronic corpora in order to research the development of English.

Source: https://www.semanticscholar.org/paper/Using-a-Hybrid-Algorithm-for-Lemm…

Screen shots
Image