Lack of lemmatization often undermines the quality of concordances, which is especially relevant for diachronic corpora. A significant part of lemmatizers are designed for Modern English. This paper presents MiddleEnglishLem, an application designed for dictionary-based lemmatization of Middle English texts. We use a hybrid algorithm to lemmatize the Helsinki Corpus of English Texts, a long-time-span diachronic corpus that includes Middle English texts of different genres, a total of 608 570 words. MiddleEnglishLem is capable of associating multiple inflected forms and orthographic varieties with canonical forms. Lemmatization becomes more accurate owing to comprehensive premade dictionaries. The competitiveness of this lemmatizer is proved by the low average errors – less than 2.5 percent, whereas its prebuilt stemmer has a strength of 0.38, a relatively high value. Accuracy of the lemmatization process can be improved by implementing syntagmatic analysis at the part-of-speech identification step. MiddleEnglishLem can be applied to diachronic corpora in order to research the development of English.
Source: https://www.semanticscholar.org/paper/Using-a-Hybrid-Algorithm-for-Lemm…