One of the major challenges in the development of artificial intelligence algorithms is poor data quality or very limited amount of data. In the case of text data, there is an additional problem due to the heterogeneity of its form. Data from sources such as mailboxes, comments or answers to open questions in surveys, often contain many spelling mistakes, simple typos and are often inaccurately written.
To do list
Before we proceed to testing various machine learning algorithms to determine which one will give the best results, we need to normalise the data. Simple machine learning algorithms require simple input. The simplest text classifier accepts a multidimensional vector as input. The number of dimensions of the input vector corresponds to the number of unique words in the learning set. It should be noted that the word “teach” and the word “teaching” are considered by the algorithm as 2 completely different words. Normalisation of text data includes, among others:
- normalisation of letter size – by default, all data should be written in lowercase,
- depending on the expected result – deleting digits,
- removing punctuation and whitespace characters (often called noise removal),
- deleting stop words, i.e. words that are very popular and bear no significant value (for example “the”, “a”, “on”, “is”, “all” in English),
- optional removal of names, surnames and proper names,
- correcting typos based on similarity to words in the dictionary,
- stemming – the process of reducing inflected (or sometimes derived) words to their word stem,
- lemmatization – the process of transforming a word to its basic form; this process is carried out based on specially prepared dictionaries,
- parts of the speech tagging – assigning to each word the corresponding part of speech; in addition, in the case of nouns, words can be marked as places / persons.
Depending on the language of the normalised text, these tasks can be very simple or very complicated. In the case of popular and syntactically simple languages such as English, a lot of tools are available that allow for fast and accurate data processing. As far as Polish language is concerned, there are many difficulties however, which means that data processing takes much longer.
The main challenges are:
- grammatical complications – each word can be conjugated in dozens of different ways, which makes the lemmatization dictionaries become very extensive, in addition, variations of some words are homonymous with other words (e.g. “mam” as “I have” and “mam” as plural “mother” in genitive),
- due to the low popularity of the language, there are definitely fewer dictionaries and tools that facilitate text processing; this does not mean that there are no such tools, but they are definitely less convenient to use and not always completely reliable.
After normalizing the text data, it is worth choosing several random samples and compare the normalized text with the base text. When comparing, you have to pay close attention to:
- deleted words – whether words crucial for the meaning of the sentence have been deleted,
- can the content be understood despite the artificiality of the normalized sentence?
It is also a good practice to compile statistics on the most common words to assess whether these words are really of value in the task under consideration.
The above described process is more complicated than it seems, especially with more complex languages. Sometimes you must consider if some word is valuable, retrain model and check if your thoughts and guesses were true. In the last two months, we have learned a lot about Polish text classification. For one of our customers, our task was to classify every incoming electronic mail to one of 19 categories to help the organisation boost a high-quality customer experience.