Major challenges in developing artificial intelligence (AI) algorithms are poor data quality and limited amounts of available data. When it comes to text data, there are additional problems because of how diverse text data can be. Data from sources such as mailboxes, comments, or answers to open questions in surveys routinely contains spelling mistakes, idioms, and are often simply inaccurately written. Text is rarely straightforward and it can be an extremely poor data source for AI.
The To-Do List
In order to test various machine learning (ML) algorithms to determine which one will return the best results, the data needs to be normalised. And given our challenge, this is even more necessary with text data. Simple machine learning algorithms require simple input. The simplest text classifier accepts a multidimensional vector as input. The number of dimensions of the input vector corresponds to the number of unique words in the learning set. Words such as “teach” and “teaching” are considered two completely different words by the algorithm and it does not spot the connection between the two words. Normalisation of text data includes, among other things:
- Normalisation of letter size (defaulting all letters to lowercase)
- Deleting digits depending on the expected result
- Removing punctuation and whitespace characters (often called noise removal)
- Deleting stop words (these are words that are often used but carry no significant value, i.e. “the”, “a”, “on” for an English text)
- Optionally removing names, surnames, and proper names
- Correcting typos based on similarity to words in the dictionary
- Stemming (the process of reducing words to their word stem, i.e. “flying” to “fly”)
- Lemmatisation (the process of transforming a word to its basic form; this process is carried out based on specially prepared dictionaries, i.e. “studies” to “study”)
- Parts of speech tagging (assigning to each word the corresponding part of speech)
Depending on the language of the normalised text, these tasks can range from very simple to very complicated. In the case of popular and syntactically simple languages such as English, a lot of tools are available that allow for fast and accurate data processing. As far as Polish is concerned, there are a lot more difficulties and data processing can take much longer.
To elaborate, the main challenges with a language such as Polish, are:
- Grammatical complications (because a word can be conjugated in dozens of different ways, lemmatisation dictionaries can become extensive and the prevalence of homonyms can also add to this problem)
- Because of the lack of popularity of the language, there are fewer dictionaries and tools that facilitate text processing. This does not mean that tools do not exist but that they are less convenient and less reliable
After normalising text data, it is worthwhile to choose several random samples and compare the normalised text with the base text. When comparing, you should pay close attention to the following:
- Deleted words (Were the words crucial to the meaning of the sentence?)
- Content comprehension (Can the content be understood with the normalised sentences?)
It is also a good practice to compile statistics on the most common words and to assess whether these words are of value in the task under consideration.
Taking the next step:
The process described above is more complicated than it may seem, especially with more complex languages. You need to consider if some words are valuable, retrain the model, and check if your thoughts and assumptions were true. In the last few months, we have learned a lot about Polish text classifications. For one of our customers, our task was to classify every incoming email to one of nineteen categories to help the organisation boost a higher quality customer experience. Text processing and classification became our most powerful tool and we have learned more and more each day thanks to projects just like this one.