AI, Technicalities
12 February 2020


One of the major challenges in the development of artificial intelligence algorithms is poor data quality or very limited amount of data. In the case of text data, there is an additional problem due to the heterogeneity of its form. Data from sources such as mailboxes, comments or answers to open questions in surveys, often contain many spelling mistakes, simple typos and are often inaccurately written.

To do list

Before we proceed to testing various machine learning algorithms to determine which one will give the best results, we need to normalise the data. Simple machine learning algorithms require simple input. The simplest text classifier accepts a multidimensional vector as input. The number of dimensions of the input vector corresponds to the number of unique words in the learning set. It should be noted that the word “teach” and the word “teaching” are considered by the algorithm as 2 completely different words. Normalisation of text data includes, among others:

  • normalisation of letter size – by default, all data should be written in lowercase,
  • depending on the expected result – deleting digits,
  • removing punctuation and whitespace characters (often called noise removal),
  • deleting stop words, i.e. words that are very popular and bear no significant value (for example “the”, “a”, “on”, “is”, “all” in English),
  • optional removal of names, surnames and proper names,
  • correcting typos based on similarity to words in the dictionary,
  • stemming – the process of reducing inflected (or sometimes derived) words to their word stem,
  • lemmatization – the process of transforming a word to its basic form; this process is carried out based on specially prepared dictionaries,
  • parts of the speech tagging – assigning to each word the corresponding part of speech; in addition, in the case of nouns, words can be marked as places / persons.

Language complexity

Depending on the language of the normalised text, these tasks can be very simple or very complicated. In the case of popular and syntactically simple languages such as English, a lot of tools are available that allow for fast and accurate data processing. As far as Polish language is concerned, there are many difficulties however, which means that data processing takes much longer.

The main challenges are:

  • grammatical complications – each word can be conjugated in dozens of different ways, which makes the lemmatization dictionaries become very extensive, in addition, variations of some words are homonymous with other words (e.g. “mam” as “I have” and “mam” as plural “mother” in genitive),
  • due to the low popularity of the language, there are definitely fewer dictionaries and tools that facilitate text processing; this does not mean that there are no such tools, but they are definitely less convenient to use and not always completely reliable.

After normalizing the text data, it is worth choosing several random samples and compare the normalized text with the base text. When comparing, you have to pay close attention to:

  • deleted words – whether words crucial for the meaning of the sentence have been deleted,
  • can the content be understood despite the artificiality of the normalized sentence?

It is also a good practice to compile statistics on the most common words to assess whether these words are really of value in the task under consideration.

The above described process is more complicated than it seems, especially with more complex languages. Sometimes you must consider if some word is valuable, retrain model and check if your thoughts and guesses were true. In the last two months, we have learned a lot about Polish text classification. For one of our customers, our task was to classify every incoming electronic mail to one of 19 categories to help the organisation boost a high-quality customer experience.

You may also like

AI / Opinion

Low-Code-No-Code: Stripping Away the Burden of IT

At Brainhint, we love to code. Given that we are a team of passionate developers, it should not come as much of a surprise. Us enjoying programming, however, does not mean that we expect our clients to get into details. When businesses can focus on increasing productivity instead of worrying about working with complex and constantly changing IT environments, everyone benefits. This is the foundation and reason why we take a low-code-no-code approach. 

Opinion / AI

Fintech: A Buzzword, But What Actually Is It?

A lot of us miss the good old days but there are some things we didn't have back then, like wifi, vast and free resources of the Internet, and convenient banking. The future is closer than we realise as modern banking moves towards client-centered fintech solutions. But what would this look like? Read on!

AI / Analysis

AI Implementation Readiness: Is your company falling behind?

Is every industry ready to implement and truly benefit from AI? Are all industries equally poised to succeed? What are the core use cases? Today’s blog post delves into the futuristic world of Artificial Intelligence.