Machine Translation 101 – Part 2

11 Nov 2020

Machine Translation

Machine Translation: How to Clean Data

There’s no getting around it; cleaning your data is a critical step in any machine-learning workflow. The famous saying in computer science, “garbage in, garbage out,” holds very true in this situation. Computers aren’t magic. They can only function on what they receive as input. That’s why your training datasets must be carefully prepared to make them as accurate as possible.

In this article, we’ll walk you through the basics of data cleaning, including the main issues that cause dirty data, the problems it can create in machine translation workflows, and how to clean data. We'll also discuss the most effective data-cleaning techniques and their importance and benefits.

What Makes Data ‘Dirty’?

What exactly does it mean for data to be ‘dirty’? Well, data gathered from the real world often has specific issues that make it difficult for the computer to be trained correctly. Datasets for machine translation tend to be amalgamated from various sources, which can lead to inconsistencies in structure and quality. Here are a few examples of common dirty data that can cause problems for your machine translation models.

Duplicate values

There’s a high risk of duplicates when data is amalgamated from multiple sources. These can often be identified with a simple script. However, you may also have data showing different values for the same meaning.

Missing or mangled values

This is especially common when your data set has been scraped from the web. The scraper can easily generate messed-up values that risk throwing your translation model out of balance.

Non-standard values

One typical example of non-standard values in machine translation is using different date formats. When using human-generated data, it’s necessary to check when people are spelling and capitalizing words similarly. Any confusion in the input will result in confusion with the model’s output.

Input errors

These often happen when a human manually adds the data, misspelling or mistyping certain words. Another human could easily spot this, but a computer cannot. For machine translation models, this can have significant adverse effects on the finished output.

Unbalanced or biased data

When your model needs to translate for a specific domain, unbalanced training data can become an issue when the dataset contains too much text from other domains.

Incorrect or imprecise translations

Datasets found around the web can be littered with poor translations. These may come in the form of incorrect words or sentences loosely or carelessly translated, causing the original meaning to be lost.

The Importance of Data Cleaning in Machine Translation

All machine translation systems learn from examining patterns in language. Elements such as emojis and usernames risk confusing the algorithm because they are not translatable. Words in all capital letters can also be problematic, as can particular punctuation. You should start your data cleaning process by removing these elements from your training dataset.

Another critical step in the data-cleaning process is data normalization. This involves removing certain parts of the data, such as numbers, which are usually the same across all languages and, therefore, not relevant to the translation process. Data normalization helps us understand how the machine learns so we can identify the best data for it to learn from. Then, we can further optimize the input for better end results.

A Typical Data Cleaning Process for Machine Translation

The exact nature of each data-cleaning workflow will depend on the data being processed. The machine translation pipeline is not standard, unlike machine learning for text analytics. Your data scientists may decide to adopt this pipeline, depending on the model architecture and various challenges observed in the data.

A typical data cleaning workflow uses the following steps for pre-processing the text: lowercasing, tokenization, normalization, and removal of unwanted characters (punctuation, URLs, numbers, HTML tags, and emoticons).

Lowercasing

This step involves transforming all words in the dataset into lower-case forms. It is helpful in cases where your datasets contain a mixture of different capitalization patterns, which may lead to translation errors, for example, having ‘portugal’ and ‘Portugal’ in the same dataset. However, lowercasing isn’t suitable for all translation projects. Some languages rely on capitalization to generate meaning.

Tokenization

This refers to the process of splitting sentences up into individual words. It’s a central part of the data-cleaning workflow, essential to enable the algorithm to make sense of the text.

Normalization

This is the process of transforming a text into a standard (canonical) form. For example, the words ‘2mor’, ‘2moro’, and ‘2mrw’ can all be normalized into a single standard word: ‘tomorrow.’ This is an essential step in data cleaning, especially when handling user-generated content from social media, blogs, or forum comments.

Removal of unwanted characters

The data cleaning process should also include removing other parts of the data that don’t add to the translation meaning, including emojis, URLs, HTML tags, and numbers. Also, most punctuation is unnecessary because it doesn’t provide additional meaning for a machine translation model.

Don’t Sacrifice Accuracy For Volume

If the human translations in the training data have inaccurate meanings, then the model's overall quality will suffer. This often happens when using data scraped from bilingual or multilingual websites. Generally, having as much training data as possible is better, but not if it’s inaccurate. In that case, it’s better to go with a smaller dataset of high-quality accurate data. Cleaning the data also ensures that irrelevant words get removed, reducing the dataset size that the machine translation model has to deal with. This makes the model perform more efficiently.

Data Cleaning Challenges by Language

Certain languages present additional challenges for data cleaning, not because of the structure of the target language itself, but rather the difficulties in obtaining sufficient volumes of data. It’s important to have a good lexicon containing all essential words in that language. You also need to have some suitable normalization tools for that language and to understand how the language works.

For example, it’s essential to understand when to preserve capital letters, such as those in German nouns. Your data science team must also understand how to deal with symbols and user-generated content, such as emoticons. Different languages have different challenges, but they often depend on what we’re most used to in terms of language processing.

Enhancing Your Data Cleaning With Defined.ai

Scraping bilingual websites to create a set of training data for machine translation may seem like a good idea, but this is far from an ideal solution.

Scraping websites produces a large quantity of natural language text, which requires extensive data cleaning to prepare it for use as training data. Your data science team will have to spend significant time and effort making this data suitable for building a reliable machine translation model. Defined.ai saves you from this process by providing ready-made and fully cleaned datasets in a wide range of languages. You can plug the dataset directly into your machine translation model and have confidence that it will give you reliable results. Defined.ai can also provide datasets in more raw natural language formats for your data science team to process using their own data-cleaning tools.

Unclean data can cause disastrous mistranslations in machine translation systems. With Defined.ai, you can access clean, high-quality, and domain-specific data to train your machine translation models.

In case you missed it, Part 1 of this series, How to Create a Perfect Dataset for Machine Translation Models, outlines how to create a perfect dataset for machine translation models and why this is so important to successful machine translation.

Get the final part of our machine translation series, How to Perform Machine Translation Evaluation, here.


© 2025 DefinedCrowd. All rights reserved.