Free FROM Tokenizers For Digital Sovereignty

More Efficiency and Independence for Natural Language Processing

Processing natural language in such a way that computers and digital devices identify and understand the conten is the objective of Natural Language Processing (NLP). Tokenizers are typically used for this type of artificial intelligence (AI). These are programs that break down text into smaller units which requires a lot of computing power and energy. What alternative there is and what this means for the digital sovereignty of companies was shown at a presentation by Aleph Alpha on the sidelines of the World Economic Forum 2025 in Davos.

Digital sovereignty is becoming strategically relevant for companies. It’s about maintaining control over the firm’s own data and value chain. “Whoever builds artificial intelligence decides what is truth or fake,” Jonas Andrulis said in Davos during the World Economic Forum 2025. This also applies, for example, to data in industrial production, according to the founder of the German AI company Aleph Alpha. In order to secure digital sovereignty, technological independence is needed. This relates to a technical peculiarity of Large Language Models (LLM), which are often used as deep learning algorithms in the fields of text processing using NLP.

The background

To make a text understandable to machines, several steps are required. First, stop words and special characters are removed or text content is normalized, e.g. by converting it to lowercase letters. A program called “tokenizer” then breaks down the text into smaller units; these are called “tokens”. For instance, a token can be a word, a punctuation mark or a symbol. After dismantling, the tokens are built and classified. This can also include attaching metadata such as position in the text or part of speech. Finally, the tokens are stored in a structure that can be used for the subsequent steps of the NLP pipeline. Tokenizers are important for various applications:

– Text classification: Text is placed in predefined categories. Tokenizers play a crucial role as they break down the text into analyzable units that can then be classified by AI models.

– Machine translation: Text is translated from one language to another. Tokenizers help analyze the text in the source language and rephrase it in the target language.

– Language generation: New texts are generated by AI models. Tokenizers are important here to ensure that the generated texts are grammatically correct and coherent.

With the help of tokenizers, chatbots or translation services, for example, can work more precisely and efficiently.

The challenge

Tokenizers are part of the so-called transformer architecture, which is run by conventional Large Language (LLM) models. LLM are deep learning algorithms that are often used in the fields of NLP. Examples are Meta’s Llama 3.1 8B, which focuses in particular on the English language, or Viking-7B from SiloAT, which mainly processes Nordic languages. The Llama model contains roughly 128,000 tokens – a collection that can be described as a dictionary. Included are combinations of letters, numbers, spaces and punctuation marks. If an LLM is trained on English-language texts, as is mostly the case, it can draft a text in English with good quality and comparatively little effort. If, on the other hand, the model is to be trained for other languages, more computing power is required – which means higher power consumption, thus higher costs and increased CO2 emissions. In countries such as India, which have about 800 languages, this practically leads to implementation issues.

The solution

Against this background, Aleph Alpha, in collaboration with the U.S. chip company AMD and its subsidiary SiloAI, has now developed a “T-Free approach”, which Jonas Andrulis presented at the World Economic Forum 2025. Instead of using tokenizers, the AI permanently processes groups of three adjacent characters of a word. In this way, an LLM that has been trained for a specific language can be adapted to another language. Moreover, this approach offers the advantage of training the model more efficiently with specific technical terms from a company or industry. Aleph Alpha reports a 400 percent increase in performance compared to Llama on H100 in the Finnish language. In English, the T-Free approach is 150 percent better than Llama. In addition, the use of the AMD chip can MI3000X significantly increase efficiency: training costs and CO2 emissions have been reduced by 70 percent compared to alternative solutions with regard to the application in the Finnish language.

The opportunity

Aleph Alpha, AMD and SiloAI now want to use the T-Free approach with companies from various industries. This gives entrepreneurs and excutives the opportunity to efficiently integrate generative AI into their business processes and to train it both company-specific and industry-specific. At the same time, this can be the starting point for advancing digital sovereignty as a strategic issue.