Skip to main content
Innovationskunst

Multilingual and open source: OpenGPT-X research project publishes large AI language model

The large AI language model of the OpenGPT-X research project is now available for download on Hugging Face: ‘Teuken-7B’ was trained from scratch with the 24 official languages of the EU and comprises seven billion parameters. Researchers and companies can use the commercially deployable open source model for their own artificial intelligence (AI) applications. The partners in the OpenGPT-X consortium project, which is funded by the German Federal Ministry for Economic Affairs and Climate Protection (BMWK) and led by the Fraunhofer Institutes for Intelligent Analysis and Information Systems IAIS and for Integrated Circuits IIS, have thus launched a large AI language model as a freely usable open source model with a European perspective.

 

About the projekt

‘In the OpenGPT-X project, we have spent the past two years researching the basic technology for large AI fundamental models and training corresponding models with strong partners from research and industry. We are delighted that we can now make our ‘Teuken-7B’ model freely available worldwide and thus offer a public research-based alternative for science and business,’ says Prof Dr Stefan Wrobel, Institute Director at Fraunhofer IAIS. ‘Our model has demonstrated its capabilities across a wide range of languages, and we hope that as many people as possible will adapt or further develop the model for their own work and applications. In this way, we want to make a contribution both within the scientific community and together with companies from different industries to address the growing demand for transparent and customisable generative artificial intelligence solutions.’

Teuken-7B is currently one of the few AI language models that have been developed multilingually from the ground up. It contains around 50 per cent non-English pre-training data and has been trained in all 24 official European languages. It has proven to be stable and reliable in its performance across several languages. This offers added value, particularly for international companies with multilingual communication requirements and product and service offerings. The provision as an open source model allows companies and organisations to operate their own customised models in real applications. Sensitive data can remain within the company.

In addition to model training, the OpenGPT-X team also addressed numerous research questions, such as how multilingual AI language models can be trained and operated in a more energy- and cost-efficient manner. To this end, a multilingual ‘tokeniser’ was developed in the project. The task of a tokeniser is to break down words into individual word components - the fewer tokens, the more (energy-) efficiently and quickly a language model generates the answer. The developed tokeniser led to a reduction in training costs compared to other multilingual tokenisers, such as Llama3 or Mistral. This is particularly important for European languages with long words such as German, Finnish or Hungarian. Efficiency gains can also be achieved in the operation of multilingual AI applications.

The OpenGPT-X joint project was funded as part of the BMWK funding programme ‘Innovative and practical applications and data spaces in the Gaia-X digital ecosystem’. Teuken-7B is therefore also accessible via the Gaia-X infrastructure. Stakeholders in the Gaia-X ecosystem can thus develop innovative language applications and transfer them into concrete application scenarios in their respective domains. In contrast to existing cloud solutions, Gaia-X is a federated system that allows different service providers and data owners to connect with each other. The data always remains with the owner and is only shared according to defined conditions.

‘I am delighted with today's release of the Gaia-X-based AI language model Teuken-7B and congratulate the OpenGPT-X project for reaching this important milestone. What is special is that Teuken-7B also enables the secure use of sensitive company data, as the Gaia-X standards guarantee data storage and processing in accordance with the highest European data protection and security regulations. Innovations such as these strengthen digital sovereignty, competitiveness and also the resilience of Germany and Europe. This is why the BMWK is funding the project with around 14 million euros,’ says Dr Franziska Brantner, Parliamentary State Secretary at the BMWK.

Bernhard Grill, Institute Director at Fraunhofer IIS, emphasises the importance for safety-relevant applications: ‘With the completely independently trained language model published here, the project partners are demonstrating their ability to generate their own large models. The associated access to a large AI language model enables applications that offer much better control over this technology without the need for non-visible third-party components - e.g. for specific, particularly safety-critical applications in the automotive sector, robotics, medicine or finance. By training with the data relevant to the specific application and using application-specific architectures, companies can create customised AI solutions that do not require black box components.’

Generative AI from a strong network - with a European perspective

Important research results from the OpenGPT-X project have been incorporated into the model development, such as tools and technologies for processing very large amounts of data, utilising powerful European HPC infrastructures and carrying out efficient model training. Teuken-7B was trained using the JUWELS supercomputer at the Jülich Research Centre. In addition to the two Fraunhofer Institutes and Forschungszentrum Jülich, the German AI Association, TU Dresden, the German Research Centre for Artificial Intelligence (DFKI), IONOS, Aleph Alpha, ControlExpert and Westdeutscher Rundfunk (WDR) also collaborated on OpenGPT-X as partners. The technology developed in OpenGPT-X will also provide the partners with the basis for training their own models in the future.

‘OpenGPT-X serves as an example of how valuable basic technology can be created with the funds of a publicly funded project and the joint efforts of a broad-based consortium - from the underlying infrastructure to the training of models and productive application. In the interests of technology and data sovereignty, it is now important to build on this foundation: We hope that OpenGPT-X will be used as the basis for many subsequent activities,’ emphasises Daniel Abbou, Managing Director of the German AI Association and President of the European AI Forum.

The research project, which was launched at the beginning of 2022, is now nearing completion. It will run until 31 March 2025 so that further optimisations and evaluations of the models can be carried out.

The path to using Teuken-7B

Interested developers from the scientific community or companies can download Teuken-7B free of charge from Hugging Face and work with it in their own development environment. The model has already been optimised for chat by means of instruction tuning. Instruction tuning is used to adapt large AI language models so that the model correctly understands instructions from users, which is particularly relevant for using the models in practice - for example in a chat application.

Teuken-7B is available in two versions: a version that can be used for research purposes and a version under the ‘Apache 2.0’ licence, which companies can use for commercial purposes in addition to research and integrate into their own AI applications. The performance of both models is roughly comparable, but some of the data sets used for instruction tuning exclude commercial use and were therefore not used in the Apache 2.0 version.

Download options and model cards can be found under the following link: https://huggingface.co/openGPT-X

The OpenGPT-X Discord Server is available to the specialist community for technical feedback, questions and specialist discussions: https://discord.gg/RvdHpGMvB3

Companies in particular also have the opportunity to take part in free demo sessions in which Fraunhofer scientists explain which applications can be realised with Teuken-7B. Registration for demo appointments is possible via www.iais.fraunhofer.de/opengpt-x.

Detailed technical background information and benchmarks as well as an overview of all research results from the OpenGPT-X project can be found on the project website: https://opengpt-x.de/en/models/teuken-7b

    About OpenGPT-X

    The OpenGPT-X project started on 1 January 2022 with funding from the Federal Ministry for Economic Affairs and Climate Protection (BMWK) amounting to around 14 million euros and will end on 31 March 2025. The ten project partners are Fraunhofer IAIS, Fraunhofer IIS, Forschungszentrum Jülich, KI Bundesverband, TU Dresden, DFKI, IONOS, Aleph Alpha, ControlExpert and WDR. Under the leadership of Fraunhofer IAIS and Fraunhofer IIS, the project is researching the entire value chain of generative AI: from the highly scalable, GPU-based infrastructure and the data for training large language models, to the development of the models, through to productive application in the form of prototypes and proof of concepts (PoCs). The overarching goal of the project was to develop our own large AI language model, which will be made available as open source for research and companies and, in particular, geared towards the multilingual needs of Europe. With the publication of Teuken-7B, the project has achieved this goal and thus provides an alternative from public research for future scientific studies and commercial applications of Generative AI.

    About Fraunhofer IAIS

    As part of the largest organisation for application-oriented research in Europe, the Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, based in Sankt Augustin/Bonn and with a site in Dresden, is one of the leading scientific institutes in the fields of artificial intelligence (AI), machine learning and big data in Germany and Europe. Around 380 employees support companies in the optimisation of products, services and processes as well as in the development of new digital business models. Fraunhofer IAIS is shaping the digital transformation of our working and living environment: with innovative AI applications for industry, health and sustainability, with future-oriented technologies such as large AI language models or quantum machine learning, with offers for training and further education or for testing AI applications for security and trustworthiness.

    About the Audio and Media Technology division of Fraunhofer IIS

    The Audio and Media Technologies division at Fraunhofer IIS has been shaping the standards and technologies used worldwide in the audio and film industry for over 30 years. Starting with the invention of mp3 and continuing with the development of AAC and the test plan of the Digital Cinema Initiative, today systems and technologies from Erlangen can be found in almost all consumer electronics and (mobile) communication devices. Our latest generation of media technologies such as MPEG-H Audio, xHE-AAC, LC3/LC3plus, Symphoria and upHear are also already in use worldwide. We have also been working with speech technologies for over 20 years. Most recently, we developed the EVS standard, which benefits all 5G voice services. Today, we are expanding our activities in the direction of voice signal processing and voice assistance systems.

    Lade Daten Loading...