The introduction of the European General Data Protection Regulation (GDPR) in 2018 had far-reaching effects on the handling and use of personal data. Anonymized data is exempt from the GDPR, as—ideally—no conclusions can be drawn about natural persons.
In response, global interest in data anonymization has greatly increased, which is reflected in the development of various new anonymization techniques. Especially concerning Large Language Models (LLMs), anonymization is of particular interest, since it has been shown that training data can be extracted retrospectively. To achieve GDPR-compliant results, high-performing anonymization models are necessary. While many such models exist for the English language, models for German texts are still lacking.
The main goal of the NERMAN project is the research and development of machine learning models for:
- Identification of personal information in German texts, and
- Methods for adequate anonymization of the identified data.
To achieve this goal, we plan to develop a Named-Entity-Recognition (NER) model focused on detecting personal data. This will be realized based on two use cases to be defined within the project. A special focus lies on the anonymization of texts provided by the BMI, which primarily consist of email and chat correspondence.
The development of a performant model requires high-quality training data. To acquire such data, we will combine web-scraping of public information with synthetic data generation. The resulting datasets will be compared against a ground truth using statistical and linguistic metrics to ensure validity and representativeness. As there is currently a lack of German NER datasets, a new benchmark dataset will be created based on our collected and generated data.
The developed models will be thoroughly evaluated in terms of performance, efficiency, and resource usage, as well as ethical and legal aspects. The resulting ethical and legal framework for handling personal data and anonymization techniques in AI will include metrics for evaluating anonymization quality. To demonstrate the practical applicability of our research, a proof-of-concept demonstrator will be developed.
A major innovation of the NERMAN project is a NER model specifically tailored to German-language email and chat data. Furthermore, the generation of synthetic datasets with linguistic characteristics similar to real communication data through the use of LLMs is planned. For the first time, representative, privacy-compliant synthetic test datasets will be created and made available for a highly sensitive domain such as security administration. Lastly, quantitative criteria will be established to enable reliable assessment of both the personal nature of data and the quality of anonymization processes.
Project Lead
DI Ulrike Kleb
JOANNEUM RESEARCH Forschungsgesellschaft mbH
POLICIES – Institut für Wirtschafts-, Sozial- und
Innovationsforschung
Partners
Bundesministerium für Inneres
Axtesys GmbH
Universität für Weiterbildung Krems - Department für E-Governance in Wirtschaft und Verwaltung
Contact
DI Ulrike Kleb
JOANNEUM RESEARCH Forschungsgesellschaft mbH
POLICIES – Institut für Wirtschafts-, Sozial- und
Innovationsforschung
Leonhardstraße 59
8010 Graz
Tel.: +43 316 876-1555
E-Mail: ulrike.kleb(at)joanneum.at
Web: https://www.joanneum.at/policies/
