Text containing personal data must be anonymized or pseudonymized before it can be used for training AI models or for research and educational purposes. Anonymization may also be necessary for publishing parliamentary materials or judicial decisions. Anonymization requires reliable and comprehensible identification of personal information.
CLEAR adresses generic, transparent, reliable, and sustainable AI solutions for named entity recognition (NER) and its application to identifying personal data. This will involve a combination of rule-based and ML-based methods in a way that exploits the advantages of both paradigms.
State of the art NER solutions rely on task-specific fine-tuning of large neural language models. Such models require high quality annotated training data and still fail to generalize. Their tendency to hallucinate reduces user trust and causes misinformation. Their inherent “black box” nature results in decisions that are neither predictable, nor explainable. Such models are not configurable, prone to bias, and create a significant environmental burden. Common rule-based systems, on the other hand, require laborious manual configuration to adapt to changing requirements.
The CLEAR project seeks to develop and evaluate hybrid NER methods for the processing of German texts:
(1) Rule learning for NER via prompting and fine-tuning of LLMs.
(2) Generation of entity candidates with deep learning models followed by the selection of entities using learned rules.
CLEAR relies on a human-in-the-loop learning approach for legal NER that mitigates the afore mentioned issues. Rule-based models are explainable, predictable, and auditable, as well as configurable and comprehensible for their users. CLEAR also offers a learning paradigm that greatly reduces the need to train LLMs and therefore environmental costs.
The concept of anonymization raises important questions in the field of legal research. New EU legislation such as the Data Act and the Data Governance Act relies on the GDPR’s concept of anonymization, without settling the open questions, thus creating the need for a practical and legally secure anonymization strategy. There are also unresolved questions around the issue of intellectual property law in conjunction with data use for training AI models, and within the European legal framework of the AI Act, e.g. regarding the research exemption or the risk classification of AI systems.
The flexible, trainable, trustworthy NER architecture to be developed in CLEAR will have impact in various KIRAS research fields on data governance, and in several other use cases including digital forensics and the fight against cybercrime.
Project Lead
Doris Ipsmiller, m2n – consulting and development gmbh
Partners
Bundesministerium für Finanzen
Bundesministerium für Justiz
Republik Österreich Parlamentsdirektion
Universität Wien Institut für Innovation und Digitalisierung im Recht
Technische Universität Wien Institut für Information Systems Engineering
Contact
Doris Ipsmiller
m2n – consulting and development gmbh
Knagg 1, 3034 Maria Anzbach
Telefon: +43 660 711987 2
office(at)m2n.at
www.m2n.at
