Large Language Models applied to the XML files from The National Archives

  by   Livio Robaldo






Departments Zienkiewicz Institute for Modelling, Data and AI
DescriptionThe UK is one of the few countries in the world whose official Publication Office (The National Archives, https://www.nationalarchives.gov.uk) has invested considerable effort in converting all legislation and case law into XML format. This XML encoding greatly simplifies pre-processing by making it far easier to retrieve and reorganize the textual content of individual sections or paragraphs within legislative and case law acts. By contrast, in most other countries legislative texts are available only as PDFs, which are notoriously difficult to process programmatically, making pre-processing far more time-consuming. For this reason, the UK provides an ideal environment for initiating business activities in LegalTech, that is, the application of Artificial Intelligence to the legal domain. LegalTech primarily relies on Natural Language Processing (NLP), a core subfield of AI, since legal documents such as legislation, case law, and administrative records are inherently textual. Recent advances in Large Language Models (LLMs), and their impressive performance, have opened new research avenues for annotating and semantically interpreting legal texts. These techniques significantly enhance the quality and granularity of information extraction. This project will develop an LLM-based methodology for processing legal documents in XML format from The National Archives, with the aim of extracting key information and reorganizing it into a format that is accessible, easy to understand, and supportive of learning, while maintaining traceability to the original official sources. The choice of which specific information to extract and restructure during the project is open to discussion. Please contact Livio Robaldo (livio.robaldo@swansea.ac.uk) to agree on the focus. Possible directions include: - Extraction and reorganization of regulative norms (obligations, permissions, and prohibitions) from legislative text. - Extraction and reorganization of key legal concept definitions from legislative text. - Extraction and reorganization of legal interpretations of legislative texts from case law. - Extraction and reorganization of legal arguments from case law, used by lawyers to support specific interpretations. - Extraction and reorganization of personal data from case law for anonymization purposes. - Etc. By the end of the project, the student will be familiar with the XML documents of The National Archives and able to work with them in the development of advanced legal services. This experience could even provide the foundation for a LegalTech start-up project (cf., for example, https://www.casesnappy.com).
Preparation- Basic knowledge of the XML format (see, e.g., https://www.w3schools.com/xml). - Basic familiarity with working with Large Language Models (e.g., GPT, Claude, LLaMA, DeepSeek, etc.).
Project Categories
Project Keywords


Level of Studies

Level 6 (Undergraduate Year 3) yes
Level 7 (Masters) yes
Level 8 (PhD) yes