Empowering Organizations: Unraveling the EU AI Act with a Custom LLM

Empowering Organizations: Unraveling the EU AI Act with a Custom LLM

By: Jacintha Walters, jacintha@babelfish.nl, 11-07-2023. The full article is available here.

This article explores the development of a large language model that is customised to serve as a resource for organizations seeking answers to their questions concerning the AIA. The goal of this article will be to create a proof of concept that examines how a custom large language model can be created to answer AIA-related questions and if this model adds value compared to existing general models like ChatGPT.

1. Introduction

The EU AI Act (AIA) is a proposed legislation around AI-models in the EU. The goal of this legislation is to strike a balance between promoting innovation and protecting fundamental rights and societal well-being. The AIA sets obligations for high-risk AI Systems to ensure transparency, fairness and trustworthiness in the design, implementation, and use of AI systems [1].

This article is an extension of a previous paper on the extent to which organization are prepared for the EU AIA in managing their functional requirements [2]. This previous paper creates a questionnaire based on the EU AI Act. 15 responses were gathered for the questionnaire. This data is analyzed to determine the biggest bottlenecks for organizations and how organization characteristics, such as size and sector, influence the degree of compliance. Besides that, 20 questions that were most prevalent among the respondents are also listed in the paper.

The previous paper concludes that the overall readiness of organizations, as measured by an average score of 58% for all responses, indicates room for improvement to achieve compliance with the AIA. Through interviews, it becomes evident that organizations harbor numerous questions concerning the AIA, encompassing both theoretical concerns regarding the act's contents and practical considerations on the application of the AIA in their organization.

To address the needs of these organizations, this paper will extend the previous paper by examining a cutting-edge solution. A Large Language Model (LLM) will be specifically designed to guide organizations through the complexities of the AIA. This LLM takes the form of a question-and-answer system, catering to both theoretical and practical queries. The primary objective is to present a proof of concept that showcases the added value of employing a custom LLM, compared to using a general LLM like ChatGPT.

This paper paper delves into the development of the proof of concept. After stating the requirements for this project, a range of approaches are identified and weighed against the requirements. From this evaluation, a shortlist of approaches is identified, leading to different prototypes and evaluation to determine the most promising solution.

In order to ensure that the proof of concept aligns with the context in which it will be utilized, the test set is constructed by encompassing the questions that were identified in the related study as most prevalent among the responses [2]. By testing the prototypes on these questions, it is ensured that the developed proof of concept can contribute to the specific challenges organizations are dealing with.

2. Requirements for the prototypes

To ensure that the proof of concept fits the problem description, requirements are predefined and used to select an appropriate technique. The requirements for this project are based on quality metrics often used in the field of AI . For this project, the scope is narrowed to English only. This is done for practical reasons since more and also higher quality pre-trained models

exist for the English language.

  1. Relevant questions from the organization should be answered in correct English.
  2. The solution should be explainable so that the user understands where the information from the AI is coming from.
  3. Speed, a question should be answered in less than one minute. Otherwise, the end product won’t be user-friendly.
  4. Off the shelf, due to the time frame of this project, it should not be necessary to build the LLM from the ground up.
  5. Accuracy, a question should be answered accurately and in its entirety.
  6. Resource demand, the model should not be expensive to run since this would mean it is not scalable.
  7. The solutions should be easily integrated into a web application so the proof of concept works for a full product.
  8. The solution should answer questions about the content of the AIA, but it should also give advice to organizations on how to improve their compliance.

3. Longlist

To create a longlist of libraries that align with the project's requirements, libraries are considered that are relevant to the project and worth exploring to determine if they meet the requirements. The libraries chosen for examination are those frequently mentioned in blogs and papers when searching for a "Custom LLM based on my own data." However, it's important to note that not all relevant libraries are examined due to the fast-paced nature of this field, where new libraries are constantly emerging. The goal of this paper is not necessarily to find the best library for this purpose, but to examine if these solutions can add any value to the problem as described. For this, it is important to find solutions that fit the context in order to draw the right conclusions.

The following libraries were included in the longlist:

  • OpenAI GPT-3:
    • SearchWithOpenAI
    • Llama Index
  • BERT:
    • Google Vertex AI
    • FastChat
    • XTuring
    • Databricks (Dolly)

An essential prerequisite for a library to be considered usable is its availability as an off-the-shelf solution. To assess the learning curve associated with each library, a one-hour time box test was conducted, allowing each library a limited timeframe to make progress on the project. After the hour, an evaluation was conducted based on the requirements. Not all requirements could be examined within the one-hour time frame. The results of the one-hour time box test are listed in Figure 1.

Figure 1 - Results of the one-hour time box test

Figure 1 - Results of the one-hour time box test

4. Shortlist

Based on the results of the one-hour timebox test a shortlist is comprised. This shortlist contains two libraries. First, SearchWithOpenAI. This library uses the OpenAI LLM API and uses embeddings to set up the AIA context [3]. The second library is Llama Index. This library also uses also uses the OpenAI API to create a data framework for LLM applications [4]. Both libraries are similar in the libraries that they are based on, such as Langchain, OpenAI, Tiktoken and PyPDF.

In the next section, both libraries will be compared. It will become clear that though similar, both models generate different answers and have specific advantages and disadvantages. In the future, testing a more diverse range of libraries would be worthwhile. For now, the focus is solely on finding the best model that matches the requirements. To determine if the custom LLMs that result from these libraries add any value compared to existing general-LLMs, ChatGPT is added to the comparison as a baseline. The code for both fine-tuned models can be found in Appendix A.

5. Method for comparing the custom LLMs

In this section the method for comparing the three models will be discussed. Three models will be compared, SearchWithOpenAI, Llama Index and ChatGPT. First the test set is created on which the models will be tested. Second the train data is constructed based on existing material. Then the results of the models are compared using three custom metrics.

5.1 Testset - interview questions

This section develops the test set of questions on which all three models will be tested. This test set will be based on questions that many organizations have concerning the AIA, as identified in the related study [2]. These questions were distilled from interviews with organizations. This was done by interpreting common mistakes in the questionnaires. For instance, when asked what data risks organizations have dealt with in the past years, almost all organizations gave an answer relating to GDPR compliance. In reality, there are many other risks besides GDPR compliance that the AIA is concerned with, so the question would be, ‘What other risks besides data privacy should my organization be concerned with?’. Two sets of questions are made, one that focuses on the substantive questions regarding AIA itself (which are called Core questions), and one that focuses on the practical/advice questions (which are called Process questions).

Questions on the content of the AIA (Core):

  1. Should technical documentation also be written for non-technical people?
  2. Does the AIA stipulate that we need someone to monitorthe AI models full-time?
  3. Does the AIA require me to work with encrypted data only?
  4. How should we deal with missing data according to the AIA?
  5. What other data risks besides data privacy should my organization be concerned with?
  6. What does the AIA mean by high-risk AI?
  7. Does the AIA require an external audit?
  8. Which documents should be included in the compliance documentation?
  9. Does the AIA mention metrics that should be used to determine a model’s risks for rights and discrimination?
  10. What does the AIA mean by ‘human oversight’?

Questions on the application of the AIA (Process):

  1. To which extent does my ISO certification help towards AIA compliance?
  2. Does GDPR training also include data bias and model bias training?
  3. What are the biggest risks to AIA compliance when data is gathered in-house?
  4. Our organization uses data from customers; what are some of the biggest risks when aiming for AIA compliance?
  5. We only use ChatGPT and other out-of-the-box AImodels; should we still be concerned with the AIA?
  6. What can we do to improve AIA compliance concerning our technical documentation?
  7. We currently don’t communicate anything about our models with our users; how can we better communicate information with the users for AIA compliance?
  8. Our organization is very small, and no one is specialized in compliance; where do we even begin to achieve AIA compliance?
  9. We currently have no idea if we communicate with our stakeholders according to the AIA; how should we assess this to make improvements?
  10. The AIA stipulates that accuracy should be according tothe state of the art. This seems very vague; how should I go about achieving state-of-the-art accuracy?

5.2 Trainset - clusters

Both models will be fine-tuned using a trainset. This set will consist of documents relevant to the AIA. Which documents are used to fine-tune will be crucial to generate accurate answers. However, due to the time of this project, only two different sets will be tested. Each set is called a cluster, and the two clusters are chosen as follows:

  1. The EU core text of the AIA, together with annexes.
  2. Cluster 1; plus

    Other EU sources on the AIA approach and opinions of various committees, such as the European Economic and Social Committee; plus Other relevant sources in the press and scientific community (published papers, some of which have been peer-reviewed)

The goal of these two sets will be to determine if adding external sources is beneficial for the LLM. This will serve as a foundation on which further iterations could be performed. The documents of both clusters can be found in Appendix B.

5.3 Generating responses

Using the trainset, four models have been fine-tuned, one for each library and one for each cluster. ChatGPT is added as a fifth model for a baseline comparison. To generate responses, each model is asked to generate a response for each question in the test set. ChatGPT did require a preamble to introduce the questions. For this, the following text is used:

5.4 Scoring the responses

To determine if the custom LLMs add value compared to the baseline model ChatGPT, the answers must be scored in some way. It has been considered to score the answers using an AI model to determine, for instance, semantics and certain words. However, it was decided that manually scoring the answers would be better. This decision is based on the fact that it is very hard to rate the quality of a response using AI since the situation requires a lot of nuances. The semantics and words can be right, but by making small misinterpretations, the overall implications of the answer might still be slightly off. Besides this, if an AI were used, the AI would compare the answer to the ‘perfect’ answer. However, a lot of the questions from the test set do not have one perfect answer. This is especially true for the process questions. To demonstrate, one process question is as follows; ‘What can we do to improve AIA compliance concerning our technical documentation?’. This question can be answered by listing a number of things that the AIA requires, for instance, documenting which libraries are used. However, it can also be answered by stating that the organization must perform a self-assessment of its technical documentation. Both are valid answers though completely different. For this reason, it was decided to manually score the answers, ensuring a transparent comparison between the generated responses.

**5.4.1 Introducing scoring metrics**

To score the generated responses, first, a response to all questions was written by the author of this paper. This is to show the reader what a correct answer would be and to help the researcher judge the correctness of the generated responses. Based on the acquired knowledge of the official documents regarding the AI act, it is assumed that the author is capable of answering these questions. However, do note that the author is an IT specialist and not a legal expert. The answers from the author including references can be found in Appendix D.

As mentioned, a question can have different responses that are all valid. For this reason, when judging the responses, the response is not matched with the ideal answer persé, but it is compared to the author’s response using three metrics; Implications (I), correctness (c1), and completeness (c2). Here implications focuses on whether the response leads to a different action than the author’s answer. However, implications alone are not enough. Sometimes, the response comes to the right conclusion but uses the wrong information to support this conclusion. For instance, when asked if the AIA mentions metrics to determine the risks on rights and discrimination, one LLM answered ‘No, the AIA does not mention any specific metrics that should be used to determine a model's risks for rights and discrimination. However, it does mention the risks of discrimination/manipulation, profiling practices, and automated decision making.’ Though the conclusion is correct, the AIA does not mention any of the things that are listed.

That’s why correctness is also added as a metric. This refers to whether or not the arguments that are given for the conclusion are correct and in line with the act. Finally, completeness is added because some answers lack relevant arguments.

To increase the reproducibility of this project, more details will be given on the scoring process. Each generated answer is scored by reading the ideal answer and then going through the metrics. The following scales were used during scoring:

  • Implications, for this score, only the first sentence is used, which gives the conclusion of the answer.
    1. The implication is completely wrong
    2. The implication is somewhat right
    3. The implication is right
  • Correctness focuses on the text following the first sentence. It can be the case that the conclusion is wrong, but the given information is still correct.
    1. The information that is given is wrong
    2. Some of the information is wrong
    3. The information is correct
  • Completeness, this means all the relevant information is in the answer. Sometimes the answer is technically correct but not very usable by an organization. In this case, this will also have negative results for the completeness.
    1. A lot of crucial information is missing
    2. Some crucial information is missing
    3. All crucial information is in the answer

To summarize the method for comparing the different LLMs, an overview is given in Figure 2. This figure depicts the process of fine-tuning using the two models and the two clusters. Then scoring all five models on the two question sets using the three metrics.

Figure 2 - Method for comparing the Custom LLMs

Figure 2 - Method for comparing the Custom LLMs

The generated responses of each model along with the score for each metric can be found in Appendix C.

6. Analyzing the results

SWOAI Llama CGPT
Cluster 1 - core 69% 73% 77%
Cluster 1 - process 81% 87% 78%
Average cluster 1 75% 80% 78%
Cluster 2 - core 84% 74% 77%
Cluster 2 -process 87% 90% 78%
Average cluster 2 86% 82% 78%

The above table shows the results of the scoring process in %. This percentage is calculated by determining how many points are scored out of nine (three points for three metrics). Note that ChatGPT is the same in both clusters since this model is not fine-tuned. In the first cluster ChatGPT outperforms the other models on the core questions, meaning ChatGPT better responded to the substantive questions on the AIA. However, the fine-tuned models outperformed ChatGPT when it comes to the process questions, meaning the fine-tuned models are better at answering questions related to the practical implementation of the AIA. In cluster 2, external documents are added to the fine-tuning trainset. This improves the score of SearchWithOpenAI significantly and improves the score of the Llama Index model slightly. ChatGPT still outperforms Llama Index in the core questions but Llama Index is the best at answering the process questions of all models. SearchWithOpenAI outperforms ChatGPT in answering both the core questions and the process questions. An important advantage of SearchWithOpenAI to note here is that this model also gives the reference where a certain answer was found in the documents, as seen in Figure 3. This means the model is very explainable to users, and it is easy to validate if an answer is correct. This is not done correctly 100% of the times, but in most generated answers, the reference was relevant.

Figure 3 - The generated answer with reference.

Figure 3 - The generated answer with reference.

ChatGPT will never mention where it got its information from and, when asked, often gives a wrong reference or gives a text that is not in the cited document. Llama Index will give a citation at random, but these citations are either non-existent or they do not match the context of the question.

Figure 4 - Results of each cluster and question set for the three models and metrics - CGPT is the same in cluster 1 and 2 since it is not fine-tuned.

Figure 4 - Results of each cluster and question set for the three models and metrics - CGPT is the same in cluster 1 and 2 since it is not fine-tuned.

Figure 4 depicts the average score for each model and metric. Cluster 2 outperforms cluster 1 on all metrics of the fine-tuned models except for a slight decrease in the correctness metric of SWOAI. It is also noteworthy that the cluster 1 models generally score higher in correctness than in implication and completeness. This means the information that was given was often correct, but sometimes the wrong conclusion was drawn, and crucial information was missing. In cluster 2, the balance between the implications and the correctness metric is much better. Meaning that correct information was given and the right conclusion was drawn. However, all the models, especially ChatGPT, often missed crucial information.

7. Conclusion

The goal of this article was two-fold. First, to determine if the custom-trained LLMs would add value compared to a general language model such as ChatGPT. The results show that when fine-tuning on both the AI Act, related documents from the EU, and external research, the models outperform ChatGPT for both substantive and process questions. This shows that customizing LLMs can add value compared to using a general LLM. Secondly, this article has created a foundation on which further research can be performed. The project shows that the SearchWithOpenAI model shows promising results and is especially relevant for this scenario because it can accurately refer to the relevant parts of the literature on which the answer is based. This is crucial for this application because it ensures that even if the answer is sometimes wrong or not useful, organizations can easily validate it themselves or find where in the literature the answer might be.

Using LLMs to answer the questions that organizations have seems very promising. The AIA itself gives a lot of answers but requires a lot of time to read carefully. On top of that, external research helps to implement the AIA into the organization and to clear up any ambiguity from the act. But reading all this material can be a time-consuming task. If LLMs can help organizations to digest this information and find the answers they are looking for quickly, this can help organizations to comply with the AIA faster.

8.8 Future Developments

This project has identified some aspects that would be relevant for further iterations:

  • Test more clusters: as mentioned, determining which documents are added to the fine-tuning set can be crucial for achieving accurate answers. In this project, two clusters are used, which have shown that adding external documents to the set helps to achieve more relevant responses. The next iteration would be to find external research on different topics and to create more clusters that can be tested and scored in the same way. Research should be categorised into ‘core’ and ‘process’ research. It would be interesting to understand how these different papers influence the LLM and which yield the best result.
  • Personalize the LLM: thinking outside-the-box, future research can focus on personalising the LLM to the organization. The responses to the current questions are very broad because the model does not know which organization is asking. In the next iteration, personas can be developed for different organizations and include documents of the organization in the trainset (for instance, the information on the organization’s website). Then by asking the same questions, it could be measured if a relevant answer was given for each persona.
  • Testing the LLM in real-life: in this project, the models were tested in context by gathering frequently asked questions from real organizations. Ideally, if the LLM has been developed for a few more iterations, the model should be tested in a real-life scenario. This can be done by going to an organization and asking them what their current obstacles are to understand and implement the AIA. Then by discussing the generated responses and getting feedback on the response, the LLM can be further fine-tuned.
  • Test a non-GPT-based model: all three libraries that were compared are based on Chat-GPT. It would be interesting to see if a BERT-based model, for instance, would outperform the Chat-GPT-based models.

Bibliography

  1. “The Act”, The Artificial Intelligence Act, 9 september 2021. https://artificialintelligenceact.eu/the-act/
  2. Walters J, “Complying with the EU AI Act”, 5 juli 2023. 
  3. Ushakrishnan, “GitHub - ushakrishnan/SearchWithOpenAI: Quick start.”, GitHub. https://github.com/ushakrishnan/SearchWithOpenAI
  4. “LlamaIndex 🦙 0.7.4”. https://gpt-index.readthedocs.io/en/latest/index.html

Appendices

Appendices can be found in the full article located here.[[babelfish.nl/user/pages/03.srvcs/blog]]

Vorige Bericht