By: Jacintha Walters, firstname.lastname@example.org, 11-07-2023. The full article is available here.
This article explores the development of a large language model that is customised to serve as a resource for organizations seeking answers to their questions concerning the AIA. The goal of this article will be to create a proof of concept that examines how a custom large language model can be created to answer AIA-related questions and if this model adds value compared to existing general models like ChatGPT.
The EU AI Act (AIA) is a proposed legislation around AI-models in the EU. The goal of this legislation is to strike a balance between promoting innovation and protecting fundamental rights and societal well-being. The AIA sets obligations for high-risk AI Systems to ensure transparency, fairness and trustworthiness in the design, implementation, and use of AI systems .
This article is an extension of a previous paper on the extent to which organization are prepared for the EU AIA in managing their functional requirements . This previous paper creates a questionnaire based on the EU AI Act. 15 responses were gathered for the questionnaire. This data is analyzed to determine the biggest bottlenecks for organizations and how organization characteristics, such as size and sector, influence the degree of compliance. Besides that, 20 questions that were most prevalent among the respondents are also listed in the paper.
The previous paper concludes that the overall readiness of organizations, as measured by an average score of 58% for all responses, indicates room for improvement to achieve compliance with the AIA. Through interviews, it becomes evident that organizations harbor numerous questions concerning the AIA, encompassing both theoretical concerns regarding the act's contents and practical considerations on the application of the AIA in their organization.
To address the needs of these organizations, this paper will extend the previous paper by examining a cutting-edge solution. A Large Language Model (LLM) will be specifically designed to guide organizations through the complexities of the AIA. This LLM takes the form of a question-and-answer system, catering to both theoretical and practical queries. The primary objective is to present a proof of concept that showcases the added value of employing a custom LLM, compared to using a general LLM like ChatGPT.
This paper paper delves into the development of the proof of concept. After stating the requirements for this project, a range of approaches are identified and weighed against the requirements. From this evaluation, a shortlist of approaches is identified, leading to different prototypes and evaluation to determine the most promising solution.
In order to ensure that the proof of concept aligns with the context in which it will be utilized, the test set is constructed by encompassing the questions that were identified in the related study as most prevalent among the responses . By testing the prototypes on these questions, it is ensured that the developed proof of concept can contribute to the specific challenges organizations are dealing with.
To ensure that the proof of concept fits the problem description, requirements are predefined and used to select an appropriate technique. The requirements for this project are based on quality metrics often used in the field of AI . For this project, the scope is narrowed to English only. This is done for practical reasons since more and also higher quality pre-trained models
exist for the English language.
To create a longlist of libraries that align with the project's requirements, libraries are considered that are relevant to the project and worth exploring to determine if they meet the requirements. The libraries chosen for examination are those frequently mentioned in blogs and papers when searching for a "Custom LLM based on my own data." However, it's important to note that not all relevant libraries are examined due to the fast-paced nature of this field, where new libraries are constantly emerging. The goal of this paper is not necessarily to find the best library for this purpose, but to examine if these solutions can add any value to the problem as described. For this, it is important to find solutions that fit the context in order to draw the right conclusions.
The following libraries were included in the longlist:
An essential prerequisite for a library to be considered usable is its availability as an off-the-shelf solution. To assess the learning curve associated with each library, a one-hour time box test was conducted, allowing each library a limited timeframe to make progress on the project. After the hour, an evaluation was conducted based on the requirements. Not all requirements could be examined within the one-hour time frame. The results of the one-hour time box test are listed in Figure 1.
Figure 1 - Results of the one-hour time box test
Based on the results of the one-hour timebox test a shortlist is comprised. This shortlist contains two libraries. First, SearchWithOpenAI. This library uses the OpenAI LLM API and uses embeddings to set up the AIA context . The second library is Llama Index. This library also uses also uses the OpenAI API to create a data framework for LLM applications . Both libraries are similar in the libraries that they are based on, such as Langchain, OpenAI, Tiktoken and PyPDF.
In the next section, both libraries will be compared. It will become clear that though similar, both models generate different answers and have specific advantages and disadvantages. In the future, testing a more diverse range of libraries would be worthwhile. For now, the focus is solely on finding the best model that matches the requirements. To determine if the custom LLMs that result from these libraries add any value compared to existing general-LLMs, ChatGPT is added to the comparison as a baseline. The code for both fine-tuned models can be found in Appendix A.
In this section the method for comparing the three models will be discussed. Three models will be compared, SearchWithOpenAI, Llama Index and ChatGPT. First the test set is created on which the models will be tested. Second the train data is constructed based on existing material. Then the results of the models are compared using three custom metrics.
This section develops the test set of questions on which all three models will be tested. This test set will be based on questions that many organizations have concerning the AIA, as identified in the related study . These questions were distilled from interviews with organizations. This was done by interpreting common mistakes in the questionnaires. For instance, when asked what data risks organizations have dealt with in the past years, almost all organizations gave an answer relating to GDPR compliance. In reality, there are many other risks besides GDPR compliance that the AIA is concerned with, so the question would be, ‘What other risks besides data privacy should my organization be concerned with?’. Two sets of questions are made, one that focuses on the substantive questions regarding AIA itself (which are called Core questions), and one that focuses on the practical/advice questions (which are called Process questions).
Questions on the content of the AIA (Core):
Questions on the application of the AIA (Process):
Both models will be fine-tuned using a trainset. This set will consist of documents relevant to the AIA. Which documents are used to fine-tune will be crucial to generate accurate answers. However, due to the time of this project, only two different sets will be tested. Each set is called a cluster, and the two clusters are chosen as follows:
Cluster 1; plus
Other EU sources on the AIA approach and opinions of various committees, such as the European Economic and Social Committee; plus Other relevant sources in the press and scientific community (published papers, some of which have been peer-reviewed)
The goal of these two sets will be to determine if adding external sources is beneficial for the LLM. This will serve as a foundation on which further iterations could be performed. The documents of both clusters can be found in Appendix B.
Using the trainset, four models have been fine-tuned, one for each library and one for each cluster. ChatGPT is added as a fifth model for a baseline comparison. To generate responses, each model is asked to generate a response for each question in the test set. ChatGPT did require a preamble to introduce the questions. For this, the following text is used:
To determine if the custom LLMs add value compared to the baseline model ChatGPT, the answers must be scored in some way. It has been considered to score the answers using an AI model to determine, for instance, semantics and certain words. However, it was decided that manually scoring the answers would be better. This decision is based on the fact that it is very hard to rate the quality of a response using AI since the situation requires a lot of nuances. The semantics and words can be right, but by making small misinterpretations, the overall implications of the answer might still be slightly off. Besides this, if an AI were used, the AI would compare the answer to the ‘perfect’ answer. However, a lot of the questions from the test set do not have one perfect answer. This is especially true for the process questions. To demonstrate, one process question is as follows; ‘What can we do to improve AIA compliance concerning our technical documentation?’. This question can be answered by listing a number of things that the AIA requires, for instance, documenting which libraries are used. However, it can also be answered by stating that the organization must perform a self-assessment of its technical documentation. Both are valid answers though completely different. For this reason, it was decided to manually score the answers, ensuring a transparent comparison between the generated responses.
**5.4.1 Introducing scoring metrics**
To score the generated responses, first, a response to all questions was written by the author of this paper. This is to show the reader what a correct answer would be and to help the researcher judge the correctness of the generated responses. Based on the acquired knowledge of the official documents regarding the AI act, it is assumed that the author is capable of answering these questions. However, do note that the author is an IT specialist and not a legal expert. The answers from the author including references can be found in Appendix D.
As mentioned, a question can have different responses that are all valid. For this reason, when judging the responses, the response is not matched with the ideal answer persé, but it is compared to the author’s response using three metrics; Implications (I), correctness (c1), and completeness (c2). Here implications focuses on whether the response leads to a different action than the author’s answer. However, implications alone are not enough. Sometimes, the response comes to the right conclusion but uses the wrong information to support this conclusion. For instance, when asked if the AIA mentions metrics to determine the risks on rights and discrimination, one LLM answered ‘No, the AIA does not mention any specific metrics that should be used to determine a model's risks for rights and discrimination. However, it does mention the risks of discrimination/manipulation, profiling practices, and automated decision making.’ Though the conclusion is correct, the AIA does not mention any of the things that are listed.
That’s why correctness is also added as a metric. This refers to whether or not the arguments that are given for the conclusion are correct and in line with the act. Finally, completeness is added because some answers lack relevant arguments.
To increase the reproducibility of this project, more details will be given on the scoring process. Each generated answer is scored by reading the ideal answer and then going through the metrics. The following scales were used during scoring:
To summarize the method for comparing the different LLMs, an overview is given in Figure 2. This figure depicts the process of fine-tuning using the two models and the two clusters. Then scoring all five models on the two question sets using the three metrics.
Figure 2 - Method for comparing the Custom LLMs
The generated responses of each model along with the score for each metric can be found in Appendix C.
|Cluster 1 - core||69%||73%||77%|
|Cluster 1 - process||81%||87%||78%|
|Average cluster 1||75%||80%||78%|
|Cluster 2 - core||84%||74%||77%|
|Cluster 2 -process||87%||90%||78%|
|Average cluster 2||86%||82%||78%|
The above table shows the results of the scoring process in %. This percentage is calculated by determining how many points are scored out of nine (three points for three metrics). Note that ChatGPT is the same in both clusters since this model is not fine-tuned. In the first cluster ChatGPT outperforms the other models on the core questions, meaning ChatGPT better responded to the substantive questions on the AIA. However, the fine-tuned models outperformed ChatGPT when it comes to the process questions, meaning the fine-tuned models are better at answering questions related to the practical implementation of the AIA. In cluster 2, external documents are added to the fine-tuning trainset. This improves the score of SearchWithOpenAI significantly and improves the score of the Llama Index model slightly. ChatGPT still outperforms Llama Index in the core questions but Llama Index is the best at answering the process questions of all models. SearchWithOpenAI outperforms ChatGPT in answering both the core questions and the process questions. An important advantage of SearchWithOpenAI to note here is that this model also gives the reference where a certain answer was found in the documents, as seen in Figure 3. This means the model is very explainable to users, and it is easy to validate if an answer is correct. This is not done correctly 100% of the times, but in most generated answers, the reference was relevant.
Figure 3 - The generated answer with reference.
ChatGPT will never mention where it got its information from and, when asked, often gives a wrong reference or gives a text that is not in the cited document. Llama Index will give a citation at random, but these citations are either non-existent or they do not match the context of the question.
Figure 4 - Results of each cluster and question set for the three models and metrics - CGPT is the same in cluster 1 and 2 since it is not fine-tuned.
Figure 4 depicts the average score for each model and metric. Cluster 2 outperforms cluster 1 on all metrics of the fine-tuned models except for a slight decrease in the correctness metric of SWOAI. It is also noteworthy that the cluster 1 models generally score higher in correctness than in implication and completeness. This means the information that was given was often correct, but sometimes the wrong conclusion was drawn, and crucial information was missing. In cluster 2, the balance between the implications and the correctness metric is much better. Meaning that correct information was given and the right conclusion was drawn. However, all the models, especially ChatGPT, often missed crucial information.
The goal of this article was two-fold. First, to determine if the custom-trained LLMs would add value compared to a general language model such as ChatGPT. The results show that when fine-tuning on both the AI Act, related documents from the EU, and external research, the models outperform ChatGPT for both substantive and process questions. This shows that customizing LLMs can add value compared to using a general LLM. Secondly, this article has created a foundation on which further research can be performed. The project shows that the SearchWithOpenAI model shows promising results and is especially relevant for this scenario because it can accurately refer to the relevant parts of the literature on which the answer is based. This is crucial for this application because it ensures that even if the answer is sometimes wrong or not useful, organizations can easily validate it themselves or find where in the literature the answer might be.
Using LLMs to answer the questions that organizations have seems very promising. The AIA itself gives a lot of answers but requires a lot of time to read carefully. On top of that, external research helps to implement the AIA into the organization and to clear up any ambiguity from the act. But reading all this material can be a time-consuming task. If LLMs can help organizations to digest this information and find the answers they are looking for quickly, this can help organizations to comply with the AIA faster.
This project has identified some aspects that would be relevant for further iterations:
Appendices can be found in the full article located here.[[babelfish.nl/user/pages/03.srvcs/blog]]