1 00:00:00,000 --> 00:00:04,900 I'm really pleased to introduce today, multi-piece. He's going to talk about 2 00:00:04,900 --> 00:00:07,600 connecting GPT with your data and how to avoid 3 00:00:08,000 --> 00:00:12,800 hallucination, which is something that we obviously talked about in the Q&A about an 4 00:00:12,800 --> 00:00:16,700 hour ago, Monty is the co-founder and CTO at deepset, 5 00:00:16,700 --> 00:00:20,800 where he builds Haystack, and deepset Cloud to enable 6 00:00:20,800 --> 00:00:24,400 developers all over the world to use NLP, effectively in their business 7 00:00:24,400 --> 00:00:28,500 applications previously, he conducted NLP research at 8 00:00:28,500 --> 00:00:29,300 Carnegie Mellon University. 9 00:00:30,000 --> 00:00:34,900 And was a data scientist for multiple startups. He's been crafting, NLP, applications 10 00:00:34,900 --> 00:00:38,700 for all kinds of businesses for more than eight years now and is convinced that the 11 00:00:38,700 --> 00:00:42,400 development workflow is your orientation are key criteria for 12 00:00:42,400 --> 00:00:46,300 successful in LP projects. So multi please take it away. 13 00:00:47,400 --> 00:00:51,700 Thanks, John. Thanks for the kind intro. Yeah, I'm super happy to be 14 00:00:51,700 --> 00:00:55,500 here today and share a few of our thoughts on llms. 15 00:00:55,500 --> 00:00:59,900 Welcome to the session and I think we are 16 00:00:59,900 --> 00:01:03,800 all aware that the lamps are the future, right? And that they actually 17 00:01:03,800 --> 00:01:07,800 unlocked already, many exciting use cases. We already 18 00:01:07,800 --> 00:01:10,600 heard about a few in the sessions earlier today. 19 00:01:10,700 --> 00:01:14,600 Still, I personally feel that there's sometimes two 20 00:01:14,600 --> 00:01:17,100 realities out there. There's this 21 00:01:17,600 --> 00:01:21,500 Crazy bus world. When I my Twitter feed, for example, 22 00:01:21,500 --> 00:01:25,700 where everything seems possible and so easy. And then there's 23 00:01:25,700 --> 00:01:29,500 Enterprise reality when I talked to Engineers product 24 00:01:29,500 --> 00:01:33,700 managers out there in the industry. So, if we look at the 25 00:01:33,700 --> 00:01:37,700 title of this whole event today, my talk will be more on the 26 00:01:37,700 --> 00:01:41,900 pitfalls side of things. And I want to share some of our learnings 27 00:01:41,900 --> 00:01:45,300 some of the challenges we basically saw with our Enterprise customers 28 00:01:45,300 --> 00:01:47,500 when building a la application. 29 00:01:47,600 --> 00:01:50,000 Ins for production use cases. 30 00:01:51,100 --> 00:01:55,600 To give this whole session a useful and positive twist. I will of course, not only share 31 00:01:55,600 --> 00:01:59,800 pitfalls but also some pragmatic tips to actually overcome 32 00:01:59,800 --> 00:02:00,000 them. 33 00:02:02,100 --> 00:02:06,400 So the next 20 minutes, I will give you an overview over the most common 34 00:02:06,400 --> 00:02:10,800 challenges when building applications then highlight 35 00:02:10,900 --> 00:02:14,700 one key method that helps a lot to overcome some of these 36 00:02:14,700 --> 00:02:18,700 challenges and then walk you through some additional smaller 37 00:02:18,700 --> 00:02:22,400 tips on what you could actually do to optimize things even further. 38 00:02:23,500 --> 00:02:27,800 So, let's get started with the most common challenges. We observed with our customers 39 00:02:27,800 --> 00:02:29,900 when building element education's 40 00:02:31,400 --> 00:02:35,800 Number one, most of these high-value use cases, we see in the 41 00:02:35,800 --> 00:02:39,100 Enterprise includes some kind of internal company data 42 00:02:39,700 --> 00:02:43,900 so that is data that is not available in the public Internet. 43 00:02:44,400 --> 00:02:48,600 It's often from special domain could be for example, 44 00:02:48,600 --> 00:02:52,800 aircraft maintenance documents could be legal contracts 45 00:02:52,800 --> 00:02:56,600 or financial reports and the typical question 46 00:02:56,600 --> 00:03:00,900 that comes up then is how can we tune the model now to 47 00:03:00,900 --> 00:03:01,100 our 48 00:03:01,200 --> 00:03:05,900 Data. How can we teach it about our internal knowledge that it doesn't know yet about? 49 00:03:09,000 --> 00:03:13,800 Change number two is actually closely related to the first one when our 50 00:03:13,800 --> 00:03:17,900 use case. Now relies on this internal data. How can we make sure we stay 51 00:03:17,900 --> 00:03:21,200 in control of it? We must be safe. Must be secure. 52 00:03:21,700 --> 00:03:25,900 Maybe our company policy doesn't allow sharing this kind of data with third parties. 53 00:03:28,500 --> 00:03:32,800 Number three. If you overcome these first two challenges, 54 00:03:33,200 --> 00:03:37,900 the next big question, question is typically around quality insurance. So once 55 00:03:38,200 --> 00:03:42,800 you want to go and production, think we all saw these models hallucinating 56 00:03:42,800 --> 00:03:46,700 simply making up answers. And if your application is being exposed 57 00:03:46,700 --> 00:03:50,800 to customers. For example, this could be extremely damaging to your 58 00:03:50,800 --> 00:03:54,700 brand. So, how do you know if there's a safe 59 00:03:54,700 --> 00:03:57,900 enough to deploy? How can you assess this quality? 60 00:04:01,800 --> 00:04:05,200 I could probably talk about each of those challenges for hours 61 00:04:06,000 --> 00:04:10,700 today. I just want to focus on Challenge 1 and 3 and get there, the 62 00:04:10,800 --> 00:04:14,800 key ideas across. So let's start with the basics. 63 00:04:15,700 --> 00:04:18,900 What is actually an hallucination? Why does it happen? 64 00:04:21,200 --> 00:04:25,500 Well, the Lambs are often wrong with the answers and even worse, they are 65 00:04:25,500 --> 00:04:29,800 confident about it. So some of you might remember this bank that 66 00:04:29,800 --> 00:04:33,800 used to exist Silicon Valley Bank and if you 67 00:04:33,800 --> 00:04:37,700 ask me LM, if Silicon Valley Bank collapsed you can easily get an 68 00:04:37,700 --> 00:04:41,600 answer like this. No, it did not collapse its successful. Financial Service 69 00:04:41,600 --> 00:04:42,800 Company dada. 70 00:04:43,700 --> 00:04:47,700 This is what we call hallucination. The model is simply making up fake 71 00:04:47,700 --> 00:04:48,800 reality. 72 00:04:52,000 --> 00:04:56,600 So let's try something else. What if we asked, why did Silicon Valley Bank collapse? 73 00:04:59,000 --> 00:05:03,400 All of a sudden the model switches its opinion. Now as we actually collapsed 74 00:05:03,600 --> 00:05:04,100 great 75 00:05:05,300 --> 00:05:09,400 It's even giving us some arguments why the bank failed but wait, 76 00:05:09,400 --> 00:05:13,900 it's mentioning something around subprime mortgages and financial 77 00:05:13,900 --> 00:05:15,600 crisis of 2008. 78 00:05:16,700 --> 00:05:20,900 I want to see very bad with remembering dates and timelines, but even I'm 79 00:05:20,900 --> 00:05:24,600 pretty sure that SV B was happening this year 80 00:05:24,600 --> 00:05:27,500 and not back then in 2008 81 00:05:28,600 --> 00:05:32,500 and it was actually related to Rising interest rates and a bank run 82 00:05:32,900 --> 00:05:34,000 not subprime mortgages. 83 00:05:35,800 --> 00:05:39,800 So this kind of behavior is what causes problems When developing an llm 84 00:05:39,800 --> 00:05:43,500 application. The model is not stable. In its 85 00:05:43,500 --> 00:05:47,400 opinion, it's giving wrong answers and it's often very 86 00:05:47,400 --> 00:05:51,000 eloquent in. Its argumentation comes, can sound very convincing. 87 00:05:51,900 --> 00:05:55,300 This makes it really hard to spot. These kind of hallucinations as a user. 88 00:05:57,500 --> 00:06:01,600 Where do these hallucinations now come from. Well, in this case, it's probably rather 89 00:06:01,600 --> 00:06:05,800 easy. As we be happened this year, this LM that I asked here, only 90 00:06:05,800 --> 00:06:09,800 saw training data on 2 September, 20 21. So 91 00:06:09,800 --> 00:06:13,700 it's internal knowledge, is simply outdated. It could most likely be 92 00:06:13,700 --> 00:06:17,100 fixed by training another model on more recent web data. 93 00:06:17,900 --> 00:06:21,800 Okay, cool. So I hallucinations 94 00:06:21,800 --> 00:06:25,200 all your problem when we asked about this kind of recent events, 95 00:06:26,600 --> 00:06:27,900 Unfortunately not. 96 00:06:30,100 --> 00:06:34,700 So let's ask Deputy about something in 2008. Let's ask about the number 97 00:06:34,700 --> 00:06:38,800 of model 3, weekers that Tesla produced in the first quarter of 98 00:06:38,800 --> 00:06:39,400 that year. 99 00:06:41,200 --> 00:06:45,600 The answer that we get here, sounds convincing. It's even mentioning a 100 00:06:45,600 --> 00:06:49,800 source. The shareholder letter. However, if you open the 101 00:06:49,800 --> 00:06:53,900 shareholder letter, you find that this number is unfortunately completely wrong. 102 00:06:54,500 --> 00:06:58,900 Instead 9700 vehicles of that type were 103 00:06:58,900 --> 00:06:59,600 produced 104 00:07:00,800 --> 00:07:03,400 So why did this model hallucinate here? 105 00:07:05,100 --> 00:07:09,900 Either they did not see any information about this at all at training time is 106 00:07:10,400 --> 00:07:14,800 quite unlikely. In this case as it mentions explicitly this kind of shareholder 107 00:07:14,800 --> 00:07:18,700 letter or well the more likely option this case. 108 00:07:19,100 --> 00:07:23,900 The model actually has seen some related information but mixed up, it's not confident 109 00:07:23,900 --> 00:07:27,500 enough. Maybe hasn't seen enough examples of it. It's 110 00:07:27,800 --> 00:07:28,400 it's guessing. 111 00:07:30,400 --> 00:07:34,700 And indeed, if you look through the shareholder letter, this number of 34,000 112 00:07:34,700 --> 00:07:38,500 cars that the model brought things up here is the total number of 113 00:07:38,500 --> 00:07:41,300 vehicles produced in the quarter, not for model free. 114 00:07:44,400 --> 00:07:48,700 So now that we know what hallucinations are, how can we actually reduce them 115 00:07:48,700 --> 00:07:51,100 to ensure quality and production? 116 00:07:52,200 --> 00:07:56,600 And let's not forget about our Second Challenge. How can we teach the LM new 117 00:07:56,600 --> 00:08:00,700 information for use case is about this internal company data. 118 00:08:01,800 --> 00:08:05,800 Turns out there's one key method that can actually really help with 119 00:08:05,800 --> 00:08:09,200 both of these challenges, it's called retrieval-augmented. 120 00:08:09,200 --> 00:08:13,700 Let Me Explain you. The key idea of this method, 121 00:08:16,900 --> 00:08:20,900 Let's first see how an answer gets generated by an llm in the standard way? 122 00:08:21,500 --> 00:08:23,700 So first, you have your question. 123 00:08:24,700 --> 00:08:28,900 This is typically converted to a prompt which contains still your 124 00:08:28,900 --> 00:08:32,900 original question. But then also maybe some optional text around it 125 00:08:32,900 --> 00:08:36,900 to instruct the model. What to do actually maybe in what style should 126 00:08:36,900 --> 00:08:37,400 respond 127 00:08:39,100 --> 00:08:43,700 This prompts then sent to an llm and this generates the 128 00:08:43,700 --> 00:08:45,500 answer that we then see simple. 129 00:08:46,800 --> 00:08:50,400 Now let's look what changes with this approach of 130 00:08:50,400 --> 00:08:54,800 retrieval-augmented, a ship. We still have our question here on the left. 131 00:08:54,800 --> 00:08:58,600 We still have a prompt that gets fed to our LM. 132 00:08:59,800 --> 00:09:03,900 However what we do now is inserting some more information or 133 00:09:03,900 --> 00:09:07,900 prompt what you see in green here. So what we want to 134 00:09:07,900 --> 00:09:11,800 provide the llm with some more useful information that helps the 135 00:09:11,800 --> 00:09:15,900 model to answer our question in a really truthful way. So in our 136 00:09:15,900 --> 00:09:19,400 example, of Silicon Valley Bank, this could be news 137 00:09:19,400 --> 00:09:23,800 articles or analyst, reports that explain this 138 00:09:23,800 --> 00:09:24,800 collapse of the bank. 139 00:09:26,500 --> 00:09:30,400 Of course, we need some automatic automatic way of searching and 140 00:09:30,400 --> 00:09:34,700 inserting these relevant documents in here. Can't do this 141 00:09:34,700 --> 00:09:35,300 manually. 142 00:09:36,600 --> 00:09:40,700 So what we do is now connecting our pipeline to some 143 00:09:40,700 --> 00:09:44,600 external database and use a so-called retriever 144 00:09:44,600 --> 00:09:48,400 model to find us a few relevant documents that we can then 145 00:09:48,400 --> 00:09:52,800 insert basically into or prompt. So think of it as a search 146 00:09:53,400 --> 00:09:57,100 first search relevant piece of information and then insert it into our prompt. 147 00:09:58,700 --> 00:10:02,100 And when we do that, we can actually solve now several problems at once. 148 00:10:02,900 --> 00:10:06,600 First of all, we reduce hallucinations as the model will now 149 00:10:06,800 --> 00:10:09,900 ground its answers on some actual information. The documents. 150 00:10:11,600 --> 00:10:15,800 You can also teach now, the LM new information so that it stays up to 151 00:10:15,800 --> 00:10:19,900 date and also becomes aware of our private company data that 152 00:10:19,900 --> 00:10:21,100 we might have in this database. 153 00:10:23,200 --> 00:10:27,800 Last but not least, it helps with explain ability and verifiability, 154 00:10:27,800 --> 00:10:31,900 for users as a user. I can now very easily browse the documents 155 00:10:31,900 --> 00:10:35,900 behind my generated, answer. Some to what we were earlier in the talk 156 00:10:35,900 --> 00:10:39,700 about being, I can verify that these 157 00:10:39,700 --> 00:10:41,500 answers make sense and where they come from. 158 00:10:44,600 --> 00:10:48,500 So, let's now have a quick live 159 00:10:48,500 --> 00:10:50,500 demo and see this in action. 160 00:10:54,500 --> 00:10:58,900 So this is a small demo that you put together. It's on hacking face basis, 161 00:10:58,900 --> 00:11:02,900 so all of you can actually access it. It's using Haystack 162 00:11:02,900 --> 00:11:06,900 open source framework under the hood or coat is also available here on the 163 00:11:06,900 --> 00:11:10,700 files and what you can do here now, is basically asking questions around 164 00:11:10,700 --> 00:11:14,900 the Silicon, Valley Bank collapse. So we 165 00:11:14,900 --> 00:11:18,900 have here already, one example, did SVP collapse and then we see basically 166 00:11:18,900 --> 00:11:22,800 two different answers, one using plain to PT 167 00:11:22,800 --> 00:11:24,200 and one using 168 00:11:24,500 --> 00:11:28,600 You with retrieval-augmented, and can switch. Basically. 169 00:11:28,600 --> 00:11:32,700 The data set that is used, you can either use a static news 170 00:11:32,700 --> 00:11:36,700 data set some articles, or you can do live web search to augment your 171 00:11:36,700 --> 00:11:40,800 upfront. So let's maybe try one of these prairies, 172 00:11:40,800 --> 00:11:42,500 run it. 173 00:11:48,700 --> 00:11:52,900 And yeah, laughs here. We see again some some plain answer with 174 00:11:52,900 --> 00:11:56,900 hallucination and down here. We should. 175 00:11:56,900 --> 00:12:00,800 Hopefully see also yet. We should see actual 176 00:12:00,800 --> 00:12:04,900 the SED collapse due to a bank, run caused by VCC and 177 00:12:04,900 --> 00:12:07,900 Founders withdrawing their funds to the data 178 00:12:07,900 --> 00:12:11,500 headwinds from continued higher interest 179 00:12:11,500 --> 00:12:15,900 rates. So that seems like more more truthful 180 00:12:15,900 --> 00:12:18,600 and we can even browse The Source, the 181 00:12:18,700 --> 00:12:20,900 Text behind it, that was used to generate results. 182 00:12:22,700 --> 00:12:26,600 So, have a look at it play around with it. It's out there. 183 00:12:32,000 --> 00:12:36,800 You cannot only play around with it but there's also linked in the slides, 184 00:12:36,800 --> 00:12:40,500 the code for the demo and tutorial. That actually explains 185 00:12:40,500 --> 00:12:44,400 you walks you through how we can build this easily with open source code yourself. 186 00:12:44,400 --> 00:12:48,900 Something like this should probably make sure some pipeline take you 187 00:12:48,900 --> 00:12:49,900 an hour to build yourself. 188 00:12:53,100 --> 00:12:57,800 So now that you've learned about one key methods, retrieval-augmented ssion, what can we do on top 189 00:12:57,800 --> 00:13:01,000 to have less hallucination? We can actually do a lot. 190 00:13:01,200 --> 00:13:05,900 Let me use the remaining few minutes now to share some quick tips and 191 00:13:05,900 --> 00:13:09,900 general directions that you can then further explore yourself or we 192 00:13:09,900 --> 00:13:12,500 can discuss them in our Q&A session, the breakout room. 193 00:13:14,300 --> 00:13:18,400 So my first tip would be invest some time in optimizing your prompts. 194 00:13:19,400 --> 00:13:22,500 A few things that we found particularly helpful here in practice, 195 00:13:23,600 --> 00:13:27,700 allowing the model to say. I don't know, often rather return 196 00:13:27,700 --> 00:13:31,900 nothing than a wrong answer. Depends on your, not on your use case. But in many 197 00:13:31,900 --> 00:13:34,500 use cases, you rather prefer this 198 00:13:35,800 --> 00:13:39,900 And you can actually instruct simply instruct your model to do. So one example 199 00:13:39,900 --> 00:13:43,500 prompt you see her here on the slide in green, that's basically the 200 00:13:43,500 --> 00:13:47,600 instruction that the model should should say. I don't know 201 00:13:48,100 --> 00:13:52,000 when the information is not grounded or the answers of ground actually in the documents. 202 00:13:53,600 --> 00:13:57,900 Two things that are less obvious, but work establishing astonishingly. Well, in practice, 203 00:13:59,500 --> 00:14:03,400 you can instruct the model with the different tasks that you actually have in mind. So, for 204 00:14:03,400 --> 00:14:07,900 example, formulate, it rather like a summarization task and less 205 00:14:07,900 --> 00:14:11,900 like a question answering task in our example here, 206 00:14:12,900 --> 00:14:16,900 You can also formulate the document context 207 00:14:16,900 --> 00:14:20,900 like the opinion of a person so these are lies nice 208 00:14:20,900 --> 00:14:24,700 little tricks to ground the model even more in this 209 00:14:24,700 --> 00:14:28,300 context that you provide in more, in the documents that we give it to give it to them. 210 00:14:28,400 --> 00:14:32,800 There's a nice recent paper. I linked also hear from zoo at all, where they show that 211 00:14:32,800 --> 00:14:36,400 this opinion based approach actually reduces hallucination quite a bit 212 00:14:36,400 --> 00:14:40,500 and for the summarization task trick, this is 213 00:14:40,500 --> 00:14:42,700 based on our own experience in a recent customer. 214 00:14:42,900 --> 00:14:46,900 A project that we actually made the observation that you have generally less 215 00:14:46,900 --> 00:14:50,900 hallucination in summarization tasks. And we kind of hijacked the snow 216 00:14:50,900 --> 00:14:54,900 for Q&A tasks as well. My intuition here would 217 00:14:54,900 --> 00:14:58,800 be that probably related to how this models were trained 218 00:14:58,800 --> 00:15:02,900 reinforcement learning on this kind of human feedback and 219 00:15:02,900 --> 00:15:06,900 maybe the labelers were kind of strict off for summarization task the 220 00:15:06,900 --> 00:15:10,700 label slightly different that the penalizes 221 00:15:10,700 --> 00:15:11,400 hallucinations anymore. 222 00:15:13,200 --> 00:15:17,900 So there's much more you can do of course, but that was for now 223 00:15:17,900 --> 00:15:21,900 prompt engineering at a glimpse that's now spend 224 00:15:21,900 --> 00:15:25,400 a few more minutes on other useful directions. What else can we do? 225 00:15:25,700 --> 00:15:28,300 Beyond prompt engineering? 226 00:15:31,100 --> 00:15:35,600 Well, so far we optimize the input to the LM. If you look at this chart 227 00:15:35,600 --> 00:15:39,700 here, we basically just what goes into the LM is there, 228 00:15:39,700 --> 00:15:43,800 maybe or something you can do after the LM when we got back to the generated 229 00:15:43,800 --> 00:15:44,300 answer. 230 00:15:46,200 --> 00:15:50,900 Yes, for sure. There's something and one of these helpful things that you can do there 231 00:15:50,900 --> 00:15:54,700 is so-called self-reflection. This basically idea 232 00:15:54,700 --> 00:15:58,400 of letting the model reflect on its own generated 233 00:15:58,400 --> 00:16:02,700 response, and give the model a chance to correct it. Let's look at a 234 00:16:02,700 --> 00:16:03,700 basic example. 235 00:16:06,100 --> 00:16:10,800 Let's ask the LM who the CEO of Twitter is. And as we know, 236 00:16:10,800 --> 00:16:14,400 this has rather recently changed. You also Supply directly the 237 00:16:14,400 --> 00:16:18,700 information that is actually needed to answer our question. We let the model already 238 00:16:18,700 --> 00:16:21,500 know that. Even mask is the actual owner and CEO. 239 00:16:22,300 --> 00:16:25,100 Just as you would do with a retrieval-augmented, a Sheen approach. 240 00:16:26,300 --> 00:16:30,600 If we look now at the response, this is a bit disappointing. The model are 241 00:16:30,900 --> 00:16:34,700 ignored the extra information, we provided it wants us about its 242 00:16:34,700 --> 00:16:38,700 internal knowledge cut off, which is nice, okay, but it actually did not 243 00:16:38,800 --> 00:16:42,700 really answer our question so what can we do to now 244 00:16:42,700 --> 00:16:43,800 improve this answer? 245 00:16:45,300 --> 00:16:49,700 We can actually ask a follow-up question. In the simple example, we can just ask if 246 00:16:49,700 --> 00:16:53,900 the, the answer is really reflecting the information we shared with the 247 00:16:53,900 --> 00:16:57,800 model and as you can see it helps. Next response is basically 248 00:16:57,800 --> 00:17:01,900 containing what we're after the model corrects itself, kind of 249 00:17:01,900 --> 00:17:05,700 course not. Do these follow-up questions, mainly as a user, but we can also 250 00:17:05,700 --> 00:17:09,900 automate this in our application. If you automate it, this then, typically what you call 251 00:17:09,900 --> 00:17:10,900 self-reflection 252 00:17:11,700 --> 00:17:15,400 Would add a step in your pipeline. That asks these kind of questions 253 00:17:15,400 --> 00:17:16,100 automatically? 254 00:17:18,900 --> 00:17:22,400 Okay, so now we talk a lot about practical ways of 255 00:17:22,400 --> 00:17:26,600 reducing Allison Nations. But how do we know that we actually 256 00:17:26,600 --> 00:17:30,700 improve? How can we be sure that the level of hallucinations is 257 00:17:30,700 --> 00:17:33,400 acceptable for all production application? 258 00:17:37,300 --> 00:17:41,700 Well, ideally you would have a way of detecting and measuring hallucination. 259 00:17:41,700 --> 00:17:45,700 Just like any other metric. We have for machine learning 260 00:17:45,700 --> 00:17:46,200 models. 261 00:17:47,600 --> 00:17:51,900 Unfortunately it turns out that this is not easy. It's I would say, one of the 262 00:17:52,100 --> 00:17:56,600 probably biggest unsolved problems in the generative AI space and the 263 00:17:57,000 --> 00:17:58,500 Super Active area of research. 264 00:17:59,700 --> 00:18:03,400 It's not like that. There are no methods out there are actually quite a few. 265 00:18:03,900 --> 00:18:07,800 Just honestly I haven't seen them working very well yet. 266 00:18:07,800 --> 00:18:08,800 They are developing. 267 00:18:11,100 --> 00:18:15,400 You can categorize these approach that are out there in three buckets 268 00:18:15,400 --> 00:18:19,600 and in each of these packets, if we see great advances and 269 00:18:19,600 --> 00:18:23,300 things are developing. So, let me walk you through quickly, through these 270 00:18:23,300 --> 00:18:27,900 markets and different approaches. There are statistical metrics. Think of 271 00:18:27,900 --> 00:18:31,500 it as classical machine learning metrics, like F1 272 00:18:31,500 --> 00:18:34,800 Rouge, material, or very easy to compute 273 00:18:34,800 --> 00:18:38,500 relying on some kind of statistical formula. 274 00:18:40,200 --> 00:18:44,500 Benchmarked a few of them recently Unfortunately they are not working very well 275 00:18:44,500 --> 00:18:48,700 for detecting hallucinations. The main problem. They don't 276 00:18:48,700 --> 00:18:52,100 correlate well with human judgments so human 277 00:18:52,600 --> 00:18:56,700 would flag something differently than this metric. They would take 278 00:18:56,800 --> 00:18:59,500 different answers as hallucinations that this metric doesn't get 279 00:19:01,600 --> 00:19:05,600 Which brings me to my next Point collecting. Human feedback is currently probably the 280 00:19:05,600 --> 00:19:09,700 most precise thing you can do. You can ask some users to 281 00:19:09,700 --> 00:19:13,800 check the predictions of your prototype Pipeline and kind of take 282 00:19:13,800 --> 00:19:17,700 all hallucinations. This, of course works. You get some 283 00:19:17,700 --> 00:19:21,700 Metric out of it with right tooling. This doesn't take too 284 00:19:21,700 --> 00:19:25,800 much time either and we can actually do this to actually do this quite a lot 285 00:19:25,800 --> 00:19:26,700 with our customers. 286 00:19:27,500 --> 00:19:31,700 However, let's do quite some mental 287 00:19:31,700 --> 00:19:35,500 effort is involved there, right? And maybe you can do it once or twice before 288 00:19:35,500 --> 00:19:39,500 going to production, but what then, how can you continuously 289 00:19:39,500 --> 00:19:43,100 monitor or hallucinations of your life application? So 290 00:19:43,100 --> 00:19:47,900 somehow we still really want an automated way of detecting hallucinations. 291 00:19:47,900 --> 00:19:51,800 And this is where the third project, third approach out 292 00:19:51,800 --> 00:19:55,800 there fits in and it is quite promising which is model 293 00:19:55,800 --> 00:19:56,800 based detection of hallucination. 294 00:19:57,500 --> 00:20:01,700 This approach, lets you detect hallucinations automatically by just 295 00:20:01,700 --> 00:20:05,300 using another Machinery model that is specialized for this task. 296 00:20:06,600 --> 00:20:10,200 The only problem so far, there's no really got good model 297 00:20:10,200 --> 00:20:14,700 specialized for it out there, and that's why we decided to train such a model 298 00:20:14,700 --> 00:20:18,700 ourselves. So let me give you a very quick sneak preview of 299 00:20:19,300 --> 00:20:21,200 what we work on right now there. 300 00:20:23,000 --> 00:20:27,700 So the idea straightforward the model that we train takes, two inputs 301 00:20:27,700 --> 00:20:31,000 generated answer and then some 302 00:20:31,000 --> 00:20:35,600 ground truth data that can compare to. And in the case of 303 00:20:35,600 --> 00:20:39,200 retrieval-augmented a Sheen, we have this actually for free. We have our 304 00:20:39,200 --> 00:20:42,700 retrieve documents which already contain the kind of Base information 305 00:20:42,700 --> 00:20:46,700 and we can use this as as kind of ground truth data. 306 00:20:48,800 --> 00:20:52,500 So the mother gets those two text inputs and then 307 00:20:52,500 --> 00:20:56,900 classifies if the generated answer from the LM is actually grounded in 308 00:20:56,900 --> 00:21:00,900 the retrieve documents or not, and it returns a score 309 00:21:00,900 --> 00:21:04,000 between 0 and 1. We called faithfulness score, 310 00:21:05,200 --> 00:21:09,400 And with that score you can now automatically evaluate every answer, our model generates 311 00:21:09,900 --> 00:21:13,800 and depending on our use case route questions 312 00:21:13,800 --> 00:21:17,700 differently or others differently at a self-reflection loop or 313 00:21:17,700 --> 00:21:19,800 whatever you want to do with this answer. 314 00:21:22,400 --> 00:21:26,900 So there's more still in the making on our side, but be assured. Once we have a 315 00:21:26,900 --> 00:21:30,100 model here, we will open source this so stay tuned. 316 00:21:31,600 --> 00:21:35,600 And with that, I'm also already coming to an end of this short session. 317 00:21:35,600 --> 00:21:39,600 I hope you saw threat of Hallucination is 318 00:21:39,600 --> 00:21:42,400 real and it's not only about outdated information. 319 00:21:42,400 --> 00:21:46,500 We also saw how retrieval-augmented 320 00:21:46,500 --> 00:21:50,700 agen can actually help with reducing this hallucinations but also 321 00:21:50,700 --> 00:21:54,800 tailoring, your responses towards their own data and I 322 00:21:54,800 --> 00:21:57,500 shared a few directions, like tips. What you can do further 323 00:21:59,500 --> 00:22:03,900 Last but not least, I also want to stress that this is not only holds 324 00:22:03,900 --> 00:22:07,700 four simple pipelines with a singer LM call. But 325 00:22:07,700 --> 00:22:11,700 also for more complex systems like agents with often dozens of 326 00:22:11,700 --> 00:22:15,900 Adam calls to make these agents actually robust. It's even more important to 327 00:22:15,900 --> 00:22:18,900 reduce hallucinations and Tack them on each other. 328 00:22:19,800 --> 00:22:21,200 Yeah with that. Thank you.