1
00:00:00,000 --> 00:00:04,900
I'm really pleased to introduce today, multi-piece. He's going to talk about

2
00:00:04,900 --> 00:00:07,600
connecting GPT with your data and how to avoid

3
00:00:08,000 --> 00:00:12,800
hallucination, which is something that we obviously talked about in the Q&A about an

4
00:00:12,800 --> 00:00:16,700
hour ago, Monty is the co-founder and CTO at deepset,

5
00:00:16,700 --> 00:00:20,800
where he builds Haystack, and deepset Cloud to enable

6
00:00:20,800 --> 00:00:24,400
developers all over the world to use NLP, effectively in their business

7
00:00:24,400 --> 00:00:28,500
applications previously, he conducted NLP research at

8
00:00:28,500 --> 00:00:29,300
Carnegie Mellon University.

9
00:00:30,000 --> 00:00:34,900
And was a data scientist for multiple startups. He's been crafting, NLP, applications

10
00:00:34,900 --> 00:00:38,700
for all kinds of businesses for more than eight years now and is convinced that the

11
00:00:38,700 --> 00:00:42,400
development workflow is your orientation are key criteria for

12
00:00:42,400 --> 00:00:46,300
successful in LP projects. So multi please take it away.

13
00:00:47,400 --> 00:00:51,700
Thanks, John. Thanks for the kind intro. Yeah, I'm super happy to be

14
00:00:51,700 --> 00:00:55,500
here today and share a few of our thoughts on llms.

15
00:00:55,500 --> 00:00:59,900
Welcome to the session and I think we are

16
00:00:59,900 --> 00:01:03,800
all aware that the lamps are the future, right? And that they actually

17
00:01:03,800 --> 00:01:07,800
unlocked already, many exciting use cases. We already

18
00:01:07,800 --> 00:01:10,600
heard about a few in the sessions earlier today.

19
00:01:10,700 --> 00:01:14,600
Still, I personally feel that there's sometimes two

20
00:01:14,600 --> 00:01:17,100
realities out there. There's this

21
00:01:17,600 --> 00:01:21,500
Crazy bus world. When I my Twitter feed, for example,

22
00:01:21,500 --> 00:01:25,700
where everything seems possible and so easy. And then there's

23
00:01:25,700 --> 00:01:29,500
Enterprise reality when I talked to Engineers product

24
00:01:29,500 --> 00:01:33,700
managers out there in the industry. So, if we look at the

25
00:01:33,700 --> 00:01:37,700
title of this whole event today, my talk will be more on the

26
00:01:37,700 --> 00:01:41,900
pitfalls side of things. And I want to share some of our learnings

27
00:01:41,900 --> 00:01:45,300
some of the challenges we basically saw with our Enterprise customers

28
00:01:45,300 --> 00:01:47,500
when building a la application.

29
00:01:47,600 --> 00:01:50,000
Ins for production use cases.

30
00:01:51,100 --> 00:01:55,600
To give this whole session a useful and positive twist. I will of course, not only share

31
00:01:55,600 --> 00:01:59,800
pitfalls but also some pragmatic tips to actually overcome

32
00:01:59,800 --> 00:02:00,000
them.

33
00:02:02,100 --> 00:02:06,400
So the next 20 minutes, I will give you an overview over the most common

34
00:02:06,400 --> 00:02:10,800
challenges when building applications then highlight

35
00:02:10,900 --> 00:02:14,700
one key method that helps a lot to overcome some of these

36
00:02:14,700 --> 00:02:18,700
challenges and then walk you through some additional smaller

37
00:02:18,700 --> 00:02:22,400
tips on what you could actually do to optimize things even further.

38
00:02:23,500 --> 00:02:27,800
So, let's get started with the most common challenges. We observed with our customers

39
00:02:27,800 --> 00:02:29,900
when building element education's

40
00:02:31,400 --> 00:02:35,800
Number one, most of these high-value use cases, we see in the

41
00:02:35,800 --> 00:02:39,100
Enterprise includes some kind of internal company data

42
00:02:39,700 --> 00:02:43,900
so that is data that is not available in the public Internet.

43
00:02:44,400 --> 00:02:48,600
It's often from special domain could be for example,

44
00:02:48,600 --> 00:02:52,800
aircraft maintenance documents could be legal contracts

45
00:02:52,800 --> 00:02:56,600
or financial reports and the typical question

46
00:02:56,600 --> 00:03:00,900
that comes up then is how can we tune the model now to

47
00:03:00,900 --> 00:03:01,100
our

48
00:03:01,200 --> 00:03:05,900
Data. How can we teach it about our internal knowledge that it doesn't know yet about?

49
00:03:09,000 --> 00:03:13,800
Change number two is actually closely related to the first one when our

50
00:03:13,800 --> 00:03:17,900
use case. Now relies on this internal data. How can we make sure we stay

51
00:03:17,900 --> 00:03:21,200
in control of it? We must be safe. Must be secure.

52
00:03:21,700 --> 00:03:25,900
Maybe our company policy doesn't allow sharing this kind of data with third parties.

53
00:03:28,500 --> 00:03:32,800
Number three. If you overcome these first two challenges,

54
00:03:33,200 --> 00:03:37,900
the next big question, question is typically around quality insurance. So once

55
00:03:38,200 --> 00:03:42,800
you want to go and production, think we all saw these models hallucinating

56
00:03:42,800 --> 00:03:46,700
simply making up answers. And if your application is being exposed

57
00:03:46,700 --> 00:03:50,800
to customers. For example, this could be extremely damaging to your

58
00:03:50,800 --> 00:03:54,700
brand. So, how do you know if there's a safe

59
00:03:54,700 --> 00:03:57,900
enough to deploy? How can you assess this quality?

60
00:04:01,800 --> 00:04:05,200
I could probably talk about each of those challenges for hours

61
00:04:06,000 --> 00:04:10,700
today. I just want to focus on Challenge 1 and 3 and get there, the

62
00:04:10,800 --> 00:04:14,800
key ideas across. So let's start with the basics.

63
00:04:15,700 --> 00:04:18,900
What is actually an hallucination? Why does it happen?

64
00:04:21,200 --> 00:04:25,500
Well, the Lambs are often wrong with the answers and even worse, they are

65
00:04:25,500 --> 00:04:29,800
confident about it. So some of you might remember this bank that

66
00:04:29,800 --> 00:04:33,800
used to exist Silicon Valley Bank and if you

67
00:04:33,800 --> 00:04:37,700
ask me LM, if Silicon Valley Bank collapsed you can easily get an

68
00:04:37,700 --> 00:04:41,600
answer like this. No, it did not collapse its successful. Financial Service

69
00:04:41,600 --> 00:04:42,800
Company dada.

70
00:04:43,700 --> 00:04:47,700
This is what we call hallucination. The model is simply making up fake

71
00:04:47,700 --> 00:04:48,800
reality.

72
00:04:52,000 --> 00:04:56,600
So let's try something else. What if we asked, why did Silicon Valley Bank collapse?

73
00:04:59,000 --> 00:05:03,400
All of a sudden the model switches its opinion. Now as we actually collapsed

74
00:05:03,600 --> 00:05:04,100
great

75
00:05:05,300 --> 00:05:09,400
It's even giving us some arguments why the bank failed but wait,

76
00:05:09,400 --> 00:05:13,900
it's mentioning something around subprime mortgages and financial

77
00:05:13,900 --> 00:05:15,600
crisis of 2008.

78
00:05:16,700 --> 00:05:20,900
I want to see very bad with remembering dates and timelines, but even I'm

79
00:05:20,900 --> 00:05:24,600
pretty sure that SV B was happening this year

80
00:05:24,600 --> 00:05:27,500
and not back then in 2008

81
00:05:28,600 --> 00:05:32,500
and it was actually related to Rising interest rates and a bank run

82
00:05:32,900 --> 00:05:34,000
not subprime mortgages.

83
00:05:35,800 --> 00:05:39,800
So this kind of behavior is what causes problems When developing an llm

84
00:05:39,800 --> 00:05:43,500
application. The model is not stable. In its

85
00:05:43,500 --> 00:05:47,400
opinion, it's giving wrong answers and it's often very

86
00:05:47,400 --> 00:05:51,000
eloquent in. Its argumentation comes, can sound very convincing.

87
00:05:51,900 --> 00:05:55,300
This makes it really hard to spot. These kind of hallucinations as a user.

88
00:05:57,500 --> 00:06:01,600
Where do these hallucinations now come from. Well, in this case, it's probably rather

89
00:06:01,600 --> 00:06:05,800
easy. As we be happened this year, this LM that I asked here, only

90
00:06:05,800 --> 00:06:09,800
saw training data on 2 September, 20 21. So

91
00:06:09,800 --> 00:06:13,700
it's internal knowledge, is simply outdated. It could most likely be

92
00:06:13,700 --> 00:06:17,100
fixed by training another model on more recent web data.

93
00:06:17,900 --> 00:06:21,800
Okay, cool. So I hallucinations

94
00:06:21,800 --> 00:06:25,200
all your problem when we asked about this kind of recent events,

95
00:06:26,600 --> 00:06:27,900
Unfortunately not.

96
00:06:30,100 --> 00:06:34,700
So let's ask Deputy about something in 2008. Let's ask about the number

97
00:06:34,700 --> 00:06:38,800
of model 3, weekers that Tesla produced in the first quarter of

98
00:06:38,800 --> 00:06:39,400
that year.

99
00:06:41,200 --> 00:06:45,600
The answer that we get here, sounds convincing. It's even mentioning a

100
00:06:45,600 --> 00:06:49,800
source. The shareholder letter. However, if you open the

101
00:06:49,800 --> 00:06:53,900
shareholder letter, you find that this number is unfortunately completely wrong.

102
00:06:54,500 --> 00:06:58,900
Instead 9700 vehicles of that type were

103
00:06:58,900 --> 00:06:59,600
produced

104
00:07:00,800 --> 00:07:03,400
So why did this model hallucinate here?

105
00:07:05,100 --> 00:07:09,900
Either they did not see any information about this at all at training time is

106
00:07:10,400 --> 00:07:14,800
quite unlikely. In this case as it mentions explicitly this kind of shareholder

107
00:07:14,800 --> 00:07:18,700
letter or well the more likely option this case.

108
00:07:19,100 --> 00:07:23,900
The model actually has seen some related information but mixed up, it's not confident

109
00:07:23,900 --> 00:07:27,500
enough. Maybe hasn't seen enough examples of it. It's

110
00:07:27,800 --> 00:07:28,400
it's guessing.

111
00:07:30,400 --> 00:07:34,700
And indeed, if you look through the shareholder letter, this number of 34,000

112
00:07:34,700 --> 00:07:38,500
cars that the model brought things up here is the total number of

113
00:07:38,500 --> 00:07:41,300
vehicles produced in the quarter, not for model free.

114
00:07:44,400 --> 00:07:48,700
So now that we know what hallucinations are, how can we actually reduce them

115
00:07:48,700 --> 00:07:51,100
to ensure quality and production?

116
00:07:52,200 --> 00:07:56,600
And let's not forget about our Second Challenge. How can we teach the LM new

117
00:07:56,600 --> 00:08:00,700
information for use case is about this internal company data.

118
00:08:01,800 --> 00:08:05,800
Turns out there's one key method that can actually really help with

119
00:08:05,800 --> 00:08:09,200
both of these challenges, it's called retrieval-augmented.

120
00:08:09,200 --> 00:08:13,700
Let Me Explain you. The key idea of this method,

121
00:08:16,900 --> 00:08:20,900
Let's first see how an answer gets generated by an llm in the standard way?

122
00:08:21,500 --> 00:08:23,700
So first, you have your question.

123
00:08:24,700 --> 00:08:28,900
This is typically converted to a prompt which contains still your

124
00:08:28,900 --> 00:08:32,900
original question. But then also maybe some optional text around it

125
00:08:32,900 --> 00:08:36,900
to instruct the model. What to do actually maybe in what style should

126
00:08:36,900 --> 00:08:37,400
respond

127
00:08:39,100 --> 00:08:43,700
This prompts then sent to an llm and this generates the

128
00:08:43,700 --> 00:08:45,500
answer that we then see simple.

129
00:08:46,800 --> 00:08:50,400
Now let's look what changes with this approach of

130
00:08:50,400 --> 00:08:54,800
retrieval-augmented, a ship. We still have our question here on the left.

131
00:08:54,800 --> 00:08:58,600
We still have a prompt that gets fed to our LM.

132
00:08:59,800 --> 00:09:03,900
However what we do now is inserting some more information or

133
00:09:03,900 --> 00:09:07,900
prompt what you see in green here. So what we want to

134
00:09:07,900 --> 00:09:11,800
provide the llm with some more useful information that helps the

135
00:09:11,800 --> 00:09:15,900
model to answer our question in a really truthful way. So in our

136
00:09:15,900 --> 00:09:19,400
example, of Silicon Valley Bank, this could be news

137
00:09:19,400 --> 00:09:23,800
articles or analyst, reports that explain this

138
00:09:23,800 --> 00:09:24,800
collapse of the bank.

139
00:09:26,500 --> 00:09:30,400
Of course, we need some automatic automatic way of searching and

140
00:09:30,400 --> 00:09:34,700
inserting these relevant documents in here. Can't do this

141
00:09:34,700 --> 00:09:35,300
manually.

142
00:09:36,600 --> 00:09:40,700
So what we do is now connecting our pipeline to some

143
00:09:40,700 --> 00:09:44,600
external database and use a so-called retriever

144
00:09:44,600 --> 00:09:48,400
model to find us a few relevant documents that we can then

145
00:09:48,400 --> 00:09:52,800
insert basically into or prompt. So think of it as a search

146
00:09:53,400 --> 00:09:57,100
first search relevant piece of information and then insert it into our prompt.

147
00:09:58,700 --> 00:10:02,100
And when we do that, we can actually solve now several problems at once.

148
00:10:02,900 --> 00:10:06,600
First of all, we reduce hallucinations as the model will now

149
00:10:06,800 --> 00:10:09,900
ground its answers on some actual information. The documents.

150
00:10:11,600 --> 00:10:15,800
You can also teach now, the LM new information so that it stays up to

151
00:10:15,800 --> 00:10:19,900
date and also becomes aware of our private company data that

152
00:10:19,900 --> 00:10:21,100
we might have in this database.

153
00:10:23,200 --> 00:10:27,800
Last but not least, it helps with explain ability and verifiability,

154
00:10:27,800 --> 00:10:31,900
for users as a user. I can now very easily browse the documents

155
00:10:31,900 --> 00:10:35,900
behind my generated, answer. Some to what we were earlier in the talk

156
00:10:35,900 --> 00:10:39,700
about being, I can verify that these

157
00:10:39,700 --> 00:10:41,500
answers make sense and where they come from.

158
00:10:44,600 --> 00:10:48,500
So, let's now have a quick live

159
00:10:48,500 --> 00:10:50,500
demo and see this in action.

160
00:10:54,500 --> 00:10:58,900
So this is a small demo that you put together. It's on hacking face basis,

161
00:10:58,900 --> 00:11:02,900
so all of you can actually access it. It's using Haystack

162
00:11:02,900 --> 00:11:06,900
open source framework under the hood or coat is also available here on the

163
00:11:06,900 --> 00:11:10,700
files and what you can do here now, is basically asking questions around

164
00:11:10,700 --> 00:11:14,900
the Silicon, Valley Bank collapse. So we

165
00:11:14,900 --> 00:11:18,900
have here already, one example, did SVP collapse and then we see basically

166
00:11:18,900 --> 00:11:22,800
two different answers, one using plain to PT

167
00:11:22,800 --> 00:11:24,200
and one using

168
00:11:24,500 --> 00:11:28,600
You with retrieval-augmented, and can switch. Basically.

169
00:11:28,600 --> 00:11:32,700
The data set that is used, you can either use a static news

170
00:11:32,700 --> 00:11:36,700
data set some articles, or you can do live web search to augment your

171
00:11:36,700 --> 00:11:40,800
upfront. So let's maybe try one of these prairies,

172
00:11:40,800 --> 00:11:42,500
run it.

173
00:11:48,700 --> 00:11:52,900
And yeah, laughs here. We see again some some plain answer with

174
00:11:52,900 --> 00:11:56,900
hallucination and down here. We should.

175
00:11:56,900 --> 00:12:00,800
Hopefully see also yet. We should see actual

176
00:12:00,800 --> 00:12:04,900
the SED collapse due to a bank, run caused by VCC and

177
00:12:04,900 --> 00:12:07,900
Founders withdrawing their funds to the data

178
00:12:07,900 --> 00:12:11,500
headwinds from continued higher interest

179
00:12:11,500 --> 00:12:15,900
rates. So that seems like more more truthful

180
00:12:15,900 --> 00:12:18,600
and we can even browse The Source, the

181
00:12:18,700 --> 00:12:20,900
Text behind it, that was used to generate results.

182
00:12:22,700 --> 00:12:26,600
So, have a look at it play around with it. It's out there.

183
00:12:32,000 --> 00:12:36,800
You cannot only play around with it but there's also linked in the slides,

184
00:12:36,800 --> 00:12:40,500
the code for the demo and tutorial. That actually explains

185
00:12:40,500 --> 00:12:44,400
you walks you through how we can build this easily with open source code yourself.

186
00:12:44,400 --> 00:12:48,900
Something like this should probably make sure some pipeline take you

187
00:12:48,900 --> 00:12:49,900
an hour to build yourself.

188
00:12:53,100 --> 00:12:57,800
So now that you've learned about one key methods, retrieval-augmented ssion, what can we do on top

189
00:12:57,800 --> 00:13:01,000
to have less hallucination? We can actually do a lot.

190
00:13:01,200 --> 00:13:05,900
Let me use the remaining few minutes now to share some quick tips and

191
00:13:05,900 --> 00:13:09,900
general directions that you can then further explore yourself or we

192
00:13:09,900 --> 00:13:12,500
can discuss them in our Q&A session, the breakout room.

193
00:13:14,300 --> 00:13:18,400
So my first tip would be invest some time in optimizing your prompts.

194
00:13:19,400 --> 00:13:22,500
A few things that we found particularly helpful here in practice,

195
00:13:23,600 --> 00:13:27,700
allowing the model to say. I don't know, often rather return

196
00:13:27,700 --> 00:13:31,900
nothing than a wrong answer. Depends on your, not on your use case. But in many

197
00:13:31,900 --> 00:13:34,500
use cases, you rather prefer this

198
00:13:35,800 --> 00:13:39,900
And you can actually instruct simply instruct your model to do. So one example

199
00:13:39,900 --> 00:13:43,500
prompt you see her here on the slide in green, that's basically the

200
00:13:43,500 --> 00:13:47,600
instruction that the model should should say. I don't know

201
00:13:48,100 --> 00:13:52,000
when the information is not grounded or the answers of ground actually in the documents.

202
00:13:53,600 --> 00:13:57,900
Two things that are less obvious, but work establishing astonishingly. Well, in practice,

203
00:13:59,500 --> 00:14:03,400
you can instruct the model with the different tasks that you actually have in mind. So, for

204
00:14:03,400 --> 00:14:07,900
example, formulate, it rather like a summarization task and less

205
00:14:07,900 --> 00:14:11,900
like a question answering task in our example here,

206
00:14:12,900 --> 00:14:16,900
You can also formulate the document context

207
00:14:16,900 --> 00:14:20,900
like the opinion of a person so these are lies nice

208
00:14:20,900 --> 00:14:24,700
little tricks to ground the model even more in this

209
00:14:24,700 --> 00:14:28,300
context that you provide in more, in the documents that we give it to give it to them.

210
00:14:28,400 --> 00:14:32,800
There's a nice recent paper. I linked also hear from zoo at all, where they show that

211
00:14:32,800 --> 00:14:36,400
this opinion based approach actually reduces hallucination quite a bit

212
00:14:36,400 --> 00:14:40,500
and for the summarization task trick, this is

213
00:14:40,500 --> 00:14:42,700
based on our own experience in a recent customer.

214
00:14:42,900 --> 00:14:46,900
A project that we actually made the observation that you have generally less

215
00:14:46,900 --> 00:14:50,900
hallucination in summarization tasks. And we kind of hijacked the snow

216
00:14:50,900 --> 00:14:54,900
for Q&A tasks as well. My intuition here would

217
00:14:54,900 --> 00:14:58,800
be that probably related to how this models were trained

218
00:14:58,800 --> 00:15:02,900
reinforcement learning on this kind of human feedback and

219
00:15:02,900 --> 00:15:06,900
maybe the labelers were kind of strict off for summarization task the

220
00:15:06,900 --> 00:15:10,700
label slightly different that the penalizes

221
00:15:10,700 --> 00:15:11,400
hallucinations anymore.

222
00:15:13,200 --> 00:15:17,900
So there's much more you can do of course, but that was for now

223
00:15:17,900 --> 00:15:21,900
prompt engineering at a glimpse that's now spend

224
00:15:21,900 --> 00:15:25,400
a few more minutes on other useful directions. What else can we do?

225
00:15:25,700 --> 00:15:28,300
Beyond prompt engineering?

226
00:15:31,100 --> 00:15:35,600
Well, so far we optimize the input to the LM. If you look at this chart

227
00:15:35,600 --> 00:15:39,700
here, we basically just what goes into the LM is there,

228
00:15:39,700 --> 00:15:43,800
maybe or something you can do after the LM when we got back to the generated

229
00:15:43,800 --> 00:15:44,300
answer.

230
00:15:46,200 --> 00:15:50,900
Yes, for sure. There's something and one of these helpful things that you can do there

231
00:15:50,900 --> 00:15:54,700
is so-called self-reflection. This basically idea

232
00:15:54,700 --> 00:15:58,400
of letting the model reflect on its own generated

233
00:15:58,400 --> 00:16:02,700
response, and give the model a chance to correct it. Let's look at a

234
00:16:02,700 --> 00:16:03,700
basic example.

235
00:16:06,100 --> 00:16:10,800
Let's ask the LM who the CEO of Twitter is. And as we know,

236
00:16:10,800 --> 00:16:14,400
this has rather recently changed. You also Supply directly the

237
00:16:14,400 --> 00:16:18,700
information that is actually needed to answer our question. We let the model already

238
00:16:18,700 --> 00:16:21,500
know that. Even mask is the actual owner and CEO.

239
00:16:22,300 --> 00:16:25,100
Just as you would do with a retrieval-augmented, a Sheen approach.

240
00:16:26,300 --> 00:16:30,600
If we look now at the response, this is a bit disappointing. The model are

241
00:16:30,900 --> 00:16:34,700
ignored the extra information, we provided it wants us about its

242
00:16:34,700 --> 00:16:38,700
internal knowledge cut off, which is nice, okay, but it actually did not

243
00:16:38,800 --> 00:16:42,700
really answer our question so what can we do to now

244
00:16:42,700 --> 00:16:43,800
improve this answer?

245
00:16:45,300 --> 00:16:49,700
We can actually ask a follow-up question. In the simple example, we can just ask if

246
00:16:49,700 --> 00:16:53,900
the, the answer is really reflecting the information we shared with the

247
00:16:53,900 --> 00:16:57,800
model and as you can see it helps. Next response is basically

248
00:16:57,800 --> 00:17:01,900
containing what we're after the model corrects itself, kind of

249
00:17:01,900 --> 00:17:05,700
course not. Do these follow-up questions, mainly as a user, but we can also

250
00:17:05,700 --> 00:17:09,900
automate this in our application. If you automate it, this then, typically what you call

251
00:17:09,900 --> 00:17:10,900
self-reflection

252
00:17:11,700 --> 00:17:15,400
Would add a step in your pipeline. That asks these kind of questions

253
00:17:15,400 --> 00:17:16,100
automatically?

254
00:17:18,900 --> 00:17:22,400
Okay, so now we talk a lot about practical ways of

255
00:17:22,400 --> 00:17:26,600
reducing Allison Nations. But how do we know that we actually

256
00:17:26,600 --> 00:17:30,700
improve? How can we be sure that the level of hallucinations is

257
00:17:30,700 --> 00:17:33,400
acceptable for all production application?

258
00:17:37,300 --> 00:17:41,700
Well, ideally you would have a way of detecting and measuring hallucination.

259
00:17:41,700 --> 00:17:45,700
Just like any other metric. We have for machine learning

260
00:17:45,700 --> 00:17:46,200
models.

261
00:17:47,600 --> 00:17:51,900
Unfortunately it turns out that this is not easy. It's I would say, one of the

262
00:17:52,100 --> 00:17:56,600
probably biggest unsolved problems in the generative AI space and the

263
00:17:57,000 --> 00:17:58,500
Super Active area of research.

264
00:17:59,700 --> 00:18:03,400
It's not like that. There are no methods out there are actually quite a few.

265
00:18:03,900 --> 00:18:07,800
Just honestly I haven't seen them working very well yet.

266
00:18:07,800 --> 00:18:08,800
They are developing.

267
00:18:11,100 --> 00:18:15,400
You can categorize these approach that are out there in three buckets

268
00:18:15,400 --> 00:18:19,600
and in each of these packets, if we see great advances and

269
00:18:19,600 --> 00:18:23,300
things are developing. So, let me walk you through quickly, through these

270
00:18:23,300 --> 00:18:27,900
markets and different approaches. There are statistical metrics. Think of

271
00:18:27,900 --> 00:18:31,500
it as classical machine learning metrics, like F1

272
00:18:31,500 --> 00:18:34,800
Rouge, material, or very easy to compute

273
00:18:34,800 --> 00:18:38,500
relying on some kind of statistical formula.

274
00:18:40,200 --> 00:18:44,500
Benchmarked a few of them recently Unfortunately they are not working very well

275
00:18:44,500 --> 00:18:48,700
for detecting hallucinations. The main problem. They don't

276
00:18:48,700 --> 00:18:52,100
correlate well with human judgments so human

277
00:18:52,600 --> 00:18:56,700
would flag something differently than this metric. They would take

278
00:18:56,800 --> 00:18:59,500
different answers as hallucinations that this metric doesn't get

279
00:19:01,600 --> 00:19:05,600
Which brings me to my next Point collecting. Human feedback is currently probably the

280
00:19:05,600 --> 00:19:09,700
most precise thing you can do. You can ask some users to

281
00:19:09,700 --> 00:19:13,800
check the predictions of your prototype Pipeline and kind of take

282
00:19:13,800 --> 00:19:17,700
all hallucinations. This, of course works. You get some

283
00:19:17,700 --> 00:19:21,700
Metric out of it with right tooling. This doesn't take too

284
00:19:21,700 --> 00:19:25,800
much time either and we can actually do this to actually do this quite a lot

285
00:19:25,800 --> 00:19:26,700
with our customers.

286
00:19:27,500 --> 00:19:31,700
However, let's do quite some mental

287
00:19:31,700 --> 00:19:35,500
effort is involved there, right? And maybe you can do it once or twice before

288
00:19:35,500 --> 00:19:39,500
going to production, but what then, how can you continuously

289
00:19:39,500 --> 00:19:43,100
monitor or hallucinations of your life application? So

290
00:19:43,100 --> 00:19:47,900
somehow we still really want an automated way of detecting hallucinations.

291
00:19:47,900 --> 00:19:51,800
And this is where the third project, third approach out

292
00:19:51,800 --> 00:19:55,800
there fits in and it is quite promising which is model

293
00:19:55,800 --> 00:19:56,800
based detection of hallucination.

294
00:19:57,500 --> 00:20:01,700
This approach, lets you detect hallucinations automatically by just

295
00:20:01,700 --> 00:20:05,300
using another Machinery model that is specialized for this task.

296
00:20:06,600 --> 00:20:10,200
The only problem so far, there's no really got good model

297
00:20:10,200 --> 00:20:14,700
specialized for it out there, and that's why we decided to train such a model

298
00:20:14,700 --> 00:20:18,700
ourselves. So let me give you a very quick sneak preview of

299
00:20:19,300 --> 00:20:21,200
what we work on right now there.

300
00:20:23,000 --> 00:20:27,700
So the idea straightforward the model that we train takes, two inputs

301
00:20:27,700 --> 00:20:31,000
generated answer and then some

302
00:20:31,000 --> 00:20:35,600
ground truth data that can compare to. And in the case of

303
00:20:35,600 --> 00:20:39,200
retrieval-augmented a Sheen, we have this actually for free. We have our

304
00:20:39,200 --> 00:20:42,700
retrieve documents which already contain the kind of Base information

305
00:20:42,700 --> 00:20:46,700
and we can use this as as kind of ground truth data.

306
00:20:48,800 --> 00:20:52,500
So the mother gets those two text inputs and then

307
00:20:52,500 --> 00:20:56,900
classifies if the generated answer from the LM is actually grounded in

308
00:20:56,900 --> 00:21:00,900
the retrieve documents or not, and it returns a score

309
00:21:00,900 --> 00:21:04,000
between 0 and 1. We called faithfulness score,

310
00:21:05,200 --> 00:21:09,400
And with that score you can now automatically evaluate every answer, our model generates

311
00:21:09,900 --> 00:21:13,800
and depending on our use case route questions

312
00:21:13,800 --> 00:21:17,700
differently or others differently at a self-reflection loop or

313
00:21:17,700 --> 00:21:19,800
whatever you want to do with this answer.

314
00:21:22,400 --> 00:21:26,900
So there's more still in the making on our side, but be assured. Once we have a

315
00:21:26,900 --> 00:21:30,100
model here, we will open source this so stay tuned.

316
00:21:31,600 --> 00:21:35,600
And with that, I'm also already coming to an end of this short session.

317
00:21:35,600 --> 00:21:39,600
I hope you saw threat of Hallucination is

318
00:21:39,600 --> 00:21:42,400
real and it's not only about outdated information.

319
00:21:42,400 --> 00:21:46,500
We also saw how retrieval-augmented

320
00:21:46,500 --> 00:21:50,700
agen can actually help with reducing this hallucinations but also

321
00:21:50,700 --> 00:21:54,800
tailoring, your responses towards their own data and I

322
00:21:54,800 --> 00:21:57,500
shared a few directions, like tips. What you can do further

323
00:21:59,500 --> 00:22:03,900
Last but not least, I also want to stress that this is not only holds

324
00:22:03,900 --> 00:22:07,700
four simple pipelines with a singer LM call. But

325
00:22:07,700 --> 00:22:11,700
also for more complex systems like agents with often dozens of

326
00:22:11,700 --> 00:22:15,900
Adam calls to make these agents actually robust. It's even more important to

327
00:22:15,900 --> 00:22:18,900
reduce hallucinations and Tack them on each other.

328
00:22:19,800 --> 00:22:21,200
Yeah with that. Thank you.