1
00:00:00,000 --> 00:00:04,700
So another important tool in the quiver of

2
00:00:04,700 --> 00:00:08,300
tools that a data scientist has is Apache. Iceberg,

3
00:00:08,700 --> 00:00:12,500
Sam redai is a senior software engineer on the

4
00:00:12,500 --> 00:00:16,900
experimentation platform and Netflix Samuel responsible for telling me all the

5
00:00:16,900 --> 00:00:20,900
amazing things that I should or shouldn't be watching. I'm guessing and

6
00:00:20,900 --> 00:00:24,600
obviously you have a background Beyond just media in medical research and

7
00:00:24,600 --> 00:00:28,700
hospitals by chains and so on, we talked a lot about data,

8
00:00:28,700 --> 00:00:29,500
you know,

9
00:00:30,000 --> 00:00:34,300
If you're listening to what give you two says, maybe you choose an algorithm, you think is right,

10
00:00:34,900 --> 00:00:38,900
but ultimately garbage in garbage out. So data, quality is going

11
00:00:38,900 --> 00:00:42,700
to drive whether or not your models are successful. Why don't you talk a

12
00:00:42,700 --> 00:00:45,700
little bit about how we test data quality at scale?

13
00:00:48,200 --> 00:00:52,500
Awesome. Thanks a lot for the introduction. So I'm excited to be here. I'm

14
00:00:52,500 --> 00:00:56,800
happy to talk to everyone about Apache, Iceberg. So we can, we can jump

15
00:00:56,800 --> 00:01:00,600
right into it. I'm going to cover something called the write-audit-publish pattern, which is a

16
00:01:00,600 --> 00:01:04,700
general pattern. But it's specifically what? Well, implemented

17
00:01:04,700 --> 00:01:07,000
via, Apache iceberg.

18
00:01:08,400 --> 00:01:12,900
So high level overview and just you know what, I'm playing the cover in the talk,

19
00:01:13,200 --> 00:01:17,700
I'm going to talk about data quality so you know what is what exactly does data quality mean?

20
00:01:18,600 --> 00:01:22,500
You know, in the modern modern era of data, you know what are some common

21
00:01:22,500 --> 00:01:26,900
patterns that that are inspired by the goal

22
00:01:26,900 --> 00:01:30,400
to achieve data quality? Then I'm going to cover, Apache, Iceberg

23
00:01:30,600 --> 00:01:34,600
just hi-low introduction on what it is and specifically its integrated

24
00:01:34,600 --> 00:01:37,500
audits feature, which I think is makes it very

25
00:01:37,600 --> 00:01:41,600
Very easy to implement this, this exact pattern at data

26
00:01:41,600 --> 00:01:45,500
very large scale and I'm going to talk a little bit about the hard part which is

27
00:01:45,500 --> 00:01:49,900
automating. This feature via your orchestration system so that people can use the

28
00:01:49,900 --> 00:01:53,800
write-audit-publish pattern in a much more declarative way. So,

29
00:01:53,800 --> 00:01:57,800
let's start with, you know, data quality, right, what exactly is

30
00:01:57,800 --> 00:02:01,700
data quality. So if you look at Wikipedia, it says people's views on data quality.

31
00:02:01,700 --> 00:02:05,900
Can often mean disagreement even when discussing the same set of data for the same

32
00:02:05,900 --> 00:02:06,400
purpose,

33
00:02:07,400 --> 00:02:11,800
it was on this day is the number of data sources increases the question of internal

34
00:02:11,800 --> 00:02:15,600
data consistency becomes significant, regardless of the fitness for

35
00:02:15,600 --> 00:02:19,800
use for any particular purpose and this is my favorite

36
00:02:19,800 --> 00:02:23,300
sentence is that defining data quality. In the sentence is difficult,

37
00:02:23,700 --> 00:02:27,600
and in reality, this is what we're trying to avoid, right data.

38
00:02:27,600 --> 00:02:31,900
Quality really means that our consumers of our data don't

39
00:02:31,900 --> 00:02:35,900
feel. Don't lose trust in in what is actually contained in the

40
00:02:35,900 --> 00:02:37,000
data that we're delivering.

41
00:02:37,200 --> 00:02:41,800
All right. So you know, the big question then is

42
00:02:41,800 --> 00:02:45,900
how can I get people to trust my data, right? And here

43
00:02:45,900 --> 00:02:49,900
are just a couple of solutions that, you know, if anyone's work with data has tried one

44
00:02:49,900 --> 00:02:53,300
or more of these Solutions, at some point, right? You can write your data to

45
00:02:53,300 --> 00:02:57,600
production and you can leave it to your consumers to run. Validations

46
00:02:58,500 --> 00:03:02,600
another solution is you, you know, you write it somewhere else. You have like, a test data

47
00:03:02,600 --> 00:03:06,400
warehouse that you write everything, sort of no one knows about. And then you

48
00:03:07,100 --> 00:03:11,900
Audit your data there in private before you rerun your stuff to move to production.

49
00:03:12,900 --> 00:03:16,900
And then, you know, some people are clever. They write data quality metrics and they

50
00:03:16,900 --> 00:03:20,600
sort of ship that as along, with their data. So that all

51
00:03:20,600 --> 00:03:24,700
consumers can sort of go. Look at metrics here, investigated for

52
00:03:24,700 --> 00:03:28,900
whatever, you know, whatever their definition of data quality, is they

53
00:03:28,900 --> 00:03:32,900
can sort of frame it in the context of these metrics. You provide and

54
00:03:32,900 --> 00:03:36,000
maybe, you know, some combination of these three options.

55
00:03:37,200 --> 00:03:41,900
Some of you are really, really expert data engineers, and you look at and you say, oh, it's not that

56
00:03:41,900 --> 00:03:45,900
complicated. I have my own mechanism here. And many data Engineers, do many

57
00:03:45,900 --> 00:03:49,900
dangers have really, really fantastic. Bespoke data, quality solutions

58
00:03:50,300 --> 00:03:54,600
that work very well. And so, you know,

59
00:03:54,600 --> 00:03:58,500
the big thing that I want to cover is that copying

60
00:03:58,500 --> 00:04:02,900
data when you're working with data at massive scale, copying data is

61
00:04:02,900 --> 00:04:06,100
essentially not an option. And that's the big thing that

62
00:04:07,100 --> 00:04:11,900
That changes that sort of knocks off a lot of the solutions that you know

63
00:04:11,900 --> 00:04:15,900
maybe were valid Solutions along time ago, once you

64
00:04:15,900 --> 00:04:19,700
reach the scale of data that pretty much exist at any organization

65
00:04:20,200 --> 00:04:24,600
you can't really copy the data and so you try and find out how to how can I

66
00:04:24,600 --> 00:04:28,900
maintain data quality? How can I share data quality without having to do this

67
00:04:29,200 --> 00:04:33,400
duplication of my data separately to actually run my

68
00:04:33,400 --> 00:04:34,000
audits?

69
00:04:35,900 --> 00:04:39,900
So I want to talk about Apache, icebergs integrated audits feature and just

70
00:04:39,900 --> 00:04:43,800
to start at a high level. You know, what is it that you guys work for anyone? Who's never

71
00:04:43,800 --> 00:04:47,900
heard of it? It's a high performance format. This is sort of the long

72
00:04:47,900 --> 00:04:51,800
description from the iceberg. Doc, site my

73
00:04:51,800 --> 00:04:55,800
description. I usually give to people is that it is Iceberg provides. Massive scale for

74
00:04:55,800 --> 00:04:59,900
called native, SQL tables, and it's accessible by many compute engines.

75
00:04:59,900 --> 00:05:03,900
And that list of compute engines as it is continuing to grow over the

76
00:05:03,900 --> 00:05:04,900
past year, has been extraordinary.

77
00:05:05,600 --> 00:05:09,800
Both in different compute engines that can access you know

78
00:05:10,400 --> 00:05:12,700
commentator tables that rest Iceberg tables.

79
00:05:14,400 --> 00:05:18,800
So just a high level overview on what the integrated audits Feature

80
00:05:19,100 --> 00:05:23,900
Feature is. So the big core of it really is that

81
00:05:23,900 --> 00:05:27,500
it allows you to write your data to production in an unpublished

82
00:05:27,500 --> 00:05:31,700
State and by unpublished it means Downstream consumers. Can't see

83
00:05:31,700 --> 00:05:35,800
that table. When they just typically query it directly as, you know,

84
00:05:35,800 --> 00:05:39,800
writing a sequel query to select from this table, they won't actually see the

85
00:05:39,800 --> 00:05:42,300
data that you've written in its unpublished state.

86
00:05:43,000 --> 00:05:47,800
The other thing that it does, and this is particular, and a lot of this is going to be specific to this park

87
00:05:48,700 --> 00:05:52,900
implementation of this. But this integrated, all this feature is that the core

88
00:05:52,900 --> 00:05:56,800
Iceberg layer. So any compute engine can actually add support

89
00:05:56,800 --> 00:06:00,600
for this and they're only a few now. But the featured the raw

90
00:06:00,600 --> 00:06:04,900
material does exist there to implement in any compute engines and Spark in particular. The

91
00:06:04,900 --> 00:06:08,600
way it's implemented is this spark web app ID value from your

92
00:06:08,600 --> 00:06:12,600
spark session tags. The unpublished snapshot so I see

93
00:06:12,700 --> 00:06:16,500
Has this concept of snapshots and when you write data unpublished that

94
00:06:16,500 --> 00:06:20,500
snapshot is tagged with this write-audit-publish ID

95
00:06:20,500 --> 00:06:24,900
from your spark session time, travel is a big feature, an iceberg.

96
00:06:24,900 --> 00:06:28,700
So time traveler, lets you select any historical snapshot

97
00:06:28,700 --> 00:06:32,900
explicitly. So that's a core feature. Not

98
00:06:32,900 --> 00:06:36,700
specific to integrate audits but specific to Iceberg for

99
00:06:36,700 --> 00:06:39,300
time traveling to select these. This unpublished data

100
00:06:40,200 --> 00:06:44,900
And then when you have confidence in your data, Iceberg has a metadata

101
00:06:44,900 --> 00:06:48,500
only cherry-pick operation, which lets you take the difference between this

102
00:06:48,500 --> 00:06:52,800
unpublished data and the current, you know, sort of head of your

103
00:06:52,800 --> 00:06:56,600
table that people have access to and sort of cherry picking the

104
00:06:56,600 --> 00:07:00,600
metadata making a new current snapshot

105
00:07:00,600 --> 00:07:04,900
to sort of make those unpublished changes published that's sort

106
00:07:04,900 --> 00:07:08,800
of the equivalent of it. And you know, another one of the best

107
00:07:08,800 --> 00:07:09,300
parts is that

108
00:07:10,200 --> 00:07:12,800
Iceberg, snapshot expiration, cleans up a lot of

109
00:07:12,800 --> 00:07:16,700
data that's unused, it's not referenced by the

110
00:07:16,700 --> 00:07:20,900
current state of the table. So if you have a weekly or monthly snapshot expiration

111
00:07:20,900 --> 00:07:24,900
process, any data that you don't end up cherry-picking or any data that

112
00:07:24,900 --> 00:07:28,900
remains unpublished is automatically cleaned up and so this is an overview

113
00:07:28,900 --> 00:07:32,900
of the core features of iceberg that really enable everything I'm going to talk about

114
00:07:32,900 --> 00:07:34,200
in the next slides.

115
00:07:35,300 --> 00:07:39,300
So let's cover like the three stages, right? So write-audit-publish

116
00:07:39,300 --> 00:07:43,800
and and what what it entails right, what it contains. So for

117
00:07:43,800 --> 00:07:47,900
write this, this right? That wept at enabled is actually a table

118
00:07:47,900 --> 00:07:51,900
property. So by default swap is

119
00:07:51,900 --> 00:07:55,700
not enabled on Iceberg tables but you can enable it simply

120
00:07:55,700 --> 00:07:59,700
by setting this table property to true on your table and you only need

121
00:07:59,700 --> 00:08:03,800
to do it. Once you don't necessarily need to do it. Every time you do a write-audit-publish

122
00:08:03,800 --> 00:08:04,800
session.

123
00:08:05,400 --> 00:08:09,800
And then the second one is in your spark configuration for your, your

124
00:08:09,800 --> 00:08:13,700
job that you're running. You just need to set this uuid

125
00:08:13,700 --> 00:08:17,200
in the spark session configuration. So when that

126
00:08:17,200 --> 00:08:21,700
once that ID is set, then you're sort of it's the

127
00:08:21,700 --> 00:08:25,700
the entire spark session that you're running in, you know, sort of

128
00:08:25,700 --> 00:08:29,200
it's a signal to Iceberg that this is going to be followed the write-audit-publish

129
00:08:29,200 --> 00:08:33,800
pattern. And then the best part is you just run your production

130
00:08:33,800 --> 00:08:35,200
HDL code. So

131
00:08:35,400 --> 00:08:39,800
So it sounds scary to a lot of people. But once you see this

132
00:08:39,800 --> 00:08:43,600
enough and you use this enough, you'll gain that confidence that

133
00:08:43,600 --> 00:08:47,800
Iceberg won't publish your data. Because you've had, you have a web ID set

134
00:08:47,800 --> 00:08:51,600
on your spark session configuration. So you run your production ETL

135
00:08:51,600 --> 00:08:55,900
code. AS is, you insert into your production table, you change. Nothing about your

136
00:08:55,900 --> 00:08:59,800
production code except the setting the swap ID in your spark

137
00:08:59,800 --> 00:09:01,000
session configuration.

138
00:09:02,900 --> 00:09:06,300
So then what is the, how does the auditing happen? So the auditing is,

139
00:09:06,900 --> 00:09:10,600
you simply have to find that snapshot ID for the

140
00:09:10,600 --> 00:09:14,900
production table that's tagged with the the web idea that would set in your spark

141
00:09:14,900 --> 00:09:18,500
session configuration. So that point that tag essentially is a

142
00:09:18,500 --> 00:09:22,900
pointer to say. Okay, now I this data is unpublished, so I can't access it by

143
00:09:22,900 --> 00:09:26,200
just clearing the table. I have to actually use the time travel feature

144
00:09:26,700 --> 00:09:30,500
to select a, an unpublished, a different. Snapshot, that's

145
00:09:30,500 --> 00:09:31,300
unpublished.

146
00:09:31,800 --> 00:09:35,800
And the whap ID is sort of the link that lets you look up, okay. What is the

147
00:09:35,800 --> 00:09:39,100
snapshot? That was created by my production, right job

148
00:09:39,100 --> 00:09:43,800
and then you can perform these validations against the data using any data, auditing

149
00:09:43,800 --> 00:09:47,800
tool. So anything that can that actually has support for Iceberg tables and can use time

150
00:09:47,800 --> 00:09:51,800
travel, you can use for auditing. So if you have a spark based auditing tool, if you have

151
00:09:51,800 --> 00:09:55,800
a tree, no based auditing tool, you have a fling based auditing to anything

152
00:09:55,800 --> 00:09:59,700
that can actually select a specific snapshot of Iceberg.

153
00:09:59,900 --> 00:10:01,400
You can utilize for

154
00:10:01,800 --> 00:10:02,500
for your audits.

155
00:10:04,300 --> 00:10:08,800
And then what happens after you're finished your Audits and you want to publish, what if your audits

156
00:10:08,800 --> 00:10:12,200
fail? You can just go back to the drawing board. Remember there's the automatic

157
00:10:12,200 --> 00:10:16,900
snapshot expiration, clean up that that is part of the maintenance of an

158
00:10:16,900 --> 00:10:20,500
iceberg warehouse and so you can, you can be assured that any of this

159
00:10:20,500 --> 00:10:23,700
unpublished data will be picked up automatically and cleaned out.

160
00:10:23,700 --> 00:10:27,900
If you have a, you know, a number of sessions

161
00:10:27,900 --> 00:10:31,900
that fail, you have data lying around, you don't have to concern

162
00:10:31,900 --> 00:10:34,300
yourself with it but that will all get cleaned up. And if

163
00:10:34,300 --> 00:10:38,700
Your audits pass then publishing that unpublished data. You

164
00:10:38,700 --> 00:10:42,500
run. This cherry-pick operation of the specific snapshot ID that was

165
00:10:42,500 --> 00:10:46,800
tagged with that write-audit-publish session. This is a metadata

166
00:10:46,800 --> 00:10:50,900
only operation. It's super fast. This goes back to the cardinal sin of

167
00:10:50,900 --> 00:10:54,800
copying data when you're working with data at scale. You really

168
00:10:54,800 --> 00:10:58,700
you know it's just not feasible when you have like maybe

169
00:10:58,700 --> 00:11:02,500
jobs that take 56 hours to run your moving massive amounts of data,

170
00:11:02,800 --> 00:11:04,300
the great part about this,

171
00:11:04,300 --> 00:11:08,400
This is the publish here is just a cherry pick operation. It just takes the metadata of the

172
00:11:08,400 --> 00:11:12,900
unpublished snapshot looks at the current head of the table and then Cherry

173
00:11:12,900 --> 00:11:16,600
picks those changes to to be available. Just a very

174
00:11:16,600 --> 00:11:20,800
fast operation and immediately all of your data is

175
00:11:20,800 --> 00:11:23,400
published and available to a downstream consumers.

176
00:11:26,400 --> 00:11:30,700
So it might, it might feel like this, right? You're seeing all these

177
00:11:30,700 --> 00:11:34,800
steps and you're saying, okay, I have to set this sparked rap session ID. I have to,

178
00:11:34,800 --> 00:11:38,700
you know, look it up. When I went to audit it, right? I have to make sure that

179
00:11:38,700 --> 00:11:42,700
I, you know, once those audits complete based on some conditional, I

180
00:11:42,700 --> 00:11:45,600
publisher or don't publish. And

181
00:11:46,800 --> 00:11:50,700
it, you know, let's zoom out on sort of what, what this feature is

182
00:11:50,700 --> 00:11:54,800
aiming to do, right? So iceberg is a data quality in a blur. This is a high level of you in

183
00:11:54,900 --> 00:11:58,900
What, what it's trying to solve here, with this particular feature, it's a data

184
00:11:58,900 --> 00:12:02,900
quality enable. Enabler in that, here's a gross simplification

185
00:12:03,200 --> 00:12:07,000
of a data Engineers pipeline, right? You have some data source?

186
00:12:08,000 --> 00:12:12,900
You ingest some pipeline, you know, you have an ingestion pipeline which you

187
00:12:12,900 --> 00:12:16,700
run your code for and then you do some manipulation, maybe some

188
00:12:16,700 --> 00:12:20,600
joins and maybe some filters and then you put this data in your in your production

189
00:12:20,600 --> 00:12:24,800
data warehouse. Now, if something goes wrong, if

190
00:12:24,800 --> 00:12:28,700
something is wrong Upstream with your data sources, you know, that's

191
00:12:28,700 --> 00:12:32,800
not the best thing in the world but your ingestion pipeline fails and things,

192
00:12:32,800 --> 00:12:36,900
you know, you just tell your Downstream consumers. Oh the data is on the way, it's

193
00:12:36,900 --> 00:12:37,800
not, it's not.

194
00:12:38,000 --> 00:12:42,700
Ready. We're debugging some issues with Downstream data sources if the data

195
00:12:42,700 --> 00:12:46,400
sources are fine but your ingestion pipeline, something's wrong with your code.

196
00:12:46,400 --> 00:12:50,800
Similar ingestion, pipeline fails, you tell your Downstream consumers, you

197
00:12:50,800 --> 00:12:54,900
know my pipeline failed, I'm working on figuring out what's going

198
00:12:54,900 --> 00:12:58,700
on with the code and everything is you know

199
00:12:58,700 --> 00:13:01,900
sort of people are waiting for the data but

200
00:13:03,100 --> 00:13:07,600
you know it's it's not the worst case scenario which is

201
00:13:08,100 --> 00:13:12,900
When something goes wrong and bad data is published to your production data

202
00:13:12,900 --> 00:13:16,900
warehouse. And this this portion right here is where all of the anxiety

203
00:13:16,900 --> 00:13:20,800
lives for. Well, I wouldn't say all most of the anxiety

204
00:13:20,800 --> 00:13:24,300
lives for a lot of data Engineers is that you don't want to actually

205
00:13:24,500 --> 00:13:28,800
production analyze or publish bad data and you can

206
00:13:28,800 --> 00:13:32,300
think of that as we're this integrated audits feature lives,

207
00:13:32,800 --> 00:13:36,900
it allows you to run your ingestion pipeline. It allows you to run its the case

208
00:13:36,900 --> 00:13:37,800
when everything goes wrong.

209
00:13:38,000 --> 00:13:42,700
When the data sources are solid your ingestion pipeline

210
00:13:42,700 --> 00:13:46,700
runs and complete successfully, but you want to include

211
00:13:46,700 --> 00:13:50,600
a set of auditing or checks before that data is actually made

212
00:13:50,600 --> 00:13:54,600
available in the production data warehouse and you want to do that without

213
00:13:54,800 --> 00:13:58,700
actually copying your your data, your massive amounts

214
00:13:58,700 --> 00:14:02,800
of data that's actually stored in your pipeline

215
00:14:03,000 --> 00:14:07,400
or process by your problem. So what it really means is no more running your data twice,

216
00:14:07,900 --> 00:14:11,300
No more having to remember to clean up artifacts like test tables

217
00:14:12,000 --> 00:14:16,900
no more having to remember to keep testing Pratt schema synced, right? So even if you could copy your

218
00:14:16,900 --> 00:14:20,900
data twice, there's this thinking that you have to do every time you

219
00:14:20,900 --> 00:14:24,400
migrate schemas for tables, which you always gets

220
00:14:24,400 --> 00:14:28,700
challenging makes you want to change them less, there's no more locking yourself into a single

221
00:14:28,700 --> 00:14:32,900
auditing tool. So anything that supports Iceberg, which is a very long list

222
00:14:32,900 --> 00:14:36,900
of compute engines that are growing everyday anything based on

223
00:14:36,900 --> 00:14:37,700
those tools.

224
00:14:37,900 --> 00:14:41,900
Can actually be used to audit your Iceberg tables. And there's no more

225
00:14:41,900 --> 00:14:45,800
coupling of your ETL logic with your validation logic, right? So you

226
00:14:45,800 --> 00:14:48,600
can actually run your keep your production code as is,

227
00:14:49,800 --> 00:14:53,900
and, you know, that that sort of allows you to store your validation logic as a

228
00:14:53,900 --> 00:14:55,900
separate separate component.

229
00:14:58,400 --> 00:15:02,800
So, you know, the hard part, really,

230
00:15:02,800 --> 00:15:06,800
and this is sort of the last section in the talk. But,

231
00:15:06,900 --> 00:15:10,700
you know, there are a couple of steps there, right? We see, you have to set the web

232
00:15:10,700 --> 00:15:14,900
ID and your spark configuration. You have to ensure that you utilize that

233
00:15:14,900 --> 00:15:18,800
when you're running your audits, you also have to cherry pick the right

234
00:15:18,800 --> 00:15:22,600
snapshot ID when you go to publish. And so the hard part is

235
00:15:22,600 --> 00:15:26,900
really automating this feature, right? As part of your orchestration system. So

236
00:15:26,900 --> 00:15:27,800
I want to cover

237
00:15:28,100 --> 00:15:32,700
what that looks like and how I've seen it done and also talk

238
00:15:32,700 --> 00:15:36,500
about some of the challenges there as well. So

239
00:15:36,700 --> 00:15:40,600
let's start without a high level. You have your orchestration system could be anything

240
00:15:40,800 --> 00:15:44,400
could be air flow, could be a custom scheduler there many. Many

241
00:15:44,400 --> 00:15:48,200
orchestration systems out there that come with scheduling your workflows

242
00:15:48,800 --> 00:15:52,600
on the right. Is just you know simple version of a production table, you could have multiple

243
00:15:52,600 --> 00:15:56,600
production tables but for the sake of this example, to assume it's one production table

244
00:15:56,600 --> 00:15:58,000
and then you have it.

245
00:15:58,100 --> 00:16:02,500
Auditing tool, right? So this is your generic auditing tool that you can use

246
00:16:02,500 --> 00:16:06,300
to run the audit process of the write-audit-publish

247
00:16:06,300 --> 00:16:10,900
pattern. So what's step one, right? So Step

248
00:16:10,900 --> 00:16:14,800
One is, we mentioned verifying that, right? Dot web. That enabled is

249
00:16:14,800 --> 00:16:18,900
true. Then that's a very cheap thing to do every time. So if you wanted to, just

250
00:16:18,900 --> 00:16:22,800
have a check that says, oh, is this enabled on the table? If it's

251
00:16:22,800 --> 00:16:26,400
not, then then set it to true. You can actually do that on every

252
00:16:26,400 --> 00:16:28,000
workflow as part of

253
00:16:28,100 --> 00:16:32,700
Of as an automated part of the orchestration system. You can also leave this

254
00:16:32,700 --> 00:16:36,900
outside of the orchestration systems job and just have the users enable web on

255
00:16:36,900 --> 00:16:40,800
their table before they start using this pattern so that that's that's

256
00:16:40,800 --> 00:16:44,900
another option there as well. But it is a cheap enough metadata operation that you

257
00:16:44,900 --> 00:16:46,600
can. You can check for this every time.

258
00:16:48,100 --> 00:16:52,900
Then the second step is you run the spark application, right? That the user scheduled as part of

259
00:16:52,900 --> 00:16:56,600
the orchestration system and here, the orchestration

260
00:16:56,600 --> 00:17:00,900
system should take the responsibility of generating the Run

261
00:17:00,900 --> 00:17:04,800
ID, all right? So you shouldn't have to actually think of some

262
00:17:04,800 --> 00:17:08,400
you some ID for this web session, the orchestration system should

263
00:17:08,600 --> 00:17:12,800
automatically generate it and usually orchestration systems. Have some concept of

264
00:17:12,800 --> 00:17:16,900
a workflow instance, ID or run ID for this particular

265
00:17:17,900 --> 00:17:21,800
Execution and oftentimes, that's enough to just utilize that.

266
00:17:21,800 --> 00:17:25,800
As the as the, the Run ID solar system to

267
00:17:25,800 --> 00:17:29,200
grab that it should actually inject it into the spark session

268
00:17:29,200 --> 00:17:33,800
configuration for that. The users job is going

269
00:17:33,800 --> 00:17:34,900
to run in

270
00:17:36,600 --> 00:17:40,500
And then, you know, once that job finishes

271
00:17:40,500 --> 00:17:44,500
successfully the registrations and also should trigger the audits, right? It should trigger the

272
00:17:44,500 --> 00:17:48,800
auditing tool. Now the auditing tool has to, you know, has to

273
00:17:48,800 --> 00:17:52,500
look up. What's that, what's? What actual snapshot should I be auditing?

274
00:17:52,500 --> 00:17:56,800
So you need to have this very declarative way to make this as easy as possible. You need

275
00:17:56,800 --> 00:18:00,700
this declarative way in the auditing tool to say hey run these audits for this

276
00:18:00,700 --> 00:18:04,800
specific snapshot of the table because then it's a matter of just finding

277
00:18:04,800 --> 00:18:06,500
out what snapshot is.

278
00:18:06,600 --> 00:18:10,700
Is tied to this execution ID and then providing that snapshots of the

279
00:18:10,700 --> 00:18:14,700
auditing tool to the actual audit executions so

280
00:18:14,700 --> 00:18:18,600
that they can run. And then when the audit is complete,

281
00:18:19,300 --> 00:18:23,600
it will give you a sense of, should I publish this data or should I not publish this

282
00:18:23,600 --> 00:18:27,800
data? And that signal should be returned to the orchestration system

283
00:18:28,200 --> 00:18:32,900
which if the auditing tool gives a published signal, it runs the cherry pick operation

284
00:18:33,400 --> 00:18:36,600
and specifically for the snapshot. That's

285
00:18:36,600 --> 00:18:40,800
Packed with this, run ID that orchestration system itself

286
00:18:40,800 --> 00:18:43,900
in created and set on the spark session.

287
00:18:43,900 --> 00:18:47,900
So this is kind of like high level how an

288
00:18:47,900 --> 00:18:51,900
orchestration system can really Orchestra system sort of the

289
00:18:51,900 --> 00:18:55,800
core component in this, how we can orchestrate this and automate this feature. So that users can

290
00:18:55,800 --> 00:18:59,800
actually just provide their their ingestion Pipeline

291
00:18:59,800 --> 00:19:03,700
and provide some audits as well. Defined some audits as part of some

292
00:19:03,700 --> 00:19:05,800
auditing tool and have this taken care of

293
00:19:06,600 --> 00:19:10,900
Italy for them, by by really the system is orchestrating the execution of their

294
00:19:11,200 --> 00:19:15,900
of their pipeline. One thing I want to say is this part right here is

295
00:19:15,900 --> 00:19:19,400
super tricky, it seems to

296
00:19:19,400 --> 00:19:23,900
be very easy, right? Publish or don't publish, right? If the

297
00:19:23,900 --> 00:19:27,700
audits past published the

298
00:19:27,700 --> 00:19:31,900
data, if the audits failed don't publish the data. But there's a lot of nuance there

299
00:19:32,400 --> 00:19:36,500
and this is usually where human intervention is.

300
00:19:36,700 --> 00:19:39,800
Often required or requested explicitly.

301
00:19:40,600 --> 00:19:44,900
So to give some examples of that. Oftentimes, if

302
00:19:44,900 --> 00:19:48,700
the audits fail, there are different. They're

303
00:19:48,700 --> 00:19:52,800
different hierarchies to audits, right? Some are called week, what we

304
00:19:52,800 --> 00:19:56,900
like to call blocking audits, right? These are audits that under no circumstances, should

305
00:19:56,900 --> 00:20:00,700
the data be published, right? So this is something like, if you're expecting millions of

306
00:20:00,700 --> 00:20:04,500
records and you get under 100,000

307
00:20:04,500 --> 00:20:06,500
records in the status in this,

308
00:20:06,600 --> 00:20:10,700
Ian, something's wrong in the should never be published. And then there are

309
00:20:12,100 --> 00:20:16,700
non blocking audits which even though they fail their more for notifying

310
00:20:16,900 --> 00:20:20,800
the ingestion owner that they want to maybe look historically and Co this. This

311
00:20:20,800 --> 00:20:24,900
check that I do fails every Friday and Saturday

312
00:20:24,900 --> 00:20:28,800
when volume picks up. For example, that that is

313
00:20:28,800 --> 00:20:30,100
very different from

314
00:20:32,000 --> 00:20:36,300
That is the reading room. Something you want to actually stop the pipeline. You may actually want that

315
00:20:36,500 --> 00:20:40,300
publish every so this different category of audits

316
00:20:40,900 --> 00:20:44,600
but then there's also the blocking audits that

317
00:20:44,600 --> 00:20:48,800
Phil where users actually want an ability to override that failure.

318
00:20:48,900 --> 00:20:52,900
So that's another level of control that users will request. Is

319
00:20:52,900 --> 00:20:56,900
that? Hey, when my audit my blocking audit fails, you know, I

320
00:20:56,900 --> 00:21:00,700
want to be able to do some investigation you sort of pause the

321
00:21:00,900 --> 00:21:04,900
At that at that stage let me do some investigation. Maybe, I've talked to some

322
00:21:04,900 --> 00:21:08,800
Upstream, some owners of the Upstream tables that I use

323
00:21:09,700 --> 00:21:13,800
and there's a good reason why this audit failed. I'll adjust the audit next week for

324
00:21:13,800 --> 00:21:17,600
now. Let me just skip it and so a lot of a lot of

325
00:21:17,600 --> 00:21:21,900
nuance exist in this stage right here, the rest of this not so

326
00:21:21,900 --> 00:21:25,700
much, although there is some as well. But but this in particular is usually the

327
00:21:25,700 --> 00:21:29,300
one where you really want to curtail that user experience to something that's

328
00:21:29,300 --> 00:21:30,600
intuitive to

329
00:21:30,700 --> 00:21:34,900
To really the people who are creating these jobs and using

330
00:21:34,900 --> 00:21:35,500
this pattern.

331
00:21:38,500 --> 00:21:42,800
Some of the gotchas I want to cover some of the gotchas to be aware of so

332
00:21:43,200 --> 00:21:47,500
pipelines that, that both right to and read from the same Target

333
00:21:47,500 --> 00:21:51,900
table, something to be aware of their. So Iceberg

334
00:21:51,900 --> 00:21:55,600
has, it's actually a really cool feature where when you have that

335
00:21:55,600 --> 00:21:59,500
spark session, that whap ID set for the spark session,

336
00:22:00,400 --> 00:22:04,600
when you write to the table that's unpublished. And then you read from that

337
00:22:04,600 --> 00:22:07,900
table, not specifying a snapshot. But just reading from

338
00:22:08,000 --> 00:22:12,600
Latest view of that table, and understands the concept that you're still within this

339
00:22:12,900 --> 00:22:16,600
write-audit-publish session and it will actually return

340
00:22:16,600 --> 00:22:20,800
the unpublished data as well. And that's useful when you're, when you're

341
00:22:20,800 --> 00:22:24,600
sort of having Justin pipelines that are multiple reads and writes from

342
00:22:24,600 --> 00:22:28,600
to the same table or Reason rights as part of multiple tables

343
00:22:28,600 --> 00:22:32,900
and it's very useful in that sense in that it allows you to do those read and write

344
00:22:32,900 --> 00:22:36,700
cycles and actually publish and actually

345
00:22:36,700 --> 00:22:37,900
combine them as a

346
00:22:38,000 --> 00:22:42,900
Unit of ingestion or unit of rights that can actually publish together or not publish

347
00:22:42,900 --> 00:22:46,600
together where that's tricky. However, is the with the web ID,

348
00:22:46,600 --> 00:22:50,800
right? And in success, there's no issue there. But when the,

349
00:22:50,800 --> 00:22:54,900
when the pipeline fails, you have to ensure

350
00:22:54,900 --> 00:22:58,300
that you don't use the same write-audit-publish ID

351
00:22:58,300 --> 00:23:02,900
when you run that next batch, because then that feature can

352
00:23:02,900 --> 00:23:06,800
sort of be unwanted, right? Because when you're starting a new Fresh session,

353
00:23:06,800 --> 00:23:08,000
you want to start

354
00:23:08,000 --> 00:23:12,800
The new fresh whap ID when it refresh that web ID because this is a new lab

355
00:23:12,800 --> 00:23:16,800
session in in situations, where you're not writing and reading from the

356
00:23:16,800 --> 00:23:20,700
same Target table. You can reuse the same web ID for multiple

357
00:23:20,900 --> 00:23:24,200
succession is because you're not actually reading from that from that data.

358
00:23:24,200 --> 00:23:28,500
But keep in mind that this can, this is a potential.

359
00:23:28,500 --> 00:23:32,900
Gotcha for this you for this particular unique scenario. Another one

360
00:23:32,900 --> 00:23:36,500
is are running parallel parallel whack jobs, so

361
00:23:36,500 --> 00:23:38,000
parallel

362
00:23:38,000 --> 00:23:42,400
I'll whack jobs right now and this is a great example to think of our back feels right. If you're running

363
00:23:42,400 --> 00:23:46,200
if you're doing daily backfills for the past year, for example,

364
00:23:46,200 --> 00:23:50,900
you may want to launch these 365 jobs in parallel.

365
00:23:50,900 --> 00:23:54,600
Each utilizing a write-audit-publish session

366
00:23:54,600 --> 00:23:58,900
and that all works and particularly

367
00:23:58,900 --> 00:24:02,800
they all work. When you're auditing, each of those individual

368
00:24:02,800 --> 00:24:06,500
day separately, one area that is

369
00:24:06,500 --> 00:24:07,800
still has room for improvement.

370
00:24:08,000 --> 00:24:12,400
Me is when you're running parallel whack jobs and you want to actually audit the

371
00:24:12,400 --> 00:24:16,700
combined view of it. So you actually want to write right? Run the backfield for

372
00:24:16,700 --> 00:24:20,800
365 days. But then have a combined view of all of these

373
00:24:20,800 --> 00:24:24,700
multiple web sessions and do an auditing. There it is possible.

374
00:24:25,300 --> 00:24:29,900
It's actually it's very possible to do if you're sort of using the the web

375
00:24:29,900 --> 00:24:33,900
ID yourself. But when you're automating and orchestrating this, it

376
00:24:33,900 --> 00:24:37,500
becomes very tricky because you need to actually have a higher level

377
00:24:37,900 --> 00:24:41,900
trucked of a web session of a collection of web sessions that are

378
00:24:41,900 --> 00:24:45,300
related and need to be audited together,

379
00:24:46,300 --> 00:24:50,700
the new branching and tagging work that's happening. In open-source Iceberg, what will really

380
00:24:50,700 --> 00:24:54,700
solve this problem? Very well. So, that's if this is a particular use case that

381
00:24:54,700 --> 00:24:58,900
you're interested in, it will be great to keep an eye on what's happening there in the in

382
00:24:58,900 --> 00:25:02,800
the open source community. And the idea there is that, you know,

383
00:25:02,800 --> 00:25:06,600
if you actually, you know, you can create a branch of the head of your table and actually

384
00:25:06,600 --> 00:25:07,800
becomes sort of

385
00:25:09,200 --> 00:25:13,900
Just as you would think of a branch and get it becomes a branch off of the main and it's a named

386
00:25:13,900 --> 00:25:17,600
Branch, right? So you give it an explicit name. So instead of actually

387
00:25:17,600 --> 00:25:21,600
tagging wet unpublished snapshot IDs, you can actually

388
00:25:21,600 --> 00:25:25,800
create a branch at the beginning of your session and you can utilize that Branch as

389
00:25:25,800 --> 00:25:29,800
part of these parallel whack jobs. So what all these parallel,

390
00:25:29,800 --> 00:25:33,900
web jobs can actually run their own rap sessions, but published to the same branch

391
00:25:34,500 --> 00:25:37,800
and then you can actually utilize that Branch name or that branch.

392
00:25:37,900 --> 00:25:41,500
Eid to get that Collective View at the end when you run your audits.

393
00:25:41,800 --> 00:25:45,900
So Branch contains is super exciting feature. That's part of the iceberg

394
00:25:45,900 --> 00:25:49,800
spec and the implementation is is currently actively being work on

395
00:25:49,800 --> 00:25:53,300
work on. And this last one is,

396
00:25:54,200 --> 00:25:58,700
is really more for the orchestration engine. So the overhead of the web

397
00:25:58,700 --> 00:26:02,100
step is small. And when I say about five minutes,

398
00:26:02,900 --> 00:26:06,500
really, the like checking for the table properties about 200

399
00:26:06,500 --> 00:26:07,400
milliseconds,

400
00:26:07,900 --> 00:26:11,700
Running the actual publish is probably two minutes or less.

401
00:26:12,100 --> 00:26:16,900
So where the five minutes really comes from? Is if you do this as part of separate

402
00:26:16,900 --> 00:26:20,600
steps and you're actually running this on running these

403
00:26:20,600 --> 00:26:24,700
containers on a, you know, through via containers on a spark cluster, the

404
00:26:24,700 --> 00:26:27,000
overhead of sort of starting up each of these steps individually

405
00:26:27,000 --> 00:26:31,700
are unnoticeable 43, our

406
00:26:31,700 --> 00:26:35,900
batch ETL jobs, but it can be significant when you're doing

407
00:26:35,900 --> 00:26:37,900
parallel job. So that's something to keep in mind.

408
00:26:37,900 --> 00:26:41,400
Mind, especially if you're doing smaller running jobs. That are about 10 minutes,

409
00:26:42,000 --> 00:26:45,800
the overhead of the additional five minutes, backfilled over

410
00:26:46,000 --> 00:26:50,600
365 days and you're running these in each of your back, each of your

411
00:26:50,800 --> 00:26:54,800
10 or 15-minute backfills end up becoming 20 minutes as a percentage,

412
00:26:54,800 --> 00:26:58,100
there could be some overhead there to consider.

413
00:26:59,000 --> 00:27:03,900
So that's something to think about there are ways to sort of optimize it by keeping that keeping

414
00:27:03,900 --> 00:27:07,800
the the web operational steps closer to the actual Sparky.

415
00:27:07,900 --> 00:27:11,900
ETL run that we can utilize some of the same resources to perform these

416
00:27:11,900 --> 00:27:15,300
steps but that's a small got you? I figured it would be good to

417
00:27:16,400 --> 00:27:17,100
list here.

418
00:27:20,300 --> 00:27:24,900
So that's that's all I just put a bullet here to like, check

419
00:27:24,900 --> 00:27:28,900
out the iceberg open source Community. There's lots of ways to

420
00:27:28,900 --> 00:27:32,700
contribute. It's a big Community, that's growing lots of people from different organizations.

421
00:27:32,700 --> 00:27:36,700
So there if you go to the iceberg that Apache

422
00:27:36,700 --> 00:27:40,900
dot org slash Community page, you'll see lots

423
00:27:40,900 --> 00:27:44,500
of ways to join the slack channel. The weekly saying there's a

424
00:27:44,500 --> 00:27:48,000
community Google Docs where you can add different agenda items.

425
00:27:48,000 --> 00:27:49,700
So there's a number

426
00:27:49,700 --> 00:27:53,300
number of ways to join and start contributing there.

427
00:27:55,700 --> 00:27:59,900
Thanks everyone. Thank you so much Sam. And it

428
00:27:59,900 --> 00:28:03,500
looks like we have a little time to discuss some of these things. So

429
00:28:03,600 --> 00:28:07,900
fortunately we can jump to Q&A here. Really. Appreciate you going through a lot of this

430
00:28:07,900 --> 00:28:11,900
stuff. I have a bigger question about the, what you're seeing, as the

431
00:28:11,900 --> 00:28:15,800
sort of business requirements around auditing and data

432
00:28:16,200 --> 00:28:20,900
as companies start to rely more and more on automated models and more and more

433
00:28:20,900 --> 00:28:24,900
of those decisions are being scrutinized by the public, obviously, recommending the wrong

434
00:28:24,900 --> 00:28:25,200
movie is

435
00:28:25,400 --> 00:28:29,900
Not bad, recommending, the wrong drug is probably really bad. What are you seeing

436
00:28:29,900 --> 00:28:33,200
happen in terms of Industries like law and

437
00:28:33,200 --> 00:28:37,700
governance? Starting to look at the output of these systems to

438
00:28:37,700 --> 00:28:41,600
understand whether data is valid or what there's liability involved.

439
00:28:43,800 --> 00:28:46,900
Yeah, that's that's a great question. I think that the, you know,

440
00:28:48,100 --> 00:28:52,600
in my personal view of, or my personal experience has been

441
00:28:52,600 --> 00:28:56,900
before as data velocity, as data scale, started to grow

442
00:28:56,900 --> 00:29:00,900
tremendously. There was a big question on just, how do we process

443
00:29:00,900 --> 00:29:04,900
this data, right? And it was much less about. How can we pop

444
00:29:04,900 --> 00:29:08,800
process this in a way that ensures data quality? It was just how, can we process

445
00:29:08,800 --> 00:29:11,700
this period, right? And so, there was a big sort of

446
00:29:12,600 --> 00:29:16,800
You know, revolution in how we manage databases right? The

447
00:29:16,800 --> 00:29:20,400
data warehouse was born, Open Table formats are

448
00:29:20,400 --> 00:29:24,900
you know, here to stay. And so a lot of these new things were sort of created

449
00:29:24,900 --> 00:29:28,700
on just, how can we process data at scale in a way that

450
00:29:28,700 --> 00:29:32,900
we're not we're not falling behind on the amount of data that were processing.

451
00:29:32,900 --> 00:29:36,900
I think now that those technologies have matured data quality

452
00:29:36,900 --> 00:29:40,800
at scale is sort of the new the new problem that's trying to be

453
00:29:40,800 --> 00:29:42,300
solved. And so you see a lot of new things.

454
00:29:42,600 --> 00:29:46,700
They're showing up, you know, data quality tools being one. But even like these

455
00:29:46,700 --> 00:29:50,500
concept of metrics layers or semantic layers that have these data

456
00:29:50,500 --> 00:29:54,900
quality built in dire the data quality checks

457
00:29:54,900 --> 00:29:58,800
built in. So, I think that right now, this is

458
00:29:58,800 --> 00:30:02,400
probably a moment where we're sort of taking all those mature data processing

459
00:30:02,400 --> 00:30:06,800
technologies that are that are really, really made incredible, incredible

460
00:30:06,800 --> 00:30:10,600
things, capable at truly, a

461
00:30:11,100 --> 00:30:12,300
ridiculous skills.

462
00:30:12,600 --> 00:30:16,600
And finding out, how can we provide, how can we have that same Revolution for

463
00:30:16,600 --> 00:30:20,900
finding how data quality should be done? And how can we do it,

464
00:30:20,900 --> 00:30:24,700
reliably and inject that as with the same level of engineering that

465
00:30:24,700 --> 00:30:26,800
people are creating these ingestion Pipelines?

466
00:30:28,600 --> 00:30:31,600
So let's talk a little bit about

467
00:30:33,200 --> 00:30:37,500
what Iceberg can do to backfill stuff when there's a downer

468
00:30:38,000 --> 00:30:42,700
downtime or an outage at some point in the data structure, kind of lives that leaves a hole in the world, right?

469
00:30:42,700 --> 00:30:46,400
Like your bass lines, don't work anymore because there's a gap there,

470
00:30:46,900 --> 00:30:50,800
there's obviously a difference between like not having any data versus having

471
00:30:51,200 --> 00:30:55,500
anomalous data and you know, it's hard for code to

472
00:30:55,900 --> 00:30:58,100
deal with or four models to deal with.

473
00:30:58,300 --> 00:31:02,800
And changes are gaps and data. Can you talk a little about backfilling? And and

474
00:31:02,800 --> 00:31:06,900
like, how you would go about back filling a gap in data once that

475
00:31:06,900 --> 00:31:10,700
data is restored and then like redoing the analysis

476
00:31:10,700 --> 00:31:12,100
to fix your models.

477
00:31:13,200 --> 00:31:17,700
Yeah, yeah. So backfilling is a, you know, is

478
00:31:18,100 --> 00:31:22,700
is really probably one of the most complicated Arts of data engineering.

479
00:31:22,800 --> 00:31:26,400
And there's a lot of challenges there, but Iceberg and

480
00:31:26,400 --> 00:31:30,600
particular. Its snapshot feature is really core

481
00:31:30,600 --> 00:31:34,900
for for backfilling will two things I'll say so. Snapshot feature is very core

482
00:31:34,900 --> 00:31:38,800
for for auditing and also the ability, the fact that the

483
00:31:38,800 --> 00:31:42,800
rights are Atomic so you can actually run these

484
00:31:42,900 --> 00:31:46,900
These these jobs in parallel and the

485
00:31:46,900 --> 00:31:50,400
actual publish step when they're made available happens instantaneously through

486
00:31:50,400 --> 00:31:54,700
through sort of a metadata commit. So that allows you to sort of

487
00:31:54,700 --> 00:31:58,900
parallelize these various backfills at super large scale

488
00:31:58,900 --> 00:32:02,900
on the other thing about you mentioned sort of like having,

489
00:32:02,900 --> 00:32:06,300
you know missing data or bad data, right?

490
00:32:06,300 --> 00:32:10,900
You can you can actually roll back super easy with

491
00:32:10,900 --> 00:32:12,700
with Iceberg as well. So these snapshots

492
00:32:13,000 --> 00:32:17,500
Owners make in those cases where maybe you do have bad data, right? Maybe you did perform a

493
00:32:17,500 --> 00:32:21,900
backfill and it, you know, ran for eight hours and then you found out that

494
00:32:21,900 --> 00:32:25,700
something about your backfill logic was actually off and that may be

495
00:32:25,700 --> 00:32:29,700
injected some some nuances in the data that makes it fundamentally

496
00:32:29,700 --> 00:32:33,500
wrong very easily with Iceberg with a metadata

497
00:32:33,800 --> 00:32:37,600
operation that runs and under you know probably a few seconds definitely under a

498
00:32:37,600 --> 00:32:41,200
minute, you can roll back, you know, petabyte size tables

499
00:32:41,500 --> 00:32:42,800
to to the

500
00:32:42,900 --> 00:32:46,700
Vyas to the previous snapshot that existed before the backfill.

501
00:32:47,800 --> 00:32:51,300
Are there competing formats? Like Delta hoody or Iceberg

502
00:32:51,400 --> 00:32:54,600
and do you think we're going to see, you know,

503
00:32:55,800 --> 00:32:59,900
they're valid reasons for those three things Beyond sort of each vendor or each

504
00:32:59,900 --> 00:33:03,900
group wants their own format? Or are we going to see interoperability between those down the

505
00:33:03,900 --> 00:33:04,300
road?

506
00:33:06,600 --> 00:33:10,400
Yeah, that's an interesting one and a harder harder one to predict.

507
00:33:10,900 --> 00:33:14,900
I think these were created, you know,

508
00:33:15,200 --> 00:33:19,200
at a time when neither existed, right? So, I don't think necessarily that there was, you know, a

509
00:33:19,200 --> 00:33:23,700
mature product that exists in and someone made a another mature product, as sort of a

510
00:33:23,700 --> 00:33:27,500
competitive competitive product. I think they were all

511
00:33:27,500 --> 00:33:30,300
created at separate times from, you know,

512
00:33:31,100 --> 00:33:34,800
organizations with really really strong requirements.

513
00:33:35,700 --> 00:33:39,800
Sort of building things that happen to sort of converge on this concept of an Open

514
00:33:39,800 --> 00:33:43,100
Table format. So I think the concept of an Open Table format is

515
00:33:43,800 --> 00:33:47,900
here to stay. You know, I know I'm much more familiar with with

516
00:33:47,900 --> 00:33:51,900
Iceberg. And the thing that I would say that Iceberg has a

517
00:33:51,900 --> 00:33:55,700
pretty good Advantage. There is, that's always been open sourced from the beginning. And it

518
00:33:55,700 --> 00:33:59,800
has a, the table spec separate from the implementation of

519
00:33:59,800 --> 00:34:03,800
the actual Java implementation. The table spec is completely laid out

520
00:34:03,800 --> 00:34:04,700
on the dock side.

521
00:34:05,100 --> 00:34:09,800
So you can see the like every Nuance of the spec which is very

522
00:34:09,800 --> 00:34:13,800
solid and sort of all implementations follow that spec. So there's a little bit

523
00:34:13,800 --> 00:34:17,400
there that I think Iceberg has in terms of adaptability and

524
00:34:17,400 --> 00:34:21,900
integration has a little bit of a better story, but as far as you know where

525
00:34:21,900 --> 00:34:25,900
where where this all goes, we'll just have to wait and see.

526
00:34:27,300 --> 00:34:30,800
It does feel like convergent evolution as you said rather than sort of

527
00:34:32,000 --> 00:34:34,700
specific differentiation a couple more questions.

528
00:34:35,000 --> 00:34:38,500
It's the data would probably be event-driven in your data lake or Lakehouse.

529
00:34:38,500 --> 00:34:42,900
Then auditing would be natural. And I'm presently, Iceberg

530
00:34:42,900 --> 00:34:43,400
could help.

531
00:34:46,500 --> 00:34:50,800
Yeah. So, for adventure, and so you could, you could actually use this for like if

532
00:34:50,800 --> 00:34:54,800
you're running streaming streaming data, if you're for

533
00:34:54,800 --> 00:34:58,800
examine the Flink application, you could implement this. And I think that that adds the other

534
00:34:58,800 --> 00:35:02,800
dimension of your checkpointing strategy, and when do you want to run

535
00:35:02,800 --> 00:35:06,900
your Audits? And over what interval? Right, if your checkpointing every two minutes, you maybe

536
00:35:06,900 --> 00:35:10,600
don't want to run your audit sweet every two minutes, but you

537
00:35:10,600 --> 00:35:14,600
absolutely could use this as part of as part of a

538
00:35:15,100 --> 00:35:15,500
yeah.

539
00:35:16,200 --> 00:35:18,600
You're an adrenaline, event-driven pipeline.

540
00:35:20,000 --> 00:35:24,300
Yeah, it's fascinating to think about, you know, the data Ops person is now wearing the pager

541
00:35:24,800 --> 00:35:28,800
we used to think of that as like, you know, someone unless the hardware is

542
00:35:28,800 --> 00:35:32,900
true, unless there's a problem with the hardware or like, the database itself goes down. But

543
00:35:32,900 --> 00:35:36,800
now, you have people instrumenting and like everyone's on pager GT. Now, they're all

544
00:35:36,800 --> 00:35:40,500
getting you know, wherever the data, flows are two more quick

545
00:35:40,500 --> 00:35:44,600
questions. Any tips on implementing automated testing for Enterprise data warehouses?

546
00:35:47,200 --> 00:35:47,800
so,

547
00:35:49,800 --> 00:35:53,800
Yeah, so at Netflix, we have a homegrown tool

548
00:35:53,800 --> 00:35:57,900
that we use for that. There are tons that are out there

549
00:35:57,900 --> 00:36:01,800
that are available. I think it really is, it really

550
00:36:01,800 --> 00:36:05,700
is unique to the type of data quality that you're doing and the type of data that you're doing

551
00:36:06,800 --> 00:36:10,800
and the particular engines that you use. So, that's sort of the

552
00:36:10,800 --> 00:36:14,800
compatibility that would look for, is that it has, it can integrate

553
00:36:14,800 --> 00:36:18,800
with the particular compute engines that you're using and that it has the right

554
00:36:18,800 --> 00:36:19,400
Suite of

555
00:36:19,500 --> 00:36:23,700
It's that that you need and that that really is the nature of the data,

556
00:36:23,800 --> 00:36:27,900
right? If you're running, if you're, for example, doing ml data, and you want something that's

557
00:36:27,900 --> 00:36:31,300
more statistical statistical, heavy with statistical, auditing

558
00:36:31,300 --> 00:36:35,900
functions. You may use that if you're using doing financial data. For example, you might want

559
00:36:35,900 --> 00:36:39,900
something that's more, has more forecasting features and maybe you can

560
00:36:39,900 --> 00:36:43,400
audit based on forecast, on on each daily

561
00:36:43,400 --> 00:36:47,800
partition or something along those lines. So I think that one is pretty

562
00:36:48,000 --> 00:36:49,400
depends a lot, but the good thing about

563
00:36:49,500 --> 00:36:53,900
Iceberg is that in this particular pattern, is that you can inject really any auditing tool. It

564
00:36:53,900 --> 00:36:55,300
brings no opinions about that.

565
00:36:56,900 --> 00:37:00,500
Awesome, last quick question. You've

566
00:37:00,500 --> 00:37:04,400
obviously had worked as a software engineer in the

567
00:37:04,400 --> 00:37:08,300
very what sort of link for the very like life

568
00:37:08,300 --> 00:37:12,700
important medical stuff, where they were or false

569
00:37:12,700 --> 00:37:16,600
negative is really bad. And you know the

570
00:37:17,100 --> 00:37:21,700
sort of luxury world of watching content, where a false

571
00:37:21,700 --> 00:37:25,700
positives, not that big a deal. What has changed in your approach to

572
00:37:25,700 --> 00:37:26,400
software engineering.

573
00:37:26,600 --> 00:37:29,400
At those two extremes of data science.

574
00:37:30,700 --> 00:37:33,900
Yeah so that's a great observation. I think

575
00:37:33,900 --> 00:37:37,700
you know manual

576
00:37:37,700 --> 00:37:41,800
intervention manual human intervention is always

577
00:37:41,800 --> 00:37:45,600
always been critical in you know clinical settings

578
00:37:45,600 --> 00:37:48,700
for sure. So that so that that factor has been

579
00:37:48,700 --> 00:37:52,900
always present that it's not enough to say oh you can Define these

580
00:37:52,900 --> 00:37:56,800
Audits and code and they'll run automatically, you always have to think of that

581
00:37:56,800 --> 00:38:00,700
review process. You always have to think of that. You know, how can someone look at this and

582
00:38:00,700 --> 00:38:04,900
Sign off. And then the human element is always much more present in those sort of

583
00:38:04,900 --> 00:38:08,800
Life, critical environments when it comes to a

584
00:38:08,800 --> 00:38:12,700
product. When it comes to something, that's not life critical, but more

585
00:38:12,700 --> 00:38:15,500
a user experience that you're really trying to drive

586
00:38:15,500 --> 00:38:19,400
latency and, and

587
00:38:19,400 --> 00:38:23,500
velocity sort of being able to run these automated at High

588
00:38:23,500 --> 00:38:27,900
Velocity is sort of the bigger requirement. So, the difference that I think is you can be much

589
00:38:27,900 --> 00:38:30,600
more Innovative and creative and there's a

590
00:38:30,700 --> 00:38:34,700
a lot to gain from getting you know audit audit checks there for ninety

591
00:38:34,700 --> 00:38:38,900
percent ninety percent of the time, if it increases your velocity and increases

592
00:38:38,900 --> 00:38:41,800
your productivity to a, some order of

593
00:38:42,200 --> 00:38:43,000
magnitude,