1 00:00:00,000 --> 00:00:04,700 So another important tool in the quiver of 2 00:00:04,700 --> 00:00:08,300 tools that a data scientist has is Apache. Iceberg, 3 00:00:08,700 --> 00:00:12,500 Sam redai is a senior software engineer on the 4 00:00:12,500 --> 00:00:16,900 experimentation platform and Netflix Samuel responsible for telling me all the 5 00:00:16,900 --> 00:00:20,900 amazing things that I should or shouldn't be watching. I'm guessing and 6 00:00:20,900 --> 00:00:24,600 obviously you have a background Beyond just media in medical research and 7 00:00:24,600 --> 00:00:28,700 hospitals by chains and so on, we talked a lot about data, 8 00:00:28,700 --> 00:00:29,500 you know, 9 00:00:30,000 --> 00:00:34,300 If you're listening to what give you two says, maybe you choose an algorithm, you think is right, 10 00:00:34,900 --> 00:00:38,900 but ultimately garbage in garbage out. So data, quality is going 11 00:00:38,900 --> 00:00:42,700 to drive whether or not your models are successful. Why don't you talk a 12 00:00:42,700 --> 00:00:45,700 little bit about how we test data quality at scale? 13 00:00:48,200 --> 00:00:52,500 Awesome. Thanks a lot for the introduction. So I'm excited to be here. I'm 14 00:00:52,500 --> 00:00:56,800 happy to talk to everyone about Apache, Iceberg. So we can, we can jump 15 00:00:56,800 --> 00:01:00,600 right into it. I'm going to cover something called the write-audit-publish pattern, which is a 16 00:01:00,600 --> 00:01:04,700 general pattern. But it's specifically what? Well, implemented 17 00:01:04,700 --> 00:01:07,000 via, Apache iceberg. 18 00:01:08,400 --> 00:01:12,900 So high level overview and just you know what, I'm playing the cover in the talk, 19 00:01:13,200 --> 00:01:17,700 I'm going to talk about data quality so you know what is what exactly does data quality mean? 20 00:01:18,600 --> 00:01:22,500 You know, in the modern modern era of data, you know what are some common 21 00:01:22,500 --> 00:01:26,900 patterns that that are inspired by the goal 22 00:01:26,900 --> 00:01:30,400 to achieve data quality? Then I'm going to cover, Apache, Iceberg 23 00:01:30,600 --> 00:01:34,600 just hi-low introduction on what it is and specifically its integrated 24 00:01:34,600 --> 00:01:37,500 audits feature, which I think is makes it very 25 00:01:37,600 --> 00:01:41,600 Very easy to implement this, this exact pattern at data 26 00:01:41,600 --> 00:01:45,500 very large scale and I'm going to talk a little bit about the hard part which is 27 00:01:45,500 --> 00:01:49,900 automating. This feature via your orchestration system so that people can use the 28 00:01:49,900 --> 00:01:53,800 write-audit-publish pattern in a much more declarative way. So, 29 00:01:53,800 --> 00:01:57,800 let's start with, you know, data quality, right, what exactly is 30 00:01:57,800 --> 00:02:01,700 data quality. So if you look at Wikipedia, it says people's views on data quality. 31 00:02:01,700 --> 00:02:05,900 Can often mean disagreement even when discussing the same set of data for the same 32 00:02:05,900 --> 00:02:06,400 purpose, 33 00:02:07,400 --> 00:02:11,800 it was on this day is the number of data sources increases the question of internal 34 00:02:11,800 --> 00:02:15,600 data consistency becomes significant, regardless of the fitness for 35 00:02:15,600 --> 00:02:19,800 use for any particular purpose and this is my favorite 36 00:02:19,800 --> 00:02:23,300 sentence is that defining data quality. In the sentence is difficult, 37 00:02:23,700 --> 00:02:27,600 and in reality, this is what we're trying to avoid, right data. 38 00:02:27,600 --> 00:02:31,900 Quality really means that our consumers of our data don't 39 00:02:31,900 --> 00:02:35,900 feel. Don't lose trust in in what is actually contained in the 40 00:02:35,900 --> 00:02:37,000 data that we're delivering. 41 00:02:37,200 --> 00:02:41,800 All right. So you know, the big question then is 42 00:02:41,800 --> 00:02:45,900 how can I get people to trust my data, right? And here 43 00:02:45,900 --> 00:02:49,900 are just a couple of solutions that, you know, if anyone's work with data has tried one 44 00:02:49,900 --> 00:02:53,300 or more of these Solutions, at some point, right? You can write your data to 45 00:02:53,300 --> 00:02:57,600 production and you can leave it to your consumers to run. Validations 46 00:02:58,500 --> 00:03:02,600 another solution is you, you know, you write it somewhere else. You have like, a test data 47 00:03:02,600 --> 00:03:06,400 warehouse that you write everything, sort of no one knows about. And then you 48 00:03:07,100 --> 00:03:11,900 Audit your data there in private before you rerun your stuff to move to production. 49 00:03:12,900 --> 00:03:16,900 And then, you know, some people are clever. They write data quality metrics and they 50 00:03:16,900 --> 00:03:20,600 sort of ship that as along, with their data. So that all 51 00:03:20,600 --> 00:03:24,700 consumers can sort of go. Look at metrics here, investigated for 52 00:03:24,700 --> 00:03:28,900 whatever, you know, whatever their definition of data quality, is they 53 00:03:28,900 --> 00:03:32,900 can sort of frame it in the context of these metrics. You provide and 54 00:03:32,900 --> 00:03:36,000 maybe, you know, some combination of these three options. 55 00:03:37,200 --> 00:03:41,900 Some of you are really, really expert data engineers, and you look at and you say, oh, it's not that 56 00:03:41,900 --> 00:03:45,900 complicated. I have my own mechanism here. And many data Engineers, do many 57 00:03:45,900 --> 00:03:49,900 dangers have really, really fantastic. Bespoke data, quality solutions 58 00:03:50,300 --> 00:03:54,600 that work very well. And so, you know, 59 00:03:54,600 --> 00:03:58,500 the big thing that I want to cover is that copying 60 00:03:58,500 --> 00:04:02,900 data when you're working with data at massive scale, copying data is 61 00:04:02,900 --> 00:04:06,100 essentially not an option. And that's the big thing that 62 00:04:07,100 --> 00:04:11,900 That changes that sort of knocks off a lot of the solutions that you know 63 00:04:11,900 --> 00:04:15,900 maybe were valid Solutions along time ago, once you 64 00:04:15,900 --> 00:04:19,700 reach the scale of data that pretty much exist at any organization 65 00:04:20,200 --> 00:04:24,600 you can't really copy the data and so you try and find out how to how can I 66 00:04:24,600 --> 00:04:28,900 maintain data quality? How can I share data quality without having to do this 67 00:04:29,200 --> 00:04:33,400 duplication of my data separately to actually run my 68 00:04:33,400 --> 00:04:34,000 audits? 69 00:04:35,900 --> 00:04:39,900 So I want to talk about Apache, icebergs integrated audits feature and just 70 00:04:39,900 --> 00:04:43,800 to start at a high level. You know, what is it that you guys work for anyone? Who's never 71 00:04:43,800 --> 00:04:47,900 heard of it? It's a high performance format. This is sort of the long 72 00:04:47,900 --> 00:04:51,800 description from the iceberg. Doc, site my 73 00:04:51,800 --> 00:04:55,800 description. I usually give to people is that it is Iceberg provides. Massive scale for 74 00:04:55,800 --> 00:04:59,900 called native, SQL tables, and it's accessible by many compute engines. 75 00:04:59,900 --> 00:05:03,900 And that list of compute engines as it is continuing to grow over the 76 00:05:03,900 --> 00:05:04,900 past year, has been extraordinary. 77 00:05:05,600 --> 00:05:09,800 Both in different compute engines that can access you know 78 00:05:10,400 --> 00:05:12,700 commentator tables that rest Iceberg tables. 79 00:05:14,400 --> 00:05:18,800 So just a high level overview on what the integrated audits Feature 80 00:05:19,100 --> 00:05:23,900 Feature is. So the big core of it really is that 81 00:05:23,900 --> 00:05:27,500 it allows you to write your data to production in an unpublished 82 00:05:27,500 --> 00:05:31,700 State and by unpublished it means Downstream consumers. Can't see 83 00:05:31,700 --> 00:05:35,800 that table. When they just typically query it directly as, you know, 84 00:05:35,800 --> 00:05:39,800 writing a sequel query to select from this table, they won't actually see the 85 00:05:39,800 --> 00:05:42,300 data that you've written in its unpublished state. 86 00:05:43,000 --> 00:05:47,800 The other thing that it does, and this is particular, and a lot of this is going to be specific to this park 87 00:05:48,700 --> 00:05:52,900 implementation of this. But this integrated, all this feature is that the core 88 00:05:52,900 --> 00:05:56,800 Iceberg layer. So any compute engine can actually add support 89 00:05:56,800 --> 00:06:00,600 for this and they're only a few now. But the featured the raw 90 00:06:00,600 --> 00:06:04,900 material does exist there to implement in any compute engines and Spark in particular. The 91 00:06:04,900 --> 00:06:08,600 way it's implemented is this spark web app ID value from your 92 00:06:08,600 --> 00:06:12,600 spark session tags. The unpublished snapshot so I see 93 00:06:12,700 --> 00:06:16,500 Has this concept of snapshots and when you write data unpublished that 94 00:06:16,500 --> 00:06:20,500 snapshot is tagged with this write-audit-publish ID 95 00:06:20,500 --> 00:06:24,900 from your spark session time, travel is a big feature, an iceberg. 96 00:06:24,900 --> 00:06:28,700 So time traveler, lets you select any historical snapshot 97 00:06:28,700 --> 00:06:32,900 explicitly. So that's a core feature. Not 98 00:06:32,900 --> 00:06:36,700 specific to integrate audits but specific to Iceberg for 99 00:06:36,700 --> 00:06:39,300 time traveling to select these. This unpublished data 100 00:06:40,200 --> 00:06:44,900 And then when you have confidence in your data, Iceberg has a metadata 101 00:06:44,900 --> 00:06:48,500 only cherry-pick operation, which lets you take the difference between this 102 00:06:48,500 --> 00:06:52,800 unpublished data and the current, you know, sort of head of your 103 00:06:52,800 --> 00:06:56,600 table that people have access to and sort of cherry picking the 104 00:06:56,600 --> 00:07:00,600 metadata making a new current snapshot 105 00:07:00,600 --> 00:07:04,900 to sort of make those unpublished changes published that's sort 106 00:07:04,900 --> 00:07:08,800 of the equivalent of it. And you know, another one of the best 107 00:07:08,800 --> 00:07:09,300 parts is that 108 00:07:10,200 --> 00:07:12,800 Iceberg, snapshot expiration, cleans up a lot of 109 00:07:12,800 --> 00:07:16,700 data that's unused, it's not referenced by the 110 00:07:16,700 --> 00:07:20,900 current state of the table. So if you have a weekly or monthly snapshot expiration 111 00:07:20,900 --> 00:07:24,900 process, any data that you don't end up cherry-picking or any data that 112 00:07:24,900 --> 00:07:28,900 remains unpublished is automatically cleaned up and so this is an overview 113 00:07:28,900 --> 00:07:32,900 of the core features of iceberg that really enable everything I'm going to talk about 114 00:07:32,900 --> 00:07:34,200 in the next slides. 115 00:07:35,300 --> 00:07:39,300 So let's cover like the three stages, right? So write-audit-publish 116 00:07:39,300 --> 00:07:43,800 and and what what it entails right, what it contains. So for 117 00:07:43,800 --> 00:07:47,900 write this, this right? That wept at enabled is actually a table 118 00:07:47,900 --> 00:07:51,900 property. So by default swap is 119 00:07:51,900 --> 00:07:55,700 not enabled on Iceberg tables but you can enable it simply 120 00:07:55,700 --> 00:07:59,700 by setting this table property to true on your table and you only need 121 00:07:59,700 --> 00:08:03,800 to do it. Once you don't necessarily need to do it. Every time you do a write-audit-publish 122 00:08:03,800 --> 00:08:04,800 session. 123 00:08:05,400 --> 00:08:09,800 And then the second one is in your spark configuration for your, your 124 00:08:09,800 --> 00:08:13,700 job that you're running. You just need to set this uuid 125 00:08:13,700 --> 00:08:17,200 in the spark session configuration. So when that 126 00:08:17,200 --> 00:08:21,700 once that ID is set, then you're sort of it's the 127 00:08:21,700 --> 00:08:25,700 the entire spark session that you're running in, you know, sort of 128 00:08:25,700 --> 00:08:29,200 it's a signal to Iceberg that this is going to be followed the write-audit-publish 129 00:08:29,200 --> 00:08:33,800 pattern. And then the best part is you just run your production 130 00:08:33,800 --> 00:08:35,200 HDL code. So 131 00:08:35,400 --> 00:08:39,800 So it sounds scary to a lot of people. But once you see this 132 00:08:39,800 --> 00:08:43,600 enough and you use this enough, you'll gain that confidence that 133 00:08:43,600 --> 00:08:47,800 Iceberg won't publish your data. Because you've had, you have a web ID set 134 00:08:47,800 --> 00:08:51,600 on your spark session configuration. So you run your production ETL 135 00:08:51,600 --> 00:08:55,900 code. AS is, you insert into your production table, you change. Nothing about your 136 00:08:55,900 --> 00:08:59,800 production code except the setting the swap ID in your spark 137 00:08:59,800 --> 00:09:01,000 session configuration. 138 00:09:02,900 --> 00:09:06,300 So then what is the, how does the auditing happen? So the auditing is, 139 00:09:06,900 --> 00:09:10,600 you simply have to find that snapshot ID for the 140 00:09:10,600 --> 00:09:14,900 production table that's tagged with the the web idea that would set in your spark 141 00:09:14,900 --> 00:09:18,500 session configuration. So that point that tag essentially is a 142 00:09:18,500 --> 00:09:22,900 pointer to say. Okay, now I this data is unpublished, so I can't access it by 143 00:09:22,900 --> 00:09:26,200 just clearing the table. I have to actually use the time travel feature 144 00:09:26,700 --> 00:09:30,500 to select a, an unpublished, a different. Snapshot, that's 145 00:09:30,500 --> 00:09:31,300 unpublished. 146 00:09:31,800 --> 00:09:35,800 And the whap ID is sort of the link that lets you look up, okay. What is the 147 00:09:35,800 --> 00:09:39,100 snapshot? That was created by my production, right job 148 00:09:39,100 --> 00:09:43,800 and then you can perform these validations against the data using any data, auditing 149 00:09:43,800 --> 00:09:47,800 tool. So anything that can that actually has support for Iceberg tables and can use time 150 00:09:47,800 --> 00:09:51,800 travel, you can use for auditing. So if you have a spark based auditing tool, if you have 151 00:09:51,800 --> 00:09:55,800 a tree, no based auditing tool, you have a fling based auditing to anything 152 00:09:55,800 --> 00:09:59,700 that can actually select a specific snapshot of Iceberg. 153 00:09:59,900 --> 00:10:01,400 You can utilize for 154 00:10:01,800 --> 00:10:02,500 for your audits. 155 00:10:04,300 --> 00:10:08,800 And then what happens after you're finished your Audits and you want to publish, what if your audits 156 00:10:08,800 --> 00:10:12,200 fail? You can just go back to the drawing board. Remember there's the automatic 157 00:10:12,200 --> 00:10:16,900 snapshot expiration, clean up that that is part of the maintenance of an 158 00:10:16,900 --> 00:10:20,500 iceberg warehouse and so you can, you can be assured that any of this 159 00:10:20,500 --> 00:10:23,700 unpublished data will be picked up automatically and cleaned out. 160 00:10:23,700 --> 00:10:27,900 If you have a, you know, a number of sessions 161 00:10:27,900 --> 00:10:31,900 that fail, you have data lying around, you don't have to concern 162 00:10:31,900 --> 00:10:34,300 yourself with it but that will all get cleaned up. And if 163 00:10:34,300 --> 00:10:38,700 Your audits pass then publishing that unpublished data. You 164 00:10:38,700 --> 00:10:42,500 run. This cherry-pick operation of the specific snapshot ID that was 165 00:10:42,500 --> 00:10:46,800 tagged with that write-audit-publish session. This is a metadata 166 00:10:46,800 --> 00:10:50,900 only operation. It's super fast. This goes back to the cardinal sin of 167 00:10:50,900 --> 00:10:54,800 copying data when you're working with data at scale. You really 168 00:10:54,800 --> 00:10:58,700 you know it's just not feasible when you have like maybe 169 00:10:58,700 --> 00:11:02,500 jobs that take 56 hours to run your moving massive amounts of data, 170 00:11:02,800 --> 00:11:04,300 the great part about this, 171 00:11:04,300 --> 00:11:08,400 This is the publish here is just a cherry pick operation. It just takes the metadata of the 172 00:11:08,400 --> 00:11:12,900 unpublished snapshot looks at the current head of the table and then Cherry 173 00:11:12,900 --> 00:11:16,600 picks those changes to to be available. Just a very 174 00:11:16,600 --> 00:11:20,800 fast operation and immediately all of your data is 175 00:11:20,800 --> 00:11:23,400 published and available to a downstream consumers. 176 00:11:26,400 --> 00:11:30,700 So it might, it might feel like this, right? You're seeing all these 177 00:11:30,700 --> 00:11:34,800 steps and you're saying, okay, I have to set this sparked rap session ID. I have to, 178 00:11:34,800 --> 00:11:38,700 you know, look it up. When I went to audit it, right? I have to make sure that 179 00:11:38,700 --> 00:11:42,700 I, you know, once those audits complete based on some conditional, I 180 00:11:42,700 --> 00:11:45,600 publisher or don't publish. And 181 00:11:46,800 --> 00:11:50,700 it, you know, let's zoom out on sort of what, what this feature is 182 00:11:50,700 --> 00:11:54,800 aiming to do, right? So iceberg is a data quality in a blur. This is a high level of you in 183 00:11:54,900 --> 00:11:58,900 What, what it's trying to solve here, with this particular feature, it's a data 184 00:11:58,900 --> 00:12:02,900 quality enable. Enabler in that, here's a gross simplification 185 00:12:03,200 --> 00:12:07,000 of a data Engineers pipeline, right? You have some data source? 186 00:12:08,000 --> 00:12:12,900 You ingest some pipeline, you know, you have an ingestion pipeline which you 187 00:12:12,900 --> 00:12:16,700 run your code for and then you do some manipulation, maybe some 188 00:12:16,700 --> 00:12:20,600 joins and maybe some filters and then you put this data in your in your production 189 00:12:20,600 --> 00:12:24,800 data warehouse. Now, if something goes wrong, if 190 00:12:24,800 --> 00:12:28,700 something is wrong Upstream with your data sources, you know, that's 191 00:12:28,700 --> 00:12:32,800 not the best thing in the world but your ingestion pipeline fails and things, 192 00:12:32,800 --> 00:12:36,900 you know, you just tell your Downstream consumers. Oh the data is on the way, it's 193 00:12:36,900 --> 00:12:37,800 not, it's not. 194 00:12:38,000 --> 00:12:42,700 Ready. We're debugging some issues with Downstream data sources if the data 195 00:12:42,700 --> 00:12:46,400 sources are fine but your ingestion pipeline, something's wrong with your code. 196 00:12:46,400 --> 00:12:50,800 Similar ingestion, pipeline fails, you tell your Downstream consumers, you 197 00:12:50,800 --> 00:12:54,900 know my pipeline failed, I'm working on figuring out what's going 198 00:12:54,900 --> 00:12:58,700 on with the code and everything is you know 199 00:12:58,700 --> 00:13:01,900 sort of people are waiting for the data but 200 00:13:03,100 --> 00:13:07,600 you know it's it's not the worst case scenario which is 201 00:13:08,100 --> 00:13:12,900 When something goes wrong and bad data is published to your production data 202 00:13:12,900 --> 00:13:16,900 warehouse. And this this portion right here is where all of the anxiety 203 00:13:16,900 --> 00:13:20,800 lives for. Well, I wouldn't say all most of the anxiety 204 00:13:20,800 --> 00:13:24,300 lives for a lot of data Engineers is that you don't want to actually 205 00:13:24,500 --> 00:13:28,800 production analyze or publish bad data and you can 206 00:13:28,800 --> 00:13:32,300 think of that as we're this integrated audits feature lives, 207 00:13:32,800 --> 00:13:36,900 it allows you to run your ingestion pipeline. It allows you to run its the case 208 00:13:36,900 --> 00:13:37,800 when everything goes wrong. 209 00:13:38,000 --> 00:13:42,700 When the data sources are solid your ingestion pipeline 210 00:13:42,700 --> 00:13:46,700 runs and complete successfully, but you want to include 211 00:13:46,700 --> 00:13:50,600 a set of auditing or checks before that data is actually made 212 00:13:50,600 --> 00:13:54,600 available in the production data warehouse and you want to do that without 213 00:13:54,800 --> 00:13:58,700 actually copying your your data, your massive amounts 214 00:13:58,700 --> 00:14:02,800 of data that's actually stored in your pipeline 215 00:14:03,000 --> 00:14:07,400 or process by your problem. So what it really means is no more running your data twice, 216 00:14:07,900 --> 00:14:11,300 No more having to remember to clean up artifacts like test tables 217 00:14:12,000 --> 00:14:16,900 no more having to remember to keep testing Pratt schema synced, right? So even if you could copy your 218 00:14:16,900 --> 00:14:20,900 data twice, there's this thinking that you have to do every time you 219 00:14:20,900 --> 00:14:24,400 migrate schemas for tables, which you always gets 220 00:14:24,400 --> 00:14:28,700 challenging makes you want to change them less, there's no more locking yourself into a single 221 00:14:28,700 --> 00:14:32,900 auditing tool. So anything that supports Iceberg, which is a very long list 222 00:14:32,900 --> 00:14:36,900 of compute engines that are growing everyday anything based on 223 00:14:36,900 --> 00:14:37,700 those tools. 224 00:14:37,900 --> 00:14:41,900 Can actually be used to audit your Iceberg tables. And there's no more 225 00:14:41,900 --> 00:14:45,800 coupling of your ETL logic with your validation logic, right? So you 226 00:14:45,800 --> 00:14:48,600 can actually run your keep your production code as is, 227 00:14:49,800 --> 00:14:53,900 and, you know, that that sort of allows you to store your validation logic as a 228 00:14:53,900 --> 00:14:55,900 separate separate component. 229 00:14:58,400 --> 00:15:02,800 So, you know, the hard part, really, 230 00:15:02,800 --> 00:15:06,800 and this is sort of the last section in the talk. But, 231 00:15:06,900 --> 00:15:10,700 you know, there are a couple of steps there, right? We see, you have to set the web 232 00:15:10,700 --> 00:15:14,900 ID and your spark configuration. You have to ensure that you utilize that 233 00:15:14,900 --> 00:15:18,800 when you're running your audits, you also have to cherry pick the right 234 00:15:18,800 --> 00:15:22,600 snapshot ID when you go to publish. And so the hard part is 235 00:15:22,600 --> 00:15:26,900 really automating this feature, right? As part of your orchestration system. So 236 00:15:26,900 --> 00:15:27,800 I want to cover 237 00:15:28,100 --> 00:15:32,700 what that looks like and how I've seen it done and also talk 238 00:15:32,700 --> 00:15:36,500 about some of the challenges there as well. So 239 00:15:36,700 --> 00:15:40,600 let's start without a high level. You have your orchestration system could be anything 240 00:15:40,800 --> 00:15:44,400 could be air flow, could be a custom scheduler there many. Many 241 00:15:44,400 --> 00:15:48,200 orchestration systems out there that come with scheduling your workflows 242 00:15:48,800 --> 00:15:52,600 on the right. Is just you know simple version of a production table, you could have multiple 243 00:15:52,600 --> 00:15:56,600 production tables but for the sake of this example, to assume it's one production table 244 00:15:56,600 --> 00:15:58,000 and then you have it. 245 00:15:58,100 --> 00:16:02,500 Auditing tool, right? So this is your generic auditing tool that you can use 246 00:16:02,500 --> 00:16:06,300 to run the audit process of the write-audit-publish 247 00:16:06,300 --> 00:16:10,900 pattern. So what's step one, right? So Step 248 00:16:10,900 --> 00:16:14,800 One is, we mentioned verifying that, right? Dot web. That enabled is 249 00:16:14,800 --> 00:16:18,900 true. Then that's a very cheap thing to do every time. So if you wanted to, just 250 00:16:18,900 --> 00:16:22,800 have a check that says, oh, is this enabled on the table? If it's 251 00:16:22,800 --> 00:16:26,400 not, then then set it to true. You can actually do that on every 252 00:16:26,400 --> 00:16:28,000 workflow as part of 253 00:16:28,100 --> 00:16:32,700 Of as an automated part of the orchestration system. You can also leave this 254 00:16:32,700 --> 00:16:36,900 outside of the orchestration systems job and just have the users enable web on 255 00:16:36,900 --> 00:16:40,800 their table before they start using this pattern so that that's that's 256 00:16:40,800 --> 00:16:44,900 another option there as well. But it is a cheap enough metadata operation that you 257 00:16:44,900 --> 00:16:46,600 can. You can check for this every time. 258 00:16:48,100 --> 00:16:52,900 Then the second step is you run the spark application, right? That the user scheduled as part of 259 00:16:52,900 --> 00:16:56,600 the orchestration system and here, the orchestration 260 00:16:56,600 --> 00:17:00,900 system should take the responsibility of generating the Run 261 00:17:00,900 --> 00:17:04,800 ID, all right? So you shouldn't have to actually think of some 262 00:17:04,800 --> 00:17:08,400 you some ID for this web session, the orchestration system should 263 00:17:08,600 --> 00:17:12,800 automatically generate it and usually orchestration systems. Have some concept of 264 00:17:12,800 --> 00:17:16,900 a workflow instance, ID or run ID for this particular 265 00:17:17,900 --> 00:17:21,800 Execution and oftentimes, that's enough to just utilize that. 266 00:17:21,800 --> 00:17:25,800 As the as the, the Run ID solar system to 267 00:17:25,800 --> 00:17:29,200 grab that it should actually inject it into the spark session 268 00:17:29,200 --> 00:17:33,800 configuration for that. The users job is going 269 00:17:33,800 --> 00:17:34,900 to run in 270 00:17:36,600 --> 00:17:40,500 And then, you know, once that job finishes 271 00:17:40,500 --> 00:17:44,500 successfully the registrations and also should trigger the audits, right? It should trigger the 272 00:17:44,500 --> 00:17:48,800 auditing tool. Now the auditing tool has to, you know, has to 273 00:17:48,800 --> 00:17:52,500 look up. What's that, what's? What actual snapshot should I be auditing? 274 00:17:52,500 --> 00:17:56,800 So you need to have this very declarative way to make this as easy as possible. You need 275 00:17:56,800 --> 00:18:00,700 this declarative way in the auditing tool to say hey run these audits for this 276 00:18:00,700 --> 00:18:04,800 specific snapshot of the table because then it's a matter of just finding 277 00:18:04,800 --> 00:18:06,500 out what snapshot is. 278 00:18:06,600 --> 00:18:10,700 Is tied to this execution ID and then providing that snapshots of the 279 00:18:10,700 --> 00:18:14,700 auditing tool to the actual audit executions so 280 00:18:14,700 --> 00:18:18,600 that they can run. And then when the audit is complete, 281 00:18:19,300 --> 00:18:23,600 it will give you a sense of, should I publish this data or should I not publish this 282 00:18:23,600 --> 00:18:27,800 data? And that signal should be returned to the orchestration system 283 00:18:28,200 --> 00:18:32,900 which if the auditing tool gives a published signal, it runs the cherry pick operation 284 00:18:33,400 --> 00:18:36,600 and specifically for the snapshot. That's 285 00:18:36,600 --> 00:18:40,800 Packed with this, run ID that orchestration system itself 286 00:18:40,800 --> 00:18:43,900 in created and set on the spark session. 287 00:18:43,900 --> 00:18:47,900 So this is kind of like high level how an 288 00:18:47,900 --> 00:18:51,900 orchestration system can really Orchestra system sort of the 289 00:18:51,900 --> 00:18:55,800 core component in this, how we can orchestrate this and automate this feature. So that users can 290 00:18:55,800 --> 00:18:59,800 actually just provide their their ingestion Pipeline 291 00:18:59,800 --> 00:19:03,700 and provide some audits as well. Defined some audits as part of some 292 00:19:03,700 --> 00:19:05,800 auditing tool and have this taken care of 293 00:19:06,600 --> 00:19:10,900 Italy for them, by by really the system is orchestrating the execution of their 294 00:19:11,200 --> 00:19:15,900 of their pipeline. One thing I want to say is this part right here is 295 00:19:15,900 --> 00:19:19,400 super tricky, it seems to 296 00:19:19,400 --> 00:19:23,900 be very easy, right? Publish or don't publish, right? If the 297 00:19:23,900 --> 00:19:27,700 audits past published the 298 00:19:27,700 --> 00:19:31,900 data, if the audits failed don't publish the data. But there's a lot of nuance there 299 00:19:32,400 --> 00:19:36,500 and this is usually where human intervention is. 300 00:19:36,700 --> 00:19:39,800 Often required or requested explicitly. 301 00:19:40,600 --> 00:19:44,900 So to give some examples of that. Oftentimes, if 302 00:19:44,900 --> 00:19:48,700 the audits fail, there are different. They're 303 00:19:48,700 --> 00:19:52,800 different hierarchies to audits, right? Some are called week, what we 304 00:19:52,800 --> 00:19:56,900 like to call blocking audits, right? These are audits that under no circumstances, should 305 00:19:56,900 --> 00:20:00,700 the data be published, right? So this is something like, if you're expecting millions of 306 00:20:00,700 --> 00:20:04,500 records and you get under 100,000 307 00:20:04,500 --> 00:20:06,500 records in the status in this, 308 00:20:06,600 --> 00:20:10,700 Ian, something's wrong in the should never be published. And then there are 309 00:20:12,100 --> 00:20:16,700 non blocking audits which even though they fail their more for notifying 310 00:20:16,900 --> 00:20:20,800 the ingestion owner that they want to maybe look historically and Co this. This 311 00:20:20,800 --> 00:20:24,900 check that I do fails every Friday and Saturday 312 00:20:24,900 --> 00:20:28,800 when volume picks up. For example, that that is 313 00:20:28,800 --> 00:20:30,100 very different from 314 00:20:32,000 --> 00:20:36,300 That is the reading room. Something you want to actually stop the pipeline. You may actually want that 315 00:20:36,500 --> 00:20:40,300 publish every so this different category of audits 316 00:20:40,900 --> 00:20:44,600 but then there's also the blocking audits that 317 00:20:44,600 --> 00:20:48,800 Phil where users actually want an ability to override that failure. 318 00:20:48,900 --> 00:20:52,900 So that's another level of control that users will request. Is 319 00:20:52,900 --> 00:20:56,900 that? Hey, when my audit my blocking audit fails, you know, I 320 00:20:56,900 --> 00:21:00,700 want to be able to do some investigation you sort of pause the 321 00:21:00,900 --> 00:21:04,900 At that at that stage let me do some investigation. Maybe, I've talked to some 322 00:21:04,900 --> 00:21:08,800 Upstream, some owners of the Upstream tables that I use 323 00:21:09,700 --> 00:21:13,800 and there's a good reason why this audit failed. I'll adjust the audit next week for 324 00:21:13,800 --> 00:21:17,600 now. Let me just skip it and so a lot of a lot of 325 00:21:17,600 --> 00:21:21,900 nuance exist in this stage right here, the rest of this not so 326 00:21:21,900 --> 00:21:25,700 much, although there is some as well. But but this in particular is usually the 327 00:21:25,700 --> 00:21:29,300 one where you really want to curtail that user experience to something that's 328 00:21:29,300 --> 00:21:30,600 intuitive to 329 00:21:30,700 --> 00:21:34,900 To really the people who are creating these jobs and using 330 00:21:34,900 --> 00:21:35,500 this pattern. 331 00:21:38,500 --> 00:21:42,800 Some of the gotchas I want to cover some of the gotchas to be aware of so 332 00:21:43,200 --> 00:21:47,500 pipelines that, that both right to and read from the same Target 333 00:21:47,500 --> 00:21:51,900 table, something to be aware of their. So Iceberg 334 00:21:51,900 --> 00:21:55,600 has, it's actually a really cool feature where when you have that 335 00:21:55,600 --> 00:21:59,500 spark session, that whap ID set for the spark session, 336 00:22:00,400 --> 00:22:04,600 when you write to the table that's unpublished. And then you read from that 337 00:22:04,600 --> 00:22:07,900 table, not specifying a snapshot. But just reading from 338 00:22:08,000 --> 00:22:12,600 Latest view of that table, and understands the concept that you're still within this 339 00:22:12,900 --> 00:22:16,600 write-audit-publish session and it will actually return 340 00:22:16,600 --> 00:22:20,800 the unpublished data as well. And that's useful when you're, when you're 341 00:22:20,800 --> 00:22:24,600 sort of having Justin pipelines that are multiple reads and writes from 342 00:22:24,600 --> 00:22:28,600 to the same table or Reason rights as part of multiple tables 343 00:22:28,600 --> 00:22:32,900 and it's very useful in that sense in that it allows you to do those read and write 344 00:22:32,900 --> 00:22:36,700 cycles and actually publish and actually 345 00:22:36,700 --> 00:22:37,900 combine them as a 346 00:22:38,000 --> 00:22:42,900 Unit of ingestion or unit of rights that can actually publish together or not publish 347 00:22:42,900 --> 00:22:46,600 together where that's tricky. However, is the with the web ID, 348 00:22:46,600 --> 00:22:50,800 right? And in success, there's no issue there. But when the, 349 00:22:50,800 --> 00:22:54,900 when the pipeline fails, you have to ensure 350 00:22:54,900 --> 00:22:58,300 that you don't use the same write-audit-publish ID 351 00:22:58,300 --> 00:23:02,900 when you run that next batch, because then that feature can 352 00:23:02,900 --> 00:23:06,800 sort of be unwanted, right? Because when you're starting a new Fresh session, 353 00:23:06,800 --> 00:23:08,000 you want to start 354 00:23:08,000 --> 00:23:12,800 The new fresh whap ID when it refresh that web ID because this is a new lab 355 00:23:12,800 --> 00:23:16,800 session in in situations, where you're not writing and reading from the 356 00:23:16,800 --> 00:23:20,700 same Target table. You can reuse the same web ID for multiple 357 00:23:20,900 --> 00:23:24,200 succession is because you're not actually reading from that from that data. 358 00:23:24,200 --> 00:23:28,500 But keep in mind that this can, this is a potential. 359 00:23:28,500 --> 00:23:32,900 Gotcha for this you for this particular unique scenario. Another one 360 00:23:32,900 --> 00:23:36,500 is are running parallel parallel whack jobs, so 361 00:23:36,500 --> 00:23:38,000 parallel 362 00:23:38,000 --> 00:23:42,400 I'll whack jobs right now and this is a great example to think of our back feels right. If you're running 363 00:23:42,400 --> 00:23:46,200 if you're doing daily backfills for the past year, for example, 364 00:23:46,200 --> 00:23:50,900 you may want to launch these 365 jobs in parallel. 365 00:23:50,900 --> 00:23:54,600 Each utilizing a write-audit-publish session 366 00:23:54,600 --> 00:23:58,900 and that all works and particularly 367 00:23:58,900 --> 00:24:02,800 they all work. When you're auditing, each of those individual 368 00:24:02,800 --> 00:24:06,500 day separately, one area that is 369 00:24:06,500 --> 00:24:07,800 still has room for improvement. 370 00:24:08,000 --> 00:24:12,400 Me is when you're running parallel whack jobs and you want to actually audit the 371 00:24:12,400 --> 00:24:16,700 combined view of it. So you actually want to write right? Run the backfield for 372 00:24:16,700 --> 00:24:20,800 365 days. But then have a combined view of all of these 373 00:24:20,800 --> 00:24:24,700 multiple web sessions and do an auditing. There it is possible. 374 00:24:25,300 --> 00:24:29,900 It's actually it's very possible to do if you're sort of using the the web 375 00:24:29,900 --> 00:24:33,900 ID yourself. But when you're automating and orchestrating this, it 376 00:24:33,900 --> 00:24:37,500 becomes very tricky because you need to actually have a higher level 377 00:24:37,900 --> 00:24:41,900 trucked of a web session of a collection of web sessions that are 378 00:24:41,900 --> 00:24:45,300 related and need to be audited together, 379 00:24:46,300 --> 00:24:50,700 the new branching and tagging work that's happening. In open-source Iceberg, what will really 380 00:24:50,700 --> 00:24:54,700 solve this problem? Very well. So, that's if this is a particular use case that 381 00:24:54,700 --> 00:24:58,900 you're interested in, it will be great to keep an eye on what's happening there in the in 382 00:24:58,900 --> 00:25:02,800 the open source community. And the idea there is that, you know, 383 00:25:02,800 --> 00:25:06,600 if you actually, you know, you can create a branch of the head of your table and actually 384 00:25:06,600 --> 00:25:07,800 becomes sort of 385 00:25:09,200 --> 00:25:13,900 Just as you would think of a branch and get it becomes a branch off of the main and it's a named 386 00:25:13,900 --> 00:25:17,600 Branch, right? So you give it an explicit name. So instead of actually 387 00:25:17,600 --> 00:25:21,600 tagging wet unpublished snapshot IDs, you can actually 388 00:25:21,600 --> 00:25:25,800 create a branch at the beginning of your session and you can utilize that Branch as 389 00:25:25,800 --> 00:25:29,800 part of these parallel whack jobs. So what all these parallel, 390 00:25:29,800 --> 00:25:33,900 web jobs can actually run their own rap sessions, but published to the same branch 391 00:25:34,500 --> 00:25:37,800 and then you can actually utilize that Branch name or that branch. 392 00:25:37,900 --> 00:25:41,500 Eid to get that Collective View at the end when you run your audits. 393 00:25:41,800 --> 00:25:45,900 So Branch contains is super exciting feature. That's part of the iceberg 394 00:25:45,900 --> 00:25:49,800 spec and the implementation is is currently actively being work on 395 00:25:49,800 --> 00:25:53,300 work on. And this last one is, 396 00:25:54,200 --> 00:25:58,700 is really more for the orchestration engine. So the overhead of the web 397 00:25:58,700 --> 00:26:02,100 step is small. And when I say about five minutes, 398 00:26:02,900 --> 00:26:06,500 really, the like checking for the table properties about 200 399 00:26:06,500 --> 00:26:07,400 milliseconds, 400 00:26:07,900 --> 00:26:11,700 Running the actual publish is probably two minutes or less. 401 00:26:12,100 --> 00:26:16,900 So where the five minutes really comes from? Is if you do this as part of separate 402 00:26:16,900 --> 00:26:20,600 steps and you're actually running this on running these 403 00:26:20,600 --> 00:26:24,700 containers on a, you know, through via containers on a spark cluster, the 404 00:26:24,700 --> 00:26:27,000 overhead of sort of starting up each of these steps individually 405 00:26:27,000 --> 00:26:31,700 are unnoticeable 43, our 406 00:26:31,700 --> 00:26:35,900 batch ETL jobs, but it can be significant when you're doing 407 00:26:35,900 --> 00:26:37,900 parallel job. So that's something to keep in mind. 408 00:26:37,900 --> 00:26:41,400 Mind, especially if you're doing smaller running jobs. That are about 10 minutes, 409 00:26:42,000 --> 00:26:45,800 the overhead of the additional five minutes, backfilled over 410 00:26:46,000 --> 00:26:50,600 365 days and you're running these in each of your back, each of your 411 00:26:50,800 --> 00:26:54,800 10 or 15-minute backfills end up becoming 20 minutes as a percentage, 412 00:26:54,800 --> 00:26:58,100 there could be some overhead there to consider. 413 00:26:59,000 --> 00:27:03,900 So that's something to think about there are ways to sort of optimize it by keeping that keeping 414 00:27:03,900 --> 00:27:07,800 the the web operational steps closer to the actual Sparky. 415 00:27:07,900 --> 00:27:11,900 ETL run that we can utilize some of the same resources to perform these 416 00:27:11,900 --> 00:27:15,300 steps but that's a small got you? I figured it would be good to 417 00:27:16,400 --> 00:27:17,100 list here. 418 00:27:20,300 --> 00:27:24,900 So that's that's all I just put a bullet here to like, check 419 00:27:24,900 --> 00:27:28,900 out the iceberg open source Community. There's lots of ways to 420 00:27:28,900 --> 00:27:32,700 contribute. It's a big Community, that's growing lots of people from different organizations. 421 00:27:32,700 --> 00:27:36,700 So there if you go to the iceberg that Apache 422 00:27:36,700 --> 00:27:40,900 dot org slash Community page, you'll see lots 423 00:27:40,900 --> 00:27:44,500 of ways to join the slack channel. The weekly saying there's a 424 00:27:44,500 --> 00:27:48,000 community Google Docs where you can add different agenda items. 425 00:27:48,000 --> 00:27:49,700 So there's a number 426 00:27:49,700 --> 00:27:53,300 number of ways to join and start contributing there. 427 00:27:55,700 --> 00:27:59,900 Thanks everyone. Thank you so much Sam. And it 428 00:27:59,900 --> 00:28:03,500 looks like we have a little time to discuss some of these things. So 429 00:28:03,600 --> 00:28:07,900 fortunately we can jump to Q&A here. Really. Appreciate you going through a lot of this 430 00:28:07,900 --> 00:28:11,900 stuff. I have a bigger question about the, what you're seeing, as the 431 00:28:11,900 --> 00:28:15,800 sort of business requirements around auditing and data 432 00:28:16,200 --> 00:28:20,900 as companies start to rely more and more on automated models and more and more 433 00:28:20,900 --> 00:28:24,900 of those decisions are being scrutinized by the public, obviously, recommending the wrong 434 00:28:24,900 --> 00:28:25,200 movie is 435 00:28:25,400 --> 00:28:29,900 Not bad, recommending, the wrong drug is probably really bad. What are you seeing 436 00:28:29,900 --> 00:28:33,200 happen in terms of Industries like law and 437 00:28:33,200 --> 00:28:37,700 governance? Starting to look at the output of these systems to 438 00:28:37,700 --> 00:28:41,600 understand whether data is valid or what there's liability involved. 439 00:28:43,800 --> 00:28:46,900 Yeah, that's that's a great question. I think that the, you know, 440 00:28:48,100 --> 00:28:52,600 in my personal view of, or my personal experience has been 441 00:28:52,600 --> 00:28:56,900 before as data velocity, as data scale, started to grow 442 00:28:56,900 --> 00:29:00,900 tremendously. There was a big question on just, how do we process 443 00:29:00,900 --> 00:29:04,900 this data, right? And it was much less about. How can we pop 444 00:29:04,900 --> 00:29:08,800 process this in a way that ensures data quality? It was just how, can we process 445 00:29:08,800 --> 00:29:11,700 this period, right? And so, there was a big sort of 446 00:29:12,600 --> 00:29:16,800 You know, revolution in how we manage databases right? The 447 00:29:16,800 --> 00:29:20,400 data warehouse was born, Open Table formats are 448 00:29:20,400 --> 00:29:24,900 you know, here to stay. And so a lot of these new things were sort of created 449 00:29:24,900 --> 00:29:28,700 on just, how can we process data at scale in a way that 450 00:29:28,700 --> 00:29:32,900 we're not we're not falling behind on the amount of data that were processing. 451 00:29:32,900 --> 00:29:36,900 I think now that those technologies have matured data quality 452 00:29:36,900 --> 00:29:40,800 at scale is sort of the new the new problem that's trying to be 453 00:29:40,800 --> 00:29:42,300 solved. And so you see a lot of new things. 454 00:29:42,600 --> 00:29:46,700 They're showing up, you know, data quality tools being one. But even like these 455 00:29:46,700 --> 00:29:50,500 concept of metrics layers or semantic layers that have these data 456 00:29:50,500 --> 00:29:54,900 quality built in dire the data quality checks 457 00:29:54,900 --> 00:29:58,800 built in. So, I think that right now, this is 458 00:29:58,800 --> 00:30:02,400 probably a moment where we're sort of taking all those mature data processing 459 00:30:02,400 --> 00:30:06,800 technologies that are that are really, really made incredible, incredible 460 00:30:06,800 --> 00:30:10,600 things, capable at truly, a 461 00:30:11,100 --> 00:30:12,300 ridiculous skills. 462 00:30:12,600 --> 00:30:16,600 And finding out, how can we provide, how can we have that same Revolution for 463 00:30:16,600 --> 00:30:20,900 finding how data quality should be done? And how can we do it, 464 00:30:20,900 --> 00:30:24,700 reliably and inject that as with the same level of engineering that 465 00:30:24,700 --> 00:30:26,800 people are creating these ingestion Pipelines? 466 00:30:28,600 --> 00:30:31,600 So let's talk a little bit about 467 00:30:33,200 --> 00:30:37,500 what Iceberg can do to backfill stuff when there's a downer 468 00:30:38,000 --> 00:30:42,700 downtime or an outage at some point in the data structure, kind of lives that leaves a hole in the world, right? 469 00:30:42,700 --> 00:30:46,400 Like your bass lines, don't work anymore because there's a gap there, 470 00:30:46,900 --> 00:30:50,800 there's obviously a difference between like not having any data versus having 471 00:30:51,200 --> 00:30:55,500 anomalous data and you know, it's hard for code to 472 00:30:55,900 --> 00:30:58,100 deal with or four models to deal with. 473 00:30:58,300 --> 00:31:02,800 And changes are gaps and data. Can you talk a little about backfilling? And and 474 00:31:02,800 --> 00:31:06,900 like, how you would go about back filling a gap in data once that 475 00:31:06,900 --> 00:31:10,700 data is restored and then like redoing the analysis 476 00:31:10,700 --> 00:31:12,100 to fix your models. 477 00:31:13,200 --> 00:31:17,700 Yeah, yeah. So backfilling is a, you know, is 478 00:31:18,100 --> 00:31:22,700 is really probably one of the most complicated Arts of data engineering. 479 00:31:22,800 --> 00:31:26,400 And there's a lot of challenges there, but Iceberg and 480 00:31:26,400 --> 00:31:30,600 particular. Its snapshot feature is really core 481 00:31:30,600 --> 00:31:34,900 for for backfilling will two things I'll say so. Snapshot feature is very core 482 00:31:34,900 --> 00:31:38,800 for for auditing and also the ability, the fact that the 483 00:31:38,800 --> 00:31:42,800 rights are Atomic so you can actually run these 484 00:31:42,900 --> 00:31:46,900 These these jobs in parallel and the 485 00:31:46,900 --> 00:31:50,400 actual publish step when they're made available happens instantaneously through 486 00:31:50,400 --> 00:31:54,700 through sort of a metadata commit. So that allows you to sort of 487 00:31:54,700 --> 00:31:58,900 parallelize these various backfills at super large scale 488 00:31:58,900 --> 00:32:02,900 on the other thing about you mentioned sort of like having, 489 00:32:02,900 --> 00:32:06,300 you know missing data or bad data, right? 490 00:32:06,300 --> 00:32:10,900 You can you can actually roll back super easy with 491 00:32:10,900 --> 00:32:12,700 with Iceberg as well. So these snapshots 492 00:32:13,000 --> 00:32:17,500 Owners make in those cases where maybe you do have bad data, right? Maybe you did perform a 493 00:32:17,500 --> 00:32:21,900 backfill and it, you know, ran for eight hours and then you found out that 494 00:32:21,900 --> 00:32:25,700 something about your backfill logic was actually off and that may be 495 00:32:25,700 --> 00:32:29,700 injected some some nuances in the data that makes it fundamentally 496 00:32:29,700 --> 00:32:33,500 wrong very easily with Iceberg with a metadata 497 00:32:33,800 --> 00:32:37,600 operation that runs and under you know probably a few seconds definitely under a 498 00:32:37,600 --> 00:32:41,200 minute, you can roll back, you know, petabyte size tables 499 00:32:41,500 --> 00:32:42,800 to to the 500 00:32:42,900 --> 00:32:46,700 Vyas to the previous snapshot that existed before the backfill. 501 00:32:47,800 --> 00:32:51,300 Are there competing formats? Like Delta hoody or Iceberg 502 00:32:51,400 --> 00:32:54,600 and do you think we're going to see, you know, 503 00:32:55,800 --> 00:32:59,900 they're valid reasons for those three things Beyond sort of each vendor or each 504 00:32:59,900 --> 00:33:03,900 group wants their own format? Or are we going to see interoperability between those down the 505 00:33:03,900 --> 00:33:04,300 road? 506 00:33:06,600 --> 00:33:10,400 Yeah, that's an interesting one and a harder harder one to predict. 507 00:33:10,900 --> 00:33:14,900 I think these were created, you know, 508 00:33:15,200 --> 00:33:19,200 at a time when neither existed, right? So, I don't think necessarily that there was, you know, a 509 00:33:19,200 --> 00:33:23,700 mature product that exists in and someone made a another mature product, as sort of a 510 00:33:23,700 --> 00:33:27,500 competitive competitive product. I think they were all 511 00:33:27,500 --> 00:33:30,300 created at separate times from, you know, 512 00:33:31,100 --> 00:33:34,800 organizations with really really strong requirements. 513 00:33:35,700 --> 00:33:39,800 Sort of building things that happen to sort of converge on this concept of an Open 514 00:33:39,800 --> 00:33:43,100 Table format. So I think the concept of an Open Table format is 515 00:33:43,800 --> 00:33:47,900 here to stay. You know, I know I'm much more familiar with with 516 00:33:47,900 --> 00:33:51,900 Iceberg. And the thing that I would say that Iceberg has a 517 00:33:51,900 --> 00:33:55,700 pretty good Advantage. There is, that's always been open sourced from the beginning. And it 518 00:33:55,700 --> 00:33:59,800 has a, the table spec separate from the implementation of 519 00:33:59,800 --> 00:34:03,800 the actual Java implementation. The table spec is completely laid out 520 00:34:03,800 --> 00:34:04,700 on the dock side. 521 00:34:05,100 --> 00:34:09,800 So you can see the like every Nuance of the spec which is very 522 00:34:09,800 --> 00:34:13,800 solid and sort of all implementations follow that spec. So there's a little bit 523 00:34:13,800 --> 00:34:17,400 there that I think Iceberg has in terms of adaptability and 524 00:34:17,400 --> 00:34:21,900 integration has a little bit of a better story, but as far as you know where 525 00:34:21,900 --> 00:34:25,900 where where this all goes, we'll just have to wait and see. 526 00:34:27,300 --> 00:34:30,800 It does feel like convergent evolution as you said rather than sort of 527 00:34:32,000 --> 00:34:34,700 specific differentiation a couple more questions. 528 00:34:35,000 --> 00:34:38,500 It's the data would probably be event-driven in your data lake or Lakehouse. 529 00:34:38,500 --> 00:34:42,900 Then auditing would be natural. And I'm presently, Iceberg 530 00:34:42,900 --> 00:34:43,400 could help. 531 00:34:46,500 --> 00:34:50,800 Yeah. So, for adventure, and so you could, you could actually use this for like if 532 00:34:50,800 --> 00:34:54,800 you're running streaming streaming data, if you're for 533 00:34:54,800 --> 00:34:58,800 examine the Flink application, you could implement this. And I think that that adds the other 534 00:34:58,800 --> 00:35:02,800 dimension of your checkpointing strategy, and when do you want to run 535 00:35:02,800 --> 00:35:06,900 your Audits? And over what interval? Right, if your checkpointing every two minutes, you maybe 536 00:35:06,900 --> 00:35:10,600 don't want to run your audit sweet every two minutes, but you 537 00:35:10,600 --> 00:35:14,600 absolutely could use this as part of as part of a 538 00:35:15,100 --> 00:35:15,500 yeah. 539 00:35:16,200 --> 00:35:18,600 You're an adrenaline, event-driven pipeline. 540 00:35:20,000 --> 00:35:24,300 Yeah, it's fascinating to think about, you know, the data Ops person is now wearing the pager 541 00:35:24,800 --> 00:35:28,800 we used to think of that as like, you know, someone unless the hardware is 542 00:35:28,800 --> 00:35:32,900 true, unless there's a problem with the hardware or like, the database itself goes down. But 543 00:35:32,900 --> 00:35:36,800 now, you have people instrumenting and like everyone's on pager GT. Now, they're all 544 00:35:36,800 --> 00:35:40,500 getting you know, wherever the data, flows are two more quick 545 00:35:40,500 --> 00:35:44,600 questions. Any tips on implementing automated testing for Enterprise data warehouses? 546 00:35:47,200 --> 00:35:47,800 so, 547 00:35:49,800 --> 00:35:53,800 Yeah, so at Netflix, we have a homegrown tool 548 00:35:53,800 --> 00:35:57,900 that we use for that. There are tons that are out there 549 00:35:57,900 --> 00:36:01,800 that are available. I think it really is, it really 550 00:36:01,800 --> 00:36:05,700 is unique to the type of data quality that you're doing and the type of data that you're doing 551 00:36:06,800 --> 00:36:10,800 and the particular engines that you use. So, that's sort of the 552 00:36:10,800 --> 00:36:14,800 compatibility that would look for, is that it has, it can integrate 553 00:36:14,800 --> 00:36:18,800 with the particular compute engines that you're using and that it has the right 554 00:36:18,800 --> 00:36:19,400 Suite of 555 00:36:19,500 --> 00:36:23,700 It's that that you need and that that really is the nature of the data, 556 00:36:23,800 --> 00:36:27,900 right? If you're running, if you're, for example, doing ml data, and you want something that's 557 00:36:27,900 --> 00:36:31,300 more statistical statistical, heavy with statistical, auditing 558 00:36:31,300 --> 00:36:35,900 functions. You may use that if you're using doing financial data. For example, you might want 559 00:36:35,900 --> 00:36:39,900 something that's more, has more forecasting features and maybe you can 560 00:36:39,900 --> 00:36:43,400 audit based on forecast, on on each daily 561 00:36:43,400 --> 00:36:47,800 partition or something along those lines. So I think that one is pretty 562 00:36:48,000 --> 00:36:49,400 depends a lot, but the good thing about 563 00:36:49,500 --> 00:36:53,900 Iceberg is that in this particular pattern, is that you can inject really any auditing tool. It 564 00:36:53,900 --> 00:36:55,300 brings no opinions about that. 565 00:36:56,900 --> 00:37:00,500 Awesome, last quick question. You've 566 00:37:00,500 --> 00:37:04,400 obviously had worked as a software engineer in the 567 00:37:04,400 --> 00:37:08,300 very what sort of link for the very like life 568 00:37:08,300 --> 00:37:12,700 important medical stuff, where they were or false 569 00:37:12,700 --> 00:37:16,600 negative is really bad. And you know the 570 00:37:17,100 --> 00:37:21,700 sort of luxury world of watching content, where a false 571 00:37:21,700 --> 00:37:25,700 positives, not that big a deal. What has changed in your approach to 572 00:37:25,700 --> 00:37:26,400 software engineering. 573 00:37:26,600 --> 00:37:29,400 At those two extremes of data science. 574 00:37:30,700 --> 00:37:33,900 Yeah so that's a great observation. I think 575 00:37:33,900 --> 00:37:37,700 you know manual 576 00:37:37,700 --> 00:37:41,800 intervention manual human intervention is always 577 00:37:41,800 --> 00:37:45,600 always been critical in you know clinical settings 578 00:37:45,600 --> 00:37:48,700 for sure. So that so that that factor has been 579 00:37:48,700 --> 00:37:52,900 always present that it's not enough to say oh you can Define these 580 00:37:52,900 --> 00:37:56,800 Audits and code and they'll run automatically, you always have to think of that 581 00:37:56,800 --> 00:38:00,700 review process. You always have to think of that. You know, how can someone look at this and 582 00:38:00,700 --> 00:38:04,900 Sign off. And then the human element is always much more present in those sort of 583 00:38:04,900 --> 00:38:08,800 Life, critical environments when it comes to a 584 00:38:08,800 --> 00:38:12,700 product. When it comes to something, that's not life critical, but more 585 00:38:12,700 --> 00:38:15,500 a user experience that you're really trying to drive 586 00:38:15,500 --> 00:38:19,400 latency and, and 587 00:38:19,400 --> 00:38:23,500 velocity sort of being able to run these automated at High 588 00:38:23,500 --> 00:38:27,900 Velocity is sort of the bigger requirement. So, the difference that I think is you can be much 589 00:38:27,900 --> 00:38:30,600 more Innovative and creative and there's a 590 00:38:30,700 --> 00:38:34,700 a lot to gain from getting you know audit audit checks there for ninety 591 00:38:34,700 --> 00:38:38,900 percent ninety percent of the time, if it increases your velocity and increases 592 00:38:38,900 --> 00:38:41,800 your productivity to a, some order of 593 00:38:42,200 --> 00:38:43,000 magnitude,