1 00:00:06,800 --> 00:00:10,800 Next, I want to go through and try and understand 2 00:00:11,000 --> 00:00:15,800 ZFS from a structural perspective. So here, we see the layout of 3 00:00:15,800 --> 00:00:19,700 an actual ZFS pool starting at the top. 4 00:00:19,700 --> 00:00:23,900 We have the Uber block. The Uber block is 5 00:00:23,900 --> 00:00:27,900 the thing that anchors the entire pool. So, when 6 00:00:27,900 --> 00:00:31,800 we talked about doing a checkpoint, what we were really checkpointing, is everything that 7 00:00:31,800 --> 00:00:35,800 you see underneath that Uber block. And so, we will get 8 00:00:35,800 --> 00:00:36,500 everything written. 9 00:00:36,700 --> 00:00:40,800 The way up to the Uber block. And then the very last thing that we do is to 10 00:00:40,800 --> 00:00:44,700 update the Uber block itself. And that will then essentially go 11 00:00:44,700 --> 00:00:48,800 from the previous version of the ZFS pool 12 00:00:48,800 --> 00:00:52,500 to the new version. So, what you can see from this 13 00:00:52,500 --> 00:00:56,700 is that when we do a checkpoint, we are not checkpointing, an individual file 14 00:00:56,700 --> 00:01:00,900 system. We are checkpointing the entire pool and 15 00:01:00,900 --> 00:01:04,900 everything. That's within that pool. So a checkpoint is not just 16 00:01:04,900 --> 00:01:06,700 for one filesystem or 17 00:01:06,800 --> 00:01:10,400 Not just for a clone not just for as evil. It's 18 00:01:10,400 --> 00:01:14,800 everything that's being managed by that poll below the Uber 19 00:01:14,800 --> 00:01:18,900 block. We have what's called The Meta object set. And this is the 20 00:01:18,900 --> 00:01:22,600 thing that's going to describe all of the file systems, the 21 00:01:22,600 --> 00:01:26,400 Clones. The snapshots does evolves that are being 22 00:01:26,400 --> 00:01:29,900 supported by this particular ZFS pool. 23 00:01:30,800 --> 00:01:34,800 So, the metal object layer, there is the thing that's between the top dotted 24 00:01:34,800 --> 00:01:36,400 line and the lower dotted line. 25 00:01:37,200 --> 00:01:41,900 And you can see there that we have a thing called an object set and 26 00:01:41,900 --> 00:01:45,500 an object set. Looks a lot like an inode 27 00:01:45,900 --> 00:01:49,300 and I'm by a lot like on I knowed. What I really mean is that it 28 00:01:49,300 --> 00:01:53,800 describes an object of arbitrary size. So you see that 29 00:01:53,800 --> 00:01:57,500 sort of triangle that's coming down, off the bottom of the object set, 30 00:01:58,000 --> 00:02:02,900 any place. You see a triangle in this figure that really just represents a set of indirect blocks. 31 00:02:03,400 --> 00:02:06,800 So there's a set of indirect blocks so that we can. 32 00:02:06,900 --> 00:02:10,900 Describe that thing that looks kind of like a file at the bottom 33 00:02:10,900 --> 00:02:14,800 of the meta object layer there. And what what that file 34 00:02:14,800 --> 00:02:17,800 contains is some 35 00:02:17,800 --> 00:02:21,800 descriptors of file systems of snapshots of clones of 36 00:02:21,800 --> 00:02:25,900 Z, valls, you'll see at the very beginning is a thing called that says 37 00:02:25,900 --> 00:02:29,400 master. And that's just a place to store information, 38 00:02:29,400 --> 00:02:33,700 that's pool wide. So that's where we're going to store information. 39 00:02:33,700 --> 00:02:36,400 For example, about reservations or quotas. 40 00:02:36,900 --> 00:02:40,900 Other things of that sort. And then on the far right of the meta object 41 00:02:40,900 --> 00:02:44,800 set layer, you see space map and that space map is 42 00:02:44,800 --> 00:02:48,500 keeping track of all of the blocks that are in the pool. 43 00:02:49,200 --> 00:02:53,800 And it's actually much more complex thing than what's actually shown there. It's not 44 00:02:53,800 --> 00:02:57,800 just a little box, but it actually has a whole set 45 00:02:57,800 --> 00:03:01,600 of things. Under essentially a space map for each one of the 46 00:03:01,600 --> 00:03:05,900 physical drives that it's managing each. One of those keeps track of the blocks on 47 00:03:05,900 --> 00:03:06,800 the drive as 48 00:03:07,000 --> 00:03:11,700 To whether it's currently in use by one of the objects in the metal 49 00:03:11,700 --> 00:03:15,900 object layer or whether it is currently free. So, if 50 00:03:15,900 --> 00:03:19,600 we want to create another file system, all we really need to do is just 51 00:03:19,600 --> 00:03:23,600 allocate another thing in that meta, object set layer. And 52 00:03:23,600 --> 00:03:27,900 just we may just need to make the layer slightly bigger. If we don't have 53 00:03:27,900 --> 00:03:31,600 any other free space that's in there from a previous deletion. 54 00:03:32,600 --> 00:03:36,900 Again, taking a snapshot no more difficult. You just allocate another one 55 00:03:36,900 --> 00:03:40,300 of those objects in The Meta object set layer set up, a few 56 00:03:40,300 --> 00:03:44,800 linkages to, you know, so, you know that this is a snapshot that's associated with a 57 00:03:44,800 --> 00:03:48,900 particular file system and beyond that 58 00:03:48,900 --> 00:03:52,600 it's just again another thing that's created in there. So creating 59 00:03:53,100 --> 00:03:57,900 any of these instances creating a snapshot or a file system or as evil or clone 60 00:03:58,100 --> 00:04:02,200 is really not much more difficult than creating a file in a traditional. 61 00:04:02,300 --> 00:04:06,700 Sistar maybe, we'll will say it's a little more complex. It's like creating a directory in a traditional 62 00:04:06,700 --> 00:04:10,700 file system, but it's really you just create another object in this array of 63 00:04:10,700 --> 00:04:14,300 objects. And it's a fully extensible array. So you just make it bigger as you need to, 64 00:04:14,300 --> 00:04:18,200 to make the thing fit. Each one of those 65 00:04:18,200 --> 00:04:22,600 objects in The Meta object set, each Moss object 66 00:04:22,600 --> 00:04:26,400 references an object set that describes its 67 00:04:26,400 --> 00:04:30,700 object. In particular, a filesystem type of object is going to 68 00:04:30,700 --> 00:04:32,200 describe an array of 69 00:04:32,300 --> 00:04:36,800 Files directories and so on, as you would expect to see in any 70 00:04:36,800 --> 00:04:40,500 traditional filesystem. So I've chosen in the 71 00:04:40,500 --> 00:04:44,700 object, set layer to break out just the file system. There would be a similar 72 00:04:44,700 --> 00:04:48,800 set of data structures. That would be below a snapshot or cologne or any of the others. 73 00:04:48,800 --> 00:04:51,900 But for space reasons, I've only shown one of these here. 74 00:04:51,900 --> 00:04:55,800 So that filesystem object that you see up in the 75 00:04:55,800 --> 00:04:59,600 meta object layer is tracking, sort of the high-level 76 00:04:59,600 --> 00:05:02,200 information about the file system. In many ways. It's 77 00:05:02,300 --> 00:05:06,800 They sort of like the super block of the ufs filesystem. But in particular, 78 00:05:06,800 --> 00:05:10,900 it points to the another one of these objects 79 00:05:10,900 --> 00:05:14,700 at things, this sort of inode like things that keeps track of an 80 00:05:14,700 --> 00:05:18,900 arbitrary size thing. In this case. It's a you 81 00:05:18,900 --> 00:05:22,200 could think of it as sort of a phial of inodes. So 82 00:05:22,200 --> 00:05:26,900 every I note in the system is stored in that object set when you create a new 83 00:05:26,900 --> 00:05:30,600 file, we just tack another. I note on to the end of that 84 00:05:30,600 --> 00:05:31,600 object set. 85 00:05:32,300 --> 00:05:36,800 When you delete one, we just deleted out there Market is not 86 00:05:36,800 --> 00:05:40,800 in use. And so of course the, you know, the different kinds of inodes as 87 00:05:40,800 --> 00:05:44,700 we already seen, you've got directories and files and symbolic links. And so on, 88 00:05:45,000 --> 00:05:49,700 as with the meta object layer, the object set has a master at the beginning where we 89 00:05:49,700 --> 00:05:53,900 store information that sort of global to this particular file system. But 90 00:05:53,900 --> 00:05:57,900 other than that, it's just all of the various inodes and we just simply make it bigger when we 91 00:05:57,900 --> 00:06:01,800 have more inodes and can shrink it down when we have fewer inodes. Finally 92 00:06:01,800 --> 00:06:02,100 the 93 00:06:02,200 --> 00:06:06,700 File system object, just describes an array of bytes, again exactly what 94 00:06:06,700 --> 00:06:10,200 you would have seen in the case of the regular traditional filesystem. 95 00:06:10,200 --> 00:06:14,600 So where it says file there and you see the big triangle, the triangle is 96 00:06:14,600 --> 00:06:18,700 always is representing a number of indirect blocks. However, 97 00:06:18,700 --> 00:06:22,900 many is needed to ultimately describe the user data and the 98 00:06:22,900 --> 00:06:26,700 user to user data is just disarray of blocks and the indirect block pointers 99 00:06:26,700 --> 00:06:30,800 point to each of those blocks. So, in that sense, it's very, very similar to 100 00:06:30,800 --> 00:06:32,100 what we saw with ufs. 101 00:06:32,300 --> 00:06:36,900 So if I were to blow up that thing that says file, it would look very much 102 00:06:36,900 --> 00:06:40,900 like the inode that we saw for ufs and you'd have the pointers 103 00:06:40,900 --> 00:06:44,900 like we have to the data blocks like we have with ufs and owner and group. 104 00:06:44,900 --> 00:06:48,200 And other things that we store that are associated with that file. 105 00:06:48,300 --> 00:06:52,900 What happens? Then is, when we go to write data, 106 00:06:52,900 --> 00:06:56,800 let's say we write data that file. We aren't able to overwrite. Any of the things 107 00:06:56,800 --> 00:07:00,600 that you see in this picture, when new data gets added a new block, 108 00:07:00,600 --> 00:07:01,400 obviously gets allocated. 109 00:07:02,200 --> 00:07:06,600 And now, an indirect block pointer has to point to that, that means that we can 110 00:07:06,600 --> 00:07:10,500 overwrite the existing indirect block. So we have to allocate a new indirect block 111 00:07:10,500 --> 00:07:14,900 and and thing that points to that. And so on, we have to create a new inode for the file because 112 00:07:14,900 --> 00:07:18,100 it's now got a different size and potentially more pointers. 113 00:07:18,100 --> 00:07:22,900 And so that's going to have to change and that's going to change all the indirect blocks 114 00:07:22,900 --> 00:07:26,800 between the file and the object set. And so that means now there's a 115 00:07:26,800 --> 00:07:30,600 new object set which means that the filesystem thing that points at it has to change 116 00:07:31,400 --> 00:07:32,200 which means all the things. 117 00:07:32,200 --> 00:07:36,800 That point to, that have to change, which means the object said at the top has to change. So you can see that 118 00:07:36,800 --> 00:07:40,900 just writing data to one file is going to cause at least 119 00:07:41,000 --> 00:07:45,900 nine or ten other things to Cascade all the way back up because the only thing that we ever can 120 00:07:45,900 --> 00:07:47,300 overwrite is the Uber block. 121 00:07:48,300 --> 00:07:52,900 And that's the reason that checkpoints have to accumulate stuff. If we did a checkpoint 122 00:07:52,900 --> 00:07:56,500 after every modification to a file, we would end up having so much data being 123 00:07:56,500 --> 00:08:00,500 written that it would just be prohibitive. But of course, we don't 124 00:08:00,500 --> 00:08:04,500 just write to one file. We're writing to lots of files in lots of file systems. 125 00:08:04,900 --> 00:08:08,700 And once we've changed one file, if we change the say, the directory to the left of 126 00:08:08,700 --> 00:08:12,900 it, well, everything that changed above it to update the file also would have be 127 00:08:12,900 --> 00:08:16,600 changed because of updating that directory. So all that without be 128 00:08:16,600 --> 00:08:18,000 amortized across 129 00:08:18,100 --> 00:08:22,800 Dating the file in the directory. So that's why after we've collected changes, you know, hundreds or 130 00:08:22,800 --> 00:08:26,800 thousands of changes in memory that the extra overhead of changing the indirect 131 00:08:26,800 --> 00:08:30,900 blocks and other things just gets lost in the noise. Because most of the right is 132 00:08:30,900 --> 00:08:34,700 actually useful data, that is been newly created.