1 00:00:06,900 --> 00:00:10,800 So we've described the various modules that make up ZF Essence. What? I'm 2 00:00:10,800 --> 00:00:14,100 how they fit in the picture of 3 00:00:14,900 --> 00:00:18,600 ZFS or ufs. Now, what I want to do is to 4 00:00:18,600 --> 00:00:22,500 look at sort of how these modules all interact with each other. 5 00:00:24,000 --> 00:00:28,800 So in this picture, here, we see the functional organization and 6 00:00:29,200 --> 00:00:33,900 I'm going to also show you the logical organization in a future 7 00:00:33,900 --> 00:00:37,600 sub. Lassen this picture. Here. We have a dotted line across the 8 00:00:37,600 --> 00:00:41,800 top, which divides the user from the kernel. So everything below that 9 00:00:41,800 --> 00:00:45,900 horizontal, dotted line is inside the kernel and this stuff above that 10 00:00:45,900 --> 00:00:49,700 is user. And, and we've also got the, the vertical 11 00:00:49,700 --> 00:00:53,700 dotted line which is sort of all dividing between the data path. 12 00:00:53,800 --> 00:00:57,800 Related stuff on the left and the sort of more management oriented stuff 13 00:00:57,800 --> 00:01:01,600 on the right. Now, you might find it a little odd that we would have in management 14 00:01:01,600 --> 00:01:05,900 interface to ZFS. But again, you need to remember that Z FS is 15 00:01:05,900 --> 00:01:09,900 doing much more than just being a filesystem. You have to be able to do 16 00:01:09,900 --> 00:01:13,700 things like tell it to take snapshots and manage its raid 17 00:01:13,700 --> 00:01:17,800 array and be able to create file 18 00:01:17,800 --> 00:01:21,700 systems and set properties on them, and all kinds of other things, which is 19 00:01:21,700 --> 00:01:23,800 sort of far more functionality than we have out. 20 00:01:23,800 --> 00:01:27,700 Out of just the file system itself. I mean it sometimes we have the ability to do that with 21 00:01:27,700 --> 00:01:31,500 ufs with configuration. And then there's a different set of things that do 22 00:01:31,500 --> 00:01:35,900 configuration for geom and Etc. And so in the case of ZFS 23 00:01:35,900 --> 00:01:39,900 because everything else is integrated, the management layer just gets integrated as well. 24 00:01:39,900 --> 00:01:43,300 Alright, so let's start on the left with the data paths. Here. 25 00:01:43,300 --> 00:01:47,600 We have. Just the traditional applications that are doing read and write system 26 00:01:47,600 --> 00:01:51,800 calls and they, of course think that they are talking to a 27 00:01:51,800 --> 00:01:53,700 traditional posix filesystem. 28 00:01:53,800 --> 00:01:57,700 System. So they come through the VFS interface just like it would for any other 29 00:01:57,700 --> 00:02:01,500 file system and into the ZFS posix layer, 30 00:02:01,500 --> 00:02:05,700 which is the thing that's going to be doing all of the interpretation of the metadata 31 00:02:05,700 --> 00:02:09,600 path name lookup and all the sorts of things. You'll notice that when we 32 00:02:09,600 --> 00:02:13,900 deal had a directory, we're going to go over and use the, the zap objects 33 00:02:13,900 --> 00:02:17,700 because there's a zap object for each directory, which is 34 00:02:17,700 --> 00:02:21,800 storing all of the names and mappings to inodes and other 35 00:02:21,800 --> 00:02:23,400 things that gets put in a directory. 36 00:02:24,600 --> 00:02:28,700 Also, as IO is done, that is going to have to put 37 00:02:28,700 --> 00:02:32,400 things into the intent log. Everything that gets done 38 00:02:32,400 --> 00:02:36,900 after a checkpoint has to be put in that intent log both data and 39 00:02:36,900 --> 00:02:40,900 metadata. So that after a crash, we're going to be 40 00:02:40,900 --> 00:02:44,800 able to roll back that log to get put in all the changes that 41 00:02:44,800 --> 00:02:48,600 happen since the last checkpoint. And of course, that there's ill 42 00:02:48,600 --> 00:02:52,900 itself is going to need to go down through the I/O system in order to get the data 43 00:02:52,900 --> 00:02:53,700 that's in that log. 44 00:02:53,800 --> 00:02:57,700 Committed unto stable store and that's 45 00:02:57,700 --> 00:03:01,800 going to have to happen with some regularity. For example, any time enough sink comes in, 46 00:03:01,800 --> 00:03:05,200 that's going to require that the zil up to that point, 47 00:03:05,200 --> 00:03:09,400 gets written to stable storage. So there's there's a fairly 48 00:03:09,400 --> 00:03:13,700 heavy traffic path from the zil down to the, to the I/O layer 49 00:03:13,700 --> 00:03:17,800 below the posix layer and the directory Leo. And once we sort of got figured out, 50 00:03:17,800 --> 00:03:21,800 you know, what's the file that we're working with? We then have to go into the data management 51 00:03:21,800 --> 00:03:23,700 layer, which is the thing that 52 00:03:23,800 --> 00:03:27,900 It's going to deal with getting us, the actual disk blocks that we need to store the data 53 00:03:27,900 --> 00:03:31,900 into and then of course is going to potentially be half into 54 00:03:31,900 --> 00:03:35,900 look things up in the ark. The cash to know if we're going to 55 00:03:35,900 --> 00:03:39,100 be overwriting existing blocks. We need the old contents and 56 00:03:40,200 --> 00:03:44,900 you know, if we're referencing directory blocks a good chance that they're going to be sitting in the cache 57 00:03:44,900 --> 00:03:48,700 and so on the Arc of course is backed up by the backup Ark. 58 00:03:48,700 --> 00:03:52,600 So the older stuff has been migrated so that may have to come 59 00:03:52,600 --> 00:03:53,300 wandering in. 60 00:03:53,800 --> 00:03:57,800 And meanwhile, all of this stuff that's doing I/O is all working through the 61 00:03:57,800 --> 00:04:01,900 io module there, which is in turn, cooperating with 62 00:04:01,900 --> 00:04:05,400 the devices. And the raid-z, if we're using that, 63 00:04:05,400 --> 00:04:09,900 and that's going to be sitting on top of geom. And again, it may 64 00:04:09,900 --> 00:04:13,800 be a virtual device from John, but generally, it's just that the very bottom of 65 00:04:13,800 --> 00:04:17,600 geom where the only thing that were essentially dealing with is the 66 00:04:17,600 --> 00:04:21,800 actual raw Hardware. Okay, moving over to the right next 67 00:04:21,800 --> 00:04:23,600 to the the other side of the data. 68 00:04:23,800 --> 00:04:27,800 Path there. We also have these things called Z valls as 69 00:04:27,800 --> 00:04:31,800 evil is made to look like a raw disk partition. That is, it 70 00:04:31,800 --> 00:04:35,500 is just a you could think of it as really just one giant file 71 00:04:36,000 --> 00:04:40,900 and it is one way of getting a slice off of a physical drive. But in 72 00:04:40,900 --> 00:04:44,600 fact what is generally done is that it really 73 00:04:44,600 --> 00:04:48,500 treated almost like a separate file system that has just one file in it. It 74 00:04:48,500 --> 00:04:52,500 looks to the outside world like it is a disk so you can see that it can be 75 00:04:52,500 --> 00:04:53,200 exported to 76 00:04:53,900 --> 00:04:57,800 And geom can do all of its regular Magic on that and pass a 77 00:04:57,800 --> 00:05:01,600 virtual volume back up. So you can for example, build a 78 00:05:01,600 --> 00:05:05,700 ufs filesystem on top of as evil or run a database on is Evol 79 00:05:05,700 --> 00:05:09,600 the benefit of running on the Z vowel is that first of all, you've got the 80 00:05:09,600 --> 00:05:13,200 intent log so that you can keep it more up to date without having to actually 81 00:05:13,200 --> 00:05:17,900 do the right. But the other thing is that you can take snapshots of as 82 00:05:17,900 --> 00:05:21,700 Evol so you can run aufs filesystem on it and it provides a 83 00:05:21,700 --> 00:05:23,600 really cheap way of taking snapshots. 84 00:05:23,900 --> 00:05:27,200 That ufs filesystem, which is is much cheaper than 85 00:05:27,200 --> 00:05:31,500 actually having that file system running on a regular hard 86 00:05:31,500 --> 00:05:35,800 piece of metal disc. Okay. Finally, we will go over to 87 00:05:35,800 --> 00:05:39,500 the management side there. There's a thing called 88 00:05:39,500 --> 00:05:43,800 Dev Z FS, which is the handle that you use to get access to 89 00:05:43,800 --> 00:05:46,700 the various commands that deal with ZFS. 90 00:05:46,700 --> 00:05:50,900 And the sort of things that you need to do is you can do 91 00:05:50,900 --> 00:05:53,300 configuration on devices. So you can 92 00:05:54,000 --> 00:05:58,900 increase the size of your raid pool or your raid-z pool, or you can do something called a 93 00:05:58,900 --> 00:06:02,700 scrub. One of the problems with hard disks 94 00:06:02,900 --> 00:06:06,500 is that they have this bad habit that you write something there and it gets written 95 00:06:06,500 --> 00:06:10,800 perfectly. Well, and then for no apparent reason, it just goes bad. 96 00:06:11,700 --> 00:06:15,800 And of course, if you're not accessing it, you won't know that if you actually try and 97 00:06:15,800 --> 00:06:19,900 read it, you'll get some kind of a read error and you'll discover that there's something wrong with 98 00:06:19,900 --> 00:06:23,700 it, but absent actually going out and reading it, you just don't know. 99 00:06:23,800 --> 00:06:27,900 That, that data is gone bad and if enough blocks of data go bad, 100 00:06:27,900 --> 00:06:31,900 at some point, you can lose a disk. And then when you go to reconstruct you find that some of 101 00:06:31,900 --> 00:06:35,700 the blocks that you need to do, the Reconstruction have horror of Horrors, gone 102 00:06:35,700 --> 00:06:39,900 bad. And so you can't complete your reconstruction. So one 103 00:06:39,900 --> 00:06:43,400 of the features that ZFS provides is what's called a scrub command 104 00:06:43,400 --> 00:06:47,700 and that says, go out onto the, my raid-z and read 105 00:06:47,700 --> 00:06:51,900 all the blocks that have been allocated by any filesystem. So it doesn't actually 106 00:06:51,900 --> 00:06:53,700 have to read every block on every disc. 107 00:06:53,800 --> 00:06:57,900 It just reads all the blocks that are actually in use and make sure that they can be 108 00:06:57,900 --> 00:07:01,300 read and if it finds one that doesn't read or doesn't read easily, 109 00:07:01,300 --> 00:07:05,100 you know, it has to be reread several times before it will read cleanly, 110 00:07:05,100 --> 00:07:09,800 then it will reconstruct that block and rewrite it or move it 111 00:07:09,800 --> 00:07:13,800 to some other place. So that in the future, you'll be 112 00:07:13,800 --> 00:07:17,900 able to get access to it. Other things, that the the management 113 00:07:17,900 --> 00:07:21,900 layer is responsible for is the send and receive which is essentially 114 00:07:21,900 --> 00:07:23,700 the dump and restore. So, 115 00:07:23,900 --> 00:07:27,700 Is coordinated with in ZFS? It's not a separate program, as it is in 116 00:07:27,700 --> 00:07:31,900 ufs. And then, finally, we have the data storage 117 00:07:31,900 --> 00:07:35,700 layer. This is the thing that's dealing with reservations and quotas 118 00:07:36,300 --> 00:07:40,900 and we'll be updating various things in the data management unit to make sure that 119 00:07:40,900 --> 00:07:44,800 those things can. In fact, be enforced, just to sort of 120 00:07:44,800 --> 00:07:48,700 recap. We have the filesystem and The Logical disk 121 00:07:48,700 --> 00:07:52,200 access. We've got the management of the pool and we've got the 122 00:07:52,200 --> 00:07:53,600 geom Import and Export. 123 00:07:53,800 --> 00:07:57,900 Port. So, we're importing to get the raw disks in and we're exporting. Z 124 00:07:57,900 --> 00:08:00,000 falls out to the outside world.