1 00:00:00,960 --> 00:00:03,190 - In this and the next couple of videos we're 2 00:00:03,190 --> 00:00:07,360 going to use pandas to read in a popular data 3 00:00:07,360 --> 00:00:09,820 set that's often used by people learning about 4 00:00:09,820 --> 00:00:12,940 data science. And work with that data set in the 5 00:00:12,940 --> 00:00:16,794 context of pandas to explore that data. 6 00:00:16,794 --> 00:00:19,260 Now, on the screen at the moment 7 00:00:19,260 --> 00:00:22,350 I have a URL for a popular data set 8 00:00:22,350 --> 00:00:25,070 repository called R Data Sets. 9 00:00:25,070 --> 00:00:28,760 This repository has over 1100 comma separated 10 00:00:28,760 --> 00:00:31,760 value data sets that you can play with , and one 11 00:00:31,760 --> 00:00:35,790 of them is the titanic disaster data set which 12 00:00:35,790 --> 00:00:39,660 has information about just over 1300 passengers 13 00:00:39,660 --> 00:00:42,058 who were on the ship, and what their fate was 14 00:00:42,058 --> 00:00:46,171 when the ship sank. Now I wanted to show you this 15 00:00:46,171 --> 00:00:49,180 just so you have a sense of the fact that there 16 00:00:49,180 --> 00:00:52,160 are these kind of massive repositories out there 17 00:00:52,160 --> 00:00:55,290 in which you can find data to work with and 18 00:00:55,290 --> 00:00:58,130 learn from as you get into data science 19 00:00:58,130 --> 00:01:00,220 and machine learning and deep learning. The 20 00:01:00,220 --> 00:01:02,380 particular data set we'll be working with like 21 00:01:02,380 --> 00:01:05,610 I said has just over 1300 records , as you can 22 00:01:05,610 --> 00:01:08,040 see some of these have many fewer records. 23 00:01:08,040 --> 00:01:11,795 Some of them have many more records , but these 24 00:01:11,795 --> 00:01:16,795 are just 1 data set repository of the many 25 00:01:19,590 --> 00:01:22,360 that are out there from which you can obtain data 26 00:01:22,360 --> 00:01:24,810 to work with for your own studies and for 27 00:01:24,810 --> 00:01:27,075 learning purposes as well. 28 00:01:27,075 --> 00:01:29,430 I just want to show you that if I search on this 29 00:01:29,430 --> 00:01:33,850 page for titanic, you'll see that the first hit 30 00:01:33,850 --> 00:01:37,410 is the titanic survival data set and what I did 31 00:01:37,410 --> 00:01:41,200 to use this with pandas is I right clicked on 32 00:01:41,200 --> 00:01:45,120 the CSV link over here and copied the URL so that 33 00:01:45,120 --> 00:01:48,954 I can load the data set directly into pandas. 34 00:01:48,954 --> 00:01:53,150 With that said let's switch over to a terminal 35 00:01:53,150 --> 00:01:55,730 window here , you can see I've already imported 36 00:01:55,730 --> 00:01:57,070 pandas. 37 00:01:57,070 --> 00:01:59,820 What we're going to do next is load up 38 00:01:59,820 --> 00:02:03,310 that data set into a pandas data frame. 39 00:02:03,310 --> 00:02:06,950 Because the URL was long I broke it into 40 00:02:06,950 --> 00:02:09,574 two pieces here and concatenated the strings, 41 00:02:09,574 --> 00:02:11,480 but you can see we're using that 42 00:02:11,480 --> 00:02:15,127 same read CSV function that we demonstrated 43 00:02:15,127 --> 00:02:19,610 in the proceeding video to load from a URL this 44 00:02:19,610 --> 00:02:23,870 time instead of from a local file on disk. 45 00:02:23,870 --> 00:02:26,466 So let's go ahead and read that in. 46 00:02:26,466 --> 00:02:31,466 I'm also going to set a setting within pandas 47 00:02:31,703 --> 00:02:35,100 to enable me to display floating point numbers 48 00:02:35,100 --> 00:02:38,080 with two digits to the right of the decimal point 49 00:02:38,080 --> 00:02:41,517 and that will enable me to compact the output 50 00:02:41,517 --> 00:02:44,120 in this example a little bit . 51 00:02:44,120 --> 00:02:47,336 Set option is a function in the pandas library 52 00:02:47,336 --> 00:02:50,440 when you pass it the string precision 53 00:02:50,440 --> 00:02:53,100 that indicates that your setting for all floating 54 00:02:53,100 --> 00:02:56,830 point numbers that are manipulated with pandas 55 00:02:56,830 --> 00:02:59,530 their display precision to be 2. 56 00:02:59,530 --> 00:03:03,330 And that's for onscreen formatting purposes. 57 00:03:03,330 --> 00:03:05,740 I'll go ahead and set that as well. 58 00:03:05,740 --> 00:03:08,590 One of the great thing's about pandas is that 59 00:03:08,590 --> 00:03:12,580 it makes it really easy to explore your data, 60 00:03:12,580 --> 00:03:14,700 which is a key thing you need to do. 61 00:03:14,700 --> 00:03:17,480 You need to get to know your data before you can 62 00:03:17,480 --> 00:03:19,150 work with that data. 63 00:03:19,150 --> 00:03:23,200 For example I can go ahead and type titanic.head 64 00:03:23,200 --> 00:03:26,120 it will give me just the first few rows 65 00:03:26,120 --> 00:03:28,603 of the titanic data set to look at. 66 00:03:28,603 --> 00:03:31,980 The indices are just the indices that 67 00:03:31,980 --> 00:03:33,950 are provided automatically by pandas 68 00:03:33,950 --> 00:03:36,732 starting from index number 0 for the first row. 69 00:03:36,732 --> 00:03:39,336 We have several columns of information here. 70 00:03:39,336 --> 00:03:41,760 This one is not named very well, but it's 71 00:03:41,760 --> 00:03:44,520 the column that specifies passenger names. 72 00:03:44,520 --> 00:03:46,860 We have a column called survived which indicates 73 00:03:46,860 --> 00:03:49,370 weather they survived the disaster or not. 74 00:03:49,370 --> 00:03:52,070 We have a sex column which indicates weather 75 00:03:52,070 --> 00:03:55,070 they were male or female, a age column for 76 00:03:55,070 --> 00:03:57,610 their age at the time of the disaster, and a 77 00:03:57,610 --> 00:04:01,240 passenger class column that indicates which class 78 00:04:01,240 --> 00:04:04,890 they were traveling on the ship. 79 00:04:04,890 --> 00:04:07,300 Now just like you can look at the first few, you 80 00:04:07,300 --> 00:04:09,620 also have the ability to look at the last few 81 00:04:09,620 --> 00:04:11,130 with the tail method . 82 00:04:11,130 --> 00:04:14,760 So these are the last 5 records of the data set 83 00:04:14,760 --> 00:04:17,990 and all these folks were 3rd class passengers. 84 00:04:17,990 --> 00:04:21,200 By the way notice NaN here in record number 305 85 00:04:21,200 --> 00:04:25,583 that person they didn't have a age for, so 86 00:04:25,583 --> 00:04:30,532 that's a missing piece of data in this particular 87 00:04:30,532 --> 00:04:31,803 data set. 88 00:04:33,220 --> 00:04:36,160 At this point we've now displayed just a few of 89 00:04:36,160 --> 00:04:37,850 the records so we can get a sense 90 00:04:38,691 --> 00:04:40,810 of what information is there. Let's also make 91 00:04:40,810 --> 00:04:44,077 a change. These column names are not as perhaps 92 00:04:44,077 --> 00:04:47,790 as readable as they could be, it would be nice if 93 00:04:47,790 --> 00:04:50,203 this one was simply called name and here this 94 00:04:50,203 --> 00:04:54,140 one's really wide and we have pretty narrow data, 95 00:04:54,140 --> 00:04:57,320 so let's also rename that one as well. 96 00:04:57,320 --> 00:04:59,850 To do that I'm going to set the data frames 97 00:04:59,850 --> 00:05:03,812 columns attribute and I have 5 columns here 98 00:05:03,812 --> 00:05:06,922 and I'm going to set all of those column names. 99 00:05:06,922 --> 00:05:09,540 For a few of them we're using the same names they 100 00:05:09,540 --> 00:05:11,840 had previously. But the first column will now be 101 00:05:11,840 --> 00:05:13,970 called name, the last column will now 102 00:05:13,970 --> 00:05:17,720 be called class. After doing that if I go ahead 103 00:05:17,720 --> 00:05:21,010 and display the head of the data's frame once 104 00:05:21,010 --> 00:05:23,200 again you can see we now have our 105 00:05:23,200 --> 00:05:26,470 customized column heads being displayed 106 00:05:26,470 --> 00:05:28,670 across the top of the data frame. 107 00:05:28,670 --> 00:05:31,490 So at this point we've loaded the data frame, 108 00:05:31,490 --> 00:05:33,770 we've started to get to know the data 109 00:05:33,770 --> 00:05:36,240 in the next video we're going to do a little bit 110 00:05:36,240 --> 00:05:39,313 of simple data analysis with the data set.