1 00:00:01,300 --> 00:00:03,590 - In this video and the next couple of videos, 2 00:00:03,590 --> 00:00:07,310 We're going to continue our introduction to data science. 3 00:00:07,310 --> 00:00:10,720 And also continue our introduction to Pandas 4 00:00:10,720 --> 00:00:13,370 by presenting how you can use Pandas 5 00:00:13,370 --> 00:00:16,310 with some Regular Expressions to do a little bit 6 00:00:16,310 --> 00:00:20,170 of what is known as Data Munging or Data Wrangling. 7 00:00:20,170 --> 00:00:22,450 It is said that for data scientists, 8 00:00:22,450 --> 00:00:25,410 about 75-80% of their time 9 00:00:25,410 --> 00:00:28,780 is actually spent preparing their data 10 00:00:28,780 --> 00:00:30,900 for use in data science studies. 11 00:00:30,900 --> 00:00:34,280 So, this concept of getting data ready 12 00:00:34,280 --> 00:00:38,410 for a study is of great importance to anybody 13 00:00:38,410 --> 00:00:41,190 who's going to become a data scientist. 14 00:00:41,190 --> 00:00:43,930 Now, two of the common operations 15 00:00:43,930 --> 00:00:47,630 that are performed as part of Data Munging or Data Wrangling 16 00:00:47,630 --> 00:00:50,250 are cleaning data to get it ready 17 00:00:50,250 --> 00:00:52,990 and also transforming data as well. 18 00:00:52,990 --> 00:00:56,400 So, some operations that you might perform 19 00:00:56,400 --> 00:01:00,820 for cleaning up data are things like deleting observations 20 00:01:00,820 --> 00:01:04,060 that have missing values because you won't be able 21 00:01:04,060 --> 00:01:06,210 to process them correctly. 22 00:01:06,210 --> 00:01:08,850 Possibly substituting reasonable values 23 00:01:08,850 --> 00:01:09,960 for missing values. 24 00:01:09,960 --> 00:01:11,030 And by the way, 25 00:01:11,030 --> 00:01:13,610 which items you choose to do 26 00:01:13,610 --> 00:01:15,980 to clean your data are often going to depend 27 00:01:15,980 --> 00:01:18,010 on the type of study and the type of data 28 00:01:18,010 --> 00:01:19,810 that you're manipulating. 29 00:01:19,810 --> 00:01:23,210 Deleting observations that have bad values. 30 00:01:23,210 --> 00:01:27,210 Possibly substituting reasonable values for bad values. 31 00:01:27,210 --> 00:01:29,140 Tossing outliers. 32 00:01:29,140 --> 00:01:30,110 In some studies, 33 00:01:30,110 --> 00:01:32,740 you may wanna get rid of values 34 00:01:32,740 --> 00:01:35,100 that are far outside of reasonable ranges 35 00:01:35,100 --> 00:01:36,090 and in other studies, 36 00:01:36,090 --> 00:01:37,140 those might be useful. 37 00:01:37,140 --> 00:01:40,690 So again, it does depend on the particular study. 38 00:01:40,690 --> 00:01:44,150 You may or may not want to perform duplicate elimination, 39 00:01:44,150 --> 00:01:46,270 and you may or may not want to deal 40 00:01:46,270 --> 00:01:49,710 with inconsistent data as well. 41 00:01:49,710 --> 00:01:53,370 Now, one example of cleaning would be 42 00:01:53,370 --> 00:01:54,560 let's say we had 43 00:01:54,560 --> 00:01:57,690 a patient's temperature readings in the hospital. 44 00:01:57,690 --> 00:01:59,730 So, we might have something like this 45 00:01:59,730 --> 00:02:01,640 where we have the name of the patient 46 00:02:01,640 --> 00:02:03,320 and some temperature readings. 47 00:02:03,320 --> 00:02:07,300 0.0 clearly is not a valid temperature reading 48 00:02:07,300 --> 00:02:09,330 for a patient in a hospital. 49 00:02:09,330 --> 00:02:11,630 Maybe the sensor got disconnected 50 00:02:11,630 --> 00:02:13,110 or malfunctioned in some way 51 00:02:13,110 --> 00:02:16,370 and therefore the reading did not come through properly. 52 00:02:16,370 --> 00:02:18,090 So, if we were to average out 53 00:02:18,090 --> 00:02:19,780 these first three temperatures, 54 00:02:19,780 --> 00:02:23,480 we'd get 98.57 which is approximately correct 55 00:02:23,480 --> 00:02:25,380 for a person's body temperature. 56 00:02:25,380 --> 00:02:27,870 But if we average out all four of these, 57 00:02:27,870 --> 00:02:31,250 we would get only 73.93 which clearly, 58 00:02:31,250 --> 00:02:32,680 in Fahrenheit temperatures, 59 00:02:32,680 --> 00:02:35,030 is not a valid body temperature 60 00:02:35,030 --> 00:02:37,310 and would be of great concern potentially 61 00:02:37,310 --> 00:02:39,900 to a doctor caring for that patient. 62 00:02:39,900 --> 00:02:43,100 So, a couple of ways we might clean this data 63 00:02:43,100 --> 00:02:46,250 would be to either delete the 0.0, 64 00:02:46,250 --> 00:02:48,910 recognizing that it can't possibly be right 65 00:02:48,910 --> 00:02:51,830 or possibly replacing it maybe with the average 66 00:02:51,830 --> 00:02:55,690 of the other three temperatures in the list. 67 00:02:55,690 --> 00:02:58,360 Now, another thing that we often do while munging, 68 00:02:58,360 --> 00:03:00,600 is transforming our data. 69 00:03:00,600 --> 00:03:04,560 So, we might do things like remove unnecessary data 70 00:03:04,560 --> 00:03:07,330 and features of the data that we don't need 71 00:03:07,330 --> 00:03:08,700 for a particular study. 72 00:03:08,700 --> 00:03:10,330 Now when you get into big data, 73 00:03:10,330 --> 00:03:13,670 you may be dealing with massive amounts of information. 74 00:03:13,670 --> 00:03:16,050 So, removing the stuff you don't need, 75 00:03:16,050 --> 00:03:19,073 can often save time and space. 76 00:03:20,054 --> 00:03:22,550 You may want to combine related features. 77 00:03:22,550 --> 00:03:24,480 So, let's say you had a first name 78 00:03:24,480 --> 00:03:27,400 and a last name but you wanted to combine them into 79 00:03:27,400 --> 00:03:30,670 a single string just as a very simple example. 80 00:03:30,670 --> 00:03:33,850 You may want to take a random sample 81 00:03:33,850 --> 00:03:37,650 of a massive data set to get a representative subset 82 00:03:37,650 --> 00:03:41,950 of that data for more efficient processing purposes. 83 00:03:41,950 --> 00:03:44,560 So for example, when people are doing polling 84 00:03:44,560 --> 00:03:46,370 for presidential elections, 85 00:03:46,370 --> 00:03:50,260 they don't go poll every single person in the United States, 86 00:03:50,260 --> 00:03:53,350 they poll a random sample of people 87 00:03:53,350 --> 00:03:55,410 that is often just a few thousand 88 00:03:55,410 --> 00:03:59,360 and then they extrapolate from that information. 89 00:03:59,360 --> 00:04:02,030 You might need to standardize some of your data formats 90 00:04:02,030 --> 00:04:05,050 or you might need to group data differently as well. 91 00:04:05,050 --> 00:04:07,460 So these are just a couple of the types 92 00:04:07,460 --> 00:04:10,780 of things you might do while preparing your data 93 00:04:10,780 --> 00:04:12,530 and in the next couple of videos, 94 00:04:12,530 --> 00:04:15,720 we would like to take a look at some basic manipulations 95 00:04:15,720 --> 00:04:20,233 of data in the context of Pandas using Regular Expressions.