1
00:00:01,300 --> 00:00:03,590
- In this video and the
next couple of videos,

2
00:00:03,590 --> 00:00:07,310
We're going to continue our
introduction to data science.

3
00:00:07,310 --> 00:00:10,720
And also continue our
introduction to Pandas

4
00:00:10,720 --> 00:00:13,370
by presenting how you can use Pandas

5
00:00:13,370 --> 00:00:16,310
with some Regular Expressions
to do a little bit

6
00:00:16,310 --> 00:00:20,170
of what is known as Data
Munging or Data Wrangling.

7
00:00:20,170 --> 00:00:22,450
It is said that for data scientists,

8
00:00:22,450 --> 00:00:25,410
about 75-80% of their time

9
00:00:25,410 --> 00:00:28,780
is actually spent preparing their data

10
00:00:28,780 --> 00:00:30,900
for use in data science studies.

11
00:00:30,900 --> 00:00:34,280
So, this concept of getting data ready

12
00:00:34,280 --> 00:00:38,410
for a study is of great
importance to anybody

13
00:00:38,410 --> 00:00:41,190
who's going to become a data scientist.

14
00:00:41,190 --> 00:00:43,930
Now, two of the common operations

15
00:00:43,930 --> 00:00:47,630
that are performed as part of
Data Munging or Data Wrangling

16
00:00:47,630 --> 00:00:50,250
are cleaning data to get it ready

17
00:00:50,250 --> 00:00:52,990
and also transforming data as well.

18
00:00:52,990 --> 00:00:56,400
So, some operations that you might perform

19
00:00:56,400 --> 00:01:00,820
for cleaning up data are things
like deleting observations

20
00:01:00,820 --> 00:01:04,060
that have missing values
because you won't be able

21
00:01:04,060 --> 00:01:06,210
to process them correctly.

22
00:01:06,210 --> 00:01:08,850
Possibly substituting reasonable values

23
00:01:08,850 --> 00:01:09,960
for missing values.

24
00:01:09,960 --> 00:01:11,030
And by the way,

25
00:01:11,030 --> 00:01:13,610
which items you choose to do

26
00:01:13,610 --> 00:01:15,980
to clean your data are
often going to depend

27
00:01:15,980 --> 00:01:18,010
on the type of study and the type of data

28
00:01:18,010 --> 00:01:19,810
that you're manipulating.

29
00:01:19,810 --> 00:01:23,210
Deleting observations
that have bad values.

30
00:01:23,210 --> 00:01:27,210
Possibly substituting reasonable
values for bad values.

31
00:01:27,210 --> 00:01:29,140
Tossing outliers.

32
00:01:29,140 --> 00:01:30,110
In some studies,

33
00:01:30,110 --> 00:01:32,740
you may wanna get rid of values

34
00:01:32,740 --> 00:01:35,100
that are far outside of reasonable ranges

35
00:01:35,100 --> 00:01:36,090
and in other studies,

36
00:01:36,090 --> 00:01:37,140
those might be useful.

37
00:01:37,140 --> 00:01:40,690
So again, it does depend
on the particular study.

38
00:01:40,690 --> 00:01:44,150
You may or may not want to
perform duplicate elimination,

39
00:01:44,150 --> 00:01:46,270
and you may or may not want to deal

40
00:01:46,270 --> 00:01:49,710
with inconsistent data as well.

41
00:01:49,710 --> 00:01:53,370
Now, one example of cleaning would be

42
00:01:53,370 --> 00:01:54,560
let's say we had

43
00:01:54,560 --> 00:01:57,690
a patient's temperature
readings in the hospital.

44
00:01:57,690 --> 00:01:59,730
So, we might have something like this

45
00:01:59,730 --> 00:02:01,640
where we have the name of the patient

46
00:02:01,640 --> 00:02:03,320
and some temperature readings.

47
00:02:03,320 --> 00:02:07,300
0.0 clearly is not a
valid temperature reading

48
00:02:07,300 --> 00:02:09,330
for a patient in a hospital.

49
00:02:09,330 --> 00:02:11,630
Maybe the sensor got disconnected

50
00:02:11,630 --> 00:02:13,110
or malfunctioned in some way

51
00:02:13,110 --> 00:02:16,370
and therefore the reading did
not come through properly.

52
00:02:16,370 --> 00:02:18,090
So, if we were to average out

53
00:02:18,090 --> 00:02:19,780
these first three temperatures,

54
00:02:19,780 --> 00:02:23,480
we'd get 98.57 which is
approximately correct

55
00:02:23,480 --> 00:02:25,380
for a person's body temperature.

56
00:02:25,380 --> 00:02:27,870
But if we average out all four of these,

57
00:02:27,870 --> 00:02:31,250
we would get only 73.93 which clearly,

58
00:02:31,250 --> 00:02:32,680
in Fahrenheit temperatures,

59
00:02:32,680 --> 00:02:35,030
is not a valid body temperature

60
00:02:35,030 --> 00:02:37,310
and would be of great concern potentially

61
00:02:37,310 --> 00:02:39,900
to a doctor caring for that patient.

62
00:02:39,900 --> 00:02:43,100
So, a couple of ways we
might clean this data

63
00:02:43,100 --> 00:02:46,250
would be to either delete the 0.0,

64
00:02:46,250 --> 00:02:48,910
recognizing that it
can't possibly be right

65
00:02:48,910 --> 00:02:51,830
or possibly replacing it
maybe with the average

66
00:02:51,830 --> 00:02:55,690
of the other three
temperatures in the list.

67
00:02:55,690 --> 00:02:58,360
Now, another thing that
we often do while munging,

68
00:02:58,360 --> 00:03:00,600
is transforming our data.

69
00:03:00,600 --> 00:03:04,560
So, we might do things like
remove unnecessary data

70
00:03:04,560 --> 00:03:07,330
and features of the
data that we don't need

71
00:03:07,330 --> 00:03:08,700
for a particular study.

72
00:03:08,700 --> 00:03:10,330
Now when you get into big data,

73
00:03:10,330 --> 00:03:13,670
you may be dealing with
massive amounts of information.

74
00:03:13,670 --> 00:03:16,050
So, removing the stuff you don't need,

75
00:03:16,050 --> 00:03:19,073
can often save time and space.

76
00:03:20,054 --> 00:03:22,550
You may want to combine related features.

77
00:03:22,550 --> 00:03:24,480
So, let's say you had a first name

78
00:03:24,480 --> 00:03:27,400
and a last name but you
wanted to combine them into

79
00:03:27,400 --> 00:03:30,670
a single string just as
a very simple example.

80
00:03:30,670 --> 00:03:33,850
You may want to take a random sample

81
00:03:33,850 --> 00:03:37,650
of a massive data set to
get a representative subset

82
00:03:37,650 --> 00:03:41,950
of that data for more
efficient processing purposes.

83
00:03:41,950 --> 00:03:44,560
So for example, when
people are doing polling

84
00:03:44,560 --> 00:03:46,370
for presidential elections,

85
00:03:46,370 --> 00:03:50,260
they don't go poll every single
person in the United States,

86
00:03:50,260 --> 00:03:53,350
they poll a random sample of people

87
00:03:53,350 --> 00:03:55,410
that is often just a few thousand

88
00:03:55,410 --> 00:03:59,360
and then they extrapolate
from that information.

89
00:03:59,360 --> 00:04:02,030
You might need to standardize
some of your data formats

90
00:04:02,030 --> 00:04:05,050
or you might need to group
data differently as well.

91
00:04:05,050 --> 00:04:07,460
So these are just a couple of the types

92
00:04:07,460 --> 00:04:10,780
of things you might do
while preparing your data

93
00:04:10,780 --> 00:04:12,530
and in the next couple of videos,

94
00:04:12,530 --> 00:04:15,720
we would like to take a look
at some basic manipulations

95
00:04:15,720 --> 00:04:20,233
of data in the context of Pandas
using Regular Expressions.