1
00:00:00,330 --> 00:00:04,578
- In my previous two Intro to
Data Science presentations,

2
00:00:04,578 --> 00:00:08,760
I focused on the Pandas
library for working

3
00:00:08,760 --> 00:00:11,610
with one-dimensional
and two-dimensional data

4
00:00:11,610 --> 00:00:14,580
as series and data frames respectively.

5
00:00:14,580 --> 00:00:17,010
Here I want to continue that discussion,

6
00:00:17,010 --> 00:00:21,060
bringing in Pandas with CSV files.

7
00:00:21,060 --> 00:00:22,260
So, as you can see,

8
00:00:22,260 --> 00:00:24,650
I've already imported the Pandas library,

9
00:00:24,650 --> 00:00:28,080
and it turns out that
Pandas makes it super easy

10
00:00:28,080 --> 00:00:30,770
to read an entire CSV file

11
00:00:30,770 --> 00:00:32,540
into a Python program,

12
00:00:32,540 --> 00:00:37,540
and also to create CSV files
from data frames, as well.

13
00:00:37,600 --> 00:00:40,380
So let's paste in a statement here.

14
00:00:40,380 --> 00:00:42,410
We're going to create a data frame

15
00:00:42,410 --> 00:00:46,721
by calling the Pandas
library's read CSV function

16
00:00:46,721 --> 00:00:49,540
the name of the file that we want to read

17
00:00:49,540 --> 00:00:51,010
is provided as an argument.

18
00:00:51,010 --> 00:00:54,470
And I'm assuming this
file is in the same folder

19
00:00:54,470 --> 00:00:56,660
from where I launched IPython,

20
00:00:56,660 --> 00:00:58,400
and it is in this case.

21
00:00:58,400 --> 00:01:02,700
And the second argument
is the column names

22
00:01:02,700 --> 00:01:05,660
for the new data frame
that we're going to create.

23
00:01:05,660 --> 00:01:08,866
Now, if you do not provide
this second argument,

24
00:01:08,866 --> 00:01:12,190
the read CSV function is going to assume

25
00:01:12,190 --> 00:01:16,244
that the very first row in the
comma separated values file

26
00:01:16,244 --> 00:01:19,840
contains the names of the columns already.

27
00:01:19,840 --> 00:01:22,550
So, in our case you may recall,

28
00:01:22,550 --> 00:01:26,070
that we did not write out
a line of text indicating

29
00:01:26,070 --> 00:01:30,290
what the column names were
in the accounts.csv file.

30
00:01:30,290 --> 00:01:33,940
So if we do not include this,
then it would accidentally

31
00:01:33,940 --> 00:01:36,240
use the very first record of information

32
00:01:36,240 --> 00:01:39,690
as the column names in
our Pandas data frame.

33
00:01:39,690 --> 00:01:42,120
So let's go ahead and
create that data frame,

34
00:01:42,120 --> 00:01:45,570
and of course we can evaluate
it to see what it looks like.

35
00:01:45,570 --> 00:01:47,720
And you'll notice that
we get the account, name,

36
00:01:47,720 --> 00:01:50,170
and balance columns as specified

37
00:01:50,170 --> 00:01:52,210
by the second argument up here.

38
00:01:52,210 --> 00:01:55,030
It provides indices for the rows,

39
00:01:55,030 --> 00:01:56,670
which are just zero through four

40
00:01:56,670 --> 00:01:58,250
by default because we didn't

41
00:01:58,250 --> 00:02:01,220
give it custom index values.

42
00:02:01,220 --> 00:02:03,860
And you can in fact see the same data

43
00:02:03,860 --> 00:02:06,910
that we've worked with
previously in text files

44
00:02:06,910 --> 00:02:10,160
and in the previous couple of videos

45
00:02:10,160 --> 00:02:14,080
with our presentation of the CSV module.

46
00:02:14,080 --> 00:02:17,100
Now, in addition to being able to create

47
00:02:17,100 --> 00:02:19,140
data frames really easily

48
00:02:19,140 --> 00:02:22,610
from a properly formatted CSV file,

49
00:02:22,610 --> 00:02:27,610
you also have the ability
to output a CSV file

50
00:02:27,870 --> 00:02:29,530
from a data frame.

51
00:02:29,530 --> 00:02:33,480
So let's assume you've
already created a data frame.

52
00:02:33,480 --> 00:02:36,800
You've done all sorts of
data wrangling and munging

53
00:02:36,800 --> 00:02:39,030
to get the data into
the form that you want,

54
00:02:39,030 --> 00:02:40,995
and now what you want to do is create

55
00:02:40,995 --> 00:02:44,410
a comma separated value file that stores

56
00:02:44,410 --> 00:02:47,380
that information in a persistent manner.

57
00:02:47,380 --> 00:02:50,110
So I'm going to go
ahead and copy and paste

58
00:02:50,110 --> 00:02:54,320
a statement in here that's
going to demonstrate

59
00:02:54,320 --> 00:02:59,320
the to CSV function
method of a data frame.

60
00:03:00,050 --> 00:03:03,530
The first argument will be
the filename that you specify

61
00:03:03,530 --> 00:03:06,100
for the comma separated value file.

62
00:03:06,100 --> 00:03:09,030
Of course, we did not
provide any path information,

63
00:03:09,030 --> 00:03:12,430
so this will simply be
placed in the same folder

64
00:03:12,430 --> 00:03:14,680
from which I launched IPython.

65
00:03:14,680 --> 00:03:18,270
And the second argument,
index=False simply means

66
00:03:18,270 --> 00:03:22,600
that we should not output the
zero, one, two, three, four

67
00:03:22,600 --> 00:03:26,980
index numbers as part of the
comma separated value file.

68
00:03:26,980 --> 00:03:29,580
So what's actually going to get written is

69
00:03:29,580 --> 00:03:32,890
everything that you see
in these three columns:

70
00:03:32,890 --> 00:03:34,620
the account column, the name column,

71
00:03:34,620 --> 00:03:35,860
and the balance column.

72
00:03:35,860 --> 00:03:38,660
And the first row of the file will be

73
00:03:38,660 --> 00:03:41,350
the string's account name and balance

74
00:03:41,350 --> 00:03:46,350
separated by commas, so
that it includes in the file

75
00:03:46,650 --> 00:03:50,640
what the purpose of each
value is in a given record

76
00:03:50,640 --> 00:03:51,850
of information.

77
00:03:51,850 --> 00:03:53,440
So I'll go ahead and do that,

78
00:03:53,440 --> 00:03:57,120
and to show you that
it did write the file,

79
00:03:57,120 --> 00:03:59,100
and also that it wrote out

80
00:03:59,100 --> 00:04:02,130
the line of column names.

81
00:04:02,130 --> 00:04:06,800
Let's display the contents
of the file here in IPython,

82
00:04:06,800 --> 00:04:10,090
and notice that we have
this one line of text,

83
00:04:10,090 --> 00:04:13,120
which is common by the way
in most of the data sets

84
00:04:13,120 --> 00:04:15,340
that you have access to online.

85
00:04:15,340 --> 00:04:17,740
Typically the first line is going to have

86
00:04:17,740 --> 00:04:19,770
the column names to give you a sense

87
00:04:19,770 --> 00:04:21,830
of what the purpose of the columns are

88
00:04:21,830 --> 00:04:23,870
in a given dataset file.

89
00:04:23,870 --> 00:04:28,810
So as you can see, it's
really easy to load a CSV file

90
00:04:28,810 --> 00:04:30,800
into a data frame for processing,

91
00:04:30,800 --> 00:04:34,280
and also once you've processed
your data in the data frame,

92
00:04:34,280 --> 00:04:38,583
to save that information
out to a CSV file as well.