1 00:00:00,330 --> 00:00:04,578 - In my previous two Intro to Data Science presentations, 2 00:00:04,578 --> 00:00:08,760 I focused on the Pandas library for working 3 00:00:08,760 --> 00:00:11,610 with one-dimensional and two-dimensional data 4 00:00:11,610 --> 00:00:14,580 as series and data frames respectively. 5 00:00:14,580 --> 00:00:17,010 Here I want to continue that discussion, 6 00:00:17,010 --> 00:00:21,060 bringing in Pandas with CSV files. 7 00:00:21,060 --> 00:00:22,260 So, as you can see, 8 00:00:22,260 --> 00:00:24,650 I've already imported the Pandas library, 9 00:00:24,650 --> 00:00:28,080 and it turns out that Pandas makes it super easy 10 00:00:28,080 --> 00:00:30,770 to read an entire CSV file 11 00:00:30,770 --> 00:00:32,540 into a Python program, 12 00:00:32,540 --> 00:00:37,540 and also to create CSV files from data frames, as well. 13 00:00:37,600 --> 00:00:40,380 So let's paste in a statement here. 14 00:00:40,380 --> 00:00:42,410 We're going to create a data frame 15 00:00:42,410 --> 00:00:46,721 by calling the Pandas library's read CSV function 16 00:00:46,721 --> 00:00:49,540 the name of the file that we want to read 17 00:00:49,540 --> 00:00:51,010 is provided as an argument. 18 00:00:51,010 --> 00:00:54,470 And I'm assuming this file is in the same folder 19 00:00:54,470 --> 00:00:56,660 from where I launched IPython, 20 00:00:56,660 --> 00:00:58,400 and it is in this case. 21 00:00:58,400 --> 00:01:02,700 And the second argument is the column names 22 00:01:02,700 --> 00:01:05,660 for the new data frame that we're going to create. 23 00:01:05,660 --> 00:01:08,866 Now, if you do not provide this second argument, 24 00:01:08,866 --> 00:01:12,190 the read CSV function is going to assume 25 00:01:12,190 --> 00:01:16,244 that the very first row in the comma separated values file 26 00:01:16,244 --> 00:01:19,840 contains the names of the columns already. 27 00:01:19,840 --> 00:01:22,550 So, in our case you may recall, 28 00:01:22,550 --> 00:01:26,070 that we did not write out a line of text indicating 29 00:01:26,070 --> 00:01:30,290 what the column names were in the accounts.csv file. 30 00:01:30,290 --> 00:01:33,940 So if we do not include this, then it would accidentally 31 00:01:33,940 --> 00:01:36,240 use the very first record of information 32 00:01:36,240 --> 00:01:39,690 as the column names in our Pandas data frame. 33 00:01:39,690 --> 00:01:42,120 So let's go ahead and create that data frame, 34 00:01:42,120 --> 00:01:45,570 and of course we can evaluate it to see what it looks like. 35 00:01:45,570 --> 00:01:47,720 And you'll notice that we get the account, name, 36 00:01:47,720 --> 00:01:50,170 and balance columns as specified 37 00:01:50,170 --> 00:01:52,210 by the second argument up here. 38 00:01:52,210 --> 00:01:55,030 It provides indices for the rows, 39 00:01:55,030 --> 00:01:56,670 which are just zero through four 40 00:01:56,670 --> 00:01:58,250 by default because we didn't 41 00:01:58,250 --> 00:02:01,220 give it custom index values. 42 00:02:01,220 --> 00:02:03,860 And you can in fact see the same data 43 00:02:03,860 --> 00:02:06,910 that we've worked with previously in text files 44 00:02:06,910 --> 00:02:10,160 and in the previous couple of videos 45 00:02:10,160 --> 00:02:14,080 with our presentation of the CSV module. 46 00:02:14,080 --> 00:02:17,100 Now, in addition to being able to create 47 00:02:17,100 --> 00:02:19,140 data frames really easily 48 00:02:19,140 --> 00:02:22,610 from a properly formatted CSV file, 49 00:02:22,610 --> 00:02:27,610 you also have the ability to output a CSV file 50 00:02:27,870 --> 00:02:29,530 from a data frame. 51 00:02:29,530 --> 00:02:33,480 So let's assume you've already created a data frame. 52 00:02:33,480 --> 00:02:36,800 You've done all sorts of data wrangling and munging 53 00:02:36,800 --> 00:02:39,030 to get the data into the form that you want, 54 00:02:39,030 --> 00:02:40,995 and now what you want to do is create 55 00:02:40,995 --> 00:02:44,410 a comma separated value file that stores 56 00:02:44,410 --> 00:02:47,380 that information in a persistent manner. 57 00:02:47,380 --> 00:02:50,110 So I'm going to go ahead and copy and paste 58 00:02:50,110 --> 00:02:54,320 a statement in here that's going to demonstrate 59 00:02:54,320 --> 00:02:59,320 the to CSV function method of a data frame. 60 00:03:00,050 --> 00:03:03,530 The first argument will be the filename that you specify 61 00:03:03,530 --> 00:03:06,100 for the comma separated value file. 62 00:03:06,100 --> 00:03:09,030 Of course, we did not provide any path information, 63 00:03:09,030 --> 00:03:12,430 so this will simply be placed in the same folder 64 00:03:12,430 --> 00:03:14,680 from which I launched IPython. 65 00:03:14,680 --> 00:03:18,270 And the second argument, index=False simply means 66 00:03:18,270 --> 00:03:22,600 that we should not output the zero, one, two, three, four 67 00:03:22,600 --> 00:03:26,980 index numbers as part of the comma separated value file. 68 00:03:26,980 --> 00:03:29,580 So what's actually going to get written is 69 00:03:29,580 --> 00:03:32,890 everything that you see in these three columns: 70 00:03:32,890 --> 00:03:34,620 the account column, the name column, 71 00:03:34,620 --> 00:03:35,860 and the balance column. 72 00:03:35,860 --> 00:03:38,660 And the first row of the file will be 73 00:03:38,660 --> 00:03:41,350 the string's account name and balance 74 00:03:41,350 --> 00:03:46,350 separated by commas, so that it includes in the file 75 00:03:46,650 --> 00:03:50,640 what the purpose of each value is in a given record 76 00:03:50,640 --> 00:03:51,850 of information. 77 00:03:51,850 --> 00:03:53,440 So I'll go ahead and do that, 78 00:03:53,440 --> 00:03:57,120 and to show you that it did write the file, 79 00:03:57,120 --> 00:03:59,100 and also that it wrote out 80 00:03:59,100 --> 00:04:02,130 the line of column names. 81 00:04:02,130 --> 00:04:06,800 Let's display the contents of the file here in IPython, 82 00:04:06,800 --> 00:04:10,090 and notice that we have this one line of text, 83 00:04:10,090 --> 00:04:13,120 which is common by the way in most of the data sets 84 00:04:13,120 --> 00:04:15,340 that you have access to online. 85 00:04:15,340 --> 00:04:17,740 Typically the first line is going to have 86 00:04:17,740 --> 00:04:19,770 the column names to give you a sense 87 00:04:19,770 --> 00:04:21,830 of what the purpose of the columns are 88 00:04:21,830 --> 00:04:23,870 in a given dataset file. 89 00:04:23,870 --> 00:04:28,810 So as you can see, it's really easy to load a CSV file 90 00:04:28,810 --> 00:04:30,800 into a data frame for processing, 91 00:04:30,800 --> 00:04:34,280 and also once you've processed your data in the data frame, 92 00:04:34,280 --> 00:04:38,583 to save that information out to a CSV file as well.