1 00:00:00,630 --> 00:00:02,430 - [Instructor] In this and the next several videos, 2 00:00:02,430 --> 00:00:04,440 we'll be taking a look at our next 3 00:00:04,440 --> 00:00:06,610 Intro to Data Science section 4 00:00:06,610 --> 00:00:11,610 on working with CSV files or comma-separated value files. 5 00:00:12,010 --> 00:00:14,500 Now, many of the most popular datasets 6 00:00:14,500 --> 00:00:16,080 for learning data science 7 00:00:16,080 --> 00:00:18,390 and artificial intelligence techniques 8 00:00:18,390 --> 00:00:21,850 are provided in comma-separated value form, 9 00:00:21,850 --> 00:00:24,040 so it is quite handy to understand 10 00:00:24,040 --> 00:00:27,541 how to process comma-separated value files. 11 00:00:27,541 --> 00:00:30,890 Now, it also turns out that a lot of the datasets 12 00:00:30,890 --> 00:00:33,323 we will be working with in higher-end lessons 13 00:00:33,323 --> 00:00:36,230 are bundled with the various libraries 14 00:00:36,230 --> 00:00:37,610 that we'll be presenting, 15 00:00:37,610 --> 00:00:40,460 and they're often bundled in such a way 16 00:00:40,460 --> 00:00:42,240 that you can load up datasets 17 00:00:42,240 --> 00:00:46,610 with a simple line of code that handles all the details 18 00:00:46,610 --> 00:00:50,500 of processing the comma-separated value files for you. 19 00:00:50,500 --> 00:00:53,932 So, although I am going to show you some basics here, 20 00:00:53,932 --> 00:00:57,710 initially, of working with comma-separated value files, 21 00:00:57,710 --> 00:01:00,710 a lot of times that will be hidden from you 22 00:01:00,710 --> 00:01:03,300 by the libraries that you're using. 23 00:01:03,300 --> 00:01:04,900 So, in this video in particular, 24 00:01:04,900 --> 00:01:06,690 we're going to start by presenting 25 00:01:06,690 --> 00:01:09,420 some basic capabilities of the Python 26 00:01:09,420 --> 00:01:12,480 standard library module CSV, 27 00:01:12,480 --> 00:01:15,070 which stands for comma-separated values. 28 00:01:15,070 --> 00:01:17,083 So, I'm gonna go ahead and import that, 29 00:01:17,940 --> 00:01:20,220 and to start out, what I want to demonstrate 30 00:01:20,220 --> 00:01:22,670 is writing to a CSV file. 31 00:01:22,670 --> 00:01:24,850 So, for that purpose, I've gone ahead 32 00:01:24,850 --> 00:01:27,660 and pasted in now a with statement 33 00:01:27,660 --> 00:01:31,570 that's very similar to the text file processing statements 34 00:01:31,570 --> 00:01:35,740 that we demonstrated in earlier videos of this lesson. 35 00:01:35,740 --> 00:01:38,880 The first thing we're going to do is open up a file, 36 00:01:38,880 --> 00:01:42,340 and in this case, we've specified accounts.csv. 37 00:01:42,340 --> 00:01:44,993 CSV is a common file name extension 38 00:01:44,993 --> 00:01:48,470 for comma-separated value files. 39 00:01:48,470 --> 00:01:51,390 Now, we're opening this with the w mode 40 00:01:51,390 --> 00:01:54,560 so if the file does not exist, it will be created. 41 00:01:54,560 --> 00:01:56,840 If it does exist, it will be wiped out, 42 00:01:56,840 --> 00:02:00,390 and we will be writing a new file from scratch. 43 00:02:00,390 --> 00:02:03,130 And if you look at the CSV module's 44 00:02:03,130 --> 00:02:06,200 online documentation at python.org, 45 00:02:06,200 --> 00:02:07,800 you'll see that they recommend 46 00:02:07,800 --> 00:02:10,495 as a third argument to the open function 47 00:02:10,495 --> 00:02:13,930 newline= empty string. 48 00:02:13,930 --> 00:02:15,957 And what this is going to do is enable 49 00:02:15,957 --> 00:02:19,820 the CSV module to properly handle 50 00:02:19,820 --> 00:02:22,500 new line characters on its own. 51 00:02:22,500 --> 00:02:24,438 That's actually built into the module. 52 00:02:24,438 --> 00:02:27,030 If you do not include this, 53 00:02:27,030 --> 00:02:30,180 then new lines will not be handled correctly 54 00:02:30,180 --> 00:02:33,200 across all platforms, and that may result 55 00:02:33,200 --> 00:02:37,000 in some problems in your CSV handling code. 56 00:02:37,000 --> 00:02:38,610 So, we're going to open, in this case, 57 00:02:38,610 --> 00:02:41,549 accounts.csv as accounts, so that will be 58 00:02:41,549 --> 00:02:45,230 the variable name we use to refer to the file object, 59 00:02:45,230 --> 00:02:47,350 and this is similar in concept 60 00:02:47,350 --> 00:02:51,370 to what we did to write into a text file, 61 00:02:51,370 --> 00:02:53,530 but the first thing we do in the context 62 00:02:53,530 --> 00:02:57,290 of CSV files is obtain a writer. 63 00:02:57,290 --> 00:03:00,810 So, there's a function in the CSV module called writer. 64 00:03:00,810 --> 00:03:03,440 You give it a file object as an argument, 65 00:03:03,440 --> 00:03:05,690 and it gives you back a new object 66 00:03:05,690 --> 00:03:09,010 called the writer that is used to write 67 00:03:09,010 --> 00:03:14,010 specifically comma-separated values into that text file. 68 00:03:14,250 --> 00:03:16,300 Now, in this case, we've chosen to write 69 00:03:16,300 --> 00:03:19,470 five separate writerow statements. 70 00:03:19,470 --> 00:03:21,650 Every writerow statement is going to create 71 00:03:21,650 --> 00:03:25,340 a line of text in the comma-separated value file, 72 00:03:25,340 --> 00:03:29,660 and the argument to writerow is a sequence of values 73 00:03:29,660 --> 00:03:33,650 that will be output as comma-separated values. 74 00:03:33,650 --> 00:03:36,397 So, we're using the same records of information 75 00:03:36,397 --> 00:03:38,730 that we demonstrated earlier 76 00:03:38,730 --> 00:03:41,963 when we introduced text file processing. 77 00:03:43,150 --> 00:03:46,737 So, let's go ahead and execute this with statement, 78 00:03:46,737 --> 00:03:49,920 and at this point, we've now created 79 00:03:49,920 --> 00:03:51,920 the comma-separated value file 80 00:03:51,920 --> 00:03:55,660 and remember that you can use !cat, 81 00:03:55,660 --> 00:03:58,800 or !more if you're on Windows, 82 00:03:58,800 --> 00:04:02,920 to view the contents of the file that you just created 83 00:04:02,920 --> 00:04:05,160 and make sure it actually got written correctly. 84 00:04:05,160 --> 00:04:07,350 So, let's go ahead and take a look 85 00:04:07,350 --> 00:04:12,060 at the accounts.csv file here, and you can see that, indeed, 86 00:04:12,060 --> 00:04:14,730 we have comma-separated values. 87 00:04:14,730 --> 00:04:17,509 Now, we happen to have written values, 88 00:04:17,509 --> 00:04:19,750 in particular for these strings, 89 00:04:19,750 --> 00:04:22,255 that were one string with no space characters 90 00:04:22,255 --> 00:04:25,390 or other information in them. 91 00:04:25,390 --> 00:04:27,843 If you had, for example, 92 00:04:28,720 --> 00:04:32,680 Jones, up here, Jones comma Sue, 93 00:04:32,680 --> 00:04:33,780 then the 94 00:04:35,520 --> 00:04:37,955 CSV module would actually output that 95 00:04:37,955 --> 00:04:41,750 with double quote characters around the entire field. 96 00:04:41,750 --> 00:04:43,941 So, if there are commas in the strings 97 00:04:43,941 --> 00:04:45,800 that you are writing out, 98 00:04:45,800 --> 00:04:48,700 they will be enclosed in double quote characters 99 00:04:48,700 --> 00:04:52,050 automatically by the CSV module, 100 00:04:52,050 --> 00:04:55,909 and similarly, when you read in that CSV file, 101 00:04:55,909 --> 00:04:58,580 it will see those double quote characters 102 00:04:58,580 --> 00:05:01,230 and know to treat everything within them 103 00:05:01,230 --> 00:05:05,020 as one field of information to read in. 104 00:05:05,020 --> 00:05:06,450 Now, another thing I want to point out 105 00:05:06,450 --> 00:05:08,580 about these five statements, here again, 106 00:05:08,580 --> 00:05:12,460 we chose writerow, which writes one row at a time, 107 00:05:12,460 --> 00:05:15,610 but there's also a method, or a function rather, 108 00:05:15,610 --> 00:05:19,019 called writerows, which can receive 109 00:05:19,019 --> 00:05:23,130 a nested list, for example, a list of lists 110 00:05:23,130 --> 00:05:26,040 and write every row of that list 111 00:05:26,040 --> 00:05:29,150 as a separate line of text in the file. 112 00:05:29,150 --> 00:05:31,340 So, we could have combined all of this 113 00:05:31,340 --> 00:05:34,510 into a single statement if we wanted. 114 00:05:34,510 --> 00:05:38,150 So, now that we've written the data out to the CSV file, 115 00:05:38,150 --> 00:05:41,730 let's also prove that we can read that data back in. 116 00:05:41,730 --> 00:05:44,990 And, again, we'll use the CSV module for that. 117 00:05:44,990 --> 00:05:46,920 Here, I've pasted in a with statement 118 00:05:46,920 --> 00:05:50,800 that's going to open up the file, accounts.csv. 119 00:05:50,800 --> 00:05:52,612 Remember, it was automatically closed 120 00:05:52,612 --> 00:05:54,789 by the first with statement. 121 00:05:54,789 --> 00:05:56,820 We're gonna open it for reading, 122 00:05:56,820 --> 00:05:59,220 and again, the online docs for this module 123 00:05:59,220 --> 00:06:03,360 say you should always include newline= empty string 124 00:06:03,360 --> 00:06:05,820 as a third argument when you're opening 125 00:06:05,820 --> 00:06:08,570 a comma-separated value file to ensure 126 00:06:08,570 --> 00:06:12,230 that the new line characters get processed correctly. 127 00:06:12,230 --> 00:06:14,600 And we will call this file accounts. 128 00:06:14,600 --> 00:06:16,860 We're going to print out some column heads 129 00:06:16,860 --> 00:06:20,150 so that we can produce nice looking output 130 00:06:20,150 --> 00:06:22,910 for this part of our demonstration. 131 00:06:22,910 --> 00:06:25,070 And similar to what we did up above here, 132 00:06:25,070 --> 00:06:29,000 when we write to a comma-separated value file, 133 00:06:29,000 --> 00:06:31,970 we use the file object to create a writer, 134 00:06:31,970 --> 00:06:33,950 while when we're going to read from it, 135 00:06:33,950 --> 00:06:36,060 we use the file object, accounts, 136 00:06:36,060 --> 00:06:39,145 to create a reader that specifically knows 137 00:06:39,145 --> 00:06:41,620 it's looking for comma-separated values 138 00:06:41,620 --> 00:06:44,100 and will parse them automatically. 139 00:06:44,100 --> 00:06:47,940 So, we create our reader, and just like with a file 140 00:06:47,940 --> 00:06:49,860 where you can iterate through the file 141 00:06:49,860 --> 00:06:52,190 one line at a time in a for-loop, 142 00:06:52,190 --> 00:06:56,810 you can iterate through the CSV reader one line at a time. 143 00:06:56,810 --> 00:06:58,827 And, in our case, we know that every one 144 00:06:58,827 --> 00:07:03,060 of our comma-separated value rows has three values in it, 145 00:07:03,060 --> 00:07:06,000 so we're taking the record that we're reading, 146 00:07:06,000 --> 00:07:09,010 and we're splitting it into three pieces of information: 147 00:07:09,010 --> 00:07:11,600 the account, the name, and the balance. 148 00:07:11,600 --> 00:07:13,700 And we'll then display the account, the name, 149 00:07:13,700 --> 00:07:17,290 and the balance individually in a formatted string. 150 00:07:17,290 --> 00:07:19,870 So, let's go ahead and execute that, 151 00:07:19,870 --> 00:07:24,150 and as you can see, we got the five rows of information. 152 00:07:24,150 --> 00:07:26,660 We can confirm by looking back up above here 153 00:07:26,660 --> 00:07:29,870 that the information is correct and in the right order, 154 00:07:29,870 --> 00:07:33,200 and we also got the nice column-based formatting 155 00:07:33,200 --> 00:07:36,620 because we were using the f-string notation 156 00:07:36,620 --> 00:07:38,890 that you've seen previously. 157 00:07:38,890 --> 00:07:41,750 So, as you can see, it's real easy in Python 158 00:07:41,750 --> 00:07:45,000 using the CSV module to both create 159 00:07:45,000 --> 00:07:48,773 comma-separated value files and read them in as well.