1 00:00:00,840 --> 00:00:01,850 - [Instructor] In this video, we're going 2 00:00:01,850 --> 00:00:05,080 to talk about cleaning and pre-processing tweets 3 00:00:05,080 --> 00:00:07,090 to get them ready for text-based 4 00:00:07,090 --> 00:00:10,860 analysis using NLP techniques. 5 00:00:10,860 --> 00:00:12,790 Now, as we were writing our book, 6 00:00:12,790 --> 00:00:14,160 on which these videos were based, 7 00:00:14,160 --> 00:00:16,990 one of the articles that we came across 8 00:00:16,990 --> 00:00:20,270 pointed out that about 80 percent of the work 9 00:00:20,270 --> 00:00:23,790 that data scientists do is on cleaning 10 00:00:23,790 --> 00:00:26,360 and preparing data for analysis 11 00:00:26,360 --> 00:00:27,590 in the first place. 12 00:00:27,590 --> 00:00:30,110 So, as you know and as you've seen 13 00:00:30,110 --> 00:00:32,210 in earlier videos in this lesson, 14 00:00:32,210 --> 00:00:34,340 tweets come back with all sorts of, 15 00:00:34,340 --> 00:00:36,550 I'll call it gunk, in them. 16 00:00:36,550 --> 00:00:39,820 And, if you want to analyze the actual text 17 00:00:39,820 --> 00:00:42,730 within the tweet, you may need to clean that up 18 00:00:42,730 --> 00:00:45,280 and get the text ready for analysis. 19 00:00:45,280 --> 00:00:48,920 So, some of the common natural language processing tasks 20 00:00:48,920 --> 00:00:52,270 that you'll use to clean and normalize the text 21 00:00:52,270 --> 00:00:55,030 within tweets for the purpose of analysis 22 00:00:55,030 --> 00:00:59,360 is to do things like convert the text all to one case. 23 00:00:59,360 --> 00:01:03,210 That way if you're, for example, parsing the words in tweets 24 00:01:03,210 --> 00:01:05,140 and you want to keep track of how many occurrences 25 00:01:05,140 --> 00:01:07,150 of a given word there are you can treat 26 00:01:07,150 --> 00:01:10,240 both uppercase and lowercase as the same. 27 00:01:10,240 --> 00:01:14,820 Removing things like the hash symbol from hashtags, 28 00:01:14,820 --> 00:01:18,310 at mentions that mention other user accounts, 29 00:01:18,310 --> 00:01:21,120 removing duplicate tweets because of the fact 30 00:01:21,120 --> 00:01:22,500 that there's lots of retweets. 31 00:01:22,500 --> 00:01:25,010 You might want to remove duplicates. 32 00:01:25,010 --> 00:01:28,670 Removing the hashtags themselves from tweets as well. 33 00:01:28,670 --> 00:01:32,070 Also, getting rid of excess white space and punctuation, 34 00:01:32,070 --> 00:01:34,420 removing stop words like we demonstrated 35 00:01:34,420 --> 00:01:37,580 with text blob in the previous lesson, 36 00:01:37,580 --> 00:01:40,720 removing URLs, not that you would do this for everything, 37 00:01:40,720 --> 00:01:44,730 it depends, of course, on what analysis you wish to perform. 38 00:01:44,730 --> 00:01:49,730 You also often see them remove keywords in Twitter 39 00:01:50,000 --> 00:01:54,940 such as RT for retweet and FAV for favorite 40 00:01:54,940 --> 00:01:58,140 which is similar to a like on Facebook 41 00:01:58,140 --> 00:01:59,703 and some other platforms. 42 00:02:00,540 --> 00:02:03,730 You'll perform tasks like stemming and lemmatization 43 00:02:03,730 --> 00:02:06,340 that we introduced in the NLP lesson 44 00:02:06,340 --> 00:02:09,760 and also tokenizing the text inside of tweets 45 00:02:09,760 --> 00:02:11,600 to get to the individual words 46 00:02:11,600 --> 00:02:14,030 for analysis purposes as well. 47 00:02:14,030 --> 00:02:17,090 Now, as you might expect, there are libraries 48 00:02:17,090 --> 00:02:19,820 that can help you with tasks like this. 49 00:02:19,820 --> 00:02:23,170 So, one such library is called tweet pre-processor. 50 00:02:23,170 --> 00:02:25,540 This is the command you'll use to install it, 51 00:02:25,540 --> 00:02:28,160 so you will want to take a moment to go ahead 52 00:02:28,160 --> 00:02:31,150 and execute that command from your command line. 53 00:02:31,150 --> 00:02:32,930 And, again, if you're a windows user, 54 00:02:32,930 --> 00:02:35,650 I recommend that you run the anaconda prompt 55 00:02:35,650 --> 00:02:39,370 as administrator for the purpose of installing libraries 56 00:02:39,370 --> 00:02:42,290 to ensure that they get installed properly. 57 00:02:42,290 --> 00:02:45,610 Now, the tweet pre-processor library has the ability 58 00:02:45,610 --> 00:02:49,200 to automatically remove from a tweet any combination 59 00:02:49,200 --> 00:02:51,200 of the items that you see here, 60 00:02:51,200 --> 00:02:54,730 URLs, at mentions like at NASA which represents 61 00:02:54,730 --> 00:02:58,380 the NASA twitter account, excuse me, 62 00:02:58,380 --> 00:03:02,870 hashtags like pound Mars or hash Mars, 63 00:03:02,870 --> 00:03:05,120 twitter reserve words like the RT and FAV 64 00:03:05,120 --> 00:03:07,110 that I just mentioned a moment ago, 65 00:03:07,110 --> 00:03:09,820 emoji characters, so you can tell it to, 66 00:03:09,820 --> 00:03:13,610 actually, remove all emojis or just the smileys, 67 00:03:13,610 --> 00:03:16,820 like smiley face, sad face, et cetera. 68 00:03:16,820 --> 00:03:20,350 You also have the ability to remove numbers as well 69 00:03:20,350 --> 00:03:25,130 and, in fact, there are a bunch of options that you can pass 70 00:03:25,130 --> 00:03:28,640 when you're configuring a tweet pre-processor object 71 00:03:28,640 --> 00:03:31,730 to specify which options you would like 72 00:03:31,730 --> 00:03:33,580 to take advantage of, so I'll show you 73 00:03:33,580 --> 00:03:35,660 that table in just a moment. 74 00:03:35,660 --> 00:03:38,090 Now, it turns out that there's also 75 00:03:38,090 --> 00:03:42,250 some handy utility functions in the text blob modules 76 00:03:42,250 --> 00:03:46,680 sub-module called utils, so textblob.utils has a function, 77 00:03:46,680 --> 00:03:48,880 for example, called strip_punc 78 00:03:48,880 --> 00:03:53,880 which strips out punctuation either anywhere in a string, 79 00:03:54,630 --> 00:03:56,620 or only at the end of a string. 80 00:03:56,620 --> 00:04:01,090 So, depending on what the contents are of the text 81 00:04:01,090 --> 00:04:03,360 that you're processing, you can decide where 82 00:04:03,360 --> 00:04:06,360 to strip the punctuation from, and that could be important 83 00:04:06,360 --> 00:04:09,653 for properly tokenizing text as well. 84 00:04:11,030 --> 00:04:15,110 So, before we actually do a demo using tweet pre-processor, 85 00:04:15,110 --> 00:04:17,820 let's just take a look at the option constants 86 00:04:17,820 --> 00:04:19,590 that are available to you as part 87 00:04:19,590 --> 00:04:22,080 of the tweet pre-processor's module. 88 00:04:22,080 --> 00:04:24,740 So, those constants are listed down the right hand column, 89 00:04:24,740 --> 00:04:28,190 here, and their corresponding meanings are shown to you 90 00:04:28,190 --> 00:04:30,180 over in the left hand column. 91 00:04:30,180 --> 00:04:32,160 So, for our little demonstration 92 00:04:32,160 --> 00:04:33,390 that we're about to show you, 93 00:04:33,390 --> 00:04:36,080 which we'll do in a new iPython session, 94 00:04:36,080 --> 00:04:39,610 we're going to say that we want to remove URLs 95 00:04:39,610 --> 00:04:42,270 and we also want to remove reserved words, 96 00:04:42,270 --> 00:04:46,710 but again, any combination of these constants is possible. 97 00:04:46,710 --> 00:04:48,930 So, with that said, I'm going to jump back out 98 00:04:48,930 --> 00:04:52,560 to the iPython command prompt and, as you can see here, 99 00:04:52,560 --> 00:04:57,160 I have started a new iPython session for this purpose. 100 00:04:57,160 --> 00:05:00,290 And, as we go forward, we'll use new sessions 101 00:05:00,290 --> 00:05:03,520 for several of the subsequent examples as well. 102 00:05:03,520 --> 00:05:05,610 So, first thing we need to do is import 103 00:05:05,610 --> 00:05:07,630 the tweet pre-processor's module 104 00:05:07,630 --> 00:05:10,220 which happens to be called preprocessor. 105 00:05:10,220 --> 00:05:12,210 And, their documentation recommends 106 00:05:12,210 --> 00:05:15,260 that you import the module as the letter p, 107 00:05:15,260 --> 00:05:17,120 so that's what I've done here. 108 00:05:17,120 --> 00:05:19,160 Now, you can set the options 109 00:05:19,160 --> 00:05:22,840 with the module's function named set options, 110 00:05:22,840 --> 00:05:27,300 you simply provide the arguments comma delimited. 111 00:05:27,300 --> 00:05:29,760 You do need to specify the module name 112 00:05:29,760 --> 00:05:31,740 in order to access each constant. 113 00:05:31,740 --> 00:05:35,930 So, p.OPT.URL gives me the URL option 114 00:05:35,930 --> 00:05:40,810 and p.OPT.RESERVED gives me the reserved words option 115 00:05:40,810 --> 00:05:45,720 for things like RT for retweet and FAV for a favorite. 116 00:05:45,720 --> 00:05:47,430 So, we'll go ahead and configure that. 117 00:05:47,430 --> 00:05:49,740 And, for demo purposes, rather than actually going 118 00:05:49,740 --> 00:05:52,860 and grabbing a tweet here, we just created a string 119 00:05:52,860 --> 00:05:56,760 called tweet text that has some sample text in it 120 00:05:56,760 --> 00:06:01,760 including RT to represent a retweet and a URL as well. 121 00:06:02,290 --> 00:06:05,800 So, let's go ahead and create that and cleaning a tweet, 122 00:06:05,800 --> 00:06:09,220 once you have the tweet pre-processor set-up, 123 00:06:09,220 --> 00:06:11,970 is as simple as calling the clean function 124 00:06:11,970 --> 00:06:15,150 from that module and handing it the text 125 00:06:15,150 --> 00:06:16,460 that you wish to clean. 126 00:06:16,460 --> 00:06:18,820 What you will get back is a new string 127 00:06:18,820 --> 00:06:21,700 with the options having been applied to it. 128 00:06:21,700 --> 00:06:23,250 So, any of the text that needs 129 00:06:23,250 --> 00:06:25,290 to be cleaned out, will be gone. 130 00:06:25,290 --> 00:06:28,810 In our case, we're going to lose the RT for retweet 131 00:06:28,810 --> 00:06:31,750 and we're going to lose the URL as well. 132 00:06:31,750 --> 00:06:34,400 And, as you can see, it removed the extra space 133 00:06:34,400 --> 00:06:38,070 before the letter a and it also removed the extra space 134 00:06:38,070 --> 00:06:43,070 after the word URL as it gave us back that cleaned string.