1 00:00:00,810 --> 00:00:01,740 - [Instructor] Next let's look at 2 00:00:01,740 --> 00:00:03,460 the speech_to_text function. 3 00:00:03,460 --> 00:00:05,860 Which as you saw we call twice 4 00:00:05,860 --> 00:00:08,230 in our run translator function. 5 00:00:08,230 --> 00:00:12,860 Once to get English-spoken audio into text 6 00:00:12,860 --> 00:00:16,310 and once to get Spanish-spoken audio into text. 7 00:00:16,310 --> 00:00:18,650 And in each case we are going to need 8 00:00:18,650 --> 00:00:22,260 to create a SpeechToTextV1 object. 9 00:00:22,260 --> 00:00:25,110 Now as you can see here when we create the object, 10 00:00:25,110 --> 00:00:28,380 we need to give it our API key 11 00:00:28,380 --> 00:00:31,860 in order to access the Watson Speech to Text service. 12 00:00:31,860 --> 00:00:35,690 So here's our keys module that we imported 13 00:00:35,690 --> 00:00:37,240 at the beginning of the script. 14 00:00:37,240 --> 00:00:40,160 And inside that module we have some variable names 15 00:00:40,160 --> 00:00:43,960 that we defined where you stored your versions 16 00:00:43,960 --> 00:00:45,660 of the API keys. 17 00:00:45,660 --> 00:00:48,760 Now we take this object that we've created 18 00:00:48,760 --> 00:00:50,930 and assign it to the variable stt 19 00:00:50,930 --> 00:00:54,580 which is a common shorthand for speech to text. 20 00:00:54,580 --> 00:00:58,850 And we're now going to use that to invoke the Web service. 21 00:00:58,850 --> 00:01:01,200 And you can see we're doing that here 22 00:01:01,200 --> 00:01:03,660 in the context of a with statement. 23 00:01:03,660 --> 00:01:06,270 The with statement is going to open a file 24 00:01:06,270 --> 00:01:08,520 that we specified as an argument 25 00:01:08,520 --> 00:01:10,860 to our speech to text function. 26 00:01:10,860 --> 00:01:13,580 And that file is going to be opened 27 00:01:13,580 --> 00:01:17,440 for reading in binary format. 28 00:01:17,440 --> 00:01:21,480 So we're going to get the data out of that file. 29 00:01:21,480 --> 00:01:23,440 We're going to call that object 30 00:01:23,440 --> 00:01:26,570 that we use to manipulate the data audio_file. 31 00:01:26,570 --> 00:01:30,800 And the result that we're going to get back 32 00:01:30,800 --> 00:01:34,290 is going to be the result of calling these speech to text 33 00:01:34,290 --> 00:01:38,040 object's recognize function or method rather. 34 00:01:38,040 --> 00:01:41,070 Now as you can see, we're using several arguments here. 35 00:01:41,070 --> 00:01:43,400 We have three of them in particular, 36 00:01:43,400 --> 00:01:45,190 and there are other ones as well 37 00:01:45,190 --> 00:01:47,920 that you'll find in the online documentation. 38 00:01:47,920 --> 00:01:50,240 But here we're using three key ones. 39 00:01:50,240 --> 00:01:53,390 The audio argument specifies the file 40 00:01:53,390 --> 00:01:56,160 from which we're going to get the bytes 41 00:01:56,160 --> 00:01:57,890 that are going to get sent over 42 00:01:57,890 --> 00:01:59,780 to the speech to text service. 43 00:01:59,780 --> 00:02:04,620 The content_type argument is the so-called media type. 44 00:02:04,620 --> 00:02:08,640 That used to be called the MIME type which was a shorthand 45 00:02:08,640 --> 00:02:11,780 for Multipurpose Internet Mail Extensions. 46 00:02:11,780 --> 00:02:16,110 That was relatively recently renamed media type 47 00:02:16,110 --> 00:02:18,300 for more modern uses. 48 00:02:18,300 --> 00:02:20,780 So MIME was something that's been around 49 00:02:20,780 --> 00:02:25,310 for a couple of decades now for attachments on email. 50 00:02:25,310 --> 00:02:28,780 And then finally the last argument that we specified here 51 00:02:28,780 --> 00:02:31,540 is the model argument, which if you recall 52 00:02:31,540 --> 00:02:35,620 from the earlier discussion for going from English speech 53 00:02:35,620 --> 00:02:38,410 to English text, we're going to take advantage 54 00:02:38,410 --> 00:02:41,900 of the US English Broadband model. 55 00:02:41,900 --> 00:02:44,930 So again if you go way back up here, 56 00:02:44,930 --> 00:02:47,960 you can see in Step 2 the actual name 57 00:02:47,960 --> 00:02:50,780 of that model that's going to be passed through 58 00:02:50,780 --> 00:02:52,120 to the Web service. 59 00:02:52,120 --> 00:02:55,130 So coming back down to our speech_to_text function here. 60 00:02:55,130 --> 00:02:57,100 So this call to recognize 61 00:02:57,100 --> 00:03:00,060 is actually what invokes the Web service, 62 00:03:00,060 --> 00:03:03,360 and what we're going to get back from that 63 00:03:03,360 --> 00:03:08,360 is what they call a detailed response object. 64 00:03:08,850 --> 00:03:10,750 And the detailed response object 65 00:03:10,750 --> 00:03:13,890 is actually a JavaScript object notation object 66 00:03:13,890 --> 00:03:15,880 that looks like this. 67 00:03:15,880 --> 00:03:17,980 And here in this diagram, 68 00:03:17,980 --> 00:03:20,640 we've associated with these boxes 69 00:03:20,640 --> 00:03:23,970 that you see on the screen the source code line numbers 70 00:03:23,970 --> 00:03:27,520 where corresponding statements are going to access 71 00:03:27,520 --> 00:03:31,120 this JavaScript object notation representation. 72 00:03:31,120 --> 00:03:35,960 Now this is a JSON representation which, 73 00:03:35,960 --> 00:03:38,620 when it gets converted into a Python object, 74 00:03:38,620 --> 00:03:42,950 is basically a set of nested dictionaries and lists. 75 00:03:42,950 --> 00:03:47,010 This particular dictionary has two key value pairs. 76 00:03:47,010 --> 00:03:49,530 The first one is called results which is a list, 77 00:03:49,530 --> 00:03:52,750 and the second one is called result_index. 78 00:03:52,750 --> 00:03:55,210 And in the results list, 79 00:03:55,210 --> 00:03:57,930 you're going to have the transcription results. 80 00:03:57,930 --> 00:04:00,060 Now one of the things that's interesting 81 00:04:00,060 --> 00:04:02,590 about the speech to text service 82 00:04:02,590 --> 00:04:05,440 is it can give you back final results 83 00:04:05,440 --> 00:04:07,090 which is what we are getting here. 84 00:04:07,090 --> 00:04:09,940 Final equals true in this case. 85 00:04:09,940 --> 00:04:12,720 But it can also give back intermediate results. 86 00:04:12,720 --> 00:04:17,220 So for instance if you've ever watched a live newscast 87 00:04:17,220 --> 00:04:21,030 that has closed captions showing up at the bottom 88 00:04:21,030 --> 00:04:23,860 of the screen, you'll sometimes notice 89 00:04:23,860 --> 00:04:27,510 that the transcription results are not as accurate 90 00:04:27,510 --> 00:04:30,450 as they could be, and then they'll suddenly change. 91 00:04:30,450 --> 00:04:32,980 And that's because they are using tools 92 00:04:32,980 --> 00:04:37,880 that are dynamically processing the spoken word into text, 93 00:04:37,880 --> 00:04:41,690 and as they get a better representation of that text, 94 00:04:41,690 --> 00:04:45,170 it will actually get updated automatically at that point. 95 00:04:45,170 --> 00:04:50,170 Now if you provide the arguments to the recognize method 96 00:04:51,630 --> 00:04:54,200 that enable you to get intermediate responses, 97 00:04:54,200 --> 00:04:56,690 you might actually have multiple elements 98 00:04:56,690 --> 00:04:58,650 within the results list. 99 00:04:58,650 --> 00:05:03,640 In our case we have just one dictionary in that results list 100 00:05:03,640 --> 00:05:08,290 which represents the final transcription for demo purposes. 101 00:05:08,290 --> 00:05:10,150 Now within that dictionary 102 00:05:10,150 --> 00:05:13,700 there's an alternatives key value pair 103 00:05:13,700 --> 00:05:15,630 that has a list object in it, 104 00:05:15,630 --> 00:05:17,810 and the only element we got back 105 00:05:17,810 --> 00:05:20,410 in this case was the final transcription. 106 00:05:20,410 --> 00:05:23,110 And you can see there's two different key value pairs 107 00:05:23,110 --> 00:05:24,790 in this object. 108 00:05:24,790 --> 00:05:28,340 One is the confidence level of the transcription. 109 00:05:28,340 --> 00:05:33,090 So according to IBM they're 98.3 percent sure 110 00:05:33,090 --> 00:05:36,330 that they got our transcription correct, 111 00:05:36,330 --> 00:05:38,350 and the transcript that they gave us back 112 00:05:38,350 --> 00:05:41,290 for the question I asked is "where is the closest bathroom" 113 00:05:41,290 --> 00:05:44,970 which is precisely what I asked in that video. 114 00:05:44,970 --> 00:05:49,240 So unlike what we did in the data mining Twitter chapter 115 00:05:49,240 --> 00:05:54,050 or lesson rather where the tweepy module gave us properties 116 00:05:54,050 --> 00:05:56,440 for accessing all of this information 117 00:05:56,440 --> 00:06:00,150 in the context of Watson, we do have to actually navigate 118 00:06:00,150 --> 00:06:04,910 into this structure to get down to the the transcript level 119 00:06:04,910 --> 00:06:07,640 and access the information 120 00:06:07,640 --> 00:06:10,040 that is coming back to us as text. 121 00:06:10,040 --> 00:06:10,890 So with that said 122 00:06:10,890 --> 00:06:13,260 let me switch back over to the source code here. 123 00:06:13,260 --> 00:06:14,630 So what's going to happen 124 00:06:14,630 --> 00:06:17,949 is we will get back from this recognize method 125 00:06:17,949 --> 00:06:22,949 the detailed response object representing the JSON, 126 00:06:23,470 --> 00:06:26,890 and then on that object we call get_result 127 00:06:26,890 --> 00:06:30,680 which gives you back the actual JavaScript object notation 128 00:06:30,680 --> 00:06:35,020 but as a Python object consisting of dictionaries and lists. 129 00:06:35,020 --> 00:06:38,060 We store that in the variable called result, 130 00:06:38,060 --> 00:06:42,120 and then we can start to pick apart the information 131 00:06:42,120 --> 00:06:43,790 in that JSON object. 132 00:06:43,790 --> 00:06:45,890 So first thing we do, 133 00:06:45,890 --> 00:06:48,620 and we broke it down into a bunch of statements here 134 00:06:48,620 --> 00:06:52,100 to make it easier for you to see what's going on. 135 00:06:52,100 --> 00:06:53,500 So the first thing we do 136 00:06:53,500 --> 00:06:57,370 is we access the main object's results key, 137 00:06:57,370 --> 00:06:59,390 and I'm gonna bounce back and forth 138 00:06:59,390 --> 00:07:02,000 between my diagram and the code here 139 00:07:02,000 --> 00:07:03,160 so you see what's going on. 140 00:07:03,160 --> 00:07:04,730 So we get the results key 141 00:07:04,730 --> 00:07:08,220 which is a list containing dictionaries. 142 00:07:08,220 --> 00:07:10,173 So now we have that list. 143 00:07:11,290 --> 00:07:15,570 Next up we go into the results list and access element zero 144 00:07:15,570 --> 00:07:16,620 which is the element 145 00:07:16,620 --> 00:07:20,470 that contains the speech recognition result for us. 146 00:07:20,470 --> 00:07:24,810 So again we have a list, and it contains only one object 147 00:07:24,810 --> 00:07:26,363 which is another dictionary. 148 00:07:27,330 --> 00:07:29,550 Next up we go into. 149 00:07:29,550 --> 00:07:30,410 Whoops. 150 00:07:30,410 --> 00:07:33,744 Next up we go into that list. 151 00:07:33,744 --> 00:07:34,610 (clears throat) Excuse me. 152 00:07:34,610 --> 00:07:35,443 I'm sorry. 153 00:07:35,443 --> 00:07:38,700 Dictionary rather and we access its alternatives key. 154 00:07:38,700 --> 00:07:41,570 So showing you again here's the alternatives key 155 00:07:41,570 --> 00:07:44,853 which is a list that contains one dictionary. 156 00:07:46,100 --> 00:07:50,420 Then we go into that list and get the only dictionary 157 00:07:50,420 --> 00:07:52,690 that's in there at element zero. 158 00:07:52,690 --> 00:07:54,730 And then we go into that dictionary, 159 00:07:54,730 --> 00:07:57,320 and we access its transcript key 160 00:07:57,320 --> 00:07:59,697 which is going to give us back the string 161 00:07:59,697 --> 00:08:03,280 "where is the closest bathroom" in our example. 162 00:08:03,280 --> 00:08:06,650 Now once we have that string, we're going to return it 163 00:08:06,650 --> 00:08:09,910 from our function speech_to_text, 164 00:08:09,910 --> 00:08:12,030 and that's when it will get displayed to you 165 00:08:12,030 --> 00:08:14,033 as part of the script's execution.