1
00:00:00,810 --> 00:00:01,740
- [Instructor] Next let's look at

2
00:00:01,740 --> 00:00:03,460
the speech_to_text function.

3
00:00:03,460 --> 00:00:05,860
Which as you saw we call twice

4
00:00:05,860 --> 00:00:08,230
in our run translator function.

5
00:00:08,230 --> 00:00:12,860
Once to get English-spoken audio into text

6
00:00:12,860 --> 00:00:16,310
and once to get Spanish-spoken
audio into text.

7
00:00:16,310 --> 00:00:18,650
And in each case we are going to need

8
00:00:18,650 --> 00:00:22,260
to create a SpeechToTextV1 object.

9
00:00:22,260 --> 00:00:25,110
Now as you can see here
when we create the object,

10
00:00:25,110 --> 00:00:28,380
we need to give it our API key

11
00:00:28,380 --> 00:00:31,860
in order to access the Watson
Speech to Text service.

12
00:00:31,860 --> 00:00:35,690
So here's our keys module that we imported

13
00:00:35,690 --> 00:00:37,240
at the beginning of the script.

14
00:00:37,240 --> 00:00:40,160
And inside that module we
have some variable names

15
00:00:40,160 --> 00:00:43,960
that we defined where
you stored your versions

16
00:00:43,960 --> 00:00:45,660
of the API keys.

17
00:00:45,660 --> 00:00:48,760
Now we take this object that we've created

18
00:00:48,760 --> 00:00:50,930
and assign it to the variable stt

19
00:00:50,930 --> 00:00:54,580
which is a common shorthand
for speech to text.

20
00:00:54,580 --> 00:00:58,850
And we're now going to use
that to invoke the Web service.

21
00:00:58,850 --> 00:01:01,200
And you can see we're doing that here

22
00:01:01,200 --> 00:01:03,660
in the context of a with statement.

23
00:01:03,660 --> 00:01:06,270
The with statement is going to open a file

24
00:01:06,270 --> 00:01:08,520
that we specified as an argument

25
00:01:08,520 --> 00:01:10,860
to our speech to text function.

26
00:01:10,860 --> 00:01:13,580
And that file is going to be opened

27
00:01:13,580 --> 00:01:17,440
for reading in binary format.

28
00:01:17,440 --> 00:01:21,480
So we're going to get the
data out of that file.

29
00:01:21,480 --> 00:01:23,440
We're going to call that object

30
00:01:23,440 --> 00:01:26,570
that we use to manipulate
the data audio_file.

31
00:01:26,570 --> 00:01:30,800
And the result that
we're going to get back

32
00:01:30,800 --> 00:01:34,290
is going to be the result of
calling these speech to text

33
00:01:34,290 --> 00:01:38,040
object's recognize
function or method rather.

34
00:01:38,040 --> 00:01:41,070
Now as you can see, we're
using several arguments here.

35
00:01:41,070 --> 00:01:43,400
We have three of them in particular,

36
00:01:43,400 --> 00:01:45,190
and there are other ones as well

37
00:01:45,190 --> 00:01:47,920
that you'll find in the
online documentation.

38
00:01:47,920 --> 00:01:50,240
But here we're using three key ones.

39
00:01:50,240 --> 00:01:53,390
The audio argument specifies the file

40
00:01:53,390 --> 00:01:56,160
from which we're going to get the bytes

41
00:01:56,160 --> 00:01:57,890
that are going to get sent over

42
00:01:57,890 --> 00:01:59,780
to the speech to text service.

43
00:01:59,780 --> 00:02:04,620
The content_type argument
is the so-called media type.

44
00:02:04,620 --> 00:02:08,640
That used to be called the
MIME type which was a shorthand

45
00:02:08,640 --> 00:02:11,780
for Multipurpose Internet Mail Extensions.

46
00:02:11,780 --> 00:02:16,110
That was relatively
recently renamed media type

47
00:02:16,110 --> 00:02:18,300
for more modern uses.

48
00:02:18,300 --> 00:02:20,780
So MIME was something that's been around

49
00:02:20,780 --> 00:02:25,310
for a couple of decades now
for attachments on email.

50
00:02:25,310 --> 00:02:28,780
And then finally the last
argument that we specified here

51
00:02:28,780 --> 00:02:31,540
is the model argument, which if you recall

52
00:02:31,540 --> 00:02:35,620
from the earlier discussion
for going from English speech

53
00:02:35,620 --> 00:02:38,410
to English text, we're
going to take advantage

54
00:02:38,410 --> 00:02:41,900
of the US English Broadband model.

55
00:02:41,900 --> 00:02:44,930
So again if you go way back up here,

56
00:02:44,930 --> 00:02:47,960
you can see in Step 2 the actual name

57
00:02:47,960 --> 00:02:50,780
of that model that's
going to be passed through

58
00:02:50,780 --> 00:02:52,120
to the Web service.

59
00:02:52,120 --> 00:02:55,130
So coming back down to our
speech_to_text function here.

60
00:02:55,130 --> 00:02:57,100
So this call to recognize

61
00:02:57,100 --> 00:03:00,060
is actually what invokes the Web service,

62
00:03:00,060 --> 00:03:03,360
and what we're going to get back from that

63
00:03:03,360 --> 00:03:08,360
is what they call a
detailed response object.

64
00:03:08,850 --> 00:03:10,750
And the detailed response object

65
00:03:10,750 --> 00:03:13,890
is actually a JavaScript
object notation object

66
00:03:13,890 --> 00:03:15,880
that looks like this.

67
00:03:15,880 --> 00:03:17,980
And here in this diagram,

68
00:03:17,980 --> 00:03:20,640
we've associated with these boxes

69
00:03:20,640 --> 00:03:23,970
that you see on the screen
the source code line numbers

70
00:03:23,970 --> 00:03:27,520
where corresponding
statements are going to access

71
00:03:27,520 --> 00:03:31,120
this JavaScript object
notation representation.

72
00:03:31,120 --> 00:03:35,960
Now this is a JSON representation which,

73
00:03:35,960 --> 00:03:38,620
when it gets converted
into a Python object,

74
00:03:38,620 --> 00:03:42,950
is basically a set of nested
dictionaries and lists.

75
00:03:42,950 --> 00:03:47,010
This particular dictionary
has two key value pairs.

76
00:03:47,010 --> 00:03:49,530
The first one is called
results which is a list,

77
00:03:49,530 --> 00:03:52,750
and the second one is called result_index.

78
00:03:52,750 --> 00:03:55,210
And in the results list,

79
00:03:55,210 --> 00:03:57,930
you're going to have the
transcription results.

80
00:03:57,930 --> 00:04:00,060
Now one of the things that's interesting

81
00:04:00,060 --> 00:04:02,590
about the speech to text service

82
00:04:02,590 --> 00:04:05,440
is it can give you back final results

83
00:04:05,440 --> 00:04:07,090
which is what we are getting here.

84
00:04:07,090 --> 00:04:09,940
Final equals true in this case.

85
00:04:09,940 --> 00:04:12,720
But it can also give back
intermediate results.

86
00:04:12,720 --> 00:04:17,220
So for instance if you've
ever watched a live newscast

87
00:04:17,220 --> 00:04:21,030
that has closed captions
showing up at the bottom

88
00:04:21,030 --> 00:04:23,860
of the screen, you'll sometimes notice

89
00:04:23,860 --> 00:04:27,510
that the transcription
results are not as accurate

90
00:04:27,510 --> 00:04:30,450
as they could be, and then
they'll suddenly change.

91
00:04:30,450 --> 00:04:32,980
And that's because they are using tools

92
00:04:32,980 --> 00:04:37,880
that are dynamically processing
the spoken word into text,

93
00:04:37,880 --> 00:04:41,690
and as they get a better
representation of that text,

94
00:04:41,690 --> 00:04:45,170
it will actually get updated
automatically at that point.

95
00:04:45,170 --> 00:04:50,170
Now if you provide the arguments
to the recognize method

96
00:04:51,630 --> 00:04:54,200
that enable you to get
intermediate responses,

97
00:04:54,200 --> 00:04:56,690
you might actually have multiple elements

98
00:04:56,690 --> 00:04:58,650
within the results list.

99
00:04:58,650 --> 00:05:03,640
In our case we have just one
dictionary in that results list

100
00:05:03,640 --> 00:05:08,290
which represents the final
transcription for demo purposes.

101
00:05:08,290 --> 00:05:10,150
Now within that dictionary

102
00:05:10,150 --> 00:05:13,700
there's an alternatives key value pair

103
00:05:13,700 --> 00:05:15,630
that has a list object in it,

104
00:05:15,630 --> 00:05:17,810
and the only element we got back

105
00:05:17,810 --> 00:05:20,410
in this case was the final transcription.

106
00:05:20,410 --> 00:05:23,110
And you can see there's two
different key value pairs

107
00:05:23,110 --> 00:05:24,790
in this object.

108
00:05:24,790 --> 00:05:28,340
One is the confidence
level of the transcription.

109
00:05:28,340 --> 00:05:33,090
So according to IBM
they're 98.3 percent sure

110
00:05:33,090 --> 00:05:36,330
that they got our transcription correct,

111
00:05:36,330 --> 00:05:38,350
and the transcript that they gave us back

112
00:05:38,350 --> 00:05:41,290
for the question I asked is
"where is the closest bathroom"

113
00:05:41,290 --> 00:05:44,970
which is precisely what
I asked in that video.

114
00:05:44,970 --> 00:05:49,240
So unlike what we did in the
data mining Twitter chapter

115
00:05:49,240 --> 00:05:54,050
or lesson rather where the
tweepy module gave us properties

116
00:05:54,050 --> 00:05:56,440
for accessing all of this information

117
00:05:56,440 --> 00:06:00,150
in the context of Watson, we
do have to actually navigate

118
00:06:00,150 --> 00:06:04,910
into this structure to get down
to the the transcript level

119
00:06:04,910 --> 00:06:07,640
and access the information

120
00:06:07,640 --> 00:06:10,040
that is coming back to us as text.

121
00:06:10,040 --> 00:06:10,890
So with that said

122
00:06:10,890 --> 00:06:13,260
let me switch back over
to the source code here.

123
00:06:13,260 --> 00:06:14,630
So what's going to happen

124
00:06:14,630 --> 00:06:17,949
is we will get back from
this recognize method

125
00:06:17,949 --> 00:06:22,949
the detailed response object
representing the JSON,

126
00:06:23,470 --> 00:06:26,890
and then on that object we call get_result

127
00:06:26,890 --> 00:06:30,680
which gives you back the actual
JavaScript object notation

128
00:06:30,680 --> 00:06:35,020
but as a Python object consisting
of dictionaries and lists.

129
00:06:35,020 --> 00:06:38,060
We store that in the
variable called result,

130
00:06:38,060 --> 00:06:42,120
and then we can start to
pick apart the information

131
00:06:42,120 --> 00:06:43,790
in that JSON object.

132
00:06:43,790 --> 00:06:45,890
So first thing we do,

133
00:06:45,890 --> 00:06:48,620
and we broke it down into
a bunch of statements here

134
00:06:48,620 --> 00:06:52,100
to make it easier for you
to see what's going on.

135
00:06:52,100 --> 00:06:53,500
So the first thing we do

136
00:06:53,500 --> 00:06:57,370
is we access the main
object's results key,

137
00:06:57,370 --> 00:06:59,390
and I'm gonna bounce back and forth

138
00:06:59,390 --> 00:07:02,000
between my diagram and the code here

139
00:07:02,000 --> 00:07:03,160
so you see what's going on.

140
00:07:03,160 --> 00:07:04,730
So we get the results key

141
00:07:04,730 --> 00:07:08,220
which is a list containing dictionaries.

142
00:07:08,220 --> 00:07:10,173
So now we have that list.

143
00:07:11,290 --> 00:07:15,570
Next up we go into the results
list and access element zero

144
00:07:15,570 --> 00:07:16,620
which is the element

145
00:07:16,620 --> 00:07:20,470
that contains the speech
recognition result for us.

146
00:07:20,470 --> 00:07:24,810
So again we have a list, and
it contains only one object

147
00:07:24,810 --> 00:07:26,363
which is another dictionary.

148
00:07:27,330 --> 00:07:29,550
Next up we go into.

149
00:07:29,550 --> 00:07:30,410
Whoops.

150
00:07:30,410 --> 00:07:33,744
Next up we go into that list.

151
00:07:33,744 --> 00:07:34,610
(clears throat)
Excuse me.

152
00:07:34,610 --> 00:07:35,443
I'm sorry.

153
00:07:35,443 --> 00:07:38,700
Dictionary rather and we
access its alternatives key.

154
00:07:38,700 --> 00:07:41,570
So showing you again
here's the alternatives key

155
00:07:41,570 --> 00:07:44,853
which is a list that
contains one dictionary.

156
00:07:46,100 --> 00:07:50,420
Then we go into that list
and get the only dictionary

157
00:07:50,420 --> 00:07:52,690
that's in there at element zero.

158
00:07:52,690 --> 00:07:54,730
And then we go into that dictionary,

159
00:07:54,730 --> 00:07:57,320
and we access its transcript key

160
00:07:57,320 --> 00:07:59,697
which is going to give us back the string

161
00:07:59,697 --> 00:08:03,280
"where is the closest
bathroom" in our example.

162
00:08:03,280 --> 00:08:06,650
Now once we have that string,
we're going to return it

163
00:08:06,650 --> 00:08:09,910
from our function speech_to_text,

164
00:08:09,910 --> 00:08:12,030
and that's when it will
get displayed to you

165
00:08:12,030 --> 00:08:14,033
as part of the script's execution.