1
00:00:01,300 --> 00:00:03,180
- So next, let's take a look at the

2
00:00:03,180 --> 00:00:06,900
Speech to Text service from IBM Watson.

3
00:00:06,900 --> 00:00:09,210
We'll be using this a little
bit later in this lesson

4
00:00:09,210 --> 00:00:11,450
for the purpose of taking audio files

5
00:00:11,450 --> 00:00:14,490
that we record and save out to disk,

6
00:00:14,490 --> 00:00:18,110
and turning them into text
transcriptions of that audio.

7
00:00:18,110 --> 00:00:22,130
And we'll do that both for
English text and Spanish text.

8
00:00:22,130 --> 00:00:24,640
Now one of the things that's
interesting about this service,

9
00:00:24,640 --> 00:00:26,960
is you can also give it specific keywords

10
00:00:26,960 --> 00:00:28,630
that you wanted to listen for,

11
00:00:28,630 --> 00:00:31,440
and it can tell you whether it finds them

12
00:00:31,440 --> 00:00:34,370
and with what likelihood it
found them within the text

13
00:00:34,370 --> 00:00:36,600
that it's transcribing from the audio.

14
00:00:36,600 --> 00:00:38,560
It's also capable interestingly of

15
00:00:38,560 --> 00:00:41,290
distinguishing amongst multiple speakers.

16
00:00:41,290 --> 00:00:43,760
So for instance, when you're watching

17
00:00:43,760 --> 00:00:47,170
news broadcasts nowadays,
sometimes you'll see them

18
00:00:47,170 --> 00:00:49,990
with closed captioning at
the bottom of the screen

19
00:00:49,990 --> 00:00:53,720
and that's happening live
as the people are speaking.

20
00:00:53,720 --> 00:00:57,910
And you'll notice that it
shows different speakers' text

21
00:00:57,910 --> 00:00:59,670
in those audio transcriptions,

22
00:00:59,670 --> 00:01:02,710
and Watson is capable
of doing that as well.

23
00:01:02,710 --> 00:01:05,640
So let me switch over to the demo here.

24
00:01:05,640 --> 00:01:08,210
This is the Speech to Text demo page

25
00:01:08,210 --> 00:01:09,980
at the URL that you see up here.

26
00:01:09,980 --> 00:01:12,310
And if you scroll down, you'll notice

27
00:01:12,310 --> 00:01:14,440
that you have the ability
to record your own audio

28
00:01:14,440 --> 00:01:16,200
so you could try this with your own voice.

29
00:01:16,200 --> 00:01:18,270
You can upload existing audio files

30
00:01:18,270 --> 00:01:20,540
but they also give you a couple of samples

31
00:01:20,540 --> 00:01:21,630
that you can play.

32
00:01:21,630 --> 00:01:24,480
And I want to run through
this first sample for you

33
00:01:24,480 --> 00:01:27,440
so I'm going to stop talking
and let you listen to this,

34
00:01:27,440 --> 00:01:30,150
and you'll see down here it's going to

35
00:01:30,150 --> 00:01:33,140
transcribe as the audio is playing.

36
00:01:33,140 --> 00:01:37,300
And it will eventually distinguish
between the two speakers

37
00:01:37,300 --> 00:01:40,970
and sometimes you'll see that immediately.

38
00:01:40,970 --> 00:01:42,670
Sometimes you'll see it later on.

39
00:01:42,670 --> 00:01:44,940
It will adjust what's coming out

40
00:01:44,940 --> 00:01:47,600
as it works through the example.

41
00:01:47,600 --> 00:01:50,751
So, whoops, let me go
ahead and click that.

42
00:01:50,751 --> 00:01:52,640
- (Michael) So thank you
very much for coming David.

43
00:01:52,640 --> 00:01:53,920
It's good to have you here.

44
00:01:53,920 --> 00:01:55,140
- (David) Good, it's my pleasure Michael.

45
00:01:55,140 --> 00:01:56,500
Glad to be with you.

46
00:01:56,500 --> 00:01:59,520
- How real is artificial intelligence?

47
00:01:59,520 --> 00:02:00,780
- The question of how real

48
00:02:00,780 --> 00:02:03,779
is artificial intelligence
is a complex one.

49
00:02:03,779 --> 00:02:05,890
- (Voiceover) Now as of
right now, it hasn't detected

50
00:02:05,890 --> 00:02:08,285
the second speaker yet
but it will eventually.

51
00:02:08,285 --> 00:02:10,440
- (David) We define artificial
intelligence as the ability

52
00:02:10,440 --> 00:02:15,330
of a machine on its own to
understand large volumes of data.

53
00:02:15,330 --> 00:02:18,990
To reason that data with a
purpose to predict the future

54
00:02:18,990 --> 00:02:21,910
and then to continue and
to learn and get better.

55
00:02:21,910 --> 00:02:23,784
That is happening today in certain fields.

56
00:02:23,784 --> 00:02:27,170
- (Michael) How far in the
continuum is IBM Watson

57
00:02:27,170 --> 00:02:30,900
in operability artificial intelligence.

58
00:02:30,900 --> 00:02:32,980
- (Voiceover) Just a few
more seconds of audio here.

59
00:02:32,980 --> 00:02:35,540
- (David) So first of all,
once it's actually intelligent

60
00:02:35,540 --> 00:02:37,220
it will no longer be artificial.

61
00:02:37,220 --> 00:02:40,620
So we're moving to the
point that these systems

62
00:02:40,620 --> 00:02:44,673
increasingly understand
enormous volumes of data.

63
00:02:45,633 --> 00:02:47,050
- (Voiceover) Okay so at this point

64
00:02:47,050 --> 00:02:49,000
it finished that audio sample

65
00:02:49,000 --> 00:02:51,660
and you notice that it
then updated everything

66
00:02:51,660 --> 00:02:53,991
that it had put in here previously,

67
00:02:53,991 --> 00:02:56,850
showing the two different
speakers along the way

68
00:02:56,850 --> 00:02:58,580
and if you were to go play that again,

69
00:02:58,580 --> 00:03:00,640
you'll be able to see that these indeed

70
00:03:00,640 --> 00:03:02,150
were the two different speakers.

71
00:03:02,150 --> 00:03:06,000
But also up here, there
were some keywords to spot.

72
00:03:06,000 --> 00:03:08,040
And you'll notice if we switch over to

73
00:03:08,040 --> 00:03:10,630
this Word Timings and Alternatives tab,

74
00:03:10,630 --> 00:03:13,630
you see the words in the transcription.

75
00:03:13,630 --> 00:03:16,420
If you go to the Keywords
tab, you see the keywords

76
00:03:16,420 --> 00:03:18,310
that we were looking for and

77
00:03:18,310 --> 00:03:20,150
the likelihood that they were found.

78
00:03:20,150 --> 00:03:23,770
And you can also see the
JavaScript Object Notation response

79
00:03:23,770 --> 00:03:28,410
that came back from the Watson service,

80
00:03:28,410 --> 00:03:30,210
the Speech to Text service as well.

81
00:03:30,210 --> 00:03:34,190
And we'll be picking off
information from that JSON response

82
00:03:34,190 --> 00:03:36,760
when we write our app a little bit later.

83
00:03:36,760 --> 00:03:39,670
So go ahead and play around
with this on your own.

84
00:03:39,670 --> 00:03:41,770
You'll notice by the
way, that there's lots

85
00:03:41,770 --> 00:03:43,230
of different languages supported

86
00:03:43,230 --> 00:03:45,310
so if you speak one of these languages,

87
00:03:45,310 --> 00:03:48,100
you might try using
the Record Audio option

88
00:03:48,100 --> 00:03:51,850
along with a corresponding
model as it's called,

89
00:03:51,850 --> 00:03:55,520
to go ahead and try transcribing text

90
00:03:55,520 --> 00:03:56,703
in your own language.