1
00:00:06,760 --> 00:00:10,150
- Now let's review a use case of analytics

2
00:00:10,150 --> 00:00:12,923
for a subscription
video on demand service.

3
00:00:13,800 --> 00:00:17,060
Well, as you can see when
it comes to analytics,

4
00:00:17,060 --> 00:00:18,980
we have a lot of options.

5
00:00:18,980 --> 00:00:22,750
So, we'll just start from the
top and move our way down here

6
00:00:22,750 --> 00:00:25,360
and kind of explain what's going on.

7
00:00:25,360 --> 00:00:27,990
So, let's just say that we have an API,

8
00:00:27,990 --> 00:00:30,830
and that API is powered
by perhaps, you know,

9
00:00:30,830 --> 00:00:34,783
containers running within
ECS or ECS for Kubernetes.

10
00:00:35,840 --> 00:00:39,460
It could be running on EC2 directly,

11
00:00:39,460 --> 00:00:43,130
but in this example, this
comes from personal experience

12
00:00:43,130 --> 00:00:46,290
of running, you know, a
video subscription service

13
00:00:46,290 --> 00:00:49,970
very similar to Netflix,
where it was all of the

14
00:00:49,970 --> 00:00:52,830
microservices were powered by ECS.

15
00:00:52,830 --> 00:00:56,370
And they were all available
through a load balancer.

16
00:00:56,370 --> 00:00:59,630
And so, all of those
microservices, you know,

17
00:00:59,630 --> 00:01:02,220
a couple of dozen different applications

18
00:01:02,220 --> 00:01:05,490
were all sending logs to CloudWatch Logs,

19
00:01:05,490 --> 00:01:08,620
and the benefit of that
is by streaming those logs

20
00:01:08,620 --> 00:01:11,380
to CloudWatch Logs, we had
the benefit of, you know,

21
00:01:11,380 --> 00:01:13,890
if a container went
away, which it often did,

22
00:01:13,890 --> 00:01:16,200
then we still had those logs.

23
00:01:16,200 --> 00:01:17,870
We didn't have to worry
about them, you know,

24
00:01:17,870 --> 00:01:18,960
being unavailable.

25
00:01:18,960 --> 00:01:23,580
And then, it was very easy
to connect CloudWatch Logs

26
00:01:23,580 --> 00:01:27,300
to Elasticsearch, the Amazon
Elasticsearch service,

27
00:01:27,300 --> 00:01:31,190
and have those logs stream
essentially in real-time,

28
00:01:31,190 --> 00:01:35,220
into Elasticsearch, so that
developers could very easily

29
00:01:35,220 --> 00:01:38,350
access those logs within Elasticsearch.

30
00:01:38,350 --> 00:01:41,870
Now technically, developers
could access logs in

31
00:01:41,870 --> 00:01:45,600
CloudWatch Logs, but Elasticsearch
gave us much more power

32
00:01:45,600 --> 00:01:49,670
and flexibility in how we
could not only view those logs,

33
00:01:49,670 --> 00:01:52,250
but run analysis on those logs as well.

34
00:01:52,250 --> 00:01:55,910
So, even from an application
performance standpoint,

35
00:01:55,910 --> 00:01:58,730
is really the biggest
benefit to our developers,

36
00:01:58,730 --> 00:02:02,260
is really understanding how
our applications are performing

37
00:02:02,260 --> 00:02:04,240
by looking at those kinds of logs.

38
00:02:04,240 --> 00:02:08,720
And so, we also had
several different databases

39
00:02:08,720 --> 00:02:11,970
within the relational databases service.

40
00:02:11,970 --> 00:02:14,450
We had MySQL data storage.

41
00:02:14,450 --> 00:02:18,070
We also had Postgre data
stores, or databases,

42
00:02:18,070 --> 00:02:23,070
within Aurora- and
non-Aurora-based RDS instances.

43
00:02:23,950 --> 00:02:28,650
Certain data stored here,
like user account information,

44
00:02:28,650 --> 00:02:30,120
video metadata,

45
00:02:30,120 --> 00:02:32,400
and a number of other things,

46
00:02:32,400 --> 00:02:34,820
and then we also had a lot
of financial information

47
00:02:34,820 --> 00:02:37,050
stored here in a SQL server database.

48
00:02:37,050 --> 00:02:41,130
Primarily because the finance
and accounting departments,

49
00:02:41,130 --> 00:02:43,770
their tools were all Windows-based,

50
00:02:43,770 --> 00:02:46,100
and we stored information

51
00:02:46,100 --> 00:02:50,210
as it related to not just
billing, but also as it related to

52
00:02:50,210 --> 00:02:54,950
the cost of producing videos,
as it related to the cost of

53
00:02:54,950 --> 00:02:56,250
marketing those videos.

54
00:02:56,250 --> 00:02:58,720
And so, by storing all of that there,

55
00:02:58,720 --> 00:03:01,910
then finance and accounting
could, they could use their

56
00:03:01,910 --> 00:03:04,470
Windows-based tools to
work with that data,

57
00:03:04,470 --> 00:03:07,610
but we could also, as we'll
talk more about later,

58
00:03:07,610 --> 00:03:11,000
we could pull that in to Redshift.

59
00:03:11,000 --> 00:03:15,630
And so, we also stored
a number of other types

60
00:03:15,630 --> 00:03:17,780
of information in DynamoDB.

61
00:03:17,780 --> 00:03:22,780
We actually duplicated a
lot of the user information,

62
00:03:23,980 --> 00:03:28,950
the video information
was duplicated from MySQL

63
00:03:28,950 --> 00:03:30,650
into DynamoDB.

64
00:03:30,650 --> 00:03:34,010
And that's because we had CMS tools,

65
00:03:34,010 --> 00:03:35,890
content management systems,

66
00:03:35,890 --> 00:03:40,220
that allowed, you know,
editors and content creators

67
00:03:40,220 --> 00:03:42,090
to edit that content here.

68
00:03:42,090 --> 00:03:45,920
But then for running and
production for our users,

69
00:03:45,920 --> 00:03:48,250
and, you know, coming from
their mobile applications

70
00:03:48,250 --> 00:03:51,540
and from the website, they got
a much better performance by

71
00:03:51,540 --> 00:03:53,420
reading it from DynamoDB.

72
00:03:53,420 --> 00:03:56,960
And so we had other services
in the background that could,

73
00:03:56,960 --> 00:04:01,930
you know, replicate that data
from MySQL over to DynamoDB

74
00:04:01,930 --> 00:04:03,700
doing an ETL operation.

75
00:04:03,700 --> 00:04:07,470
We had other data coming
in, as, you know, users were

76
00:04:07,470 --> 00:04:09,420
performing different things.

77
00:04:09,420 --> 00:04:12,180
As they, you know, clicked on a webpage,

78
00:04:12,180 --> 00:04:13,480
as they searched for something,

79
00:04:13,480 --> 00:04:16,110
as they, you know, pressed
play, as they paused,

80
00:04:16,110 --> 00:04:17,016
as they fast-forward,

81
00:04:17,016 --> 00:04:19,870
and as they continued to watch a video.

82
00:04:19,870 --> 00:04:21,480
There were so many different events

83
00:04:21,480 --> 00:04:23,190
that users were generating

84
00:04:23,190 --> 00:04:25,460
that our API was collecting.

85
00:04:25,460 --> 00:04:29,560
A lot of those events were
sent into Amazon Kinesis.

86
00:04:29,560 --> 00:04:32,930
And Amazon Kinesis gave us
that, you know, reliable way of

87
00:04:32,930 --> 00:04:37,040
ingesting a very high
volume of data coming in

88
00:04:37,040 --> 00:04:38,350
at a high velocity.

89
00:04:38,350 --> 00:04:42,370
And so, we had, we split that
in a couple of different ways.

90
00:04:42,370 --> 00:04:45,120
We had what we would call
Kinesis-enabled applications,

91
00:04:45,120 --> 00:04:49,460
or consumers, Kinesis consumers,
that were reading from

92
00:04:49,460 --> 00:04:53,450
this Kinesis stream and
performing a number of different

93
00:04:53,450 --> 00:04:56,000
kinds of analysis on that stream.

94
00:04:56,000 --> 00:04:59,330
So we were doing real-time
information, such as, you know,

95
00:04:59,330 --> 00:05:01,480
how many people are viewing
the website right now,

96
00:05:01,480 --> 00:05:04,580
how many people are viewing
the videos right now.

97
00:05:04,580 --> 00:05:07,510
And then other kinds of
longer-term trend analysis,

98
00:05:07,510 --> 00:05:11,158
such as, you know, what videos
are more popular based on

99
00:05:11,158 --> 00:05:13,150
the length that they were watched.

100
00:05:13,150 --> 00:05:15,940
You know, are people
watching to a certain point

101
00:05:15,940 --> 00:05:17,070
and then skipping ahead?

102
00:05:17,070 --> 00:05:18,230
That kind of information.

103
00:05:18,230 --> 00:05:22,350
So, we could use Kinesis-enabled
applications, consumers,

104
00:05:22,350 --> 00:05:27,300
to read from that stream,
perform analysis on real-time

105
00:05:27,300 --> 00:05:30,900
kinds of things, and
store that in DynamoDB.

106
00:05:30,900 --> 00:05:32,590
And then, of course, other users

107
00:05:32,590 --> 00:05:33,960
could potentially read that.

108
00:05:33,960 --> 00:05:38,040
Like if we wanted to also
display the popularity of a

109
00:05:38,040 --> 00:05:41,330
particular video, then these
Kinesis applications could,

110
00:05:41,330 --> 00:05:45,060
you know, determine that
popularity, and then other users

111
00:05:45,060 --> 00:05:47,233
could read that from DynamoDB.

112
00:05:49,485 --> 00:05:53,969
We also had... these Kinesis
streams could connect

113
00:05:53,969 --> 00:05:58,190
to Kinesis data Firehose,
and Kinesis data Firehose is

114
00:05:58,190 --> 00:06:03,190
sort of an out-of-the-box
solution for writing Kinesis data

115
00:06:03,350 --> 00:06:06,970
either directly to S3
or directly to Redshift

116
00:06:06,970 --> 00:06:08,530
among a couple of other places.

117
00:06:08,530 --> 00:06:11,473
And so, here, we could
leverage Kinesis Firehose,

118
00:06:11,473 --> 00:06:15,840
without writing our own code,
just leverage Kinesis Firehose

119
00:06:15,840 --> 00:06:18,860
to get that data into flat files in S3,

120
00:06:18,860 --> 00:06:21,640
where they would remain for some time.

121
00:06:21,640 --> 00:06:26,640
And from there, we had
access to large datasets

122
00:06:26,650 --> 00:06:29,760
that we could run analysis
on using Amazon Athena;

123
00:06:29,760 --> 00:06:34,760
so our analytics team, our
data team, could use their own,

124
00:06:34,940 --> 00:06:38,390
you know, SQL-based tools
connecting to Athena using,

125
00:06:38,390 --> 00:06:43,170
you know, ODBC, JDBC drivers, and running

126
00:06:43,170 --> 00:06:44,440
whatever ad-hoc queries

127
00:06:44,440 --> 00:06:47,753
they could think of
against that large dataset

128
00:06:47,753 --> 00:06:48,950
within S3.

129
00:06:48,950 --> 00:06:52,420
And then, of course, we had
- like I mentioned earlier -

130
00:06:52,420 --> 00:06:57,150
we had user data, we had video
and a number of different

131
00:06:57,150 --> 00:06:59,870
types of data here in MySQL, and we had

132
00:06:59,870 --> 00:07:03,770
financial information here as
it related to those videos.

133
00:07:03,770 --> 00:07:06,860
And then we had real-time
popularity information within

134
00:07:06,860 --> 00:07:10,450
DynamoDB, and so we could
use Data Pipeline to pull

135
00:07:10,450 --> 00:07:14,470
all of that data in from all
of these various sources,

136
00:07:14,470 --> 00:07:17,760
performing an extract and
transformation and then load

137
00:07:17,760 --> 00:07:21,420
that into Redshift, so
that our analytics team

138
00:07:21,420 --> 00:07:24,980
could perform, you know,
regular queries around, well,

139
00:07:24,980 --> 00:07:29,260
how did the popularity of a
video, how does the popularity

140
00:07:29,260 --> 00:07:32,620
of a video, relate to, you
know, the money that we put in

141
00:07:32,620 --> 00:07:33,760
to produce it?

142
00:07:33,760 --> 00:07:37,450
We spend a certain amount of
money to, you know, make this

143
00:07:37,450 --> 00:07:40,020
particular video or this
series of videos, and we spend

144
00:07:40,020 --> 00:07:41,970
a certain amount of money to market it.

145
00:07:43,241 --> 00:07:46,980
Are we seeing, you know,
a level of engagement

146
00:07:46,980 --> 00:07:48,753
to justify that expense?

147
00:07:48,753 --> 00:07:51,370
Right, and so the only way to know that

148
00:07:51,370 --> 00:07:56,170
is to be able to join, you
know, user data, account data,

149
00:07:56,170 --> 00:08:00,970
with other, you know, financial
data with popularity data.

150
00:08:00,970 --> 00:08:04,650
And again, we can do that by
pulling all of that data into

151
00:08:04,650 --> 00:08:08,600
Redshift and allowing
our analytics team to,

152
00:08:08,600 --> 00:08:12,250
very much like Athena, they
could connect to Amazon Redshift

153
00:08:12,250 --> 00:08:14,840
and run any ad-hoc query they can think of

154
00:08:14,840 --> 00:08:18,710
across a very large dataset,
and determine not only

155
00:08:18,710 --> 00:08:23,450
popularity as it relates to
cost, but also how does our

156
00:08:23,450 --> 00:08:26,753
marketing efforts relate to users

157
00:08:26,753 --> 00:08:31,753
continuing their memberships
and their subscriptions?

158
00:08:31,780 --> 00:08:34,363
You know, are users canceling
their subscriptions?

159
00:08:35,320 --> 00:08:38,888
And is there a relationship
between activity in the video

160
00:08:38,888 --> 00:08:43,870
and, you know, users either
canceling or renewing their

161
00:08:43,870 --> 00:08:45,130
memberships?

162
00:08:45,130 --> 00:08:49,480
Is there a relationship
between, you know, activity on

163
00:08:49,480 --> 00:08:53,790
a new series and whether or
not we were getting new people

164
00:08:53,790 --> 00:08:56,830
subscribed, you know,
new users signing up?

165
00:08:56,830 --> 00:09:00,770
And then, how did, you know,
is the money that we're making

166
00:09:00,770 --> 00:09:04,470
on that, is it allowing us
to be profitable considering

167
00:09:04,470 --> 00:09:06,450
the money that we spent
to create and market that?

168
00:09:06,450 --> 00:09:10,130
Right, so there was a lot of
information here that required

169
00:09:10,130 --> 00:09:14,360
some very complex SQL statements
that could only really

170
00:09:14,360 --> 00:09:16,650
be run in one place.

171
00:09:16,650 --> 00:09:19,240
And Redshift served that really well.

172
00:09:19,240 --> 00:09:22,230
So, as you can see, we have,
when it comes to analytics

173
00:09:22,230 --> 00:09:25,992
within AWS, we have a lot of
options, a lot of very powerful

174
00:09:25,992 --> 00:09:30,992
options, and it's very common
for, you know, applications to

175
00:09:31,030 --> 00:09:34,240
use a number of different
data stores, because each one

176
00:09:34,240 --> 00:09:37,410
of these is targeted for a
particular use case, and it

177
00:09:37,410 --> 00:09:40,770
serves that kind of data
in that kind of scenario,

178
00:09:40,770 --> 00:09:43,850
that kind of access pattern, really well.

179
00:09:43,850 --> 00:09:45,230
Right, and of course there are others.

180
00:09:45,230 --> 00:09:48,260
There's Amazon EMR,
which was not really used

181
00:09:48,260 --> 00:09:51,870
in this particular scenario,
but that is also an option.

182
00:09:51,870 --> 00:09:54,590
So again, as you continue
to move forward in your

183
00:09:54,590 --> 00:09:57,670
exploration of AWS, I would
highly encourage you to explore

184
00:09:57,670 --> 00:10:02,620
some of these tools, such
as Kinesis, DynamoDB,

185
00:10:02,620 --> 00:10:05,860
Athena, Redshift,
Elasticsearch, and of course

186
00:10:05,860 --> 00:10:07,563
Amazon EMR as well.