1
00:00:07,060 --> 00:00:10,033
- Now let's review monitoring,
with Amazon CloudWatch.

2
00:00:11,200 --> 00:00:14,960
Now, with Amazon CloudWatch
we can gain instrumentation

3
00:00:14,960 --> 00:00:19,220
for our infrastructure by
one, collecting metrics,

4
00:00:19,220 --> 00:00:23,200
key performance metrics
from our EC2 instances and

5
00:00:23,200 --> 00:00:26,460
EOB volumes, RDS instances, and so on.

6
00:00:26,460 --> 00:00:29,630
We can also collect
logs from those systems

7
00:00:29,630 --> 00:00:31,930
and we'll talk more about logs later on,

8
00:00:31,930 --> 00:00:35,060
but for now let's focus
on collecting metrics.

9
00:00:35,060 --> 00:00:37,080
So again, this is one
of the key things that

10
00:00:37,080 --> 00:00:41,200
Amazon CloudWatch does, is to
collect performance metrics.

11
00:00:41,200 --> 00:00:45,150
And it's important to
remember that these metrics

12
00:00:45,150 --> 00:00:48,928
are only stored for up to 2 weeks and so

13
00:00:48,928 --> 00:00:53,928
if you needed the ability to
see much more historical data,

14
00:00:54,170 --> 00:00:57,680
if for example, if you
wanted to compare this week,

15
00:00:57,680 --> 00:01:01,450
this year, with this week
last year, then you would need

16
00:01:01,450 --> 00:01:05,520
to pull those metrics into
either your own system,

17
00:01:05,520 --> 00:01:08,410
perhaps on EC2 or on premises

18
00:01:08,410 --> 00:01:12,610
or rely on a third party
service, such as New Relic

19
00:01:12,610 --> 00:01:15,020
or Data Dog or something else.

20
00:01:15,020 --> 00:01:20,020
And of course, like all Amazon
services, CloudWacth allows

21
00:01:21,520 --> 00:01:25,530
us to pull these metrics
out by accessing the API.

22
00:01:25,530 --> 00:01:29,770
And we can give third parties
access to that API, so that

23
00:01:29,770 --> 00:01:33,650
a third party like New
Relic or Data Dog or

24
00:01:33,650 --> 00:01:38,130
Noggios, running on premises
can connect to CloudWatch

25
00:01:38,130 --> 00:01:40,510
and pull those metrics in.

26
00:01:40,510 --> 00:01:43,890
Now, every service, or
most services anyway,

27
00:01:43,890 --> 00:01:47,340
will generate their own
unique set of metrics.

28
00:01:47,340 --> 00:01:51,740
So EC2 has a particular set
of metrics, RDS, redshift,

29
00:01:51,740 --> 00:01:55,620
elasticash, and so on, they
all generate their own unique

30
00:01:55,620 --> 00:02:00,620
set of metrics. We also
have the ability to publish

31
00:02:01,680 --> 00:02:06,340
custom metrics into CloudWatch.
So if our own applications,

32
00:02:06,340 --> 00:02:10,370
if we wanted to publish
something like JVM heap size,

33
00:02:10,370 --> 00:02:14,242
or the number of processes,
number of concurrent threads,

34
00:02:14,242 --> 00:02:18,550
render times, those kinds of
things, if there are metrics

35
00:02:18,550 --> 00:02:23,550
that are very easy for our
application to readily grab,

36
00:02:24,290 --> 00:02:28,770
then it's very fairly trivial
to publish those to CloudWatch

37
00:02:28,770 --> 00:02:31,960
so that we can monitor, not
only things reported by,

38
00:02:31,960 --> 00:02:33,050
say the hypervisor,

39
00:02:33,050 --> 00:02:36,900
but also things reported
by our applications.

40
00:02:36,900 --> 00:02:39,090
And then of course we can correlate things

41
00:02:39,090 --> 00:02:41,680
that are happening within
our application to things

42
00:02:41,680 --> 00:02:44,370
that are happening within
our infrastructure.

43
00:02:44,370 --> 00:02:48,250
And so now for example,
if we take a look at EC2,

44
00:02:48,250 --> 00:02:53,250
with EC2 the default interval
collection is every 5 minutes.

45
00:02:53,490 --> 00:02:57,068
Right? And so for things
like auto scaling,

46
00:02:57,068 --> 00:03:01,683
that's going to be too long.
We may need to scale within

47
00:03:03,130 --> 00:03:08,097
a minute or two. And so we can
pay an extra fee per instance

48
00:03:08,097 --> 00:03:12,510
to gain access to a detailed
one minute interval.

49
00:03:12,510 --> 00:03:17,510
Now a thing to keep in mind
about EC2, is that these metrics

50
00:03:17,890 --> 00:03:22,580
that EC2 collects and reports
are reported by the hypervisor

51
00:03:22,580 --> 00:03:25,930
and now the hypervisor
will have a very accurate,

52
00:03:25,930 --> 00:03:30,670
perhaps the most accurate metric
on something like CPU usage

53
00:03:30,670 --> 00:03:34,173
and network IO, disk IO,
and a few other metrics.

54
00:03:35,220 --> 00:03:38,680
But one thing to keep in mind
is that the hypervisor has

55
00:03:38,680 --> 00:03:41,620
no idea how memory is being used.

56
00:03:41,620 --> 00:03:44,610
And so while the hypervisor
may know how much memory an

57
00:03:44,610 --> 00:03:46,670
EC2 instance actually has,

58
00:03:46,670 --> 00:03:48,670
it doesn't know how it is being used.

59
00:03:48,670 --> 00:03:52,410
So, if you need more detailed memory usage

60
00:03:52,410 --> 00:03:56,930
that kind of metric needs to
be reported to CloudWatch,

61
00:03:56,930 --> 00:03:58,690
from inside the instance.

62
00:03:58,690 --> 00:04:03,180
And of course, CloudWatch does
provide a very powerful agent

63
00:04:03,180 --> 00:04:07,900
that can just readily out of
the box report very detailed

64
00:04:07,900 --> 00:04:10,820
memory usage and of course
those are reported as

65
00:04:10,820 --> 00:04:13,890
custom metrics and we
do pay an additional fee

66
00:04:13,890 --> 00:04:15,213
per custom metric.

67
00:04:16,750 --> 00:04:21,750
Now for the elastic load bouncer,
as an example, the default

68
00:04:21,750 --> 00:04:24,010
is 1 minute, and that's so that

69
00:04:24,010 --> 00:04:26,910
that service can support scaling.

70
00:04:26,910 --> 00:04:30,780
For the relational database
service, we do get access to

71
00:04:30,780 --> 00:04:35,780
memory. Because the database
engine is already running,

72
00:04:36,900 --> 00:04:39,990
and the database, typically,
the database engines are aware

73
00:04:39,990 --> 00:04:42,980
of memory usage and
connections, and disk IO,

74
00:04:42,980 --> 00:04:47,980
and things like that, then we
are relying on the software

75
00:04:48,270 --> 00:04:49,360
that's already apart of

76
00:04:49,360 --> 00:04:51,513
that instance to report those things.

77
00:04:52,370 --> 00:04:56,490
With DynamoDB, we do get such
things like read and write

78
00:04:56,490 --> 00:05:00,130
throughput and again, this
is not an exhaust of it,

79
00:05:00,130 --> 00:05:04,350
but just an example of the type
of metrics that we might see

80
00:05:04,350 --> 00:05:05,673
with different services.

81
00:05:07,260 --> 00:05:10,000
Another powerful feature
that we have with CloudWatch

82
00:05:10,000 --> 00:05:15,000
is the ability to create alarms,
where we either want to be

83
00:05:15,100 --> 00:05:19,859
notified when something is
happening, or we want to leverage

84
00:05:19,859 --> 00:05:24,080
a, you know the breach of a
threshold trigger some other

85
00:05:24,080 --> 00:05:27,660
process. And so with CloudWatch alarms,

86
00:05:27,660 --> 00:05:30,380
we define a threshold.

87
00:05:30,380 --> 00:05:34,180
We say, well I want something
to happen, I want to create

88
00:05:34,180 --> 00:05:38,310
an alarm when some number,
some metric is either too high

89
00:05:38,310 --> 00:05:39,143
or too low.

90
00:05:40,010 --> 00:05:44,660
And so even though we use
the word alarm, it does not

91
00:05:44,660 --> 00:05:47,770
necessarily signal an emergency.

92
00:05:47,770 --> 00:05:50,150
It simply means when
the alarm is triggered,

93
00:05:50,150 --> 00:05:53,720
it simply means that a number
is too high or too low,

94
00:05:53,720 --> 00:05:56,790
for too long of a period of time.

95
00:05:56,790 --> 00:05:59,950
And whether or not there's an
emergency, that's up to you

96
00:05:59,950 --> 00:06:04,630
to determine based on the nature
of your application on that

97
00:06:04,630 --> 00:06:07,280
particular infrastructure. But,

98
00:06:07,280 --> 00:06:12,140
we can use those alarms trigger
things like auto scaling.

99
00:06:12,140 --> 00:06:16,360
We can also use that alarm to
simply terminate an instance.

100
00:06:16,360 --> 00:06:20,250
We can also use the alarm
to reboot an instance.

101
00:06:20,250 --> 00:06:22,030
There could be a case where,

102
00:06:22,030 --> 00:06:25,900
maybe you are aware of a
memory leak, as an example,

103
00:06:25,900 --> 00:06:29,770
and you're waiting on
developers to fix it in code,

104
00:06:29,770 --> 00:06:34,230
but in the meantime, you're
trying to mitigate the issue

105
00:06:34,230 --> 00:06:35,490
within infrastructure.

106
00:06:35,490 --> 00:06:38,803
So perhaps, as a stop
gap measure you have a,

107
00:06:39,720 --> 00:06:43,040
you're collecting memory
as a custom metric and then

108
00:06:43,040 --> 00:06:46,681
when memory becomes too high
you simply reboot the machine.

109
00:06:46,681 --> 00:06:49,130
Of course that's not a longterm solution,

110
00:06:49,130 --> 00:06:52,777
just one example of something
that you could do with alarms.

111
00:06:52,777 --> 00:06:54,963
And it's also important to remember,

112
00:06:57,726 --> 00:07:00,670
that with most Amazon services
there are both hard limits

113
00:07:00,670 --> 00:07:02,010
and soft limits.

114
00:07:02,010 --> 00:07:04,800
Hard limits are just a
nature of the technology,

115
00:07:04,800 --> 00:07:08,930
but soft limits can be
overridden by submitting a ticket

116
00:07:08,930 --> 00:07:10,520
to Amazon support.

117
00:07:10,520 --> 00:07:15,102
And so with CloudWatch there
is an initial limit of 5000

118
00:07:15,102 --> 00:07:17,380
alarms per account.

119
00:07:17,380 --> 00:07:20,623
So that's 5000 alarms
across all the regions.

120
00:07:23,610 --> 00:07:28,320
Some limits within AWS
are specific to a region

121
00:07:28,320 --> 00:07:30,870
and some are more broad,

122
00:07:30,870 --> 00:07:33,913
some are specific to an
account across all regions.

123
00:07:34,780 --> 00:07:37,410
So, let's take a look here at a diagram.

124
00:07:37,410 --> 00:07:41,730
In this diagram, we are collecting.

125
00:07:41,730 --> 00:07:44,890
You can see here that
we have a load balancer

126
00:07:44,890 --> 00:07:49,890
and this load balancer is
sending metrics to CloudWatch,

127
00:07:50,040 --> 00:07:55,040
such as requests per minute,
um the number of 500s,

128
00:07:55,240 --> 00:07:59,300
the number of 400s, backend
arrows and those kind of things.

129
00:07:59,300 --> 00:08:04,250
Our EC2 instances individually
are also reporting

130
00:08:04,250 --> 00:08:08,380
those metrics, such as CPU
usage, disk IO, network IO.

131
00:08:08,380 --> 00:08:12,860
And then our auto scaling
group is also reporting

132
00:08:12,860 --> 00:08:17,560
metrics about that group in
aggregate, so average CPU,

133
00:08:17,560 --> 00:08:20,410
average disk usage and
network usage and so on.

134
00:08:20,410 --> 00:08:25,380
Our RDS instance is also
reporting metrics into CloudWatch.

135
00:08:25,380 --> 00:08:28,860
So, CloudWatch is collecting
these metrics from various

136
00:08:28,860 --> 00:08:30,400
different types of places.

137
00:08:30,400 --> 00:08:33,903
And so in the least, one
really helpful thing is,

138
00:08:34,780 --> 00:08:37,800
if you are seeing some
performance degradation,

139
00:08:37,800 --> 00:08:42,470
you could look back on those
CloudWatch metrics in the

140
00:08:42,470 --> 00:08:46,240
console and create a graph
and overlay different metrics

141
00:08:46,240 --> 00:08:49,160
and then correlate, you
know what you're seeing in

142
00:08:49,160 --> 00:08:52,030
the load balancer,
maybe request per minute

143
00:08:52,030 --> 00:08:57,030
how that effects CPU or disk
IO within your EC2 instances

144
00:08:57,740 --> 00:09:02,360
and/or how that is effecting
your database. Right?

145
00:09:02,360 --> 00:09:06,570
And then of course here, we have an alarm.

146
00:09:06,570 --> 00:09:09,330
We've, you can see here
that we've defined an alarm

147
00:09:09,330 --> 00:09:14,330
that says when CPU utilization
is greater than 80% for

148
00:09:15,040 --> 00:09:19,380
two periods of one minute.
So essentially, when CPU is

149
00:09:19,380 --> 00:09:23,580
greater than 80% for two
minutes, then we want an alarm

150
00:09:23,580 --> 00:09:28,040
to go off. Now again, it's
not necessarily an emergency.

151
00:09:28,040 --> 00:09:31,280
And what happens is totally up to us.

152
00:09:31,280 --> 00:09:34,690
We get to configure what happens
when that alarm goes off.

153
00:09:34,690 --> 00:09:37,700
It's very possible that nothing
happens. We can do that.

154
00:09:37,700 --> 00:09:40,270
We can have an alarm that just goes off

155
00:09:40,270 --> 00:09:43,350
and then nothing happens.
But here, in this example,

156
00:09:43,350 --> 00:09:45,950
we may want that alarm, you can see

157
00:09:45,950 --> 00:09:49,490
that we could use that alarm
to trigger auto scaling.

158
00:09:49,490 --> 00:09:54,160
So that perhaps that alarm is
signifies that their is more

159
00:09:54,160 --> 00:09:56,520
work to be done than what can be done with

160
00:09:56,520 --> 00:10:00,380
our current set of instances.
And so this could signal

161
00:10:00,380 --> 00:10:05,240
the need to grow and expand
our fleet of EC2 instances,

162
00:10:05,240 --> 00:10:06,870
in order to meet that demand.

163
00:10:06,870 --> 00:10:10,310
We can also send that alarm out

164
00:10:10,310 --> 00:10:13,530
through the simple notifications service.

165
00:10:13,530 --> 00:10:16,430
And so here we have a
simple notification topic

166
00:10:16,430 --> 00:10:19,490
and then from there we
can do a number of things.

167
00:10:19,490 --> 00:10:22,920
We can have lambda respond to that, right?

168
00:10:22,920 --> 00:10:27,280
So we can write lambda
function, that would respond

169
00:10:27,280 --> 00:10:30,050
to that alarm in some
kind of intelligent way,

170
00:10:30,050 --> 00:10:33,380
doing some kind of automated
process in response

171
00:10:33,380 --> 00:10:34,330
to that alarm.

172
00:10:34,330 --> 00:10:38,260
Another thing that we could
do is to perhaps, you know,

173
00:10:38,260 --> 00:10:40,760
if that alarm is for a
particular application,

174
00:10:40,760 --> 00:10:45,760
then perhaps we send that
alarm to a slack channel

175
00:10:45,930 --> 00:10:50,210
so that our developers are basically,

176
00:10:50,210 --> 00:10:53,610
in mini teams that I've worked
with developers have 2 things

177
00:10:53,610 --> 00:10:55,180
in front of them all the time,

178
00:10:55,180 --> 00:10:59,520
their IDE and their slack
channel, and so instead of putting

179
00:11:00,600 --> 00:11:04,020
notifications and alarms off
in some place that requires

180
00:11:04,020 --> 00:11:07,780
them to go look for them, then
it's better in my experience,

181
00:11:07,780 --> 00:11:10,830
to have that alarm go right
to where they're already at,

182
00:11:10,830 --> 00:11:12,640
such as slack. And so,

183
00:11:12,640 --> 00:11:15,800
and also instead of that
alarm or notification

184
00:11:15,800 --> 00:11:18,640
going to one person and
waiting for that one person,

185
00:11:18,640 --> 00:11:21,990
we can send that notification
to an entire team of people

186
00:11:21,990 --> 00:11:26,600
so that we have a greater
chance of multiple people

187
00:11:26,600 --> 00:11:30,970
being aware of an issue.
And I've seen that be

188
00:11:30,970 --> 00:11:34,100
a really powerful pattern
for helping development teams

189
00:11:34,100 --> 00:11:37,240
jump on issues much faster.

190
00:11:37,240 --> 00:11:42,240
We can also have those alarms
be sent to some other kind of

191
00:11:42,450 --> 00:11:47,290
ticketing issue or bug
tracking software like Gira,

192
00:11:47,290 --> 00:11:51,690
or some kind of sim management
system, or like a mension,

193
00:11:51,690 --> 00:11:53,270
maybe you're not using slack,

194
00:11:53,270 --> 00:11:56,610
but maybe you're using some
other kind of chat system.

195
00:11:56,610 --> 00:12:00,840
The point is to automate
the collection of metrics,

196
00:12:00,840 --> 00:12:04,670
automate the triggering of
alarms and then get those

197
00:12:04,670 --> 00:12:08,270
notifications to the appropriate
people in an efficient way.

198
00:12:08,270 --> 00:12:12,670
So you can see that CloudWatch
plays a very key role

199
00:12:12,670 --> 00:12:15,330
within our EWS infrastructure.

200
00:12:15,330 --> 00:12:18,880
And we will take a closer
look at a CloudWatch

201
00:12:18,880 --> 00:12:22,223
and CloudWatch logs as
the course progresses.