1 00:00:07,060 --> 00:00:10,033 - Now let's review monitoring, with Amazon CloudWatch. 2 00:00:11,200 --> 00:00:14,960 Now, with Amazon CloudWatch we can gain instrumentation 3 00:00:14,960 --> 00:00:19,220 for our infrastructure by one, collecting metrics, 4 00:00:19,220 --> 00:00:23,200 key performance metrics from our EC2 instances and 5 00:00:23,200 --> 00:00:26,460 EOB volumes, RDS instances, and so on. 6 00:00:26,460 --> 00:00:29,630 We can also collect logs from those systems 7 00:00:29,630 --> 00:00:31,930 and we'll talk more about logs later on, 8 00:00:31,930 --> 00:00:35,060 but for now let's focus on collecting metrics. 9 00:00:35,060 --> 00:00:37,080 So again, this is one of the key things that 10 00:00:37,080 --> 00:00:41,200 Amazon CloudWatch does, is to collect performance metrics. 11 00:00:41,200 --> 00:00:45,150 And it's important to remember that these metrics 12 00:00:45,150 --> 00:00:48,928 are only stored for up to 2 weeks and so 13 00:00:48,928 --> 00:00:53,928 if you needed the ability to see much more historical data, 14 00:00:54,170 --> 00:00:57,680 if for example, if you wanted to compare this week, 15 00:00:57,680 --> 00:01:01,450 this year, with this week last year, then you would need 16 00:01:01,450 --> 00:01:05,520 to pull those metrics into either your own system, 17 00:01:05,520 --> 00:01:08,410 perhaps on EC2 or on premises 18 00:01:08,410 --> 00:01:12,610 or rely on a third party service, such as New Relic 19 00:01:12,610 --> 00:01:15,020 or Data Dog or something else. 20 00:01:15,020 --> 00:01:20,020 And of course, like all Amazon services, CloudWacth allows 21 00:01:21,520 --> 00:01:25,530 us to pull these metrics out by accessing the API. 22 00:01:25,530 --> 00:01:29,770 And we can give third parties access to that API, so that 23 00:01:29,770 --> 00:01:33,650 a third party like New Relic or Data Dog or 24 00:01:33,650 --> 00:01:38,130 Noggios, running on premises can connect to CloudWatch 25 00:01:38,130 --> 00:01:40,510 and pull those metrics in. 26 00:01:40,510 --> 00:01:43,890 Now, every service, or most services anyway, 27 00:01:43,890 --> 00:01:47,340 will generate their own unique set of metrics. 28 00:01:47,340 --> 00:01:51,740 So EC2 has a particular set of metrics, RDS, redshift, 29 00:01:51,740 --> 00:01:55,620 elasticash, and so on, they all generate their own unique 30 00:01:55,620 --> 00:02:00,620 set of metrics. We also have the ability to publish 31 00:02:01,680 --> 00:02:06,340 custom metrics into CloudWatch. So if our own applications, 32 00:02:06,340 --> 00:02:10,370 if we wanted to publish something like JVM heap size, 33 00:02:10,370 --> 00:02:14,242 or the number of processes, number of concurrent threads, 34 00:02:14,242 --> 00:02:18,550 render times, those kinds of things, if there are metrics 35 00:02:18,550 --> 00:02:23,550 that are very easy for our application to readily grab, 36 00:02:24,290 --> 00:02:28,770 then it's very fairly trivial to publish those to CloudWatch 37 00:02:28,770 --> 00:02:31,960 so that we can monitor, not only things reported by, 38 00:02:31,960 --> 00:02:33,050 say the hypervisor, 39 00:02:33,050 --> 00:02:36,900 but also things reported by our applications. 40 00:02:36,900 --> 00:02:39,090 And then of course we can correlate things 41 00:02:39,090 --> 00:02:41,680 that are happening within our application to things 42 00:02:41,680 --> 00:02:44,370 that are happening within our infrastructure. 43 00:02:44,370 --> 00:02:48,250 And so now for example, if we take a look at EC2, 44 00:02:48,250 --> 00:02:53,250 with EC2 the default interval collection is every 5 minutes. 45 00:02:53,490 --> 00:02:57,068 Right? And so for things like auto scaling, 46 00:02:57,068 --> 00:03:01,683 that's going to be too long. We may need to scale within 47 00:03:03,130 --> 00:03:08,097 a minute or two. And so we can pay an extra fee per instance 48 00:03:08,097 --> 00:03:12,510 to gain access to a detailed one minute interval. 49 00:03:12,510 --> 00:03:17,510 Now a thing to keep in mind about EC2, is that these metrics 50 00:03:17,890 --> 00:03:22,580 that EC2 collects and reports are reported by the hypervisor 51 00:03:22,580 --> 00:03:25,930 and now the hypervisor will have a very accurate, 52 00:03:25,930 --> 00:03:30,670 perhaps the most accurate metric on something like CPU usage 53 00:03:30,670 --> 00:03:34,173 and network IO, disk IO, and a few other metrics. 54 00:03:35,220 --> 00:03:38,680 But one thing to keep in mind is that the hypervisor has 55 00:03:38,680 --> 00:03:41,620 no idea how memory is being used. 56 00:03:41,620 --> 00:03:44,610 And so while the hypervisor may know how much memory an 57 00:03:44,610 --> 00:03:46,670 EC2 instance actually has, 58 00:03:46,670 --> 00:03:48,670 it doesn't know how it is being used. 59 00:03:48,670 --> 00:03:52,410 So, if you need more detailed memory usage 60 00:03:52,410 --> 00:03:56,930 that kind of metric needs to be reported to CloudWatch, 61 00:03:56,930 --> 00:03:58,690 from inside the instance. 62 00:03:58,690 --> 00:04:03,180 And of course, CloudWatch does provide a very powerful agent 63 00:04:03,180 --> 00:04:07,900 that can just readily out of the box report very detailed 64 00:04:07,900 --> 00:04:10,820 memory usage and of course those are reported as 65 00:04:10,820 --> 00:04:13,890 custom metrics and we do pay an additional fee 66 00:04:13,890 --> 00:04:15,213 per custom metric. 67 00:04:16,750 --> 00:04:21,750 Now for the elastic load bouncer, as an example, the default 68 00:04:21,750 --> 00:04:24,010 is 1 minute, and that's so that 69 00:04:24,010 --> 00:04:26,910 that service can support scaling. 70 00:04:26,910 --> 00:04:30,780 For the relational database service, we do get access to 71 00:04:30,780 --> 00:04:35,780 memory. Because the database engine is already running, 72 00:04:36,900 --> 00:04:39,990 and the database, typically, the database engines are aware 73 00:04:39,990 --> 00:04:42,980 of memory usage and connections, and disk IO, 74 00:04:42,980 --> 00:04:47,980 and things like that, then we are relying on the software 75 00:04:48,270 --> 00:04:49,360 that's already apart of 76 00:04:49,360 --> 00:04:51,513 that instance to report those things. 77 00:04:52,370 --> 00:04:56,490 With DynamoDB, we do get such things like read and write 78 00:04:56,490 --> 00:05:00,130 throughput and again, this is not an exhaust of it, 79 00:05:00,130 --> 00:05:04,350 but just an example of the type of metrics that we might see 80 00:05:04,350 --> 00:05:05,673 with different services. 81 00:05:07,260 --> 00:05:10,000 Another powerful feature that we have with CloudWatch 82 00:05:10,000 --> 00:05:15,000 is the ability to create alarms, where we either want to be 83 00:05:15,100 --> 00:05:19,859 notified when something is happening, or we want to leverage 84 00:05:19,859 --> 00:05:24,080 a, you know the breach of a threshold trigger some other 85 00:05:24,080 --> 00:05:27,660 process. And so with CloudWatch alarms, 86 00:05:27,660 --> 00:05:30,380 we define a threshold. 87 00:05:30,380 --> 00:05:34,180 We say, well I want something to happen, I want to create 88 00:05:34,180 --> 00:05:38,310 an alarm when some number, some metric is either too high 89 00:05:38,310 --> 00:05:39,143 or too low. 90 00:05:40,010 --> 00:05:44,660 And so even though we use the word alarm, it does not 91 00:05:44,660 --> 00:05:47,770 necessarily signal an emergency. 92 00:05:47,770 --> 00:05:50,150 It simply means when the alarm is triggered, 93 00:05:50,150 --> 00:05:53,720 it simply means that a number is too high or too low, 94 00:05:53,720 --> 00:05:56,790 for too long of a period of time. 95 00:05:56,790 --> 00:05:59,950 And whether or not there's an emergency, that's up to you 96 00:05:59,950 --> 00:06:04,630 to determine based on the nature of your application on that 97 00:06:04,630 --> 00:06:07,280 particular infrastructure. But, 98 00:06:07,280 --> 00:06:12,140 we can use those alarms trigger things like auto scaling. 99 00:06:12,140 --> 00:06:16,360 We can also use that alarm to simply terminate an instance. 100 00:06:16,360 --> 00:06:20,250 We can also use the alarm to reboot an instance. 101 00:06:20,250 --> 00:06:22,030 There could be a case where, 102 00:06:22,030 --> 00:06:25,900 maybe you are aware of a memory leak, as an example, 103 00:06:25,900 --> 00:06:29,770 and you're waiting on developers to fix it in code, 104 00:06:29,770 --> 00:06:34,230 but in the meantime, you're trying to mitigate the issue 105 00:06:34,230 --> 00:06:35,490 within infrastructure. 106 00:06:35,490 --> 00:06:38,803 So perhaps, as a stop gap measure you have a, 107 00:06:39,720 --> 00:06:43,040 you're collecting memory as a custom metric and then 108 00:06:43,040 --> 00:06:46,681 when memory becomes too high you simply reboot the machine. 109 00:06:46,681 --> 00:06:49,130 Of course that's not a longterm solution, 110 00:06:49,130 --> 00:06:52,777 just one example of something that you could do with alarms. 111 00:06:52,777 --> 00:06:54,963 And it's also important to remember, 112 00:06:57,726 --> 00:07:00,670 that with most Amazon services there are both hard limits 113 00:07:00,670 --> 00:07:02,010 and soft limits. 114 00:07:02,010 --> 00:07:04,800 Hard limits are just a nature of the technology, 115 00:07:04,800 --> 00:07:08,930 but soft limits can be overridden by submitting a ticket 116 00:07:08,930 --> 00:07:10,520 to Amazon support. 117 00:07:10,520 --> 00:07:15,102 And so with CloudWatch there is an initial limit of 5000 118 00:07:15,102 --> 00:07:17,380 alarms per account. 119 00:07:17,380 --> 00:07:20,623 So that's 5000 alarms across all the regions. 120 00:07:23,610 --> 00:07:28,320 Some limits within AWS are specific to a region 121 00:07:28,320 --> 00:07:30,870 and some are more broad, 122 00:07:30,870 --> 00:07:33,913 some are specific to an account across all regions. 123 00:07:34,780 --> 00:07:37,410 So, let's take a look here at a diagram. 124 00:07:37,410 --> 00:07:41,730 In this diagram, we are collecting. 125 00:07:41,730 --> 00:07:44,890 You can see here that we have a load balancer 126 00:07:44,890 --> 00:07:49,890 and this load balancer is sending metrics to CloudWatch, 127 00:07:50,040 --> 00:07:55,040 such as requests per minute, um the number of 500s, 128 00:07:55,240 --> 00:07:59,300 the number of 400s, backend arrows and those kind of things. 129 00:07:59,300 --> 00:08:04,250 Our EC2 instances individually are also reporting 130 00:08:04,250 --> 00:08:08,380 those metrics, such as CPU usage, disk IO, network IO. 131 00:08:08,380 --> 00:08:12,860 And then our auto scaling group is also reporting 132 00:08:12,860 --> 00:08:17,560 metrics about that group in aggregate, so average CPU, 133 00:08:17,560 --> 00:08:20,410 average disk usage and network usage and so on. 134 00:08:20,410 --> 00:08:25,380 Our RDS instance is also reporting metrics into CloudWatch. 135 00:08:25,380 --> 00:08:28,860 So, CloudWatch is collecting these metrics from various 136 00:08:28,860 --> 00:08:30,400 different types of places. 137 00:08:30,400 --> 00:08:33,903 And so in the least, one really helpful thing is, 138 00:08:34,780 --> 00:08:37,800 if you are seeing some performance degradation, 139 00:08:37,800 --> 00:08:42,470 you could look back on those CloudWatch metrics in the 140 00:08:42,470 --> 00:08:46,240 console and create a graph and overlay different metrics 141 00:08:46,240 --> 00:08:49,160 and then correlate, you know what you're seeing in 142 00:08:49,160 --> 00:08:52,030 the load balancer, maybe request per minute 143 00:08:52,030 --> 00:08:57,030 how that effects CPU or disk IO within your EC2 instances 144 00:08:57,740 --> 00:09:02,360 and/or how that is effecting your database. Right? 145 00:09:02,360 --> 00:09:06,570 And then of course here, we have an alarm. 146 00:09:06,570 --> 00:09:09,330 We've, you can see here that we've defined an alarm 147 00:09:09,330 --> 00:09:14,330 that says when CPU utilization is greater than 80% for 148 00:09:15,040 --> 00:09:19,380 two periods of one minute. So essentially, when CPU is 149 00:09:19,380 --> 00:09:23,580 greater than 80% for two minutes, then we want an alarm 150 00:09:23,580 --> 00:09:28,040 to go off. Now again, it's not necessarily an emergency. 151 00:09:28,040 --> 00:09:31,280 And what happens is totally up to us. 152 00:09:31,280 --> 00:09:34,690 We get to configure what happens when that alarm goes off. 153 00:09:34,690 --> 00:09:37,700 It's very possible that nothing happens. We can do that. 154 00:09:37,700 --> 00:09:40,270 We can have an alarm that just goes off 155 00:09:40,270 --> 00:09:43,350 and then nothing happens. But here, in this example, 156 00:09:43,350 --> 00:09:45,950 we may want that alarm, you can see 157 00:09:45,950 --> 00:09:49,490 that we could use that alarm to trigger auto scaling. 158 00:09:49,490 --> 00:09:54,160 So that perhaps that alarm is signifies that their is more 159 00:09:54,160 --> 00:09:56,520 work to be done than what can be done with 160 00:09:56,520 --> 00:10:00,380 our current set of instances. And so this could signal 161 00:10:00,380 --> 00:10:05,240 the need to grow and expand our fleet of EC2 instances, 162 00:10:05,240 --> 00:10:06,870 in order to meet that demand. 163 00:10:06,870 --> 00:10:10,310 We can also send that alarm out 164 00:10:10,310 --> 00:10:13,530 through the simple notifications service. 165 00:10:13,530 --> 00:10:16,430 And so here we have a simple notification topic 166 00:10:16,430 --> 00:10:19,490 and then from there we can do a number of things. 167 00:10:19,490 --> 00:10:22,920 We can have lambda respond to that, right? 168 00:10:22,920 --> 00:10:27,280 So we can write lambda function, that would respond 169 00:10:27,280 --> 00:10:30,050 to that alarm in some kind of intelligent way, 170 00:10:30,050 --> 00:10:33,380 doing some kind of automated process in response 171 00:10:33,380 --> 00:10:34,330 to that alarm. 172 00:10:34,330 --> 00:10:38,260 Another thing that we could do is to perhaps, you know, 173 00:10:38,260 --> 00:10:40,760 if that alarm is for a particular application, 174 00:10:40,760 --> 00:10:45,760 then perhaps we send that alarm to a slack channel 175 00:10:45,930 --> 00:10:50,210 so that our developers are basically, 176 00:10:50,210 --> 00:10:53,610 in mini teams that I've worked with developers have 2 things 177 00:10:53,610 --> 00:10:55,180 in front of them all the time, 178 00:10:55,180 --> 00:10:59,520 their IDE and their slack channel, and so instead of putting 179 00:11:00,600 --> 00:11:04,020 notifications and alarms off in some place that requires 180 00:11:04,020 --> 00:11:07,780 them to go look for them, then it's better in my experience, 181 00:11:07,780 --> 00:11:10,830 to have that alarm go right to where they're already at, 182 00:11:10,830 --> 00:11:12,640 such as slack. And so, 183 00:11:12,640 --> 00:11:15,800 and also instead of that alarm or notification 184 00:11:15,800 --> 00:11:18,640 going to one person and waiting for that one person, 185 00:11:18,640 --> 00:11:21,990 we can send that notification to an entire team of people 186 00:11:21,990 --> 00:11:26,600 so that we have a greater chance of multiple people 187 00:11:26,600 --> 00:11:30,970 being aware of an issue. And I've seen that be 188 00:11:30,970 --> 00:11:34,100 a really powerful pattern for helping development teams 189 00:11:34,100 --> 00:11:37,240 jump on issues much faster. 190 00:11:37,240 --> 00:11:42,240 We can also have those alarms be sent to some other kind of 191 00:11:42,450 --> 00:11:47,290 ticketing issue or bug tracking software like Gira, 192 00:11:47,290 --> 00:11:51,690 or some kind of sim management system, or like a mension, 193 00:11:51,690 --> 00:11:53,270 maybe you're not using slack, 194 00:11:53,270 --> 00:11:56,610 but maybe you're using some other kind of chat system. 195 00:11:56,610 --> 00:12:00,840 The point is to automate the collection of metrics, 196 00:12:00,840 --> 00:12:04,670 automate the triggering of alarms and then get those 197 00:12:04,670 --> 00:12:08,270 notifications to the appropriate people in an efficient way. 198 00:12:08,270 --> 00:12:12,670 So you can see that CloudWatch plays a very key role 199 00:12:12,670 --> 00:12:15,330 within our EWS infrastructure. 200 00:12:15,330 --> 00:12:18,880 And we will take a closer look at a CloudWatch 201 00:12:18,880 --> 00:12:22,223 and CloudWatch logs as the course progresses.