1 00:00:06,760 --> 00:00:10,150 - Now let's review a use case of analytics 2 00:00:10,150 --> 00:00:12,923 for a subscription video on demand service. 3 00:00:13,800 --> 00:00:17,060 Well, as you can see when it comes to analytics, 4 00:00:17,060 --> 00:00:18,980 we have a lot of options. 5 00:00:18,980 --> 00:00:22,750 So, we'll just start from the top and move our way down here 6 00:00:22,750 --> 00:00:25,360 and kind of explain what's going on. 7 00:00:25,360 --> 00:00:27,990 So, let's just say that we have an API, 8 00:00:27,990 --> 00:00:30,830 and that API is powered by perhaps, you know, 9 00:00:30,830 --> 00:00:34,783 containers running within ECS or ECS for Kubernetes. 10 00:00:35,840 --> 00:00:39,460 It could be running on EC2 directly, 11 00:00:39,460 --> 00:00:43,130 but in this example, this comes from personal experience 12 00:00:43,130 --> 00:00:46,290 of running, you know, a video subscription service 13 00:00:46,290 --> 00:00:49,970 very similar to Netflix, where it was all of the 14 00:00:49,970 --> 00:00:52,830 microservices were powered by ECS. 15 00:00:52,830 --> 00:00:56,370 And they were all available through a load balancer. 16 00:00:56,370 --> 00:00:59,630 And so, all of those microservices, you know, 17 00:00:59,630 --> 00:01:02,220 a couple of dozen different applications 18 00:01:02,220 --> 00:01:05,490 were all sending logs to CloudWatch Logs, 19 00:01:05,490 --> 00:01:08,620 and the benefit of that is by streaming those logs 20 00:01:08,620 --> 00:01:11,380 to CloudWatch Logs, we had the benefit of, you know, 21 00:01:11,380 --> 00:01:13,890 if a container went away, which it often did, 22 00:01:13,890 --> 00:01:16,200 then we still had those logs. 23 00:01:16,200 --> 00:01:17,870 We didn't have to worry about them, you know, 24 00:01:17,870 --> 00:01:18,960 being unavailable. 25 00:01:18,960 --> 00:01:23,580 And then, it was very easy to connect CloudWatch Logs 26 00:01:23,580 --> 00:01:27,300 to Elasticsearch, the Amazon Elasticsearch service, 27 00:01:27,300 --> 00:01:31,190 and have those logs stream essentially in real-time, 28 00:01:31,190 --> 00:01:35,220 into Elasticsearch, so that developers could very easily 29 00:01:35,220 --> 00:01:38,350 access those logs within Elasticsearch. 30 00:01:38,350 --> 00:01:41,870 Now technically, developers could access logs in 31 00:01:41,870 --> 00:01:45,600 CloudWatch Logs, but Elasticsearch gave us much more power 32 00:01:45,600 --> 00:01:49,670 and flexibility in how we could not only view those logs, 33 00:01:49,670 --> 00:01:52,250 but run analysis on those logs as well. 34 00:01:52,250 --> 00:01:55,910 So, even from an application performance standpoint, 35 00:01:55,910 --> 00:01:58,730 is really the biggest benefit to our developers, 36 00:01:58,730 --> 00:02:02,260 is really understanding how our applications are performing 37 00:02:02,260 --> 00:02:04,240 by looking at those kinds of logs. 38 00:02:04,240 --> 00:02:08,720 And so, we also had several different databases 39 00:02:08,720 --> 00:02:11,970 within the relational databases service. 40 00:02:11,970 --> 00:02:14,450 We had MySQL data storage. 41 00:02:14,450 --> 00:02:18,070 We also had Postgre data stores, or databases, 42 00:02:18,070 --> 00:02:23,070 within Aurora- and non-Aurora-based RDS instances. 43 00:02:23,950 --> 00:02:28,650 Certain data stored here, like user account information, 44 00:02:28,650 --> 00:02:30,120 video metadata, 45 00:02:30,120 --> 00:02:32,400 and a number of other things, 46 00:02:32,400 --> 00:02:34,820 and then we also had a lot of financial information 47 00:02:34,820 --> 00:02:37,050 stored here in a SQL server database. 48 00:02:37,050 --> 00:02:41,130 Primarily because the finance and accounting departments, 49 00:02:41,130 --> 00:02:43,770 their tools were all Windows-based, 50 00:02:43,770 --> 00:02:46,100 and we stored information 51 00:02:46,100 --> 00:02:50,210 as it related to not just billing, but also as it related to 52 00:02:50,210 --> 00:02:54,950 the cost of producing videos, as it related to the cost of 53 00:02:54,950 --> 00:02:56,250 marketing those videos. 54 00:02:56,250 --> 00:02:58,720 And so, by storing all of that there, 55 00:02:58,720 --> 00:03:01,910 then finance and accounting could, they could use their 56 00:03:01,910 --> 00:03:04,470 Windows-based tools to work with that data, 57 00:03:04,470 --> 00:03:07,610 but we could also, as we'll talk more about later, 58 00:03:07,610 --> 00:03:11,000 we could pull that in to Redshift. 59 00:03:11,000 --> 00:03:15,630 And so, we also stored a number of other types 60 00:03:15,630 --> 00:03:17,780 of information in DynamoDB. 61 00:03:17,780 --> 00:03:22,780 We actually duplicated a lot of the user information, 62 00:03:23,980 --> 00:03:28,950 the video information was duplicated from MySQL 63 00:03:28,950 --> 00:03:30,650 into DynamoDB. 64 00:03:30,650 --> 00:03:34,010 And that's because we had CMS tools, 65 00:03:34,010 --> 00:03:35,890 content management systems, 66 00:03:35,890 --> 00:03:40,220 that allowed, you know, editors and content creators 67 00:03:40,220 --> 00:03:42,090 to edit that content here. 68 00:03:42,090 --> 00:03:45,920 But then for running and production for our users, 69 00:03:45,920 --> 00:03:48,250 and, you know, coming from their mobile applications 70 00:03:48,250 --> 00:03:51,540 and from the website, they got a much better performance by 71 00:03:51,540 --> 00:03:53,420 reading it from DynamoDB. 72 00:03:53,420 --> 00:03:56,960 And so we had other services in the background that could, 73 00:03:56,960 --> 00:04:01,930 you know, replicate that data from MySQL over to DynamoDB 74 00:04:01,930 --> 00:04:03,700 doing an ETL operation. 75 00:04:03,700 --> 00:04:07,470 We had other data coming in, as, you know, users were 76 00:04:07,470 --> 00:04:09,420 performing different things. 77 00:04:09,420 --> 00:04:12,180 As they, you know, clicked on a webpage, 78 00:04:12,180 --> 00:04:13,480 as they searched for something, 79 00:04:13,480 --> 00:04:16,110 as they, you know, pressed play, as they paused, 80 00:04:16,110 --> 00:04:17,016 as they fast-forward, 81 00:04:17,016 --> 00:04:19,870 and as they continued to watch a video. 82 00:04:19,870 --> 00:04:21,480 There were so many different events 83 00:04:21,480 --> 00:04:23,190 that users were generating 84 00:04:23,190 --> 00:04:25,460 that our API was collecting. 85 00:04:25,460 --> 00:04:29,560 A lot of those events were sent into Amazon Kinesis. 86 00:04:29,560 --> 00:04:32,930 And Amazon Kinesis gave us that, you know, reliable way of 87 00:04:32,930 --> 00:04:37,040 ingesting a very high volume of data coming in 88 00:04:37,040 --> 00:04:38,350 at a high velocity. 89 00:04:38,350 --> 00:04:42,370 And so, we had, we split that in a couple of different ways. 90 00:04:42,370 --> 00:04:45,120 We had what we would call Kinesis-enabled applications, 91 00:04:45,120 --> 00:04:49,460 or consumers, Kinesis consumers, that were reading from 92 00:04:49,460 --> 00:04:53,450 this Kinesis stream and performing a number of different 93 00:04:53,450 --> 00:04:56,000 kinds of analysis on that stream. 94 00:04:56,000 --> 00:04:59,330 So we were doing real-time information, such as, you know, 95 00:04:59,330 --> 00:05:01,480 how many people are viewing the website right now, 96 00:05:01,480 --> 00:05:04,580 how many people are viewing the videos right now. 97 00:05:04,580 --> 00:05:07,510 And then other kinds of longer-term trend analysis, 98 00:05:07,510 --> 00:05:11,158 such as, you know, what videos are more popular based on 99 00:05:11,158 --> 00:05:13,150 the length that they were watched. 100 00:05:13,150 --> 00:05:15,940 You know, are people watching to a certain point 101 00:05:15,940 --> 00:05:17,070 and then skipping ahead? 102 00:05:17,070 --> 00:05:18,230 That kind of information. 103 00:05:18,230 --> 00:05:22,350 So, we could use Kinesis-enabled applications, consumers, 104 00:05:22,350 --> 00:05:27,300 to read from that stream, perform analysis on real-time 105 00:05:27,300 --> 00:05:30,900 kinds of things, and store that in DynamoDB. 106 00:05:30,900 --> 00:05:32,590 And then, of course, other users 107 00:05:32,590 --> 00:05:33,960 could potentially read that. 108 00:05:33,960 --> 00:05:38,040 Like if we wanted to also display the popularity of a 109 00:05:38,040 --> 00:05:41,330 particular video, then these Kinesis applications could, 110 00:05:41,330 --> 00:05:45,060 you know, determine that popularity, and then other users 111 00:05:45,060 --> 00:05:47,233 could read that from DynamoDB. 112 00:05:49,485 --> 00:05:53,969 We also had... these Kinesis streams could connect 113 00:05:53,969 --> 00:05:58,190 to Kinesis data Firehose, and Kinesis data Firehose is 114 00:05:58,190 --> 00:06:03,190 sort of an out-of-the-box solution for writing Kinesis data 115 00:06:03,350 --> 00:06:06,970 either directly to S3 or directly to Redshift 116 00:06:06,970 --> 00:06:08,530 among a couple of other places. 117 00:06:08,530 --> 00:06:11,473 And so, here, we could leverage Kinesis Firehose, 118 00:06:11,473 --> 00:06:15,840 without writing our own code, just leverage Kinesis Firehose 119 00:06:15,840 --> 00:06:18,860 to get that data into flat files in S3, 120 00:06:18,860 --> 00:06:21,640 where they would remain for some time. 121 00:06:21,640 --> 00:06:26,640 And from there, we had access to large datasets 122 00:06:26,650 --> 00:06:29,760 that we could run analysis on using Amazon Athena; 123 00:06:29,760 --> 00:06:34,760 so our analytics team, our data team, could use their own, 124 00:06:34,940 --> 00:06:38,390 you know, SQL-based tools connecting to Athena using, 125 00:06:38,390 --> 00:06:43,170 you know, ODBC, JDBC drivers, and running 126 00:06:43,170 --> 00:06:44,440 whatever ad-hoc queries 127 00:06:44,440 --> 00:06:47,753 they could think of against that large dataset 128 00:06:47,753 --> 00:06:48,950 within S3. 129 00:06:48,950 --> 00:06:52,420 And then, of course, we had - like I mentioned earlier - 130 00:06:52,420 --> 00:06:57,150 we had user data, we had video and a number of different 131 00:06:57,150 --> 00:06:59,870 types of data here in MySQL, and we had 132 00:06:59,870 --> 00:07:03,770 financial information here as it related to those videos. 133 00:07:03,770 --> 00:07:06,860 And then we had real-time popularity information within 134 00:07:06,860 --> 00:07:10,450 DynamoDB, and so we could use Data Pipeline to pull 135 00:07:10,450 --> 00:07:14,470 all of that data in from all of these various sources, 136 00:07:14,470 --> 00:07:17,760 performing an extract and transformation and then load 137 00:07:17,760 --> 00:07:21,420 that into Redshift, so that our analytics team 138 00:07:21,420 --> 00:07:24,980 could perform, you know, regular queries around, well, 139 00:07:24,980 --> 00:07:29,260 how did the popularity of a video, how does the popularity 140 00:07:29,260 --> 00:07:32,620 of a video, relate to, you know, the money that we put in 141 00:07:32,620 --> 00:07:33,760 to produce it? 142 00:07:33,760 --> 00:07:37,450 We spend a certain amount of money to, you know, make this 143 00:07:37,450 --> 00:07:40,020 particular video or this series of videos, and we spend 144 00:07:40,020 --> 00:07:41,970 a certain amount of money to market it. 145 00:07:43,241 --> 00:07:46,980 Are we seeing, you know, a level of engagement 146 00:07:46,980 --> 00:07:48,753 to justify that expense? 147 00:07:48,753 --> 00:07:51,370 Right, and so the only way to know that 148 00:07:51,370 --> 00:07:56,170 is to be able to join, you know, user data, account data, 149 00:07:56,170 --> 00:08:00,970 with other, you know, financial data with popularity data. 150 00:08:00,970 --> 00:08:04,650 And again, we can do that by pulling all of that data into 151 00:08:04,650 --> 00:08:08,600 Redshift and allowing our analytics team to, 152 00:08:08,600 --> 00:08:12,250 very much like Athena, they could connect to Amazon Redshift 153 00:08:12,250 --> 00:08:14,840 and run any ad-hoc query they can think of 154 00:08:14,840 --> 00:08:18,710 across a very large dataset, and determine not only 155 00:08:18,710 --> 00:08:23,450 popularity as it relates to cost, but also how does our 156 00:08:23,450 --> 00:08:26,753 marketing efforts relate to users 157 00:08:26,753 --> 00:08:31,753 continuing their memberships and their subscriptions? 158 00:08:31,780 --> 00:08:34,363 You know, are users canceling their subscriptions? 159 00:08:35,320 --> 00:08:38,888 And is there a relationship between activity in the video 160 00:08:38,888 --> 00:08:43,870 and, you know, users either canceling or renewing their 161 00:08:43,870 --> 00:08:45,130 memberships? 162 00:08:45,130 --> 00:08:49,480 Is there a relationship between, you know, activity on 163 00:08:49,480 --> 00:08:53,790 a new series and whether or not we were getting new people 164 00:08:53,790 --> 00:08:56,830 subscribed, you know, new users signing up? 165 00:08:56,830 --> 00:09:00,770 And then, how did, you know, is the money that we're making 166 00:09:00,770 --> 00:09:04,470 on that, is it allowing us to be profitable considering 167 00:09:04,470 --> 00:09:06,450 the money that we spent to create and market that? 168 00:09:06,450 --> 00:09:10,130 Right, so there was a lot of information here that required 169 00:09:10,130 --> 00:09:14,360 some very complex SQL statements that could only really 170 00:09:14,360 --> 00:09:16,650 be run in one place. 171 00:09:16,650 --> 00:09:19,240 And Redshift served that really well. 172 00:09:19,240 --> 00:09:22,230 So, as you can see, we have, when it comes to analytics 173 00:09:22,230 --> 00:09:25,992 within AWS, we have a lot of options, a lot of very powerful 174 00:09:25,992 --> 00:09:30,992 options, and it's very common for, you know, applications to 175 00:09:31,030 --> 00:09:34,240 use a number of different data stores, because each one 176 00:09:34,240 --> 00:09:37,410 of these is targeted for a particular use case, and it 177 00:09:37,410 --> 00:09:40,770 serves that kind of data in that kind of scenario, 178 00:09:40,770 --> 00:09:43,850 that kind of access pattern, really well. 179 00:09:43,850 --> 00:09:45,230 Right, and of course there are others. 180 00:09:45,230 --> 00:09:48,260 There's Amazon EMR, which was not really used 181 00:09:48,260 --> 00:09:51,870 in this particular scenario, but that is also an option. 182 00:09:51,870 --> 00:09:54,590 So again, as you continue to move forward in your 183 00:09:54,590 --> 00:09:57,670 exploration of AWS, I would highly encourage you to explore 184 00:09:57,670 --> 00:10:02,620 some of these tools, such as Kinesis, DynamoDB, 185 00:10:02,620 --> 00:10:05,860 Athena, Redshift, Elasticsearch, and of course 186 00:10:05,860 --> 00:10:07,563 Amazon EMR as well.