1 00:00:07,380 --> 00:00:11,930 - The Amazon Athena service is a great example 2 00:00:11,930 --> 00:00:16,000 of highlighting the differences between on-prem 3 00:00:16,000 --> 00:00:20,680 infrastructure and cloud native architecture. 4 00:00:20,680 --> 00:00:21,993 So what is Athena? 5 00:00:22,840 --> 00:00:27,840 It is region scoped and, it's designed around data analysis 6 00:00:29,170 --> 00:00:33,680 but, it does it in a special way that really 7 00:00:33,680 --> 00:00:37,800 you can't achieve in an on-prem environment 8 00:00:38,660 --> 00:00:42,780 because, it utilizes data that is sitting in S3 9 00:00:43,620 --> 00:00:47,050 but, you can query it as if it were a relational 10 00:00:47,050 --> 00:00:49,423 database using SQL queries. 11 00:00:50,670 --> 00:00:54,370 The underlying engine is entirely serverless 12 00:00:55,310 --> 00:00:59,290 but, it utilizes third party offerings that you 13 00:00:59,290 --> 00:01:03,010 may have heard of, including Presto 14 00:01:03,010 --> 00:01:06,750 for distributing those SQL queries and, 15 00:01:06,750 --> 00:01:11,750 Apache Hive as a metastore to help understand 16 00:01:12,410 --> 00:01:15,130 how the data is organized 17 00:01:15,130 --> 00:01:18,103 into individual tables and schemas. 18 00:01:20,270 --> 00:01:24,210 Now, in an actual Athena table, what does it look 19 00:01:24,210 --> 00:01:25,360 like? 20 00:01:25,360 --> 00:01:29,070 Well, the table itself is just a container 21 00:01:29,070 --> 00:01:33,430 a logical resource that contains the definition 22 00:01:33,430 --> 00:01:34,623 of the metadata. 23 00:01:36,740 --> 00:01:38,839 It specifies the location of the data 24 00:01:38,839 --> 00:01:42,100 in terms of S3 buckets and 25 00:01:42,100 --> 00:01:47,100 prefixes as well as the actual structure of the data itself. 26 00:01:50,330 --> 00:01:54,050 The tables in Athena don't actually contain the data. 27 00:01:54,050 --> 00:01:59,050 All of the data is in the individual objects in S3. 28 00:02:01,390 --> 00:02:04,950 You can create Athena tables automatically 29 00:02:04,950 --> 00:02:07,410 or, you can create them by hand, especially 30 00:02:07,410 --> 00:02:09,890 if you have a number of disparate sources 31 00:02:09,890 --> 00:02:13,453 for that data that might even be in other AWS accounts. 32 00:02:14,980 --> 00:02:18,620 And, finally, you can organize your individual 33 00:02:18,620 --> 00:02:23,620 Athena tables together to create a database, 34 00:02:24,180 --> 00:02:26,530 as a logical grouping of those tables 35 00:02:26,530 --> 00:02:28,650 if you want to be able to issue queries 36 00:02:28,650 --> 00:02:31,423 across multiple tables at the same time. 37 00:02:33,769 --> 00:02:36,500 Now, the third and most important element 38 00:02:36,500 --> 00:02:39,550 of Athena is called the data catalog. 39 00:02:39,550 --> 00:02:41,810 And, this is really important 40 00:02:41,810 --> 00:02:44,963 for big data and data lake uses. 41 00:02:45,800 --> 00:02:48,740 This is a system for organizing all 42 00:02:48,740 --> 00:02:52,940 of the individual tables into a cohesive unit 43 00:02:54,120 --> 00:02:59,120 and, the data catalog is what helps to combine the dataset 44 00:02:59,800 --> 00:03:04,030 which is the actual underlying data, as well 45 00:03:04,030 --> 00:03:08,210 as the tables and schemas those definitions 46 00:03:08,210 --> 00:03:11,660 into a data source that can then be treated 47 00:03:11,660 --> 00:03:13,123 as a logical object. 48 00:03:15,390 --> 00:03:19,520 There's a legacy version of data catalogs that are managed 49 00:03:19,520 --> 00:03:21,570 by the Athena service itself. 50 00:03:21,570 --> 00:03:22,930 And, these are being phased out. 51 00:03:22,930 --> 00:03:26,723 You may or may not have access to these in your AWS account. 52 00:03:27,990 --> 00:03:32,230 If you create a data catalog today though, 53 00:03:32,230 --> 00:03:34,940 it actually uses another data analytic 54 00:03:34,940 --> 00:03:39,320 service called 'Glue' that we're going to be talking about 55 00:03:39,320 --> 00:03:43,723 in a later lesson to help organize those data catalogs.