Sergey Nivens - Fotolia
Organizations often use cloud-based applications to analyze large amounts of data, including system and application logs, business metrics, external data sources, public data sets and many others.
AWS, the largest public cloud provider, has more than a dozen data analytics offerings. These services occasionally have overlapping functionality, which can make it harder to know which one to choose.
Let's look at three services for data analysis on AWS -- Amazon Redshift, Amazon EMR and Amazon Athena -- to help find the right fit for your data analysis needs.
Redshift is a managed data warehouse that stores and performs data analysis queries in a centralized location. The work is done in a Redshift cluster that consists of one or more compute nodes that also store data. While Redshift also supports analyzing data stored in Amazon S3 using Amazon Redshift Spectrum, its main focus is on analyzing data stored in the cluster itself.
Redshift is designed to pull together data from lots of different sources and store it in a structured fashion. This builds data consistency rules directly into the tables of the database. Amazon Redshift is the best service to use when you need to perform complex queries on massive collections of structured and semi-structured data with fast performance. This makes it ideal for, say, a retail business that needs to regularly create performance reports based on data from inventory, financial and retail sales systems.
However, this model makes it difficult to reduce the size of a Redshift cluster based on demand, because the data is stored directly in the cluster. This typically results in high cost, given that Redshift clusters are often always on. IT teams can alleviate this issue with reserved node offerings, which are billed at discounted hourly rates for either a one-year or three-year period.
Athena is a serverless service for data analysis on AWS mainly geared towards accessing data stored in Amazon S3. But since it can access data defined in AWS Glue catalogues, it also supports Amazon DynamoDB, ODBC/JDBC drivers and Redshift.
Data analysts use Athena, which is built on Presto, to execute queries using SQL syntax. Unlike Redshift and EMR, users don't have to explicitly configure the underlying compute infrastructure and they only pay for data scanned ($5 per terabyte in most regions), which makes it a cost-effective tool in most cases.
Athena is widely used to analyze log data stored in S3 for services such as Application Load Balancer, Amazon CloudFront, AWS CloudTrail, Amazon Kinesis Data Firehose and any type of log data exported into S3. While this service is the easiest way to get visibility into data stored in S3, services such as EMR can potentially bring better performance -- albeit at a potentially higher cost -- since developers control the underlying infrastructure.
Athena is a good fit for infrequent or ad hoc data analysis needs, since users don't have to launch any infrastructure and the service is always ready to query data.
Amazon EMR provides managed deployments of popular data analytics platforms, such as Presto, Spark, Hadoop, Hive and HBase, among others. EMR automates the launch of compute and storage nodes powered by Amazon EC2 instances, and more recently AWS Fargate.
While data can be stored inside EC2 instances using an HDFS (Hadoop Distributed File System), the service also supports querying data stored in sources outside the cluster, such as relational databases or S3.
This makes it possible to reduce the cluster size based on demand and therefore optimize cost. The service is a good fit for teams that prefer or need to use any of the popular platforms supported in EMR (i.e. Presto, Spark, etc.). It also supports Reserved Instances and Savings Plans for EC2 clusters and Savings Plans for Fargate, which can help lower cost.
EMR is a good fit for predictable data analysis tasks, typically on clusters that need to be available for extended periods of time. This includes data loads in which having control over the underlying infrastructure -- EC2 instances and S3 storage -- would optimize performance and justify the additional work.
Dig Deeper on Cloud application monitoring and performance
Related Q&A from Ernesto Marquez
When it comes to data stream processing and analysis, AWS offers Amazon Kinesis or a managed version of Apache Kafka. Compare these two options to ... Continue Reading
Lambda and VPCs are essential to many AWS architectures, but they don't come together as intuitively as you might think. Learn how to configure ... Continue Reading
There are two primary ways to handle capacity in DynamoDB: on-demand or provisioned. Learn the advantages, concerns and use cases for each option. Continue Reading