Why big data, testing make good Amazon EC2 spot instance candidates

Amazon EC2 spot instances are ideal for use with some cloud apps. Running tasks like big data analysis there can even reduce your bottom line.

When deploying applications in the cloud, you need to determine acceptable performance parameters. If you’re hosting...

time-sensitive applications in Amazon EC2 with fault-tolerant designs, those apps make perfect candidates for Amazon EC2 spot instances.

Batch-oriented processing, such as testing, analyzing “big data” as well as extraction transformation and load (ETL) operations are good candidates for spot instances. These jobs are typically scheduled to run without much end-user interaction. The apps also lend themselves to dividing the entire workload into smaller tasks that can be completed independently. For example, if your software development group is doing regression testing on a new cloud-based app, they can submit tests for one module to one instance while testing another module to a different instance. Tests that span both modules can be submitted to a third instance.

Big data analysis and data warehouse ETL operations also fit well as spot instances, even though subtasks are not necessarily independent of other tasks. Consider, for example, a typical data warehouse aggregation problem.

A data warehouse collects data from multiple source locations, such as branch offices. Some branch data is aggregated vertically -- from branch office totals into regional totals and then to corporate totals. Other data, such as sales totals for individual products across all stores, is aggregated horizontally.

Amazon EC2 spot instances can be used to perform ETL operations by region or by product line. A summary script could be used to aggregate data from multiple spot instances. The same script could detect when a region or product total has not been completed, presumably because Amazon recovered a spot instance and restarted the job to calculate missing data.

Challenges with big data analysis and Amazon EC2 spot instances 
Amazon spot instances are ideal for use with big data analysis in the cloud, but you can encounter some problems. Moving big data to the cloud can be expensive and slow.

Amazon spot instances are ideal for use with big data analysis in the cloud, but you can encounter some problems. Moving big data to the cloud can be expensive and slow.

If the large volumes of data you want to analyze are already in the public cloud, then spot instances will work. However, if the cost of uploading and storing large volumes of data outweighs the savings of using cloud computing resources, consider using in-house clusters or a private on-premises cloud.

If spot instances are an option for your big data analysis requirements, you might be able to use Amazon Elastic Map Reduce (EMR) with EC2 spot instances. EMR implements the map reduce paradigm to process big data. This computational model works well for tasks in which large volumes of data can be analyzed independently (the map phase); results are combined in a new set of data (reduce phase) that is processed in a similar map-reduce pattern.  

Many -- but not all -- big data projects are a good fit for map reduce. Network analysis problems, such as analyzing social networks or the flow of email messages, do not lend themselves to map reduce. In addition to providing a scalable platform for analyzing large data sets, EMR provides fault tolerant capabilities that support recovery when spot instances are reclaimed.

Amazon EMR is just one way to introduce fault tolerance into your application architecture. If you are working with custom applications that were not designed with fault tolerance in mind, consider using a check-pointing strategy to save information about the computational state of persistent storage. When your application starts, check-point capabilities can detect the status of the last saved state and continue processing from that point. 

You can also use message queuing to keep a list of tasks that still need to process. Applications running on Amazon EC2 spot instances can take a task from a pending jobs queue and add a message to an in-process queue to indicate the spot instance is working on the task. When the job is complete, it will remove the job from the in-process queue. Scripts can run occasionally to check the age of jobs in the in process queue and add jobs back to the pending queue if they have not been completed in a reasonable amount of time (presumably because the spot instance was reclaimed).

Spot instances can help reduce your bottom line when running big data analytics operations in the cloud. Be sure to consider performance requirements and fault-tolerance characteristics of your application before running them on spot instances.


Dan Sullivan, M.Sc., is an author, systems architect and consultant with over 20 years of IT experience with engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education, among others. Dan has written extensively about topics ranging from data warehousing, cloud computing and advanced analytics to security management, collaboration, and text mining.

Dig Deeper on Big data, machine learning and AI