Cloud computing gives us the ability to rapidly deploy large numbers of servers for compute-intensive tasks. Do you need to transform, filter and analyze terabytes of data? Do you have thousands of media files that need to be converted to a different format?
Distributing a large workload over a cluster of servers in the Amazon cloud is an option for many of these large, data-intensive tasks. It can be a challenge to set up and manage a cluster, but the
StarCluster software allows users to create clusters with simple command line procedures. Clusters consist of a single master server, multiple worker servers and Elastic Block Store (EBS) volumes. When you issue a command to create a cluster, StarCluster does the following:
- Instantiates virtual machine instances
- Configures a new security group
- Defines user-friendly host names (e.g., node001)
- Creates a nonadministrator user account
- Configures Secure Shell (SSH) for password-less logins
- Defines Network File System (NFS) file shares across the cluster
- Configures the Oracle Grid Engine queuing system for managing jobs across the cluster
Using StarCluster in Amazon cloud
StarCluster is written in Python, so you can install from the Python Package Index using the easy_install command. Linux and Mac users likely already have Python installed, but Windows users may need to install Python 2.7 and Python setup tools first. Once the StarCluster tool is installed, you will need to define a configuration file with basic information about your cluster and Amazon cloud account, such as access IDs and key location. StarCluster creates a default template you can edit to specify your account-specific information.
StarCluster uses a small cluster default configuration, but you can specify machine sizes, EBS volumes and number of servers as command-line options or in the configuration file. You can also specify the Amazon Web Services region in which to host your cluster. StarCluster parameters include cluster size, the Amazon machine image (AMI) to run on worker nodes, worker node instance type, master node AMI and EBS volumes attached to and NFS-shared to the cluster.
A security group is created for the cluster, allowing you to specify firewall rules for the entire cluster. Since the firewall rules are part of EC2's core functionality, they can be managed through the AWS Management Console as well as through the configuration file.
Streamlining management with Oracle Grid Engine
The Oracle Grid Engine (formerly the Sun Grid Engine) is a queue system that helps streamline job management in a cluster. For example, if you have apply transformations to thousands of files, you can use the Oracle Grid Engine to schedule a job for all of those files once, allowing the queue to manage processing as resources become available.
Another advantage of using Oracle Grid Engine is that it can load balance jobs across all servers in the cluster. Depending on how you configure the engine, it can also add or remove nodes from the cluster to accommodate changes in demand. The Oracle Grid Engine also monitors jobs in the queue and errors and maintains other useful information about the state of jobs in the cluster.
Storing StarCluster output data
Jobs running in the cluster typically generate output, and some, if not all, of that output may need to persist after the cluster is shut down. Since these clusters are composed of Amazon virtual machines, you'll need to plan to store data in either Simple Storage Service (S3) or on EBS volumes. If you are using S3 object storage, your jobs can read and write data to S3 buckets as they would if your program were running outside a cluster. StarCluster configuration file has an optional section for specifying volume IDs and mount points for EBS volumes. Any data written to EBS volumes mounted to a cluster will be available to other Amazon EC2 instances that attach those volumes after the cluster is shut down.
StarCluster supports the use of spot instances for worker nodes with a command line option for specifying a spot instance bid on a particular server type. In addition, a spothisory command shows the current, maximum and average prices for an instance type over the prior 30 days.
StarCluster was created for scientific computing research and is configured with tools commonly used in scientific computing. Some of the tools, especially NumPy, a Python numerical package, and ATLAS, an optimized linear algebra package, are especially useful for data analysis. Plugins for StarCluster support widely used tools, such as Hadoop and MySQL.
About the author:
Dan Sullivan, M.Sc., is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.
This was first published in October 2013