The idea of limitless storage capacity in the cloud can be enticing. IT admins may be tempted to keep several snapshots,...
backups and redundant copies of data and files in the cloud to ease the worry of losing them. That’s reasonable if this data will make or break your company. But it’s important to store data according to a cloud-based storage policy that balances the risk of lost data with the cost of storing too much for too long.
There are a few questions IT pros need to ask when creating a practical cloud storage policy, including:
- Should I use object storage, block storage or database storage?
- How many redundant copies are necessary?
- How frequently should I create snapshots?
- How long should I retain snapshots?
This tip looks at Amazon storage options as an example, but the principles apply to other public cloud providers with comparable services.
For the cloud, should I choose object storage or block storage?
Object stores are a good option if you store, update and retrieve objects on an object-by-object basis using a URL addressable identifier. If you are storing data that fits well into an object storage model, such as scanned images, you might want to consider Amazon Simple Storage Service (S3).
It’s important to remember that not all data is mission-critical.
When you need to store data in a file system and randomly access subsets of that data programmatically, Amazon Elastic Block Store (EBS) is a better option. With Amazon EBS, you can create a 1 GB to 1 TB storage volume that can be mounted to your Amazon Elastic Compute Cloud (EC2) instances as a raw block device. You can create a file system on the device and use it as an additional hard drive. IT teams should use block storage if they that want to run a relational database on an Amazon EC2 instance and maintain persistent data files when EC2 instances are shut down.
Comparing the costs of object storage and block storage is straightforward. If you are storing up to a terabyte of data, Amazon S3 will cost $0.125 per GB per month for 99.99999999% durability or $0.093 per GB per month for 99.99% durability, which is how likely you are to lose an object. Unit costs decrease as the amount of data stored increases.
How many redundant copies are too many?
If you can’t readily reconstruct data and can’t risk losing it, then think twice before storing it with reduced redundancy storage. But it’s also important to remember that not all data is mission-critical. Large datasets used for test and development are suitable for reduced redundancy storage; however, you need to have the scripts used to generate test datasets so you can recreate them, if needed.
Another cloud storage option for database applications is Amazon’s DynamoDB, a NoSQL database without associated administration overhead. Storing 1,000 GB of data in Amazon DynamoDB will cost around $1,098, according to the AWS Simple Monthly Calculator. This may seem expensive, but you get more than simple storage. With Amazon DynamoDB, you have a scalable database that you can query against -- and you do not need to install and administer a database application. Simpler database services using Amazon SimpleDB will cost closer to $250 per month.
When should I create snapshots in Amazon S3?
Cloud storage depends on hardware, and hardware can fail. It’s not unreasonable to assume some of the hardware components storing your data will fail at some point in the project lifecycle. One way to mitigate that risk is to keep redundant copies of your data.
Some data storage services like Amazon S3 incorporate redundant storage to improve durability. A reduced redundancy option means you pay less while assuming the additional risk of data loss. Amazon EBS, for example, redundantly stores block devices within an Availability Zone, so a single hardware failure does not result in data loss. Amazon EBS costs $0.10 per GB per month -- less than the cost of storing on full-redundancy Amazon S3. To maintain backups of devices, you can create snapshots and store them in S3 at standard rates.
How long should I retain copies of cloud-based data?
This is a tough question. First, IT teams need to specify file retention periods in a document retention policy. Different types of data will require different retention periods.
One bug in an application can corrupt all your data. Backups and snapshots could be the key to restoring the last known good state for data.
If regulations require your company to retain data or if legal counsel advises you that some data is subject to e-discovery, you may have to retain it for a specific period to stay in compliance. Other data isn’t as important. Developers may only need to keep test data sets long enough to finish testing or until another test set has been generated.
When it comes to system-level backups, you should decide how much risk you are willing to take. For example, you might perform full backups every week and keep them for four weeks. You also might run nightly, incremental backups that you keep until the next full backup. This approach limits the amount of storage you will dedicate to a backup but it could leave you without a way to recover data that was deleted prior to the oldest backup.
With Amazon ESB running as a file system, you can use your favorite backup application to create full, differential or incremental backups. If you are running a relational database, you can use the relational database’s dump utility to create files that can be stored on S3.
Even if hardware never fails, backups and snapshots can help preserve data integrity. One bug in an application can corrupt all your data. Backups and snapshots could be the key to restoring the last known good state for data.
Public cloud computing providers bill for storage in addition to regular service charges, which keeps us very aware of all costs associated with our storage habits. Practical storage policies help balance the risk of data loss with the cost of storing everything in the cloud. Too many copies and backups will affect the bottom line; too few copies can affect the enterprise in other ways.
Dan Sullivan, M.Sc., is an author, systems architect and consultant with over 20 years of IT experience with engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education, among others. Dan has written extensively about topics ranging from data warehousing, cloud computing and advanced analytics to security management, collaboration, and text mining.