This article is part of an Essential Guide, our editor-selected collection of our best articles, videos and other content on this topic. Explore more in this guide:
6. - Big data analysis techniques show value in enterprise IT: Read more in this section
- Real data strategies could bring cloud to enterprises
- Why are people so hung up on Hadoop?
- Cloud computing increases big data potential in enterprises
- Evaluating Analytics as a Service in the enterprise
- Big data appliances give enterprises more data analysis options
- Amazon, GE, Pivotal collaborate on the Internet of Things
- What are useful resources for a data-analysis newcomer?
- What are the benefits and downfalls of Analytics as a Service?
- Amazon Redshift grabs attention of database admins
- Fascination with big data shrouds cost-benefit analysis
Explore other sections in this guide:
- 1. - Follow #reInvent on Twitter
- 2. - Where are enterprises in cloud computing adoption?
- 3. - The importance of locking down your cloud
For all the buzz around cloud computing, applications in public cloud services today represent a small portion of total IT spending. That will remain the case unless mainstream, mission-critical applications that make up the meat of enterprise budgets can move to public cloud.
The biggest hang-up for public cloud adoption appears to be high cloud storage costs. While Web-related cloud applications may store a few hundred megabytes of data, mission-critical apps store terabytes – and, at prevailing prices, that's too rich for many users' blood. Fortunately, there are two strategies to address cloud storage costs: data abstraction and query-distributed access to data.
Cutting BI, analytics costs with data abstraction
Business intelligence (BI) and analytics are some of the most promising applications for cloud. These apps are clustered in time around major IT decisions and spread in space over the whole spectrum of planners and decision makers. That makes them ideal cloud applications, but enterprises often estimate the cost of one BI app trial run at more than $30,000, which is quite pricey.
More resources on cloud storage costs
Read about the hidden fees in cloud storage
Learn about how to do a cloud storage cost analysis
Get the inside scoop on cloud storage fees
Making big data real without becoming excessively large is an exercise in the first of our two data cost management approaches, data abstraction. Data abstraction is a mechanism that produces one or more summary databases from raw company information -- small enough to be stored in the cloud economically.
One of our clients in a healthcare organization reported that creating a set of databases summarizing patient information by diagnosis code, treatment code and age/gender reduced the volume of information by over 300 times, which means data storage and access costs in the cloud would have been three-tenths of one percent of what they'd be for full access to detail.
Making data abstraction an effective cost management approach requires analysis of how and what you analyze. Most BI runs aren't looking for detail; they're looking for trends. For most industries there are clear variables that will be important -- diagnosis and treatment in healthcare, for example. By creating summary databases on these variables, you can cut costs by speeding up access, without impacting the analytical work itself. It's also easy, once a specific combination of variables has been identified as being of interest, to then go back to extract detail for that combination from the un-summarized data, if necessary. Abstraction-based analytics then becomes a cloud application, detailing analysis processes for the data center.
Using query-distributed access for unstructured data
The abstraction approach works well for applications that analyze transactional data structured around a small number of important variables. Where it doesn't work is with big data in its traditional, unstructured form, because unstructured data can't be easily abstracted. Some companies have had some success with creating databases that identify the rate of appearance of specific words or word combinations in emails, but this presumes that the important words/combinations can be known in advance. For most applications, a more general approach is needed. That approach is query-distributed access to data, our second data cost management strategy.
A data processing task usually has three components: actual processing of data, database management access, to locate the data, and storage access, to get information from mass storage devices. If large amounts of information can't be placed in the cloud for cost reasons, they can't be pulled into the cloud record-by-record either. The best approach is to host data and query logic in one place, outside the cloud, and send database management systems (DBMS) queries to extract a subset of the data for processing in the cloud. Keeping the DBMS engine functions on-premises and moving only queries and results in and out of the cloud can reduce data storage and access costs significantly.
It's relatively easy to structure applications for this kind of division of functionality and, in fact, more and more vendors offer DBMS engines or appliances that combine storage/query features. However, it may be necessary to build checks into the applications to prevent faulty query construction from delivering all the data. Pilot testing isn't enough here; the query logic should test the size of results before delivery.
Acknowledging the problem of distributed query processing
A special situation with big data is the chance that the information isn't stored in one place. Email, instant messaging and collaborative information is often stored where it's generated, so companies can have dozens or hundreds of sites. This creates the problem of distributed query processing, which is typically known by the name MapReduce, the solution architecture, or Hadoop, the open-source implementation most commonly used.
But even structured data can involve distributed queries; a financial company reported that its customer loan experience analysis draws data from more than 30 databases located in major metropolitan areas. For structured DBMS analysis, it's possible to use SQL/DBMS commands to "join" results from multiple sites, even if the queries are sent to each site to be run individually. Here, the issue is to ensure that the queries can be subdivided to run entirely on the data in each location; otherwise, each will require access to the other locations to run, raising costs considerably.
Despite the fact that there's considerable focus on the question of how to create hybrid clouds, it may well be that creating "hybrid data" will be more important for the future of the cloud in mission-critical applications. Absent a way of optimizing the use of inexpensive local storage and highly flexible cloud processing, users will likely find their large databases an anchor holding them inside traditional IT architectures. That will not only lose mission-critical application revenues for the cloud, it will lose cloud benefits for enterprises.
About the author
Tom Nolle is president of CIMI Corporation, a strategic consulting firm specializing in telecommunications and data communications since 1982.