Big data apps are different from traditional applications because they have a slightly different development cycle,...
involving data scientists, developers and production engineers. Data engineers may spend weeks fine-tuning the algorithms and normalizing the data sources to drive new applications. Then, it is up to programmers to implement these into production-ready systems. It's also important to consider the economic effect of different approaches.
At Spark Summit, distinguished Amazon engineer Marvin Theimer weighed in on some of the best practices they have identified to get big data apps built on Apache Spark production-ready. "Most of this conference is about getting to success, which means going from prototype to production," he said. "Once the enterprise has designed its big data algorithms, it needs to put them into production. In the car world, it is one thing to design a new car. But then, the hard work comes of stamping out a million of them."
Understand new requirements
Enterprise architects need to consider a variety of what Theimer called ilities to get the apps ready for production, including scalability, high availability, maintainability, evolvability, auditability and reproducibility. Many of these requirements have always been important to enterprise architects. Some, like maintainability, need to consider the faster pace of change in new data analytics tools. Others, like auditabilty and reproducibility, grow in importance, as these apps drive business decisions and interact with the system of record in new ways.
With maintainability, enterprises need to plan for the rapid pace of upgrades to core data infrastructure underlying the Spark platform. It's like changing the tires on a car as it is driving, so organizations have to be prepared to quickly deploy new security patches.
It is also important to consider the applications are developed using test data. The production application needs to manage personally identifiable information, like credit card data, so it does not get hacked and users don't accidentally see each other's data.
The enterprise needs to be able to meter usage of customers and send out an appropriate bill. When customers dispute a bill, there needs to be an ability to go back with an audit log of metering records and reproducibly show how the bill was calculated in order to come to a resolution. All of these capabilities have to be built into the cloud infrastructure behind the big data apps.
Aim for mundane efficiency
Developers need to think about going from algorithmic efficiency to mundane efficiency. It's fine to use the most performant services in development, but deciding to use a slower storage service like Glacier, rather than Simple Storage Service, can have a big effect on the bill. It's also important to have a procedure in place for getting rid of data that never gets used. In the beginning, it may make sense to store everything to enable new use cases. But enterprises should consider a regular analysis to identify and toss out data that is never used, nor required to be stored. "You have to remember which to throw away," Theimer said.
This also applies to managing millions of jobs. Some of these will fail in ways the enterprise will not be able to predict. The enterprise needs to have a process in place for identifying and killing these zombie services.
Additionally, the company should consider supporting a different kind of user. Early adopters may want fine -grained control, while the majority in the enterprises will want ease of use. As big data apps go into production, they will be adopted by people with greater skill in other areas, like business, biology or astronomy, rather than data science.
Plan for reckless use
It is also important to keep track of how the applications are being used. As enterprises begin to open this big data infrastructure to others in the organization, many of these apps will be used in highly inefficient and costly ways. "Developers will craft Ferraris, but then users will take them into the mud. This could be disastrous when the application is running on a multi-tenant system," Theimer explained. "Consequently, enterprises need to think about how to get users other applications, like a Jeep, or make it easier to take the Ferraris off-road and still get good results. You need to design systems that will tolerate usages you never expected and still give a good user experience."
Another good practice is to identify and plan for the effect of large-scale events, such as an entire data center going offline. At this point, everyone in that data center will be trying to move their applications to different data centers, overwhelming the network. It's akin to everyone clogging up the freeways to escape a tornado. It is important to architect conservative recovery mechanisms so recovery processes don't overwhelm the network.
Create robust stopping points
Implement stopping points to make it easier to roll back parts of a process when something goes wrong. In cases when data is corrupted, it is more efficient to reprocess the point in the processing pipeline after the corruption than rerun the whole process.
One approach is to create a bread crumb infrastructure showing the path of data as it is transformed through the data pipeline. If the system spits out the wrong number, this makes it easier to go back and find the place where a problem occurred.
Organizations should consider adapting the kind of accounting mechanisms used in the financial industries to track the representations of money as it flows through different systems. Data will inevitably go wrong. This kind of an approach will make it easier to find the input sources, so solutions are easier to identify.
Consider the economics of the implementation
Mundane efficiency involves being proactive in working with usage reports, billing alerts, cost explorer and lifecycle management. As these applications start to go into products, business imperatives are likely to drive many of these cloud applications toward multi-tenant services to lower costs. Enterprise architects also need to think through tracking the machine learning resource usage in order to manage costs.
"To be production-ready, you need to build this stuff from the beginning. After you have gotten to the 'eureka' point of getting data in, then you have to automate everything," Theimer noted. "It is also important to make this stuff secure by default. It is also important to think about pushing through in the direction of serverless event-driven computing. The good news is that the cloud can help a lot with all of this."
How big data apps fit in today's picture
Spark versus MapReduce: Who wins?
Quiz: How well do you really know Apache Spark?