Big data can be invaluable for enterprise business intelligence. However, mining loads of unstructured text data...
for that value means you'll need at least a basic search service, or sometimes more advanced text analysis methods.
Cloud administrators and developers working with Amazon Web Services (AWS) have the option of implementing their own search server -- using popular open source tools such as Lucene and Solr -- or using Amazon CloudSearch. Here are a few issues to consider when deciding between Search as a Service versus do-it-yourself (DIY) search.
Search as a Service with Amazon CloudSearch
Amazon CloudSearch is a cloud-based search service that companies can integrate into applications to index documents and respond to search queries. Like other AWS services, Amazon manages the server implementation -- not you. Amazon CloudSearch provides free text search as well as more advanced features, such as faceted searching and customizable relevance rankings.
- Faceted searching. Faceted search allows application users to narrow the set of documents searched by using a classification scheme for documents. For example, a document repository may classify documents according to several facets or fields, such as creation date, document type or key topic.
- Customizable relevance rankings. By default, all fields in a search index are considered equally relevant, which is not always the best weighting scheme. However, relative field weighting allows developers to weigh some fields (e.g., keywords) more highly than others to determine the relevance scores of documents and, ultimately, the ranking of a document in a result set.
In addition to providing core search services for application developers and administrators, Amazon CloudSearch will scale according to demand. It also maintains search indexes in memory to reduce latency.
Do-it-yourself search with Solr, Lucene
Amazon services are often cost-competitive with running your own services; however, if you are willing to incur potentially higher costs due to application administration overhead to get greater control and more features, you may want to look to a third-party tool. For example, open source search platform Apache Solr is free and includes support for advanced text searching capabilities, linear scalability, near-real-time indexing and extensible plug-in architecture. Solr also supports more advanced text analytic operations, such as word splitting, regular expressions and sounds-like filters. The open source platform also includes support for internationalization -- an important feature for applications with a global user base.
Another advantage of using Solr is access to specialized applications that can reduce the demands on your own developers. LucidWorks, for example, offers add-ons to perform named entity recognition; integrates with Drools, the open source business rules engine; and tunes search parameters to improve the quality of search results and rankings.
Lucene, a Java-based search and indexing service, is another option, but offers fewer features than Solr. In fact, Solr is based on Lucene, but it adds search and management features.
Comparing costs of CloudSearch and DIY
Charges for Amazon CloudSearch are based on the size of search instances, document batch uploads, number of document index operations and the volume of data transfer. The cost of search instances range from $0.10 per hour for a small instance to $1.10 per hour for a double extra-large instance.
If the search service will be needed for extended periods of time, you may want to consider comparing Amazon CloudSearch costs to reserved instance pricing rather than on-demand pricing. Reserved instances are available for one- and three-year commitments.
Table 1 shows the costs of several scenarios with varying repository size, query load and re-indexes.
Estimating the cost of running your own search service, such as a Lucene or Solr server, is more difficult because administration costs can vary, but we can estimate the cost of running instances comparable to those used by Amazon CloudSearch. Using on-demand pricing and assuming instances run 24 hours per day for 30 days per month, the cost of a general-purpose small instance (m1.small) is $43.20, the cost of a large instance is $172.80 and the cost of an extra-large instance is $345.60. The differences in the cost of DIY instances and CloudSearch services are not significant compared to the cost of service administration. In a use case requiring a large instance, the DIY savings might cover the cost of less than two hours of administrator time.
Amazon CloudSearch can enable developers to quickly implement search capabilities for their cloud-based applications. The service includes support for basic search operations, as well as some more advanced features at costs competitive with a DIY approach. For users with more advanced requirements, the additional overhead of managing your own service may be well worth the benefits of advanced search and text analysis.