Social media is a gold mine of potential business intelligence about customer opinions. Microsoft's Social Analytics...
team will use QuickView, an advanced search service, and sentiment analysis technology to mine a mother lode of filtered tweet streams from Windows Azure for real-time, crowd-sourced opinions.
Social media data is a byproduct of Software as a Service (SaaS) sites like Twitter, Facebook, LinkedIn and StackExchange. And filtered versions of this data are worth millions to retailers, manufacturers, pollsters and politicians, as well as marketing and financial firms and news analysts. QuickView, the advanced search service that uses natural language processing (NLP) technologies, can be used to extract meaningful information from tweets, including sentiment analysis, and then index them in real time.
Industry pundits claim the revenue social media sites garner from syndicating their trove of big data exceeds advertising costs. For example, Twitter reached 250 million tweets per day in mid-October. The company syndicates those tweets to organizations like DataSift, which passes along license fees of $0.01 per 1,000 custom-filtered tweets. DataSift partners, such as Datameer, offer Apache Hadoop-powered big data business intelligence (BI) applications for data visualization and analysis.
Microsoft can see the value of this. This year, the company gained access to the entire Twitter firehose feed. In July, the company reportedly paid Twitter $30 million per year to extend the agreement for the Bing search engine to index tweets in real time. And Microsoft also joined the big data syndication sweepstakes in 2010 when it introduced Windows Azure Marketplace DataMarket, previously called Project “Dallas.”
Social Analytics experimental cloud
In late October, the SQL Azure Labs team announced the private trial of a new Microsoft Codename “Social Analytics” experimental cloud service, which targets organizations that want to integrate social Web information into business applications. The lab offers an Engagement client user interface (UI) for browsing datasets that implement Microsoft’s QuikView technology.
Figure 1 shows the Engagement client displaying a live feed of tweets filtered for Windows 8 content in the left pane and tweets ranked by 24-hour buzz in the right pane.
The rate of tweet addition to the left pane depends on the popularity of the topic for which you filter; the Windows 8 dataset received about one new tweet per second over a recent 24-day period. The current “Social Analytics” OData version also supports LINQPad and PowerPivot for Excel clients, but not the DataMarket Add-in for Excel (CPT2) or Tableau Public.
APIs support BI data analysis and collection
To accommodate enterprises that design BI data analysis and collection applications, the Social Analytics team also provides a RESTful application programming interface (API) that supports LINQ queries with a .NET IQueryable interface and returns up to 500 rows per request. For Windows 8, the average data download rate is about 10,000 new tweets per minute with a 3 Mbps DSL connection. Microsoft is currently capturing about 150,000 MessageThreads daily across its environments, including the labs datasets.
The company evaluates and discards about 200 to 300 times more content items that don’t meet its criteria for data capture, said Kim Saunders, program manager for Social Analytics at Microsoft. “In the initial lab, our target is to maintain at least 30 days of history. We plan to grow this target over time,” Saunders added.
Figure 2 shows a screenshot of Microsoft Codename “Social Analytics” Windows Form Client .NET 4.0 C# application after retrieving 120,000 ContextItems. You can download the project with full source code from my SkyDrive account. Each ContextItem represents an individual tweet from the VancouverWindows8 dataset. (“Vancouver” was the original codename for the service.)
The DataGrid columns display an assigned graphical user ID, the first 50 or so characters of the tweet as the Title, a Calculated Tone ID value (3 is neutral, 4 is negative and 5 is positive), and a Tone Reliability index, which represents the confidence in the correctness of the Tone value, or sentiment. The Tone is based only on English content as it does not support other languages yet; the client application included in the labs equates any Tone with reliability less than 0.8 as neutral, Saunders said.
Hadoop implementation for Azure still in the works
Regarding the forthcoming Hadoop implementation for Windows Azure or Windows Server, Saunders said Microsoft’s current storage strategy is based on SQL Azure and Windows Azure Blob storage. The company is evaluating Hadoop storage longer term, but there are no specific plans to expose MapReduce capabilities over the datasets. Microsoft is working on a collection of aggregates that will enable analytics and trending for top keywords, handles, sources and threads.
“Our first step toward enabling users to create their own datasets is a shared proof-of-concept environment that will enable a small number of selected partners to create their own filters in a shared environment,” Saunders said. In addition to Twitter, the service’s SocialProfilesTypes collection includes blogs, Facebook, LinkedIn, Microsoft Developer Network (MSDN), StackExchange and YouTube.
Microsoft may be late to the social analytics market, but it has the advantages of Microsoft Research’s deep NLP resources, Windows Azure for big data delivery from the cloud, strong relationships with Twitter and Facebook and an up-and-running DataMarket presence -- not to mention very deep pockets.
About the author:
Roger Jennings is a data-oriented .NET developer and writer, a Windows Azure MVP, the principal consultant of OakLeaf Systems and curator of the OakLeaf Systems blog. He's also the author of 30+ books on the Windows Azure Platform, Microsoft operating systems (Windows NT and 2000 Server), databases (SQL Azure, SQL Server and Access), .NET data access, Web services and InfoPath 2003. His books have more than 1.25 million English copies in print and have been translated into 20+ languages.