Project Botticelli

Apache Hadoop, Big Data, Microsoft

2 April 2012 · 4 comments · 19106 views

What Can We Use Big Data For?

Apache Hadoop technologies integrate with the entire Microsoft Application Platform on many levels. The purpose of this article is to outline some of those integration points, and to outline the possibilities of solutions and applications that this combination enables. If you are new to Big Data and Apache Hadoop, you may also want to watch a portion of the Introduction to SQL Server Business Intelligence video, in which the fundamental concepts of Hadoop, including MapReduce, HDFS, Hive, Pig Latin, and the basis of the distributed data and distributed processing are explained in more detail.

Why Big Data?

Today’s Big Data will be tomorrow’s little data. What makes Big Data big, today, is its complexity, or, as I prefer to think about it, our present inability to process it using traditional approaches. Some prefer to think of Big Data as being physically large (volume), or that is generated very fast (velocity), or that is sufficiently varied to be consider complex enough. Gartner calls it the 3 Vs of Big Data. This is an oversimplification, because what makes Big Data big is not just its complexity, but, even more importantly, the complexity of the problem (the query) that you are trying to resolve, with regards to the given data set. It is interesting to see that Google call their Big Data approach the Big Query—of course, it is the combination of the two that matters, not either alone.

Let’s say you want to perform an in-depth associative analysis of events that occurred across the entire library of logs generated by your enterprise systems, your devices, each night. With a sufficient degree of detail required, even small logs would make this quest one of Big Data, today. On the other hand, the same data set, but with a simpler query, such as event count, would not be a Big Data problem.

Assuming that you happen to have a complex enough combination of a problem to solve and its data, Big Data technologies, such as Apache Hadoop, present an opportunity, that has not been available to majority of organisations, and their users, in the past. To put it simply, with regards to Hadoop, unless you were Bing, Facebook, Google, or Yahoo, in the past, you could not dream of running an exhaustive search for a useful business answer, such as product popularity survey across the web’s vast expanse of blogs, pages, or tweets. With Hadoop, on the other hand, a mid-size, and perhaps even a small organisation, can have at its disposal the ability to do just that, on demand, and quite inexpensively, too. 

Transformative Technology

As Big Data is a new technology, it has the potential to transform your business, by offering you a competitive advantage, that others do not have, yet. This is, of course, a formidable opportunity. At the same time, by being so bleeding-edge today, it is not yet well understood by businesses, presents risks, primarily those that it might just not provide anything useful at all, while consuming resources, and is highly untested in common business scenarios, so no well known patterns and practices yet exist to control Big Data projects. This characterises new, transformative technologies. Equally, it not possible to list all classical uses of such an early technology, as there was no time yet for it to become classical. I would argue, that if there were classical applications of Big Data, it would no longer be a transformative technology, rather a growth-oriented one, focused on those apparently tried and tested applications.

There is a window of opportunity, a chance for you to consider Big Data’s wildest uses, at the moment. Have an open mindset, and expect it to be more like research, and entirely unlike a deployment of a well-understood data warehouse project. With that in mind, however, there are already a number of scenarios where Big Data and Hadoop seems to work. Let me point some of those out.

Potential Uses of Big Data

While Big Data Analytics can cope equally well with numerical and textual applications, it seems that the latter are of more interest to early users at this stage, perhaps because of the novelty of actually being able to do something useful with unstructured text files. Blogs, postings, pages, documents available of the web, messages, tweets and so on, all take many forms, and usually happen to have no semantic definition that would present them as structured data.

For example, in one large financial company for which I worked a while ago, all PPT documents, strewn around servers, shares, and workstations, happened to mention events and people working for the company’s customers. The financial house wanted to be able to better research future projects, business deals, and upcoming transactions, by accessing this wealth of unstructured data. They wanted to find documents with similarities, and, above all, they wanted to ensure that a project manager could know everything there was to be known about a particular customer, in terms of its past dealings, without having to rely on a CRM system. Of course, on a scale of even a large financial company, this is not a Big Data problem today, anymore. It is just a matter of having a good crawl and an in-house search engine, like SharePoint FAST, or Apache Solr (related to the Apache Lucene project), perhaps aided with judicious tagging. Even better, this could be a case for the new Semantic Similarity Search and the new File Table technologies of SQL Server 2012, that allow you to store such live, unstructured data (or even an entire file system of such data), right inside SQL, and query it at will.

However, if we expanded this business scenario to cover not just documents located on the internal servers of the company, but, in fact, the entire web out there, we hit on a golden opportunity for Big Data Analytics. You might think that to tackle that one has to be like one of the big search engines. Indeed, that used to be the case. But: what if you had, at your disposal, such an ability to prepare that data, and to provide meaningful answers, on a nightly, or at least a weekly basis, all by yourselves?

Or, what if you would like to know how many times your product name has been mentioned last week on the web? Or, how many times your competitors’ products were mentioned next to your product? What if you would like to measure the sentiment of those mentions, trying to assess the mood as being positive or otherwise? All of these types of analysis, extraordinarily useful to every business, are possible using Big Data Analytics, and for a relatively low fee, of perhaps hundreds or low thousands of dollars, in terms of computing cost. Of course, a good statistical survey, run by an experienced metrics organisation, Nielsen perhaps, can yield a good answer today, within bounds of statistical correctness. But what if you could do all of that, yourself, without a need for surveys, and do it more interactively, expanding the nature of a query to add questions which are specific to each case being analysed? You might want to count the lexical distance between the mentions of your product name and that of a competitor, and if a certain threshold was reached, add additional metrics, not computed in other cases.

What if you could absorb the content of the entire World Wide Web, every night, to find answers that provide your company with actionable insight? That could be transformative to your business, wouldn’t it?

Or, perhaps, you have logs of activity on your networked devices, firewalls, web servers. What if you could actually run meaningful queries, using such simple techniques as pivot charts and filters, on a regular basis, over this rather large and non-uniform data source?

Perhaps you manufacture goods, and you have thousands of sensors across your lines, which, at present, just send alarms when something exceeds a threshold. What if you could get all of their data joined together to understand the patterns of behaviour on your production line and cross-relate it to later warranty claims or product quality issues? How about interacting with sensors in the cars you make, once they are on the road, used by the owners, with their permission and observing stringent privacy goals, to provide an early understanding of safety issues before they arise, and before anyone suffered in an accident?

Perhaps, as a large health organisation, using your sensors and data, you could predict, from subtle variations to the norm, when and where an epidemic might strike?

As a generalisation, any pattern search, aimed at finding meaningful correlations, not too dissimilar in nature to some of the goals of traditional data mining, is at the heart of Big Data Analytics. I expect that all the typical uses of data mining might find their way into Big Data one day—bearing in mind that the current techniques work very well on even very large data sets.

Further, there are opportunities to re-analyse old data, that we have not managed to understand in the past. Ultimately, in matter of a few months, or short years, we will know of plenty of interesting Big Data applications, at which point this technology will start to mature, and focus on bringing growth to organisations who can employ it. In the meantime, it can help you compete, before anyone else catches on.

Why Apache Hadoop?

Apache Hadoop is beautiful because it is based on an astonishingly simple idea of MapReduce. Just as Julius Caesar invented Divide et Impera, or Divide and Conquer (in computing), as a way of managing a large chunk of our planet a few thousand years ago, MapReduce takes a simple approach towards dividing a complex problem and its data into smaller chunks, processing each independently, and reducing the numerous results into a grand answer. All the time, Hadoop takes care to avoid the major limitation of today’s distributed processing: the limitation of network speed, while realising that both CPU and memory are fairly plentiful and inexpensive. Add to that resilience, through data replication in HDFS (Hadoop Distributed File System) and you have the barebones of a modern platform for writing highly distributed software, for both Big Data preparation, and its analytics.

While the concepts of parallel computing have been with us for a few decades, no major, generic enough, yet practical, implementations have ever descended into the world of business reality from the higher echelons of academic research. What is particularly striking about Hadoop, and perhaps is a wonderful, circular joke of chance, is that a technology that was suggested in a seminal 2004 Google whitepaper about MapReduce, which then inspired Doug Cutting to try to create an open source search engine (Nutch), but instead gave birth to Hadoop, is now used by the very search engine that wrote the original whitepaper (actually, updated in 2008)! What this means, in my opinion, is that just as a proof of a cake is in eating it, the proof of Hadoop is its continued use by every large search engine and social network, allegedly in preference to their own, older, in-house approaches.

Hadoop is amazingly powerful yet insanely practical. You can throw a bunch of cheap, old computers together to form a processing cluster, or you can build a fabulous, and very expensive, data centre, or—and this is of utmost important to small and medium companies—you can hire it by the hour from a cloud, for dollars today, and for pennies in the near future.   

Why Apache Hadoop and Microsoft?

Microsoft has gone Hadoop-happy on many levels. First of all, as there is an opportunity to let you analyse data that you may already hold in SQL Server using Hadoop, the already available Hadoop ODBC connectors for SQL Server 2008 R2 and 2012 let you load data in and out of HDFS. But this is just the beginning.

With the ODBC driver and an Excel add-in for Hadoop Hive, the data warehouse-like layer over HDFS, you can query and study the results of your Hadoop analytics in the familiar environment of the most popular BI tool on this planet. Further, it will become easier for your apps to access the results of such analytics, thanks to ODBC.

Big Data, Apache Hadoop, and MicrosoftOf most interest, however, is an architecture comprised of a Hadoop processing cluster, functioning as a sort of a big vacuum sucker for data, that can absorb it, simplify it, and present it as higher-level results, combined with a more traditional analytical SQL Server database, cube, or a tabular model, functioning as a timeline of those results, perhaps with hierarchical analysis support. Imagine not just knowing the answer to the question how many times your product was mentioned last week, but actually having a week-by-week table, or even a cube, that includes various levels of that analysis, across your usual dimensions, such as product categories or regions. Further, by combining Hadoop and SQL this way, you benefit from not needing to re-architect your existing analytical applications for Hadoop—they can continue to use SQL as they did in the past. Add to that the simplicity of offering SQL models to Power View or PowerPivot, and you can have Big Data on a small screen, that your users will understand and will want to use.

There is also a case for combining Hadoop with SQL StreamInsight for high-velocity data, for example for pre-processing events from sensors, or financial systems.

There is more. Microsoft also wants to help you leverage your existing software, or perhaps its components, to run inside Hadoop. Perhaps you have worked on a piece of .NET logic that could be used as the heart of a mapper or a reducer. For that, rather than making you rewrite it and re-host it under Linux (which, by the way, is not that difficult with Cygwin) you may just prefer to have it run on Windows. Hence a Windows Server distribution of Apache Hadoop is in the works, for your data centre, and will be available under the name of Microsoft HDInsight.

Hadoop is also going to be available as HDInsight for Windows Azure, letting you use it in a similar way to Amazon Web Services Elastic MapReduce, as a supercomputer-for-pennies, letting you hire the cluster of a sufficient size to run your processing without ever having to build one, or even needing to know how to configure it.

Personally, I am quite excited by this latest prospect. As a younger programmer, aeons ago, I started punching cards for an ICL 1905 (Odra 1305) mainframes, with George 3 operating system, and I have dabbled in IBM 360 (ask me about the DD command some time), and the CDC Cyber-72 supercomputer. I quite liked the power of those mainframes and the kind of at-that-time-big-data problems we could solve with them. I even wanted to have one of those, but had to settle for a set of manuals… Well, now anyone can have that kind of for-now-almost unlimited computing power, for a very reasonable price, which, may I bluntly predict, will become cheaper by a factor of 100 or more, in the course of this decade.

Microsoft is new to the world of open source, and I am also very pleased, if a bit surprised, by their willingness to support the Apache Hadoop development process by being co-operative, the open way, and submitting back to the community their own, pretty major additions, such as support for JavaScript as a first-class language for writing code for Hadoop, the JS Console, and, previously, Node.js. Incidentally, I never thought I would see the day ASP.NET MVC got open-sourced by Microsoft, but that is another story.

And finally, if you do go Hadoop big way, you will probably want usual enterprise-grade security (like Active Directory) over what is otherwise a rather too-friendly, more co-operative system. You will probably ask for manageability features, such as updates, monitoring, etc. This forms the final part of Microsoft’s efforts in this area, working on integrating Hadoop with the Windows Server, System Center, and so on. 

Big Data Solutions for Small Data Problems?

I owe it to my colleague, Mr Simon Lidberg, to point out that while Big Data platforms, especially Hadoop, are great for solving big data problems, they are not very suitable for small data. Indeed, it seems that small data problems, usually better solved using cubes, tabular models, or just plain-old data warehouse queries—even over large amounts of data—might be slower, and more unwieldy, when treated using Hadoop. Hadoop can be slower, and more cumbersome, due to its batch-oriented nature, than OLAP, or PowerPivot, for not-really-Big Data.

Of course, the big question is if you have a big data problem or a small data problem with just a big problem of some other sort.

Privacy Concerns

It pains me to see, again, another amazing, and very powerful technology being invented and made available to the world, without having spent even as much as a conference to discuss how such powerful analytics might further erode our privacy and the slowly ebbing-away personal liberty. I worry greatly about the younger generation’s apparent lack of interest in those matters, and the willingness of the world not to pay a fee and to prefer free online services that turn the users, and the data about them, into a product to be sold on. I worry, that this technology, in particular sentiment analytics, can be used not just to study the data about your products, but indeed, about others, and I do not just mean companies. It is again this time, when democratic processes have not caught up with the possibilities offered by science and technology. At the same time, there are a lot of people who are not willing to take a measured, and self-restrained approach to matters of privacy. Some have even built serious businesses out of a blatant disdain for privacy.

May I wish you a transformative experience of Big Data, but also one that you, and others, can truly feel good about.

Rafal