I recently had the opportunity to participate in a series of panel discussions as part of a Big Data and Hadoop roadshow conducted by Avnet, HPE, Hortonworks and Suse.
In addition to great industry and technology presentations from Hortonworks and HPE, each event included an interactive panel discussion, featuring domain experts from HPE, Hortonworks and Avnet. The HPE and Hortonworks panelists varied by city, with a couple of repeat participants. My calendar lined up so I agreed to attend all three events for Avnet.
While I might not have been in love with the idea at the time (3 cities in 1.5 weeks), participating in all three events turned out to be very beneficial. In addition to gaining new and expanded insights from my fellow panelists, I got to engage with a broad set of individuals and organizations on big data and Hadoop. As the one constant on all three panels, I had the opportunity to experience the differences and similarities between the different audiences.
The audience dynamics varied from city to city, from local culture to types of companies/industries present to their average level of big data/Hadoop experience. The attendees in one city were (on average) fairly new or early in their big data strategies, whereas the audience in another city was (on average) further along in their big data journey. Naturally, this resulted in some different and unique questions and discussions from city to city. What really struck me however, were the similarities between the three diverse audiences when it came to their questions around big data and Hadoop. I was surprised at how often the same questions kept coming up and how similar many of the discussions were from event to event.
With that in mind, I thought it might be beneficial to our partner community to highlight some of the common questions that kept coming up around Hadoop and big data and summarize the panels’ responses. We actually agreed most of the time, making the task of summarizing the panel opinions not overly daunting.
Does Hadoop replace my existing Data Warehouse?
Panel says: No. Hadoop can be an extremely valuable extension to your data warehouse and even off-load some services from your data warehouse (such as ETL), but it does not replace it. Hadoop is not a RDBMS, it’s not an ACID compliant database, it’s not even a database. It is a file system (Hadoop Distributed File System or HDFS) and analytic/calculation engine (MapReduce). Yes, we can add SQL services like Hive and other processing engines like Spark but it still doesn’t replace an Enterprise Data Warehouse. Hive and other SQL on Hadoop tools are not full ANSI SQL standard, rather a sub-set of ANSI SQL 1992 features – which would have significant speed/performance implications. Hadoop is complementary to your data warehouse.
Of course, if we really wanted to complicate things, we could dig deeper into what you consider to be a data warehouse -and we would get a variety of answers that run the spectrum. And if the answer was something like “our data warehouse is really just a repository of data from a handful of sources, without any complex schemas or modeling” – then maybe you “could” actually move everything to Hadoop. But since that is fairly academic and probably of limited applicability to most enterprise customers, I’ll stick with my original answer of: no.
What about Spark, does it replace Hadoop?
Once again: No. Spark is an in-memory processing engine that can run on top of HDFS or stand-alone. As an in-memory engine, Spark is much faster than the traditional MapReduce approach. Spark can process data from HDFS, Hive, Flume and other data sources extremely fast, allowing Hadoop to be an effective streaming or real-time analytics platform. Spark can replace MapReduce as the right tool for many jobs, but it is just one part of the Hadoop ecosystem, which includes tools such as MapReduce, Spark, Storm, Hive, Hbase, Flume etc.
Are dedicated programmers/developers needed to deploy/manage a Hadoop system? Do I need to hire a Data Scientist?
You will certainly need some folks with Hadoop skills, database/data management skills, system admin skills, programing skills and analytics skills. Currently, the market isn’t oversaturated with Hadoop admins that possess all of these skills along with several deployments and a few years of management experience under their belts (I think we’ll see more over the next few years). Experienced DBAs can usually be effective Hadoop admins, as are good system admins (i.e. folks that know more than just navigating the GUI).
As for the data scientist, they’re great if you can find one (and afford him/her). You’re talking about someone who gets statistics, algorithms, coding, data and database technologies and the underlying business logic. In many cases, companies are leveraging the skills of multiple individuals already on staff as opposed to hiring a dedicated data scientist.
We hear about a lot of cool “science projects” but what are companies actually doing with Hadoop in production scenarios?
Over the last couple of years, we have seen more organizations using Hadoop in production environments. Some common examples include:
- Consolidating data from multiple sources/methods into a “data lake”
- Offloading ETL process from existing data warehouse
- Predictive modeling/analytics (related to security, maintenance, marketing, supply-chain etc.)
- Real-time or streaming analytics (when front-ended with an in-memory engine like Spark or SAP HANA
The specific use-case examples are plentiful now, across most verticals like healthcare, retail, financial services, manufacturing etc.
How do I start the Big Data journey? What use cases are low hanging fruits to try out first?
We were all passionately unanimous in our response to this one – and the answer is: have a use case. Ok, have a well-defined, small in scope, manageable, measurable use case that has the support of the business. Work with the business stakeholders to identify an attainable use case that will return measurable business value and secure their buy-in. One of the most cited reasons for failed big data projects is the lack of a well-defined, business-relevant use case.
Now, here is where we diverged a little. Some of the panelists advocate starting with an operational IT use case (such as offloading ETL or log management) as a first Hadoop project, then using that success as a proof point to secure business buy-in for a more business centric Hadoop project. While I don’t disagree with that approach, I’d still prefer to start with a use case that directly impacts business objectives.
What infrastructure is most appropriate for Hadoop?
One of the key tenets of Hadoop is that it was designed to leverage “commodity” hardware. As our panels were part of an HPE-centric event, we focused on HPE solutions. HPE infrastructure is obviously a great platform for a Hadoop cluster. Additionally, the HPE folks had some very interesting testing and benchmark data showing significant performance gains for some Hadoop workloads using HPE Moonshot systems with 3Par arrays for Hadoop. Yes, the approach is completely counter-intuitive, but the test results were compelling. Of course, there are many solid infrastructure options for Hadoop, Cisco, IBM, Lenovo to name a few. Many of which have validated reference architectures or frameworks for Hadoop, making design and deployment MUCH easier. There are even a few somewhat “turn-key” Hadoop infrastructure solutions that can be delivered pre-configured and pre-integrated, workload optimized either from the manufacturer or from Avnet.
There were plenty of other questions but in the interest of brevity (I realize that ship may have already sailed at this point), I’ll stop here. These seemed to be the most common questions across the three audiences. I won’t subject you to stuff about the rule of large numbers and statistical inference as related to Hadoop – unless that’s your thing, in which case please feel free to reach out directly.