How to catch more fish in the elusive “Data Lake”…

I love working in High Tech! From my point of view there really is no other industry that moves as fast, changes as often and provides as much opportunity for quick learning and adaptation. With all of this excitement however one pattern has become clear; every new technology concept usually falls short in its first iteration and then returns years later in a refined form that is ready for mass market.

One of the most intriguing technologies that illustrates this point is the “Data Lake”. Originally coined by James Dixon – Founder and CTO of Pentaho. In James’ blog post a Data Lake was defined as follows;

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples” 

The philosophy of a Data Lake is very compelling. It allows organizations to connect their various sources of data and create actionable insights. These insights form the basis of incremental revenue opportunities, cost savings and risk avoidance. In turn, these insights evolve into a sustainable competitive advantage and a platform on which to drive Digital Transformation.

From 2010 onward we have seen almost every organization build out a Data Lake strategy in attempt to seize these benefits. Unfortunately, in many cases these projects have not been a total success.

Having spent many years chasing the “elusive lunker” across many lakes and rivers let me try to share what I believe happened with a fishing analogy –

In order to catch big fish in any body of water a few key elements need to be in place – Lets compare this to capturing insights in a Big Data lake.

  1. You need an appropriate rod and reel – equipment that will present your bait appropriately and handle the stress
    • In the Big data world this would be the infrastructure and Databases – whether in the cloud or on premise
  2. You need appropriate bait – Something to draw the fish out of hiding and entice them to bite
    • In the Big Data world this would be your Analytics and Query tools
  3. You need an appropriate guide or guidebooks – to help ensure you are “fishing where the fish are”
    • In the Big Data world this would be your Data Scientists or Consultants

Much of the focus in Big Data over the last few years have centered around these three  areas. The industry has spent a ton of time and money on tools, infrastructure and people to help unleash the power of our data but very few have actually seen the benefits. From my point of view we have missed one very important factor –  

“Before we can even think about catching anything our Data Lake needs to have fish in it!”

Simply dumping data into a common repository doesn’t create a Data Lake. Data must be cleansed, prepared and appropriately indexed with metadata prior to being fed into the data lake. This preparation allows for it to be  associated and blended with other data allowing for this promise of deep connections and insights.

In fact, this particular issue is now getting even more complex for a few key reasons –

  1. Applications are now being created beyond the Data Center across various Private Clouds – We need a ubiquitous set of data services that allow data to be consolidated and associated regardless of where it originated!
  2. Most of the volume of data being created is “unstructured” (photos, videos, audio files) where Metadata composition is complex – We need a set of data services that can query and index unstructured data at the time of ingestion to ensure that it can easily associated with structured data (In ERP Systems for instance).
  3. We are now on the crux of the “IoT Era” where sensor data has the ability to deliver “real-time” insights but we will need to be handle the volume of data that will be produced – We need a way to optimize data ingestion so that we can handle this coming flood of data!

I joined Hitachi 8 months ago based on their compelling Data strategy and their proven ability to solve these kind of specific issues. In fact, the same company that coined the term “Data Lake” (Pentaho) was acquired by Hitachi two years ago and when coupled with Hitachi’s legacy in the storage space the outcome is a simpler and more comprehensive data services platform. Simply put, our experience in enterprise storage coupled with the metadata tagging capabilities within our Hitachi Content Platform when integrated with Pentaho for data blending, analytics and visualization bring data lakes to life! Couple this with our Operational experience we not only have the Technology Platform but the industry insight to help organizations to solve their most complex issues and drive better insights.

Im excited about the future!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s