Let’s face it: using a data lake to wrangle big data scares people. There are too many horror stories of companies getting burned when implementing can’t-miss technologies. For example, there’s a long list of companies that were sold on the promise of data warehousing, only to end up pouring money into what became a dumpster fire that incinerated the careers of many CIOs.
At first glance, data lakes seem to improve on the promise that data warehouses once held–to help you gain better insight and foresight into your business. But as with data warehousing, there are pitfalls to avoid when you implement a data lake. If you manage to dodge these traps, you can shepherd your organization through the process of building a data lake. And, when you couple the data lake with best-in-breed analytics tools, you can fulfill the promise of gaining deeper insight and foresight and grow your bottom line.
Avoid Creating Data Silos
Most data lakes are built using Hadoop to build data clusters—which are multiple storage spaces for large amounts of structured and unstructured data, in its native format. Data is formatted on request by data scientists and analysts. This “schema on read” architecture is less expensive and quicker to implement than traditional “schema on write” architectures (such as those that underpin data warehouses) which force data to be cleansed, de-duplicated, and formatted as it goes into the database.
The problem with clustering with Hadoop—or a similar technology—is that you can “over-cluster” and create too many storage spaces. Over-clustering leads to duplication of data across the organization and the potential creation of data silos, which exacerbates existing problems you may have with inconsistent information from your IT systems. It also eliminates one key benefit of a data lake, which is to provide a centralized, flexible data storage facility that provides users with access to data across the organization.
The solution is to constantly monitor the Hadoop clusters you’re creating—especially if you’re creating clusters for different departments or business functions—and keep them to a minimum to avoid having data duplication and inconsistency issues.
Don’t be Rigid with your Data
The beauty of using a data lake is that data can be formatted in multiple ways, at the time it’s requested for analysis, depending on who’s requesting it, and for what purpose. This gives companies that build a data lake very agile, powerful data analysis capabilities that can be customized to satisfy multiple user constituencies. However, if you build an inflexible architecture and implement overly-rigid data governance policies, you risk losing the flexibility and power of the data lake.
A scalable technical infrastructure is a must when implementing a data lake. As data types and volumes change, your architecture–especially your database and analytics capabilities–must be able to flex and grow with those changes, otherwise you’re just creating bottlenecks that sap the power of the technology.
Also, if you govern data access too tightly, you risk inhibiting the “freedom of movement and speech” of data throughout the organization, thus stifling analytic breakthroughs and insight generation. Conversely, if you don’t monitor how, and what, data is ingested and implement minimal, high level governance policies, you’ll end up with garbage piles full of data that no one uses because it’s either irrelevant, or so dirty that it’s rendered useless.
Do What’s Right for the Company
Everyone (except those curmudgeons who are resistant to change) will be excited about new technology. This is especially true when you tell them that they’ll be able to get better answers to the questions they have and be able to do their jobs more efficiently and effectively. That’s a good thing. However, don’t let excitement and politics rule the data lake implementation process. Do what’s right for the company, not what the most influential constituencies in the organization demand.
To pick a good pilot or proof-of-concept project, use an information-gathering process that suits your organization and uncover the issues that hamper decision-making and business performance. Then, use the results of that process to understand the needs of different user groups. This process will help you clearly define the benefits of the project and uncover the most significant opportunities and pick the project that delivers the most value in the shortest time.
Building a data lake is a worthwhile undertaking for just about any company. Data lakes can provide you with the foundation to take advantage of incredible analytics techniques that help you leverage new insights to leapfrog your competition. Pick the right project, avoid creating or perpetuating data silos, and keep things flexible, and you’ll be on your way.
I’d love to hear about your perspective and experiences. Leave a comment below, contact me by email at email@example.com, or find me on Twitter.