What is a data lake? Flexible big data management explained
- 24 September, 2018 20:00
If you are tuned in to the latest technology concepts around big data, you’ve likely heard the term “data lake.” The image conjures up a large reservoir of water—and that’s what a data lake is, in concept: a reservoir. Only it’s for data.
Data lake defined
A data lake holds a vast amount of raw, unstructured data in its native format.
Therefore, all you need is a device that supports a flat file system, which means you can use a mainframe if you want. The data is moved to other servers for processing. Most enterprises go with the Hadoop File System (HDFS), because it is designed for fast processing of large data sets and is used in a big data environment where a data lake is likely to be used.
That support for native-format data brings a key benefit. “If I want to get a ridiculous amount of data and figure out what to do with it later, that fits in the mantra of what we do with data lakes now,” says Michael Hiskey, head of strategy at Semarchy, a vendor of data management software.
“We have things known and unknown that people on the data lake side are taking keep everything that might be interesting and take order out of madness later. We could not guess today what’s valuable from the things I’m throwing away, but that could turn out to be interesting in the future,” he says.
Jake Stein, CEO of Stitch, an ETL service that connects multiple cloud data sources, echoed the future-proofing sentiment. “If you’re not sure when you’re going to use the data and it’s not important to have subsecond access and want to store it in a low-cost form, the data lake is the right format. It’s often a case of if you don’t capture the data now, you will never get it again, so it’s important to future=proof yourself in that aspect.”
Data lake vs. data warehouse
Data repositories are nothing new; data warehouses have been around for decades. And while it is natural to compare data warehouses to data lakes, there are fundamental differences that separate data warehouses from data lakes, ranging from the kind of data stored to how it is processed.
Data lakes don’t require specialty hardware
One of the key differences between a data lake and a data warehouse is that a data lake does not require special hardware or software, unlike a data warehouse.
Data lakes are more flexible
As noted, a data lake holds a vast amount of raw, unstructured data in its native format, whereas the data warehouse is much more structured into folders, rows, and columns. As a result, a data lake is much more flexible about its data than a data warehouse is.
That’s important because of the 80 percent rule: Back in 1998, Merrill Lynch estimated that 80 percent of corporate data is unstructured, and that has remained essentially true. That in turn means data warehouses are severely limited in their potential data analysis scope.
Hiskey argues that data lakes are more useful than data warehouses because you can gather and store data now, even if you are not using elements of that data, but can go back weeks, months, or years later and perform analysis on the old data that might have been otherwise discarded.
A flexibility-related difference between the data lake and the data warehouse is schema-on-read vs. schema-on-write. A schema is a logical description of the entire database, with the name and description of records of all record types.
A data warehouse applies schema-on-write, so you have to know exactly how to structure the data before you save it. That means a lot of preparation before intake, or at least before storage. By contrast. data lakes apply schema-on-read, so you can format it as you read and process it. Schema-on-read means you can throw everything into a bucket, like log files, web files, or things with no meaningful structure, and then figure it out later.
“A data warehouse is highly structured. You have to really understand the data before you do anything on it,” said Joe Wilhelmy, director of data engineering at the American Associate of Insurance Services (AAIS). “With a data lake, you can bring it iteratively through a maturity cycle from raw source data to structured projection. You can see it along the way don’t have to be beholden to data engineers and IT to productize that data before it’s usable.”
Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When someone performs a business query based on a certain metadata, all the data tagged is then analyzed for the query or question.
Unlike a data warehouse, data lakes don’t have an underlying database. Instead, data lakes use a flat file system. With a database, you have to choose data and columns before you write to it. The trade-off is that it might take a while to insert the data into a database, but when you do a query it is a lot faster than in a data lake, which has to process the data as it is read.
“With a data lake, you can put data into a store any way you like. That allows you to write data with a flexible schema and query later, but orders of magnitude slower,” said Stein. “The one element those servers don’t do well is metadata management. Things like what goes in which folder, when is it aged out. You have to roll your own when doing a service like that.”
Enterprise-class data lake software now available
For the longest time, the double-edged sword around data lakes was that they could be done with existing hardware and free, open source software. The advantage was that they used your existing hardware and free, open source software. The problem was the lack of commercially supported software from a traditional, mature data warehouse firm, which most people want.
That has since changed, and traditional companies lke TeraData and Oracle offer commercial data lake products, as do specialized big data vendors like Hortonworks and Cloudera.
Amazon, Microsoft, Google, and IBM all offer a variety of data lake tools along with their basic cloud storage services, so you can build your data lake on premises or in the cloud.
Other commercial data lake products include:
- Apache NiFi: This Apache-licensed open-source tool is used for data routing and transformation in data lakes and analytics. It’s available as a commerciall product from Hortonworks under the name DataFlow.
- Cambridge Semantics: The latest version of its Anzo Smart data lake product adds a semantic layer to data on both ingestion and read, so you can do on-demand preparation and analysis. It also has graph models to display the data analysis visually.
- Hitachi Vantara: Hitachi Vantara owns Pentaho, which first coined the term “data lake.” Pentaho is known for its data integration tools beyond just data lakes and offers integration with Hadoop, Spark, Kafka, and NoSQLto provide security, governance, integration, and data transformation.
- Trifacta: Its Wrangler software uses AI and machine learning algorithms to automate and simplify the processing of data and interaction with the analysts or business user. It visually tracks and presents the lineage of data transformation steps for specific data sets and across multiple workflows.
- Zaloni: Zaloni offers an enterprise data lake platform called Zaloni Data Platform, which includes support for cloud and on-premises deployment, a management platform, data catalog, zones for data governance, and self-service data-prep tools that cover end-to-end processing.
When to avoid a data lake
A data lake is not for everyone. Some companies may not need it, and it might make things worse. For example, Hiskey says data lakes are not for real-time work. “If you are looking for real-time, up-to-date info, a data lake is not for you. It’s for historical data. You’re still going to need a fast, transactional system.”
Wilhelmy says some industries won’t allow data lakes due to their unorganized nature. “There’s no strong data governance of random bits and files, and no one understands what governance processes are around the data lake. A prerequisite would be a strong data-governance position. The organization would have to be at an intermediate or advanced level of maturity to govern data processes in a data lake, from taking it in and cleaning it to passing it out to the organization.”
And Joshua Greenbaum, principal analyst with Enterprise Applications Consulting, doesn’t think data lakes are a good idea at all. “In most cases, data lakes are a sign of laziness on the side of IT and not a case of strategic thinking. The laziness is ‘Let’s put our data in one place and think about it later,’” he says.
Greenbaum argues if you don’t know the problems you are trying to solve, you’re collecting as many bricks as you can because one day you want to build something. “But if you don’t have a plan, all you have is a pile of bricks, and what if you need wooden beams? If you started with a design, you would know what you need to have.”
His cynicism comes from seeing this happen before with data warehouses. “This is a movie we’ve seen before, with different actors but the plot is the same and the end is the same. You are going to waste a lot of money on a data lake like [you did on] a data warehouse if you don’t do it strategically,” said Greenbaum.
A data lake with no purpose is an expensive “just in case” approach. But done strategically, it’s an excellent way to store information that you want to analyze and act on in different ways over time—customer patterns, for example—because you didn’t process it to the point where it can be used only do one thing, as in a typical data warehouse.