Blog data management

What is a data lake? Flexible big data management explained

A data lake may be a far more versatile repository than a data warehouse. Or it may be a trash dump that grows and grows.

In case you are tuned in to the newest know-how concepts around big data, you’ve doubtless heard the term “data lake.” The picture conjures up a giant reservoir of water—and that’s what a data lake is, in concept: a reservoir. Only it’s for data.

Data lake outlined
A data lake holds a huge quantity of uncooked, unstructured data in its native format.

Subsequently, all you want is a gadget that helps a flat file system, which suggests you should use a mainframe if you’d like. The data is moved to different servers for processing. Most enterprises go together with the Hadoop File System (HDFS), as a result of it is designed for fast processing of huge data sets and is used in a big data setting where a data lake is doubtless for use.

That help for native-format data brings a key profit. “If I want to get a ridiculous amount of data and figure out what to do with it later, that fits in the mantra of what we do with data lakes now,” says Michael Hiskey, head of strategy at Semarchy, a vendor of data management software program.

“We have things known and unknown that people on the data lake side are taking keep everything that might be interesting and take order out of madness later. We could not guess today what’s valuable from the things I’m throwing away, but that could turn out to be interesting in the future,” he says.

Jake Stein, CEO of Stitch, an ETL service that connects multiple cloud data sources, echoed the future-proofing sentiment. “If you’re not sure when you’re going to use the data and it’s not important to have subsecond access and want to store it in a low-cost form, the data lake is the right format. It’s often a case of if you don’t capture the data now, you will never get it again, so it’s important to future=proof yourself in that aspect.”

Data lake vs. data warehouse
Data repositories are nothing new; data warehouses have been round for many years. And whereas it is natural to match data warehouses to data lakes, there are elementary variations that separate data warehouses from data lakes, starting from the sort of data stored to the way it is processed.

Data lakes don’t require specialty hardware
One of many key differences between a data lake and a data warehouse is that a data lake does not require special hardware or software program, in contrast to a data warehouse.

Data lakes are more versatile
As famous, a data lake holds a vast quantity of raw, unstructured data in its native format, whereas the data warehouse is far more structured into folders, rows, and columns. As a outcome, a data lake is rather more versatile about its data than a data warehouse is.

That’s essential because of the 80 % rule: Back in 1998, Merrill Lynch estimated that 80 % of corporate data is unstructured, and that has remained primarily true. That in turn means data warehouses are severely restricted of their potential data evaluation scope.

Hiskey argues that data lakes are extra helpful than data warehouses as a result of you’ll be able to collect and store data now, even in case you are not using parts of that data, however can go back weeks, months, or years later and perform analysis on the previous data which may have been otherwise discarded.

A flexibility-related difference between the data lake and the data warehouse is schema-on-read vs. schema-on-write. A schema is a logical description of your complete database, with the identify and outline of data of all document varieties.

A data warehouse applies schema-on-write, so you must know exactly how you can structure the data earlier than you reserve it. Meaning a lot of preparation before intake, or at the very least earlier than storage. Against this. data lakes apply schema-on-read, so you’ll be able to format it as you read and process it. Schema-on-read means you possibly can throw all the things into a bucket, like log information, net information, or things with no meaningful construction, and then figure it out later.

“A data warehouse is highly structured. You have to really understand the data before you do anything on it,” stated Joe Wilhelmy, director of data engineering at the American Associate of Insurance coverage Providers (AAIS). “With a data lake, you can bring it iteratively through a maturity cycle from raw source data to structured projection. You can see it along the way don’t have to be beholden to data engineers and IT to productize that data before it’s usable.”

Every data aspect in a lake is assigned a unique identifier and tagged with a set of prolonged metadata tags. When somebody performs a enterprise question based mostly on a sure metadata, all the data tagged is then analyzed for the query or question.

In contrast to a data warehouse, data lakes don’t have an underlying database. As an alternative, data lakes use a flat file system. With a database, you need to choose data and columns earlier than you write to it. The trade-off is that it’d take a whereas to insert the data into a database, but once you do a question it is a lot quicker than in a data lake, which has to course of the data because it is learn.

“With a data lake, you can put data into a store any way you like. That allows you to write data with a flexible schema and query later, but orders of magnitude slower,” stated Stein. “The one element those servers don’t do well is metadata management. Things like what goes in which folder, when is it aged out. You have to roll your own when doing a service like that.”

Enterprise-class data lake software now out there
For the longest time, the double-edged sword around data lakes was that they might be executed with present hardware and free, open supply software program. The advantage was that they used your present hardware and free, open source software program. The issue was the shortage of commercially supported software from a traditional, mature data warehouse firm, which most individuals want.

That has since modified, and traditional corporations lke TeraData and Oracle supply business data lake products, as do specialised big data distributors like Hortonworks and Cloudera.

Amazon, Microsoft, Google, and IBM all supply a number of data lake tools together with their primary cloud storage providers, so you possibly can construct your data lake on premises or in the cloud.

Other business data lake products embrace:

Apache NiFi: This Apache-licensed open-source software is used for data routing and transformation in data lakes and analytics. It’s out there as a commerciall product from Hortonworks beneath the identify DataFlow.
Cambridge Semantics: The newest model of its Anzo Sensible data lake product provides a semantic layer to data on each ingestion and skim, so you can do on-demand preparation and analysis. It additionally has graph fashions to display the data evaluation visually.
Hitachi Vantara: Hitachi Vantara owns Pentaho, which first coined the time period “data lake.” Pentaho is recognized for its data integration tools beyond simply data lakes and gives integration with Hadoop, Spark, Kafka, and NoSQLto present safety, governance, integration, and data transformation.
Trifacta: Its Wrangler software uses AI and machine studying algorithms to automate and simplify the processing of data and interaction with the analysts or business consumer. It visually tracks and presents the lineage of data transformation steps for particular data sets and throughout a number of workflows.
Zaloni: Zaloni provides an enterprise data lake platform referred to as Zaloni Data Platform, which incorporates help for cloud and on-premises deployment, a management platform, data catalog, zones for data governance, and self-service data-prep instruments that cover end-to-end processing.
When to keep away from a data lake
A data lake is not for everyone. Some corporations might not need it, and it’d make things worse. For example, Hiskey says data lakes usually are not for real-time work. “If you are looking for real-time, up-to-date info, a data lake is not for you. It’s for historical data. You’re still going to need a fast, transactional system.”

Wilhelmy says some industries gained’t permit data lakes because of their unorganized nature. “There’s no strong data governance of random bits and files, and no one understands what governance processes are around the data lake. A prerequisite would be a strong data-governance position. The organization would have to be at an intermediate or advanced level of maturity to govern data processes in a data lake, from taking it in and cleaning it to passing it out to the organization.”

And Joshua Greenbaum, principal analyst with Enterprise Purposes Consulting, doesn’t assume data lakes are a good concept at all. “In most cases, data lakes are a sign of laziness on the side of IT and not a case of strategic thinking. The laziness is ‘Let’s put our data in one place and think about it later,’” he says.

Greenbaum argues when you don’t know the problems you are attempting to unravel, you’re accumulating as many bricks as you possibly can because someday you need to construct something. “But if you don’t have a plan, all you have is a pile of bricks, and what if you need wooden beams? If you started with a design, you would know what you need to have.”

His cynicism comes from seeing this happen earlier than with data warehouses. “This is a movie we’ve seen before, with different actors but the plot is the same and the end is the same. You are going to waste a lot of money on a data lake like [you did on] a data warehouse if you don’t do it strategically,” stated Greenbaum.

A data lake with no function is an costly “just in case” strategy. But accomplished strategically, it’s a superb solution to store info that you simply need to analyze and act on in several ways over time—customer patterns, for example—because you didn’t course of it to the purpose the place it may be used solely do one factor, as in a typical data warehouse.