What Is a Data Lake? Is It Right for Your Company?

A few questions to help determine if it’s the data architecture that is really required.

The data lake is a repository that allows for the high-speed input of massive amounts of any kind of raw data.

The idea for the data lake came from a technology called Hadoop, where a few engineers developed a data storage model that could accommodate the immense amount of data needed by search platforms. The Apache Software Foundation eventually spun Hadoop into an open source project, making Hadoop's code available to everyone for free.

Data lakes have been around for over ten years now, and since then enterprise software vendors such as Microsoft and Amazon have introduced their own versions, such as Azure Data Lake and AWS Lake Formation.

Companies are still trying to determine if and when a data lake is the appropriate solution for their data.

Why Was the Data Lake Invented?

It can be helpful to think about why your company created a data lake in the first place, in order to decide whether your company needs one. A data lake can be seen as a response to an earlier concept, the data warehouse, in some ways.

A data warehouse is a systematized process and repository for data that might be needed for a company's operations and analytics. This strict structure keeps data ready for analysis, but it does so at the expense of collection and processing speed. The collecting of massive data volumes at high speed, as well as the collection of a range of structured and unstructured data, makes it impractical for companies - otherwise known as big data companies.

The data lake's ability to handle high volumes of data quickly makes it a desirable solution for companies dealing with big data. The use of clustering - where multiple servers are used for storage, processing and computation, rather than a single server - contributes significantly to this.

Also known as distributed computing, this technique allows for expansion of the data infrastructure as needed, scaling it to meet the immense demands of companies such as Yahoo, Facebook or eBay.

A Primer Question: Do We Really Have Big Data?

If you are considering using a data lake, the first question to ask is if your organization produces big data. Subjectively, “big” can mean different things, but in general, it refers to data sets that are too large or complex for traditional data storage and processing technologies.

When considering data collection, it is important to think about both the amount and type of data. A data lake might not be the best option for storing relational data because it does not have the same structure or safeguards as a relational database system (RDBMS). A data lake may be the right solution for you if you are dealing with large amounts of semi-structured and unstructured data.

Here are some more questions to ask about your data lake implementation, assuming you got past the gatekeeper.

Here are some questions to help you plan your data lake implementation:

What’s our plan for dealing with small data?
Would it be easy for my data science team to work in the lake?
How do we keep track of the data once it is in the data lake?
Is it possible to integrate a data lake with existing data infrastructure? If so, how would that be done?

What’s Our Plan for Dealing With Small Data?

As was mentioned, data lakes aren't good for every situation. An example of this is that Hadoop has more difficulty processing smaller datasets. Furthermore, it is better equipped to handle one large file instead of many small files that together comprise the same amount of data.

If you want a system that can manage both small and large datasets, you might want to consider using Apache Hadoop Ozone.

Another option you might want to consider is configuring your data pipeline to bundle many files into one container, such as a sequencefile, an Avro file, or a Hadoop archive (.har) file.

Some commercial enterprise-grade ETL (extract, transform, load) solutions, such as Xplenty, can automatically optimize small files for storage in a data lake while loading new data into the system.

Can My Data Science Team Easily Work in the Lake?

Implementing a data architecture that your data science team can't work with is the last thing you want to do.

Because some of its technologies, like MapReduce, are more difficult to use, Hadoop can pose challenges for data science teams. In particular, programming languages that data scientists often use, like Python and R, can't work directly on Hadoop's distributed data sets. Apache Spark is open source software that can help improve performance and simplify data lake operations.

Spark provides access to an extensive machine learning library through PySpark and SparkR, which allows data scientists to work with languages that are very close to Python and R.

It is important to consider the skill sets required for a successful Spark implementation, such as the ability to integrate and manage open source software code.

While Spark may simplify the process of working with data lakes, it does require some specialized knowledge for proper integration. If your developer team is struggling to integrate Spark, you could consider using one of the commercial versions of the product. Although this would still require some level of management, it would enable you to get up and running on a data lake in less than an hour.

Remember that PySpark and SparkR require extra training for those who are already familiar with Python and R.

How Are We Going to Keep Track of the Data Once We Put It in the Data Lake?

The importance of this cannot be overstated - the data lake's ability to quickly ingest any data type comes at the expense of providing the structure needed to make the data useful. Without a plan to address this in your data lake strategy, you may end up with something akin to a roach motel - datasets check in, but never leave.

Data lakes have given rise to solutions, like data catalogs, that provide a data warehouse-like level of organization.

There are many data catalog solutions available if you are looking to buy one. But before you start, you should know that this product category has evolved significantly over the past few years. Choose a solution that will help you advance to the next level.

To be frank, for most companies, it is a complicated web of different types of repositories - relational databases, NoSQL databases, warehouses, marts, operational data stores - often from various vendors, owned by different departments, and existing in different locations.

You need to inquire about a data catalog's compatibility with not just your data lake, but all other current data infrastructure before making a purchase.

Can I Integrate a Data Lake With Current Data Infrastructure? And If So, How?

Although a data lake can manage a large volume of data, you would not want to put all of your data into the system.

Initially, moving data is a huge undertaking, and unless there’s a compelling reason to move it — like AWS pulling the plug on you — it’s generally not something you’ll want to do.

To integrate a data architecture including Hadoop, Oracle, and Teradata, what is the best course of action?

Matt Aslett of 451 Research has defined an enterprise intelligence platform (EIP) as technology that allows for the concurrent execution of analytics projects on datasets potentially residing in various data repositories. John Santaferraro, of EMA, has described a UAW as a system that unifies interactions with data and analytics tools.

How can analysts' forward-thinking ideas be used to construct a working data infrastructure within my company?

We asked this question when my company was working on creating a system that would automate SQL queries over dispersed and distributed data architecture. Researching and reviewing the literature on this issue led us to an open source technology called Trino. Trino's ability to integrate various architectures (e.g. Hadoop + OracleDB, MongoDB, Teradata) without needing to transfer or modify any data is a key selling point. By leveraging virtualization, you can create an abstract layer that allows you to look at a group of different systems as if they were one data source.

While assessing data catalogs, look for capabilities giving you an abstracted access layer over multiple data repositories. If data movement is required, you are likely to face a separate set of challenges which will delay your progress.

You should think about how you will manage small datasets, maintain order, and integrate with any existing data infrastructure if you want to have a data lake. If any of these obstacles are left unaddressed, they could cause your data lake investment to fail to produce desired results; however, all of these obstacles can be surmounted using the approaches I have described.