What are AWS Data Lakes?

Posted by Martin Townend on Mar 23, 2021 10:00:00 AM

On a global scale, there are about 2.5 quintillion bytes of data created each day. The data grows exponentially by ten times every five years, and as the data being created continues to grow, the need to store, clean, process, and analyze the data is becoming a growing concern for many organizations. We are now storing and analyzing data in a different class beyond CRM and ERP systems, and our data includes more social media, web analytics, and IoT data from various devices, as well as machine-generated log data.

You may have heard of Data Lakes, but what are they?

Traditionally, organizations have various kinds of data being generated from applications, databases, log files, and much more. This data often takes up exorbitant amounts of space, with no structure or transformation. Basically, it is a mess!

So, what is a Data Lake, and what can it do? A Data Lake provides organizations with a centralized repository for a wide variety of data forms, located in a central platform that supports structured, semi-structured, and unstructured data. Data Lakes really allow you to break down data silos and support a wide range of applications across analytics and machine learning use cases. Did I forget to mention that you can do all this without moving your data, duplicating data, and interfering on these different use cases?

Why AWS for Data Lakes

Most organizations are already on (or are thinking about) their cloud journey. This could be your first footprint into the cloud to get you started, or it could expand your cloud footprint. As I described above, Data Lakes provide you with a way to store both relational and non-relational data at massive scale. They support a wide variety of tools that help you analyze the data and give you deeper insights.

  • You have a central data catalogue that provides you with a view of data you own and the properties of this data.
  • Services like EMR can run your dig data applications or Amazon Athena for ad-hoc real-time interactive analytics.
  • Amazon Redshift can be used for your data warehouse and Redshift Spectrum can be used to run scale-out exabyte queries across data stored in your data lake in S3 or Redshift.

What Service do I use to store my Data Lake?

S3 is a great place for the Data Lake central repository, as it provides a vast number of features like analytics and file system integration. You are able to use services like AWS Lake Formation to stand up a data lake within days or spin up a FSx data lake with Lustre for HPC, machine learning, or media workloads.

How do I build my AWS Data Lake?

There are 3 steps involved in building out a data lake.

1. Collect and centralize your data
2. Catalogue and transform your data
3. Analyze and gain insights into your data

Let ConvergeOne help build out your data lake and start ingesting data. Reach out to Martin Townend or Mike McGuire for more information.


This ConvergeOne white paper will answer the following questions:

  • What exactly is unstructured data (also known as qualitative data)?
  • How is it different from structured data (also known as quantitative data)?
  • What are the latest trends and inherent challenges businesses need to be aware of concerning unstructured data?

Download the white paper to learn why it is becoming more critical than ever to extract value from unstructured data.


Topics: Data Center


Martin Townend
Martin Townend  -- Martin Townend is a Senior Cloud Solutions Architect for ConvergeOne’s National Data Center Practice. Martin is six-times AWS Certified and Microsoft Azure Certified. Martin has a deep understanding of cloud security and the various public clouds. Martin has focused on cloud for over 10 years, helping organizations on their cloud journey and designing secure, scalable environments. Martin’s innovation continues within cloud and emerging technologies.