You may have heard of Data Lakes, but what are they?
Traditionally, organizations have various kinds of data being generated from applications, databases, log files, and much more. This data often takes up exorbitant amounts of space, with no structure or transformation. Basically, it is a mess!
So, what is a Data Lake, and what can it do? A Data Lake provides organizations with a centralized repository for a wide variety of data forms, located in a central platform that supports structured, semi-structured, and unstructured data. Data Lakes really allow you to break down data silos and support a wide range of applications across analytics and machine learning use cases. Did I forget to mention that you can do all this without moving your data, duplicating data, and interfering on these different use cases?
Why AWS for Data Lakes
Most organizations are already on (or are thinking about) their cloud journey. This could be your first footprint into the cloud to get you started, or it could expand your cloud footprint. As I described above, Data Lakes provide you with a way to store both relational and non-relational data at massive scale. They support a wide variety of tools that help you analyze the data and give you deeper insights.
- You have a central data catalogue that provides you with a view of data you own and the properties of this data.
- Services like EMR can run your dig data applications or Amazon Athena for ad-hoc real-time interactive analytics.
- Amazon Redshift can be used for your data warehouse and Redshift Spectrum can be used to run scale-out exabyte queries across data stored in your data lake in S3 or Redshift.
What Service do I use to store my Data Lake?
S3 is a great place for the Data Lake central repository, as it provides a vast number of features like analytics and file system integration. You are able to use services like AWS Lake Formation to stand up a data lake within days or spin up a FSx data lake with Lustre for HPC, machine learning, or media workloads.
How do I build my AWS Data Lake?
There are 3 steps involved in building out a data lake.
1. Collect and centralize your data
2. Catalogue and transform your data
3. Analyze and gain insights into your data