A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). A data lake can be established “on premises” (within an organization’s data centers) or “in the cloud” (using cloud services from vendors such as Amazon, Microsoft, or Google).
While data warehouses can only work with structured information—such as information in a relational database where the data is organized into clearly-identified columns and rows. Data lakes can work with any type of data.
Many companies worldwide are currently using cloud storage services such as Google Cloud Storage and Amazon S3 or a distributed file system such as Apache Hadoop.
There is also an academic interest in the concept of data lakes among universities.
For example, Personal DataLake at Cardiff University is a new type of data lake which aims at managing big data of individual users by providing a single point of collecting, organising, and sharing personal data.