5 Leading Data Lake Analytics Platforms and Services: Part 1
31st December 2021
By Michael A
In recent years there has been an insatiable hunger for data lakes because of their ability to store data regardless of whether it is structured, semi-structured, or unstructured. This capability is especially important because the rate of increase in the volume of unstructured and semi-structured data far outweighs that of structured data. You will find that most data lakes are built on one of three cloud object stores: Amazon Simple Storage Service (Amazon S3), Azure Data Lake Storage Gen2 (ADLS Gen2), or Google Cloud Storage (GCS). It is also becoming increasingly more common for data lakes to span multiple-cloud providers, and as a result, more than one storage service.
Why Cloud Object Stores are Important
Cloud object stores are typically low cost, highly scalable, extremely secure, compliant with several international standards, and provide virtually unlimited storage capacity. Implementing an on-premises data lake with all these attributes would be impractical for all but the largest of companies, not only because of the associated operational costs, but also the exceptional degree of skill and experience that is required to configure, maintain, and support the data infrastructure.
One of the biggest challenges when creating and maintaining a data lake is to avoid it turning into a data swamp. Companies can avoid creating data swamps by carefully organising the data that is ingested, and ensuring there are effective data governance controls, policies, and procedures in place.
Another significant challenge, once a data lake has been successfully established, is enabling data engineers, data analysts, data scientists, machine learning (ML) or artificial intelligence (AI) engineers, and other members of the analytics team to unlock value from the data. They need flexible, scalable, and highly performant platforms and services that allow them to analyse and transform the data into solutions that provide value to their organisation.
Leading Data Lake Analytics Platforms and Services
This five-part blog series will introduce what are arguably the five best-of-bread data lake analytics platforms or services right now and present some ideas around how they could be used by an analytics team to deliver business value.
The platforms and services that will be explored in this blog series are:
- Azure Synapse Analytics
- Amazon Athena
- Databricks
- Google BigQuery
- Dremio
Azure Synapse Analytics
Azure Synapse Analytics is a unified analytics platform for building end-to-end analytics solutions. It is a superset of several well-integrated services that enable members of the analytics team to work with the same set of data in a way that is aligned with their analytic workflows. The data lake is implemented using one or more ADLS Gen2 accounts, and the services require data to be ingested there first before the full spectrum of Azure Synapse Analytics capabilities can be used.
Significant Capabilities
Azure Synapse Pipelines is a data orchestration service that enables data engineers to build data pipelines at varying levels of complexity. It is based on the Azure Data Factory service, a mature cloud data orchestration service that has a proven track record.
Azure Synapse Serverless SQL Pools is a scalable data lake SQL query engine that makes it possible to flexibly project a schema over semi-structured and structured files, enabling them to be queried like relational database tables, with a comparable level of performance There is also an Azure Synapse Dedicated SQL Pools service, which is a massively parallel relational database service but, because this blog post focuses on services that can query the data lake directly, it has only been mentioned to make you aware of it.
Azure Synapse Spark Pools is based on a special version of Apache Spark enabling data in the data lake to be handled using powerful programming languages including the Scala, Python, R, and SQL. What makes this version of Apache Spark unique is it also includes support for C#.
Power BI is one of the leading business intelligence software-as-a-service (SaaS) platforms, and this is integrated into the Azure Synapse Analytics service in a way that enables reports and dashboards to be quickly created from the refined data assets in the data lake.
Delivering Business Value
Azure Synapse Analytics caters to the many different roles found in a modern analytics team including the data engineer, data analyst, data scientist, and ML/AI engineer. Data engineers will spend most of their time ingesting data into the data lake (i.e. one or more ADLS Gen2 storage accounts) using Azure Synapse Pipelines and then transforming it for downstream use using Azure Synapse serverless SQL pools, Azure Synapse serverless Spark pools, or a combination of both.
Data scientists will typically use Azure Synapse serverless Spark pools to perform iterative tasks on the data ingested data such as feature engineering, ML/AI model training, and ML/AI model evaluation. This could be done using Spark MLlib, a machine learning library built into Apache Spark, or with Azure Machine Learning services, a cloud-based machine learning service that enables data scientists and ML/AI engineers to rapidly iterate on machine learning models using automated ML (AutoML). Once trained and validated, the ML/AI engineers can take these production-ready ML/AI models and integrate them into their data pipelines using Azure Synapse Pipelines.
Data analysts can use the integrated Power BI experience to create business intelligence semantic data models, build reports and dashboards on top, making it possible to share insights across their organisation. The Power BI data model would use refined data from the data lake as a primary source and augment this with external data from other systems and data services, enabling the data analyst to stay on top the ever-evolving reporting requirements. They could also use the Power BI composite models and recently announced hybrid tables features to combine an in-memory analytics cache, to provide consistently quick query response times, with massive data volumes that are queried directly in the data lake. These features can also be used to satisfy real-time reporting requirements.
Coming Up Next
The next instalment of this five-part blog series will explore how an analytics team could use Amazon Athena and other complimentary AWS services to deliver business value from a data lake implemented with Amazon S3.
Follow Us and Stay Up to Date
Follow us on X and LinkedIn to keep up to date with Open Data Blend, open data, and open-source data analytics technology news. Be among the first to know when there's something new.
Blog hero image by Joshua Sortino on Unsplash.