Data Lake – Page 9 – Curated SQL

Searching Industry Templates for Lake Databases in Synapse

Published 2022-07-11 by Kevin Feasel

With Azure Synapse Database Templates generally available, our customers are constantly wanting to see and learn more about how to use these templates. Through these blogs we want to share tips and tricks our customers can use to help them utilize these templates in an efficient way. We’ve recently received several questions around the different ways a user can navigate these templates to create their lake databases. In this blog, I’d like to walk through a few options that may come handy as you give database templates a try.
Azure Synapse Analytics offers a no-code database designer which allows you to browse these database templates, select and customize the tables you want to use, to model your enterprise data. There are several ways to browse the tables provided by the comprehensive industry templates within the designer’s exploration experience. Though the user experience is super intuitive, there are a few tips and tricks that can make this process even easier. Let’s do a quick tour to learn about the different ways to browse these templates.

Click through for a few different ways to look at standard tables for different industries.

Comments closed

Summarizing Data & AI Summit Announcements

Published 2022-07-05 by Kevin Feasel

Zach Stagers hits the high notes:

One of the biggest cheers of the keynote was that Delta is being fully open sourced! Databricks continue to share their incredible work to help drive our industry forward. Delta already has wide adoption, but with the open sourced version now being levelled up to the same standard as the ‘proprietary’ one, this should help cement it as the default choice for lake-based storage.
There were some announcements of things to come with Delta too, such as a optimised deletes and updates by removing single rows instead of having to completely rewrite the file. It’ll be really interesting to see how this works, and just how much it boosts performance.

Read on for more notes on several big announcements.

Comments closed

Semantics Layers for Data Lakehouses

Published 2022-06-27 by Kevin Feasel

Jans Aasman explains why semantic modeling is so important for a data lakehouse:

Data lakehouses would not exist — especially not at enterprise scale — without semantic consistency. The provisioning of a universal semantic layer is not only one of the key attributes of this emergent data architecture, but also one of its cardinal enablers.
In fact, the critical distinction between a data lake and a data lakehouse is that the latter supplies a vital semantic understanding of data so users can view and comprehend these enterprise assets. It paves the way for data governance, metadata management, role-based access, and data quality.

For a deeper dive into the topic, Kyle Hale has a post covering this with Databricks and Power BI as examples.

Comments closed

Delta Live Tables and Power BI Data Modeling

Published 2022-06-21 by Kevin Feasel

Tahir Fayyaz goes from Delta Lake to Power BI:

To get the optimal performance from Power BI it is recommended to use a star schema data model and to make use of user-defined aggregated tables. However, as you build out your facts, dimensions, and aggregation tables and views in Delta Lake, ready to be used by the Power BI data model, it can become complicated to manage all the pipelines, dependencies, and data quality as you need to consider the following:
– How to easily develop and manage the data model’s transformation code.
– How to run and scale data pipelines for the model as data volumes grow.
– How to keep all the Delta Lake tables updated as new data arrives.
– How to view the lineage for all tables as the model gets more complex.
– How to actively stop data quality issues that result in incorrect reports.

Read on for recommendations, a couple architectural diagrams, and some sample code.

Comments closed

Processing Security Logs in Databricks with Delta Live Tables

Published 2022-06-06 by Kevin Feasel

Silvio Fiortio ingests some data:

Databricks recently introduced Workflows to enable data engineers, data scientists, and analysts to build reliable data, analytics, and ML workflows on any cloud without needing to manage complex infrastructure. Workflows allows users to build ETL pipelines that are automatically managed, including ingestion, and lineage, using Delta Live Tables. The benefits of Workflows and Delta Live Tables easily apply to security data sources, allowing us to scale to any volume or latency required for our operational needs.
In this article we’ll demonstrate some of the key benefits of Delta Live Tables for ingesting and processing security logs, with a few examples of common data sources we’ve seen our customers load into their cyber Lakehouse.

Click through to learn more.

Comments closed

Understanding the Data Lakehouse

Published 2022-04-18 by Kevin Feasel

Tom Jordan explains what data lakehouses are:

When we are thinking about data platforms, there are many different services and architectures that can be used – sometimes this can be a bit overwhelming! Data warehouses, data models, data lakes and reports are all typical components of an enterprise data platform, which have different uses and skills required. However, in the past few years a new architecture has been rising; the data lakehouse. This is an architecture that borrows ideas and concepts from several different areas, which we will be exploring in greater detail in this blog.

Click through to learn more about the origin of this term and how it draws + differs from both a data lake and a data warehouse.

Comments closed

Delta Lake Operability in Azure Synapse Analytics

Published 2022-03-04 by Kevin Feasel

James Serra lets us know when and where we can use Delta Lake within Azure Synapse Analytics:

Many companies are seeing the value in collecting data to help them make better business decisions. When building a solution in Azure to collect the data, nearly everyone is using a data lake. A majority of those are also using delta lake, which is basically a software layer over a data lake that gives additional features. I have yet to see anyone using competing technologies to delta lake in Azure, such as Apache Hudi or Apache Iceberg (see A Thorough Comparison of Delta Lake, Iceberg and Hudi and Open Source Data Lake Table Formats: Evaluating Current Interest and Rate of Adoption).

Read on for more information.

Comments closed

Organizing Synapse Workspaces and Lakehouses

Published 2021-11-26 by Kevin Feasel

Jovan Popovic confirms that Microsoft is using the term “Lakehouse” like Databricks does:

The lakehouse pattern enables you to keep a large amount of your data in Data Lake and to get the analytic capabilities without a need to move your data to some data warehouse to start an analysis. A lakehouse represents a good trade-off between query performance and the ability to access the latest version of data without the need to wait for data to be reloaded.
Azure Synapse Analytics workspace enables you to implement the Lakehouse pattern on top of Azure Data Lake storage.
When you think about your lakehouse solution, be aware that there are two options for creating databases over the lake:
– Lake databases that are created using Spark or database template
– SQL databases that are created using serverless SQL pools on top of data lake.
Although you might use different tools and languages to create these types of databases, the principles described in this article apply to both types. I will use the term “lakehouse” whenever i reference Spak Lake database or SQL database created using the serverless SQL pools.

Click through for Jovan’s guidance.

Comments closed

Querying Delta Lake via Azure Synapse Analytics Serverless SQL Pool

Published 2021-11-11 by Kevin Feasel

Tony Truong uses T -SQL to query Delta Lake files:

How to query Delta Lake with SQL on Azure Synapse
As mentioned earlier, Azure Synapse has several compute pools for the evolving analytical workload. There is the Apache Spark pool for data engineers and serverless SQL pool for analysts. Let us break down how the two personas work together to query a shared Delta Lake.

Read on for the setup and the payoff.

Comments closed

Automatic Backups on a Data Lake or Lakehouse

Published 2021-10-22 by Kevin Feasel

Dave Ruijter backs that thing up:

Out of the box, Azure Data Lake Storage Gen2 provides redundant storage. Therefore, the data in your Data Lake(house) is resilient to transient hardware failures within a datacenter through automated replicas. This ensures durability and high availability. In this blog post, I provide a backup strategy on how to further protect your data from accidental deletions, data corruption, or any other data failures. This strategy works for Data Lake as well as Data Lakehouse implementations. It uses native Azure services, no additional tools, software, or licenses are required.

Read on for a detailed strategy.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Category: Data Lake