December 2020 – Page 3

Spark Streaming in a Databricks Notebook

Published 2020-12-23 by Kevin Feasel

Tomaz Kastrun shows off Spark Streaming in a Databricks notebook:

Spark Streaming is the process that can analyse not only batches of data but also streams of data in near real-time. It gives the powerful interactive and analytical applications across both hot and cold data (streaming data and historical data). Spark Streaming is a fault tolerance system, meaning due to lineage of operations, Spark will always remember where you stopped and in case of a worker error, another worker can always recreate all the data transformation from partitioned RDD (assuming that all the RDD transformations are deterministic).

Click through for the demo.

Comments closed

Using the Cosmos DB Analytics Storage Engine

Published 2020-12-23 by Kevin Feasel

Hasan Savran explains the purpose of the Cosmos DB Analytics Storage Engine:

Analytics storage uses Column Store format to save your data. This means data is written to disk column by column rather than row by row. This makes all aggregation function run fast because disk does not need to work hard to find data row by row anymore. Cosmos DB takes responsibility to move data from Transaction Store to Analytical Store too. You do not need to write any ETL packages to accomplish this. That means you do not need to figure out which data needs to update, which data should be deleted. Azure Cosmos DB figures all data for you, syncs the data between these two storage engines. This gives us the isolation we have been looking for between transactional and analytical environments. Data written to transactional storage will be available in Analytical Storage less than 5 minutes. In my experience, it really depends on the size of the database, if you have a smaller database usually data becomes available in Analytical Storage in less than a minute.

This makes the data easy to ingest into Azure Synapse Analytics, for example.

Comments closed

PASS: the End of an Era

Published 2020-12-23 by Kevin Feasel

Mala Mahadevan reflects on 22 years of association with PASS:

I finally decided I would write about the lessons I’ve learned in my 22 year association with them. This is necessary for me to move on and may be worth reading for those who think similar.
There is the common line that PASS is not the #sqlfamily, and that line is currently true. But back in those days, it was. Atleast it was our introduction to the community commonly known as #sqlfamily. So many lessons here are in fact lessons in dealing with and living with community issues.

Read on to learn from Mala.

Comments closed

Integrating Power BI with Azure Synapse Analytics

Published 2020-12-23 by Kevin Feasel

Santosh Balasubramanian walks us through the process of querying Azure Synapse Analytics data with Power BI:

In this guide, you will be integrating an already-existing Power BI workspace with Azure Synapse Analytics so that you can quickly access datasets, edit reports directly in the Synapse Studio, and automatically see updates to the report in the Power BI workspace. We will be using a Power BI report developed using the Movie Analytics dataset of the previous guide to show the functionalities of the Power BI integration in Azure Synapse.

Click through for the demo.

Comments closed

Linking between Notebooks in Azure Data Studio

Published 2020-12-23 by Kevin Feasel

Julie Koesmarno shows us the rules of linking notebooks in Azure Data Studio:

When writing a notebook, it can be very handy to be able to refer to a specific part to a notebook and allow the readers to jump to that part, i.e linking or anchoring. Using this technique, you can also create an index list or a table of contents or cross-referencing to parts of other notebooks too. Check out my demo notebook for this linking topic, from MsSQLGirl Github Repo.

Read on for those rules.

Comments closed

Importing Database Code from GitHub using Azure Data Studio

Published 2020-12-23 by Kevin Feasel

Elizabeth Noble has a new video for us:

In this week’s YouTune video, I show how you can clone (import) a repository (database code) from GitHub all within Azure Data Studio. This is a great feature that helps make database source control more accessible to individuals who may not have access or be comfortable using Visual Studio or VS Code.

Click through for the video.

Comments closed

Internal and External Azure Data Factory Pipeline Activities

Published 2020-12-23 by Kevin Feasel

Paul Andrew differentiates two form of pipeline activity:

Firstly, you might be thinking, why do I need to know this? Well, in my opinion, there are three main reasons for having an understanding of internal vs external activities:
1. Microsoft cryptically charges you a different rate of execution hours depending on the activity category when the pipeline is triggered. See the Azure Price Calculator.
2. Different resource limitations are enforced per subscription and region (not per Data Factory instance) depending on the activity category. See Azure Data Factory Resource Limitations.
3. I would suggest that understanding what compute is used for a given pipeline is good practice when building out complex control flows. For example, this relates to things like Hosted IR job concurrency, what resources can connect to what infrastructure and when activities may might become queued.

Paul warns that this is a dry topic, but these are important reasons to know the difference.

Comments closed

Repartitioning and Coalescing in Spark

Published 2020-12-22 by Kevin Feasel

The Hadoop in Real World team contrasts repartitioning and coalescing in Spark:

The function of repartition and coalesce functions in Spark is to change the number of partitions on a DataFrame.

Read on to see how the two differ.

Comments closed

Working with Serverless and Dedicated SQL Pools in Azure Synapse Analytics

Published 2020-12-22 by Kevin Feasel

Igor Stanko takes us through both dedicated and serverless SQL Pools in Azure Synapse Analytics:

Both serverless and dedicated SQL pools can be used within the same Synapse workspace, providing the flexibility to choose one or both options to cost-effectively manage your SQL analytics workloads. With Azure Synapse, you can use T-SQL to directly query data within a data lake for rapid data exploration and take advantage of the full capabilities of a data warehouse for more predictable and mission-critical workloads. With both query options available, you can choose the most cost-effective option for each of your use cases, resulting in cost savings across your business.
This post explores 2 consumption choices when exercising analytics using Synapse SQL (serverless and dedicated SQL pools) and examines the power and flexibility provided by Azure Synapse when both are used to execute T-SQL workloads. In addition, we will explore options to control cost when using both models.

Click through for details, including hints on minimizing costs.

Comments closed

Multiple Slicers and AND Logic

Published 2020-12-22 by Kevin Feasel

Stephanie Bruno embraces the healing power of AND:

When using slicers in Power BI reports, multiple selections filter data with OR logic. For example, if you have a slicer with products and your visuals are displaying total number of invoices, then when “bicycles” and “helmets” are selected in the products slicer your visual will show the number of invoices that include bicycles OR helmets. But what if you need to have it instead only show the number of invoices that include bicycles AND helmets? Read on to find out how you can do just that with DAX.

Read on for the solution.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Month: December 2020