Press "Enter" to skip to content

Day: December 23, 2020

Visualization and the Value of Expectations

Alex Velez thinks about violating expectations in visuals:

This isn’t to say we should never deviate from normal graphing conventions, but we should have a good reason for doing so—a reason that makes up for any unintended consequences. 

What other design decisions might also take our audience by surprise—going against normal graphing expectations? I’ll outline a few. 

Click through for examples. One thing not explicitly brought up is that we follow conventions to reduce the amount of thought needed to understand something. For circumstances in which there’s a major benefit, you might want to run that risk. Also, there’s an argument in here that, at some point, it’s better to have something radically different than marginally different.

Comments closed

Spark Streaming in a Databricks Notebook

Tomaz Kastrun shows off Spark Streaming in a Databricks notebook:

Spark Streaming is the process that can analyse not only batches of data but also streams of data in near real-time. It gives the powerful interactive and analytical applications across both hot and cold data (streaming data and historical data). Spark Streaming is a fault tolerance system, meaning due to lineage of operations, Spark will always remember where you stopped and in case of a worker error, another worker can always recreate all the data transformation from partitioned RDD (assuming that all the RDD transformations are deterministic).

Click through for the demo.

Comments closed

Using the Cosmos DB Analytics Storage Engine

Hasan Savran explains the purpose of the Cosmos DB Analytics Storage Engine:

Analytics storage uses Column Store format to save your data. This means data is written to disk column by column rather than row by row. This makes all aggregation function run fast because disk does not need to work hard to find data row by row anymore. Cosmos DB takes responsibility to move data from Transaction Store to Analytical Store too. You do not need to write any ETL packages to accomplish this. That means you do not need to figure out which data needs to update, which data should be deleted. Azure Cosmos DB figures all data for you, syncs the data between these two storage engines. This gives us the isolation we have been looking for between transactional and analytical environments. Data written to transactional storage will be available in Analytical Storage less than 5 minutes. In my experience, it really depends on the size of the database, if you have a smaller database usually data becomes available in Analytical Storage in less than a minute.

This makes the data easy to ingest into Azure Synapse Analytics, for example.

Comments closed

PASS: the End of an Era

Mala Mahadevan reflects on 22 years of association with PASS:

I finally decided I would write about the lessons I’ve learned in my 22 year association with them. This is necessary for me to move on and may be worth reading for those who think similar.
There is the common line that PASS is not the #sqlfamily, and that line is currently true. But back in those days, it was. Atleast it was our introduction to the community commonly known as #sqlfamily. So many lessons here are in fact lessons in dealing with and living with community issues.

Read on to learn from Mala.

Comments closed

Integrating Power BI with Azure Synapse Analytics

Santosh Balasubramanian walks us through the process of querying Azure Synapse Analytics data with Power BI:

In this guide, you will be integrating an already-existing Power BI workspace with Azure Synapse Analytics so that you can quickly access datasets, edit reports directly in the Synapse Studio, and automatically see updates to the report in the Power BI workspace. We will be using a Power BI report developed using the Movie Analytics dataset of the previous guide to show the functionalities of the Power BI integration in Azure Synapse.

Click through for the demo.

Comments closed

Linking between Notebooks in Azure Data Studio

Julie Koesmarno shows us the rules of linking notebooks in Azure Data Studio:

When writing a notebook, it can be very handy to be able to refer to a specific part to a notebook and allow the readers to jump to that part, i.e linking or anchoring. Using this technique, you can also create an index list or a table of contents or cross-referencing to parts of other notebooks too. Check out my demo notebook for this linking topic, from MsSQLGirl Github Repo.

Read on for those rules.

Comments closed

Internal and External Azure Data Factory Pipeline Activities

Paul Andrew differentiates two form of pipeline activity:

Firstly, you might be thinking, why do I need to know this? Well, in my opinion, there are three main reasons for having an understanding of internal vs external activities:

1. Microsoft cryptically charges you a different rate of execution hours depending on the activity category when the pipeline is triggered. See the Azure Price Calculator.

2. Different resource limitations are enforced per subscription and region (not per Data Factory instance) depending on the activity category. See Azure Data Factory Resource Limitations.

3. I would suggest that understanding what compute is used for a given pipeline is good practice when building out complex control flows. For example, this relates to things like Hosted IR job concurrency, what resources can connect to what infrastructure and when activities may might become queued.

Paul warns that this is a dry topic, but these are important reasons to know the difference.

Comments closed