Month: December 2021

Log4j and the Microsoft Data Platform

Published 2021-12-20 by Kevin Feasel

Andreas Wolter has some guidance for us:

Microsoft published guidance regarding Log4j 2 vulnerability for customers using Azure Data services. Please find the latest information here:
Microsoft’s Response to CVE-2021-44228 Apache Log4j 2 – Microsoft Security Response Center
The published list shows affected products only.

For SQL Server, even components which use log4j, the version is old enough that it is not affected by the series of exploits, bugs, and exploits of bugs which were introduced to try to fix the prior round of exploited bugs. The big exception is Big Data Clusters and if you happened to install log4j on your own.

Comments closed

Diving into Spark Streaming

Published 2021-12-20 by Kevin Feasel

Tomaz Kastrun continues a series on Spark and is well into a section on Spark Streaming. Part 17 looks at watermarks:

Streaming data is considered as continuously ingested data with particular frequency and latency. It is considered “big data” and data that has no discrete beginning nor end.
The primary goal of any real-time stream processing system is to process the streaming data within a window frame (or considered this as frequency). Usually this frequency is “as soon as it arrives”. On the other hand, latency in streaming processing model is considered to have the means to work or deal with all the possible latencies (one second or one minute) and provides an end-to-end low latency system. If frequency of data analysing is on user’s side (destination), latency is considered on the device’s side (source).

Part 18 enumerates the supported types of windows:

Tumbling windows are fixed sized and static. They are non-overlapping and are contiguous intervals. Every ingested data can be (must be) bound to a singled window.
Sliding windows are also fixed sized and also static. Windows will overlap when the duration of the slide is smaller than the duration of the window. Ingested data can therefore be bound to two or more windows
Session windows are dynamic in size of the window length. The size depends on the ingested data. A session starts with an input and expands if the following input expands if the next ingested record has fallen within the gap duration.

Part 19 includes good information on how data engineers can work with streams of data:

Streaming data can be used in conjunction with other datasets. You can have Joining streaming data, joining data with watermarking, deduplication, outputting the data, applying foreach logic, using triggers and creating Stream API Tables.
All of the functions are available in Python, Scala and Java and some are not available with R. We will be focusing on Python and R.

Check out all three of these posts.

Comments closed

Interactive Jupyter Plots with Plotly Express

Published 2021-12-20 by Kevin Feasel

Kevin Jacobs starts a series on Plotly Express:

This blog series is a beginners’ tutorial on how you can make interactive plots in a Jupyter notebook using Plotly Express. In this first blog post on this topic, we will go through the steps needed for creating a basic line Python plot and a 3D scatter plot.

Click through for the examples.

Comments closed

Simulating Slow Data Sources in Power BI

Published 2021-12-20 by Kevin Feasel

Chris Webb builds a particular kind of test:

As a postscript to my series on Power BI refresh timeouts (see part 1, part 2 and part 3) I thought it would be useful to document how I was able to simulate a slow data source in Power BI without using large data volumes or deliberately complex M code.
It’s relatively easy to create an M query that returns a table of data after a given delay. For example, this query returns a table with one column and one row after one hour and ten minutes:

Read on for a version of the function which slowly emits rows, as well as some T-SQL which slowly emits rows.

Comments closed

Power BI Desktop Hardening

Published 2021-12-20 by Kevin Feasel

Matthew Roche explains a bit of jargon:

What I am going to do here is talk a little about “desktop hardening,” which is a term I’ve heard hundreds of times in meetings and work conversations, but which I discovered is practically undefined and unmentioned on the public internet. The only place where I could find this term used in context was in this reply from Power BI PM Christian Wade… and if you don’t already know what “desktop hardening” is, seeing it used in context here isn’t going to help too much.

When I hear that term, I definitely do not think about APIs; instead, the first thing which comes to mind is security, which is a totally different story.

Comments closed

SERVERPROPERTY() and DATABASEPROPERTYEX() Views

Published 2021-12-20 by Kevin Feasel

Andy Mallon provides a public service:

The thing I hate the most about these two functions is that you need to know the right magic spells to make them work. Let’s look at SERVERPROPERTY() first. The syntax for the function is SERVERPROPERTY( 'propertyname' ), which is easy enough syntax, but the list of values for propertyname isn’t discoverable from SQL Server metadata, DMVs, or even IntelliSense. Instead, I need to check the docs for the list of allowable values. These property names are essentially magic words, and I need to check my spell book to make sure I get it right.
Invalid values for propertyname just return NULL–which is easy enough to handle, but also means your code will compile and run, but might do unintended things if you get your magic spell wrong, due to a typo.

Click through for Andy’s solution to the problem.

Comments closed

A Primer on the Serverless SQL Pool

Published 2021-12-20 by Kevin Feasel

Tino Zishiri has some tips for people trying out the Azure Synapse Analytics serverless SQL pool:

Serverless SQL Pools or SQL on-demand is a serverless distributed data processing service offered by Microsoft. The service is comparable to Amazon Athena. The serverless nature of the service means that there is no infrastructure to manage, and you only pay for what you use (pay-per-query model).
Through Serverless SQL pools, you query the data in your data lake using T-SQL. The architecture behind the service is optimized for querying and analyzing big data by running queries in parallel.

Read on to understand where the serverless SQL pool fits, as well as some tips about data transformation with this pool.

Comments closed

Trying out Spark Streaming

Published 2021-12-17 by Kevin Feasel

Tomaz Kastrun continues a series on Spark. Part 15 provides an introduction to Spark Streaming:

Spark Streaming or Structured Streaming is a scalable and fault-tolerant, end-to-end stream processing engine. it is built on the Spark SQL engine. Spark SQL engine will is responsible for running results sets for streaming data, regardless of static or continuously in coming stream data.
Spark stream can use Dataframe (or Datasets) API in Scala, Python, R or Java to work on handling data ingest, creating streaming analytics and do all the computations. All these requests and workloads are done against Spark SQL engine.

I don’t think I’ve ever seen an example of using Spark Streaming in R, so that one’s new to me.

Part 16 looks at DataFrame operations in Spark Streaming:

When working with Spark Streaming from file based ingestion, user must predefine the schema. This will require not only better performance but consistent data ingest for streaming data. There is always possibility to set the spark.sql.streaming.schemaInference to true to enable Spark to infer schema on read or automatically.

Check out both of those posts.

Comments closed

Log Replay for Azure SQL Managed Instance

Published 2021-12-17 by Kevin Feasel

Joey D’Antoni has some quick notes on the Log Replay Service:

Recently, I’ve started on a project where we are migrating a customer to Azure SQL Managed Instance, which now supports a few different migration paths. You can simply backup and restore from a database backup, but you can’t apply a log or differential backup to that database. You can also use the Database Migration Service, but that requires a lot of infrastructure and Azure configuration. The log replay service, or LRS, is the functional equivalent of log shipping to your new managed instance database. While log shipping is a well known methodology for both database migrations or disaster recovery. However, the implementation is a little different–let’s talk about how it works.

Click through to see how it differs.

Comments closed

Reviewing Azure Options for PostgreSQL and MySQL

Published 2021-12-17 by Kevin Feasel

Maria Zakourdaev has a pair of info sheets. First up is Azure Database for MySQL:

MySQL is an open-source relational database that is widely used for web applications, it’s easy to use, reliable, secure, and fast.
Recently Microsoft have announced a new deployment option, Flexible Server, that is now generally available.
If we have a quick look at the available options, we now have Single Server and Flexible server deployment options.

Then we have Azure Database for PostgreSQL:

PostgreSQL is an open-sourced, feature rich and extendable relational database that handles high concurrency workloads easily. It supports complex structures, many advanced data types, Search Tree indexes and also got highly sophisticated query optimizer.
Azure Database for PostgreSQL is an Azure managed services running PostgreSQL community edition. With Flexible Server announced recently, you now have 3 deployment options: Single Server, Flexible Server and Hyperscale/Citus.

Click through for a quick comparison of each available option.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31