Press "Enter" to skip to content

Month: October 2020

Access Violation Querying System Table Functions with Parallelism

Eitan Blumin has a write-up of an interesting bug in SQL Server:

The Access Violation error is triggered when an execution plan with parallelism involves specific system table functions. We found that the error occurs ONLY with parallel execution plans.

Therefore, in order to reproduce it, you’ll need:

– SQL Server instance with MaxDOP setting not equal to 1
– At least 2 available CPU cores

Click through for additional details, including a script to generate the same error yourself.

Comments closed

The Main Components of Apache Spark

Manoj Pandey walks us through the key components in Apache Spark:

1. Spark Driver:

– The Driver program can run various operations in parallel on a Spark cluster.

– It is responsible to communicate with the Cluster Manager for allocation of resources for launching Spark Executors.

– And in parallel it instantiates SparkSession for the Spark Application.

– The Driver program splits the Spark Application into one or more Spark Jobs, and each Job is transformed into a DAG (Directed Acyclic Graph, aka Spark execution plan). Each DAG internally has various Stages based upon different operations to perform, and finally each Stage gets divided into multiple Tasks such that each Task maps to a single partition of data.

– Once the Cluster Manager allocates resources, the Driver program works directly with the Executors by assigning them Tasks.
 

Click through for additional elements and how they fit together.

Comments closed

Optical Character Recognition with Tesseract and Databricks

Alex Aleksandrov takes a look at optical character recognition with the Tesseract library:

The topic of Optical Character Recognition (OCR) is not an unexplored field to the Adatis audience. Some Adati like Kalina Ivanova (link1link2) and Francesco Sbrescia (link3) have already explored this topic from the perspective of Azure Cognitive Services and Azure Data Lake. In my first blog, I would like to explore this topic from a different perspective: using Tesseract and Databricks.

Click through for instructions.

Comments closed

Changing Power BI Slicer Appearance

Prathy Kamasani has a video:

In my recent open data project, I created a single page report model with a sparse slicer. It’s a good trick for anyone who wants to make their slicer look a bit sleeker. Like any other visual in Power BI, Slicers also have many properties. By default, below is how slicer looks in Power BI, but I made few changes to make it look like the one on left, in a few steps.

Click through for the video.

Comments closed

Parsing Parameter Default Values in Powershell

Aaron Bertrand continues a series:

In part 1 and part 2 of this series, I introduced ParamParser: a PowerShell module that helps parse parameter information – including default values – from stored procedures and user-defined functions, because SQL Server isn’t going to do it for us.

In the first few iterations of the code, I simply had a .ps1 file that allowed you to paste one or more module bodies into a hard-coded $procedure variable.

Read on to see what’s new in the ParamParser repo.

Comments closed

Swart’s Ten Percent Rule: User Connections

Michael J. Swart applies Swart’s 10% Rule to maximum simultaneous user connections:

The maximum number of user connections that SQL Server can support is 32,767. That’s it. That’s the end of the line. You can buy faster I.O. or a server with more CPUs but you can’t buy more connections.

I actually mentioned this limit in the post where I introduced Swart’s 10% rule: “If you’re using over 10% of what SQL Server restricts you to, you’re doing it wrong” In that post, I was guarded about that statement as it applied to the user connection limit. But I’d like to upgrade that to elevated.

This is Threat Level Vermillion, people!

Comments closed

Validating Data Model Results

Paul Turley continues a discussion on Power BI data model validation:

We often have users of a business intelligence solution tell us that they have found a discrepancy between the numbers in a Power BI report and a report produced by their line-of-business (LOB) system, which they believe to be the correct information.

Using the LOB reports as a data source for Power BI is usually not ideal because at best, we would only reproduce the same results in a different report. We typically connect to raw data sources and transform that detail data, along with other data sources with historical information to analyze trends, comparisons and ratios to produce more insightful reports.

However, if the LOB reports are really the north star for data validation, these can provide an effective means to certify that a BI semantic model and analytic reports are correct and reliable.

Click through for more details.

Comments closed

The Big Red Button for Query Store

Erin Stellato shows us the emergency off switch for Query Store:

Have you ever tried to turn off Query Store when there was an issue, and you thought the problem might be related to Query Store, and the ALTER DATABASE statement was blocked?  And then you couldn’t do anything but wait?  Me too.  Imagine my excitement when I discovered that the SQL Server team snuck a helpful back door into ALL versions for which Query Store is supported. 

Read on for more, including which SP / CU levels support it.

Comments closed

Indexing S3 Data with NiFi and CDP Data Hubs

Eva Nahari, et al, walk us through text indexing of S3 data with Solar, NiFi, and Cloudera Data Platform:

Data Discovery and Exploration (DDE) was recently released in tech preview in Cloudera Data Platform in public cloud. In this blog we will go through the process of indexing data from S3 into Solr in DDE with the help of NiFi in Data Flow. The scenario is the same as it was in the previous blog but the ingest pipeline differs. Spark as the ingest pipeline tool for Search (i.e. Solr) is most commonly used for batch indexing data residing in cloud storage, or if you want to do heavy transformations of the data as a pre-step before sending it to indexing for easy exploration. NiFi (as depicted in this blog) is used for real time and often voluminous incoming event streams that need to be explorable (e.g. logs, twitter feeds, file appends etc).

Our ambition is not to use any terminal or a single shell command to achieve this. We have a UI tool for every step we need to take. 

Click through to see how well they do at that.

Comments closed