Press "Enter" to skip to content

Month: March 2020

Issue with PolyBase and Cosmos DB

I found an issue with connecting to Cosmos DB from PolyBase after installing SQL Server 2019 CU2:

After upgrading to SQL Server 2019 CU2, I noticed some issues when trying to connect to a Cosmos DB collection via PolyBase. Specifically, I started getting the following error message:

Msg 105082, Level 16, State 1, Line 35
105082;Generic ODBC error: [Microsoft][MongoDBODBC] (110) Error from MongoDB Client: Server at <<my Cosmos account name>>.documents.azure.com:10255 reports wire version 2, but this version of libmongoc requires at least 3 (MongoDB 3.0) (Error Code: 15) Additional error <2>: ErrorMsg: [Microsoft][MongoDBODBC] (110) Error from MongoDB Client: Server at <<my Cosmos account name>> .documents.azure.com:10255 reports wire version 2, but this version of libmongoc requires at least 3 (MongoDB 3.0) (Error Code: 15), SqlState: HY000, NativeError: 110 .

Read on for a couple attempts at a solution and some more detail.

Comments closed

Fun with Deadlocks

Jana Sattainathan diagnoses a deadlocking issue:

We know what deadlocks are and some of the common reasons they happen. If you need a refresher, I recommend this good article. I am not going to rehash what has already been said although these high level points are worth noting to resolve them:

1) Examine known Parallelism (where you have parallelized jobs)
2) Examine unknown Parallelism (unknown jobs or users interfere with your jobs in parallel)
3) Arrange order of tables doing DML to be the same across all code. E.g., Always Customers first, Orders second, OrderDetails third.
4) Examine the indexes on the affected tables to eliminate full-table scans
5) Reduce the amount of time spent in a transaction
6) Update in chunks especially if updating/deleting across sessions
7) Avoid RBAR (Row By Agonizing Row) CRUD operations! Do statement based mass changes.

Read on to understand Jana’s situation and solution.

Comments closed

Transforming JSON to CSV with Azure Data Factory

Rayis Imayev shows how you can use the Flatten task in Azure Data Factory to convert JSON text to CSV:

What this new task does it helps to transform/transpose/flatten your JSON structure into a denormalized flatten datasets that you can upload into a new or existing flat database table. 

I like the analogy of the Transpose function in Excel that helps to rotate your vertical set of data pairs (name : value) into a table with the column names and values for corresponding objects. And when this vertical JSON structural set contains several similar sets (array) then ADF Mapping Data Flows Flatten does a really good job by transforming it into a table with several rows (records).

Click through for a demonstration.

Comments closed

CI/CD with Databricks

Sumit Mehrotra takes us through the continuous integration story around Databricks:

Development environment – Now that you have delivered a fully configured data environment to the product (or services) team in your organization, the data scientists have started working on it. They are using the data science notebook interface that they are familiar with to do exploratory analysis. The data engineers have also started working in the environment and they like working in the context of their IDEs. They would prefer a  connection between their favorite IDE and the data environment that allows them to use the familiar interface of their IDE to code and, at the same time, use the power of the data environment to run through unit tests, all in context of their IDE.

Any disciplined engineering team would take their code from the developer’s desktop to production, running through various quality gates and feedback loops. As a start, the team needs to connect their data environment to their code repository on a service like git so that the code base is properly versioned and the team can work collaboratively on the codebase.

This is more of a conceptual post than a direct how-to guide, but it does a good job of getting you on the right path.

Comments closed

Distributed Model Training with Dask and SciKit-Learn

Matthieu Lamairesse shows us how we can use Dask to perform distributed ML model training:

Dask is an open-source parallel computing framework written natively in Python (initially released 2014). It has a significant following and support largely due to its good integration with the popular Python ML ecosystem triumvirate that is NumPy, Pandas and Scikit-learn. 

Why Dask over other distributed machine learning frameworks? 

In the context of this article it’s about Dask’s tight integration with Sckit-learn’s JobLib parallel computing library that allows us to distribute Scikit-learn code with (almost) no code change, making it a very interesting framework to accelerate ML training. 

Click through for an interesting article and an example of using this on Cloudera’s ML platform.

Comments closed

Exploring the Extended Events Live Data Window

Grant Fritchey takes us through the Extended Events Live Data window:

One reason a lot of people don’t like Extended Events is because the output is in XML. Let’s face it, XML is a pain in the bottom. However, there are a bunch of ways around dealing with the XML data. The first, and easiest, is to ignore it completely and use the Live Data window built into SQL Server Management Studio.

I’ve written about the Live Data window before, and I’ve been using it throughout this series of posts on Extended Events. There’s a lot more to this tool than is immediately apparent. Today, we’re going to explore the basics around this tool

Read on to see what you can do with this. It’s a lot more powerful than it first appears to be.

Comments closed

Monitoring with WhoIsActive

Hadi Fadlallah looks at one of my favorite stored procedures:

For this reason, Adam Machanic (a Microsoft MVP since 2004) developed a more powerful stored procedure called “sp_whoisactive” to fill in the gap between the actual needs of DBAs and the currently provided procedures (sp_who and sp_who2).

In the following sections, we will talk briefly about sp_who and sp_who2 stored procedure, then we will illustrate how to download and use sp_whoisactive stored procedure.

Though there is a nicer way to insert into a table based on WhoIsActive outputs.

Comments closed

Trimming Strings with T-SQL

Andy Levy saves us all several characters at a time:

Every now and then, we encounter data that needs to be cleaned up because it’s got leading and/or trailing spaces. Or maybe you’re storing short data in a CHAR(N) field, so when you query it, you’re getting trailing spaces. For time immemorial, we’ve had to wrap these fields in rtrim(ltrim(fieldname)) to do the deed.

Effective with SQL Server 2017, that’s no longer the case. 

The eight keystrokes add up over time. In all seriousness, I am happy that TRIM() is a thing in SQL Server 2017. And Andy gives us a little bonus to make it worth your refactoring while.

Comments closed