Press "Enter" to skip to content

Author: Kevin Feasel

Azure Data Studio October Release

Alan Yu announces the October release of Azure Data Studio:

As announced at Microsoft Ignite, one of the most exciting extensions to share in our September GA release was the release of the SQL Server 2019 Preview extension. If you were following the blog announcements, starting with SQL Server 2019 preview, SQL Server big data clusters allow you to deploy scalable clusters of SQL Server, Spark, and HDFS Docker containers running on Kubernetes.

These components are running side by side to enable you to read, write, and process big data from Transact-SQL or Spark. SQL Server big data clusters allow you to easily combine and analyze your high-value relational data with high-volume big data. To learn about all the excitement of SQL Server Big Data Clusters, follow the documentation here.

These experiences are built as an extension to Azure Data Studio. We can go into full depth about all the great capabilities this extension includes, but deep-diving into any one of these features can be a full blog post itself. Here is a high-level summary of these features, and then you can see a full demo of the features below.

There’s plenty more in here as well.

Comments closed

Beautiful Deadlock Graphs And Tying RIDs Back To Object Names

Josh Simar shares a deadlock graph which I have entitled The Pit Of Despair:

I can’t make heads or tails of that but I can tell you that seems like a really bad brawl for resources. It’s like a Jerry Springer show with a few extras thrown in. Since I knew that my graph wasn’t going to be helpful in this instance I went to the actual xml and tried to figure out how I could tune this to make it better in the future. I needed to know exactly where the issue was so the waitresource pointer is a good place to start.

You will see many blog articles on how to find SQL wait resources when the resource type is a key, a page, or an object (I suggest Kendra Little’s blog post) There is however a noticeable glut on articles explaining RID (a RID is a key on a table with no clustered index). I finally found how to tie a RID to an actual resource name but it was used for corruption so the details were a bit hazy at first.

Click through for this work of database art as well as a script which links RIDs back to specific object names.

Comments closed

Fuzzy Matching In Power BI

Reza Rad looks at a preview feature in Power BI to perform fuzzy matching:

Fuzzy Merge is a way of joining two tables together, but not on exact matching criteria, but on the similarity threshold. If you want to learn what is the Merge operation itself and the difference of that with Append, read my blog post here. If you want to learn more details about what is Merge and the different types of join or merge, read my other blog post here. Merge or Join is simply the act of combining two tables with different structures, but with link/join columns, to access columns from one of the tables in the other one.

To use Merge operation on the “source” query, You can click on the Merge Queries as New option in the Home tab of Power Query Editor window.

This kind of functionality was in SQL Server Integration Services as well but suffered from a huge scaling problem, where the component worked pretty well with small numbers of records, but once you got into the 100K+ range, everything started to fall apart.  I’d be interested to see where that limit is in Power BI.

Comments closed

Understanding Query Optimizer Timeouts

Joseph Pilov answers frequently asked questions about SQL Server’s query optimizer when it times out:

What Is Optimizer Timeout?

SQL Server uses a cost-based query optimizer. Therefore, it selects a query plan with the lowest cost after it has built and examined multiple query plans. One of the objectives of the SQL Server query optimizer (QO) is to spend a “reasonable time” in query optimization as compared to query execution. Therefore, QO has a built-in threshold of tasks to consider before it stops the optimization process. If this threshold is reached before QO has considered most, if not all, possible plans then it has reached the Optimizer TimeOut limit. An event is reported in the query plan as Time Out under “Reason For Early Termination of Statement Optimization.” It’s important to understand that this threshold isn’t based on clock time but on number of possibilities considered. In current SQL QO versions, over a half million possibilities are considered before a time out is reached.

Optimizer timeout is designed in Microsoft SQL Server and in many cases encountering it is not a factor affecting query performance. However, in some cases the SQL query plan choice may be affected by optimizer timeout and thus performance could be impacted. When you encounter such issues, if you understand optimizer timeout mechanism and how complex queries can be affected in SQL Server, it can help you to better troubleshoot and improve your performance issue.

Read the whole thing.

Comments closed

Packages For Testing R Packages

Maelle Salmon shows us how to test our R packages within R:

If you’re brand-new to unit testing your R package, I’d recommend reading this chapter from Hadley Wickham’s book about R packages.

There’s an R package called RUnit for unit testing, but in the whole post we’ll mention resources around the testthat package since it’s the one we use in our packages, and arguably the most popular one. testthat is great! Don’t hesitate to reads its docs again if you started using it a while ago, since the latest major release added the setup() and teardown() functions to run code before and after all tests, very handy.

To setup testing in an existing package i.e. creating the test folder and adding testthat as a dependency, run usethis::use_testthat(). In our WIP pRojects package, we set up the tests directory for you so you don’t forget. Then, in any case, add new tests for a function using usethis::use_test().

The testthis package might help make your testing workflow even smoother. In particular, test_this() “reloads the package and runs tests associated with the currently open R script file.”, and there’s also a function for opening the test file associated with the current R script.

This is an area where I know I need to get better, and Maelle gives us a plethora of tooling for tests.

Comments closed

Clients For Working With HDFS

Mark Litwintschik reviews several clients for working with the Hadoop Distributed Filesystem:

The Hadoop Distributed File System (HDFS) allows you to both federate storage across many computers as well as distribute files in a redundant manor across a cluster. HDFS is a key component to many storage clusters that possess more than a petabyte of capacity.

Each computer acting as a storage node in a cluster can contain one or more storage devices. This can allow several mechanical storage drives to both store data more reliably than SSDs, keep the cost per gigabyte down as well as go some way to exhausting the SATA bus capacity of a given system.

Hadoop ships with a feature-rich and robust JVM-based HDFS client. For many that interact with HDFS directly it is the go-to tool for any given task. That said, there is a growing population of alternative HDFS clients. Some optimise for responsiveness while others make it easier to utilise HDFS in Python applications. In this post I’ll walk through a few of these offerings.

Read on for reviews of those offerings.

Comments closed

Monitoring Apache NiFi With A Custom Dashboard

Tim Spann has started a new series on monitoring Apache NiFi:

In this little proof of concept work, we grab some of these flows process them in Apache NiFi and then store them in Apache Hive 3 tables for analytics. We should probably push the data to HBase for aggregates and Druid for time series. We will see as this expands.

There are also other data access options including the NiFi REST API and the NiFi Python APIs.

Boostrap Notifier

  • Send notification when the NiFi starts, stops or died unexpectedly
  • Two OOTB notifications
  • Email notification service
  • HTTP notification service
  • It’s easy to write a custom notification service

Reporting Tasks

  • AmbariReportingTask (global, per process group)

  • MonitorDiskUsage(Flowfile, content, provenance repositories)

  • MonitorMemory

Much of this is an overview of the tools and measures available.

Comments closed

Reshaping Data Frames With tidyr

Anisa Dhana shows off some of the data reshaping functionality available in the tidyr package:

As it is shown above, the variable agegp has 6 groups (i.e., 25-34, 35-44) which has different alcohol intake and smoking use combinations. I think it would be interesting to transform this dataset from long to wide and to create a column for each age group and show the respective cases. Let see how the dataset will look like.

dt %>% 
  spread(agegp, ncases) %>% 
  slice(1:5)

Click through for a few additional transformations.

Comments closed

Querying Web API From Power BI

Paul Turley shows us how to hit secured Web API endpoints with Power BI:

Having recently worked-through numerous issues with API data feeds and deployed report configurations, I’ve learned a few important best practices and caveats – at least for some common use cases.  In one example, we have a client who expose their software-as-a-service (SaaS) customer data through several web API endpoints.  Each SaaS customer has a unique security key which they can use with Power BI, Power Query or Excel and other tools to create reporting solutions.  If we need a list of available products, it is a simple matter to create a long URL string consisting of the web address for the  endpoint, security key and other parameters; an then just pass this to Power Query as a web data source.  However, it’s not quite that easy for non-trivial reporting scenarios.

Thanks to Jamie Mikami from CSG Pro for helping me with the Azure function code for demonstrating this with demo data.  Thanks also to Chris Webb who has meticulously covered several facets of API data sources in great detail on his blog, making this process much easier.

Click through for the instructions.

Comments closed

Understanding ANY And ALL In SQL

Doug Kline explains the ANY and ALL operators in SQL:

-- note that this creates a single column of values
-- which could be used in something like IN
-- for example
SELECT   1
WHERE    12 IN    (  SELECT   tempField
                     FROM     (VALUES(11),(12),(7)) tempTable(tempField))

-- I could rephrase this as:
SELECT   1
WHERE    12 = ANY (  SELECT   tempField
                     FROM     (VALUES(11),(12),(7)) tempTable(tempField))

I rarely see these operators in the wild and might have used them in production code a couple of times if that.

Comments closed