Press "Enter" to skip to content

Month: September 2019

Diagnosing TCP SACKs-Related Slowdown in Databricks

Chris Stevens, et al, walk us through troubleshooting a slowdown after using Linux images which have been patched for the TCP SACKs vulnerabilities:

In order to figure out why the straggler task took 15 minutes, we needed to catch it in the act. We reran the benchmark while monitoring the Spark UI, knowing that all but one of the tasks for the save operation would complete within a few minutes. Sorting the tasks in that stage by the Status column, it did not take long for there to be only one task in the RUNNING state. We had found our skewed task and the IP address in the Host column pointed us at the executor experiencing the regression.

This is a nice case study of network troubleshooting, so of course there are Wireshark screenshots in it.

Comments closed

Accessing Data in Azure Data Lake Storage Gen 2

James Serra gives us several methods to access data in Azure Data Lake Storage Gen 2:

With data lakes becoming popular, and Azure Data Lake Store (ADLS) Gen2 being used for many of them, a common question I am asked about is “How can I access data in ADLS Gen2 instead of a copy of the data in another product (i.e. Azure SQL Data Warehouse)?”. The benefits of accessing ADLS Gen2 directly is less ETL, less cost, to see if the data in the data lake has value before making it part of ETL, for a one-time report, for a data scientist who wants to use the data to train a model, or for using a compute solution that points to ADLS Gen2 to clean your data. While these are all valid reasons, you still want to have a relational database (see Is the traditional data warehouse dead?). The trade-off in accessing data directly in ADLS Gen2 is slower performance, limited concurrency, limited data security (no row-level, column-level, dynamic data masking, etc) and the difficulty in accessing it compared to accessing a relational database.

Since ADLS Gen2 is just storage, you need other technologies to copy data to it or to read data in it.

Read on for the solution.

Comments closed

Disambiguating Azure SQL Database Classes

Arun Sirpal explains the different types of Azure SQL Database available to us:

I want to do a quick summary post of the many different types of Azure SQL Database available and I am not talking about elastic pools, VMs etc, more so the singleton type.

Azure SQL Database (I call normal mode) – A choice between the DTU model (Basic, Standard and Premium) and vCore (General Purpose and Business Critical). Within this space there are two different architecture types used by Microsoft under the covers.

As the product expands, we get more and more options, and Arun clarifies where each fits.

Comments closed

Alerting on Refresh Failure in Power BI

Matt Allington shows how you can set up an alert to contact you when a scheduled Power BI data refresh fails:

In this article I am going to share with you a concept to manage refresh failures in production PowerBI.com reports. In a perfect world, you should configure your queries so that they “prevent” possible refresh failure issues from occurring, but also to notify you when something goes wrong without the report refresh failing in the first place. There are many things that can go wrong with report refreshes and you probably can’t prevent all of them occurring. In the example I use in this article I will show you how to prevent a refresh failure caused by duplicates appearing in a lookup table after the report has been built, the model has been loaded to PowerBI.com and the scheduled refresh has been set up using a gateway. If during refresh a duplicate key occurs in any of the Lookup tables in the data model, the refresh fails, and the updated data does not go live.

Matt’s specific scenario is around duplicate data, but it can extend to other issues as well.

Comments closed

Troubleshooting Power BI Refresh Failures

Annie Xu gives us a few reasons why Power BI refreshes might fail:

License level: Power BI Premium license has different level and Power BI Premium capacity is shared within a Tenant and can be shared by multiple workspaces.The maximum number of models (datasets) that can be refreshed in parallel is based on the capacity node.  For example, P1 = 6 models, P2 = 12 models and P3 = 24 models.

Click through for the set of possibilities.

Comments closed

IIS Log Analysis in Power BI

Joy George Kunjikkur shows how you can build a Power BI dashboard to analyze IIS log files:

As developers, we all might have encountered situation of analyzing IIS web server logs. During the development time, the file is small and easy to analyze in Notepad or Excel. But when it grows to GBs in production servers we use other tools. One such popular tool to query IIS logs is LogParser. It is a free command-line tool from Microsoft. There are graphical applications around it to generate even charts. One such free tool is the Log Parser Studio. It is also from Microsoft.


Once we move up in career and had to deal with managers, product stakeholders or ever to CXO to show what the IIS logs say, we need more visuals or a dashboard reflecting the IIS logs. Though we can create visual using Log Parser Studio, it is tedious creating reports and charts one by one. 

Click through for a solution.

Comments closed

Upgrading Azure Kubernetes Service

Chris Taylor has a point updates to jump in Azure Kubernetes Service:

As it is late at night my brain wasn’t working as it should be but thought I’d put a quick blog out there to say that if you are on v1.11.5 and want to upgrade to >= v1.13.10 then you have to do this in a 2 stage process by upgrading to v1.12.8 first:

Fortunately, upgrading is pretty easy using the Azure command line or even the Azure portal.

Comments closed

WVPlots

Nina Zumel announces a new version of WVPlots on CRAN:

WVPlots was originally a catch-all package of ggplot2 visualizations that we at Win-Vector tended to use repeatedly, and wanted to turn into “one-liners.” A consequence of this is that the older visualizations had our preferred color schemes hard-coded in. More recent additions to the package sometimes had palette or color controls, but not in a consistent way. Making color controls more consistent has been a “todo” for a while—one that I’d been putting off. A recent request from user Brice Richard (thanks Brice!) has pushed me to finally make the changes.

Click through to see what’s changed and for an example vignette.

Comments closed

RDDs, DataFrames, and Datasets in Spark

Brad Llewellyn walks us through the three key data structures in Apache Spark:

We see that creating an RDD can be done with one easy function.  In this snippet, sc represents the default SparkContext.  This is extremely important, but is better left for a later post.  SparkContext offers the .textFile() function which creates an RDD from a text file, parsing each line into it’s own element in the RDD.  These lines happen to represent CSV records.  However, there are many other common examples that use lines of free text.

It should also be noted that we can use the .collect() and .take() functions to view the contents of an RDD.  The difference between .collect() and .take() is that .take() allows us to specify the number of elements we want to retrieve, whereas .collect() returns the entire RDD.

My tendencies are probably skewed pretty heavily, but I live in DataFrames and almost never use raw RDDs anymore.

Comments closed

Computing Time to Payment on Invoices

Daniel Hutmacher has a painful but realistic problem to solve:

Here’s an example customer. You’ll notice right off the bat that we’re sending this customer an invoice every day on the 20th of the month. To add some complexity, the customer will arbitrarily pay parts of the invoiced amount over time, and to add insult to injury, the banking interface won’t tell us which invoice the customer is paying for, so we’ll just decide that each payment goes towards the oldest outstanding invoice.

Our task is to calculate how many days have elapsed, for each invoice, from invoice date to payment in full.

Daniel has an excellent solution to the problem, so check it out.

Comments closed