Press "Enter" to skip to content

Day: October 6, 2020

Xenographs

Alex Velez talks about xenographs:

I recall the first time I came across a horizon chart. Two thoughts came to mind: 1) this looks cool; and 2) I don’t have the energy to figure this out. Fast forward to now. I’ve learned how to read horizon charts, and I’ve even identified a few good use cases for them. This illustrates both the problem and the potential of xenographs. Let’s explore the potentially problematic side first.

Novel approaches to visualizing data can intimidate audiences. They introduce a learning curve because a never-before-seen graph typically requires time and energy to decipher. This obstacle could be enough to dissuade audiences from consuming the data altogether. Even if your audience does invest their time, the resulting conversation is often about reading the visual instead of the primary takeaway. This seems counterintuitive, especially in the explanatory analytics space, but it doesn’t mean we should denounce everything novel.

My response to this depends heavily on the medium. If you’re giving a presentation, a novel or underused chart can be good if it helps tell the story. You have the advantage of being there to explain the dynamics of the diagram for people who have never seen it before. For an informative article, you have some ability to elaborate, as in this bracket win probabilities diagram, which is exactly the type of thing you’d see in certain newspapers and magazines. But unless your visual is immediately intuitive (and I’d consider things like a Manhattan plot or maybe a Dot-boxplot to be intuitive enough for most audiences), I don’t think I would include many of those on public-facing or corporate dashboards, as they’re liable to confuse people and you might not have the space available to explain how this works.

Comments closed

Persisting an RDD in Spark

Sarfaraz Hussain takes us through caching / persisting RDDs in Apache Spark:

Spark RDD persistence is an optimization technique which saves the result of RDD evaluation in cache memory. Using this we save the intermediate result so that we can use it further if required. It reduces the computation overhead.

When we persist an RDD, each node stores the partitions of it that it computes in memory and reuses them in other actions on that RDD (or RDD derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

Read on to see how you can do this and some of the options available to you when caching. This is extremely useful when working with external data sources, as then you don’t risk hitting the external source multiple times.

Comments closed

Multi-Subnet Availability Groups and MultiSubnetFailover

Andy Mallon takes us through an all-too-common scenario:

With a default configuration, multi-subnet AGs require that the clients connecting to them include MultiSubnetFailover=true as a connection string attribute. This attribute tells the driver to expect DNS to provide multiple IP addresses for the Listener name, and to try all of them to find the correct IP to connect to for that network name. Clients that do not specify this attribute will get multiple IPs and not know how to handle them properly–most drivers will pick up one of the returned IPs at random (or maybe just seemingly random), and try to connect to that. This can result in random (or seemingly random) connection failures when it picks the wrong IP.

However, not every client or application will support this connection string attributes. In my experience there are two extremely common reasons that you can’t use MultiSubnetFailover=true:

Read on to see what you can do in that case.

Comments closed

ODBC Scalar Functions

Shane O’Neill discovers ODBC scalar functions:

Can you imagine my shock when I came across a piece of code that not only was not for finding and replacing but even though I did not think it would compile, it did!

If you can imagine my shock, then you’re going to need to increase it more when I tell you that there are a whole family of the same functions!
Here is the code that threw me for a loop the first time I saw it.

SELECT {d '1970-01-01'};

I wasn’t familiar with this syntax either, but if you work heavily with multiple data sources, it can be quite useful—for example, Teradata and DB/2 support them, as well as Shane’s examples of SQL Server and Oracle.

Comments closed

Integrating Power BI into Azure Synapse Analytics

Ginger Grant walks us through two methods of integrating Power BI and Azure Synapse Analytics:

From within Synapse you have the ability to access a Power BI workspace so that you can use Power BI from within Synapse.  Your Power BI tenant can be in a different data center than the Azure Synapse Workspace, but they both must be in the same Power BI Tenant.  You can use Power BI to look at any data you wish, as the data you use can be from any location. When this blog was written, it was only possible to connect to one Power BI workspace from within Azure Synapse. In order to run Power BI as shown here, first I needed to create a Linked Service from within Synapse.

Read on for more.

Comments closed

ANSI String Comparison

Greg Dodd takes us through one of the oddities of ANSI string comparison:

So let’s play a game, what will the output be if I pass in the following? I’ve included my guesses

EXEC AreStringsTheSame N'Greg', N'Greg' --Yes
EXEC AreStringsTheSame N'Greg', N'Dodd' --No
EXEC AreStringsTheSame N'', N'' --Yes
EXEC AreStringsTheSame N' ', N' ' --Yes
EXEC AreStringsTheSame N' ', N'' --No
EXEC AreStringsTheSame N' Greg', 'Greg' --No, leading space
EXEC AreStringsTheSame N'Greg ', 'Greg' --No, trailing space

Read on to see what Greg ended up guessing wrong and why.

Comments closed

Troubleshooting High Threadpool Waits and Deadlocked Schedulers

Eitan Blumin takes us through a troubleshooting scenario:

In short, high THREADPOOL waits can happen when SQL Server doesn’t have enough “worker threads” to handle new tasks, which could cause SQL Server to hang and refuse connections. When a task is waiting for a worker thread to become available, that wait type is called THREADPOOL wait.

A background process, called “Scheduler Monitor“, will identify when the same worker threads are “stuck” in the same state for 60 seconds or more. In which case it will resolve the issue as a Deadlocked Scheduler, and that’ll cause dropped connections, rollbacks, and even fail-overs.

When a Deadlocked Scheduler event happens, SQL Server will automatically generate a memory dump file (SQLDump#####.mdmp), and log the incident in the SQL Server Error Log.

Read on to understand what causes this as well as why we always fumble our keys under the car as the scary monster approaches.

Comments closed

Adding Row Numbers to ADF Data Flows

Rayis Imayev shows two methods of generating unique, ascending row numbers in Azure Data Factory data flows:

Adding a row number to your dataset could a trivial task. Both ANSI and Spark SQL have the row_number() window function that can enrich your data with a unique number whole for your whole or partitioned data recordset. 

Recently I had a case of creating a data flow in Azure Data Factory (ADF) where there was a need to add a row number.

Read on for a couple attempts which didn’t work, followed by two that do, including an assist from Joseph Edwards.

Comments closed