Press "Enter" to skip to content

Month: October 2020

Persisting an RDD in Spark

Sarfaraz Hussain takes us through caching / persisting RDDs in Apache Spark:

Spark RDD persistence is an optimization technique which saves the result of RDD evaluation in cache memory. Using this we save the intermediate result so that we can use it further if required. It reduces the computation overhead.

When we persist an RDD, each node stores the partitions of it that it computes in memory and reuses them in other actions on that RDD (or RDD derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

Read on to see how you can do this and some of the options available to you when caching. This is extremely useful when working with external data sources, as then you don’t risk hitting the external source multiple times.

Comments closed

Multi-Subnet Availability Groups and MultiSubnetFailover

Andy Mallon takes us through an all-too-common scenario:

With a default configuration, multi-subnet AGs require that the clients connecting to them include MultiSubnetFailover=true as a connection string attribute. This attribute tells the driver to expect DNS to provide multiple IP addresses for the Listener name, and to try all of them to find the correct IP to connect to for that network name. Clients that do not specify this attribute will get multiple IPs and not know how to handle them properly–most drivers will pick up one of the returned IPs at random (or maybe just seemingly random), and try to connect to that. This can result in random (or seemingly random) connection failures when it picks the wrong IP.

However, not every client or application will support this connection string attributes. In my experience there are two extremely common reasons that you can’t use MultiSubnetFailover=true:

Read on to see what you can do in that case.

Comments closed

ODBC Scalar Functions

Shane O’Neill discovers ODBC scalar functions:

Can you imagine my shock when I came across a piece of code that not only was not for finding and replacing but even though I did not think it would compile, it did!

If you can imagine my shock, then you’re going to need to increase it more when I tell you that there are a whole family of the same functions!
Here is the code that threw me for a loop the first time I saw it.

SELECT {d '1970-01-01'};

I wasn’t familiar with this syntax either, but if you work heavily with multiple data sources, it can be quite useful—for example, Teradata and DB/2 support them, as well as Shane’s examples of SQL Server and Oracle.

Comments closed

Integrating Power BI into Azure Synapse Analytics

Ginger Grant walks us through two methods of integrating Power BI and Azure Synapse Analytics:

From within Synapse you have the ability to access a Power BI workspace so that you can use Power BI from within Synapse.  Your Power BI tenant can be in a different data center than the Azure Synapse Workspace, but they both must be in the same Power BI Tenant.  You can use Power BI to look at any data you wish, as the data you use can be from any location. When this blog was written, it was only possible to connect to one Power BI workspace from within Azure Synapse. In order to run Power BI as shown here, first I needed to create a Linked Service from within Synapse.

Read on for more.

Comments closed

ANSI String Comparison

Greg Dodd takes us through one of the oddities of ANSI string comparison:

So let’s play a game, what will the output be if I pass in the following? I’ve included my guesses

EXEC AreStringsTheSame N'Greg', N'Greg' --Yes
EXEC AreStringsTheSame N'Greg', N'Dodd' --No
EXEC AreStringsTheSame N'', N'' --Yes
EXEC AreStringsTheSame N' ', N' ' --Yes
EXEC AreStringsTheSame N' ', N'' --No
EXEC AreStringsTheSame N' Greg', 'Greg' --No, leading space
EXEC AreStringsTheSame N'Greg ', 'Greg' --No, trailing space

Read on to see what Greg ended up guessing wrong and why.

Comments closed

Troubleshooting High Threadpool Waits and Deadlocked Schedulers

Eitan Blumin takes us through a troubleshooting scenario:

In short, high THREADPOOL waits can happen when SQL Server doesn’t have enough “worker threads” to handle new tasks, which could cause SQL Server to hang and refuse connections. When a task is waiting for a worker thread to become available, that wait type is called THREADPOOL wait.

A background process, called “Scheduler Monitor“, will identify when the same worker threads are “stuck” in the same state for 60 seconds or more. In which case it will resolve the issue as a Deadlocked Scheduler, and that’ll cause dropped connections, rollbacks, and even fail-overs.

When a Deadlocked Scheduler event happens, SQL Server will automatically generate a memory dump file (SQLDump#####.mdmp), and log the incident in the SQL Server Error Log.

Read on to understand what causes this as well as why we always fumble our keys under the car as the scary monster approaches.

Comments closed

Adding Row Numbers to ADF Data Flows

Rayis Imayev shows two methods of generating unique, ascending row numbers in Azure Data Factory data flows:

Adding a row number to your dataset could a trivial task. Both ANSI and Spark SQL have the row_number() window function that can enrich your data with a unique number whole for your whole or partitioned data recordset. 

Recently I had a case of creating a data flow in Azure Data Factory (ADF) where there was a need to add a row number.

Read on for a couple attempts which didn’t work, followed by two that do, including an assist from Joseph Edwards.

Comments closed

Stream Processing with ksqldb

Michael Drogalis takes us through how stream processing works with ksqldb:

ksqlDB, the event streaming database, is becoming one of the most popular ways to work with Apache Kafka®. Every day, we answer many questions about the project, but here’s a question with an answer that we are always trying to improve: How does ksqlDB work?

The mechanics behind stream processing can be challenging to grasp. The concepts are abstract, and many of them involve motion—two things that are hard for the mind’s eye to visualize. Let’s pop open the hood of ksqlDB to explore its essential concepts, how each works, and how it all relates to Kafka.

Click through for a demo with animations.

Comments closed

Reasons Data Science Projects Fail

Ryohei Fujimaki summarizes some of the reasons why data science projects can fail:

According to Gartner analyst Nick Heudecker, over 85% of data science projects fail.  A report from Dimensional Research indicated that only 4% of companies have succeeded in deploying ML models to production environment.

Even more critical, the economic downturn caused by the COVID-19 pandemic has placed increased pressure on data science and BI teams to deliver more with less. In this down market, organizations are reassessing which AI/ML models they should develop, how to optimize resources and how to best use valuable budget dollars for maximum impact. In this type of environment, AI/ML project failure is simply not acceptable.

That 85% sounds suspiciously like the percentage of failed business intelligence and data warehouse projects, as well as the percentage of failed big data projects. It’s close enough that it makes me want to come up with some overarching idea that projects based on the consolidation of multiple independent data systems across several business units are liable to fail about 5/6 of the time.

Comments closed

Delta Lake DML Internals

Tathagata Das, et al, take us through how Delta Lake handles update, delete, and merge operations:

`DELETE` works just like `UPDATE` under the hood. Delta Lake makes two scans of the data: the first scan is to identify any data files that contain rows matching the predicate condition. The second scan reads the matching data files into memory, at which point Delta Lake deletes the rows in question before writing out the newly clean data to disk.

After Delta Lake completes a `DELETE` operation successfully, the old data files are not deleted — they’re still retained on disk, but recorded as “tombstoned” (no longer part of the active table) in the Delta Lake transaction log. Remember, those old files aren’t deleted immediately because you might still need them to time travel back to an earlier version of the table. If you want to delete files older than a certain time period, you can use the `VACUUM` command.

Click through for a video as well as a blog post with the details.

Comments closed