Press "Enter" to skip to content

Curated SQL Posts

PolyBase and Dockerized Hadoop

I have a solution to a problem which vexed me for quite some time:

Quite some time ago, I posted about PolyBase and the Hortonworks Data Platform 2.5 (and later) sandbox.

The summary of the problem is that data nodes in HDP 2.5 and later are on a Docker private network. For most cases, this works fine, but PolyBase expects publicly accessible data nodes by default—one of its performance enhancements with Hadoop was to have PolyBase scale-out group members interact directly with the Hadoop data nodes rather than having everything go through the NameNode and PolyBase control node.

Click through for the solution.

Comments closed

Null Checks in Spark DataFrames

Bipin Patwardhan gives us four techniques for validating whether data in Spark exists:

The task at hand was pretty simple — we wanted to create a flexible and reusable library of classes that would make the task of data validation (over Spark DataFrames) a breeze. In this article, I will cover a couple of techniques/idioms used for data validation. In particular, I am using the null check (are the contents of a column ‘null’). In order to keep things simple, I will be assuming that the data to be validated has been loaded into a Spark DataFrame named “df.”

Click through for those techniques.

Comments closed

strace and SQL Server Containers

Anthony Nocentino tries using strace to diagnose SQL Server process activity in a container:

We’re attaching to an already running docker container running SQL. But what we get is an idle SQL Server process this is great if we have a running workload we want to analyze but my goal for all of this is to see how SQL Server starts up and this isn’t going to cut it.
 
My next attempt was to stop the sql19 container and quickly start the strace container but the strace container still missed events at the startup of the sql19 container. So I needed a better way.

Don’t worry—Anthony finds a better way.

Comments closed

dbatools: the Book

Chrissy LeMaire has an exciting announcement:

After nearly 10 months of work, early access to Learn dbatools in a Month of Lunches is now available from our favorite publisher, Manning Publications!

For years, people have asked if any dbatools books are available and the answer now can finally be yes, mostly. Learn dbatools in a Month of Lunches, written by me and Rob Sewell (the DBA with the beard), is now available for purchase, even as we’re still writing it. And as of today, you can even use the code bldbatools50 to get a whopping 50% off.

They’re in active book development, so buy a copy now and watch as the book evolves.

Comments closed

Backup to URL in Azure

Joey D’Antoni recommends backing up database to Blob Storage via URL in Azure:

Unlike in your on-premises environment, where you might have up to a 32 Gbps fibre channel connection to your storage array and then a separate 10 Gbps connection to the file share where you write your SQL Server backups, in the cloud you have a single connection to both storage and the rest of the network. That single connection is metered and correlates to the size (and $$$) of your VM. So bandwidth is somewhat sacred, since backups and normal storage traffic go over the same limited tunnel. This doesn’t mean you can’t have good storage performance, it just means you have to think about things. In the case of the customer I mentioned, they were saturating their network pipe, by writing backups to the file system, and then having the Azure backup service backup their VM, they were saturating their pipe and making regular SQL Server I/Os wait.

Read on to see what the alternative is.

Comments closed

Exploratory Data Analysis with ExPanDaR

Joachim Gassen walks us through the ExPanDaR package in R:

The ‘ExPanDaR’ package offers a toolbox for interactive exploratory data analysis (EDA). You can read more about it here. The ‘ExPanD’ shiny app allows you to customize your analysis to some extent but often you might want to continue and extend your analysis with additional models and visualizations that are not part of the ‘ExPanDaR’ package.

Thus, I am currently developing an option to export the ‘ExPanD’ data and analysis to an R Notebook. While it is not ready for CRAN yet, it seems to work reasonably well and I would love to see some people trying it and letting me know about any bugs or other issues that they encounter. Hence, this blog post.

Looks like an interesting package. H/T R-bloggers

Comments closed

Paired RDDs in Spark

Ramandeep Kaur explains how Paired Resilient Distributed Datasets (PairRDDs) differ from regular RDDs:

So, assuming that you have a fair idea about what Spark is and the basics of RDDs. Paired RDD is one of the kinds of RDDs. These RDDs contain the key/value pairs of data. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network. For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.

When datasets are described in terms of key/value pairs, it is common to want to aggregate statistics across all elements with the same key.

Paired RDDs bring us back to that key-value pair paradigm which Hadoop’s version of MapReduce brought to the forefront.

Comments closed

Saving Data in Docker Containers

Anthony Nocentino has a three-part series on persisting SQL Server data in Docker containers. Part 1 takes us through volumes:

Let’s talk about how we can use Docker Volumes and SQL Server to persist data. If we want to run SQL Server in a container we will want to decouple our data from the container itself. Doing so will enable us to delete the container, replace it and start up a new one pointing at our existing data. When running SQL Server in a container will store data in /var/opt/mssql by default. When the container starts up for the first time it will put the system databases in that location and any user databases created will also be placed at this location by default. 

Part 2 looks at how volumes differ between the Linux and Mac/Windows versions of Docker:

So in my previous post, we discussed Docker Volumes and how they have a lifecycle independent of the container enabling us to service the container image independent of the data inside the container. Now let’s dig into Volumes a little bit more and learn where Docker actually stores that data on the underlying operating system.  

Part 3 ties it in with SQL Server:

Makes sense…we changed where SQL Server is reading/writing data. macOS doesn’t support a file mode called O_DIRECT which allows for unbuffered read/write access to the file opened using the open system call.  O_DIRECT is used by systems that manage their own file caching, like relational database management systems (RDBMS). So as SQL starts up and tries to open the master database with O_DIRECT the files can’t be opened because the macOS kernel doesn’t support this mode. And this is the reason why we have to have that Linux VM around. That Linux VM will support O_DIRECT option on the file opened. See more about this at the GitHub issue here.

Definitely worth getting a handle on this if you’re interested in containers.

Comments closed

Azure Data Studio Server Groups

Dave Bland explains how we can get different tab colors in Azure Data Studio:

Have you ever been in an environment where you worked with both DEV\QA servers and Production servers?  Of course you have, we all have.  However, being in such an environment brings a certain level of risk.  We all try to be very careful to run any code we are asked to run on the correct server and in the correct environment.  Despite all of our checks and our Due diligence, we are still human and do make a mistake from time to time.  Azure Data Studio has a nice feature that will allows us to change the color of tabs based on the environment the server is in.  The image below is an example of what it would look like.

I really like this functionality because it gives you one additional warning that you’re about to run a dev script against a production server on accident.

Comments closed