Press "Enter" to skip to content

Month: May 2019

Load Testing Databases

Ed Elliott shares some tips on load testing a database:

Testing database performance is hard and takes a great deal of work to probably not even do particularly well. Instead of thinking about how you can load test a database think about how you can drive the application.

For instance, if you have a web app then use JMeter to simulate load. If you have a “fat app”, then you might need to write some code to call specific workflows through the application.

Click through for good advice, particularly around what you should not do.

Comments closed

Syncing Google Calendar and Outlook with Microsoft Flow

Eugene Meidinger shows how to keep a couple of calendars in sync:

Back when I worked a normal job, I had two calendars: Office 365 for work and Google for home. Now that I work for myself, that’s a lot more complicated. Sometimes a customer will create an account for me in their network. Sometimes I’ll partner with other consultants and work as part of their team. And of course, I’ve got my own work email at eugene@sqlgene.com.

I need all of these calendars to consolidate to one place. My natural inclination and personal preference is to put it all into Google. Now, there are sync apps available, but this sort of problem is a perfect use case. A calendar event is created in outlook, a flow is triggered, and that information is transferred to Google.

I was just complaining about this yesterday and then I see the post this morning. I’m pleased though simultaneously concerned that Eugene has bugged my hotel room.

Comments closed

Automating Jupyter Notebooks

I have some early thoughts on automating Jupyter notebooks:

In the command above, I included the date of execution. That way, I can script this to run once a day, storing results in an HTML file in some directory. Then, I can compare results over time and see when issues popped up.

I can also parse the resultant HTML if need be. Note that this won’t be trivial: even though the output looks like a simple [1] "PROBLEM ALERT", there’s a more complicated HTML blob. 

At some point I’ll probably have follow-up thoughts on the topic. Probably.

Comments closed

Deploying SQL Server Containers to Azure with Terraform

Andrew Pruski has a post covering deployment of SQL Server containers to Azure using Terraform:

What this is going to do is create an Azure Container Instance Group with one container it in, running SQL Server 2019 CTP 2.5. It’ll be publicly exposed to the internet on port 1433 (I’ll cover fixing that in a future post) so we’ll get a public IP that we can use to connect to.

Notice that the location and resource_group_name are set using variables that retrieve the values of the resource group are going to create.

Cool! We are ready to go!

Fun stuff, and Andrew promises more.

Comments closed

Azure Open Datasets

Jen Stirrup is pleased that Microsoft is bringing back open datasets:

Nearly three years ago, I complained bitterly about the demise of Windows Datamarket, which aimed to provide free, stock datasets for any and every purpose. I was a huge fan of the date dimension and  the geography dimension, since they really helped me to get started with data warehousing.

So I’m glad to say that the concept is back, revamped and rebuilt for the data scientists today. Azure Open Datasets will be useful to anyone who wants data for any reason: perhaps for learning, for demos, for improving machine learning accuracy, perhaps.

Go check it out.

Comments closed

Creating Temp Staging Tables to Avoid Spooling

Bert Wagner shows how you can create your own tables in tempdb to avoid eager or lazy spools:

SQL Server Spool operators are a mixed bag. On one hand, they can negatively impact performance when writing data to disk in tempdb. On the other hand, they allow filtered and transformed result sets to be temporarily staged, making it easier for that data to be reused again during that query execution.

The problem with the latter scenario is that SQL Server doesn’t always decide to use a spool; often it’s happy to re-read (and re-process) the same data repeatedly. When this happens, one option you have is to explicitly create your own temporary staging table that will help SQL Server cache data it needs to reuse.

The other problem with spooling is that the spool doesn’t have indexes and so performance can be awful. When I look at an execution plan, one of my immediate red flags is spooling: if we have that, removing it is one of the first candidates for optimization after the trivial stuff (expected scan/seek behavior, “fat pipes” from excessive row counts, residual I/O, etc.).

Comments closed

Splitting Arrays with OPENJSON

Dave Mason continues a journey into parsing JSON with T-SQL:

Starting with SQL Server 2016, Microsoft provided a STRING_SPLIT function. It is a table-valued function that splits a string into rows of substrings, based on a specified separator character. It’s been a welcome addition that we waited a long time for. It has one shortcoming, though: the order of the output rows is not guaranteed to match the order of the substrings in the input string.

Microsoft also provided support for parsing JSON data starting with SQL Server 2016. I discovered the OPENJSON function can be used to split strings, and it can also return the ordinal position of each substring from the original input string.

There are some limitations which you’d expect, namely around requirements for valid JSON.

Comments closed

Horizontal Labels with ggplot

Michael Toth shows us how to ensure we use horizontal text labels in ggplot:

There are several things we could do to improve this graph, but in this guide let’s focus on rotating the y-axis label. This simple change will make your graph so much better. That way, people won’t have to tilt their heads like me to understand what’s going on in your graph:

It may not seem like much when you’re creating the visual, but it can make a difference for a viewer.

Comments closed

Repeated Cross-Validation in R

Ludvig Olsen walks us through a couple of nice R packages:

The benefits of using groupdata2 to create the folds are 1) that it allows us to balance the ratios of our output classes (or simply a categorical column, if we are working with linear regression instead of classification), and 2) that it allows us to keep all observations with a specific ID (e.g. participant/user ID) in the same fold to avoid leakage between the folds.

The benefit of cvms is that it trains all the models and outputs a tibble (data frame) with results, predictions, model coefficients, and other sweet stuff, which is easy to add to a report or do further analyses on. It even allows us to cross-validate multiple model formulas at once to quickly compare them and select the best model.

Ludvig also gives us some examples of how both packages can help you out. H/T R-Bloggers

Comments closed