Press "Enter" to skip to content

Curated SQL Posts

Getting beyond Pandas

Shittu Olumide recommends a few other packages:

If you’ve worked with data in Python, chances are you’ve used Pandas many times. And for good reason; it’s intuitive, flexible, and great for day-to-day analysis. But as your datasets start to grow, Pandas starts to show its limits. Maybe it’s memory issues, sluggish performance, or the fact that your machine sounds like it’s about to lift off when you try to group by a few million rows.

That’s the point where a lot of data analysts and scientists start asking the same question: what else is out there?

Read on for seven options, including six libraries and one built-in programming technique.

Leave a Comment

Animated Maps in R with gganimate

Osheen MacOscar looks at a new version of an old package:

In this blog post, we are going to use data from the {gapminder} R package, along with global spatial boundaries from ‘opendatasoft’. We are going to plot the life expectancy of each country in the Americas and animate it to see the changes from 1957 to 2007.

The {gapminder} package we are using is from the Gapminder foundation, an independent educational non-profit fighting global misconceptions. The cover issues like global warming, plastic in the oceans and life satisfaction.

There are several common gotchas that Osheen takes us through before building an animated map of the western hemisphere.

Leave a Comment

Oracle Password-Related Profile Settings

David Fitzjarrell takes a look at some settings:

Passwords expire, and, depending upon how various profiles are configured, accounts are either locked or provided a grace period during which the old password can be changed. In any recent enterprise password verification functions are provided to police new passwords to ensure a modicum of security. Let’s dig into what Oracle provides to assist in password security.

Basic elements of password security that Oracle provides start with the profile; listed below are the associated resources:

Read on for the available options you can set on a per-profile basis.

Leave a Comment

Auto-Scale Billing for Spark in Microsoft Fabric now GA

Santhosh Kumar Ravindran announces a feature in general availability:

We’re thrilled to announce the general availability (GA) of Autoscale Billing for Apache Spark in Microsoft Fabric — a serverless billing model designed to offer greater flexibility, transparency, and cost efficiency for running Spark workloads at scale.

With this model now fully supported, Spark Jobs can run independently of your Fabric capacity and are billed on a pay-as-you-go basis — similar to how Spark works in Azure Synapse. This gives teams the freedom to scale compute as needed without impacting other workloads running on your shared Fabric capacity.

I’m of two minds here. On the one hand, there is value to having this as an option. On the other hand, one of the talking points for Microsoft Fabric is that you have one billing model. But because it’s an optional thing you can enable rather than something you must use, I’m fine with it.

Leave a Comment

Stored Procedures and Headers

Andy Brownsword lays out an argument:

Code is an ever moving target. Version control and documentation only go so far, if they even exist. Sometimes all you have is the code in front of you.

This is why I always start stored procedures with a header.

There was a time I strongly resisted this idea, but if you are diligent about keeping this up to date, it can be very useful for record-keeping, especially if your company has a tendency to switch source control systems and not keep the history between moves.

Leave a Comment

Using Python in R in Excel

Adam Gladstone wraps up a series on an R add-in for Excel:

In the last post in this series I am going to look at calling Python from R. Even though Excel now provides a means of calling Python scripts directly, using the =PY() formula in a worksheet, there are still occasions when it is beneficial to call Python via R. For example, it turns out that importing yfinance produces a ‘module not found’ error using Excel’s function. According to the documentation, yfinance is not one of the open source libraries that the Excel Python secure distribution supports. To get around this issue, we can use the R package Reticulate. This lets us load and run Python scripts from R. As we have seen in the previous parts of this series, the ExcelRAddIn allows us to run R scripts from an Excel worksheet. And putting these two together is quite simple.

I’m glad Adam mentioned this because my first question was going to be, why use this when Excel has Python capabilities built in? And that’s a reasonable answer.

Leave a Comment

DBAs and Data Access

Brent Ozar wraps up a survey:

Last week, I asked if your database administrators could read all of the data in all databases. The results (which may be different from this post, because I’m writing the post ahead of time and the poll is still open):

In a lot of cases, this doesn’t really matter much. In places where it does matter (for example, reading protected health information or critical financial data), there should be controls in place. I’ve always been on the side of this issue that says that yes, you do need to be able to trust your administrators at the end of the day, because somebody’s going to need a way to get to that data in case of company emergency. But as a company grows and there are additional opportunities for division of labor and specialization, you do open up the possibility of stronger controls, proper auditing, limiting certain data access to privileged accounts, and consequences for violating the rules.

Leave a Comment

Three Years of PSBlitz

Vlad Drumea looks back:

In case you’re not familiar with it, PSBlitz is a PowerShell script that outputs SQL Server health and performance diagnostics data to either Excel or HTML, and saves execution plans and deadlock graphs as .sqlplan and .xdl files.

It’s based on modified non-stored procedure versions of Brent Ozar’s SQL Server First Responder Kit and Erik Darling’s sp_QuickieStore.
Along with some other custom scripts I’ve put together.

You can get more details about its capabilities and usage from my PSBlitz GitHub repo.

Read on to learn about the origin story of PSBlitz, some of the major changes that have taken place over the past three years, and a bit more on the tool.

Leave a Comment

Randomization of Personally Identifiable Information

Rich Benner tries out a couple of techniques:

The main issue we see in dev environments is that people take a nice little version of their database, a few hundred rows of data per table, and develop on that. This is great for checking that your logic is correct, but not good when it comes to actually deploying the code to production. Suddenly, your nice, pretty code has to deal with millions of rows of data and grinds to a halt because you didn’t write it for big data sets. No, you wrote it for your development system, and it was fast on your machine. We see this a lot.

Rich shows a couple of techniques for data randomization. My biggest challenge to this is that if you need a proper distribution of data, you lose it. Using the telephone number example, if you have lookups or data analysis by area code, randomly generating across every area code would be bad. Also, if your application is smart enough to deal with valid or invalid area codes and exchanges (the middle three digits of the three-three-four phone number style in the US), you generating arbitrary area codes or exchanges might prevent app developers from using the application in the proper way, perhaps requiring them to fix phone numbers after viewing a data entry screen.

In short, there are easier and harder ways to do this, and several factors may push you into the harder way.

Leave a Comment