Press "Enter" to skip to content

Day: July 23, 2019

Polishing Uncalibrated Models

Nina Zumel takes an uncalibrated random forest model and applies a calibration technique to improve the estimate on one variable:

In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make more accurate predictions on individuals. In this article, we’ll demonstrate one ad-hoc method for calibrating an uncalibrated model with respect to specific grouping variables. This “polishing step” potentially returns a model that estimates certain rollups in an unbiased way, while retaining good performance on individual predictions.

This is a great explanation of the process as well as its risks and limitations.

Comments closed

Generating Excel Spreadsheets from Shiny

Richard Hill and Andy Merlino show how you can export data from a Shiny app into Excel:

R is great for report generation. Shiny allows us to easily create web apps that generate a variety of reports with R.

This post details a demo Shiny app that generates an Excel report, a PowerPoint report, and a PDF report:

The full Shiny app source code is available here. Also, we included a more basic Shiny app that generates an Excel report at the end of this post. Follow up posts will include similar simple Shiny apps generating PowerPoint and PDF reports.

Excel is still the most popular business intelligence tool and Excel support tends to be one of the first requests people get with third-party apps, so it’s good to know you can do this in Shiny without too much rigmarole.

Comments closed

When tempdb Spills Attack

Josh Darnell ran into a problem with a SQL Agent job:

One of my colleagues reached out to me recently about a production issue where a SQL Server Agent job had failed with this error message:

Msg 1105, Level 17, State 2, Line 15
Could not allocate space for object ‘dbo.SORT temporary run storage: 140737513062400’ in database ‘tempdb’ because the ‘PRIMARY’ filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

I fully expected this to be a scheduled maintenance task, like index rebuilds or statistics updates. I’ve seen this error before in those contexts (rebuilding large indexes in tempdb, or updating statistics with FULLSCAN).

But watch as SQL Server subverts Josh’s expectations.

Comments closed

Power BI Conditional Formatting and Icons

Matt Allington shows how you can now use icons as the output of conditional formatting in Power BI:

Note how the icons above have both shape and colour so you can differentiate between them even if you are colour blind.  This is best practice.

You can also change the default formatting to work on the hard coded number settings that you specify. In the example below I have changed the settings to work on absolute numbers instead of percentages (note the changes in the highlighted boxes).  Also note that I have set the minimum and maximum numbers shown as 1 and 2.  To do this, simply delete the value in these boxes.  Thanks to Chris Webb for finally helping me understand how this works.

It’s easy to go overboard with this, but I’m happy to see conditional formatted icons in place; done right, you can pack a lot of information into a small space with them.

Comments closed

The SQL Notebook Experience, Featuring Powershell

Rob Sewell takes a break from book-writing and talks about using Powershell in SQL Notebooks:

Yes, it’s funny but also it carries a serious warning. Without understanding what it is doing, please don’t enable PowerShell to be run in a SQL Notebook that someone sent you in an email or you find on a GitHub. In the same way as you don’t open the word document attachment which will get a thousand million trillion pounddollars into your bank account or run code you copy from the internet on production without understanding what it does, this could be a very dangerous thing to do.

With that warning out of the way, there are loads of really useful and fantastic use cases for this. SQL Notebooks make great run-books or incident response recorders and PowerShell is an obvious tool for this. (If only we could save the PowerShell output in a SQL Notebook, this would be even better)

“It’s a bit hacky” is a generous statement, but it’s really cool that Rob figured out a way to do this. There is a Powershell kernel for Jupyter, but I’ve not had the best experience adding new kernels to Azure Data Studio (at least not F#’s kernel, which I tried).

Comments closed

Simple Query Zen

Erik Darling wants you to simplify your life queries:

See, when a query is big and complicated to you, there’s a pretty good chance you’re gonna get a big and complicated query plan, because it’s big and complicated to the optimizer, too.

This isn’t to say the optimizer is dumb or bad or ugly; it’s just that there’s only so long it’s willing to spend coming up with a plan.

Remember, cheap plan fast. Not perfect, not great, maybe good enough.

It’s a good operating philosophy: if you have a query which has gone off the rails, one of the best things you can do is try to turn the query into several small steps. It’s possible to reduce complexity that way…though you may also gain complexity in the process if you do it wrong.

Comments closed

Transactional Replication Tips

Nate Johnson has a few things which might make SQL Server transactional replication easier for you:

For what seems like years, I’ve bemoaned the fact that SQL Transactional Replication doesn’t come with a “Just Trust Me” option. I’ll explain more about what I mean in a moment. The other thing I’ve complained about is that there’s no “Pause” button — which not entirely accurate, since obviously you could just stop the distribution and subscription agents. But specifically what I mean is, it’s not easy to ‘put it on hold so you can make some schema changes to one of the tables that’s being replicated’, and then easily “Resume” it after you’re done with said changes.

Well, I’m happy to say that now I have both of these tools/methodologies in my arsenal!

Read on for those tips and a couple more.

Comments closed

Keeping S3 and Blob Storage in Sync

Sheldon Hull shares with us a technique to keep an S3 bucket in sync with an Azure Blob Storage blob:

Moving data between two cloud providers can be painful, and require more provider scripting if doing api calls. For this, you can benefit from a tool that abstracts the calls into a seamless synchronization tool.

I’ve used RClone before when needing to deduplicate several terabytes of data in my own Google Drive, so I figured I’d see if it could help me sync up 25GB of json files from Azure to S3.

You’ll have to do a few of the steps on your own, but this looks like a good way of parking data in two clouds.

Comments closed