Press "Enter" to skip to content

Curated SQL Posts

Finding the Latest in the Powershell Gallery

Jeffrey Hicks does some marketing:

It is easy to search the PowerShell Gallery online and from a PowerShell prompt. But I felt there could be more. To address this “marketing” need, I’ve started PSGallery Report.

This is not a PowerShell module. I have set up a process to generate daily reports on what has recently been published in the PowerShell Gallery. I’m generating a set of reports in markdown and posting them to Github.

This is an interesting idea, though you might need to browse to yesterday’s version of the code (if you’re reading this today, 2022-04-11) because the latest run appears to have removed everything.

Comments closed

KEEP PLAN: A Not-So-Useful Query Hint

Tom Zika digs into a query hint:

Like Dwarves of Moria, I delved too greedily and too deep. I found the answer in a Plan Caching and Recompilation in SQL Server 2012 whitepaper before trying it out on my own. Relevant excerpt:

KEEP PLAN

The KEEP PLAN query hint changes the recompilation thresholds for temporary tables, and makes them identical to those for permanent tables. Therefore, if changes to temporary tables are causing many recompilations, this query hint can be used.

But since it doesn’t show the proof, I decided to test it out anyway.

And it’s a good thing, too, as it turns out this isn’t quite how things work anymore.

Comments closed

Saving Space with 6NF in SQL Server

Aaron Bertrand has a two-parter. Part one sets up the problem:

We often build logging or other insert-only tables where we store large strings like URLs, host names, or error messages. It’s usually not until the table has terabytes of data that we realize there might have been a better way. If we are logging traffic or exceptions for our own application, it’s likely that we record the same URL, host name, or error message on millions of rows. What if we only had to write that URL or host name or message text once, the first time we saw it? In this tip, I want to share one idea for abstracting away recurring values, reducing storage, and making search queries faster (especially those with wildcards) without requiring immediate changes in the application layer.

Part two maximizes the savings:

In my previous tip, I showed how we can make a growing logging table leaner by moving large, repeating strings to their own dimension tables. The solution there involved an AFTER INSERT trigger and assumed that we could change the applications to recognize the new table structure in relatively short order.

Check out both posts for more details. If you’re confused about my calling this 6NF and Aaron mentioning dimension tables, the answer is that he’s talking about the end result and I’m describing the process.

Comments closed

Topic Modeling with Python

Sanil Mhatre takes us through topic modeling:

Topic modeling is a powerful Natural Language Processing technique for finding relationships among data in text documents. It falls under the category of unsupervised learning and works by representing a text document as a collection of topics (set of keywords) that best represent the prevalent contents of that document. This article will focus on a probabilistic modeling approach called Latent Dirichlet Allocation (LDA), by walking readers through topic modeling using the team health demo dataset. Demonstrations will use Python and a Jupyter notebook running on Anaconda. Please follow instructions from the “Initial setup” section of the previous article to install Anaconda and set up a Jupyter notebook.

The second article of this series, Text Mining and Sentiment Analysis: Power BI Visualizations, introduced readers to the Word Cloud, a common technique to represent the frequency of keywords in a body of text. Word Cloud is an image composed of keywords found within a body of text, where the size of each word indicates its frequency in that body of text. This technique is limited in its ability to discover underlying topics and themes in the text, because it only relies on the frequency of keywords to determine their popularity. Topic modeling overcomes these limitations and uncovers deeper insights from text data using statistical modeling for discovering the topics (collection of words) that occur in text documents.

Read on for an informative article with plenty of code.

Comments closed

Logging in Python

Daniel Chung shows off the logging module in Python:

Note that now all five messages were output, so the default level that the root logger logs is now “DEBUG.” The log record attributes (such as %(asctime)s) that can be used to format the output can be found in the logging documentation.

Although there is a default logger, we usually want to make and use other loggers that can be configured separately. This is because we may want a different severity level or format for different loggers. 

Next on the todo list is to implement the Reader monad to hide that logging deeper in your code base so that you a) don’t need to see logging code everywhere, and b) don’t forget to include logging in some function.

Comments closed

Accessing Twitter’s V2 API via R

Bryan Shalloway dives into the mess known as Twitter:

academictwitteR is probably the most established package that provides a quickstart entry point to the V2 API. However it requires creating an academic account in twitter, i.e. the user must be affiliated with a university. I also stumbled onto RTwitterV2 and voson.tcn which both also provide quickstarts on the V2 API, but did not explore these.

Instead I followed the tutorial Getting started with R and v2 of the Twitter API by Twitter Developer Advocate Jessica Garson that uses {httr} to interact more directly with the API. I highly recommend reading her tutorial. The code below is mostly just copied from there but changed to provide an example of getting the usernames of those that liked a tweet.

Read on for a how-to and some notes.

Comments closed

Tokenizing Results in SSIS

Peter Schott shows off a handy trick:

I’ve worked with SSIS for some time now, but a recent question and post pointed out an expression I’d never used before. We often have a need to split strings and pick out some portion of that string. Sometimes we have a need to pull in everything in the “nth” occurrence of a string. The TOKEN expression can be used to get that particular value. I tested this out by mocking up a really simple package.

Read on to see how.

Comments closed

Building a Gantt Chart with Power BI Paginated Reports

Paul Turley unlocks one of my guilty pleasures:

A Gantt chart is a running list of activities with the duration for each displayed as a horizontal bar depicting the beginning and ending day along a horizontal scale. The challenge is that this is not a standard chart type in either Power BI or SSRS/Paginated Reports. Furthermore, project planners may prefer to see activities as rows in the format of a printed page, as you can see in this example:

I like Gantt charts more than is probably healthy. Paul shows a method which isn’t exactly easy but it does the trick.

Comments closed

Reviewing SQL Server Logins

Lee Markum takes a look at logins:

You know you need to be thinking about SQL Server security, but maybe you’re not sure where to start. Topics like firewalls and ports and port scanners and such may be dancing your mind. Those are good things to think about, but they are not under your sphere of influence as a data professional in charge of SQL Server. So, what can you do?

Your first place to start is by looking at the Logins, which as I’ve explained in a previous  post, are at the level of the SQL Server instance level.

Read on for two approaches.

Comments closed