Press "Enter" to skip to content

Month: April 2022

Saving Space with 6NF in SQL Server

Aaron Bertrand has a two-parter. Part one sets up the problem:

We often build logging or other insert-only tables where we store large strings like URLs, host names, or error messages. It’s usually not until the table has terabytes of data that we realize there might have been a better way. If we are logging traffic or exceptions for our own application, it’s likely that we record the same URL, host name, or error message on millions of rows. What if we only had to write that URL or host name or message text once, the first time we saw it? In this tip, I want to share one idea for abstracting away recurring values, reducing storage, and making search queries faster (especially those with wildcards) without requiring immediate changes in the application layer.

Part two maximizes the savings:

In my previous tip, I showed how we can make a growing logging table leaner by moving large, repeating strings to their own dimension tables. The solution there involved an AFTER INSERT trigger and assumed that we could change the applications to recognize the new table structure in relatively short order.

Check out both posts for more details. If you’re confused about my calling this 6NF and Aaron mentioning dimension tables, the answer is that he’s talking about the end result and I’m describing the process.

Comments closed

Topic Modeling with Python

Sanil Mhatre takes us through topic modeling:

Topic modeling is a powerful Natural Language Processing technique for finding relationships among data in text documents. It falls under the category of unsupervised learning and works by representing a text document as a collection of topics (set of keywords) that best represent the prevalent contents of that document. This article will focus on a probabilistic modeling approach called Latent Dirichlet Allocation (LDA), by walking readers through topic modeling using the team health demo dataset. Demonstrations will use Python and a Jupyter notebook running on Anaconda. Please follow instructions from the “Initial setup” section of the previous article to install Anaconda and set up a Jupyter notebook.

The second article of this series, Text Mining and Sentiment Analysis: Power BI Visualizations, introduced readers to the Word Cloud, a common technique to represent the frequency of keywords in a body of text. Word Cloud is an image composed of keywords found within a body of text, where the size of each word indicates its frequency in that body of text. This technique is limited in its ability to discover underlying topics and themes in the text, because it only relies on the frequency of keywords to determine their popularity. Topic modeling overcomes these limitations and uncovers deeper insights from text data using statistical modeling for discovering the topics (collection of words) that occur in text documents.

Read on for an informative article with plenty of code.

Comments closed

Logging in Python

Daniel Chung shows off the logging module in Python:

Note that now all five messages were output, so the default level that the root logger logs is now “DEBUG.” The log record attributes (such as %(asctime)s) that can be used to format the output can be found in the logging documentation.

Although there is a default logger, we usually want to make and use other loggers that can be configured separately. This is because we may want a different severity level or format for different loggers. 

Next on the todo list is to implement the Reader monad to hide that logging deeper in your code base so that you a) don’t need to see logging code everywhere, and b) don’t forget to include logging in some function.

Comments closed

Accessing Twitter’s V2 API via R

Bryan Shalloway dives into the mess known as Twitter:

academictwitteR is probably the most established package that provides a quickstart entry point to the V2 API. However it requires creating an academic account in twitter, i.e. the user must be affiliated with a university. I also stumbled onto RTwitterV2 and voson.tcn which both also provide quickstarts on the V2 API, but did not explore these.

Instead I followed the tutorial Getting started with R and v2 of the Twitter API by Twitter Developer Advocate Jessica Garson that uses {httr} to interact more directly with the API. I highly recommend reading her tutorial. The code below is mostly just copied from there but changed to provide an example of getting the usernames of those that liked a tweet.

Read on for a how-to and some notes.

Comments closed

Tokenizing Results in SSIS

Peter Schott shows off a handy trick:

I’ve worked with SSIS for some time now, but a recent question and post pointed out an expression I’d never used before. We often have a need to split strings and pick out some portion of that string. Sometimes we have a need to pull in everything in the “nth” occurrence of a string. The TOKEN expression can be used to get that particular value. I tested this out by mocking up a really simple package.

Read on to see how.

Comments closed

Building a Gantt Chart with Power BI Paginated Reports

Paul Turley unlocks one of my guilty pleasures:

A Gantt chart is a running list of activities with the duration for each displayed as a horizontal bar depicting the beginning and ending day along a horizontal scale. The challenge is that this is not a standard chart type in either Power BI or SSRS/Paginated Reports. Furthermore, project planners may prefer to see activities as rows in the format of a printed page, as you can see in this example:

I like Gantt charts more than is probably healthy. Paul shows a method which isn’t exactly easy but it does the trick.

Comments closed

Reviewing SQL Server Logins

Lee Markum takes a look at logins:

You know you need to be thinking about SQL Server security, but maybe you’re not sure where to start. Topics like firewalls and ports and port scanners and such may be dancing your mind. Those are good things to think about, but they are not under your sphere of influence as a data professional in charge of SQL Server. So, what can you do?

Your first place to start is by looking at the Logins, which as I’ve explained in a previous  post, are at the level of the SQL Server instance level.

Read on for two approaches.

Comments closed

Skipping SQLPS

Ben Miller avoids SQLPS:

In researching some things for a presentation, I was to give at Intersections, Spring 2022, I came across this documentation page for SQL Server PowerShell that in the SQL Server Agent section it indicates that there is a way to get the Agent Job Step to skip loading the SQLPS module and you get to use PowerShell natively and can load whichever modules you would like.

What? Huge News!

That is good news. SQLPS is so outdated it’s not even funny, so this gives you a much better alternative.

Comments closed

Replacing PWDENCRYPT with HASHBYTES

Randolph West does a switcharoo:

Dave is referring to the Microsoft Docs page for PWDENCRYPT(), which has been deprecated for some time. Unfortunately, although the recommended replacement is HASHBYTES(), there isn’t an example on either page of how to replicate the functionality of PWDENCRYPT().

So, borrowing from Sebastian Meine who wrote an article titled Hash Algorithms – How does SQL Server store Passwords?, this is how you can replicate the functionality of PWDENCRYPT() to create a login, using the HASHBYTES() function instead:

If this is what you have, so be it…but an algorithm like bcrypt or scrypt would be so much better for this purpose than SHA2 or SHA3. That means using a third party library for it but there are plenty for React, .NET, Python, etc.

1 Comment

Printing ggplot2 Plots as Receipts

Bob Rudis has fun with a Point of Sale printer:

At the end of March, I caught a fleeting tweet that showcased an Epson thermal receipt printer generating a new “ticket” whenever a new GitHub issue was filed on a repository. @aschmelyun documents it well in this blog post. It’s a pretty cool hack, self-contained on a Pi Zero.

Andrew’s project birthed an idea: could I write an R package that will let me plot {ggplot2}/{grid} objects to it? The form factor of the receipt printer is tiny (~280 “pixels” wide), but the near infinite length of the paper means one can play with some data visualizations that cannot be done in other formats (and it would be cool to be able to play with other content to print to it in and outside of R).

Read on for a fun story which gets an entry in my most coveted category. H/T R-Bloggers

Comments closed