Press "Enter" to skip to content

Curated SQL Posts

Log Tokenization and Reduction in Azure Data Explorer

Brian Bønk tries out some new functions:

Before the release described below – the ADX service had a good handfull of features to help with anomaly detection and clustering on semi structured data.

With the functions like basket() and autocluster() the service can find patterns based on common values across the columns. The problem with these functions, is that they are not able to parse free text columns and extract tokens and repeatable patterns.

Yes, you could use the diffpatterns_text() function – but that is not strong enough to cover real diversity of free text log records.

It’s interesting that the end result is looking for log entries whose shape differs from the norm. That’s a clever approach to log file analysis.

Comments closed

Tips for Using psql

Ryan Booz shows off a tool:

Having access to the psql command-line tool is essential for any developers or DBAs that are actively working with and connecting to PostgreSQL databases. In our first article, we discussed the brief history of psql and demonstrated how to install it on your platform of choice and connect to a PostgreSQL database.

In this article we’ll get you up and running with all of the essential things you need to know to start on your journey to becoming a psql power-user. From basic command syntax to the most common (and helpful) meta-commands, it’s all covered throughout the rest of the article.

Also check out the comments for a link to a pager which works with psql.

Comments closed

Building a Model with shiny and tinyAML

Steven Sanderson has a series on using the tidyAML Model Builder. Part 1 builds a simple model:

The first reactive expression, data, reads in the data file uploaded by the user or selects a built-in dataset, depending on which option the user chooses. If the user uploads a file, the read.csv() function is used to read the data file into a data frame. If the user selects a built-in dataset, the get() function is used to retrieve the data frame associated with that dataset. In both cases, the column names of the data frame are used to update the choices in the predictor_col select input, so that the user can select which column to use as the predictor variable.

Part 2 builds on it by adding new regression algorithms:

Yesterday I spoke about building tidymodels models using my package {tidyAML} and {shiny}. I have made an update to it, and will continue to make updates to it this week.

I have added all of the supported engines for regression problems only, NOT classification yet, that will be tomorrow’s work. I will then add a drop down for users to pick which backend function they want to use from {parsnp} like linear_reg().

Comments closed

Notes for Power BI Governance

Soheil Bakhshi provides advice:

In recent years, Power BI has become one of the most widely used business intelligence (BI) tools. Power BI is more than just a reporting tool; it is a comprehensive analytical platform that enables users to collaborate on data insights and share them internally and externally. In addition to creating reports and dashboards, Power BI allows users to collaborate and share their work with others. For instance, users can share dashboards with their colleagues, allowing them to view, interact, and engage with the data quickly. However, as more organisations adopt Power BI, it becomes essential to ensure appropriate governance processes, policies, and rules are in place. This blog post explains Power BI governance and why business owners need to be conscious of it.

Read on for more advice.

Comments closed

Trying Query Parameterization Settings in SQL Server

Tibor Karaszi builds a test:

You have probably seen the recommendation to turn on the “optimize for ad-hoc workloads” setting. You might even have seen a more recent recommendation to set the database setting parameterization to forced (instead of the default which is simple). The aim of this post is to briefly describe each and then do some test with various settings.

Click through for that test. This is a good example of how we need to temper guidance with context. In Tibor’s scenario, forced parameterization is a no-brainer and optimize for ad hoc workloads gives a pretty nice reduction in plan cache utilization. But then, with optimize for ad hoc workloads on, you lose the ability to see the first run of a query in Query Store and lose the opportunity to tune the different variations of a query which only ran once. Pretty much every setting in SQL Server exists because there is a scenario in which that is the most appropriate setting. Except auto-shrink. Auto-shrink delenda est.

Comments closed

Installing SqlPackage for a Deployment Pipeline

Kevin Chant uses a deployment tool to install a deployment tool for his deployment tools:

I decided to do this post after some feedback I received about SqlPackage after a series of posts about deploying dacpacs to serverless SQL Pools. For example, my post about deploying a dacpac to a serverless SQL pool.

Because in order to deploy dacpacs to serverless SQL Pools you must update SqlPackage.

With this in mind, I thought I better go through various ways to update SqlPackage if intending to use it to deploy dacpacs to serverless SQL Pools.

Read on to see how you can do this.

Comments closed

Measuring Power BI Dataset Memory and CPU Utilization

Chris Webb checks resource utilization:

This post is a follow-up to my recent post on identifying CPU and memory-intensive Power Query queries in Power BI. In that post I pointed out that Profiler and Log Analytics now gives you information on the CPU and memory used by an individual Power Query query when importing data into Power BI. What I didn’t notice when I wrote that post is that there is also now information available in Profiler and Log Analytics that tells you about peak memory and CPU usage across all Power Query queries for a single refresh in the Power BI Service, as well as memory usage for the refresh as a whole.

Click through for a demonstration using Profiler.

Comments closed

Understanding the Kafka Partitioner

Bill Bejeck talks partitions:

Apache Kafka is the de facto standard for event streaming today. Part of what makes Kafka so successful is its ability to handle tremendous volumes of data, with a throughput of millions of records per second, not unheard of in production environments. One part of Kafka’s design that makes this possible is partitioning.  

Kafka uses partitions to spread the load of data across brokers in a cluster, and it’s also the unit of parallelism; more partitions mean higher throughput. Since Kafka works with key-value pairs, getting records with the same key on the same partition is essential.  

Read on to learn a bit about how that partitioning works and why it’s important for application design, especially across multiple programming languages.

Comments closed

Creating a Clickable Word Cloud with Shiny

Mandy Norrbo builds a word cloud:

Word clouds are a visual representation of text data where words are arranged in a cluster, with the size of each word reflecting its frequency or importance in the data set. Word clouds are a great way of displaying the most prominent topics or keywords in free text data obtained from websites, social media feeds, reviews, articles and more. If you want to learn more about working with unstructured text data, we recommend attending our Text Mining in R course

Usually, a word cloud will be used solely as an output. But what if you wanted to use a word cloud as an input? For example, let’s say we visualised the most common words in reviews for a hotel. Imagine we could then click on a specific word in the word cloud, and it would then show us only the reviews which mention that specific word. Useful, right?

Read on to see how you can create one of these.

Comments closed