Press "Enter" to skip to content

Curated SQL Posts

Generating Artificial Data with Databricks Generator

Ust Oldfield shows off a new tool:

Databricks Labs is a relatively new offering from Databricks which showcases what their teams have been creating in the field to help their customers. As a Consultant, this makes my life a lot easier as I don’t have to re-invent the wheel and I can use it to demonstrate value in partnering with Databricks. There’s plenty of use cases that I’ll be using, and extending, with my client but the one I want to focus on in this post is the Data Generator.

Read on for an example of how this works. Something not in Ust’s post but worth mentioning is that you can control the distribution of random numeric features. That’s a piece of functionality you often don’t see in data generators.

1 Comment

Kafka and SIEM/SOAR Tools

Kai Waehner wraps up a series on Apache Kafka and network security:

SIEM combines security information management (SIM) and security event management (SEM). They provide analysis of security alerts generated by applications and network hardware. Vendors sell SIEM as software, as appliances, or as managed services; these products are also used for logging security data and generating reports for compliance purposes.

SOAR tools automate security incident management investigations via a workflow automation workbook. The cyber intelligence API enables the playbook to automate research related to the ticket (lookup potential phishing URL, suspicious hash, etc.). The first responder determines the criticality of the event. At this level, it is either a normal or an escalation event. SOAR includes security incident response platforms (SIRPs), Security orchestration and automation (SOA), and threat intelligence platforms (TIPs).

In summary, SIEM and SOAR are key pieces of a modern cybersecurity infrastructure. The capabilities, use cases, and architectures are different for every company.

Click through to see where Kafka can fit in all of this.

Comments closed

Exporting a Hive Table to CSV

The Hadoop in Real World team shows how you can export data from a Hive table specifically into a file using comma-separated values:

It is a pretty common use case to export the contents of a Hive table into a CSV file. It’s pretty simple if you are using a recent version of Hive. In this post, we will see who to achieve this with both newer and older versions of Hive.

Read on to see both versions of the answer.

Comments closed

BCP from R into SQL Server

Thomas Roh shows how you can perform bulk insert operations into SQL Server using the bcputility package in R:

Writing large datasets to SQL Server can be very slow using the DBI package with an odbc connection. The issue with writing data is that individual INSERT statements are generated for each row of data. I’ve also had issues with remote connections that can make large writes to SQL Server take a very long time. SQL Server Management Studio does provide a GUI interface to import data that is much more efficient. For those that want to include the data import in their reproducible R workflows there are a couple of options.

Read on to see how it works. It’s still calling bcp.exe under the covers, so expect similar foibles using it as you would bcp. H/T R-Bloggers.

Comments closed

Importing SQL Server Extended Properties into Azure Purview

Daniel Janik shows how you can use PyApacheAtlas to move specific SQL Server extended properties into Azure Purview:

This post is going to be restricted to only SQL Server Table Columns and only Extended Properties named MS_Description. Quite a few years ago I worked on a data catalog project where we added descriptions for many of the tables, views, and columns to the database using extended properties named MS_Description. Let’s assume you have some of these for this post keeping in mind that the Purview APIs provide so many functions beyond what this post covers and that the code here could be modified to do so much more as well.

Starting out I thought it would be great to import the sensitivity classifications that SSMS creates. Pre-SQL 2019 these were held in Extended Properties and now have their very own DMV (sys.sensitivity_classifications). While this sounded great in theory it wasn’t as exciting when I wrote the code. This is because Azure Purview already has system classifications at a more granular scale for each of the ones you find in SSMS and Purview also adds these as it executes a scan on the data source. It does a pretty good job too. With that said, I shifted my focus to adding descriptions instead.

Read on to see how you can do this.

Comments closed

Journey before Destination and Disagreeing with Your Heroes

Mark Seemann reasons through a difficult problem:

Perhaps we’re really talking past each other. Perhaps we’re trying to solve different problems, and thereby arrive at different solutions.

I can only guess at the kinds of problems that my heroes think of when they prefer dynamic languages, and I don’t want to misrepresent them. What I can do, however, is outline the kind of problem that I typically have in mind.

I’ve spent much of my career trying to balance sustainability with correctness. I consider correctness as a prerequisite for all code. As Gerald Weinberg implies, if a program doesn’t have to work, anything goes. Thus, sustainability is a major focus for me: how do we develop software that can sustain our organisation now and in the future? How do we structure and organise code so that future change is possible?

Reading Mark’s essay makes me want to break out my copy of Tractatus Logico-Philosophicus but let’s not go crazy here… This is a good reminder, however, that incentives (implicit as well as explicit), experiences, and a host of other factors which make it really difficult to say conclusively “X is a better solution than Y” without laying out the specific premises.

Comments closed

Power BI and Information Protection in the Enterprise

Marc Lelijveld continues a series on taking Power BI to the enterprise level:

On to the next topic, which is super important for global, enterprise grade solutions if you ask me. Security and information protection! The fifth blog in the series of transforming local into global Power BI solutions.

Good to know up front is that information protection is a feature that not specifically related to Power BI, but also applies to other Microsoft products and services. Though, in this blog I will explain the functionality of information protection, also known as sensitivity labels and how this applies to Power BI. Let’s first look at what the functionality is, where you can apply it and why it is important.

Read on for more information concerning sensitivity labels and where they become useful.

Comments closed

When Query Cost Goes Astray

Erik Darling warns that man cannot live on query cost alone:

There are also rather misguided efforts to figure out parallelism settings based on plan costs. The main problem with that being that if you currently have a lot of parallel queries, all that means is that the estimated cost of the serial plan was higher than your current Cost Threshold For Parallelism setting, and the cost of the parallel plan was less than the cost of the serial plan.

If you increase Cost Threshold For Parallelism, you may very well still end up with a parallel plan, because the serial version was still more expensive. If you eventually change Cost Threshold For Parallelism to the point where some queries are no longer eligible for parallelism, you may eventually find yourself unhappy with the performance of the serial version of the query plan.

Read on for Erik’s take on query cost, which is a good one.

Comments closed

Row Level Security in Azure Analysis Services and Power BI PPU

Gilbert Quevauvilliers continues a series on moving from Azure Analysis Services to Power BI Premium Per User:

In this blog post I am going to cover how to implement Row Level Security (RSL) when using AAS and how this can be done on PPU.

In my example below I am going to show creating to simple RLS roles which will limit data for the users who belong to 2 roles.

Despite the simplification, we can see how row-level security applies to both products and how the two differ.

Comments closed