Press "Enter" to skip to content

Curated SQL Posts

CPU Hot-Add And NUMA

Frank Denneman discusses VMware NUMA behavior when you hot-add more CPUs:

But what happens when the VM is configured with less vCPUs than the core count of the physical CPU package and CPU Hot-Add is enabled? Will there be performance impact? And the answer is no. The VPD configured for the VM fits inside a NUMA node, and thus the CPU scheduler and the NUMA scheduler optimizes memory operations. It’s all about memory locality. Let’s make use of some application workload test to determine the behavior of the VMkernel CPU scheduling.

For this test, I’ve installed DVD Store 3.0 and ran some test loads on the MS-SQL server. To determine the baseline, I’ve logged in the ESXi host via an SSH session and executed the command: sched-stats -t numa-pnode. This command shows the CPU and memory configuration of each NUMA node in the system. This screenshot shows that the system is only running the ESXi operating system. Hardly any memory is consumed. TotalMem indicates the total amount of physical memory in the NUMA node in kb. FreeMem indicates the amount of free physical memory in the NUMA node in kb.

Interesting reading.

Comments closed

Logistic Regression With R

Raghavan Madabusi runs through a sample logistic regression:

Input Variables: These variables are called as predictors or independent variables.

  • Customer Demographics (Gender and Senior citizenship)
  • Billing Information (Monthly and Annual charges, Payment method)
  • Product Services (Multiple line, Online security, Streaming TV, Streaming Movies, and so on)
  • Customer relationship variables (Tenure and Contract period)

Output Variables: These variables are called as response or dependent variables. Since the output variable (Churn value) takes the binary form as “0” or “1”, it will be categorized under classification problem in the supervised machine learning.

One of the interesting things in this post was the use of missmap, which is part of Amelia.

Comments closed

Powershell Difficulties

Dave Mason shares some difficulties he has had grokking Powershell:

The developer in me thinks this is nuts. Run the same few lines of code twice, with no changes in between, and get different outputs? Madness!

Here’s another example. Nothing too complex here: I connect to an instance of SQL, SELECT CURRENT_TIMESTAMP, and show the returned value in the output window. (There’s a fixable issue here that I would go on to discover later. But hold that thought for now.)

Even when you’re conceptually familiar with a language, getting into the particular foibles of that language can expose all sorts of behavior which is strange to newcomers.

Comments closed

Pipeline Architecture With Kafka

Alexandra Wang describes how Pandora Media has used Apache Kafka for real-time ad serving using Kafka Connect:

Our ad server publishes billions of messages per day to Kafka. We soon realized that writing a proprietary Kafka consumer able to handle that amount of data with the desired offset management logic would be non-trivial, especially when requiring exactly-once-delivery semantics. We found that the Kafka Connect API paired with the HDFS connector developed by Confluent would be perfect for our use case.

We’ve also found it painful not having a central authority on data structures that can share their respective schemas across all services and applications. Without a central registry for message schemas, data serialization and deserialization for a variety of applications are troublesome and the pipeline is fragile when schema evolution happens. We found Schema Registry is a great solution for this problem.

To address the above two problems, we integrated the Kafka Connect API and Schema Registry into our Kafka-centered data pipeline.

Well worth reading, especially the difficulties that they’ve had during maintenance periods and in lower environments.

Comments closed

Community Localization For Crossplatform Tools

Mona Nasr and Andy Gonzalez are looking for tool translation support:

Community has completed the translations for VScode SQL Server extension for six languages: Brazilian, French, Japanese, Italian, Russian, and Spanish.

We still need help with other languages. If you know anyone with language expertise, refer them to the Team Page.

Your contributions are valuable and will help us improve the product in your languages. We hope to continue working with the community in future projects.

Hit up the Team Page link to learn more about how to contribute.

Comments closed

Linear Regression In SQL

Phil Factor shows how to generate a quick linear regression using SQL, Powershell, and Gnuplot:

It looks a bit like someone has fired a shotgun at a wall but is there a relationship between the two variables? If so, what is it? There seems to be a weak positive linear relationship between the two variables here so we can be fairly confident of plotting a trendline.

Here is the data, and we will proceed to calculate the slope and intercept. We will also calculate the correlation.

It’s good to know that this is possible, but I’d switch to R or Python long before.

Comments closed

Power BI Row-Level Security

Steve Hughes has some resources on implementing row-level security in Power BI:

Row level security is the ability to filter content based on a users role. There are two primary ways to implement row level security in Power BI – through Power BI or using SSAS. Power BI has the ability in the desktop to create roles based on DAX filters which affect what users see in the various assets in Power BI.

In order for this to work, you will need to deploy to a Workspace where users only have read permissions. If the members of the group associated to the Workspace have edit permissions, row level security in Power BI will be ignored.

Read on for more details as well as a set of how-to links.

Comments closed

Managing The Pace Of Change

Kellan Danielson and the rest of the Power Pivot Pro team discuss the pace of change in the data platform:

@djharshany I’ve found Pocket (https://getpocket.com/) really useful for saving items for later. I’m on a schedule as well – I save a lot of articles and then pour through them when I’m on an airplane or waiting in line somewhere. #productivityhack

I think this furious pace of technological development has made me much more aware 1) of the amount of noise out in the world that I’m safe ignoring and 2) of how we need to stay vigilant in producing content that cuts through the noise.

Given that these are people who specialize in the fastest-moving part of the Microsoft data platform, it’s worth getting their thoughts on the rapid pace of change.

Comments closed

Text Normalization With Spark

Engineers at Treselle Systems have put together a two-part series on text normalization using Apache Spark.  First, they walk through normalizing the text:

We have used Spark shared variable “broadcast” to achieve distributed caching. Broadcast variables are useful when large datasets need to be cached in executors. “stopwords_en.txt” is not a large dataset but we have used in our use case to make use of that feature.

What are Broadcast Variables?
Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only. Without broadcast variables, these variables would be shipped to each executor for every transformation and action, which can cause network overhead. However, with broadcast variables, they are shipped once to all executors and are cached for future reference.

From there, they dig into details on what the Spark engine did and why we see what we do:

Note: Stage 2 has both reduceByKey() and sortByKey() operations and as indicated in job summary “saveAsTextFile()” action triggered Job 2. Do you have any guess whether Stage 2 will be further divided into other stages in Job 2? The answer is: yes Job 2 DAG: This job is triggered due to saveAsTextFile() action operation. The job DAG clearly indicates the list of operations used before the saveAsTextFile() operations.Stage 2 in Job 1 is further divided into another stage as Stage 2. In Stage 2 has both reduceByKey() and sortByKey() operations and both operations can shuffle the data so that Stage 2 in Job 1 is broken down into Stage 4 and Stage 5 in Job 2. There are three stages in this job. But, Stage 3 is skipped. The answer for the skipped stage is provided below “What does “Skipped Stages” mean in Spark?” section.

There’s some good information here if you want to become more familiar with how Spark works.

Comments closed