Press "Enter" to skip to content

Curated SQL Posts

Working With Missing Values In R

Anisa Dhana has a few examples of ways we can work with data containing missing values in R:

Imputation is a complex process that requires a good knowledge of your data. For example, it is crucial to know whether the missing is at random or not before you impute the data. I have read a nice tutorial which visualize the missing data and help to understand the type of missing, and another post showing how to impute the data with MICE package.

In this short post, I will focus on management of the missing data using the tidyverse package. Specifically, I will show how to manage missings in the long data format (i.e., more than one observation for id).

Anisa shows a few different techniques, depending upon what you need to do with the data.  I’d caution about using mean in the second example and instead typically prefer median, as replacing missing values with the median won’t alter the distribution in the way that it can with mean.

Comments closed

Configuring Kafka Streams For Least Privilege

Gwen Shapira explains how we can assign minimal rights to Kafka Streams and KSQL users:

The principle of least privilege dictates that each user and application will have the minimal privileges required to do their job. When applied to Apache Kafka® and its Streams API, it usually means that each team and application will have read and write access only to a selected few relevant topics.

Organizations need to balance developer velocity and security, which means that each organization will likely have their own requirements and best practices for access control.

There are two simple patterns you can use to easily configure the right privileges for any Kafka Streams application—one provides tighter security, and the other is for more agile organizations. First, we’ll start with a bit of background on why configuring proper privileges for Kafka Streams applications was challenging in the past.

Read the whole thing; “granting everybody all rights” generally isn’t a good idea, no matter what your data platform of choice may be.

Comments closed

Working With Key-Value Pairs In Spark

Teena Vashist shows us a few of the functions available with Spark for working with key-value pairs:

1. Creating Key/Value Pair RDD: 
The pair RDD arranges the data of a row into two parts. The first part is the Key and the second part is the Value. In the below example, I used a parallelize method to create a RDD, and then I used the length method to create a Pair RDD. The key is the length of the each word and the value is the word itself.

scala> val rdd = sc.parallelize(List("hello","world","good","morning"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val pairRdd = rdd.map(a => (a.length,a))
pairRdd: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at map at <console>:26
scala> pairRdd.collect().foreach(println)
(5,hello)
(5,world)
(4,good)
(7,morning)

Click through for more operations.  Spark is a bit less KV-centric than classic MapReduce jobs, but there are still plenty of places where you want to use them.

Comments closed

Getting Maintenance Plan Information From Powershell

Shane O’Neill gives us the low-down on what we need to do in order to retrieve maintenance plan information from SQL Server using Powershell:

It’s surprisingly difficult to get this information in SQL Server. In fact I was quite stuck trying to figure out how to get this information when I realized that the good people over at Brent Ozar Unlimited already do some checking on this for their sp_Blitz tool.

A quick look at the above code showed me that dbo.sysssispackages was what I was looking for. Combine this with:

  • 1. Some hack-y SQL for the frequency in human readable format, and
  • 2. Some even more hack-y SQL to join on the SQL Agent Job name

And we had pretty much the XML for the Maintenance Plan and the SQL Agent Job Schedule that they were against.

Shane has made the code available as well, so check it out if you have any maintenance plans you’re trying to understand and maybe even get away from.

Comments closed

Gaining Business Understanding Through Paying Attention

Laura Ellis lays down some good tips for understanding business problems:

I know this sounds somewhat silly. But, when thinking through the steps that I take to solve a business problem, I realized that I do employ a strategy. The backbone of that strategy is based on the principals of solving a word problem. Yes, that’s right. Does anyone else remember staring at those first complex word problems as a kid and not quite knowing where to start? I do! However, when my teacher provided us with strategies to break down the problem into less intimidating, actionable steps, everything became rather doable. The steps: circle the question, highlight the important information and cross out unnecessary information. Do these steps and all of a sudden the problem is simplified and much less scary. What a relief! By employing the same basic strategy, we too can feel that sense of calm when working on a business problem.

It sounds blase but paying attention to what people are saying (or writing) versus hearing a few words and assuming the rest.

Comments closed

SQL Managed Instance Business Critical Tier Now Available

Jovan Popovic announces Azure SQL Managed Instance Business Critical tier has reached GA:

We are happy to announce General availability of Business Critical tier in Azure SQL Managed Instance – architectural model built for high-performance and IO demanding databases.

After 5 months of public preview period Azure SQL Managed Instance Business Critical Service tier is generally available.

Azure SQL Managed Instance Business Critical tier is built for high performance databases and applications that require low IO latency of 1-2ms in average with up to 100K IOPS that can be achieved using fast local SSD that this tier uses to place database files.

Click through to see what Business Critical tier in particular has to offer.

Comments closed

The Power Of Dual Storage Mode For Power BI Aggregations

Reza Rad continues a series on Power BI aggregations by explaining how using the Dual storage mode can make queries faster if you use both Import and DirectQuery sources:

This is not what we actually expect to see. The whole purpose of Sales Agg table is to speed up the process from DirectQuery mode, but we are still querying the DimDate from the database. So, what is the solution? Do we change the storage mode of DimDate to Import? If we do that, then what about the connection between DimDate and FactInternetSales? We want that connection to work as DirectQuery of course.

Now that you learned about the challenge, is a good time to talk about the third storage mode; Dual.

Read on for an example-filled tutorial.

Comments closed

A Review Of AWS Managed Kafka Public Preview

Stephane Maarek is not impressed with AWS’s managed Kafka offering so far:

For me, the more people use Apache Kafka, the more business I get. As I teach Apache Kafka online on Udemy (links at https://kafka-tutorials.com/), the prospect of having an entire user base from AWS wanting to learn Apache Kafka is exciting! And as an Apache Kafka consultant, it’s always more fun to spend time deploying data pipelines than deploying infrastructure.

Unfortunately what AWS released today misses the mark. I think it’s reminiscent of managed services of open source software in AWS overall: they’re released early and lack features that I think should be MVP. In my opinion this will deter future users.

Based on Stephane’s reading, this is a product which should have sat in development for another 3-6 months to flesh out the features, upgrade the version of Kafka used, etc.  Definitely read this before jumping on AWS MSK.

Comments closed

Load Testing Spark To MongoDB

Abdelghani Tassi has a quick load test to see how fast Spark can load data into MongoDB:

Recently, my company faced the serious challenge of loading a 10 million rows of CSV-formatted geographic data to MongoDB in real-time.

We first tried to make a simple Python script to load CSV files in memory and send data to MongoDB. Processing 10 million rows this way took 26 minutes!

26 minutes for processing a dataset in real-time is unacceptable so we decided to proceed differently.

I’m not sure the test was totally fair, but the results comport to my biases…  There is some good advice here:  storing data in optimized formats (Parquet in this instance) can make a big difference, Spark is useful for ETL style operations, and Scala is generally the fastest language in the Spark world.

Comments closed

Hive And Spark Integrated Together

Bikas Saha and Saumitra Buragohain share some of the direction the Apache Hive team is going in version 3:

The latest release of Apache Hive 3 (part of HDP 3) provides significant new capabilities including ACID support for data ingest. This functionality has many applications, a crucial one being privacy support for data modifications and deletions for GDPR. In addition, ACID also significantly reduces the time to ingest for data, thereby improving data freshness for Hive queries. To provide these features, Hive needs to take full control of the files that store the table data and thus this data is no longer directly accessible by third party systems like Apache Spark. Thus Apache Spark’s built-in support for Hive table data is no longer supported for data managed by Hive 3.

At the same time, Apache Spark has become the de-facto standard for a wide variety complex processing use cases on Big Data. This includes data stored in Hive 3 tables and thus we need a way to provide efficient, high-performance, ACID compliant access to Hive 3 table data from Spark. Fortunately, Apache Spark supports a pluggable approach for various data sources and Apache Hive itself can also be considered as one data source. We have implemented the Hive Warehouse Connector (HWC) as library to provide first class support for Spark to read Hive 3 data for subsequent complex processing (like machine learning) in Spark.

Spark is also commonly used to ETL raw data into Hive tables and this scenario should continue to be supported in the Hive ACID world. To do that, HWC integrates with the latest Hive Streaming APIs to support ingest into Hive both from batch jobs as well as structured streaming jobs.

Overall the Hive Warehouse connector provide efficient read write access to Hive warehouse data from Spark jobs, while providing transparent user identity propagation and maintaining consistent security and access control.

Spark has had some dependencies on Hive (or at least expectations of certain Hive conventions like /tmp/hive existing), but the two systems have historically been more (friendly) competitors than tools integrated in the same chain.

Comments closed