Press "Enter" to skip to content

Curated SQL Posts

Exploratory Data Analysis In R

Laura Ellis walks us through some easy techniques for learning about our data using R:

DIM AND GLIMPSE

Next, we will run the dim function which displays the dimensions of the table. The output takes the form of row, column.

And then we run the glimpse function from the dplyr package. This will display a vertical preview of the dataset. It allows us to easily preview data type and sample data.

Spending some quality time doing EDA can save you in the long run, as it can help you get a feel for things like data quality, the distributions of variables, and completeness of data.

Comments closed

Using Datadog To Monitor Spark Clusters On EMR

Priya Matpadi walks us through one way to monitor Spark clusters on Amazon ElasticMapReduce:

We recently implemented a Spark streaming application, which consumes data from from multiple Kafka topics. The data consumed from Kafka comprises different types of telemetry events generated by mobile devices. We decided to host the Spark cluster using the Amazon EMR service, which manages a fleet of EC2 instances to run our data-processing pipelines.

As part of preparing the cluster and application for deployment to production, we needed to implement monitoring so we could track the streaming application and the Spark infrastructure itself. At a high level, we wanted ensure that we could monitor the different components of the application, understand performance parameters, and get alerted when things go wrong.

In this post, we’ll walk through how we aggregated relevant metrics in Datadog from our Spark streaming application running on a YARN cluster in EMR.

Check it out.  If this is interesting, Priya’s blog has the full series.

Comments closed

Pivoting With Spark SQL

MaryAnn Xue shows us how to use the PIVOT operator in Spark SQL:

Pivot was first introduced in Apache Spark 1.6 as a new DataFrame feature that allows users to rotate a table-valued expression by turning the unique values from one column into individual columns.

The upcoming Apache Spark 2.4 release extends this powerful functionality of pivoting data to our SQL users as well. In this blog, using temperatures recordings in Seattle, we’ll show how we can use this common SQL Pivot feature to achieve complex data transformations.

The syntax is quite similar to the PIVOT syntax that SQL Server uses.

Comments closed

Showing Forecasts With Actuals In Power BI

Alberto Ferrari shows us how we can incorporate actuals and forecasted values in the same Power BI visuals:

The Forecast measure in the demo model is quite an advanced piece of DAX code that would require a full article by itself. The curious reader will find more information on how to reallocate budget at different granularities in the video Budgeting with Power BI. In this article, we use the Forecast measure without detailed explanations; our goal is to explain how to compute the next measure: Remaining Forecast.

The Remaining Forecast measure must analyze the Sales table, finding the last day for which there are sales, and only then computing the forecasts.

Read the whole thing.

Comments closed

Logistic Regression With Apache Spark

Manoj Gautam shows how to perform a logistic regression with Apache Spark:

Since we are going to try algorithms like Logistic Regression, we will have to convert the categorical variables in the dataset into numeric variables. There are 2 ways we can do this.

  1. Category Indexing
  2. One-Hot Encoding

Here, we will use a combination of StringIndexer and OneHotEncoderEstimator to convert the categorical variables. The OneHotEncoderEstimator will return a SparseVector.

Click through for the code and explanation.

Comments closed

Test Data Generation In SQL Server

Ahmad Yaseen walks through a couple techniques for creating test data in SQL Server:

Generating test data to fill the development database tables can also be performed easily and without wasting time for writing scripts for each data type or using third party tools. You can find various tools in the market that can be used to generate testing data. One of these wonderful tools is the dbForge Data Generator for SQL Server . It is a powerful GUI tool for a fast generation of meaningful test data for the development databases. dbForge data generation tool includes 200+ predefined data generators with sensible configuration options that allow you to emulate column-intelligent random data. The tool also allows generating demo data for SQL Server databases already filled with data and creating your own custom test data generators. dbForge Data Generator for SQL Server can save your time and effort spent on demo data generation by populating SQL Server tables with millions of rows of sample data that look just like real data. dbForge Data Generator for SQL Server helps to populate tables with most frequently used data types such as Basic, Business, Health, IT, Location, Payment and Person data types.

I have a love-hate relationship with test data generation tools, as they tend not to create reasonable data, where reasonable is a combination of domain (hi, birth date in the early 1800s!) and distribution.

Comments closed

Testing Package Properties With ssisUnit

Bartosz Ratajczyk shows how you can test certain properties on an Integration Services package using ssisUnit:

The command is simple. You can get or set the property using the value for given property path. As usual – when you get the value, you leave the value blank. The path – well – is the path to the element in the package or the project. You use backslashes to separate elements in the package tree, and at the end, you use .Properties[PropertyName] to read the property. If you use the elements collection – like connection managers – you can pick a single element using square brackets and the name of this element.

Read on for more, including limitations and useful testing scenarios.

Comments closed

Thoughts On Snowflake DB

Koen Verbeeck shares some thoughts after working with Snowflake DB for a few months:

Let’s start with the positive.

  • Snowflake is a really scalable database. Storage is virtually limitless, since the data is stored on blob storage (S3 on AWS and Blob Storage on Azure). The compute layer (called warehouses) is completely separated from the storage layer and you can scale it independently from storage.

  • It is really easy to use. This is one of Snowflake’s core goals: make it easy to use for everyone. Most of the technical aspects (clustering, storage etc) are hidden from the user. If you thought SQL Server is easy with it’s “next-next-finish” installation, you’ll be blown away by Snowflake. I really like this aspect, since you have really powerful data warehousing at your finger tips, and the only thing you have to worry about is how to get your data into it. With Azure SQL DW for example, you have to about distribution of the data, how you are going to set things up etc. Not here.

It’s not all positive, but Koen seems quite happy to work with the product.

Comments closed

SMO And Clear-Text Passwords

Cody Konior looks at a case where SMO can leak SQL authentication passwords:

SMO connects to SQL Server using the ADO.NET SQLClient library which has 13+ years of features which help mask the passwords you pass in for SQL Authentication. SMO bypasses some of those features to often leak the passwords in clear-text.

We’ll prove it through repeatable tests that can be used to track if Microsoft fix the problem or not.

Read the whole thing.

Comments closed

Automated Testing With Power Query

Fred Kaffenberger walks us through query failure with Power Query:

I loved Nar’s post on Automated Testing using DAX. I especially like the rule of always including controls so that business readers can share responsibility for data quality. For my part, I sometimes use hidden pages in Power BI reports to assure myself of data quality. I also set alerts on testing dashboards in the Power BI Service to notify me if something is not right. Sometimes, however, a more proactive approach is needed. So, we’ll be doing automated testing with Power Query.

If the query can’t connect to the data source, it will fail. When this happens, the report in Power BI Service is stale, but accurate. I’m fine with thisIt can also happen that the query succeeds but is incomplete. In this case, the result is that the report is wrong. Why does this happen? It can happen because of an overtaxed transactional data source. The ERP or CRM or work order system just can’t deliver the amount of data. Maybe it’s linked SQL tables using ODBC. For whatever reason, the query succeeds, but data is missing. I’m NOT fine with this. The long-term solution is to move to a more reliable data source (data warehouse, anybody?). In the short run, refreshes must be stopped. Stale data is better than bad data.

Also check out the comments.

Comments closed