2018-11-05 – Curated SQL

Logistic Regression With Apache Spark

Published 2018-11-05 by Kevin Feasel

Manoj Gautam shows how to perform a logistic regression with Apache Spark:

Since we are going to try algorithms like Logistic Regression, we will have to convert the categorical variables in the dataset into numeric variables. There are 2 ways we can do this.

Category Indexing

One-Hot Encoding

Here, we will use a combination of StringIndexer and OneHotEncoderEstimator to convert the categorical variables. The OneHotEncoderEstimator will return a SparseVector.

Click through for the code and explanation.

Comments closed

Test Data Generation In SQL Server

Published 2018-11-05 by Kevin Feasel

Ahmad Yaseen walks through a couple techniques for creating test data in SQL Server:

Generating test data to fill the development database tables can also be performed easily and without wasting time for writing scripts for each data type or using third party tools. You can find various tools in the market that can be used to generate testing data. One of these wonderful tools is the dbForge Data Generator for SQL Server . It is a powerful GUI tool for a fast generation of meaningful test data for the development databases. dbForge data generation tool includes 200+ predefined data generators with sensible configuration options that allow you to emulate column-intelligent random data. The tool also allows generating demo data for SQL Server databases already filled with data and creating your own custom test data generators. dbForge Data Generator for SQL Server can save your time and effort spent on demo data generation by populating SQL Server tables with millions of rows of sample data that look just like real data. dbForge Data Generator for SQL Server helps to populate tables with most frequently used data types such as Basic, Business, Health, IT, Location, Payment and Person data types.

I have a love-hate relationship with test data generation tools, as they tend not to create reasonable data, where reasonable is a combination of domain (hi, birth date in the early 1800s!) and distribution.

Comments closed

Testing Package Properties With ssisUnit

Published 2018-11-05 by Kevin Feasel

Bartosz Ratajczyk shows how you can test certain properties on an Integration Services package using ssisUnit:

The command is simple. You can get or set the property using the value for given property path. As usual – when you get the value, you leave the value blank. The path – well – is the path to the element in the package or the project. You use backslashes to separate elements in the package tree, and at the end, you use .Properties[PropertyName] to read the property. If you use the elements collection – like connection managers – you can pick a single element using square brackets and the name of this element.

Read on for more, including limitations and useful testing scenarios.

Comments closed

Thoughts On Snowflake DB

Published 2018-11-05 by Kevin Feasel

Koen Verbeeck shares some thoughts after working with Snowflake DB for a few months:

Let’s start with the positive.

Snowflake is a really scalable database. Storage is virtually limitless, since the data is stored on blob storage (S3 on AWS and Blob Storage on Azure). The compute layer (called warehouses) is completely separated from the storage layer and you can scale it independently from storage.
It is really easy to use. This is one of Snowflake’s core goals: make it easy to use for everyone. Most of the technical aspects (clustering, storage etc) are hidden from the user. If you thought SQL Server is easy with it’s “next-next-finish” installation, you’ll be blown away by Snowflake. I really like this aspect, since you have really powerful data warehousing at your finger tips, and the only thing you have to worry about is how to get your data into it. With Azure SQL DW for example, you have to about distribution of the data, how you are going to set things up etc. Not here.

It’s not all positive, but Koen seems quite happy to work with the product.

Comments closed

SMO And Clear-Text Passwords

Published 2018-11-05 by Kevin Feasel

Cody Konior looks at a case where SMO can leak SQL authentication passwords:

SMO connects to SQL Server using the ADO.NET SQLClient library which has 13+ years of features which help mask the passwords you pass in for SQL Authentication. SMO bypasses some of those features to often leak the passwords in clear-text.

Even where it would normally be hidden.

Even where you use Persist Security Info introduced in 2005.

Even where you use System.Security.SecureString introduced in 2012.

Though thankfully not where you use System.Data.SqlClient.SqlCredential also introduced in 2012. However… there’s some caveats here too.

We’ll prove it through repeatable tests that can be used to track if Microsoft fix the problem or not.

Read the whole thing.

Comments closed

Automated Testing With Power Query

Published 2018-11-05 by Kevin Feasel

Fred Kaffenberger walks us through query failure with Power Query:

I loved Nar’s post on Automated Testing using DAX. I especially like the rule of always including controls so that business readers can share responsibility for data quality. For my part, I sometimes use hidden pages in Power BI reports to assure myself of data quality. I also set alerts on testing dashboards in the Power BI Service to notify me if something is not right. Sometimes, however, a more proactive approach is needed. So, we’ll be doing automated testing with Power Query.

If the query can’t connect to the data source, it will fail. When this happens, the report in Power BI Service is stale, but accurate. I’m fine with this. It can also happen that the query succeeds but is incomplete. In this case, the result is that the report is wrong. Why does this happen? It can happen because of an overtaxed transactional data source. The ERP or CRM or work order system just can’t deliver the amount of data. Maybe it’s linked SQL tables using ODBC. For whatever reason, the query succeeds, but data is missing. I’m NOT fine with this. The long-term solution is to move to a more reliable data source (data warehouse, anybody?). In the short run, refreshes must be stopped. Stale data is better than bad data.

Also check out the comments.

Comments closed

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Day: November 5, 2018

Logistic Regression With Apache Spark

Test Data Generation In SQL Server

Testing Package Properties With ssisUnit

Thoughts On Snowflake DB

SMO And Clear-Text Passwords

Automated Testing With Power Query