2019-02-07 – Curated SQL

The basic import steps described for tiny tables applies to importing bigger tables into Impala. The difference occurs when you construct your sqoop import command. For large tables, you want it to run fast, so setting parallelism to 1, which specifies one map task during the import won’t work well. Instead, using the default parallelism setting, which is 4 map tasks to import in parallel, is a good place to start. So you don’t need to specify a value for the -m option unless you want to increase the number of parallel map tasks.
Another difference is that bigger tables usually have a primary key, which become good candidates where you can split the data without skewing it. The tiny_table we imported earlier doesn’t have a primary key. Also note that the -e option for the sqoop import command, which instructs Sqoop to import the data returned for the specified SQL statement doesn’t work if you split data on a string column. If stringcolumns are used to split the data with the -e option, it generates incompatible SQL. So if you decide to split data on the primary key for your bigger table, make sure the primary key is on a column of a numeric data type, such as int, which works best with the -e option because it generates compatible SQL.

Read the whole thing. Sqoop has been around for a while because it does its job well.

Comments closed

Wait Stats And Missing Indexes

Published 2019-02-07 by Kevin Feasel

Arthur Daniels explains that missing indexes can cause high wait stat counts to appear:

At first, this statement might sound a bit confusing. Usually, we expect wait statistics to show us what a query is waiting on, whether it’s waiting on memory, loading pages from disk, or any of the other numerous wait types.
Once you start collecting wait statistics, you’ll have a lot of data to sort through. You might find waits like CX_PACKET, CX_CONSUMER, and PAGEIOLATCH. Surprisingly, these could mean that your databases aren’t well indexed.

This makes sense. At its core, wait stats tell you where SQL Server is hurting: where is the bottleneck. But just like a person at the doctor, SQL Server can only be so specific in how it relates this pain to you, and that specificity generally boils down to hardware components. The solution might be “get more hardware,” but as Arthur points out, writing better queries and using better indexes can mitigate those pains too.

Comments closed

Pipelines Everywhere

Published 2019-02-07 by Kevin Feasel

John Mount explains the benefit of pipes and pipelines, and shows us an advanced pipe in R:

The idea is: many important calculations can be considered as a sequence of transforms applied to a data set. Each step may be a function taking many arguments. It is often the case that only one of each function’s arguments is primary, and the rest are parameters. For data science applications this is particularly common, so having convenient pipeline notation can be a plus. An example of a non-trivial data processing pipeline can be found here.
In this note we will discuss the advanced R pipeline operator “dot arrow pipe” and an S4 class (wrapr::UnaryFn) that makes working with pipeline notation much more powerful and much easier.

As you’d expect from John, there’s a lot of detail and it’s an interesting approach.

Comments closed

BPE: The Answer To A Question Nobody Asked

Published 2019-02-07 by Kevin Feasel

Randolph West is not a big fan of Buffer Pool Extension in practice:

Unfortunately there are some practical problems with the Buffer Pool Extension in 2019.
Firstly, let’s talk about the licensing contradiction with SQL Server Standard Edition. Since 2016 Service Pack 1, we have been able to access 128 GB of memory for the buffer pool, plus additional RAM for in-memory OLTP and Columnstore indexes. The practical limits of Standard Edition have been — for the most part — all but lifted. By the same token, if we can afford Enterprise Edition, we can afford more physical RAM. In other words, if we need more than 128 GB RAM for the buffer pool, perhaps we should move away from Standard Edition.

Title aside, Randolph makes a really good point: Buffer Pool Extension was one of those features which sounded great during development, but hardware quickly overtook it and made it all but irrelevant. On the whole, that’s a positive for us as SQL Server users.

Comments closed

Power BI Workspace V2

Published 2019-02-07 by Kevin Feasel

Reza Rad shows us the differences between Power BI Workspace V1 and V2:

Workspace version 2 has been available in Power BI Service for more than 6 months now. The new version introduced in August 2018, however, still many people don’t know what it is, and what is the difference of that with the old version, and the most important question: Should you create the new workspace in V2 or V1? Should you click on the Try Now button when you create the new workspace or not? I have previously written about workspaces and their important role in creating a collaborative environment. In this post, I’ll answer all questions above to help you make the right decision when creating the workspace. If you like to learn more about Power BI, read Power BI book from Rookie to Rock Star.

I’ll admit I was unaware of V2 workspaces. This was interesting reading.

Comments closed

Testing SQL Logins For Weak Passwords

Published 2019-02-07 by Kevin Feasel

Tom LaRock shows how you can test SQL authenticated logins for weak passwords using a built-in SQL Server function:

DATA SECURITY IS HARD BECAUSE PEOPLE ARE DUMB.
Don’t just take my word for it though. Do a quick search for “common password list” and you’ll see examples of passwords scraped from breaches. These are passwords often used by default to secure systems and data.
Chances are, these passwords are in your environment, right now.
Here’s what you can do to protect your data.

Read on to see what you can do, both to detect weak passwords and to make it harder for users to use them.

Comments closed

Integrating Azure Data Factory With GitHub

Published 2019-02-07 by Kevin Feasel

Rayis Imayev shows us how to tie Azure Data Factory pipelines with GitHub, allowing automatic check-in based on ADF pipeline changes:

Working with Azure Data Factory (ADF) enables me to build and monitor my Extract Transform Load (ETL) workflows in Azure. My ADF pipelines is a cloud version of previously used ETL projects in SQL Server SSIS.

And prior to this point, all my sample ADF pipelines were developed in so-called “Live Data Factory Mode” using my personal workspace, i.e. all changes had to be published in order to be saved. This hasn’t been the best practice from my side, and I needed to start using a source control tool to preserve and version my development code.

Click through for a detailed demo.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28

Day: February 7, 2019

Sqoop From MySQL To Cloudera

Wait Stats And Missing Indexes

Pipelines Everywhere

BPE: The Answer To A Question Nobody Asked

Power BI Workspace V2

Testing SQL Logins For Weak Passwords

Integrating Azure Data Factory With GitHub