Press "Enter" to skip to content

Curated SQL Posts

Parallelism In R

Florian Prive shows off a few methods for parallelizing code in R:

Parallelize with foreach

You need to do at least two things:

  • replace %do% by %dopar%. Basically, always use %dopar% because you can use registerDoSEQ() is you really want to run the foreach sequentially.

  • register a parallel backend using one of the packages that begin with do (such as doParalleldoMCdoMPI and more). I will list only the two main parallel backends because there are too many of them.

Check it out.  Florian spends a lot of time with foreach and doParallel, a little bit of time with flock, and mentions Microsoft R Open.  H/T R-Bloggers

Comments closed

Dealing With Trace Flags In The Registry

Wayne Sheffield shows us how to configure SQL Server trace flags within T-SQL using registry access commands:

In a recent post, I introduced you to how to work with the registry directly from within SQL Server. Continuing this theme, this post provides an example situation where you would do so.

In this example, we will want to configure SQL Server to enable a few trace flags (TF) when SQL Server starts. Specifically, let’s set trace flags 1117 and 1118 so that they are enabled when SQL Server starts up, and enable them now. Additionally, we have trace flags 1204 and 1222 now enabled on some servers, and we want to disable those (since we have the deadlocks being captured in the system health XE, we don’t need them in the error log also). We also don’t want to force a restart of the SQL Server services.

I’ve always felt a little icky about writing to the registry from SQL Server, but Wayne shows how to do it right.

Comments closed

Multi-Structured Data In U-SQL

Melissa Coates shows us how to use U-SQL to normalize JSON files in which different rows may have differing structures:

Handling the varying formats in U-SQL involves a few steps if it’s the first time you’ve done this:

  1. Upload custom JSON assemblies  [one time setup]

  2. Create a database   [one time setup]

  3. Register custom JSON assemblies   [one time setup]

  4. Upload JSON file to Azure Data Lake Store [manual step as an example–usually automated]

  5. Run U-SQL script to “standardize” the JSON file(s) into a consistent CSV column/row format

Melissa then shows us how to do this step-by-step.

Comments closed

Azure SQL Data Warehouse Patterns

Murshed Zaman shows us a couple of patterns and anti-patterns for Azure SQL Data Warehouse:

Azure SQL DW is a Massively Parallel Processing (MPP) data warehousing service. It is a service because Microsoft maintains the infrastructure and software patching to make sure it’s always on up to date hardware and software on Azure. The service makes it easy for a customer to start loading their tables on day one and start running queries quickly and allows scaling of compute nodes when needed.

In an MPP database, table data is distributed among many servers (known as compute or slave nodes), and in many MPP systems shared-nothing storage subsystems are attached to those servers. Queries come through a head (or master) node where the location metadata for all the tables/data blocks resides. This head node knows how to deconstruct the query into smaller queries, introduce various data movement operations as needed, and pass smaller queries on to the compute nodes for parallel execution. Data movement is needed to align the data by the join keys from the original query. The topic of data movement in an MPP system is a whole another blog topic by itself, that we will tackle in a different blog. Besides Azure SQL DW, some other examples of a MPP data warehouses are Hadoop (Hive and Spark), Teradata, Amazon RedShift, Vertica, etc.

The opposite of MPP is SMP (Symmetric Multiprocessing) which basically means the traditional one server systems. Until the invention of MPP we had SMP systems. In database world the examples are traditional SQL Server, Oracle, MySQL etc. These SMP databases can also be used for both OLTP and OLAP purposes.

Murshed spends the majority of this blog post covering things you should not do, which is probably for the best.

Comments closed

Drillthrough In Power BI Reports

Dustin Ryan shows off drillthrough actions in Power BI:

At some point this weekend, the Microsoft Power BI folks turned on the drillthrough feature in the Power BI service. This is the same drillthrough feature that demonstrated during day 1 of the Microsoft Data Insights Summit, which you can read about here. So I thought it’d be good to quickly walk through how you can set up the drill through action.

First, navigate to a report that you’ve already published to the Power BI service. This report will become the target of your drill-through action. Start editing the report and you’ll notice that on the right side there is a Drillthrough Filters section.

Read on for more.  At the moment, this is only available within the Power BI service, not in Desktop.

Comments closed

Automating Table Creation Within Biml

Bill Fellows shows off some C# skills that he uses to build table definitions inside Biml:

Good grief, that’s a lot of code, how do I use it? The basic usage would be something like

 <Tables> <#= GetAstTableNodeFromQuery(this.RootNode.OleDbConnections["msdb"], "SELECT 100 AS demo", "dbo", "DemoQuery").GetBiml() #> </Tables>

The call to GetAstTableNodeFromQuery return an AstTableNode which is great, but what we really want is the Biml behind it so we chain a call to .GetBiml() onto the end.

What would make that better though is to make it a little more dynamic. Let’s improve the code to create tables based on a pairs of names and queries. I’m going to use a Dictionary called namedQueries to hold the names and queries and then enumerate through them, calling our GetAstTableNodeFromQuery for each entry.

Read on for the underlying code.  Bill also promises to refactor this code and make it more versatile.

Comments closed

Learning Naive Bayes

Sunil Ray explains the Naive Bayes algorithm:

What are the Pros and Cons of Naive Bayes?

Pros:

  • It is easy and fast to predict class of test data set. It also perform well in multi class prediction
  • When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
  • It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

Cons:

  • If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.

  • On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.

  • Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

Read the whole thing.  Naive Bayes is such an easy algorithm, yet it works remarkably well for categorization problems.  It’s typically not the best solution, but it’s a great first solution.  H/T Data Science Central

Comments closed

Diving Into Spark’s Cost-Based Optimizer

Ron Hu, et al, explain how Spark’s cost-based optimizer works:

At its core, Spark’s Catalyst optimizer is a general library for representing query plans as trees and sequentially applying a number of optimization rules to manipulate them. A majority of these optimization rules are based on heuristics, i.e., they only account for a query’s structure and ignore the properties of the data being processed, which severely limits their applicability. Let us demonstrate this with a simple example. Consider a query shown below that filters a table t1 of size 500GB and joins the output with another table t2of size 20GB. Spark implements this query using a hash join by choosing the smaller join relation as the build side (to build a hash table) and the larger relation as the probe side 1. Given that t2 is smaller than t1, Apache Spark 2.1 would choose the right side as the build side without factoring in the effect of the filter operator (which in this case filters out the majority of t1‘s records). Choosing the incorrect side as the build side often forces the system to give up on a fast hash join and turn to sort-merge join due to memory constraints.

Click through for a very interesting look at this query optimzier.

Comments closed

Connecting To Kafka Via SSL

Harikiran Nayak shows how to work with secure Kafka connections:

First, the Kafka broker must be configured to accept client connections over SSL. Please refer to the Apache Kafka Documentation to configure your broker. If your Kafka cluster is already SSL-enabled, you can look up the port number in your Kafka broker configuration file (or the broker logs). Look for the listeners=SSL://host.name:port configuration option. To ensure that the Kafka broker is correctly configured to accept SSL connections, run the following command from the same host that you are running SDC on. If SDC is running from within a docker container, log in to that docker container and run the command.

Read on for more.

Comments closed