Press "Enter" to skip to content

Month: September 2017

Multi-Structured Data In U-SQL

Melissa Coates shows us how to use U-SQL to normalize JSON files in which different rows may have differing structures:

Handling the varying formats in U-SQL involves a few steps if it’s the first time you’ve done this:

  1. Upload custom JSON assemblies  [one time setup]

  2. Create a database   [one time setup]

  3. Register custom JSON assemblies   [one time setup]

  4. Upload JSON file to Azure Data Lake Store [manual step as an example–usually automated]

  5. Run U-SQL script to “standardize” the JSON file(s) into a consistent CSV column/row format

Melissa then shows us how to do this step-by-step.

Comments closed

Azure SQL Data Warehouse Patterns

Murshed Zaman shows us a couple of patterns and anti-patterns for Azure SQL Data Warehouse:

Azure SQL DW is a Massively Parallel Processing (MPP) data warehousing service. It is a service because Microsoft maintains the infrastructure and software patching to make sure it’s always on up to date hardware and software on Azure. The service makes it easy for a customer to start loading their tables on day one and start running queries quickly and allows scaling of compute nodes when needed.

In an MPP database, table data is distributed among many servers (known as compute or slave nodes), and in many MPP systems shared-nothing storage subsystems are attached to those servers. Queries come through a head (or master) node where the location metadata for all the tables/data blocks resides. This head node knows how to deconstruct the query into smaller queries, introduce various data movement operations as needed, and pass smaller queries on to the compute nodes for parallel execution. Data movement is needed to align the data by the join keys from the original query. The topic of data movement in an MPP system is a whole another blog topic by itself, that we will tackle in a different blog. Besides Azure SQL DW, some other examples of a MPP data warehouses are Hadoop (Hive and Spark), Teradata, Amazon RedShift, Vertica, etc.

The opposite of MPP is SMP (Symmetric Multiprocessing) which basically means the traditional one server systems. Until the invention of MPP we had SMP systems. In database world the examples are traditional SQL Server, Oracle, MySQL etc. These SMP databases can also be used for both OLTP and OLAP purposes.

Murshed spends the majority of this blog post covering things you should not do, which is probably for the best.

Comments closed

Drillthrough In Power BI Reports

Dustin Ryan shows off drillthrough actions in Power BI:

At some point this weekend, the Microsoft Power BI folks turned on the drillthrough feature in the Power BI service. This is the same drillthrough feature that demonstrated during day 1 of the Microsoft Data Insights Summit, which you can read about here. So I thought it’d be good to quickly walk through how you can set up the drill through action.

First, navigate to a report that you’ve already published to the Power BI service. This report will become the target of your drill-through action. Start editing the report and you’ll notice that on the right side there is a Drillthrough Filters section.

Read on for more.  At the moment, this is only available within the Power BI service, not in Desktop.

Comments closed

Automating Table Creation Within Biml

Bill Fellows shows off some C# skills that he uses to build table definitions inside Biml:

Good grief, that’s a lot of code, how do I use it? The basic usage would be something like

    <Tables>
        <#=  GetAstTableNodeFromQuery(this.RootNode.OleDbConnections["msdb"], "SELECT 100 AS demo", "dbo", "DemoQuery").GetBiml() #>
    </Tables>

The call to GetAstTableNodeFromQuery return an AstTableNode which is great, but what we really want is the Biml behind it so we chain a call to .GetBiml() onto the end.

What would make that better though is to make it a little more dynamic. Let’s improve the code to create tables based on a pairs of names and queries. I’m going to use a Dictionary called namedQueries to hold the names and queries and then enumerate through them, calling our GetAstTableNodeFromQuery for each entry.

Read on for the underlying code.  Bill also promises to refactor this code and make it more versatile.

Comments closed

Learning Naive Bayes

Sunil Ray explains the Naive Bayes algorithm:

What are the Pros and Cons of Naive Bayes?

Pros:

  • It is easy and fast to predict class of test data set. It also perform well in multi class prediction
  • When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
  • It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

Cons:

  • If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.

  • On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.

  • Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

Read the whole thing.  Naive Bayes is such an easy algorithm, yet it works remarkably well for categorization problems.  It’s typically not the best solution, but it’s a great first solution.  H/T Data Science Central

Comments closed

Diving Into Spark’s Cost-Based Optimizer

Ron Hu, et al, explain how Spark’s cost-based optimizer works:

At its core, Spark’s Catalyst optimizer is a general library for representing query plans as trees and sequentially applying a number of optimization rules to manipulate them. A majority of these optimization rules are based on heuristics, i.e., they only account for a query’s structure and ignore the properties of the data being processed, which severely limits their applicability. Let us demonstrate this with a simple example. Consider a query shown below that filters a table t1 of size 500GB and joins the output with another table t2of size 20GB. Spark implements this query using a hash join by choosing the smaller join relation as the build side (to build a hash table) and the larger relation as the probe side 1. Given that t2 is smaller than t1, Apache Spark 2.1 would choose the right side as the build side without factoring in the effect of the filter operator (which in this case filters out the majority of t1‘s records). Choosing the incorrect side as the build side often forces the system to give up on a fast hash join and turn to sort-merge join due to memory constraints.

Click through for a very interesting look at this query optimzier.

Comments closed

Connecting To Kafka Via SSL

Harikiran Nayak shows how to work with secure Kafka connections:

First, the Kafka broker must be configured to accept client connections over SSL. Please refer to the Apache Kafka Documentation to configure your broker. If your Kafka cluster is already SSL-enabled, you can look up the port number in your Kafka broker configuration file (or the broker logs). Look for the listeners=SSL://host.name:port configuration option. To ensure that the Kafka broker is correctly configured to accept SSL connections, run the following command from the same host that you are running SDC on. If SDC is running from within a docker container, log in to that docker container and run the command.

Read on for more.

Comments closed

Finding Out Whodunnit Using The Transaction Log

David Fowler shows us how to figure out which user made a bad data change when you don’t have auditing mechanisms in place:

So it’s looking like things are in a bad way, obviously we could go to a backup and get the old values back but that’s never going to tell us who made the change.  So that transaction log again, how do we actually go about getting our hands dirty and having a look at it.

Well there’s a nice little undocumented function called fn_dblog.  Let try giving that a go and see what we get back. By the way, the two parameters are the first and last LSNs that you want to look between.  Leaving them as NULL with return the entire log.

This is great unless you have connection pooling and the problem happened through an application.  In that case, the returned username will be the application’s username.

Comments closed

CHECKDB On Azure SQL Database

Arun Sirpal ponders running DBCC CHECKDB on Azure SQL Database:

I was exchanging messages with Azure Support and even though I didn’t get a concrete answer to confirm this I ended up asking the question within a Microsoft based yammer group and yes they do automatically carry out consistency checks.

This is great but it is one less thing for me to worry about and if there is serious corruption, you know potential data loss (which would be rare) then they will definitely tell you and work with you.

However, it doesn’t mean you CAN’T run it, I was curious so I ran DBCC CHECKDB on my Azure SQL Databases, but like with any other consistency check it is best to do it OFF-PEAK hours. I would probably take it a step further and wouldn’t even bother running it.

It’s an interesting post, reminding us that administering an Azure database isn’t the same as on-prem.

Comments closed