Press "Enter" to skip to content

Day: October 10, 2016

Word Clouds In Python

Allison Tharp shows how to generate a word cloud using Python:

Every week, someone on Reddit posts a “word cloud” on all of the NFL team’s subreddits.  These word clouds show the most used words on that subreddit for the week (the larger the word, the more it was used).  These word plots are always really fascinating to me, so I wanted to try to make some for myself.  In this tutorial, we’ll be making the following word cloud from my board game stats twitter feed, @BGGStats

Looks like the implementation is fairly straightforward, so check it out.

Comments closed

Azure Data Lake Updates

Saveen Reddy points out a few updates to Azure Data Lake Store & the Azure Data Lake Analytics portal:

Use Custom Delimeters when Previewing Files

Previously, we had supported comma, colon, space, tab, ampersand, and bar delimiters. With the many different kinds of files used in Azure Data Lake Store and Azure Storage, we’ve added a “Custom” delimiter options for you to define your own delimiter.

To change the delimiter on the Azure Portal:

  1. Open the file you want to preview using Data Explorer.

  2. Click on Format

  3. Under Delimiter, click the dropdown and change it to Custom

  4. A new Custom Delimiter field will appear, type in your delimiter here

  5. Click OK

Read on for more updates.

Comments closed

Change With Automatic Stats Update

Jack Li notes that SQL Server 2016 has changed when automatic statistics update gets called:

Old threshold: it takes 20% of row changes before auto update stats kicks (there are some tweaks for small tables, for large tables, 20% change is needed).  For a table with 100 million rows, it requires 20 million row change for auto stats to kick in. For vast majority of large tables, auto stats basically doesn’t do much.

New threshold: Starting SQL 2008 R2 SP1, we introduced a trace flag 2371 to control auto update statistics better (new threshold).  Under trace flag 2371, percentage of changes requires is dramatically reduced with large tables.  In other words, trace flag 2371 can cause more frequent update.  This new threshold is off by default and is enabled by the trace flag.  But in SQL 2016, this new threshold is enabled by default for a database with compatibility level 130.

Important to know.

Comments closed

Generating JSON In Power BI

Chris Webb shows how to generate JSON in M:

Often, when calling web services from Power BI or Power Query, you’ll need to generate some JSON inside your query to send to these web services. The M language makes this relatively easy to do with the Json.FromValue() function but there’s only one example of how to use it in the help so I though it might be useful to provide some worked examples of M data types and how Json.FromValue() turns them into JSON.

First, here’s a function – that I’ve called GetJson() for the examples here – that takes a parameter of any data type, passes it to Json.FromValue() and returns the JSON representation of the input as text:

Read on for the code sample.

Comments closed

T-SQL And R Performance Comparisons

Tomaz Kastrun does several performance comparisons between various R packages and T-SQL constructs:

Couple of packages I will mention for data manipulations are plyr, dplyr and data.table and compare the execution time, simplicity and ease of writing with general T-SQL code and RevoScaleR package. For this blog post I will use R packagedplyr and T-SQL with possibilites of RevoScaleR computation functions.

My initial query will be. Available in WideWorldImportersDW database. No other alterations have been done to underlying tables (fact.sale or dimension.city).

Read on for code and conclusions.  I don’t think there are any shocking conclusions:  the upshot is to filter data as early as possible.

Comments closed

Always Encrypted Powershell Cmdlets

Sanjay Mishra alerts us to new Powershell cmdlets for enabling Always Encrypted on columns:

The July 2016 release of SSMS (and later versions) introduced a set of PowerShell cmdlets through a new ‘SqlServer’ module. This pagedescribes the various capabilities that these cmdlets bring to the table. Of most interest to the specific scenario described above is the Set-SqlColumnEncryption cmdlet. In the post below, we will walk through the steps required to use this – first from a PowerShell session to test the capability, and then finally from a C# application which is using PowerShell Automation to invoke the cmdlets from an application.

As a side note it is worth knowing that the cmdlets in the ‘SqlServer’ PowerShell module can also be used for automating key setup and management (and are, in many ways, more powerful than SSMS – they expose more granular tasks, and thus can be used to achieve role separation and to develop a custom key management workflow – but that is likely a topic for a separate post!)

Sanjay also includes a sample Powershell script to show how it works.

Comments closed

Starting Azure Stream Analytics Jobs From Code

Hylke Peek wants to kick off an Azure Stream Analytics job from a Universal Windows Platform application:

I had one of those feelings while working with Azure Stream Analytics (ASA). My solution worked but there was one ‘elementary and simple’ thing I wanted: Start the ASA-jobs within my C#-code. That shouldn’t be hard and there’s some documentation. But no, I needed to combine several opposed solutions to a new one to make it possible.

In this post I shortly explain how you can start ASA-jobs within your C# UWP application:

  • I explain which components you have in the authentication process and which parameters you need.

  • Example code is provided. You only need to enter your parameter values.

Click through for the code.

Comments closed

Sequentially Increasing Indexes

Joe Chang discusses benchmarking and looks at a particular scenario around maximizing insert performance:

The test environment here is a single socket Xeon E3 v3, quad-core, hyper-threading enabled. Turbo-boost is disabled for consistency. The software stack is Windows Server 2016 TP5, and SQL Server 2016 cu2 (build 2164). Some tests were conducted on a single socket Xeon E5 v4 with 10 cores, but most are on the E3 system. In the past, I used to maintain two-socket systems for investigating issues, but only up to the Core2 processor, which were not NUMA.

The test table has 8 fixed length not null columns, 4 bigint, 2 guids, 1 int, and a 3-byte date. This adds up to 70 bytes. With file and row pointer overhead, this works out to 100 rows per page at 100% fill-factor.

Both heap and clustered index organized tables were tested. The indexes tested were 1) single column key sequentially increasing and 2) two column key leading with a grouping value followed by a sequentially increasing value. The grouping value was chosen so that inserts go to many different pages.

The test was for a client to insert a single row per call. Note that the recommended practice is to consolidate multiple SQL statements into a single RPC, aka network roundtrip, and if appropriate, bracket multiple Insert, Update and Delete statements with a BEGIN and COMMIT TRAN. This test was contrived to determine the worst case insert scenario.

With that setup in mind, click through to learn his results.

Comments closed

ISNULL And COALESCE Behavior Difference

Vladimir Oselsky notes an edge case where ISNULL and COALESCE can behave differently:

Even though we would expect to see both records returned we only get 1 record. Huh? This is exactly what puzzled a coworker, ofcourse query was not as simple as this one but same issue caused him to hit a road block.

In the case of COALESCE and OR methods, results are identical.

The underlying issue here is that the variable data type differs from the column’s data type, and exposes a difference in how COALESCE and ISNULL work.

Comments closed