Press "Enter" to skip to content

Author: Kevin Feasel

Data Wrangling With Power Query

Eugene Meidinger parses a complex report using Power Query:

Hmm, so it looks like I made a mistake. I hope my honesty won’t lose me any izzat, or ability to command respect. I think it’s important to see how people really learn and really solve problems. So, I’m including my screw ups in this post.

Apparently, I created a linked table and I can’t see how to edit the the Power Query portion for that. A linked table is a nice way to pull raw data from the Excel workbook. It’s great for reference tables, but doesn’t solve our problem.

Come for the data analysis, stay for the spelling bee.  This is part one of a two-parter, focusing on techniques to get the data in a digestible format; part two will do interesting things with the data.

Comments closed

The Multifaceted Nature Of R

John Mount points out that there are many ways to skin a cat in R:

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):

There should be one– and preferably only one –obvious way to do it.

Frankly in R (especially once you add many packages) there is usually more than one way. As an example we will talk about the common Rfunctions: str(), head(), and the tibble package‘s glimpse().

This is a small example of a large phenomenon.

Comments closed

Production-Quality Powershell Functions

Missy Januszko has some tips on turning those Powershell scripts into reusable functions:

Breaking down your code may mean chopping apart your lengthy script into smaller pieces. As a best practice, a function should do only one thing. A retrieval cmdlet retrieves information and sends that information to the pipeline. Conversely, a functional cmdlet performs an act but not a retrieval act. It may take input from another cmdlet and act upon that input. It may or may not send output information to the pipeline. Lastly, output cmdlets format output in a desired display. As a result, this will allow us to use the pipeline more effectively to pass parameters between functions. In the above example, most of the function is a retrieval function. The exception is that it formats the output into a table with the last line. I will remove that line and let the user of the function decide how they want it formatted.

As a friendly warning to operations folks who are using more and more Powershell, when you do it right, you end up being a developer.  But we can keep that a secret, just between you and me.

Comments closed

JSON Data Sources In SSIS

Chris Koester shows how to read JSON data sources in SQL Server Integration Services:

Once the Script Component has been defined as a source, the output columns can be defined. For this post, the same USGS Earthquake data that was used in the “Download JSON data with PowerShell” post will serve as an example. Be careful to choose the correct data types here. This can be tedious because you have to choose the correct data types in the C# code as well, and ensure that they correspond with the SSIS types. It’s helpful to bookmark a SSIS data type translation table for reference.

It does involve creating a script component, but aside from the tedium that Chris mentions, it’s not too bad.

1 Comment

Sentiment Analysis In R

Stefan Feuerriegel and Nicolas Pröllochs have a new package in CRAN:

Our package “SentimentAnalysis” performs a sentiment analysis of textual contents in R. This implementation utilizes various existing dictionaries, such as QDAP or Loughran-McDonald. Furthermore, it can also create customized dictionaries. The latter uses LASSO regularization as a statistical approach to select relevant terms based on an exogenous response variable.

I’m not sure how it stacks up to external services, but it’s another option available to us.

Comments closed

Hypervisor-Driven Wait Stats

Paul Randal explains that delays in the hypervisor layer could be responsible for SOS_SCHEDULER_YIELD waits in SQL Server:

Specifically, I was concerned about SOS_SCHEDULER_YIELD waits. This is a special wait type that occurs when a thread is able to run for 4ms of CPU time (called the thread quantum) without needing to get suspended waiting for an unavailable resource. In a nutshell, a thread must call into the SQLOS layer every so often to see whether it has exhausted its thread quantum, and if so it must voluntarily yield the processor. When that happens, a context switch occurs, and so a wait type must be registered: SOS_SCHEDULER_YIELD. A deeper explanation of this wait type is in my waits library here.

My theory was this: if a VM is prevented from running for a few milliseconds or more, that could mean that a thread that’s executing might exhaust its thread quantum without actually getting 4ms of CPU time, and so yield the processor causing an SOS_SCHEDULER_YIELD wait to be registered. If this happened a lot, it could produce a set of wait statistics for a virtualized workload that appears to have lots of SOS_SCHEDULER_YIELDs, when in fact it’s actually a VM performance problem and the SOS_SCHEDULER_YIELD waits are really ‘fake’.

Read on for more details, and definitely check out the link.  It was an eye-opener when I learned that SOS_SCHEDULER_YIELD didn’t mean “need more/more powerful CPUs.”

Comments closed

Abusing The Uniquifier

Denis Gobo shows what happens when you run out of unique values available to the uniquifier:

You would get the following error..straight from the beast himself apparently

Msg 666, Level 16, State 2, Line 1
The maximum system-generated unique value for a duplicate group was exceeded for index with partition ID 72057594039173120. Dropping and re-creating the index may resolve this; otherwise, use another clustering key.

I will be using DBCC PAGE and DBCC IND in this blog post, if you want to learn how to use these yourself, take a look at How to use DBCC PAGE

One horror story along these lines I’ve heard was a system where the developers would insert every new row with a clustered index value of 0 and then subsequently update the row to set the column to its correct value.  This does not decrement the uniquifier, though, so eventually you hit the limit even if there are only a relatively small number of 0-valued rows.

Comments closed

Make Those Clustered Indexes Unique

Thomas Rushton shows what happens when your clustered index is not unique and you have a lot of time to kill:

The theory behind clustered indexes is that they are (usually) unique – after all, they define the logical layout of your table on disk. And if you have multiple records with the same clustering index key, then which order would they be in? If you don’t define the CI as unique, then SQL Server will add (behind the scenes) a so-called “Uniqueifier” (or maybe “uniquifier”) to fix that. Grant’s first post in the thread referenced above gives some information about how to see this Uniqu[e]ifier in the table structure itself.

Read the whole thing.

Comments closed

BDD In Spark

Aaron Colcord and Zachary Nanfelt explain how to use Cucumber to create behavior-driven development tests on Apache Spark:

Cucumber allows us to write a portion of our software in a simple, language-based approach that enables all team members to easily read the unit tests. Our focus is on detailing the results we want the system to return. Non-Technical members of the team can easily create, read, and validate the testing of the system.

Often Apache Spark is one component among many in processing data and this can encourage multiple testing frameworks. Cucumber can help us provides a consistent unit testing strategy when the project may extend past Apache Spark for data processing. Instead of mixing the different unit testing strategies between sub-projects, we create one readable agile acceptance framework. This is creating a form of ‘Automated Acceptance Testing’.

Best of all, we are able to create ‘living documentation’ produced during development. Rather than a separate Documentation process, the Unit Tests form a readable document that can be made readable to external parties. Each time the code is updated, the Documentation is updated. It is a true win-win.

It’s an interesting mix.  I’m not the biggest fan of BDD but I’m happy that this information is out there.

Comments closed