Press "Enter" to skip to content

Day: June 7, 2017

Riddler Nation: Game Theory In Action

Curtis Miller goes over a multi-phase distribution game with no known information:

The winning strategy of the last round, submitted by Vince Vatter, was (0, 1, 2, 16, 21, 3, 2, 1, 32, 22), with an official record1 of 751 wins, 175 losses, and 5 ties. Naturally, the top-performing strategies look similar. This should not be surprising; winning strategies exploit common vulnerabilities among submissions.

I’ve downloaded the submitted strategies for the second round (I already have the first round’s strategies). Lets load them in and start analyzing them.

This is a great blog post, which looks at using evolutionary algorithms to evolve a winning strategy.

Comments closed

Data Wrangling With Power Query

Eugene Meidinger parses a complex report using Power Query:

Hmm, so it looks like I made a mistake. I hope my honesty won’t lose me any izzat, or ability to command respect. I think it’s important to see how people really learn and really solve problems. So, I’m including my screw ups in this post.

Apparently, I created a linked table and I can’t see how to edit the the Power Query portion for that. A linked table is a nice way to pull raw data from the Excel workbook. It’s great for reference tables, but doesn’t solve our problem.

Come for the data analysis, stay for the spelling bee.  This is part one of a two-parter, focusing on techniques to get the data in a digestible format; part two will do interesting things with the data.

Comments closed

The Multifaceted Nature Of R

John Mount points out that there are many ways to skin a cat in R:

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):

There should be one– and preferably only one –obvious way to do it.

Frankly in R (especially once you add many packages) there is usually more than one way. As an example we will talk about the common Rfunctions: str(), head(), and the tibble package‘s glimpse().

This is a small example of a large phenomenon.

Comments closed

Production-Quality Powershell Functions

Missy Januszko has some tips on turning those Powershell scripts into reusable functions:

Breaking down your code may mean chopping apart your lengthy script into smaller pieces. As a best practice, a function should do only one thing. A retrieval cmdlet retrieves information and sends that information to the pipeline. Conversely, a functional cmdlet performs an act but not a retrieval act. It may take input from another cmdlet and act upon that input. It may or may not send output information to the pipeline. Lastly, output cmdlets format output in a desired display. As a result, this will allow us to use the pipeline more effectively to pass parameters between functions. In the above example, most of the function is a retrieval function. The exception is that it formats the output into a table with the last line. I will remove that line and let the user of the function decide how they want it formatted.

As a friendly warning to operations folks who are using more and more Powershell, when you do it right, you end up being a developer.  But we can keep that a secret, just between you and me.

Comments closed

JSON Data Sources In SSIS

Chris Koester shows how to read JSON data sources in SQL Server Integration Services:

Once the Script Component has been defined as a source, the output columns can be defined. For this post, the same USGS Earthquake data that was used in the “Download JSON data with PowerShell” post will serve as an example. Be careful to choose the correct data types here. This can be tedious because you have to choose the correct data types in the C# code as well, and ensure that they correspond with the SSIS types. It’s helpful to bookmark a SSIS data type translation table for reference.

It does involve creating a script component, but aside from the tedium that Chris mentions, it’s not too bad.

1 Comment

Sentiment Analysis In R

Stefan Feuerriegel and Nicolas Pröllochs have a new package in CRAN:

Our package “SentimentAnalysis” performs a sentiment analysis of textual contents in R. This implementation utilizes various existing dictionaries, such as QDAP or Loughran-McDonald. Furthermore, it can also create customized dictionaries. The latter uses LASSO regularization as a statistical approach to select relevant terms based on an exogenous response variable.

I’m not sure how it stacks up to external services, but it’s another option available to us.

Comments closed

Hypervisor-Driven Wait Stats

Paul Randal explains that delays in the hypervisor layer could be responsible for SOS_SCHEDULER_YIELD waits in SQL Server:

Specifically, I was concerned about SOS_SCHEDULER_YIELD waits. This is a special wait type that occurs when a thread is able to run for 4ms of CPU time (called the thread quantum) without needing to get suspended waiting for an unavailable resource. In a nutshell, a thread must call into the SQLOS layer every so often to see whether it has exhausted its thread quantum, and if so it must voluntarily yield the processor. When that happens, a context switch occurs, and so a wait type must be registered: SOS_SCHEDULER_YIELD. A deeper explanation of this wait type is in my waits library here.

My theory was this: if a VM is prevented from running for a few milliseconds or more, that could mean that a thread that’s executing might exhaust its thread quantum without actually getting 4ms of CPU time, and so yield the processor causing an SOS_SCHEDULER_YIELD wait to be registered. If this happened a lot, it could produce a set of wait statistics for a virtualized workload that appears to have lots of SOS_SCHEDULER_YIELDs, when in fact it’s actually a VM performance problem and the SOS_SCHEDULER_YIELD waits are really ‘fake’.

Read on for more details, and definitely check out the link.  It was an eye-opener when I learned that SOS_SCHEDULER_YIELD didn’t mean “need more/more powerful CPUs.”

Comments closed

Abusing The Uniquifier

Denis Gobo shows what happens when you run out of unique values available to the uniquifier:

You would get the following error..straight from the beast himself apparently

Msg 666, Level 16, State 2, Line 1
The maximum system-generated unique value for a duplicate group was exceeded for index with partition ID 72057594039173120. Dropping and re-creating the index may resolve this; otherwise, use another clustering key.

I will be using DBCC PAGE and DBCC IND in this blog post, if you want to learn how to use these yourself, take a look at How to use DBCC PAGE

One horror story along these lines I’ve heard was a system where the developers would insert every new row with a clustered index value of 0 and then subsequently update the row to set the column to its correct value.  This does not decrement the uniquifier, though, so eventually you hit the limit even if there are only a relatively small number of 0-valued rows.

Comments closed

Make Those Clustered Indexes Unique

Thomas Rushton shows what happens when your clustered index is not unique and you have a lot of time to kill:

The theory behind clustered indexes is that they are (usually) unique – after all, they define the logical layout of your table on disk. And if you have multiple records with the same clustering index key, then which order would they be in? If you don’t define the CI as unique, then SQL Server will add (behind the scenes) a so-called “Uniqueifier” (or maybe “uniquifier”) to fix that. Grant’s first post in the thread referenced above gives some information about how to see this Uniqu[e]ifier in the table structure itself.

Read the whole thing.

Comments closed