Fun With Random Walks

Emrah Mete simulates a random walk in R:

Let’s consider a game where a gambler is likely to win $1 with a probability of p and lose $1 with a probability of 1-p.

Now, let’s consider a game where a gambler is likely to win $1 and lose $1 with a probability of 1. The player starts the game with X dollars in hand. The player plays the game until the money in his hand reaches N (N> X) or he has no money left. What is the probability that the player will reach the target value? (We know that the player will not leave the game until he reaches the N value he wants to win.)

The problem of the story above is known in literature as Gambler’s Ruin or Random Walk. In this article, I will simulate this problem with R with different settings and examine how the game results change with different settings.

Click through for the script and analysis.  There’s a reason they call this game the gambler’s ruin.

The Business Value Of Upgrading To Hadoop 3

Roni Fontaine, Vinod Vavilapalli, and Saumitra Buragohain explain some of the business case for upgrading to Hadoop 3 from Hadoop 2:

Hadoop 2 doesn’t support GPUs. Hadoop 3 enables scheduling of additional resources, such as disks and GPUs for better integration with containers, deep learning & machine learning.  This feature provides the basis for supporting GPUs in Hadoop clusters, which enhances the performance of computations required for Data Science and AI use cases.

Hadoop 2 cannot accommodate intra-node disk balancing. Hadoop 3 has intra-node disk balancing. If you are repurposing or adding new storage to an existing server with older capacity drives, this leads to unevenly disks space in each server.   With intra-node disk balancing, the space in each disk is evenly distributed.

Hadoop 2 has only inter-queue preemption across queues. Hadoop 3 introduces intra-queue preemption which goes to the next level time by allowing preemption between application within a single queue. This means that you can prioritize jobs within the queue based on user limits and/or application priority

Read on for more examples.

DataExplorer

Boxuan Cui introduces DataExplorer, an R package dedicated to assist with exploratory data analysis:

According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.

Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any data.frame-like objects. However, certain functions require a data.table class object as input due to the update-by-reference feature, which I will cover in later part of the post.

For my money, that number is closer to 90%.  I will have to check this package out.

Using Power BI To Pass Parameterized Values To R

Stacia Varga shows how you can parameterize your R scripts within Power BI:

It’s not difficult, but the cool thing about Power BI is that I can use parameters to dynamically change the report visualization without opening up the script. To do this:

  • Open the Query Editor in Power BI

  • Click Manage Parameters, and then click New Parameter.

  • Set the parameter properties – Name, Type, and Current Value.The Name is how I will reference the parameter my R script, the Type is the data type, and Current Value is the initial value that I want to set (if any).

Click through for an example and more details.

Tracking T-SQL Code Progress

Louis Davidson has a couple of ways of tracking progress and success of T-SQL code:

For a process that runs for 10 seconds, this is no problem at all. But when you are doing 100s of loops, and they take time, you don’t want to wait. Eventually, data will start spurting out, but not immediately. We want immediately, even if it isn’t the optimum way.

The PRINT statement won’t output immediately, but its cousin, good old RAISERROR, does. Using a severity of 10 for the message, the message will be output just like a PRINT message. Then, adding WITH NOWAIT to the RAISERROR, the messages will no longer be queued for output, and will be returned immediately.

Both of the techniques Louis shares are useful for keeping track of progress in code.  I’d expect that as tooling gets more sophisticated with respect to live execution plans, we might eventually get to the point where there’s an overall expected progress indicator, something which would be quite useful when three levels of management are standing at your desk waiting for something to finish.

Quick Update Stats Job

Amy Herold has a Powershell script to generate a SQL Agent job which updates statistics for defined tables:

In an effort to rule out whether or not statistics are definitely a factor, I want to UPDATE STATS on all the tables in my query, and at a specific time – sometime the day before we expect our slowdown to occur. I also want to be able to easily see how long the process ran and the duration of the update for each table. I could write all this to a table, and maybe I will do this later, but viewing this from job history is what I want right now – it is easy to look at and easy for everyone else to find and see as well.

Creating a job with multiple steps can sometimes be a bit painful and tedious, if the job is going to have A LOT of steps. With PowerShell, a server name and a database, I can dynamically create this job.

One of many reasons to have at least a little bit of Powershell knowledge if you are a SQL Server DBA.

Annotations And ObjectTags In Biml

Cathrine Wilhelmsen explains how to use Annotations and ObjectTags to store metadata in Biml:

There are two types of annotations in the Biml language. The first type is SSIS Annotations. An SSIS annotation is a comment inside your SSIS package that is visible on the design surface. This is what most SSIS developers think of when they hear the word “annotation”.

The second type, and the type covered in this blog post, is Biml Annotations. Biml annotations are more than just comments inside a package. They can be used to store additional string metadata on any Biml object, and can also be used in auto-generated documentation.

There are four types of Biml annotations: CodeComment, Documentation, Description, and Tag. You can read more about them in the official AnnotationType documentation.

I know I’ve underused Annotations in my Biml work, so this was good to read.

Digging Into The In-Memory Columnstore Location

Niko Neugebauer does some investigation into where, exactly, memory-optimized columnstore data goes:

This is a rather simple blog post that is dedicated to the theme of the In-Memory Columnstore Indexes location. This has been a constant topic of discussion over a long period of time, even during the public events – and there is a need to clear out this topic.

I have assumed that the In-Memory Columnstore structures (Segments, Dictionaries, …) are located in the In-Memory, but there have been voices that I greatly respect, pointing that actually the Columnstore Object Pool is the exact location of any Columnstore structures, and there is nothing better than to take this feature for a ride and see what the SQL Server engine is actually doing.

Niko shows off a couple of useful DMVs along the way, too.

Categories

February 2018
MTWTFSS
« Jan  
 1234
567891011
12131415161718
19202122232425
262728