Press "Enter" to skip to content

Author: Kevin Feasel

stringr Cheat Sheet

David Smith points out a cheat sheet for dealing with strings in R:

The RStudio team has created another very useful cheat sheet for RWorking with Strings. This cheat sheet provides an example-laden menu of operations you can perform on strings (character verctors) in R using the stringr package. While base R provides a solid set of string manipulation functions, the stringr package functions are simpler, more consistent (making them easy to use with the pipe operator), and more like the Ruby or Python way of handling string operations.

Click through for a link to the PDF.

Comments closed

Partitioning Tables

Eleni Markou shows how to partition tables in Postgres, SQL Server, and Google’s BigQuery:

When it comes to Microsoft SQL Server things are a bit different as this database system does not support dynamic partitions and so partitioning the table manually can be a huge maintenance issue.

That being said, in order to create a partitioned table a similar procedure to the one previously presented must be followed. This time we will create a monthly partition.

Read on for scripts for each.

Comments closed

Cosmos DB Limitations

Vincent-Philippe Lauzon points out a few limitations with Cosmos DB:

The original DocumentDB SQL didn’t have any aggregation capacity.  But it did acquire those capacities along the way.

Traditionally, that isn’t the strong spot for document-oriented databases.  They tend to be more about find documents and manipulating the documents as oppose to aggregating metrics on a mass of documents.

Today, DocumentDB SQL implements the following aggregate functions:

  • COUNT

  • SUM

  • MIN

  • MAX

  • AVG

Read on for where the current aggregation limitation is, as well as more.

Comments closed

Is Azure SQL DW A Good Fit For You?

Melissa Coates has a nice choose-your-own-adventure story around Azure SQL Data Warehouse:

Q4: How large is your database?

It is difficult to pinpoint an exact number for the absolute minimum size recommended for Azure SQL DW. Many data professionals in the industry see the minimum “practical” data size for Azure SQL DW in the 1-4TB range. Microsoft documentation has recently stated as low as 250GB for a minimum size. Since Azure SQL DW is an MPP (massively parallel processing) system, you experience a significant performance penalty with small data sizes because of the overhead incurred to distribute and consolidate across the nodes (which are distributions in a “shared-nothing” architecture). We recommend Azure SQL DW for a data warehouse which is starting to approach 1TB and expected to continue growing.

Great advice here.  I’ve heard too often of people looking at the name “Azure SQL Data Warehouse” and figuring that because they have data warehouses on-prem, this is the appropriate analog.  Azure SQL DW is not a typical data warehousing environment; it’s more of a specialized tool than that, so click through to see if it fits your needs.

Comments closed

Using C# To Stream Data Into Power BI

Chris Koester shows us how to pass data from our .NET applications into a Power BI streaming dataset:

This post will demonstrate how to push data into Power BI Streaming Datasets with C#. For demo purposes I normally use LINQPad to run the code, but you could also create a .Net or .Net Core console application. LINQPad is an excellent, lightweight scratchpad for C# and other .Net languages.

Power BI Streaming Datasets are a very cool feature because dashboard tiles that use them update in real time. You don’t have to refresh the browser window to display new data. With this feature you can watch your data in near real-time. This could be compelling in scenarios involving sensors, IoT, website traffic, etc.

Click through for the demo script.  This shows how easy it can be to take your on-premises data and feed it into live Power BI dashboards.

Comments closed

Memory-Optimized Indexing Strategy

Kunal Karoth has some advice for creating memory-optimized indexes:

To further optimize your queries that do point lookups on your memory-optimized table using an equality operator (=), consider using a Hash index. Hash indexes also optimize insert operations and aid database recovery time. As mentioned earlier, a Hash index uses a hash table internally to keep track of the index keys. The hashing function uses all the key(s) specified at the index creation time to create a mapping between the index key column(s) and the corresponding hash bucket. Hence, it is important that you specify all the index key columns that are part of the hash index, during the lookup in your query.

As with any hashing technique, collisions will occur. A hash collision occurs when two or more index keys map to the same hash bucket. In general, having a limited number of collisions is expected and is fine. It is only when the number of collisions become excessive, it may noticeably begin to impact your query performance. Your aim should be to keep the number of collisions as low as possible. You may experience a higher number of collisions either because: –

  • There are many duplicate index key column values in the table and/or

  • The number of hash buckets for your Hash index are under provisioned.

There’s some helpful information here if you are new to In-Memory OLTP.

Comments closed

CXCONSUMER Waits And More From PASS Summit

Brent Ozar relays a couple exciting announcements from PASS Summit:

Microsoft’s Joe Sack & Pedro Lopes held a forward-looking session for performance tuners at the PASS Summit and dropped some awesome bombshells.

Pedro’s Big Deal: there’s a new CXPACKET wait in town: CXCONSUMER. In the past, when queries went parallel, we couldn’t differentiate harmless waits incurred by the consumer thread (coordinator, or teacher from my CXPACKET video) from painful waits incurred by the producers. Starting with SQL Server 2016 SP2 and 2017 CU3, we’ll have a new CXCONSUMER wait type to track the harmless ones. That means CXPACKET will really finally mean something.

Read on to see what Joe has for us.

Comments closed

Creating A Poekr AI In Python

Kevin Jacobs has a fairly simple framework for building poker-playing bots:

The bot uses Monte Carlo simulations running from a given state. Suppose you start with 2 high cards (two Kings for example), then the chances are high that you will win. The Monte Carlo simulation then simulates a given number of games from that point and evaluates which percentage of games you will win given these cards. If another King shows during the flop, then your chance of winning will increase. The Monte Carlo simulation starting at that point, will yield a higher winning probability since you will win more games on average.

If we run the simulations, you can see that the bot based on Monte Carlo simulations outperforms the always calling bot. If you start with a stack of $100,-, you will on average end with a stack of $120,- (when playing against the always-calling bot).

It’s a start, and an opening for more sophisticated logic and analysis.

Comments closed

The Importance Of Distributions

Jocelyn Barker explains distributions using role-playing games as an example:

We see that for the entire curve, our odds of success goes down when we add criticals and for most of the curve, it goes up for 3z8. Lets think about why. We know the guards are more likely to roll a 20 and less likely to roll a 1 from the distribution we made earlier. This happens about 14% of the time, which is pretty common, and when it happens, the rogue has to have a very high modifier and still roll well to overcome it unless they also roll a 20. On the other hand, with 3z8 system, criticals are far less common and everyone rolls close to average more of the time. The expected value for the rogue is ~10.5, where as it is ~14 for the guards, so when everyone performs close to average, the rogue only needs a small modifier to have a reasonable chance of success.

It’s a nice spin on a classic statistics lesson.

Comments closed

BI Announcements At PASS Summit

Chris Webb points out some good news released at PASS Summit:

Power BI Report Server

There’s a new release of Power BI Report Server available, and you can read all about it here:

https://powerbi.microsoft.com/en-us/blog/new-version-of-power-bi-report-server-now-available/

The blog post highlights the fact that you can connect to SSRS shared datasets via OData – which is basically what I was talking about here.

This is the big one for me, but there are a few other product updates that Chris covers as well.

Comments closed