stringr Cheat Sheet

David Smith points out a cheat sheet for dealing with strings in R:

The RStudio team has created another very useful cheat sheet for RWorking with Strings. This cheat sheet provides an example-laden menu of operations you can perform on strings (character verctors) in R using the stringr package. While base R provides a solid set of string manipulation functions, the stringr package functions are simpler, more consistent (making them easy to use with the pipe operator), and more like the Ruby or Python way of handling string operations.

Click through for a link to the PDF.

Partitioning Tables

Eleni Markou shows how to partition tables in Postgres, SQL Server, and Google’s BigQuery:

When it comes to Microsoft SQL Server things are a bit different as this database system does not support dynamic partitions and so partitioning the table manually can be a huge maintenance issue.

That being said, in order to create a partitioned table a similar procedure to the one previously presented must be followed. This time we will create a monthly partition.

Read on for scripts for each.

Cosmos DB Limitations

Vincent-Philippe Lauzon points out a few limitations with Cosmos DB:

The original DocumentDB SQL didn’t have any aggregation capacity.  But it did acquire those capacities along the way.

Traditionally, that isn’t the strong spot for document-oriented databases.  They tend to be more about find documents and manipulating the documents as oppose to aggregating metrics on a mass of documents.

Today, DocumentDB SQL implements the following aggregate functions:

  • COUNT

  • SUM

  • MIN

  • MAX

  • AVG

Read on for where the current aggregation limitation is, as well as more.

Is Azure SQL DW A Good Fit For You?

Melissa Coates has a nice choose-your-own-adventure story around Azure SQL Data Warehouse:

Q4: How large is your database?

It is difficult to pinpoint an exact number for the absolute minimum size recommended for Azure SQL DW. Many data professionals in the industry see the minimum “practical” data size for Azure SQL DW in the 1-4TB range. Microsoft documentation has recently stated as low as 250GB for a minimum size. Since Azure SQL DW is an MPP (massively parallel processing) system, you experience a significant performance penalty with small data sizes because of the overhead incurred to distribute and consolidate across the nodes (which are distributions in a “shared-nothing” architecture). We recommend Azure SQL DW for a data warehouse which is starting to approach 1TB and expected to continue growing.

Great advice here.  I’ve heard too often of people looking at the name “Azure SQL Data Warehouse” and figuring that because they have data warehouses on-prem, this is the appropriate analog.  Azure SQL DW is not a typical data warehousing environment; it’s more of a specialized tool than that, so click through to see if it fits your needs.

Using C# To Stream Data Into Power BI

Chris Koester shows us how to pass data from our .NET applications into a Power BI streaming dataset:

This post will demonstrate how to push data into Power BI Streaming Datasets with C#. For demo purposes I normally use LINQPad to run the code, but you could also create a .Net or .Net Core console application. LINQPad is an excellent, lightweight scratchpad for C# and other .Net languages.

Power BI Streaming Datasets are a very cool feature because dashboard tiles that use them update in real time. You don’t have to refresh the browser window to display new data. With this feature you can watch your data in near real-time. This could be compelling in scenarios involving sensors, IoT, website traffic, etc.

Click through for the demo script.  This shows how easy it can be to take your on-premises data and feed it into live Power BI dashboards.

Memory-Optimized Indexing Strategy

Kunal Karoth has some advice for creating memory-optimized indexes:

To further optimize your queries that do point lookups on your memory-optimized table using an equality operator (=), consider using a Hash index. Hash indexes also optimize insert operations and aid database recovery time. As mentioned earlier, a Hash index uses a hash table internally to keep track of the index keys. The hashing function uses all the key(s) specified at the index creation time to create a mapping between the index key column(s) and the corresponding hash bucket. Hence, it is important that you specify all the index key columns that are part of the hash index, during the lookup in your query.

As with any hashing technique, collisions will occur. A hash collision occurs when two or more index keys map to the same hash bucket. In general, having a limited number of collisions is expected and is fine. It is only when the number of collisions become excessive, it may noticeably begin to impact your query performance. Your aim should be to keep the number of collisions as low as possible. You may experience a higher number of collisions either because: –

  • There are many duplicate index key column values in the table and/or

  • The number of hash buckets for your Hash index are under provisioned.

There’s some helpful information here if you are new to In-Memory OLTP.

CXCONSUMER Waits And More From PASS Summit

Brent Ozar relays a couple exciting announcements from PASS Summit:

Microsoft’s Joe Sack & Pedro Lopes held a forward-looking session for performance tuners at the PASS Summit and dropped some awesome bombshells.

Pedro’s Big Deal: there’s a new CXPACKET wait in town: CXCONSUMER. In the past, when queries went parallel, we couldn’t differentiate harmless waits incurred by the consumer thread (coordinator, or teacher from my CXPACKET video) from painful waits incurred by the producers. Starting with SQL Server 2016 SP2 and 2017 CU3, we’ll have a new CXCONSUMER wait type to track the harmless ones. That means CXPACKET will really finally mean something.

Read on to see what Joe has for us.

Categories

November 2017
MTWTFSS
« Oct  
 12345
6789101112
13141516171819
20212223242526
27282930