Pandas Multiindex and T-SQL

Kevin Feasel

2019-06-28

Python

Tomaz Kastrun explains why you should never cross the streams:

1. SQL Server and Python Pandas Indexes are two different worlds and should not be mixed.
2. SQL Server uses Index primarily for DML operations and to keep data ACID.
3. Python Pandas uses Index and MultiIndex for keeping data dimensionality when performing data wrangling and statistical analysis.
4. SQL Server Index and Python Pandas Index don’t know about each other’s existence, meaning if user want to propagate the T-SQL index to Python Pandas (in order to minimize the impact of duplicates, missing values or to impose the relational model), it needs to be introduced and created, once data enters “in the python world”.

Read on for additional conclusions and the demos which bring us here.

CQL: Category Theory-Based Querying Language

John Cook looks at a querying language based on category theory:

My interest in category theory waxes and wanes, and just as it was was at its thinnest crescent phase I ran across CQL, categorical query language. I haven’t had time to look very far into it, but it seems promising. The site’s modest prose relative to the revolutionary rhetoric of some category enthusiasts makes me have more confidence that the authors may be on to something useful.

I’m going through some lectures on category theory now and am in a big functional programming phase, so this is interesting but I won’t be giving up SQL anytime soon for it.

Checkpoints Under Simple Recovery

Max Vernon shows us how checkpointing works when your database is in the simple recovery model:

Even though the transaction has been rolled back, the log records will not be cleared until a checkpoint occurs. An automatic checkpoint could be triggered by other ongoing transactions being written to the log, or a manual CHECKPOINT statement could be executed. However, for a database that is not seeing frequent transactions, the log may stay nearly full for an extended period of time. This scenario might be seen often during development where there are a very limited number of transactions being generated. 

Read the whole thing. Just because you’re in simple recovery mode doesn’t mean the transaction log becomes any less useful.

When Power BI Publish to Web is the Wrong Choice

Treb Gatte takes us through Publish to Web and the importance of getting the correct licensing for Power BI:

Some folks, either to avoid the need to buy a Microsoft Power BI license or in trying to embed Microsoft Power BI content in an On-Premises site like Microsoft SharePoint 2013, published their content using this function.

The risk is that if the content is on a page that gets indexed by a major search engine, like Google, the embed code will likely live in Google’s index forever. Then anyone can search for your data.

Read on to see the right way to do this. Treb also notes that there are good use cases for Publish to Web; you just have to make sure yours is one of them.

Unicode Escaping Across Various Languages

Solomon Rutzky shows how to perform Unicode character escaping in a dozen places:

The purpose of this post is to correct the overall lack of examples. Everything shown below are actual working examples of creating both a Unicode-only BMP character (meaning a non-Supplementary Character that would require Unicode) and a Supplementary Character. Most examples include a link to an online demo, either on db<>fiddle (for database demos) or IDE One (for non-database demos), both very cool and handy sites.

This includes T-SQL and MySQL as well as .NET languages, PHP, JavaScript, and even Excel. It’s a handy reference page.

Pre-Filtering Power BI Slicers

Matt Allington takes us through a new feature in Power BI:

Now back to the point of this post.  As of June 2019, it is possible to pre-filter slicers as well.  It may seem weird, but this previously wasn’t possible – it seemed weird to me, anyway.  This is now fixed and it is possible to use the side filter pane in the same way as other visuals.  I can think of quite a few useful scenarios, including:
– Hiding the dreaded (Blank) in a slicer.  [Actually, please don’t do that, but instead fix your data model].
– Filtering out items not relevant (eg category managers may only want to see their own products)
– Hiding items with no sales

Read on for a few examples of how to use this.

Random Forest on Small Numbers of Observations

Neil Saunders takes us through an interesting problem:

A recent question on Stack Overflow [r] asked why a random forest model was not working as expected. The questioner was working with data from an experiment in which yeast was grown under conditions where (a) the growth rate could be controlled and (b) one of 6 nutrients was limited. Their dataset consisted of 6 rows – one per nutrient – and several thousand columns, with values representing the activity (expression) of yeast genes. Could the expression values be used to predict the limiting nutrient?

The random forest was not working as expected: not one of the nutrients was correctly classified. I pointed out that with only one case for each outcome, this was to be expected – as the random forest algorithm samples a proportion of the rows, no correct predictions are likely in this case. As sometimes happens the question was promptly deleted, which was unfortunate as we could have further explored the problem.

Neil decided to explore the problem further regardless and came to some interesting conclusions.

Monitoring Big Data Clusters

Mohammad Darab continues a series on Big Data Clusters:

There are many ways to view the health of your Big Data Cluster. As of CTP 3.0, there are kubectl commands, mssqlctl commands as well as dashboards. For the sake of this series, I will focus on the dashboards. I will blog about some of the useful kubectl and mssqlctl commands in later posts.

The first dashboard is the Microsoft Cluster Administration portal (see below snapshot). This is a view into the Big Data Cluster Controller. As you can see from the image below, the Overview pane shows the Controller, Master Instance and all the pools. On the left hand side you can see more details. If you click on the “Service Endpoint” option, you will see a list of endpoints that you can bookmark.

Something I appreciate is that Microsoft thought ahead on what the monitoring story should look like rather than waiting until the end and slapping something together.

Simplifying Columnstore

Monica Rathbun takes us through concepts behind columnstore indexes:

Now, I admit when these first were introduced in SQL Server 2012 I found them very intimidating (additionally, you couldn’t update them directly). For me, anytime you say columnstore, my mind tends to set off alarms saying wait stay away, this is too complicated. So, in this post I am going to try and simplify the feature for you.

To do that first you need to understand some terminology and the difference between a columnstore index and a row store index (the normal kind we all use). Let’s start with the terminology.

There are some interesting complications around columnstore indexes but for analytical or warehousing queries, they’re excellent.

More Testing of Inline Scalar UDFs

Erik Darling makes a FROIDian slip:

The idea behind FROID is that it removes some restrictions around scalar valued functions.

1. They can be inlined into the query, not run per-row returned
2. They don’t force serial execution, so you can get a parallel plan

If your functions already run pretty quickly over a small  number of rows, and the calling query doesn’t qualify for parallelism, you may not see a remarkable speedup.

Even in that case, Erik argues that you can still get some benefits from SQL Server 2019 bringing those scalar UDFs inline.

Categories

June 2019
MTWTFSS
« May Jul »
 12
3456789
10111213141516
17181920212223
24252627282930