January 2019 – Page 5

After a small bit of research I discovered the concept of monkey patching (modifying a program to extend its local execution) the DataFrame object to include a transform function. This function is missing from PySpark but does exist as part of the Scala language already.
The following code can be used to achieve this, and can be stored in a generic wrapper functions notebook to separate it out from your main code. This can then be called to import the functions whenever you need them.

Things which make Python more of a functional language are fine by me. Even though I’d rather use Scala.

Comments closed

Highlighting Words In Powershell

Published 2019-01-24 by Kevin Feasel

Roman Gelman has a function which lets you highlight words in Powershell text:

Despite of the Write-HostHighlight function intended to work with a text (it highlights a substring within a string), it can operate with objects too. Generally, it accepts any input from any cmdlet. The output is not always looks nice, but it works! Let see some examples.

This looks a bit like newer versions of grep which highlight matched patterns. I like it.

Comments closed

Querying Cosmos DB Execution Metrics

Published 2019-01-24 by Kevin Feasel

Hasan Savran shows us how to retrieve execution metrics for a Cosmos DB call:

When I speak about CosmosDB, I always get questions like “How can I retrieve information about the execution plans?” or “Isn’t there a tool like SSMS which can show me what’s happening in the background?” Usually, questions like that comes from DBAs. If you have questions like that, I have good and bad news for you. Good news is, Yes you can get retrieve metrics from CosmosDB about execution plans. Bad news is, you need to know some programming to be able to do that because you need to use CosmosDB SDK.

The only way to access this information is from CosmosDB SDK 2.x. I couldn’t retrieve execution metrics by using SDK 3.x for custom queries. Here is the available metrics you can retrieve from CosmosDB queries.

I wonder if this is a “this is still new” thing, a “you don’t need these where you’re going” thing, or a “this is exactly how we envisioned implementation” thing. Especially around getting per-query metrics after the fact.

Comments closed

Visualization Failures

Published 2019-01-24 by Kevin Feasel

Stephanie Evergreen talks about two specific instances of self-inflicted visualization failure:

There’s a solid argument to be made that the scales in these charts shouldn’tstart at zero because we wouldn’t see any difference between the two years; all the lines would look flat. But there’s also a solid reason why they should start at zero—maybe I’m exaggerating the change if I don’t. Only the people who work closely with this data would know what kind of scale would fit best given the context of this foundation.
However, people on social media took notice of what they thought was a failure of mine and one commenter tweeted that “there’s no way [a dataviz Godfather] would approve this visual.” So, I got up the guts and sent the whole thing to the Godfather himself.
The Godfather wrote back: “To be honest, almost everything about your redesign is deceitful.” Ouch. I may have actually shed tears over this one. I was devastated.

There’s a good reminder here that failure is a critical part of learning.

Comments closed

Changing Server Collations En Masse

Published 2019-01-24 by Kevin Feasel

Hugo Kornelis has a script to change server collations across a large number of objects:

Problem is: the new instance was set up with a default collation that was different from the default collation of the old instance. And hence different from the database default collation. And when that resulted in errors, they responded by changing the default collation. Now everything ran and they were happy.
Fast forward two years. The database now has lots of “old” tables that use the original default collation for all string columns. But also lots of “new” tables that have their string columns using the new default collation. And the upgrade failed because somewhere in the new code is a comparison between two character columns that now have incompatible collation.

Click through for the script, as well as the standard disclaimer never blindly to run things in production.

Comments closed

Rotating Expired TDE Certificates

Published 2019-01-24 by Kevin Feasel

Chris Bell shows how you can quickly rotate TDE certificates, hopefully before they expire:

We have expired or expiring SQL TDE certificates! What now?
Well, the first thing we do is not panic. Even if our TDE certificate expires it won’t cause any issues. The SQL Server will continue to work normally. Even if we restore the DB elsewhere using the expired certificate we will just get a warning that the certificate is expired.
A warning is nice, and the system still working let’s us breathe a little easier, but we know that an updated certificate is a much better thing to have. In fact, setting up a regular key rotation schedule is even better and a recommended practice.

Good information, and Chris shares scripts to make it easy.

Comments closed

Impossible Joins And Terrible Execution Plans

Published 2019-01-24 by Kevin Feasel

Erik Darling shows us a case where the same query can be nice and fast, but change one parameter and suddenly performance goes out the window:

In the original plan, the TOP asked for rows, and quickly got them.
In the second plan, the TOP kept asking for rows, getting them from the Votes table, and then losing them on the join to Posts.
There was no parameter sniffing, there were no out of date stats, no blocking, or any other oddities. It’s just plain bad luck because of the data’s relationship.

Read the whole thing.

Comments closed

Computer Internals and the Buffer Pool

Published 2019-01-24 by Kevin Feasel

Randolph West starts a new series on the buffer pool in SQL Server:

Now that we’ve reminded ourselves of those fundamentals, let’s take a closer look at the buffer pool.
The buffer pool in SQL Server resides in the computer’s main memory (RAM). When the database engine requests a data page for reading or writing, it is assumed to be in the buffer pool. The buffer pool itself controls access between RAM and storage. If the data page that the database engine requests is not in RAM, a request is sent to the storage engine to retrieve that page. This may be storage directly attached to the system, or via a network interface card.

This first post is a high-level overview, but it looks like there’s a lot more in store from Randolph.

Comments closed

AMD vs Intel CPUs For Data Processing Jobs

Published 2019-01-23 by Kevin Feasel

Hariharan Iyer and Abhishek Srivastava run some tests against AWS’s new AMD-powered EC2 instances:

Our summary findings from TPCDS benchmarks are as follows:
– TPCDS queries are not as sensitive to local disk performance (and hence to EBS volume sizes)
– r5 (Intel) instances are consistently faster than r5a (AMD) instances. However, the speedup depends on the engine and the speedup for r5 (Intel) is lower for Spark (10%) than for Hive (25%).
– r5 instances are also either cheaper (by about 10% for Hive) or the same cost (for Spark)

At least for Hadoop and Spark work, Intel CPUs are a bit better, but there is some nuance in the story so check it out.

Comments closed

Running RStudio Server In Azure

Published 2019-01-23 by Kevin Feasel

David Smith notes that RStudio Server Pro is now available on Azure:

RStudio Server Pro is now available on the Azure Marketplace, the company announced on the RStudio Blog earlier this month. This means you can launch RStudio Server Pro on an virtual machine with the memory, disk, and CPU configuration of your choice, and pay by the minute for the VM instance plus a the RStudio software charge. Then, you can use a browser to access the remote RStudio Server (the interface is nigh-indistinguishable from the desktop version), with access to the commercial features of RStudio including support for multiple R version and concurrent R sessions, load-balancing and high availability instances, and enhanced security.

RStudio Server Pro and Microsoft R Server are both very nice for production-quality R servers. You can get away with the open source versions, but there are some good reasons to use the enterprise-grade products in an enterprise.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Month: January 2019

A Functional Approach To PySpark

Highlighting Words In Powershell

Querying Cosmos DB Execution Metrics

Visualization Failures

Changing Server Collations En Masse

Rotating Expired TDE Certificates

Impossible Joins And Terrible Execution Plans

Computer Internals and the Buffer Pool

AMD vs Intel CPUs For Data Processing Jobs

Running RStudio Server In Azure