Our summary findings from TPCDS benchmarks are as follows:
– TPCDS queries are not as sensitive to local disk performance (and hence to EBS volume sizes)
– r5 (Intel) instances are consistently faster than r5a (AMD) instances. However, the speedup depends on the engine and the speedup for r5 (Intel) is lower for Spark (10%) than for Hive (25%).
– r5 instances are also either cheaper (by about 10% for Hive) or the same cost (for Spark)
At least for Hadoop and Spark work, Intel CPUs are a bit better, but there is some nuance in the story so check it out.
RStudio Server Pro is now available on the Azure Marketplace, the company announced on the RStudio Blog earlier this month. This means you can launch RStudio Server Pro on an virtual machine with the memory, disk, and CPU configuration of your choice, and pay by the minute for the VM instance plus a the RStudio software charge. Then, you can use a browser to access the remote RStudio Server (the interface is nigh-indistinguishable from the desktop version), with access to the commercial features of RStudio including support for multiple R version and concurrent R sessions, load-balancing and high availability instances, and enhanced security.
RStudio Server Pro and Microsoft R Server are both very nice for production-quality R servers. You can get away with the open source versions, but there are some good reasons to use the enterprise-grade products in an enterprise.
In RDBMS, we sometimes use the term skew colloquially to mean the same thing as non-uniform distribution, i.e. a normal distribution would also be skewed. We simply mean that some values appear more often than others. Thus, I will put the term “skew” in double quotes in this article. While your RDBMS’s statistics contain this information once they are calculated, we can also detect such “skew” manually in ad-hoc queries using percentiles, which are defined in the SQL standard and supported in a variety of databases, as ordinary aggregate functions, including:
– SQL Server (regrettably, only as window functions)
As Lukas implies, SQL Server is a step behind in terms of calculating percentiles, and calculating several percentiles over a large data set will be slow. Very slow. Though batch mode processing in 2019 does help here.
Instead of filling the execution plan cache with plans that will never get reused, the optimize for ad hoc workloads option will cache a plan stub instead of the full plan. The plan stub is significantly smaller in size and is only replaced with the full execution plan when SQL Server recognizes that the same query has executed multiple times.
This reduces the amount of size one-time queries take up in t he cache, allowing more reusable plans to remain in the cache for longer periods of time.
I’ve run into several cases where this has helped SQL Server and don’t think I’ve found a scenario where it actively hurts.
It’s important to try and use an install set that is the same level of Service pack as your current install. Otherwise, you could end up installing multiple patches to get the SQL Launchpad service to work. Which is something discussed in a previous post here.
I know some companies have a central installer for SQL Server and then have all the updates in another location. Hence, if you are in such an environment be prepared to run multiple updates from that location after the install.
This is definitely one of the features which is easier to install from the beginning than to install after the fact.
Since, in spite of our best efforts since SQL Server 2000, we can’t effectively stop scalar UDFs from being used, wouldn’t it be great to make SQL Server simply handle them better?
SQL Server 2019 introduces a new feature called Scalar UDF Inlining. Instead of keeping the function separate, it is incorporated into the overall plan. This leads to a much better execution plan and, in turn, better runtime performance.
Read the whole thing. Especially the part about how this changes the way one of the DMV counters behaves.
If you don’t have an account in Azure or you don’t have a subscription that you can use for Azure Data Lake, No need to worry! You can still use Dataflow. The whole process of storing data into Azure Data Lake is internally managed through Dataflow. You won’t even need to login to the Azure portal or anywhere else. From your point of view, in the Power BI website, you create a dataflow, and that dataflow manages the whole storage configuration. You won’t need to have any other accounts or pay anything extra or more than what you are paying for Power BI subscription.
Click through for use cases and some tips.
Beginning of last year, we started to develop a chat bot demo. The idea was to integrate the chat bot into one of the big applications as a replacement to FAQ. Users can ask questions to bot thus avoiding obvious support tickets in the future.
Things went well. We got appreciation on the demo and started moving to production. About half way, things started turning south. The demo chat bot application used Bot SDK V3. It had voice recognition enabled which allow users to talk to it and get the response back in voice. During the demo we used Azure Bing Speech API. But later before the production, we got the notice that the service is obsolete and will be retired mid 2019. Another surprise was the introduction of Bot SDK V4 which is entirely different that Bot SDK V3. Something like AngularJS v/s Angular.
The major services tend to give you some time to switch over—in this case, they had 10 months to make a move. But when dealing with online services versus locally installed products, there’s always a risk that the service you’re calling won’t be there, and depending upon how critical that service is, it can have a major effect on your ability to function if it disappears one day. That’s definitely not a reason to ignore these services; it’s a reason to have a backup plan in place.