Press "Enter" to skip to content

Author: Kevin Feasel

Cloudera And Hortonworks Officially Merged

Arun Murthy gives the used-to-be-Hortonworks perspective on the now-official merger of Cloudera and Hortonworks:

Our merger did not arise out of the blue. Our respective missions were well aligned, and together the new Cloudera has the scale it needs to service the constantly changing needs of the world’s most demanding organizations and to grow even more dominant in the market.
New open-source standards such as Kubernetes, container technology and the growing adoption of cloud-native architectures are major parts of Cloudera’s strategy.  Our primary initiative out of the gate is to deliver a 100-percent open-source unified platform, which leverages the best features of Hortonworks Data Platform (HDP) 3.0 and Cloudera’s CDH 6.0. Cloud-native and built for any cloud – with a public cloud experience across all clouds – the unified platform embodies our shared “cloud everywhere” vision.

I’m more a fan of the Hortonworks tooling like Ambari than I am of Cloudera’s alternatives, so it will be interesting to see what happens going forward. The good news for recalcitrant types like me is that HDP will be around for a couple of years yet.

Comments closed

Installing Azure DevOps Server 2019

Kevin Chant takes a look at Azure DevOps Server 2019 RC1:

To clarify, Azure DevOps Server 2019 is the new name for TFS. Furthermore, it’s now rebranded and updated to look and feel like the Azure DevOps service. Which is the new name for Visual Studio team Services (VSTS) in Azure.
Therefore, I decided to do this because I am currently involved with Azure DevOps and using it with SQL Server deployments. Hence, I decided I wanted a local copy of my own to test things.

Click through for a couple of interesting findings, especially around required (and disallowed) versions of prerequisites like SQL Server versions.

Comments closed

Counting Virtual Log Files

Jamie Wick has a couple of scripts to help us monitor virtual log file counts for our transaction logs:

The LOGINFO command returns a row for each Virtual Log File in the Transaction Log. The FileSize column shows the size (KB) of each individual VLF. The amount of the transaction log that is currently in use can be determined by the number rows with a Status of 2 (0 = Inactive, 2 = Active & 4 = VLF exists on the Primary replica, but not on the Secondary replica).
One of the biggest problems with the LOGINFO command has been the difficulty of extrapolating the information to include all databases on the server, for use in reports, automated alerts and maintenance processes. Beginning with SQL Server 2016 SP2, there are 2 new Dynamic Management Views (DMVs) that can be used for monitoring the VLF information: sys.dm_db_log_info & sys.dm_db_log_stats.

A huge number of VLFs can be a hidden performance sink.

Comments closed

How Query Store Stores Date Data

Erin Stellato shows us a few nuances to the way that Query Store handles dates internally:

Query Store retains query performance data at the plan level.  This data is then broken out into intervals of time, determined by the INTERVAL_LENGTH_MINUTES setting.  The time intervals are found in the sys.query_store_runtime_stats_interval system view, and the start_time and end_time columns are of the DATETIMEOFFSET data type.  This means that the date is time-zone aware, and in Query Store the data is stored as UTC.  Now why does all of this matter?  Because handling dates in Query Store is important if you’re going to query the data directly.

Click through to see a query of what this looks like, as well as a few tips on parsing the data.

Comments closed

xp_cmdshell And Non-Sysadmin Accounts

Lucas Kartawidjaja shows us how you can grant a non-sysadmin user the right to run xp_cmdshell:

Once we run the above T-SQL query, any account that is part of the sysadmin role in the SQL Server instance has the ability to run the xp_cmdshell extended stored procedure. On the background, when the user with sysadmin privileges runs the xp_cmdshell, it will execute the Windows command shell using the SQL Server Service Account (So if you are executing xp_cmdshell to access certain resource on the network, for example, and you are having permission issue, you might want to make sure that the SQL Server Service Account has permission to that resource).
Now, what if you have a non-sysadmin account that needs to run xp_cmdshell? In order to do that, we would need to do some additional configuration.

Granting non-sysadmins rights to run xp_cmdshell definitely rates as well above-average in terms of risk. I don’t have any problem with xp_cmdshell being turned on—especially considering that by default, only sysadmin accounts get it and sysadmin accounts can turn it on if it’s disabled, meaning it’s effectively always on for sysadmin. But when you start granting non-sysadmin accounts the ability to shell out, you have to be even more careful of protecting that SQL Server instance.

Comments closed

Writing Vectorized Code In R

John Mount helps us understand writing R code like a native:

This sort of difference, scalar oriented C++ being so much faster than scalar oriented R, is often distorted into “R is slow.”
This is just not the case. If we adapt the algorithm to be vectorized we get an R algorithm with performance comparable to the C++implementation!
Not all algorithms can be vectorized, but this one can, and in an incredibly simple way. The original algorithm itself (xlin_fits_R()) is a bit complicated, but the vectorized version (xlin_fits_V()) is literally derived from the earlier one by crossing out the indices. That is: in this case we can move from working over very many scalars (slow in R) to working over a small number of vectors (fast in R).

This is akin to writing set-based SQL instead of cursor-based SQL: you’re thinking in terms which make it easier for the interpreter (or optimizer, in the case of a database engine) to operate quickly over your inputs. It’s also one of a few reasons why I think learning R makes a lot of sense when you have a SQL background.

Comments closed

gganimate Now On CRAN

Thomas Lin Pedersen announces that gganimate is now available on CRAN:

While this commit was done in the autumn 2017, nothing further happened until I decided to make gganimate the center of my useR 2018 keynote, at which point I was forced (by myself) to have some sort of package ready by the summer of 2018.
A fair amount of users have shown displeasure in the breaking changes this history has resulted in. Many blog posts have already been written focusing on the old API, as well as code on numerous computers that will no longer work. I understand this frustration, of course, but both me and David agreed that doing it this way was for the best in the end. I’m positive that the new API has already greatly exceeded the mind-share of the old API and given a year the old API will be all but a distant memory…

Read on for information on these breaking changes, and how the changes will make life easier in the long run. And stay for the fireworks. H/T R-Bloggers

Comments closed

Practical AI Workshop Notebooks

David Smith has published a set of notebooks from the Practical AI for the Working Software Engineer workshop:

Last month, I delivered the one-day workshop Practical AI for the Working Software Engineer at the Artificial Intelligence Live conference in Orlando. As the title suggests, the workshop was aimed at developers, bu I didn’t assume any particular programming language background. In addition to the lecture slides, the workshop was delivered as a series of Jupyter notebooks. I ran them using Azure Notebooks (which meant the participants had nothing to install and very little to set up), but you can run them in any Jupyter environment you like, as long as it has access to R and Python. You can download the notebooks and slides from this Github repository (and feedback is welcome there, too). 

Read on for details about those notebooks and to get your own copies.

Comments closed

Finding Max Concurrent Operations With T-SQL

I have a post up showing how to calculate the maximum number of concurrent operations using T-SQL:

You can probably see by this point how the pieces are coming together:  each time frame has a starting point and an ending point.  If there were no overlap at all, we’d see in the fourth column a number followed by a NULL, followed by a number followed by a NULL, etc.  But we clearly don’t see that:  we see work item ordinals 3 and 4 share some overlap:  item 3 started at 3:06:15 PM and ended after item 4’s start of 3:07:20 PM.  This means that those two overlapped to some extent.  Then we see two NULL values, which means they both ended before 5 began.  So far so good for our developers!

Click through for a bunch of T-SQL scripts, images, and important advice about always having interns around to take the blame.

Comments closed

Visualizing Hash Match Joins

Bert Wagner continues his series on visualizing physical join operators:

Hash Match joins are the dependable workhorses of physical join operators.
While Nested Loops joins will fail if the data is too large to fit into memory, and Merge Joins require that the input data are sorted, a Hash Match will join any two data inputs you throw at it (as long as the join has an equality predicate and you have enough space in tempdb).  

Bert has some great animated GIFs too.

Comments closed