Basically speaking, the “http user” will be used to authenticate through the HDInsight gateway, which is used to protect the HDInsight clusters you created. This user is used to access the Ambari UI, YARN UI, as well as many other UI components.
The “ssh user” will be used to access the cluster through secure shell. This user is actually a user in the Linux system in all the head nodes, worker nodes, edge nodes, etc., so you can use secure shell to access the remote clusters.
For Microsoft R Server on HDInsight type cluster, it’s a bit more complex, because we put R Studio Server Community version in HDInsight, which only accepts Linux user name and password as login mechanisms (it does not support passing tokens), so if you have created a new cluster and want to use R Studio, you need to first login using the http user’s credential and login through the HDInsight Gateway, and then use the ssh user’s credential to login to RStudio.
It’s a good read and also includes a sample Spark-R job.
Have you every executed a query in SQL Server Management Studio, looked at the execution plan, and noticed that it was a different plan than what was generated on the server?
A potential reason for this could be a different option settings. The options represent the SET values of the current session. SET options can affect how the query is execute thus having a different execution plan. You can find these options in two places within SSMS under Tools -> Options -> Query Execution -> SQL Server -> Advanced.
Click through for lots of information including a script John provided to see which options are currently on.
I have been a practicing data scientist with an emphasis on predictive modeling for about 16 years. I know enough R to be dangerous but when I want to build a model I reach for my SAS Enterprise Miner (could just as easily be SPSS, Rapid Miner or one of the other complete platforms).
The key issue is that I can clean, prep, transform, engineer features, select features, and run 10 or more model types simultaneously in less than 60 minutes (sometimes a lot less) and get back a nice display of the most accurate and robust model along with exportable code in my selection of languages.
The reason I can do that is because these advanced platforms now all have drag-and-drop visual workspaces into which I deploy and rapidly adjust each major element of the modeling process without ever touching a line of code.
I have almost exactly the opposite thought on the matter: that drag-and-drop development is intolerably slow; I can drag and drop and connect and click and click and click for a while, or I can write a few lines of code. Nevertheless, I think Bill’s post is well worth reading.
Simon Jackson has a couple posts on how to use ggplot2 to make graphs prettier. First, histograms:
Time to jazz it up with colour! The method I’ll present was motivated by my answer to this StackOverflow question.
We can add colour by exploiting the way that ggplot2 stacks colour for different groups. Specifically, we fill the bars with the same variable (
cutinto multiple categories:
Shape and size
There are many ways to tweak the
sizeof the points. Here’s the combination I settled on for this post:
There are some nice tricks here around transparency, color scheme, and gradients, making it a great series. As a quick note, this color scheme in the histogram headliner photo does not work at all for people with red-green color-blindness. Using a URL color filter like Toptal’s is quite helpful in discovering these sorts of issues.
In my current project, I’ve got a database containing everything to perform these tests:
- Tables with identical structure to the ones in the staging area (plus two columns “TestSuiteName” and “TestName”)
- A table containing the mapping from test-input table to target database, schema and table
- A stored procedure to purge the DWH (all layers) in the test environment
- A stored procedure to insert the data for a specific testsuite / name
When preparing a specific test case (the “insert rows for test case” step from the diagram above), the rows needed for that case are copied into the DWH:
Testing warehouses is certainly not a trivial exercise but given how complex warehouse ETL tends to be, having good tests reduces the number of 3 AM pages.
OptimizerStatsUsage is available in cached plans, so getting the “estimated execution plan” and the “actual execution plan” will have this information.
In the above example, I see the ModificationCount is very high (almost as much as the table cardinality itself) which after closer observation, the statistic had been updated with NORECOMPUTE.
And looking and the Seek itself, there is a large skew between estimated and actual rows. In this case, I now know a good course of action is to update statistics. Doing so produces this new result: ModificationCounter is back to zero and estimations are now correct.
This will be a good addition to SQL Server 2017.
In my lab, I decided to play around with the automatic seeding functionality that is part of Availability Groups. This was sparked by my last post about putting SSISDB into an AG. I wanted to see how it would work for a regular database. When I attempted to do so, I received the following error:
Cannot alter the availability group ‘Group1’, because it does not exist or you do not have permission. (Microsoft SQL Server, Error: 15151)
Read on for the answer; it turns out automatic seeding itself was not the culprit.
This optimization provides a great boost with a sufficient number of rows. You can read more about its test results in the blog OPTIMIZED Nested Loops Joins, created by Craig Freedman, an optimizer developer.
However, if the actual number of rows is less than the expected one, then CPU additional costs to build this sort may hide its benefits, increase CPU consumption and reduce its performance.
Read the whole thing. I think the likelihood of using either this hint or the trace flag is near nil, but crazy things do come up.
Erik Darling has started looking at interleaved execution of multi-statement table-valued functions in SQL Server 2017. First, he gives an intro:
In the first plan, the optimizer chooses the ColumnStore index over the nonclustered index that it chose in compat level 130.
This plan is back to where it was before, and I’m totally cool with that. Avoiding bad choices is just as good as making good choices.
I think. I never took an ethics class, so whatever.
In this case, the inline table valued function wiped the floor with the MSTVF, even with Interleaved Execution.
Obviously there’s overhead dumping that many rows into a table variable prior to performing the join, but hey, if you’re dumping enough rows in a MSTVF to care about enhanced cardinality estimation…
Just like Global Thermonuclear War, I believe the best way to win mutli-statement versus inline TVFs is not to play at all.