Press "Enter" to skip to content

Day: December 29, 2016

Basic Non-Linear Regression In R

Renata Ghisloti Duarte de Souza gives an example of running a non-linear regression in R:

Now, suppose you were able to find a good function to model your data. With that, we are able to predict future values for our small dataset.

One important thing about the predict() function in R is that it expects a similar dataframe with the same column name and type as the one you used in your model.

Click through for several examples.

Comments closed

Logstash Filters

Nicolas Frankel explains how the grok and dissect filters work in Logstash:

The Grok filter gets the job done. But it seems to suffer from performance issues, especially if the pattern doesn’t match. An alternative is to use the dissect filter instead, which is based on separators.

Unfortunately, there’s no app for that – but it’s much easier to write a separator-based filter than a regex-based one. The mapping equivalent to the above is:

%{timestamp} %{+timestamp} %{level}[%{application},%{traceId},%{spanId},%{zipkin}]\n
%{pid} %{}[%{thread}] %{class}:%{log}
(broken on 2 lines for better readability)

One of the big secrets to effective debugging of code is having good logging mechanisms in place.

Comments closed

HDInsight With Hive LLAP

Rashin Gupta explains some performance benefits of using Hive 2.0 (LLAP) on HDInsight:

With LLAP, we allow data scientists to query data interactively in the same storage location where data is prepared. This means that customers do not have to move their data from a Hadoop cluster to another analytic engine for data warehousing scenarios. Using ORC file format, queries can use advanced joins, aggregations and other advanced Hive optimizations against the same data that was created in the data preparation phase.

In addition, LLAP can also cache this data in its containers so that future queries can be queried from in-memory rather than from on-disk. Using caching brings Hadoop closer to other in-memory analytic engines and opens Hadoop up to many new scenarios where interactive is a must like BI reporting and data analysis.

Even with this, Hive is still more of a “warehousing” technology, but this moves it closer to real-time (or at least “not slow”) warehousing.

Comments closed

Polybase Execution Plan With Blob Storage

I look at an execution plan and packet capture of a Polybase query which reads from Azure Blob Storage:

In this case, all of those packets were 1514 bytes, so it’s an easy multiplication problem to see that we downloaded approximately 113 MB.  The 2008.csv.bz2 file itself is 108 MB, so factoring in TCP packet overhead and that there were additional, smaller packets in the stream, I think that’s enough to show that we did in fact download the entire file.  Just like in the Hadoop scenario without MapReduce, the Polybase engine needs to take all of the data and load it into a temp table (or set of temp tables if you’re using a Polybase scale-out cluster) before it can pull out the relevant rows based on our query.

The upshot is that Polybase behaves very similarly on Azure Blob Storage as it does with on-prem Hadoop for non-MapReduce queries.

Comments closed

Batch Mode Adaptive Query Memory Feedback

Niko Neugebauer talks about an upcoming performance improvement for batch mode operations with columnstore indexes:

This adjustment might take place based of the inedequacy of the estimated number of rows (which is based on the statistics available at the time of the execution time generation) related to the real execution number of rows that the iterator/query is processing.
The 2 possible adjustment scenarios are:
– when estimated number of rows is too high and the memory is granted to the query, even though the query itself will not use it.
– when estimated number of rows is too low and the memory operations such as hashing or sorting will not have enough space to fit the complete data sets, thus making them spill on to the TempDB (temporary storing the data while doing the work, based on the lacking of the available memory to the query).

Read on for details, but one interesting caveat is that this doesn’t change anything for the first run; it only updates requests on subsequent runs, so it benefits most from consistent workloads with significant plan re-use.  That said, it looks extremely useful.

Comments closed

Replication Publisher To Azure SQL DB

Jes Borland continues her series on transactional replication from on-prem SQL Server + Availability Groups into Azure SQL Database:

After initializing, check the Snapshot Agent and Log Reader Agent for success. (To do so, go to Replication, right-click the publication name, and select Snapshot Agent Status and Log Reader Agent Status.) I ran into problems with the Snapshot account not having high enough permissions in the databases (it needs db_owner), and then not having enough permissions on the snapshot folder (it needs Full). (This forum post, answered by Hilary Cotter, helped: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/899857db-e38e-4026-a34c-2a8c2628c6fc/access-denied-to-sql-replication-snapshot-folder?forum=sqlreplication.)

Except for the final section, it’s pretty much the same as dealing with on-prem SQL Server sans Availability Groups.

Comments closed

STRING_AGG() Performance

Aaron Bertrand wants to know how the STRING_AGG() function performs:

We can see that our FORCESCAN hint really did make things worse – while we shifted the cost away from the clustered index seek, the sort was actually much worse, even though the estimated costs deemed them relatively equivalent. More importantly, we can see that STRING_AGG() does offer a performance benefit, whether or not the concatenated strings need to be ordered in a specific way. As with STRING_SPLIT(), which I looked at back in March, I am quite impressed that this function scales well prior to “v1.”

Given that the early releases tend to be “get the thing working” and later CTPs are around “make the thing faster,” it’s nice to see that STRING_AGG() is already ready for prime-time, and makes me wonder if they’ll make it even faster by RTM.

Comments closed

Unencrypted Backups With TDE

Steve Jones shows what you need to do to take an unencrypted backup on a database with TDE configured:

When SQL Server goes to restore the file, it reads part of the header. In here, the process must detect the DEK and try to decrypt that key. However, since this new instance does not have the certificate, this doesn’t work and an error is thrown, despite not needing the key since the data isn’t encrypted.

The issue here is the DEK still exists in the source database.

Read the whole thing for the solution.

Comments closed

Windows Virtual Accounts

Wayne Sheffield describes virtual accounts and how SQL Server can make use of them:

SQL Server will use these groups in many places so that permissions are granted to the group, instead of the actual service account. This simplifies things greatly if you change the service account – SQL Server Configuration Manager will just change the member of this group instead of having to hunt down and change everywhere that it knows that permissions are needed for the service account. Using these groups instead of the service account will simplify your life also if you ever change the service account – all those specific permissions that you granted on local resources (paths, registry, etc.) would have to be changed. Using the group, it will still have the same permissions.

I consider virtual accounts—particularly when you stick to using the virtual account itself rather than a domain account—to be a really good security feature, as it prevents system administrators from getting lazy and using the same service account everywhere.  This in turn blocks an attacker from using a pass-the-hash strategy to pivot from one SQL Server instance to another.

Comments closed