Press "Enter" to skip to content

Month: October 2018

Connecting To Elasticsearch With R

Jerod Johnson has a sample of connecting to Elasticsearch with R:

You will need the following information to connect to Elasticsearch as a JDBC data source:

  • Driver Class: Set this to cdata.jdbc.elasticsearch.ElasticsearchDriver.
  • Classpath: Set this to the location of the driver JAR. By default, this is the lib subfolder of the installation folder.

The DBI functions, such as dbConnect anddbSendQuery , provide a unified interface for writing data access code in R. Use the following line to initialize a DBI driver that can make JDBC requests to the CData JDBC Driver for Elasticsearch:

Read on for the full instructions.

Comments closed

APPROX_COUNT_DISTINCT

Niko Neugebauer is happy with a new function in SQL Server 2019:

A rather interesting result takes place if we scale our database to 100GB TPCH and run the very same queries – the total elapsed time jumps to 50% difference (from 30%), the CPU execution time difference is kept at 50%, but the memory grant gives the biggest difference ever – those 24.476 MB are still intact for the APPROX_DISTINCT_COUNT, while the COUNT(DISTINCT) asks for just a bit over 11GB ! Besides going through a completely different gateway on the bigger machines, running COUNT(DISTINCT) will bring your system to a full stop way before the same will take place with the APPROX_DISTINCT_COUNT.
Regarding the precision – in my tests I did not see the difference going over 1%.

Test before using this function, but if you don’t the correct number and can make do with “close enough,” this can save a boatload of memory on larger tables.

Comments closed

Word Counts In DAX

Philip Seamark shows us a way of splitting strings into words in DAX:

Here is a technique you might consider if you need to split text down to individual words.  This could be used to help count, rank or otherwise aggregate the words in some longer text.  The approach detailed here uses spaces as a delimiter and will not be tripped up if multiple spaces are used between words.

There is no SPLIT function in DAX, so this approach uses the MID function to help find words.

The PBIX file used for the blog can be downloaded here.

[Updated 14th Oct, 2018]
A slightly updated version that uses UNICHAR/UNICODE to preserve the case (“A” versus “a”) of each letter can be downloaded here. The reason for this is DAX stores a dictionary of unique values for every column.  It is the first instance of any value that is added to the dictionary and assigned a new ID.  Subsequent values that are considered the same “A” and “a” are considered the same are assigned the same ID.  Using the UNICHAR/UNICODE version helps preserve the original case of each letter.

It’s an interesting approach and reminded me a bit of using a tally table to split strings in T-SQL.

Comments closed

External Memory Pressure In SQL Server 2019 On Linux

Anthony Nocentino walks us through memory pressure in SQL Server on Linux:

Now in SQL Server 2017 with that 7GB program running would cause Linux to need to make room in physical memory for this process. Linux does this by swapping least recently used pages from memory out to disk. So under external memory pressure, let’s look at the SQL Server process’ memory allocations according to Linux. In the output below we see we still have a VmSize of around 10GB, but our VmRSS value has decreased dramatically. In fact, our VmRSS is now only 2.95GB. VmSwap has increased to 5.44GB. Wow, that’s a huge portion of the SQL Server process swapped to disk.

In SQL Server 2019, there’s a different outcome! In the data below we see our 16GB VmSize which won’t change much because of the virtual address space for the process. With that large external process running SQL Server reduced VmRSS from 7.9GB (from Table 1) to 2.8GB only placing about 4.68GB in the swap file. That doesn’t sound much better, does it? I thought SQL Server was going to react to the external memory pressure…let’s keep digging and ask SQL Server what it thinks about this.

Anthony is doing some great work digging into this.  This is an area where you do have to understand the differences between Windows and Linux.

Comments closed

Voice Control For Shiny Apps

Over at Jumping Rivers, an example of using a Javascript library to control a page using voice commands:

I have found that performance across all devices and browsers is definitely not equal. By far the best browser I have found for viewing the apps is Google Chrome. I have also tended to find that my Ubuntu machines don’t do as well as Microsoft machines in picking up words correctly. A chat I had with someone recently suggested this might be down to drivers under Ubuntu for the microphones but that is not my area of expertise. Voice recognition was also fine on both of my Blackberry phones (one running BB OS 10, the other running Android 7).

It is worth noting that this does require an internet connection to function, in Chrome the voice to text is performed in the cloud.

The other thing I have noticed is that annyang seems relatively sensitive to background noise. This isn’t so bad for functions called using specific phrases but does sometimes have a large effect on the multi-word splats. This is because the splats are greedy and the background noise makes the recognition engine think that you are still talking long after you finished which gives the appearance of the application hanging.

The solution is by no means perfect, but it does look quite interesting.

Comments closed

Security Improvements In Kafka And Confluent Platform

Vahid Fereydouny demonstrates a number of security improvements made to Apache Kafka 2.0 as well as Confluent Platform 5.0:

Over the past several quarters, we have made major security enhancements to Confluent Platform, which have helped many of you safeguard your business-critical applications. With the latest release, we increased the robustness of our security feature set to help with:

  • Using standard and central directory services like Active Directory (AD)/Lightweight Directory Access Protocol (LDAP)
  • Simplifying the management of access control lists (ACLs)
  • Proactive management and monitoring of security configurations to address the gaps as soon as possible

The following new security features are available in both Confluent Platform 5.0 and Apache Kafka 2.0:

  • Support for ACL-prefixed wildcards to simplify the management of access control
  • Kafka Connect password protection with support for externalizing secrets (to “secrets stores,” etc., like Hashicorp Vault)

The following security features are available only in Confluent Platform 5.0:

  • AD/LDAP group support
  • Feature access controls in Confluent Control Center
  • Viewing of broker configurations in Confluent Control Center, including differences in security configurations between brokers

Let’s walk through each of these enhancements in detail.

Read on for examples.

Comments closed

SparkSession Versus SparkContext

Abhishek Baranwal explains the differences between the SparkSession object and the SparkContext object when writing Spark code:

Prior to spark 2.0, SparkContext was used as a channel to access all spark functionality. The spark driver program uses sparkContext to connect to the cluster through resource manager.

SparkConf is required to create the spark context object, which stores configuration parameters like appName (to identify your spark driver), number core and memory size of executor running on worker node.

In order to use API’s of SQL, Hive, and Streaming, separate context needs to be created.

Read on to see where SparkSession fits in.

Comments closed

Simulating LAG And LEAD Prior To SQL Server 2012

Izik Ben-Gan highlights a reader submission from his last post:

Last month I covered a Special Islands challenge. The task was to identify periods of activity for each service ID, tolerating a gap of up to an input number of seconds (@allowedgap). The caveat was that the solution had to be pre-2012 compatible, so you couldn’t use functions like LAG and LEAD, or aggregate window functions with a frame. I got a number of very interesting solutions posted in the comments by Toby Ovod-Everett, Peter Larsson, and Kamil Kosno. Make sure to go over their solutions since they’re all quite creative.

Curiously, a number of the solutions ran slower with the recommended index than without it. In this article I propose an explanation for this.

Even though all solutions were interesting, here I wanted to focus on the solution by Kamil Kosno, who’s an ETL developer with Zopa. In his solution, Kamil used a very creative technique to emulate LAG and LEAD without LAG and LEAD. You will probably find the technique handy if you need to perform LAG/LEAD-like calculations using code that is pre-2012 compatible.

Kamil’s solution was quite clever.

Comments closed

Actual Versus Estimated Rows In SSMS 18

David Alcock is happy with a feature in SQL Server Management Studio 18:

Last week Microsoft released SQL Server Management Studio 18.0 into public preview, here’s a link so you can read about the new and improved functionality it offers.

One significant change is the addition of actual vs estimated row counts onto the showplan operators in execution plans (only actual, not estimated…which kinda makes sense).

Here I’m running a very simple bit of code on some DMV’s (namely exec requests and sessions) to demonstrate this addition.

Read on for the example.

Comments closed