A rather interesting result takes place if we scale our database to 100GB TPCH and run the very same queries – the total elapsed time jumps to 50% difference (from 30%), the CPU execution time difference is kept at 50%, but the memory grant gives the biggest difference ever – those 24.476 MB are still intact for the APPROX_DISTINCT_COUNT, while the COUNT(DISTINCT) asks for just a bit over 11GB ! Besides going through a completely different gateway on the bigger machines, running COUNT(DISTINCT) will bring your system to a full stop way before the same will take place with the APPROX_DISTINCT_COUNT.
Regarding the precision – in my tests I did not see the difference going over 1%.
Test before using this function, but if you don’t the correct number and can make do with “close enough,” this can save a boatload of memory on larger tables.
Here is a technique you might consider if you need to split text down to individual words. This could be used to help count, rank or otherwise aggregate the words in some longer text. The approach detailed here uses spaces as a delimiter and will not be tripped up if multiple spaces are used between words.
There is no SPLIT function in DAX, so this approach uses the MID function to help find words.
The PBIX file used for the blog can be downloaded here.
[Updated 14th Oct, 2018]
A slightly updated version that uses UNICHAR/UNICODE to preserve the case (“A” versus “a”) of each letter can be downloaded here. The reason for this is DAX stores a dictionary of unique values for every column. It is the first instance of any value that is added to the dictionary and assigned a new ID. Subsequent values that are considered the same “A” and “a” are considered the same are assigned the same ID. Using the UNICHAR/UNICODE version helps preserve the original case of each letter.
It’s an interesting approach and reminded me a bit of using a tally table to split strings in T-SQL.
Now in SQL Server 2017 with that 7GB program running would cause Linux to need to make room in physical memory for this process. Linux does this by swapping least recently used pages from memory out to disk. So under external memory pressure, let’s look at the SQL Server process’ memory allocations according to Linux. In the output below we see we still have a VmSize of around 10GB, but our VmRSS value has decreased dramatically. In fact, our VmRSS is now only 2.95GB. VmSwap has increased to 5.44GB. Wow, that’s a huge portion of the SQL Server process swapped to disk.
In SQL Server 2019, there’s a different outcome! In the data below we see our 16GB VmSize which won’t change much because of the virtual address space for the process. With that large external process running SQL Server reduced VmRSS from 7.9GB (from Table 1) to 2.8GB only placing about 4.68GB in the swap file. That doesn’t sound much better, does it? I thought SQL Server was going to react to the external memory pressure…let’s keep digging and ask SQL Server what it thinks about this.
Anthony is doing some great work digging into this. This is an area where you do have to understand the differences between Windows and Linux.
I have found that performance across all devices and browsers is definitely not equal. By far the best browser I have found for viewing the apps is Google Chrome. I have also tended to find that my Ubuntu machines don’t do as well as Microsoft machines in picking up words correctly. A chat I had with someone recently suggested this might be down to drivers under Ubuntu for the microphones but that is not my area of expertise. Voice recognition was also fine on both of my Blackberry phones (one running BB OS 10, the other running Android 7).
It is worth noting that this does require an internet connection to function, in Chrome the voice to text is performed in the cloud.
The other thing I have noticed is that annyang seems relatively sensitive to background noise. This isn’t so bad for functions called using specific phrases but does sometimes have a large effect on the multi-word splats. This is because the splats are greedy and the background noise makes the recognition engine think that you are still talking long after you finished which gives the appearance of the application hanging.
The solution is by no means perfect, but it does look quite interesting.
Over the past several quarters, we have made major security enhancements to Confluent Platform, which have helped many of you safeguard your business-critical applications. With the latest release, we increased the robustness of our security feature set to help with:
- Using standard and central directory services like Active Directory (AD)/Lightweight Directory Access Protocol (LDAP)
- Simplifying the management of access control lists (ACLs)
- Proactive management and monitoring of security configurations to address the gaps as soon as possible
The following new security features are available in both Confluent Platform 5.0 and Apache Kafka 2.0:
- Support for ACL-prefixed wildcards to simplify the management of access control
- Kafka Connect password protection with support for externalizing secrets (to “secrets stores,” etc., like Hashicorp Vault)
The following security features are available only in Confluent Platform 5.0:
- AD/LDAP group support
- Feature access controls in Confluent Control Center
- Viewing of broker configurations in Confluent Control Center, including differences in security configurations between brokers
Let’s walk through each of these enhancements in detail.
Read on for examples.
Prior to spark 2.0, SparkContext was used as a channel to access all spark functionality. The spark driver program uses sparkContext to connect to the cluster through resource manager.
SparkConf is required to create the spark context object, which stores configuration parameters like appName (to identify your spark driver), number core and memory size of executor running on worker node.
In order to use API’s of SQL, Hive, and Streaming, separate context needs to be created.
Read on to see where SparkSession fits in.
Last month I covered a Special Islands challenge. The task was to identify periods of activity for each service ID, tolerating a gap of up to an input number of seconds (
@allowedgap). The caveat was that the solution had to be pre-2012 compatible, so you couldn’t use functions like LAG and LEAD, or aggregate window functions with a frame. I got a number of very interesting solutions posted in the comments by Toby Ovod-Everett, Peter Larsson, and Kamil Kosno. Make sure to go over their solutions since they’re all quite creative.
Curiously, a number of the solutions ran slower with the recommended index than without it. In this article I propose an explanation for this.
Even though all solutions were interesting, here I wanted to focus on the solution by Kamil Kosno, who’s an ETL developer with Zopa. In his solution, Kamil used a very creative technique to emulate LAG and LEAD without LAG and LEAD. You will probably find the technique handy if you need to perform LAG/LEAD-like calculations using code that is pre-2012 compatible.
Kamil’s solution was quite clever.
Last week Microsoft released SQL Server Management Studio 18.0 into public preview, here’s a link so you can read about the new and improved functionality it offers.
One significant change is the addition of actual vs estimated row counts onto the showplan operators in execution plans (only actual, not estimated…which kinda makes sense).
Here I’m running a very simple bit of code on some DMV’s (namely exec requests and sessions) to demonstrate this addition.
Read on for the example.
Usage is pretty straightforward. You specify one or more computers and off you go. There is a default value for the resulting HTML file, but you’ll likely want to specify your own. Because the function is generating custom HTML on the fly, I also provided options for you to provide pre and post content HTML material, just as you might with ConvertTo-HTML. I also give you an option to specify a graphics file which is display like a logo at the top of the report. The graphics file will be embedded in the HTML file. The CSS is also embedded in the HTML making the entire file completely self-contained.
The one knock I have is the gradient color scheme: people with Protanopia or Deuteranopia will have trouble reading the free space indicator, as the colors blur into one another. Otherwise, this looks great.
What tool do you use for Analysis Services Tabular development? SSDT right, what else? Here is a little secret. I almost don’t use SSDT anymore, except for limited tasks, such as importing new tables and visualizing relationships. I switched to a great community tool – Tabular Editor and you should too if you’re frustrated with the SSDT Tabular Designer. Back in 2012 Microsoft ported the Power Pivot designer to SSDT to let BI practitioners implement Tabular models. This is why you still get weird errors that Excel has encountered some error. Microsoft haven’t made any “professional” optimizations despite all the attention that Tabular gets. As a result, developers face:
Performance issues – As your model grows in complexity, it gets progressively slower for even simple changes, such as renaming columns. The problem of course is that any change results in a commit operation to the workspace database. SSDT requires a workspace database for the Data View but it slows down all tasks even if it doesn’t have data. While the data view is useful for data analysts, I’d personally rather sacrifice it to gain development speed.
The horrible measure grid – Enough said. To Microsoft credit, the Tabular Explorer helps somewhat but it still doesn’t support the equivalent of the SSAS MD script editor.
No automation for repetitive tasks – It’s not unusual to create many measure variants, such as YTD, QTD. SSDT doesn’t help much automating them.
It does look interesting.