The following code example contains an easy error in using the Rfunction
unique().vec1 <- c("a", "b", "c") vec2 <- c("c", "d") unique(vec1, vec2) #  "a" "b" "c"
Notice none of the novel values from
vec2are present in the result. Our mistake was: we (improperly) tried to use
unique()with multiple value arguments, as one would use
union(). Also notice no error or warning was signaled. We used
unique()incorrectly and nothing pointed this out to us. What compounded our error was
...” function signature feature.
John makes it clear that
... is not itself a bad thing, just that there is a time and a place for it and misusing it can lead to hard-to-understand bugs.
Other additional capabilities include:
Scalability and availability with NameNode federation, allowing customers to scale to thousands of nodes and a billion files. Higher availability with multiple name nodes and standby capabilities allow for the undisrupted, continuous cluster operations if a namenode goes down.
Lower total cost of ownership with erasure coding, providing a data protection method that up to this point has mostly been found in object stores. Hadoop 3 will no longer default to storing three full copies of each piece of data across its clusters. Instead of that 3x hit on storage, the erasure encoding method in Hadoop 3 will incur an overhead of 1.5x while maintaining the same level of data recoverability from disk failure. The end result will be a 50% savings in storage overhead, reducing it by half.
Real-time database, delivering improved query optimization to process more data at a faster rate by eliminating the performance gap between low-latency and high-throughput workloads. Enabled via Apache Hive 3.0, HDP 3.0 offers the only unified SQL solution that can seamlessly combine real-time & historical data, making both available for deep SQL analytics. New features such as workload management enable fine grained resource allocation so no need to worry about resource competition. Materialized views pre-computes and caches the intermediate tables into views where the query optimizer will automatically leverage the pre-computed cache, drastically improve performance. The end result is faster time to insights.
Data science performance improvements around Apache Spark and Apache Hive integration. HDP 3.0 provides seamless Spark integration to the cloud. And containerized TensorFlow technical preview combined with GPU pooling delivers a deep learning framework that makes deep learning faster and easier.
Looks like it’s invite-only at the moment, but that should change pretty soon. It also looks like I’ve got a new weekend project…
The need for visualizing the real-time data (or near-real time) has been and still is a very important daily driver for many businesses. Microsoft SQL Server has many capabilities to visualize streaming data and this time, I will tackle this issue using Python. And python Dash package for building web applications and visualizations. Dash is build on top of the Flask, React and Plotly and give the wide range of capabilities to create a interactive web applications, interfaces and visualizations.
Tomaz’s example hit SQL Server every half-second to grab the latest changes and gives us an example of roll-your-own streaming.
In previous articles, we discussed the importance of variables and how to optimize IF functions to reduce multiple evaluations of the same expression or measure. However, there are scenarios where the calculations executed in different branches of the same expression seem impossible to optimize. For example, consider the following pattern:
In cases like this involving measures A and B, there does not seem to be any possible optimizations. However, by considering the nature of the two measures A and B, they might be different evaluations of the same base measure in different filter contexts.
Read on for a couple of examples.
I’ve been having a little play around with AWS recently and was looking at S3 (AWS’ cloud storage) when I thought to myself, I wonder if it’s possible to backup up an on premise SQL Server database directly to S3?
When we want to backup directly to Azure, we can use the ‘TO URL’ clause in our backup statement. Although S3 buckets can also be accessed via a URL, I couldn’t find a way to backup directly to that URL. Most of the solutions on the web have you backing up your databases locally and then a second step of the job uses Power Shell to copy those backups up to your S3 buckets. I didn’t really want to do it that way, I want to backup directly to S3 with no middle steps. We like to keep things as simple as possible here at SQL Undercover, the more moving parts you’ve got, the more chance for things to go wrong.
So I needed a way for SQL Server to be able to directly access my buckets. I started to wonder if it’s possible to map a bucket as a network drive. A little hunting around and I came across this lovely tool, TNTDrive. TNTDrive will let us do exactly that and with the bucket mapped as a local drive, it was simply a case of running the backup to that local drive.
Quite useful if your servers are in a disk crunch. In general, I’d probably lean toward keeping on-disk backups and creating a job to migrate those backups to S3.
In general, the errors SQL gives are highly useful. Of course every now and again you get one that’s just confounding. The other day I saw the following error in the log:
Login failed for user ”. Reason: An attempt to login using SQL authentication failed. Server is configured for Windows authentication only. [CLIENT: ]
This one confused me for a couple of reasons. First, the user ”. Why an empty user? That’s not really helpful. And second Server is configured for Windows authentication only.
But Kenneth shows that the server is configured for SQL authentication as well as Windows authentication. Click through to see what gives.
Before I show you what I mean, we should probably define what’s not SARGable in general.
- Wrapping columns in functions: ISNULL, COALESCE, LEFT, RIGHT, YEAR, etc.
- Evaluating predicates against things indexes don’t track: DATEDIFF(YEAR, a_col, b_col), a_col +b_col, etc.
- Optional predicates: a_col = @a_variable or @a_variable IS NULL
- Applying some expression to a column: a_col * 1000 < some_value
Applying predicates like this show that you don’t predi-care.
They will result in the “bad” kind of index scans that read the entire index, often poor cardinality estimates, and a bunch of other stuff — sometimes a filter operator if the predicate can’t be pushed down to the index access level of the plan.
Read on for an example.
This is question I’ve gotten a few times in class…Can you force a plan for a different query with Query Store?
Assume you have two similar queries, but they have different query_id values in Query Store. One of the queries has a plan that’s stable, and I want to force that plan for the other query. Query Store provides no ability to do this in the UI, but you can try it with the stored procedure. Let’s take a look…
I can see the potential benefit, but the downside risk is huge, so it makes sense not to allow this.