To build a robust BI system, you need to cater for errors and handle errors carefully. If you build a reporting solution that the refresh of that fails everytime an error occurs, it is not a robust system. Errors can happen by many reasons, In this post, I’ll show you a way to catch potential errors in Power Query and how to build an exception report page to visualize the error rows for further investigation. The method that you learn here, will save your model from failing at the time of refresh. Means you get the dataset updated, and you can catch any rows caused the error in an exception report page. To learn more about Power BI, read Power BI book from Rookie to Rock Star.
There’s a lot of work, but also a lot of value in doing that work.
Today a quick one that I came across while writing a different blog post that I will blog later. I know we have talked about it again and again but a good best practice is to remove any high carnality columns you don’t necessary need. This trick is not new and has been blogged about before in different places but I wanted to emphasize it again due to the importance.
I was finishing up my next blog post and wanted to upload the sample file. While doing that I noticed the file was 150MB large. That is rather large for such a simple file. The largest table has 500,000 rows and none of them are unique. What is going on?
There are some interesting DMVs in Kasper’s post, including one which shows cardinality by column.
If I look at the plan that is stored in Query Store, I’ll see the identical plan up above, including the PlanGuideDB and PlanGuideName properties.
So, let’s force the plan using the values returned from the query above:
1 EXEC sys.sp_query_store_force_plan 6,7;
Now, when we run the query, we’ll see both the plan guide in use and that the plan is forced (see this earlier blog post explaining this behavior). This is all expected behavior.
Check it out to see how SQL Server behaves.
This post is aimed to serve as an overview of functionality provided by base R to work with strings. Note that the term “string” is used somewhat loosely and refers to character vectors and character strings. In R documentation, references to
character string, refer to character vectors of length 1.
Also since this is an overview, we will not examine the details of the functions, but rather list examples with simple, intuitive explanations trading off technical precision.
As much as I like the tidyverse for its data platform professional-friendly approach to R, it is good to know the base libraries (and other alternatives) as well. H/T R-Bloggers
Now R has some useful packages like ggmap, mapdata and ggplot2 which allow you to source you map satellite images directly from google maps, but this does require a free google API key to source from the cloud. These packages can also plot the map around the data as I am currently trimming the map to fit the data. But for a fair test I also used a simplistic pre-built map in R. This was from the package rworldmap, which allows plotting at a country level with defined borders. Axes can be scaled to act like a zoom function but without a higher resolutions map or raster satellite image map it is pointless to go past a country level.
There’s a lot more you can do with both languages, but when you just want a plot in a few lines of code, both are up to the task.
If you read Part Two, then you know these are the steps I used for anomaly detection with K-means:
Segmentation – the process of splitting your time series data into small segments with a horizontal translation.
Windowing – the action of multiplying your segmented data by a windowing function to truncate the dataset before and after the window. The term windowing gets its name from its functionality: it allows you to only see the data in the window range since everything before and after (or outside the window) is multiplied by zero. Windowing allows you to seamlessly stitch your reconstructed data together.
Clustering – the task of grouping similar windowed segments and finding the centroids in the clusters. A centroid is at the center of a cluster. Mathematically, it is defined by the arithmetic mean position of all the points in the cluster.
Reconstruction – the process of rebuilding your time series data. Essentially, you are matching your normal time series data to the closest centroid (the predicted centroid) and stitching those centroids together to produce the reconstructed data.
Normal Error – The purpose of the Reconstruction is to calculate the normal error associated with the output of your time series prediction.
Anomaly Detection – Since you know what the normal error for reconstruction is, you can now use it as a threshold for anomaly detection. Any reconstruction error above that normal error can be considered an anomaly.
Read the whole thing. This is a really cool use case of a set of technologies along with a venerable (if sometimes troublesome) algorithm.
With the caveat that this information is correct at the time of writing, the following points should be noted:
- The control plane and worker nodes should always be on the same release of Kubernetes.
- The control plane can only run on Linux
- The minimum version of Windows 2016 RTM is required for worker nodes, but version 1709 is preferred.
A full list of restriction can be found here.
But here is something particularly significant for anyone wishing to deploy highly available SQL Server infrastructures to Kubernetes via availability groups:
Read on for that particularly significant limitation; it’s a doozy.
There are several more use cases for a dataflow, but one that is very useful is the ability to share a dataset between apps. Previously we had to duplicate the dataset to each and every app that needed to use it, increasing the risk that one dataset was ignored, not refreshed properly or otherwise out of sync with reality. By using dataflows we can have several apps rely on the same dataflow (via a dataset), and thus it is quite possible to have a “master dataset”.
Click through for a walkthrough, as well as an understanding of the process’s limitations.
Buying a laptop from Lenovo reminded me of an episode of Seinfeld when Elaine was trying to buy soup. For some unknown reason, when I placed an order on their website and gave them my money, Lenovo gave me a Cancellation Notice, the email equivalent of “No Soup for you!” After placing an order, about 15 minutes later, I received a cancellation notice. I called customer service. They looked at the order and advised me the systemincorrectly cancelled the order. I was told to place the order again as they had resolved the problem. I created a new order, and just like the last time, I received the No Laptop for You cancellation email. I called back. This time I was told that the system thinks I am a fraud. Now I have no laptop and I have been insulted.
In all the talk of ML running the future, one thing that gets forgotten is that models, being simplifications of reality, necessarily make mistakes. Failing to have some sort of manual override means, in this case, throwing away money for no good reason.
The question is now how to perform a coarse estimation of the equijoin cardinality of the highlighted steps, using the information available.
The original cardinality estimator would have performed a fine-grained step-by-step histogram alignment using linear interpolation, assessed the join contribution of each step (much as we did for the minimum step value before), and summed each step contribution to acquire a full join estimate. While this procedure makes a lot of intuitive sense, practical experience was that this fine-grained approach added computational overhead and could produce results of variable quality.
The original estimator had another way to estimate join cardinality when histogram information was either not available, or heuristically assessed to be inferior. This is known as a frequency-based estimation[…]
It’s an interesting dive into one of the changes in 2014’s CE. The post is a little math-heavy but Paul does a great job keeping it interesting.