CSVs still use the concept of a shared disk but with an added layer of abstraction. Instead of having the shared disk be accessible to only one WSFC node at a time, all of the WSFC nodes have their own logical paths to it. This is made possible thru the Server Message Block (SMB) protocol. Since the WSFC nodes are connected to each other thru the heartbeat network (it’s another reason to have a dedicated network for inter-node communication,) CSV can take advantage of this route to the shared storage to send I/O commands.
Don’t be confused. While all of the WSFC nodes have both physical and logical paths to the shared disk, only the node that owns the SQL Server clustered resource can own it, just like when using traditional shared disks. In other words, the owner node dictates how to send the I/O commands – either directly thru its own access path or thru the heartbeat network. This is why you won’t see any direct dependency on the shared disk if you look at the Dependency Report. It also means that the failover process will be faster with CSVs.
If you’re using SQL Server 2014 or later and have built a cluster, check this feature out and see if it fits in your environment.
Even if your target variable is a numeric one, sometimes it’s better to use classification methods instead of approximation ones. For instance if you have mostly zero target values and just a few non-zero values. Change the latter to 1, in this case you’ll have two categories: 1 (positive value of your target variable ) and 0. You can also split numerical variable into multiple subgroups : apartment prices for low, medium and high by equal subset width and predict them using classification algorithms. This process is called discretization.
Both types of models are common in machine learning, so a good understanding of when to use which is important.
The execution plan is cut off, but you can be assured that there are six more similar plans below the ones pictured here. These metrics are misleading as each loop doesn’t seem too bad, right? Just 9% of the subtree cost or a few hundred reads doesn’t seem too wild, but add up all of these costs and it becomes clear that this won’t scale. What if we had thousands of rows to iterate through? For 5,000 rows, we would be looking at about 147,995,000 reads! Not to mention a very, very long execution plan that is certain to make Management Studio crawl as it renders five thousand execution plans.
Alternatively, we could cache all of the data in a temp table first, and then pull it row-by-row. This would result in significantly less reads on the underlying sales data, outperforming cursors by a mile, but would still involve iterating through the temp table over and over. For the scenario of 5,000 rows, we’d still have an inefficient slog through a smaller data set, rather than crawling through lots of data. Regardless of method, it’s still navigating quicksand either way, with varying amounts of quicksand.
This is a good article and includes STRING_SPLIT and OPENJSON methods in SQL Server 2016, but does not include building your own CLR methods. Check it out if you need to do any delimited string manipulation.
It’s not uncommon for tests to be written at the get-go and then forgotten about. Remember that as code changes or incorrect behavior is found, new tests need to be written or existing tests need to be modified. Possibly worse than having no tests is having a bunch of tests spitting out false positives. This is because humans are prone to habituation and desensitization. It’s easy to become habituated to false positives to the point where we no longer pay attention to them.
Temporarily disabling tests may be acceptable in the short term. A more strategic solution is to optimize your test writing. The easier it is to create and modify tests, the more likely they will be correct and continue to provide value. For my testing, I generally write code to automate a lot of wiring to verify results programmatically.
I started this article with almost no idea how to test R code. I still don’t…but this article does help. I recommend reading it if you want to write production-quality R code.
While getting bugs fixed quickly is great, I really hope that this new focus on tooling means that we get new BI-related features in SSMS and SSDT. SSMS has, in particular, been shockingly bad at supporting BI developers: for example, I find it unbelievable that we still don’t have a DAX query window in SSMS when the language has been in SSAS Tabular since SQL Server 2012 (I know the community have filled the gap with DAX Studio, but that’s not the point). Aaron Nelson recently announced a Trello board where ideas for new features in SSMS can be debated but there’s very little BI-related stuff there at the time of writing.
One can hope. The big problem is getting enough support to bump items up on the backlog, and for that, there’s a Trello board.
Nearly 12 minutes doesn’t sound fantastic, but let’s remember that this is running on a single-node sandbox hosted on my laptop. That’s hardly a fair setup for a distributed processing system. Also, I have done nothing to optimize the files; I’m using compressed, comma-separated text files, have not partitioned the data in any meaningful way, and have taken the easy way out whenever possible. This means that an optimized file structure running on a real cluster with powerful servers behind it could return the data set a lot faster…but for our purposes, that’s not very important. I’m using the same hardware in all three cases, so in that sense this is a fair comp.
Despite my hemming and hawing, Polybase still performed as well as Hive and kicked sand in the linked server’s face. I have several ideas for how to tune and want to continue down this track, showing various ways to optimize Polybase and Hive queries.
Currently, when utilizing the SQL Server images in the VM Gallery in Azure, any installations of SQL Server Analysis Services default to Multidimensional. Thus, if you want SSAS Tabular, you have additional work to perform.
I was just chatting with a Senior Program Manager on the SQL Server Analysis Services product team. They currently don’t have anything in their plans for providing SQL Server Gallery Images with SSAS Tabular instead of Multidimensional. We agreed that it is a good idea for that to happen. We also agreed that a Connect suggestion would be a great way to gauge broader community support/appetite for providing Gallery images with Tabular installed.
Perhaps the 2014 SQL Server is implicitly converting to float, using the nearest even prior to the explicit cast to Numeric. However, how the scale (number of decimal digits that will be stored to the right of the decimal point) would be determined in such a scenario is a conundrum. Either way, although the mapping is defined the same, the behavior demonstrated between the two versions of SQL Server is inconsistent.Research into ANSI and IEEE both boil down to truncation and/or rounding is implementation defined.
It’s an interesting issue. Read on for more details.
Since virtualization only solves #3, a lot more should be done to create a private cloud. Also, a cloud should also support Platform-as-a-service (PaaS) to allow for application innovation. Fortunately there are products to add the other characteristics to give you a private cloud, such as Microsoft’s Azure Stack. And of course you can always use a public cloud.
Read James’s post to get the full listing of what makes for a “cloud” offering.
I recently tried to use the SSISDB Upgrade Wizard to upgrade a restored SSISDB (backed up in an earlier version) to SQL Server 2016. It didn’t go well.
I decided to use SSIS Catalog Compare to generate the scripts and ISPAC files from the previous instance, and deploy them to the SSIS 2016 Catalog.
“You Can Do That?”
Yes. Yes you can. Here’s how…
This is a paid product, but if you want to perform this upgrade, it sounds like a good tool for the job.