Often we wish to work with such data aligned so each row in
d2has the same
idxvalue as the same row (by row order) as
d1. This is an important data wrangling task, so there are many ways to achieve it in R, such as
dplyr::left_join(), or by sorting both tables into the same order and then using
However if you wish to preserve the order of the first table (which may not be sorted), you need one more trick.
Click through to see that one additional trick.
At Stitch Fix, individual personalization is in our DNA: every client is unique and every piece of clothing we send is chosen to be just right. When we buy merchandise, we could choose to lump clients together; algorithms trained on lumped data would steer us toward that little black dress or those comfy leggings that delight a core, modal group of clients. Yet when we split clients into narrower segments and focus on the tails of the distribution, the algorithms have the chance to also tease out that sleek pinstripe blazer or that pair of distressed teal jeans that aren’t right for everyone, but just right for someone. As long as we don’t split our clients so finely that we’re in danger of overfitting, and as long as humans can still understand the algorithm’s recommendations, splitting is the way to go.
In other cases lumping can provide action-oriented clarity for human decision-makers. For example, we might lump clients into larger groups when reporting on business growth for a crisp understanding of holistic business health, even if our models forecast that growth at the level of finer client splits.
Read on and check out their useful chart for figuring out whether lumping or splitting is the better idea for you.
Closure tables are plain ordinary relational tables that are designed to work easily with relational operations. It is true that useful extensions are provided for SQL Server to deal with hierarchies. The HIERARCHYID data type and the common language runtime (CLR) SqlHierarchyId class are provided to support the Path Enumeration method of representing hierarchies and are intended to make tree structures represented by self-referencing tables more efficient, but they are likely to be appropriate for some but not all the practical real-life hierarchies or directories. As well as path enumerations, there are also the well-known design patterns of Nested Sets and Adjacency Lists. In this article, we’ll concentrate on closure tables.
A directed acyclic graph (DAG) is a more general version of a closure table. You can use a closure table for a tree structure where there is only one trunk, because a branch or leaf can only have one trunk. We just have a table that has the nodes (e.g. staff member or directory ‘folder’) and edges (the relationships). We are representing an acyclic (no loops allowed) connected graph where the edges must all be unique, and where there is reflexive closure. (each node has an edge pointing to itself)
Take the time to read this one carefully, as I think this model is applicable much more often than it’d appear at first blush.
As you may already know, when SQL Server optimizes a query, it evaluates multiple candidate plans, and eventually picks the one with the lowest estimated cost. The estimated plan cost is the sum of all the operators’ estimated costs. In turn, each operator’s estimated cost is the sum of the estimated I/O cost and estimated CPU cost. The cost unit is meaningless in its own right. Its relevance is in the comparison that the optimizer makes between candidate plans. That is, the costing formulas were designed with the goal that, between candidate plans, the one with the lowest cost will (hopefully) represent the one that will finish more quickly. A terribly complex task to do accurately!
The more the costing formulas adequately take into account the factors that truly affect the algorithm’s performance and scaling, the more accurate they are, and the more likely that given accurate cardinality estimates, the optimizer will choose the optimal plan. At any rate, if you want to understand why the optimizer chooses one algorithm versus another you need to understand two main things: one is how the algorithms work and scale, and another is SQL Server’s costing model.
So back to the plan in Figure 1; let’s try and understand how the costs are computed. As a policy, Microsoft will not reveal the internal costing formulas that they use. When I was a kid I was fascinated with taking things apart. Watches, radios, cassette tapes (yes, I’m that old), you name it. I wanted to know how things were made. Similarly, I see value in reverse engineering the formulas since if I manage to predict the cost reasonably accurately, it probably means that I understand the algorithm well. During the process you get to learn a lot.
Our query ingests 1,000,000 rows. Even with this number of rows, the I/O cost seems to be negligible compared to the CPU cost, so it is probably safe to ignore it.
As for the CPU cost, you want to try and figure out which factors affect it and in what way.
I give this my highest recommendation.
To create the VMs, you need to go through these four steps:
Basics to configure basic setting of the VM
Size to choose the VM machine size
Settings to configure the features. In this case, the default values are used. You just need to click the Next button to proceed further
Once it’s running, Prashanth shows how to connect via PuTTY and configure the service.
Our host for T-SQL Tuesday this month is Jens Vestergaard (b/t) and he has asked about our favorite SSMS tool. My initial thought was to talk about using solutions in SSMSbut I’d already written about that. My next thought was to write about sp_DBPermissions and sp_SrvPermissionswhich of course I’ve written about several times. No big surprise, I wrote them after all.
So what tool am I going to write about? Well, sp_DBPermissions and sp_SrvPermissions of course. I mean I did write them after all, and Jens did say we could brag about something we wrote :).
Read on for more information about these useful tools.
You can select any property from the output and get just the TotalSeconds, but I like this simple output for when I have to leave some work in progress and I need to come back and check some time in the future.
If you are confused by this code and want further explanations, keep reading!
That’s a lot simpler than the “classic” .NET way of setting up a StopWatch and tracking changes.
When I found this command I couldn’t have been more excited. My day-to-day job requires the care and watering of over 100 SQL Server instances of varying versions. Using this command you can get the current build of all your instances and then compare that to the most recent available. There are also parameters for how far you want to be from the latest version. Setting the -latest switch means just that, your server will only be seen as compliant if it’s on the latest release, passing in -1CU means that it can be no more than 1 cumulative update behind.
Read on for a few additional useful cmdlets. Out of a large number of useful cmdlets.