Shuffle is an expensive operation whether you do it with plain old MapReduce programs or with Spark. Shuffle is he process of bringing Key Value pairs from different mappers (or tasks in Spark) by Key in to a single reducer (task in Spark). So all key value pairs of the same key will end up in one task (node). So we can loop through the key value pairs and do the needed aggregation.
Since production jobs usually involve a lot of tasks in Spark, the key value pairs movement between nodes during shuffle (from one task to another) will cause a significant bottleneck. In some cases Shuffle is not avoidable but in many instances you could avoid shuffle by structuring your data little differently. Avoiding shuffle will have an positive impact on performance.
Read the whole thing. Getting partitions right is critical to writing scalable Spark jobs.
When working with R date formats, you’re generally going to be trying to accomplish one of two different but related goals:
1. Converting a character string like “
Jan 30 1989” to a Date type
2. Getting an R Date object to print in a specific format for a graph or other output
You may need to handle both of these goals in the same analysis, but it’s best to think of them as two separate exercises. Knowing which goal you are trying to accomplish is important because you will need to use different functions to accomplish each of these. Let’s tackle them one at a time.
There are some good insights in the post. H/T R-bloggers
For the SQL Server developer, matrices are probably most valuable for solving more complex string-searching problems, using Dynamic Programming. Once you get into the mindset of this sort of technique, a number of seemingly-intractable problems become easier. Here are fifty common data structure problems that can be solved using Dynamic programming. Until SQL Server 2017, these were hard to do in SQL because of the lack of support for this style of programming. Memoization, one of the principles behind the technique is easy to do in SQL but it is very tricky to convert existing procedural algorithms to use table variables. It is usually easier and quicker to use strings as pseudo-variables as I did with Edit Distance and the Levenshtein algorithm, the longest common subsequence, and the Longest Common Substring. The problem with doing this is that the code to fetch the array values can be very difficult to decypher or debug. JSON can do it very easily with path array references.
The results aren’t fantastic but the code is easier at least.
This is where stuff gets crazy. First, I created a table named
#ValuePerCategory, which has the mean price and the price standard deviation for each expense category. To get this information, I trawled through the catalog and picked reasonable-enough values for each of the categories. This is my level of commitment to getting things right(ish). The standard deviations, though, I just made up. I didn’t look at huge numbers of products and calculate these values myself. That’s the limit of my commitment to excellence and why I don’t have a giant banner on my stadium.
It’s also why John Madden never coached me.
Here’s yet another entry in the list of useful things I learned from Ehren von Lehe on the Power Query MSDN forum: Table.Buffer() does not buffer nested table, record or list values inside the cells of a table.
There are some important implications to this, so check out what Chris has to say.
This has always worked super, and has been a go-to query for me for years. But when looking at a SQL Server 2017 distributor with the distribution database in an Availability Group, that wasn’t working. All the publications had a
publisher_idof 1, but in
server_id1 was some random linked server, and definitely not the publisher. But replication was working great. Maybe replication was set up on the other AG replica, and
server_id1 came from there.
Nope. On the other replica, it was the same story.
Server_id1 was a random linked server, and nothing to do with replication at all, let alone the publisher. But replication was working perfectly. A teammate fooling around with it in dev confirmed that if he updated the
publisher_idto match the
server_idwe thought it should join to, replication stopped working. So, that
publisher_idof 1 was correct. Or special. But also definitely different than what I’ve seen in prior versions of SQL Server.
Read on to see what Andy learned.
Let’s start with this example:
SELECT 4.4/CAST(.44 AS VARCHAR(5))
Ignoring for a moment that our denominator is of type VARCHAR, if we do some quick mental math or use a calculator, we can see that the answer should be 10:
Spoilers: it isn’t 10, and Bert explains why.
A common example of a slowly changing dimension would be a Person table. When storing information about a person, you’ll often store attributes such as name, address, marital status, date of birth, and social security number. Depending on the domain in which you’re working, you may even store extended information such as hair color, eye color, homeownership status, and income level.
Some of these attributes would certainly change over time. Since the average person moves 11.4 times in their life, you can expect that the address component of your Person table will need to be updated. The same applies to name attributes, since people often change their names after marriage or other life events. By definition, the load process for a slowly changing dimension table would update those attributes when they change.
I’m not as big a fan of the concept as Tim is (mostly because the date is a system time, not an application time, so fixing dates is not an easy problem), but it’s an interesting idea.