You’ll know an expert programmer by the quality of the code that they write. Experts have good communication skills, both sharing their own knowledge and soliciting input from others. They are self-aware, understanding the kinds of mistakes they can make, and reflective. They are also fast (but not at the expense of quality).
Experience should be measured not just on its quantity (i.e., number of years in the role), but on its quality. For example, working on a variety of different code bases, shipping significant amounts of code to production, and working on shared code bases. The knowledge of an expert is T-shaped with depth in the programming language and domain at hand, and a broad knowledge of algorithms, data structures, and programming paradigms.
Click through for the full review.
Well, I knew I wouldn’t be satisfied. The other day I shared a PowerShell prompt function that could display telemetry like information for a few remote servers. One of the drawbacks was the limited amount of information I could display. I’ve revised that function and have a new version that displays additional information via a few performance counters. I’ve also reorganized the function to make it a bit more efficient. Want to see it?
My jokey lede aside, this is really cool. Click through for details and to get a link to the code.
Today I woke up with an interesting question, about how to show a selection of months in a nice way, detecting contiguous selection. You can easily understand the desired solution from the following figure:
I enjoyed writing a quick solution, which is worth sharing. The code is somewhat verbose, but this is mainly for educational purposes (meaning I did not want to spend time optimizing it). I will likely write a full article on it, for now, just enjoy some DAX code:
I removed the image, but to get the gist (and get you to click through to see it in its beauty), it reads “January, March-April, August-December”
Click through for Alberto’s quick-and-dirty solution and then Chris Webb’s improvement.
The Bitmap operator is used to build a bitmap that, based on a hash, represents which values may be present in a data flow. Due to the chance of hash collisions in the hash function used, the Bitmap process can produce false positives but not false negatives – so a match based on a bitmap is not guaranteed to be a match to the actual data, but a non-match based on a bitmap is guaranteed to not be a match in the actual data.
The generated bitmap is typically used in other operators to remove rows for which there is no match in the bitmap, and hence guaranteed no match in the original set of data processed by the Bitmap operator. The use of Bitmap operators is most common in execution plans for star join queries in large data warehouses. An example can be seen here.
Click through for details on how it works and plenty of good information on it.
This first little tip comes from a painful experience. It is common sense to only try and create files in a directory that exists, but sometimes that directory has to be different on different systems. Then comes a little copy and paste of the last code used that worked. You think you are golden but forgot that one little tweak for the directory to be used. Oops.
Read on to see how SQL Server exposes that error, and then Jason shows us a different how-not-to with file targets.
The parallel processing holds two varieties of execution: Synchronous and Asynchronous.
In synchronous execution, once a process starts execution, it puts a lock over the main program until its get accomplished.
While the asynchronous execution doesn’t require locking, it performs a task quickly but the outcome can be in the rearranged order.
Click through for a few examples using Pool.
Now it’s time to train our classification model! Let’s use the glmnet package to fit a logistic regression model with LASSO regularization. It’s a great fit for text classification because the variable selection that LASSO regularization performs can tell you which words are important for your prediction problem. The glmnet package also supports parallel processing with very little hassle, so we can train on multiple cores with cross-validation on the training set using
Hot take: Jane Austen was the best English-language novelist of the 19th century. I’d say “all-time” but the world isn’t ready for a take that hot.
This post came about due to a question on the Microsoft Machine Learning Server forum. The question was if there are any plans by Microsoft to support more the one input dataset (
sp_execute_external_script. My immediate reaction was that if you want more than one dataset, you can always connect from the script back into the database, and retrieve data.
However, the poster was well aware of that, but due to certain reasons he did not want to do it that way – he wanted to push in the data, fair enough. When I read this, I seemed to remember something from a while ago, where, instead of retrieving data from inside the script, they pushed in the data, serialized it as an output parameter and then used the binary representation as in input parameter (yeah – this sounds confusing, but bear with me). I did some research (read Googling), and found this StackOverflow question, and answer. So for future questions, and for me to remember, I decided to write a blog post about it.
This has been a point of frustration for me. We can name the one input data set, so I’d really like to see true support for input multiple data sets without the need for hacks.
I was really curious about the RETURNS NULL ON NULL INPUT function option so I decided to do some testing. I was very surprised to find out that it’s actually a form of scalar UDF optimization that has been in the product since at least SQL Server 2008 R2.
It turns out that if you know that a scalar UDF will always return a NULL result when a NULL input is provided then the UDF should ALWAYS be created with the RETURNS NULL ON NULL INPUT option, because then SQL Server doesn’t even run the function definition at all for any rows where the input is NULL – short-circuiting it in effect and avoiding the wasted execution of the function body.
The more often you pass in NULL to that function, the better your performance will be relative to the default case.
I enjoy pulling the data out of AdventureWorks. It is a great test harness. What is the quickest way of doing it? Well, everyone knows it is native BCP, but how much faster is that format than tab-delimited or comma-delimited BCP? Can we quickly output data in XML? Is there a way of outputting array-in-array JSON reasonably quickly? Of course, the answer is going to vary from system to system, and across versions, but any data about this is usually welcome.
In addition to these questions, I wanted to know more about how much space these files take up, either raw or zipped. We’re about to find out. We’ll test all that, using good ol’ BCP and SQLCMD.
My motivation for doing this was to explore ways of quickly transferring data to MongoDB. to test out a way of producing array-in-array JSON at a respectable turn of speed. It turned out to be tricky. The easy and obvious ways were slow.
As is usual for Phil, this article is done quite well.