Press "Enter" to skip to content

Curated SQL Posts

A Trace Flag (Generally) to Avoid

Erik Darling takes us through trace flag 3608:

According to the docs:

Prevents SQL Server from automatically starting and recovering any database except the master database. If activities that require TempDB are initiated, then model is recovered and TempDB is created. Other databases will be started and recovered when accessed. Some features, such as snapshot isolation and read committed snapshot, might not work. Use for Move System Databases and Move User Databases.

Note: Do not use during normal operation.

Scope: global only

But it turns out it can do quite a bit of harm. It seems that many things stop working when it’s in use, though, including statistics getting automatically created.

Click through to see what kinds of things fail to work as a result of this trace flag.

Comments closed

Environments in Azure ML

Luis Valencia explains what environments are in Azure ML:

An Environment defines Python packages, environment variables, and Docker settings that are used in machine learning experiments, including in data preparation, training, and deployment to a web service. An Environment is managed and versioned in an Azure Machine Learning Workspace. You can update an existing environment and retrieve a version to reuse. Environments are exclusive to the workspace they are created in and can’t be used across different workspaces.

In basic terms for a developer, it’s basically a Docker Image with all the needed dependencies (conda/pip packages) to run your experiment.

A friendly word of advice from some bad experiences: stick with the curated environments as much as you can. Those are easy and rarely fail. Building your own environments from Conda files is a possibility, but it’s an, err, probabilistic exercise as to whether your compute target will actually work or not.

Comments closed

Testing Stock Market Efficiency with Compression Algorithms

Holger von Jouanne-Diedrich has a clever test:

One of the most fiercely fought debates in quantitative finance is whether the stock market (or financial markets in general) is (are) efficient, i.e. whether you can find patterns in them that can be profitably used.

If you want to learn about an ingenious method (that is already present in anyone’s computer) to approach that question, read on!

As soon as I saw the post, my Eugene Fama senses were tingling. The results are not surprising (at least, to anyone who got my reference in the prior sentence), but I did enjoy the rather clever approach to the question.

Comments closed

Pipelined Functions in Powershell

Robert Cain continues a series on functions in Powershell:

In my previous post, I covered the use of PowerShell Advanced Functions. I highly suggest you read it if you haven’t, it provides some foundational knowledge that will be important to understand for this post.

In this post, we’ll see how to pipeline enable your functions. Just like a cmdlet, you’ll be able to take input from the pipeline, work with it, then send it out your function back into the pipeline.

Making your code pipeline-friendly is especially important if you want others to use your functions, as that’s one of the biggest benefits of Powershell as a language.

Comments closed

Tools and Tips for Accessibility

Daron Yöndem shares insights:

Last week, as a new employee, I went through Microsoft’s internal employee learning portal and found the Accessibility 101 online course. To my surprise, the course did have a good amount of practical information and connected the concept of accessibility nicely to inclusion and diversity. In this post, I want to share a couple of the practical steps to help you step up your accessibility game. If you are where I was, I’m sure you will love these.

Click through for some easy ways to improve presentations and webpages. Most of this is a few minutes’ worth of effort but can pay dividends. On a side note, congrats to Daron for the Microsoft gig. I enjoyed working with him in the past and know he’ll do great there.

Comments closed

Measure Filters in Power BI

Marco Russo and Alberto Ferrari dive into a topic:

The first paragraph of this article needs to be a warning: the article itself is here for DAX and Power BI enthusiasts only. We are going to show a report that does not work, and then we explore how to fix the problem by performing a deep analysis of the queries generated by Power BI, finding the problem, and finally fixing it. The article contains a lot of references to advanced DAX concepts and the final solution is NOT a best practice. The value of the article is not in the specific solution. Rather, the important part is that a deep understanding of DAX and Power BI can help you obtain the right results, specifically when you have the feeling that you are faced with a bug because Power BI is acting strange. If you do not like DAX before reading this article, you will like it even less at the end. But if you love DAX, then chances are you will really enjoy the reading, even though it requires quite a lot of brain bandwidth. For sure, it took all of mine when I first encountered this behavior.

Break out the propeller hats before you dive in.

Comments closed

Order, Sort, Cluster, and Distribute in Hive

The Hadoop in Real World team give us three methods (and one synonym) to organize results in Hive:

Hive provides 3 options to order or sort the result of records – order by, sort by, cluster by and distribute by. Which option you choose has performance implications. So it is important to understand the difference between the options and choose the right one for the use case at hand.

Click through for a high-level overview of the techniques.

Comments closed

Scaling Hadoop Beyond 10,000 Nodes

Keqiu Hu, et al, take us through a problem of scale:

At LinkedIn, we use Hadoop as our backbone for big data analytics and machine learning. With an exponentially growing data volume, and the company heavily investing in machine learning and data science, we have been doubling our cluster size year over year to match the compute workload growth. Our largest cluster now has ~10,000 nodes, one of the largest (if not the largest) Hadoop clusters on the planet. Scaling Hadoop YARN has emerged as one of the most challenging tasks for our infrastructure over the years.

In this blog post, we will first discuss the YARN cluster slowdowns we observed as we approached 10,000 nodes and the fixes we developed for these slowdowns. Then, we will share the ways we proactively monitored for future performance degradations, including a now open-sourced tool we wrote called DynoYARN, which reliably forecasts performance for YARN clusters of arbitrary size. Finally, we will describe Robin, an in-house service which enables us to horizontally scale our clusters beyond 10,000 nodes.

Read on to learn about the problems they experienced and how they resolved them.

Comments closed