Let’s say we have a deterministic (non-random) problem for which one of the solutions involves randomness. One very common example of such problem is a function minimization on certain interval: it can be solved non-randomly (like in most methods of optim()), or randomly (the simplest approach being to generate random set of points on interval and to choose the one with the lowest function value).
What is a “clean” way of writing a function to solve the problem? The issue with direct usage of randomness inside a function is that it affects the state of outer random number generation:
Click through for a solution which uses random numbers but doesn’t change the outside world’s random number generation after it’s done.
Today, we’re going to talk about the Databricks File System (DBFS) in Azure Databricks. If you haven’t read the previous posts in this series, Introduction, Cluster Creation and Notebooks, they may provide some useful context. You can find the files from this post in our GitHub Repository. Let’s move on to the core of this post, DBFS.
As we mentioned in the previous post, there are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data. For this post, we’re going to talk about the storage layer underneath Azure Databricks, DBFS. Since Azure Databricks manages Spark clusters, it requires an underlying Hadoop Distributed File System (HDFS). This is exactly what DBFS is. Basically, HDFS is the low cost, fault-tolerant, distributed file system that makes the entire Hadoop ecosystem work. We may dig deeper into HDFS in a later post. For now, you can read more about HDFS here and here.
Click through for more detail on DBFS.
Often when query tuning, I’ll try a change that I think makes sense, only to have it backfire.
It’s not that the query got slower, it’s that the results that came back were
Now, this can totally happen because of a bug in previously used logic, but that’s somewhat rare.
wrongdifferent results make testers nervous. Especially in production.
This is where knowledge of abstract math and logic (like De Morgan’s Laws, both of which I’ve used to tune queries in the past because I’m a nerd) can pay off
Let’s say, you need to format the Sales Amount for all countries as a whole number, except for Australia.
See below two ways you can do it.
Click through for the two methods.
There are two common pitfalls involving window functions, both of which are the result of counterintuitive implicit defaults that are imposed by the SQL standard. One pitfall has to do with calculations of running totals where you get a window frame with the implicit RANGE option. Another pitfall is somewhat related, but has more severe consequences, involving an implicit frame definition for the FIRST_VALUE and LAST_VALUE functions.
There’s a lot going on in these two examples, so read on.
The fact that you know you have corruption in TempDB is good news, that shows that you are running CheckDB against TempDB and many people overlook this.
The corrupt page in TempDB may cause some issues if it is not cleared up.
Since the TempDB files are recreated if they don’t exist when SQL Server restarts, here is what I would recommend.
Read on for Steve’s advice.
Spools come in a variety of types, but most of them can be summarized as operators that store an intermediary result table in tempdb.
SQL Server often uses spools to process complex queries, transforming the data into a tempdb worktable to allow further data operations. The downside to this however is the need to write the data to disk in tempdb.
When I see a spool, I first often try to think if there is a way to rewrite the query to avoid the spool in the first place.
Spools are my bete noire. I love what Erik Darling calls them: SQL Server passive-aggressively telling you that you need an index without actually saying that you need an index.
At the end of the post, Bert calls out Hugo Kornelis’s operator list. Definitely check that out too, as Hugo is putting together a masterpiece in the original meaning of the term.