Press "Enter" to skip to content

Month: June 2017

Limitations In sys.dm_exec_query_plan

Brent Ozar shows a limitation in the sys.dm_exec_query_plan DMV:

The query with six joins has a cached plan that you can click on – because it’s from sys.dm_exec_query_plan, a management object that returns execution plans as XML.

The query with seven joins does not. Its cached plan is too large or complex for that DMV. Microsoft quickly realized this problem when 2005 came out, so in 2005 Service Pack 2, they introduced sys.dm_exec_text_query_plan. That function returns bigger data – but it only comes back in text format.

There are a couple of workarounds, at least, but they aren’t ideal.

Comments closed

Discovering Orphaned Users

Adrian Buckman troubleshoots an access scenario:

First lets check that the User Does actually exist, we know the Server login exists otherwise the user would be complaining that they cannot connect to the SQL server instance.

Sure enough – there is the user ‘SQLUndercoverUser’ lets check out the permissions:

No problems there – the user has [db_datareader], [db_datawriter] and [db_owner] so we know there is not a permissions issue, so lets test this login by connecting to SQL server with the user credentials:

Connected to the Server with no issues, lets open a new query against SQLUnderCoverDB:

Hmm so despite having  permissions to access the database  we are receiving this error – we know that the password is correct too otherwise we wouldn’t be able to access the Server at all….

Adrian does a nice job of walking through the troubleshooting process, going from simple problems (does the user actually exist? does the user have permissions?) and into the real cause, which was orphaned SQL authenticated users.  Read the whole thing.

Comments closed

Multi-Database Query Store Data Retrieval

Tracy Boggiano wants to get Query Store data for a large number of databases on the same instance and has written a Powershell script to this effect:

In SQL Server 2016, Microsoft introduced to us the Query Store.  This has been proven to be a very beneficial feature especially at my job, saved the day on Christmas Eve. One of the limitations if see is when you have server with several databases on it trying to find the one that has the problem. So until Microsoft gives use a better tool or someone builds one I wrote a quick PowerShell function to query across all the databases and return data to a grid so you can sort and find that troublesome query.

To call it simply provide the server, which metric you want, the top number of queries to return, who many hours to look back, and the minimum execution count.

This is one of two issues I have with Query Store, the other being that its data gets saved to the primary filegroup without any recourse.  Click through for the script.

Comments closed

Scraping Session Data

Amy Herold has scraped PASS Summit 2017 submissions using Powershell:

Never having done a web scrape before, this was the perfect subject for my first time – grabbing all the sessions submitted to PASS Summit 2017…and doing it with PowerShell! Here is the script I used for this. I have accounted for the following:

  • Apostrophes (aka single quote). They will break your insert unless you have two of them, and for some reason, people seem to use them all over the place.

  • Formatting the string data for insert. No, your data will not magically come out right in your insert with single quotes so you need to add them.

  • Additional ID and deleted fields.

  • Speaker URL and ID. Will be using this to scrape speaker details later.

  • Accurate lower and upper bounds. These were arrived at by trial and error (you’re welcome), as well as the clean up of the data I scraped. More on this later.

Powershell probably wouldn’t be my first language for web scrapes—that’d be Python—but Amy shows how to get a scrape going.

Comments closed

Using OtterTune To Tune Databases

Dana Van Aken, Geoff Gordon, and Any Pavlo show off OtterTune, which uses machine learning techniques to tune database management systems like MySQL and Postgres:

OtterTune, a new tool that’s being developed by students and researchers in the Carnegie Mellon Database Group, can automatically find good settings for a DBMS’s configuration knobs. The goal is to make it easier for anyone to deploy a DBMS, even those without any expertise in database administration.

OtterTune differs from other DBMS configuration tools because it leverages knowledge gained from tuning previous DBMS deployments to tune new ones. This significantly reduces the amount of time and resources needed to tune a new DBMS deployment. To do this, OtterTune maintains a repository of tuning data collected from previous tuning sessions. It uses this data to build machine learning (ML) models that capture how the DBMS responds to different configurations. OtterTune uses these models to guide experimentation for new applications, recommending settings that improve a target objective (for example, reducing latency or improving throughput).

In this post, we discuss each of the components in OtterTune’s ML pipeline, and show how they interact with each other to tune a DBMS’s configuration. Then, we evaluate OtterTune’s tuning efficacy on MySQL and Postgres by comparing the performance of its best configuration with configurations selected by database administrators (DBAs) and other automatic tuning tools.

This is potentially a very interesting technology and is not the only one of its kind—we’ve seen Microsoft enter this space as well for SQL Server index and tuning recommendations.

Comments closed

Fresh R Installation On Linux

Marcelo Perlin has a script to install R on Linux:

Since I formatted all my three computers (home/laptop/work), I wrote a small bash file to automate the process of installing R and its dependencies. I use lots of R packages in a daily basis. For some of them, it is required to install dependencies using the terminal. Each time that a install.package() failed, I saved the name of the required software and added it to the bash file. While my bash file will not cover all dependencies for all packages, it will suffice for a great proportion.

Another option might be to grab a Docker image.

Comments closed

S3 Versus HDFS For Spark Data Storage

Reynold Xin, Josh Rosen, and Kyle Pistor argue that you should use blob storage (S3, Azure Blob, etc.) instead of disk when building a cloud-based Spark cluster:

Based on our experience, S3’s availability has been fantastic. Only twice in the last six years have we experienced S3 downtime and we have never experienced data loss from S3.

Amazon claims 99.999999999% durability and 99.99% availability. Note that this is higher than the vast majority of organizations’ in-house services. The official SLA from Amazon can be found here: Service Level Agreement – Amazon Simple Storage Service (S3).

For HDFS, in contrast, it is difficult to estimate availability and durability. One could theoretically compute the two SLA attributes based on EC2’s mean time between failures (MTTF), plus upgrade and maintenance downtimes. In reality, those are difficult to quantify. Our understanding working with customers is that the majority of Hadoop clusters have availability lower than 99.9%, i.e. at least 9 hours of downtime per year.

It’s interesting how opinion has shifted; even a year ago, the recommendation would be different.

Comments closed

Understanding Random Forests

Manish Kumar Barnwal explains how random forest algorithms work:

Say our dataset has 1,000 rows and 30 columns. There are two levels of randomness in this algorithm:

  • At row level: Each of these decision trees gets a random sample of the training data (say 10%) i.e. each of these trees will be trained independently on 100 randomly chosen rows out of 1,000 rows of data. Keep in mind that each of these decision trees is getting trained on 100 randomly chosen rows from the dataset i.e they are different from each other in terms of predictions.
  • At column level: The second level of randomness is introduced at the column level. Not all the columns are passed into training each of the decision trees. Say we want only 10% of columns to be sent to each tree. This means a randomly selected 3 column will be sent to each tree. So for the first decision tree, may be column C1, C2 and C4 were chosen. The next DT will have C4, C5, C10 as chosen columns and so on.

This  is a nice article and includes cases when not to use random forests.

Comments closed

Making Entity Framework Writes A Little Less Slow

Ilya Chumakov has some tips for making Entity Framework inserts and updates a lot faster:

When adding or modifying a large number of records (10³ and more), the Entity Framework performance is far from perfect. The reasons are architectural peculiarities of the framework, and non-optimality of the generated SQL. Leaping ahead, I can reveal that saving data through a bypass of the context significantly minimizes the execution time.

There’s some good advice in here, though not my favorite advice, which is don’t use Entity Framework.

Comments closed

Check Those Aliases

Erik Darling warns you about accidentally using the wrong alias in a query:

People will often tell you to clearly alias your tables, and they’re right. It will make them more readable and understandable to whomever has to read your code next, puzzling over the 52 self joins and WHERE clause that starts off with 1 = 2. It can also help solve odd performance problems.

Take this query, for instance.

This isn’t just for subqueries; even simple joins can go haywire when you accidentally use the wrong alias and both tables happen to have the same column name.

Comments closed