Author: Kevin Feasel

Scraping Session Data

Published 2017-06-05 by Kevin Feasel

Amy Herold has scraped PASS Summit 2017 submissions using Powershell:

Never having done a web scrape before, this was the perfect subject for my first time – grabbing all the sessions submitted to PASS Summit 2017…and doing it with PowerShell! Here is the script I used for this. I have accounted for the following:

Apostrophes (aka single quote). They will break your insert unless you have two of them, and for some reason, people seem to use them all over the place.
Formatting the string data for insert. No, your data will not magically come out right in your insert with single quotes so you need to add them.
Additional ID and deleted fields.
Speaker URL and ID. Will be using this to scrape speaker details later.
Accurate lower and upper bounds. These were arrived at by trial and error (you’re welcome), as well as the clean up of the data I scraped. More on this later.

Powershell probably wouldn’t be my first language for web scrapes—that’d be Python—but Amy shows how to get a scrape going.

Comments closed

Using OtterTune To Tune Databases

Published 2017-06-05 by Kevin Feasel

Dana Van Aken, Geoff Gordon, and Any Pavlo show off OtterTune, which uses machine learning techniques to tune database management systems like MySQL and Postgres:

OtterTune, a new tool that’s being developed by students and researchers in the Carnegie Mellon Database Group, can automatically find good settings for a DBMS’s configuration knobs. The goal is to make it easier for anyone to deploy a DBMS, even those without any expertise in database administration.

OtterTune differs from other DBMS configuration tools because it leverages knowledge gained from tuning previous DBMS deployments to tune new ones. This significantly reduces the amount of time and resources needed to tune a new DBMS deployment. To do this, OtterTune maintains a repository of tuning data collected from previous tuning sessions. It uses this data to build machine learning (ML) models that capture how the DBMS responds to different configurations. OtterTune uses these models to guide experimentation for new applications, recommending settings that improve a target objective (for example, reducing latency or improving throughput).

In this post, we discuss each of the components in OtterTune’s ML pipeline, and show how they interact with each other to tune a DBMS’s configuration. Then, we evaluate OtterTune’s tuning efficacy on MySQL and Postgres by comparing the performance of its best configuration with configurations selected by database administrators (DBAs) and other automatic tuning tools.

This is potentially a very interesting technology and is not the only one of its kind—we’ve seen Microsoft enter this space as well for SQL Server index and tuning recommendations.

Comments closed

Fresh R Installation On Linux

Published 2017-06-02 by Kevin Feasel

Marcelo Perlin has a script to install R on Linux:

Since I formatted all my three computers (home/laptop/work), I wrote a small bash file to automate the process of installing R and its dependencies. I use lots of R packages in a daily basis. For some of them, it is required to install dependencies using the terminal. Each time that a install.package() failed, I saved the name of the required software and added it to the bash file. While my bash file will not cover all dependencies for all packages, it will suffice for a great proportion.

Another option might be to grab a Docker image.

Comments closed

S3 Versus HDFS For Spark Data Storage

Published 2017-06-02 by Kevin Feasel

Reynold Xin, Josh Rosen, and Kyle Pistor argue that you should use blob storage (S3, Azure Blob, etc.) instead of disk when building a cloud-based Spark cluster:

Based on our experience, S3’s availability has been fantastic. Only twice in the last six years have we experienced S3 downtime and we have never experienced data loss from S3.

Amazon claims 99.999999999% durability and 99.99% availability. Note that this is higher than the vast majority of organizations’ in-house services. The official SLA from Amazon can be found here: Service Level Agreement – Amazon Simple Storage Service (S3).

For HDFS, in contrast, it is difficult to estimate availability and durability. One could theoretically compute the two SLA attributes based on EC2’s mean time between failures (MTTF), plus upgrade and maintenance downtimes. In reality, those are difficult to quantify. Our understanding working with customers is that the majority of Hadoop clusters have availability lower than 99.9%, i.e. at least 9 hours of downtime per year.

It’s interesting how opinion has shifted; even a year ago, the recommendation would be different.

Comments closed

Understanding Random Forests

Published 2017-06-02 by Kevin Feasel

Manish Kumar Barnwal explains how random forest algorithms work:

Say our dataset has 1,000 rows and 30 columns. There are two levels of randomness in this algorithm:

At row level: Each of these decision trees gets a random sample of the training data (say 10%) i.e. each of these trees will be trained independently on 100 randomly chosen rows out of 1,000 rows of data. Keep in mind that each of these decision trees is getting trained on 100 randomly chosen rows from the dataset i.e they are different from each other in terms of predictions.

At column level: The second level of randomness is introduced at the column level. Not all the columns are passed into training each of the decision trees. Say we want only 10% of columns to be sent to each tree. This means a randomly selected 3 column will be sent to each tree. So for the first decision tree, may be column C1, C2 and C4 were chosen. The next DT will have C4, C5, C10 as chosen columns and so on.

This is a nice article and includes cases when not to use random forests.

Comments closed

Making Entity Framework Writes A Little Less Slow

Published 2017-06-02 by Kevin Feasel

Ilya Chumakov has some tips for making Entity Framework inserts and updates a lot faster:

When adding or modifying a large number of records (10³ and more), the Entity Framework performance is far from perfect. The reasons are architectural peculiarities of the framework, and non-optimality of the generated SQL. Leaping ahead, I can reveal that saving data through a bypass of the context significantly minimizes the execution time.

There’s some good advice in here, though not my favorite advice, which is don’t use Entity Framework.

Comments closed

Check Those Aliases

Published 2017-06-02 by Kevin Feasel

Erik Darling warns you about accidentally using the wrong alias in a query:

People will often tell you to clearly alias your tables, and they’re right. It will make them more readable and understandable to whomever has to read your code next, puzzling over the 52 self joins and WHERE clause that starts off with 1 = 2. It can also help solve odd performance problems.

Take this query, for instance.

This isn’t just for subqueries; even simple joins can go haywire when you accidentally use the wrong alias and both tables happen to have the same column name.

Comments closed

Restoring A BACPAC File

Published 2017-06-02 by Kevin Feasel

Steve Jones shows how to restore a database saved in .bacpac format:

I needed to get the WideWorldImporters sample database for a project and noticed that there was a BACPAC available. I downloaded it and needed to restore this as a database. At least, that’s what many people would think.

However, if you go to the restore dialog, and select Device and then pick your location, there’s no filter for a .bacpac. In fact, if you choose one, it won’t restore. You’ll get the “no backupset selected” error.

Read on for a step-by-step guide showing how to do this.

Comments closed

Automating Azure SQL DB Maintenance

Published 2017-06-02 by Kevin Feasel

Tim Radney shows several methods for performing automated Azure SQL Database maintenance, including runbooks:

Once you create your account, you can then start creating runbooks. You can do just about anything with the runbooks. There are numerous existing run books that you can browse through and modify for your own use, including provisioning, monitoring, life cycle management, and more.

You can create the runbooks offline, or using the Azure Portal, and they’re built using PowerShell. In this example, we will reuse the code from the PowerShell demo and also demonstrate how we can use the built in Azure Service scheduler to run our existing PowerShell code and not have to rely on an on-premises scheduler, task scheduler, or Azure VM to schedule a job.

Read the whole thing if you have Azure SQL Database instances in your environment.

Comments closed

Dimensional Design Tips

Published 2017-06-02 by Kevin Feasel

Koen Verbeeck provides some helpful hints when designing dimensions in SQL Server Analysis Services Multidimensional models:

Although traditional dimension modeling – as explained by Ralph Kimball – tries to avoid snowflaking, it might help the processing of larger dimensions. For example, suppose you have a large customer dimension with over 10 million members. One attribute is the customer country. Realistically, there should only be a bit over 200 countries, maximum. When SSAS processes the dimension, it sends SELECT DISTINCT commands to SQL Server. Such a query on top of a large dimension might take some time. However, if you would snowflake (aka normalize) the country attribute into another dimension, the SELECT DISTINCT will run much faster. Here, you need to trade-off performance against the simplicity of your design.

There are several good tips here.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31