Press "Enter" to skip to content

Month: March 2017

Automating SpeedPASS Generation

Wayne Sheffield has a Powershell script to generate SQL Saturday SpeedPASSes:

My good friend, Mr. Google, found this post by Kendal Van Dyke. This post has a PowerShell script that will download and merge all of the PDFs for selected attendees into one big PDF. This enables printing out all of the SpeedPASSes at once, instead of one-by-one. However, I have a couple of problems with this script in it’s current form. First, the instructions for how to get the information from the SQLSaturday admin site have changed (they did do a major web site change last year). Secondly, it downloads all of the PDFs one-by-one, and puts them into a temporary directory, where they are all merged together. In file-name order. Not alphabetically. This means that the manual sorting is still necessary. But hey – it’s PowerShell. Surely we can come up with a way to do this sorting for us!

So I decided to re-write this script to suit my needs. Kendal’s script downloads the SpeedPASS PDF files one-by one. However, the admin site allows us to download all of them in one zip file. I like this approach better. I ended up making two major changes to the script. The first change requires pre-downloading and extracting all of the SpeedPASSes files. The second change is to get them to merge alphabetically. Like Kendal’s script, this uses the PDFSharp assemblies. This requires using PowerShell 3.0 or higher.

Click through for the script, which is probably very helpful if you ever run a SQL Saturday event.

Comments closed

HDInsight Basics: Nodes

Abdullah Al Mahmood explains some of the basics of Azure HDInsight, including what Hadoop means by nodes:

HDInsight clusters consist of several virtual machines (nodes) serving different purposes. The most common architecture of an HDInsight cluster is – two head nodes, one or more worker nodes, and three zookeeper nodes.

Head nodes: Hadoop services are installed and run on head nodes. There are two head nodes to ensure high availability by allowing master services and components to continue to run on the secondary node in the event of a failure on the primary. Both head nodes are active and running within the cluster simultaneously. Some services, such as HDFS or YARN, are only ‘active’ on one head node at any given time (and ‘standby’ on the other head node). Other services such as HiveServer2 or Hive Metastore are active on both head nodes at the same time. There are services like Application Timeline Server (ATS) and Job History Server (JHS) which are installed on both head nodes but should run only on the head node where Ambari server is running. If these components sound unfamiliar, please revisit the article on Hadoop ecosystem in HDInsight.

Read on to see the other classes of nodes HDInsight uses.

Comments closed

Interrogating A Stack Dump

Kendra Little looks at a SQL Server stack dump:

In the video, I show an example of a stack dump caused by running DBCC PAGE with format style 3 against a table with a filtered index in SQL Server 2014.

It looks like this bug is fixed in SQL Server 2016, at least by SP1.

Sample code to reproduce this against the AdventureWorks2012 database (which I had restored to SQL Server 2014) is in my gist here.

Click through to watch the video.

Comments closed

SSMS Templates

Jana Sattainathan shows some of the value of SQL Server Management Studio templates, along with an important warning:

If you do start creating your own templates, you are responsible for backing them up. To locate the folder where they are stored

  1. Open DOS command prompt
  2. Run “echo %APPDATA%”
  3. Note the base path
  4. Navigate to %AppData%\Microsoft\Microsoft SQL Server\{SQL Server Version}\Tools\Shell\Templates\Sql\

(where %AppData% is the base path from

and {SQL Server Version} = 90 for SQL 2005, 100 for SQL 2008, 110 for SQL 2012, 120 for SQL 2014 and 130 for SQL 2016)

Templates are extremely useful for day-to-day development as well as giving a handy way of generating snippets of code, like estimating row count without having to remember to join to sys.indexes, sys.objects, and sys.dm_db_partition_stats.

Comments closed

Google Compute Engine Whitepapers

Brent Ozar Unlimited has a couple whitepapers out about working with SQL Server in Google Compute Engine.  First, Brent and Tara Kizer create an Availability Group:

In this white paper we built with Google, we’ll show you:

  • How to build your first Availability Group in Google Compute Engine

  • How to test your work with four failure simulations

  • How to tell whether your databases will work well in GCE

Erik Darling also has a whitepaper on performance tuning:

Relax. Have a drink. In this white paper we built with Google, we’ll show you:

  • How to measure your current SQL Server using data you’ve already got

  • How to size a SQL Server in Google Compute Engine to perform similarly

  • After migration to GCE, how to measure your server’s bottleneck

  • How to tweak your SQL Server based on the performance metrics you’re seeing

If you’re looking at GCE as a potential migratory spot, you’ve got some extra reading material.

Comments closed

Dr. Elephant: Where Does My Hadoop Cluster Hurt?

Carl Steinbach looks back at Dr. Elephant one year later:

What we needed to introduce to the job-tuning equation was a series of questions like those asked by a physician making a diagnosis: a step-by-step process that guides the user through the problem-solving process, while also educating them at the same time.

So we created Dr. Elephant, a system that automatically detects under-performing jobs, diagnoses the root cause, and guides the owner of the job through the treatment process. Dr. Elephant makes it easy to identify jobs that are wasting resources, as well as jobs that can achieve better performance without sacrificing efficiency. Perhaps most importantly, Dr. Elephant makes it easy to act on these insights by making job-level performance tuning accessible to users regardless of their previous skill level. In the process, Dr. Elephant has helped to ease the tension that previously existed between user productivity on one side and cluster efficiency on the other.

LinkedIn has made this project open source if you want to check it out in your environment.

Comments closed

TensorFlow With YARN

Wangda Tan and Vinod Kumar Vavilapalli show how to control TensorFlow jobs with YARN:

YARN has been used successfully to run all sorts of data applications. These applications can all coexist on a shared infrastructure managed through YARN’s centralized scheduling.

With TensorFlow, one can get started with deep learning without much knowledge about advanced math models and optimization algorithms.

If you have GPU-equipped hardware, and you want to run TensorFlow, going through the process of setting up hardware, installing the bits, and optionally also dealing with faults, scaling the app up and down etc. becomes cumbersome really fast. Instead, integrating TensorFlow to YARN allows us to seamlessly manage resources across machine learning / deep learning workloads and other YARN workloads like MapReduce, Spark, Hive, etc.

Read on for more details, including a demo video.

Comments closed

Rolling Out An Analytics Project

Christina Prevalsky shares some thoughts on considerations when implementing an analytics project:

The earlier you address data quality the better; the less time your end users spend on data wrangling, and the more they can focus on high value analytics. As your organization’s data infrastructure matures, migrating from spreadsheets to databases and data warehouses, data quality checks should be formally defined, documented, and automated. Exceptions should either be handled automatically during data intake using predefined business rules logic or require immediate user intervention to correct any errors.

Providing clean, centralized, and analytics-ready data to end users should not be a one-way process. By allowing end users to focus on high-value analytics, like data mining, network graphs, clustering, etc., they can uncover certain outliers and anomalies in the data. Effective data management should include a feedback loop to communicate these findings and, if necessary, incorporate any changes in the ETL processes, making centralized data management more dynamic and flexible.

The big question to ask is, “what problem are we trying to solve?”  That will help determine the answer to many of the questions, including how you store the data, how you expose the data, and even which data you collect and keep.

Comments closed