Press "Enter" to skip to content

Author: Kevin Feasel

Powershell ETL, Part 2

Max Trinidad has part 2 of his Powershell ETL series:

If you notice, in the above cmdlet the where-clause I’m selecting to use the Column1 property instead of a reasonable label. In my scenario the data in the CSV file contain variable columns fopr its different data types such as: Info, Error, and System. So, it was easy to identify the total number of columns to be 15 columns.

Now, using the cmdlet “Import-Csv” using the parameter “-Header”, you can define a list columns when you build the $Logdata object. We create the $header variable with the column-names separated by comma.

Keep an eye out for part 3.  In the meantime, check out part 1 if you haven’t already.

Comments closed

New XQueryPlanPath Update

Richie Lee has an update of XQueryPlanPath:

I’ve made a few changes to the XQueryPlanPath project. The project parses query plans into xml and then using xpath to find the value of one or more nodes. This could then be used in testing to verify that any changes made to a query would retain a query plan that is considered optimal, and then if any changes break the test you can verify if the change causes sub-optimal effect on your query.

There was however one issue – query plans are like opinions; every SQL Server Instance has one, and none of them think that theirs stinks. So running a test on a dev box will potentially produce a different query plan from that on the build server, to that of production etc. This broadly because of 3 reasons:

Check it out, especially if your XML parsing skills aren’t top-notch.

Comments closed

Hadoop 3

Alex Woodie covers some upcoming changes with Hadoop version 3:

Hadoop 3, as it currently stands (which is subject to change), won’t look significantly different from Hadoop 2, Ajisaka said. Made generally available in the fall of 2013, Hadoop 2 was a very big deal for the open source big data platform, as it introduced the YARN scheduler, which effectively decoupled the MapReduce processing framework from HDFS, and paved the way for other processing frameworks, such as Apache Spark, to process data on Hadoop simultaneously. That has been hugely successful for the entire Hadoop ecosystem.

It appears the list of new features in Hadoop 3 is slightly less ambitious than the Hadoop 2 undertaking. According to Ajisaka’s presentation, in addition to support for erasure coding and bug fixes, Hadoop 3 currently calls for new features like:

  • shell script rewrite;
  • task-level native optimization;
  • the capability to derive heap size or MapReduce memory automatically;
  • eliminating of old features;
  • and support for more than two NameNodes.

The big benefit to erasure coding is that you can potentially cut data usage requirements in half, so that can help in very large environments.  Alex also notes that the first non-beta version of Hadoop 3 is expected to release by the end of the year.

Comments closed

Screenshot Interview Questions

Brent Ozar has the latest in his interview series:

They didn’t give you parameter 26837 – I’m just giving you that so you can see an execution plan.

You don’t have to talk me through the query itself, or what you’d want to do to fix it. In fact, I want you to avoid that altogether.

Instead, tell me what things you need to know before you start tuning, and explain how you’re going to get them.

I think, based on the noise in the comments section, that this is a good question.  Good interview questions are separating in equilibrium (as opposed to pooling).  The question itself is straightforward, but people have such a tendency to jump the gun that they try to answer a question which isn’t being asked.  Then, when reading the question, the set of steps and processes people have is interesting because of how much they differ.

Bonus question:  take your interview answer (“I would do X and Y and Z and then A and B and C and maybe D.”) and apply it to the last time you had this scenario come up.  How many of [A-DX-Z] did you actually do?

Comments closed

New SQL Server Connector Preview

Rebecca Zhang notes that there is a new public preview of the SQL Server Connector:

For those not familiar with the SQL Server Connector, it enables SQL Server to use Azure Key Vault as an Extensible Key Management (EKM) Provider for its SQL encryption keys. This means that you can use your own encryption keys and protect them in Azure Key Vault, a cloud-based external key management system which offers central key management, leverages hardware security modules (HSMs), and allows separation of management of keys and data, for additional security. This is available for the SQL encryption keys used in Transparent Data Encryption (TDE), Column Level Encryption (CLE), and Backup encryption.

When using these SQL encryption technologies, your data is encrypted with a symmetric key (called the database encryption key) stored in the database. Traditionally (without Azure Key Vault), a certificate that SQL Server manages would protect this data encryption key (DEK). With Azure Key Vault integration for SQL Server through the SQL Server Connector, you can protect the DEK with an asymmetric key that is stored in Azure Key Vault. This way, you can assume control over the key management, and have it be in a separate key management service outside of SQL Server.

Check it out, as it might be a solution to some key management issues.

Comments closed

Data Frames

Saravanan Subramanian has an introduction to data frames in R:

The R data frame is a high level data structure which is equivalent to a table in database systems.  It is highly useful to work with machine learning algorithms, and it’s very flexible and easy to use.

The standard definition of data frames are a “tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R‘s modeling software.”

Data frames are a powerful abstraction and make R a lot easier for database professionals than application developers who are used to thinking iteratively and with one object at a time.

Comments closed

Warehouses Will Live On

Jesse Seymour argues that in-memory analysis solutions will not entirely replace data warehouses:

The big reason that dimensional modeling increases clarity is that the dimensional model seeks to flatten data as much as possible.  Let’s compare two examples.  Both of these examples are for a fictional health clinic.

The first example is that we want a report on how many male patients were  treated with electric shock therapy by provider, grouped monthly and spanning year to date range.

Those big Kimball-style warehouses do a great job of making it easier for people who are not database specialists to query data and get meaningful, consistent results to known business questions.  The trick to understanding data platforms is that they tend to be complements rather than substitutes:  introducing Spark-R in your environment does not replace your Kimball-style warehouse; it complements it by letting analysts find trends more easily.  Similarly, a Hadoop cluster potentially lets you complement an existing data warehouse in a few ways:  acting as a data aggregator (which allows you to push some ETL work off onto the cluster), a data collector (especially for information which is useful but doesn’t really fit in your conformed warehouse), and a data processor (particularly for those gigantic queries which are not time-sensitive).

Comments closed

Migrating SQL Server To Azure IaaS

James Serra has links and notes on migrating SQL Server from on-premesis to Azure virtual machines:

  • Convert on-premises physical machine to Hyper-V VHD, upload to Azure Blob storage, and then deploy as new VM using uploaded VHD.  Use when bringing your own SQL Server license, when migrating a database that you will run on an older version of SQL Server, or when migrating system and user databases together as part of the migration of database dependent on other user databases and/or system databases.  Use on SQL Server 2005 or greater to SQL Server 2005 or greater

  • Ship hard drive using Windows Import/Export Service.  Use when manual copy method is too slow, such as with very large databases.  Use on SQL Server 2005 or greater to SQL Server 2005 or greater

If you’re looking for notes on where to get started, this is a good link.

Comments closed

Using Python 3.4 With EMR And Spark

Bruno Faria shows how to use Python 3.4 with Spark on Amazon’s ElasticMapReduce:

An EMR 4.6 cluster running Spark 1.6.1 will still use Python 2.7 as the default interpreter. If you want to change this, you will need to set the environment variable: PYSPARK_PYTHON=python34. You can do this when you launch a cluster by using the configurations API and supplying the configuration shown in the snippet below:

I’m more of a SQL and Scala guy, but if you like Python and are on the Python 3 side of the divide, here’s a solution for you.

Comments closed

More On Database Corruption

SQLWayne has a follow-up on database corruption:

You see the CHECKSUM on the backup along with the RESTORE VERIFYONLY.  The code was generated by right clicking on the database, selecting Tasks, then Backup, plug in the parameters, and select Script.  I put it in a new query window as I may back up several databases in the same job.  Sometimes I’ll just do a find/replace for the other databases since my backup.  The Restore Verifyonly gives you some confidence that your backup is recoverable: NEVER assume that just because your backup ran that the database is restorable!  The ONLY way to know is to actually restore it to another file!  You don’t want to accidentally clobber your production that probably has newer data in it.

Corruption is a serious event when your entire job revolves around protecting data.  Be prepared.

Comments closed