Press "Enter" to skip to content

Curated SQL Posts

Kaggle Data Science Report For 2017

Mark McDonald rounds up a few notebooks covering a recent Kaggle survey:

In 2017 we conducted our first ever extra-large, industry-wide survey to captured the state of data science and machine learning.

As the data science field booms, so has our community. In 2017 we hit a new milestone of reaching over 1M registered data scientists from almost every country in the world. Representing many different backgrounds, skill levels, and professions, we were excited to ask our community a wide range of questions about themselves, their skills, and their path to data science. We asked them everything from “what’s your yearly salary?” to “what’s your favorite data science podcasts?” to “what barriers are faced at work?”, letting us piece together key insights about the people and the trends behind the machine learning models.

Without further ado, we’d love to share everything with you. Over 16,000 responses surveys were submitted, with over 6 full months of aggregated time spent completing it (an average response time of more than 16 minutes).

Click through for a few reports.  Something interesting to me is that the top languages/tools were, in order, Python, R, and SQL.  For the particular market niche that Kaggle competitions fit, that makes a lot of sense:  I tend to like R more for data exploration and data cleansing, but much of that work is already done by the time you get the dataset.

Comments closed

Handling Late Arrivals In SSIS

Peter Schott shows us a pattern for dealing with late-arriving dimension members in SQL Server Integration Services ETL packages:

The general steps are

  1. Set up your source query.

  2. Pass the data through a Lookup for your Dimension with the missing results routed to a “No Match” output.

  3. Insert those “No Match” rows into your Dimension using a SQL task – checking to make sure that this particular row hasn’t already been inserted (this is important).

  4. Do another lookup using a “Partial Cache” to catch these newly-inserted members.

  5. Use a UNION ALL transform to bring the existing and late-arriving members together.

Click through for more information and a helpful package diagram.

Comments closed

Error Handling On SQL With Linux

Anthony Nocentino explains Linux error codes and systemd behavior for SQL on Linux:

Now in the output above, you’ll notice a bolded line. In there, you can system that systemd[1] receives a return code from SQL Server of status=1/FAILURE.  Systemd[1] is the parent process to sqlservr, in fact it’s the parent to all processes on our system. It receives the exit code and immediately, systemd initiates a restart of the service due to the configuration we have for our mysql-server systemd unit.
What’s interesting is that this happens even on a normal shutdown. But that simply doesn’t make sense, return values on clean exits should return 0. It’s my understanding of the SHUTDOWN command, that it will cause the database engine to shutdown cleanly.

On the development side, there aren’t many differences between SQL on Linux versus SQL on Windows (aside from things which haven’t yet made the move); on the administration side, there are some interesting differences.

Comments closed

Decrypting SSIS Passwords

Jason Brimhall shows how to decrypt your Integration Services package’s password if you have a SQL Agent job set to execute that package:

Take note here that I am only querying the msdb database. There is nothing exceedingly top secret here – yet. Most DBAs should be extremely familiar with these tables and functions that I am using here.

What does this show me though? If I have a package that is being run via Agent Job in msdb, then the sensitive information needs to be decrypted somehow. So, in order to do that decryption the password needs to be passed to the package. As it turns out, the password will be stored in the msdb database following the “DECRYPT” switch for the dtutil utility. Since I happen to have a few of these packages already available, when I run this particular query, I will see something like the following in my results.

That’s a clever solution.  I get the feeling that I should be a bit perturbed by how simple this is, but I don’t; the real sensitive data is still secure.

Comments closed

Position Differences And Convolutional Neural Networks

Pete Warden shares his knowledge of how convolutional neural networks deal with position differences in images:

If you’re trying to recognize all images with the sun shape in them, how do you make sure that the model works even if the sun can be at any position in the image? It’s an interesting problem because there are really three stages of enlightenment in how you perceive it:

  • If you haven’t tried to program computers, it looks simple to solve because our eyes and brain have no problem dealing with the differences in positioning.

  • If you have tried to solve similar problems with traditional programming, your heart will probably sink because you’ll know both how hard dealing with input differences will be, and how tough it can be to explain to your clients why it’s so tricky.

  • As a certified Deep Learning Guru, you’ll sagely stroke your beard and smile, safe in the knowledge that your networks will take such trivial issues in their stride.

It’s a good read.

Comments closed

Avro And Streaming Data

Pat Patterson shows how to get the advantages of the Avro file format while streaming individual records:

Avro is a very efficient way of storing data in files, since the schema is written just once, at the beginning of the file, followed by any number of records (contrast this with JSON or XML, where each data element is tagged with metadata). Similarly, Avro is well suited to connection-oriented protocols, where participants can exchange schema data at the start of a session and exchange serialized records from that point on. Avro works less well in a message-oriented scenario since producers and consumers are loosely coupled and may read or write any number of records at a time. To ensure that the consumer has the correct schema, it must either be exchanged “out of band” or accompany every message. Unfortunately, sending the schema with every message imposes significant overhead — in many cases, the schema is as big as the data or even bigger!

Read on to see how the Confluent Schema Registry can solve this problem.

Comments closed

ggplot2 Basics

Bharani Akella has an introduction to ggplot2:

Plot10: Scatter-plot

ggplot(data = mtcars,aes(x=mpg,y=hp,col=factor(cyl)))+geom_point()
  • mpg(miles/galloon) is assigned to the x-axis

  • hp(Horsepower) is assigned to the y-axis

  • factor(cyl) {Number of cylinders} determines the color

  • The geometry used is scatter plot. We can create a scatter plot by using the geom_point() function.

He has a number of similar examples showing several variations on bar, line, and scatterplot charts.

Comments closed

SSIS 2017 Scale-Out

Wolfgang Strasser has started a series on the new scale-out functionality in SQL Server Integration Services 2017.  First, his introduction:

In the past, SSIS package executions were only able to run on the server that hosted the Integration Services server itself. With the rising number and requirements of more and more package executions sometimes the resources on the server ran short. Addressing this resource shortage custom scale out functionality was implemented that allowed package executions to be transfered to other “worker” machines in order to distribute execution load. With SQL Server 2017, this functionality is built into an shipped with SSIS 2017.

Before I am diving deeper into SSIS Scale Out I would like to discuss some basic vocabulary in the field of scalability.

Then, he describes the scale-out architecture:

The master is managing the available workers and all the work that is requested for execution in the scale out topoloy.

  • The master manages a list of (active) workers

  • The master gets the instructions from clients

  • The master knows the current state of work (queued jobs, running jobs, finished jobs, ..)

If you’re familiar with other distributed computing systems, this follows a similar path.

Comments closed

Finding Objects Relating To A Schema

Jason Brimhall has a script to help you find which objects are tied to a particular schema:

I have run into this very issue where there are far too many objects in the schema to be able to drop one by one. Add to the problem that I am looking to do this via script. Due to the need to drop the schema and the (albeit self imposed) requirement of doing it via script, I came up with the following that will cover most cases that I have encountered.

Click through for the script.

Comments closed