Press "Enter" to skip to content

Curated SQL Posts

Running SQL On Linux In Windows For Linux

Anthony Nocentino troubleshoots an error when trying to run SQL Server on Linux using the Windows Subsystem For Linux:

The first thing I had to do was reproduce the issue. So on my Windows 10 test VM I installed the Windows Subsystem for Linux, steps to do so are here and I installed the Ubuntu app.

Then, I fired up a bash shell using WSL and then I installed SQL Server on Linux for Ubuntu as documented here.

Now, I completed the installation of SQL Server on Linux using mssql-conf when that program completes it attempts to start SQL Server on Linux. BOOM! I’m able to reproduce the same error.

Looking at the error, I decided to see if I could run SQL Server on Linux from the shell as the user mssql. This would remove systemd and mssql-conf from the picture. Basically I wanted to see if I could get another, more descriptive, error to pop out.

Anthony digs out a very useful debugging tool in Linux, strace.  Sadly, he’s not able to solve the problem at the moment, but at least gets us a step in the right direction.

Comments closed

Creating A Linked Server To Amazon Athena

Maria Zakourdaev shows that you can create a linked server connection in SQL Server to query data using Amazon Athena:

I will show you today how you can use Management Studio or any stored procedure to query the data, stored in a csv file, located on S3 storage. I am using CSV file format as an example here, columnar PARQUET gives much better performance.

I am going to:

1. Put a simple CSV file on S3 storage

2. Create External table in Athena service over the data file bucket

3. Create linked server to Athena inside SQL Server

4. Use OPENQUERY to query the data.

Athena service is built on the top of Presto, distributed SQL engine and also uses Apache Hive to create, alter and drop tables. You can run ANSI SQL statements in the Athena query editor, launching it from the AWS web services UI. You can use complex joins, window functions and many other great SQL language features. Using Athena eliminates need for ETL because it projects your schema on the data files at the time of the query.

Standard linked server warnings apply, but sometimes you need to bridge a couple technologies.

Comments closed

Watching In-Memory OLTP Files

Ned Otter observes the process when you create memory-optimized tables and insert data into those tables:

It’s clear that before we inserted any data, we had 20 files that were in the PRECREATED state. After inserting 10 rows, we now have 18 PRECREATED files, and 2 UNDER CONSTRUCTION files, which means the In-Memory engine is populating these files, that they are “open” in terms of their CHECKPOINT status. If you don’t understand what these terms mean, please read Part 1.

But there’s one thing that doesn’t look right here: we’ve inserted data into the table, but sizeBytesUsed is still zero for the UNDER CONSTRUCTION files. Why is that?

Ned does assume that you’ve read part 1 of the series already.

Comments closed

Enabling Exactly-Once Kafka Streams

Guozhang Wang wraps up his exactly-once series in Kafka:

When restarting the application from the point of failure, we would then try to resume processing from the previously remembered position in the input Kafka topic, i.e. the committed offset. However, since the application was not able to commit the offset of the processed message before crashing last time, upon restarting it would fetch A again. The processing logic will then be triggered a second time to update the state, and generate the output messages. As a result, the application state will be updated twice (e.g. from S’ to S’’) and the output messages will be sent and appended to topic TB twice as well. If, for example, your application is calculating a running count from the input data stream stored in topic TA, then this “duplicated processing” error would mean over-counting in your application, resulting in incorrect results.

Today, many stream processing systems that claim to provide “exactly-once” semantics actually depend on users themselves to cooperate with the underlying source and destination streaming data storage layer like Kafka, because they simply treat this layer as a blackbox and hence does not try to handle these failure cases at all. Application user code then has to either coordinate with these data systems—for example, via a two-phase commit mechanism—to guarantee no data duplicates, or handle duplicated records that could be generated from the clients talking to these systems when the above mentioned failure happens.

There’s some good information in here, so check it out.

Comments closed

Image Counts For Neural Network Training

Pete Warden shares his rule of thumb for how many images you need to train a neural network:

In the early days I would reply with the technically most correct, but also useless answer of “it depends”, but over the last couple of years I’ve realized that just having a very approximate rule of thumb is useful, so here it is for posterity:

You need 1,000 representative images for each class.

Like all models, this rule is wrong but sometimes useful. In the rest of this post I’ll cover where it came from, why it’s wrong, and what it’s still good for.

Read on to learn where the number 1000 came from and get some good hints, like flipping and rescaling images.

Comments closed

Error Handling In Scala

Manish Mishra gives a few examples of how to handle errors in Scala:

Try[T] is another construct to capture the success or a failure scenarios. It returns a value in both cases. Put any expression in Try and it will return Success[T] if the expression is successfully evaluated and will return Failure[T] in the other case meaning you are allowed to return the exception as a value. However with one restriction that it in case of failures it will only return Throwable types:

def validateZipCode(zipCode:String): Try[Int] = Try(zipCode.toInt)

But Throwing an exception doesn’t make much sense here since it is not much of a calculation. Although we can take this example to understand the use case. If the given string is not a number, it will be a failure. The value from the Try can be extracted in same as Option. It can be matched

As you write more complicated Spark operations, handling errors becomes critical.

Comments closed

An Introduction To seplyr

John Mount guest blogs on the Revolutions blog about seplyr:

seplyr is an R package that supplies improved standard evaluation interfaces for many common data wrangling tasks.

The core of seplyr is a re-skinning of dplyr‘s functionality to seplyr conventions (similar to how stringr re-skins the implementing package stringi).

Read on for a couple of examples of where seplyr can make it easier for you to program with than dplyr.

Comments closed

Creating Dynamic Pivot Tables

Ben Richardson shows how to use dynamic SQL to create pivot tables with arbitrary numbers of pivot elements:

The headings of the columns are the individual values inside the city column. We specified these values inside the pivot operator in our query.

The most tedious part of creating pivot tables is specifying the values for the column headings manually. This is the part that is prone to most errors, particularly if the data in your online data source changes. We can not be sure that the values we specified in the pivot operator will remain in the database until we create this pivot table next time.

For instance, in our script, we specified London, Liverpool, Leeds and Manchester as values for headings of our pivot table. These values existed in the Сity column of the student table. What if somehow one or more of these values are deleted or updated? In such cases, null will be returned.

A better approach would be to create a dynamic query that will return a full set of values from the column from which you are trying to generate your pivot table.

Click through to see how to build this.

Comments closed