Press "Enter" to skip to content

Curated SQL Posts

AWS Glue Now Supports Scala

Mehul Shah, et al, announce that AWS Glue officially supports Scala:

We are excited to announce AWS Glue support for running ETL (extract, transform, and load) scripts in Scala. Scala lovers can rejoice because they now have one more powerful tool in their arsenal. Scala is the native language for Apache Spark, the underlying engine that AWS Glue offers for performing data transformations.

Beyond its elegant language features, writing Scala scripts for AWS Glue has two main advantages over writing scripts in Python. First, Scala is faster for custom transformations that do a lot of heavy lifting because there is no need to shovel data between Python and Apache Spark’s Scala runtime (that is, the Java virtual machine, or JVM). You can build your own transformations or invoke functions in third-party libraries. Second, it’s simpler to call functions in external Java class libraries from Scala because Scala is designed to be Java-compatible. It compiles to the same bytecode, and its data structures don’t need to be converted.

To illustrate these benefits, we walk through an example that analyzes a recent sample of the GitHub public timeline available from the GitHub archive. This site is an archive of public requests to the GitHub service, recording more than 35 event types ranging from commits and forks to issues and comments.

Functional languages tend to be very good for ETL tasks, and Scala is a great choice due to its relationship with Spark.

Comments closed

Dealing With String Parsing In T-SQL

Andy Mallon has written a T-SQL function to parse file paths from strings:

Writing & reading code is easier if you understand the logic before attacking the code. I find this to be particularly important when you anticipate complicated code. SQL Server sucks at parsing strings, so I anticipate complicated code.

How do you identify the directory from a file path? That’s just everything up to the last slash–and I like to include that final slash to make it clear it’s a directory.

How do you identify the file name from a file path? That’s just everything after the final slash.

The key here is going to be identifying that final slash, and grabbing the text on either side.

Read on for the function.

Comments closed

Automatic Partition Splitting

Marlon Ribunal has a script to split partitioned tables automatically:

So, let’s pretend it’s the month of April 2017 and this is the partition currently populated. Based on the query above, aside from the current partition bucket, we also have another available bucket month for May.

Say we want to maintain 3 available buckets at any given time. The next available bucket is May, so that means we need 2 more partitions to cover for June and July.

Read on for more, including some scripts that you can automate.

Comments closed

SQL Operations Studio January Release

Alan Yu announces a new release of SQL Operations Studio:

The January release includes several major repo updates and feature releases, including:

  • Enable the HotExit feature to automatically reopen unsaved files.

  • Add the ability to access saved connections from Connection Dialog.

  • Set the SQL editor tab color to match the Server Group color.

  • Fix the broken Run Current Query command.

  • Fix the broken pinned Windows Start Menu icon.

Click through for the download link.

Comments closed

A Story Of Database Corruption

Jason Brimhall tells a tale of a corrupt database:

Calmly, you settle in and check the server and eventually find your way to the error logs to see the following:

Msg 823, Level 24, State 2, Line 1

The operating system returned error 1(Incorrect function.) to SQL Server during a read at offset 0x0000104c05e000 in file ‘E:\Database\myproddb.mdf’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Suddenly you understand and feel the collective fear and paranoia. What do you do now that the world has seemingly come to an end for your database?

Definitely worth a read.

Comments closed

Microsoft R Open 3.4.3

David Smith announces Microsoft R Open 3.4.3:

Microsoft R Open (MRO), Microsoft’s enhanced distribution of open source R, has been upgraded to version 3.4.3 and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to the latest R (version 3.4.3) and updates the bundled packages (specifically: checkpointcurldoParallelforeach, and iterators) to new versions.

MRO is 100% compatible with all R packages. MRO 3.4.3 points to a fixed CRAN snapshot taken on January 1 2018, and you can see some highlights of new packages released since the prior version of MRO on the Spotlights page. As always, you can use the built-in checkpoint packageto access packages from an earlier date (for reproducibility) or a later date (to access new and updated packages).

That brings Microsoft up to speed with base R.

Comments closed

Set Operations In Spark

Fisseha Berhane compares SparkSQL, DataFrames, and classic RDDs when performing certain set-based operations:

In this fourth part, we will see set operators in Spark the RDD way, the DataFrame way and the SparkSQL way.
Also, check out my other recent blog posts on Spark on Analyzing the Bible and the Quran using Spark and Spark DataFrames: Exploring Chicago Crimes.

The data and the notebooks can be downloaded from my GitHub repository.
The three types of set operators in RDD, DataFrame and SQL approach are shown below.

This is where SparkSQL (and SQL in general) shines, although the DataFrame approach is also compact.

Comments closed

Visuals I Like

I continue my series on dashboard visualization:

This leads me to a little bit of advice for choosing bars versus columns.  You will want to choose a bar chart if the following are true:

  1. Category names are long, where by “long” I mean more than 2-3 characters.
  2. You have a lot of categories.
  3. You have relatively few periods—ideally, you’ll only have one period with a bar chart.

By contrast, you would choose a column chart if:

  1. Viewing across periods is important.  For example, I want to see the number of critic reviews fluctuate across the season for each of the TV shows.
  2. You have many periods with relatively few categories.  The more periods and the fewer categories, the more likely you are to want a column chart.
  3. Category names are short, by which I mean approximately 1-3 characters.

Some people will rotate text 90 degrees to try to turn a bar chart into a column chart.  I don’t like that because then people need to rotate the page or crane their necks.  In that case, just use the bar chart.

I like Cleveland dot plots, but they’re not implemented at all in Power BI and the two add-ons in the store aren’t that great either.  Also, there’s bonus material explaining why The Punisher season 1 was better than Daredevil season 1.

Comments closed

Aggregations And Always Encrypted

Monica Rathbun finds trouble with Always Encrypted:

The real challenges started when the client began to test their application code. The first thing we hit was triggers.

The table had several insert triggers associated with the columns that were now encrypted. Since the data was now encrypted the insert triggers would fail. Again, we lucked out and they were able to recode somethings in order to remove the triggers. Of course, since troubles always come in threes, this was no different. First the constraint problem, then the triggers, then we hit the biggest road block that halted our Always Encrypted implementation.

Read on for more information about the things you cannot do with Always Encrypted, including some limitations which will eventually go away.

Comments closed