Press "Enter" to skip to content

Curated SQL Posts

Test Connection With HDInsight

I have a post trying to test a connection using HDInsight:

WebHCat is a web-based REST API for HCatalog, a management layer for dealing with files in HDFS.  If you’re looking for configuration settings for WebHCat, you’ll want generally to look for “templeton” in config files, as Templeton was the project name before WebHCat.  In Ambari, you can go to the Hive configs and look at webhcat-site.xml for configuration settings.  For WebHCat, the default port in HDInsight is 30111, which you should find in the templeton.port configuration setting.

I don’t like the fact that WebHDFS is blocked, but at least WebHCat is functional.

Comments closed

Computing Holidays

Gerald Britton has a few ways of calculating holidays for date dimensions:

Notice also that I use CROSS APPLY as an expression evaluator.  This keeps the code a little DRY-er.  In OOP-speak, I’ve encapsulated what varies.

Now that I’ve got a basic date table, let’s see about updating those holiday columns.   I’ll do it with an UPDATE command here, though it is possible to make them computed also — it just gets a little messy since there are so many different holidays with different calculations.  Let’s start with Thanksgiving.  In Canada, Thanksgiving Day is the second Monday in October.  In the US, it’s the fourth Thursday in November.  I’ve seen some interesting approaches to solving the problem.  One (for US Thanksgiving) looks like this:

I like the Easter formula.  Click through for details.

Comments closed

Run And RunOnce Registry Key Limits

Denny Cherry runs into a limit in the Run and RunOnce registry value lengths:

Microsoft has had the registry keys for Run and RunOnce in the registry since the registry was introduced in Windows 95 and Windows NT 4.  But in the 20+ years that those keys have been there (and I’ve used them for a variety of things) I’ve never known that there was a limit on the length of the commands that you could put into those keys.

I found this while working on a client project when I needed to kick off some powershell automatically when the server restarted to get it added to the domain, as well as do a few other things.  But for some reason the key just wasn’t running.

The limit does seem a bit short, though at least it’s one longer than the max length of a file path.

Comments closed

Concatenation Using CONCAT

Neil Gelder has a look at the CONCAT() function in SQL Server 2012:

This function removes all need to convert values just to concatenate text and string values together, but for some reason whenever I see code posted on various websites, even from SQL Server MVP’s they seem never to utilise this function.

One excellent use for this is when you attempting to created the SQL Statement for a dynamic SQL execution.  Instead of the usual conversions and + signs joining the text and parameters together its far simpler.

CONCAT is one of my favorite T-SQL enhancements from 2012 (after window functions).

Comments closed

Using Spark MLlib For Categorization

Taras Matyashovskyy uses Apache Spark MLlib to categorize songs in different genres:

The roadmap for implementation was pretty straightforward:

  • Collect the raw data set of the lyrics (~65k sentences in total):

    • Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc.
    • Abba, Ace of Base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc.
  • Create training set, i.e. label (0 for metal | 1 for pop) + features (represented as double vectors)

  • Train logistic regression that is the obvious selection for the classification

This is a supervised learning problem, and is pretty fun to walk through.

Comments closed

SQLite With Powershell

Phil Factor combines SQLLite, Powershell, and SQL Server:

 Although I dearly love using SQL Server, I wouldn’t use it in every circumstance; there are times, for example, when just isn’t necessary to use a Server-based RDBMS for a data-driven application. The open-source SQLite is arguably the most popular and well-tried-and-tested database ever. It is probably in your phone, and used by your browser. Your iTunes will use it. Most single-user applications that need to handle data will use SQLite because it is so reliable and easy to install.

It is specifically designed as a zero-configuration, embedded, relational database with full ACID compliance, and a good simple dialect of SQL92. The SQLite library accesses its storage files directly, using a single library, written in C, which contains the entire database system. Creating a SQLite database instance is as easy as opening a simple cross-platform file that contains the entire database instance. It requires no administration.

There’s a lot going on in this interesting article; I recommend giving it a read.

Comments closed

Spark Clusters On Spot Pricing

Sameer Farooqui explains spot pricing with respect to AWS servers:

The idea behind Spot instances is to allow you to bid on spare Amazon EC2 compute capacity. You choose the max price you’re willing to pay per EC2 instance hour. If your bid meets or exceeds the Spot market price, you win the Spot instances. However, unlike traditional bidding, when your Spot instances start running, you pay the live Spot market price (not your bid amount). Spot prices fluctuate based on the supply and demand of available EC2 compute capacity and are specific to different regions and availability zones.

So, although you may have bid 0.55 cents per hour for a r3.2xlarge instance, you’ll end up paying only 0.10 cents an hour if that’s what the going rate is for the region and availability zone.

Databricks uses spot pricing for Community Edition clusters to control costs.  Click through for a very interesting discussion of spot pricing and how they take advantage of it.

Comments closed

Maximum Temperatures With Spark Languages

Praveen Sripati has a two-part series on getting aggregates by year in various Spark languages.  In part one, he looks at Python:

Hadoop – The Definitive Guide revolves around the example of finding the maximum temperature for a particular year from the weather data set. The code for the same is here and the data here. Below is the Spark code implemented in Python for the same.

In part 2, he looks at Spark SQL:

In the previous blog, we looked at how find out the maximum temperature of each year from the weather dataset. Below is the code for the same using Spark SQL which is a layer on top of Spark. SQL on Spark was supported using Shark which is being replaced by Spark SQL.Here is a nice blog from DataBricks on the future of SQL on Spark.

There’s no Scala example here, but it’s pretty straightforward as well.

Comments closed

Spark And .NET

Bharath Venkatesh shows how to make Spark calls using the .NET ODBC driver:

Prerequisite

Before you begin, you must have the following:

Check it out.  Using Spark on .NET is pretty easy.

Comments closed

Price Optimization Using Decision Trees

Bernard Antwi Adabankah uses a decision tree to model price changes:

The sample included N = 262 individual orders for Interlocking Hearts Design Cake Knife/Server set with OrderItemSKU as 2401 from the period ranging from 1st March 2014 to 20th April 2016 with an ecommerce company which sells on Amazon.co.uk

The Profit response variable is measured as the product sale price on amazon.co.uk which includes amazon.co.uk commission and any applicable postage costs less the purchase price of the Hearts Design Cake Knife/Server set from the supplier.

Read the comments for a couple good critiques of the article.

Comments closed