Press "Enter" to skip to content

Author: Kevin Feasel

SKLearn To Azure ML

David Crook shows how to build a model using Python’s SciKit library and then operationalize it in Azure ML:

Why Model Outside Azure ML?

Sometimes you run into things like various limitations, speed, data size or perhaps you just iterate better on your own workstation.  I find myself significantly faster on my workstation or in a jupyter notebook that lives on a big ol’ server doing my experiments.  Modelling outside Azure ML allows me to use the full capabilities of whatever infrastructure and framework I want for training.

So Why Operationalize with Azure ML?

AzureML has several benefits such as auto-scale, token generation, high speed python execution modules, api versioning, sharing, tight PaaS integration with things like Stream Analytics among many other things.  This really does make life easier for me.  Sure I can deploy a flask app via docker somewhere, but then, I need to worry about things like load balancing, and then security and I really just don’t want to do that.  I want to build a model, deploy it, and move to the next one.  My value is A.I. not web management, so the more time I spend delivering my value, the more impactful I can be.

Read the whole thing.

Comments closed

Renaming Servers

Steve Jones shows how to rename a server using Powershell:

I’m sure many people are experimenting with VMs and SQL Server. If you’re like me, many of you just default to installing Windows 7/10 or Windows Server xx Standard for your testing. Those systems work fine, but I’ve been trying to build slimmer systems, which means looking at Server Core. Installing Server Core is much the same as other versions, though you end up with only a command line. If you’re like me, using VMWare, you also might end up with a server name like “WIN-LKR3R4FfL5T”.

I want to change that. It’s a fine name if I’m working locally. It’s not to much fun connecting across a network. This post looks at how to rename that machine.

This is probably a good idea to do before installing any major software.  Renaming a server under SQL Server is possible, but there are a few extra steps to the process.

Comments closed

SQL Server Connectivity In VS Code

Max Trinidad looks at the VS Code plugin for SQL Server connectivity:

After installing, we need to customized their setting by creating connection(s) to our SQL Server. We do this by opening VS Code “User Preferences” and under “Default Settings.json” we search for the “vscode-mssql” settings to be copied over to our working folder “settings.json” file.

I played with this very early on and would like to see it continue to be developed, but it’s no replacement for Management Studio.

Comments closed

Git Introduction

Allison Tharp has an introduction to Git:

Git is a version control system (VCS), which is just what it sounds like: a system to help keep track of different versions of software.  Git isn’t the only VCS out there (others include CVS, SVN, and Fossil), but it is one of the more popular systems, particularly for open source projects.  You’ve certainly used software that was developed using Git (Firefox and Chrome are two big ones!).

Version control is really helpful when you are working with other people.  Without version control, if I send you a file I’m working on and you make changes to it, we would suddenly have two versions.  If I integrate your changes into my file, then we’d only have one file but no history!  Even when working alone, version control is really helpful for us to keep track of how the project is moving along.

Understanding at least one source control platform is vital for software development.  Git can be like pulling teeth (and then there are the times when it gets really painful), but if you are developing software (even personal scripts!) and don’t have source control in place, you’re walking a tightrope without a net.

Comments closed

Correlated Datetime Columns

Grant Fritchey covers a concept I’d never heard of:

Correlated Datetime Columns works. Clearly it’s not something you’re going to enable on all your databases. Probably most of your databases don’t have clustered indexes on datetime columns let alone enough tables with correlation between the data stored in them. However, when you do have that type of data correlation, enabling Correlated Datetime Columns and ensuring you have a clustered index on the datetime column is a viable tuning mechanism. Further, this is a mechanism that has been around since 2005. Just so you know, I did all my testing in SQL Server 2016, so this something that anyone in the right situation can take advantage of. Just remember that TANSTAAFL always applies. Maintaining the statistics needed for the Correlated Datetime Columns is done through materialized views that are automatically created through the optimization process. You can see the views in SSMS and any queries against the objects. You’ll need to take this into account during your statistics maintenance. However, if Correlated Datetime Columns is something you need, this is really going to help with this, fairly narrow, aspect of query tuning.

I don’t know that I’ll ever do this, but it’s worth filing away just in case.

Comments closed

Whither Hadoop?

Kashif Saiyed looks at recent trends in Hadoop:

  • 2016 and beyond – this is an interesting timing for “Big Data”. Cloudera’s valuation has dropped by 38%. Hortonwork’s valuation has dropped by almost 40%, forcing them to cut the professional services department. Pivotal has abandoned its Hadoop distribution, going to market jointly with Hortonworks. What happened and why? I think the main driver of this decline is enterprise customers that started adoption of technology in 2014-2015. After a couple of years playing around with “Big Data” they has finally understood that Hadoop is only an instrument for solving specific problems, it is not a turnkey solution to take over your competitors by leveraging the holy power of “Big Data”. Moreover, you don’t need Hadoop if you don’t really have a problem of huge data volumes in your enterprise, so hundreds of enterprises were hugely disappointed by their useless 2 to 10TB Hadoop clusters – Hadoop technology just doesn’t shine at this scale. All of this has caused a big wave of priorities re-evaluation by enterprises, shrinking their investments into “Big Data” and focusing on solving specific business problems.

There are some good points around product saturation and a general skills shortage, but even if you look at it pessimistically, this is a product with 30% market penetration, and which is currently making the move from being a large batch data processing product to a streaming + batch processing product.

Comments closed

Converting CSV To Parquet

Praveen Sripati shows how to use Spark Dataframes to convert a CSV file into a Parquet format:

In the previous blog, we looked at on converting the CSV format into Parquet format using Hive. It was a matter of creating a regular table, map it to the CSV data and finally move the data from the regular table to the Parquet table using the Insert Overwrite syntax. In this blog we will look at how to do the same thing with Spark using the dataframes feature.

Most of the code is basic setup; writing to Parquet is really a one-liner.

Comments closed

Comparing File Formats In Hadoop

Andrew Peterson points out performance comparisons for various Hadoop file formats:

According to a posting on the Hortonworks site, both the compression and the performance for ORC files are vastly superior to both plain text Hive tables and RCfile tables. For compression, ORC files are listed as 78% smaller than plain text files. And for performance, ORC files support predicate pushdown and improved indexing that can result in a 44x (4,400%) improvement. Needless to say, for Hive, ORC files will gain in popularity.  (you can read the posting here: ORC File in HDP 2: Better Compression, Better Performance).

There are several considerations around picking the correct file format, and it’s probably best to experiment with them in your specific environment.

Comments closed

Returning Defult Rows

Christopher Huntley wants to return a default record when there are no results:

Or if you’re ready to take it to the ╰[ ⁰﹏⁰ ]╯level then change the column to NChar and use the hex of your favorite emoji like:

DECLARE @testtable1 TABLE (
testid int identity (1,1),
testvalue nchar (255))

–use the below for the final query

SELECT
ISNULL((SELECT testvalue from @testtable1 where testvalue > 101), NCHAR(0xD83D)+ NCHAR(0xDE20) ) as testvaluethatworks

There are a few other alternatives, such as loading results into a temp table and inserting a default row if the temp table is empty.

Comments closed