Press "Enter" to skip to content

Author: Kevin Feasel

Azure ML To Python

Koos van Strien “graduates” from Azure ML into Python:

Python is often used in conjunction with the scikit-learn collection of libraries. The most important libraries used for ML in Python are grouped inside a distribution called Anaconda. This is the distribution that’s also used inside Azure ML1. Besides Python and scikit-learn, Anaconda contains all kinds of Data Science-oriented packages. It’s a good idea to install Anaconda as a distribution and use Jupyter (formerly IPython) as development environment: Anaconda gives you almost the same environment on your local machine as your code will run in once in Azure ML. Jupyter gives you a nice way to keep code (in Python) and write / document (in Markdown) together.

Anaconda can be downloaded from https://www.continuum.io/downloads.

If you’re going down this path, Anaconda is absolutely a great choice.

Comments closed

Powershell Failing On Error

Michael Bourgon has a helpful tip for those CmdExec SQL Agent jobs which run Powershell scripts which won’t fail on error:

However, when run via SQL Agent, it succeeds.  GAH!
I tried 50 different variations; modifying the script, various TRY..CATCH blocks found on the internet.  Nothing.  Every single one of them succeeded.

Then I remembered that by default, even though it had an error, by default errors always continue.  ($ErrorActionPreference=”Continue”.  So I added this line at the top:

Read on for the answer.

Comments closed

Automatic Approval For Data Lake Analytics

Yan Li reports that Azure Data Lake Analytics no longer requires waiting for approval:

We’re happy to announce that we’ve made it much faster to get started with the Data Lake Store and Analytics services starting today. Before today, when you tried to sign up for these services you had to go through an approval process that introduced a delay of at least one hour.

Now, you no longer have to wait for approval, and you can simply create an account immediately.

Yan also has some “getting started” links to help you out, now that you don’t have to wait for an account.

Comments closed

Counting Without Counts

Ewald Cress discusses the breferences member:

I’ll spare you my false starts, but I think I finally have it. The first observation is that, on the occasions breferences increments, it does not increment linearly, but instead has an exponential growth pattern. These increments take it through the sequence 0, 1, 3, 7, 15, 31, 127, 255 etc. Or in binary: 0, 1, 11, 111, 1111, 11111, 111111, 1111111, 11111111…

Those numbers can be seen as off-by-one variations of powers of two. Forget the offset, and think of the number as simply doubling on each increment if it keeps your head clearer – instead of accuracy, we have a order-of-magnitude reference count.

I’d never heard of an algorithm like this, although that could be due to my having dealt with relatively little low-level structural code.  I’m glad Ewald sussed out the mechanics driving breferences.

Comments closed

Immutable Servers

Diana Tkachenko describes a pattern for reducing “prod doesn’t look like stage” types of errors:

Immutable server pattern makes use of disposable components for everything that makes up an application that is not data. This means that once the application is deployed, nothing changes on the server – no scripts are run on it, no configuration is done on it. The packaged code and any deploy scripts is essentially baked into the server. No outside process is able to modify the contents after the server has been deployed. For example, if you were using Docker containers to deploy your code, everything the application needs would be in the Docker image, which you then use to create and run a container. You cannot modify the image once it’s been created, and if any changes do need to take place, you would create a new image and work with that one.

In our case, we use AWS Amazon Machine Images (AMIs) to accomplish the same thing. We make heavy use of Amazon Linux machines, which are Redhat-based, and thus package the code into RPMs[2]. The RPMs define all the dependencies for running the application, the code itself, and any startup scripts to run on bootup. The RPM is then installed on a clean base image of Amazon Linux, and an image is taken, resulting in an AMI. This AMI is synonymous with “immutable server” – it cannot be changed once it is created. The AMI is then deployed into an Auto Scaling Group(ASG) and attached to the Elastic Load Balancer (ELB). In this post, I’ll guide you through for a closer look at every step of this Immutable Server deploy pipeline. I’ll then go into how and why we embedded planned failures into this system. At the end, I’ll share the insights we’ve gained into the pros and cons of deploying in this way.

This is a very interesting concept.  I’ve heard of no-patch servers (where, instead of patching live servers, you spin up a new VM with the operating system updates and spin down the old one), but this takes the idea one step further.

Comments closed

Data Masking And Row-Level Filtering In Hadoop

Syed Mahmood and Srikanth Venkat discuss two security features in Apache Ranger:

Dynamic data masking via Apache Ranger enables security administrators to ensure that only authorized users can see the data they are permitted to see, while for other users or groups the same data is masked or anonymized to protect sensitive content. The process of dynamic data masking does not physically alter the data, or make a copy of it. The original sensitive data also does not leave the data store, but rather the data is obfuscated when presenting to the user. Apache Ranger 0.6 included with HDP 2.5, introduces a new type of authorization policy called “Masking Policy” that can used to define which specific data fields are masked and what are the rules for how to anonymization or pseudonymize the specific data. For example, a security administrator may choose to mask credit card numbers when displayed to customer service personnel, such that only last four digits are rendered in the form of XXXX-XXXX-XXXX-0123. The same would be true of sensitive data such as social security numbers or email addresses that are masked to be rendered in a different formats based on data masking rules.

This is part one of a two-part series; part two will dig into the technical details.  I have to wonder if Ranger’s dynamic data masking is as easy to circumvent as SQL Server’s.

Comments closed

Collecting Registry Data

Slava Murygin shows how to use Powershell to read the Registry:

Do you use third party tools to document state of your SQL Server?
If not, that script is for you!

At first, you will know what is your SQL Server is up to.
At second, that might be your baseline document, to which you can compare a current SQL Server state over the time.
At third, that is a priceless piece of documentation!!! (I mean FREE!!!) which you can put in a folder and report to your boss.

Registry settings are a good part of a baseline, particularly a security baseline.

Comments closed

Finding Destinations In SSISDB

Bill Fellows has a script to figure out the name of that table throwing errors upon insertion:

There is a rich set of tables and views available in the SSISDB that operate as a flight recorder for SSIS packages as they execute. Markus Ehrenmüller (t) had a great question in Slack. In short, can you figure out what table is being used as a destination and I took a few minutes to slice through the tables to see if I could find it.

If it’s going to be anywhere, it looks like you can find it in catalog.event_message_context

If someone is using an OLE DB Destination and uses “Table or view” or “Table or View – fast load” settings, the name of the table will be the event message_context table. If they are using a variable name, then it’s going to be trickier.

Read on for the script.

Comments closed

OR Filters With Excel Cube Functions

Chris Webb shows how to build an OR filter in Power Pivot:

Now, imagine that you want a report with the Sales Amount measure on columns and Years on rows, and you want to filter the data so that you only see values for Mondays in July or Wednesdays in September. Using the Fields, Items and Sets functionality you could filter the data to only show the day/month combinations you need for each year, but since you can’t put a named set into the Filter area of a PivotTable you would have to use Excel formulas to sum up the combinations to get the totals you need

I’ve never used Excel’s Cube Functions.  Looks like something to add to the to-learn pile.

Comments closed