Press "Enter" to skip to content

Month: February 2018

Visual Studio Code In Anaconda 5.1

George Leopold reports that Anaconda 5.1 will now include Visual Studio Code as an optional IDE:

Microsoft and Python data science platform vendor Anaconda have extended their partnership by adding the software giant’s code editor to the latest Anaconda distribution.

The addition of Microsoft’s Visual Studio Code (VS Code) expands its support for the latest release of the Python data science platform, Anaconda 5.1. The Python platform has attracted more than 4.5 million users running the programming language on Windows, Mac and Linux.

Along with editing and debugging features, the partners said the cross-platform code editor includes custom features for Anaconda users. For example, a Python extension customizes VS Code for the Python development environment.

Read on for more information.

Comments closed

Streaming ETL In Practice Using KSQL

Robin Moffatt builds an example of streaming ETL using Oracle, GoldenGate, and Kafka:

So in this post I’m going to show an example of what streaming ETL looks like in practice. I’m replacing batch extracts with event streams, and batch transformation with in-flight transformation of these event streams. We’ll take a stream of data from a transactional system built on Oracle, transform it, and stream it into Elasticsearch to land the results to, but your choice of datastore is up to you—with Kafka’s Connect API you can stream the data to almost anywhere! Using KSQL we’ll see how to filter streams of events in real-time from a database, how to join between events from two database tables, and how to create rolling aggregates on this data.

It’s a very useful example.

Comments closed

Automating HDF Cluster Deployment

Ali Bajwa has a how-to guide for automating HDF 3.1 cluster deployment on AWS:

The release of HDF 3.1 brings about a significant number of improvements in HDF: Apache Nifi 1.5, Kafka 1.0, plus the new NiFi registry. In addition, there were improvements to Storm, Streaming Analytics Manager, Schema Registry components. This article shows how you can use ambari-bootstrap project to easily generate a blueprint and deploy HDF clusters to both either single node or development/demo environments in 5 easy steps. To quickly setup a single node setup, a prebuilt AMI is available for AWS as well as a script that automates these steps, so you can deploy the cluster in a few commands.

Click through for the installation guide.

Comments closed

SSAS Query Analyzer

Chris Webb reviews Analysis Services Query Analyzer:

Last week a new, free tool for analysing the performance of MDX queries on SSAS Multidimensional was released: Analysis Services Query Analyzer. You can get all the details and download it here:

https://ssasqueryanalyzer.github.io/

…and here’s a post on LinkedIn by one of the authors, Francesco De Chirico, explaining why he decided to build it:

https://www.linkedin.com/pulse/asqa-10-released-francesco-de-chirico/

I’ve played around with it a bit and I’m very impressed – it’s a really sophisticated and powerful tool, and one I’m going to spend some time learning because I’m sure it will be very useful to me.

Read on for the rest of Chris’s review, including product screenshots.

Comments closed

Installing Jupyter Notebook Kernels

Nigel Meakins continues his Jupyter series by showing how to install various kernels:

Jupyter-Scala

This can be downloaded from here. Unzip and run the jupyter-scala.ps1 script on windows using elevated permissions in order to install.

The kernel files will end up in <UserProfileDir>\AppData\Roaming\jupyter\kernels\scala-develop and the kernel will appear in Jupyter with the default name of ‘Scala (develop)’. You can of course change this in the respective kernel.json file.

Click through to see how to install a few other kernels with various levels of configuration.

Comments closed

Changing Int To Bigint

Danny Kruge shows one way to change a table’s identity value from integer to bigint:

The table was around 500GB with over 900 million rows. Based on the average number of inserts a day on that table, I estimated that we had eight months before inserts on that table would grind to a halt. This was an order entry table, subject to round-the-clock inserts due to customer activity. Any downtime to make the conversion to BIGINT was going to have to be minimal.

This article describes how I planned and executed a change from an INT to a BIGINT data type, replicating the process I used in a step by step guide for the AdventureWorks database. The technique creates a new copy of the table, with a BIGINT datatype, on a separate SQL Server instance, then uses object level recovery to move it into the production database.

There’s a way to do this without any downtime, though the trigger logic gets a little more complex and it does take longer.

Comments closed

Looking Up Managers In AD Using Powershell

Jana Sattainathan shows how to use Powershell to look up a group of Active Directory users’ managers:

Today, I received a request to find the manager for a whole bunch of users. This was a list of names (not UserId’s) in a Excel worksheet.

It is not actually that complex to do it

  1. Locate the AD user based on the name

  2. Check the Manager property

  3. Lookup AD again for Manager to get the name

Click through for the script.  This does, of course, assume that the information is already in Active Directory somewhere.

Comments closed

Changing The Default Filegroup

Kenneth Fisher shows how you can change the default filegroup:

You know you can have multiple filegroups right? You might have a separate filegroup for the data (the clustered index & heaps) and another for the indexes (non-clustered indexes). Or maybe you want to separate your data tables from the system tables. There are any number of reasons why you might want to have multiple filegroups, however, there will always be a primary filegroup and it will always be the default if you don’t specify otherwise. Right? Wrong.

I’ve never seen a way to remove primary or to move the system objects in it. However, you can change the primary filegroup.

Having a separate filegroup for your tables and another for indexes (or splitting things up some other way) can help get a database back online faster, as you can restore the system tables first and then restore filegroups as needed.

Comments closed

Using Group-Managed Service Accounts With SQL Server

Wayne Sheffield has a post on using gMSA with SQL Server:

A gMSA is a sMSA [standalone managed service account] that can be used across multiple devices, and where the Active Directory (AD) controls the password. PowerShell is used to configure a gMSA on the AD. The specific computers that it is allowed to be used on is configured using some more PowerShell commands. The AD will automatically update the password for the gMSA at the specified interval – without requiring a restart of the service! Because the AD automatically manages the password, nobody knows what the password is.

Not all services support a gMSA – but SQL Server does. During a SQL Server installation you can specify the gMSA account. The SQL Server Configuration Manager (SSCM) tool can be used to change an existing SQL Server instance to use a gMSA. After entering the gMSA account you simply do not enter a password. The server automatically retrieves the password from the AD.

This is a nice way of improving service account security in a scenario where, for example, you can’t or don’t want to use virtual service accounts.

Comments closed

Loops Versus Apply: Speed Comparison

Mike Spencer compares lapply (single core and its multi-core version) versus a for loop in R:

But how fast were they? Can we get faster? Thankfully R provides `system.time()` for timing code execution. In order to get faster, it makes sense to use all the processing power our machines have. The ‘parallel’ library has some great tools to help us run our jobs in parallel and take advantage of multicore processing. My favourite is `mclapply()`, because it is very very easy to take an `lapply` and make it multicore. Note that mclapply doesn’t work on Windows. The following script runs the `read_clean_write()` function in a for loop (boo, hiss), lapply and mclapply. I’ve run these as list elements to make life easier later on.

It’s interesting reading, particularly because I had expected lapply to do a little bit better.  Also interesting is the relative overhead cost of mclapply in this scenario:  going from 1 core to 4 cut the time to approximately 1/3, not 1/4.

Comments closed