Author: Kevin Feasel

HDFS Federation

Published 2017-11-24 by Kevin Feasel

Sangeeta Gulia explains what HDFS Federation is and how it differs from classic HDFS:

HDFS Federation improves the existing HDFS architecture through a clear separation of namespace and storage, enabling generic block storage layer. It enables support for multiple namespaces in the cluster to improve scalability and isolation. Federation also opens up the architecture, expanding the applicability of HDFS cluster to new implementations and use cases.

Namenodes are federated, that is, all these NameNodes work independently and don’t require any coordination with each other.

It’s one way to reduce the number of potential single points of failure in a Hadoop environment.

Comments closed

Pandas Basics

Published 2017-11-24 by Kevin Feasel

Kevin Jacobs has a tutorial on Python’s Pandas library:

There are a few things worth mentioning. Often, Pandas is abbreviated as pd (like Numpy which is often abbreviated as np). If you look at other code, you will see that DataFrames are often abbreviated by df. Here, the DataFrame is constructed using data from a list of lists. The columns argument specifies the keys of the data.

This is a high-level intro, but helps you get your feet wet if you’ve not played with the library.

Comments closed

A Look At Azure SQL Database Price By Tier

Published 2017-11-24 by Kevin Feasel

Arun Sirpal puts together a comparison of Azure SQL Database prices (in GBP) by service tier:

Hopefully this paints a picture for you. I will have my say though. Basic tier database is something that you should NOT be using for production workloads, its quite obvious with the 2GB limit but worth reinforcing the point. Standard tier is more for your common workloads and premium is designed for high transactional volume where I/O performance is much more important to you – my hunch maybe they are utilizing SSDs? I am not sure but the premium costs are much higher.

There is another service tier called Premium RS (In preview). My understanding is that performance is similar to that of Premium HOWEVER only useful for workloads that can tolerate data loss up to 5-minutes due to service failures. I will probably not use this for production but then again it seems to be nearly half the cost of premium. Choices choices choices.

Two notes with this one. First, these are prices as of when Arun put together the notes; they will probably fluctuate over time. Second, there might be differences in prices by data center. At the very least, though, this gives you an idea of the relative price spread.

Comments closed

Getting Distinct Dimension Count Based On A Filtered Measure

Published 2017-11-24 by Kevin Feasel

Gogula Aryalingam shows us a neat trick in Power BI:

Enthusiastic as we were, one of the hardest nuts to crack, though it seemed so simple during requirements gathering, was to perform a distinct count of a dimension based on a filtered measure on a couple of the reports. To sketch it up with some context; you have products, several more dimensions, and a whole lot of measures including one called Fulfillment (which was a calculation based on a couple of measures from two separate tables). The requirement was to get a count of all those products (that were of course filtered by other slicers on the Power BI report) wherever Fulfillment was less than 100%, i.e. the number of products that had not reached their targets.

Simple as the requirements seemed, the hardest part in getting it done, was the limited knowledge in DAX, specifically, knowing which function to use. We first tried building the data model itself, but our choice in DAX formulae, and the number of records we had (50 million+) soon saw us running out of memory in seconds on a 28GB box; Not too good, given the rest of the model didn’t even utilize more than half the memory.

Click through for the answer.

Comments closed

Improving Code Quality With SonarQube

Published 2017-11-24 by Kevin Feasel

Samir Behara has a quick look at SonarQube, an open source static analysis engine:

In my project, we have also integrated SonarQube with our TFS CI/CD build and have configured the Quality Gates.

For example – If I try to inject a security threat or a known coding issue — the TFS build will fail, the check in will get rejected, the quality gate fails and SonarQube points me to the exact issue – which I can rectify and do another check-in. So it will basically stop you from checking in code with potential issues.

Currently the only way to catch such issues is during manual coding reviews. SonarQube will help in automating that process. You can write your own rules to look for known issues in the code and stop it before the code gets checked in to source control.
So overall you can ensure good quality code going to Production and less regression defects coming up at a later point of time.

Read on for an example where a SonarQube rule can find a SQL injection vulnerability and thereby fail the build.

Comments closed

Backups Are Faster With SQL Server 2017

Published 2017-11-24 by Kevin Feasel

Parikshit Savjani explains how the SQL Server team was able to use indirect checkpoints to improve backup performance:

In RDBMS, whenever tables get larger, one of the technique to tune and optimize the scans on the tables is by partitioning it. With indirect checkpoints, we do the same.

In indirect checkpoint, for every database which has target_recovery_time set, a dirty page manager and dirty page list is created. The dirty page list is further partitioned by scheduler allowing the dirty page tracking to scale further. This decouples the dirty page scan for a given database from the size of the buffer pool and allows the scan to scale and be much faster than automatic checkpoint algorithm.

As Bob Dorr mentions in his blog here, a new database creation process in SQL Server 2016 requires only 250 buffers to scan as opposed to 500 Million buffers with former algorithm. This is the rationale for making indirect checkpoint a default which is much more scalable algorithm to track dirty pages in the buffer pool compared to automatic checkpoints.

Read on to see how this technology led to faster backups.

Comments closed

Generating Task Factory Dynamics CRM Loads With Biml

Published 2017-11-24 by Kevin Feasel

Meagan Longoria shows how to use Biml to generate SSIS packages which use the Task Factory Dynamics CRM source:

I recently worked on a project where a client wanted to use Biml to create SSIS packages to stage data from Dynamics 365 CRM. My first attempt using a script component had an error, which I think is related to a bug in the Biml engine with how it currently generates script components, so I had to find a different way to accomplish my goal. (If you have run into this issue with Biml, please comment so I know it’s not just me! I have yet to get Varigence to confirm it.) This client owned the Pragmatic Works Task Factory, so we used the Dynamics CRM source to retrieve data.

Meagan has the code as well as some important notes, so read the whole thing.

Comments closed

AG Failover From Powershell

Published 2017-11-24 by Kevin Feasel

Frank Gill has written a script to perform an Availability Group failover using Powershell:

The function takes a replica name as input and queries system tables for Availability Groups running as secondary that are online, healthy, and synchronous. For each AG found, the function generates an ALTER AVAILABILITY GROUP statement. If the -noexec parm is set to 0, the command will be executed. If -noexec is set to 1, the command will be written out to a file.

When writing the function, I started out trying to use the native PowerShell Availability Group cmdlets. After several false starts, I found it easier to develop the T-SQL code in Management Studio and use Invoke-Sqlcmd to execute the code. The code is available below. I hope you can put it to use.

Click through for the script.

Comments closed

Happy Thanksgiving

Published 2017-11-23 by Kevin Feasel

Because today is Thanksgiving, there will be no curation. Curated SQL will return either tomorrow or Monday, depending upon when I wake up from my turkey coma.

Comments closed

An Apache Sqoop Tutorial

Published 2017-11-22 by Kevin Feasel

Subham Sinha has an introductory-level tutorial on Apache Sqoop:

For Hadoop developer, the actual game starts after the data is being loaded in HDFS. They play around this data in order to gain various insights hidden in the data stored in HDFS.

So, for this analysis the data residing in the relational database management systems need to be transferred to HDFS. The task of writing MapReduce code for importing and exporting data from relational database to HDFS is uninteresting & tedious. This is where Apache Sqoop comes to rescue and removes their pain. It automates the process of importing & exporting the data.

Sqoop makes the life of developers easy by providing CLI for importing and exporting data. They just have to provide basic information like database authentication, source, destination, operations etc. It takes care of remaining part.

Sqoop internally converts the command into MapReduce tasks, which are then executed over HDFS. It uses YARN framework to import and export the data, which provides fault tolerance on top of parallelism.

In my experience, Sqoop does two things really well: first, it lets you move data from a relational database into HDFS (or Hive). Second, it lets you move data from HDFS (or Hive) into a staging table on a relational database. That can make Sqoop a useful part of an ETL process.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31