Press "Enter" to skip to content

Author: Kevin Feasel

Power BI Row-Level Security

Tomaz Kastrun shows us row-level security in Power BI:

Row -Level Security or managing roles in Power BI is not something new. But environments, where there is a need for securing read access for end-users based on their account name, are very frequent. Row Level Security is omitting and controlling access to a user or group (or distribution group in active directory) to rows on a single dataset (or table in SQL Server) and all the relationships to this dataset.

There is a performance cost to this, but if you need it, it’s there. Power BI row-level security can also work with Analysis Services row-level security and (to an extent, and this is new) SQL Server row-level security.

Comments closed

Installing Postgres

Mala Mahadevan has started learning a bit about PostgreSQL:

I had the opportunity of working on a project converting a postgres database to a sql server based one at work. I used this opportunity to learn more of this platform and decided to write some blog posts on it. I will be focusing quite a bit on how this compares with SQL Server as I go along and hope it will be useful.

PostGreSQL is an open source based relational database – it is easy to download and install from here. I chose the version available for Windows since mine involves a comparison with SQL Server on Windows and this is easier.

Click through for the installation steps.

Comments closed

Hive + LLAP Now Faster with ElasticMapReduce 6

Suthan Phillips has a benchmark for ElasticMapReduce 5 versus 6:

To evaluate the performance benefits of running Hive with Amazon EMR release 6.0.0, we’re using 70 TCP-DS queries with a 3 TB Apache Parquet dataset on a six-node c4.8xlarge EMR cluster to compare the total runtime and geometric mean with results from EMR release 5.29.0.

The results show that the TPC-DS queries run twice as fast in Amazon EMR 6.0.0 (Hive 3.1.2) compared to Amazon EMR 5.29.0 (Hive 2.3.6) with the default Amazon EMR Hive configuration.

The following graph shows performance improvements measured as total runtime for 70 TPC-DS queries. Amazon EMR 6.0.0 has the better (lower) runtime.

Click through for the measures and a bit more info on LLAP.

Comments closed

Handling Bad Records with Apache Spark

Divyansh Jain shows three techniques for handling invalid input data with Apache Spark:

Most of the time writing ETL jobs becomes very expensive when it comes to handling corrupt records. And in such cases, ETL pipelines need a good solution to handle corrupted records. Because, larger the ETL pipeline is, the more complex it becomes to handle such bad records in between. Corrupt data includes:

– Missing information
– Incomplete information
– Schema mismatch
– Differing formats or data types

Since ETL pipelines are built to be automated, production-oriented solutions must ensure pipelines behave as expected. This means that data engineers must both expect and systematically handle corrupt records.

This is the seedy underbelly of semi-structured data: you don’t have control over the data as it comes in, so you have to control the data coming out.

Comments closed

Applying the Principles of Site Reliability Engineering

Sheldon Hull has an essay on site reliability engineering in practice:

I’ve always been focused on building resilient systems, sometimes to my own detriment velocity wise. Balancing the momentum of delivery features and improving reliability is always a tough issue to tackle. Automation isn’t free. It requires effort and time to do correctly. This investment can help scaling up what a team can handle, but requires slower velocity initially to do it right.

How do you balance automating and coding solutions to manual fixes, when you often can’t know the future changes in priority?

This is personal experience rather than prescriptive guidance. Very interesting personal experience.

Comments closed

VARCHAR Columns and Bytecode Version Mismatch in R

Dave Mason runs through a tricky problem with SQL Server Machine Learning Services:

During my testing, I’ve found R handles CHAR and VARCHAR data within the input data set as long as the ASCII codes comprising the data is in the range from 0 to 127. This much is not surprising–those are the character codes for the ASCII table. Starting with character code 128, R begins having some trouble. 

Read on to see the problem. Dave’s advice at the end is sound (and frankly, my advice for any string data in SQL Server).

Comments closed

Checking JSON Structure with ADF

Rayis Imayev takes us through the solution of a tricky problem in Azure Data Factory:

Within my “ForEach” container I have also placed a Stored Procedure task and set 4 data elements from my incoming data stream as values for corresponding parameters.

However this approach will not work for all my incoming JSON events, it actually failed for the last one, since it didn’t have both “stop_time” and “last_update” data elements.

An easy way to fix this problem is to add missing data elements with empty values for the last event record, however, when we don’t have control over incoming data, we need to adjust our data processing steps.

Read on to see how Rayis solves this problem.

Comments closed

Tying Azure Data Factory to Source Control

Eddy Djaja explains why you really want to tie Azure Data Factory to your source control:

Azure Data Factory (ADF) is Microsoft’s ETL or more precise: ELT tool in the cloud. For more information of ADF, Microsoft puts the introduction of ADF in this link: https://docs.microsoft.com/en-us/azure/data-factory/introduction. As some have argued if ADF will replace or complement the “on-premise”  SSIS, it is uncertain and only time can tell what will happen in the future.
Unlike SSIS, the authoring of ADF does not use Visual Studio. ADF authoring uses a web browser to create ADF components, such as pipelines, activities, datasets, etc. The simplicity of authoring ADF may confuse the novice developers on how ADF components are saved, stored and published. When logging to ADF for the first time after creating an ADF, the authoring is in the ADF mode. How do we know?

Click through for the explanation and some resources on how to do it.

Comments closed

Tips for Moving from Pandas to Koalas

Haejoon Lee, et al, walk us through migrating existing code written for Pandas to use the Koalas library:

In particular, two types of users benefit the most from Koalas:

– pandas users who want to scale out using PySpark and potentially migrate codebase to PySpark. Koalas is scalable and makes learning PySpark much easier
– Spark users who want to leverage Koalas to become more productive. Koalas offers pandas-like functions so that users don’t have to build these functions themselves in PySpark

This blog post will not only demonstrate how easy it is to convert code written in pandas to Koalas, but also discuss the best practices of using Koalas; when you use Koalas as a drop-in replacement of pandas, how you can use PySpark to work around when the pandas APIs are not available in Koalas, and when you apply Koalas-specific APIs to improve productivity, etc. The example notebook in this blog can be found here.

Read on to learn more.

Comments closed

Image Caching with Docker

etash2901 at the Knoldus blog walks us through the way Docker caches images:

If the objects on the file system that Docker is about to produce have not changed between builds, reusing a cache of a previous build on the host is a great time-saver. It makes building a new container really, really fast. None of those file structures have to be created and written to disk this time — the reference to them is sufficient to locate and reuse the previously built structures.

This is an order of magnitude faster than a a fresh build. If you’re building many containers, this reduced build-time means getting that container into production costs less, as measured by compute time.

Click through for some advice on how to minimize the amount of time you spend waiting for image layers to download or process.

Comments closed