Press "Enter" to skip to content

Month: April 2020

Downloading Files from Websites with Power BI

Imke Feldmann takes us through an interesting scenario:

When downloading data from the web, it’s often best to grab the data from APIs that are designed for machine-to-machine communication than from the site that’s actually visible on the screen. Not only is the download usually faster, but you also often get more additional parameters that can be very useful. In this article I’m going to show you how to retrieve the relevant URLs for downloading files from webpages (without resorting to external tools like Fiddler) and how to tweak them to your needs.

Read on to see different techniques for finding a URL to give to end users.

Comments closed

Getting a Substring with DAX

Reza Rad shows us how to build out a substring using DAX:

Substring is one of the most common functions in many languages, However, there is no function named Substring DAX. There is a very simple way of doing it, which I am going to explain in this post. Substring means getting part of a string, for example from “Reza Rad”, if I want to get the start starting from index 2, for 4 characters, it should return “za R”. Considering that the first character is index 0. Let’s see how this is possible.

The answer’s not as pretty as a SUBSTRING() function would be, but it’s also not too far off.

Comments closed

How VMware Resource Pools Affect SQL Server

David Klee walks us through the concept of resource pools in VMware:

Resource pools are used to hierarchically partition available CPU and memory resources, and are available for use at the VMware host cluster layer.

To better prioritize certain VMs over others, especially in a highly concurrent VM farm, I recommend leveraging three resource pools for SQL Server-on-VMware environments. Tier-1 can be created with a high value of resources assigned for CPU and memory; Tier-2 is normal; Tier-3 is low. Do not manually specify the amount of shares for each, as this metric will become skewed if compute hardware is added or removed from the host cluster.

Read on to understand why and how, as well as a few more tips around resource pools.

Comments closed

Installing Postgres

Mala Mahadevan has started learning a bit about PostgreSQL:

I had the opportunity of working on a project converting a postgres database to a sql server based one at work. I used this opportunity to learn more of this platform and decided to write some blog posts on it. I will be focusing quite a bit on how this compares with SQL Server as I go along and hope it will be useful.

PostGreSQL is an open source based relational database – it is easy to download and install from here. I chose the version available for Windows since mine involves a comparison with SQL Server on Windows and this is easier.

Click through for the installation steps.

Comments closed

Power BI Row-Level Security

Tomaz Kastrun shows us row-level security in Power BI:

Row -Level Security or managing roles in Power BI is not something new. But environments, where there is a need for securing read access for end-users based on their account name, are very frequent. Row Level Security is omitting and controlling access to a user or group (or distribution group in active directory) to rows on a single dataset (or table in SQL Server) and all the relationships to this dataset.

There is a performance cost to this, but if you need it, it’s there. Power BI row-level security can also work with Analysis Services row-level security and (to an extent, and this is new) SQL Server row-level security.

Comments closed

Handling Bad Records with Apache Spark

Divyansh Jain shows three techniques for handling invalid input data with Apache Spark:

Most of the time writing ETL jobs becomes very expensive when it comes to handling corrupt records. And in such cases, ETL pipelines need a good solution to handle corrupted records. Because, larger the ETL pipeline is, the more complex it becomes to handle such bad records in between. Corrupt data includes:

– Missing information
– Incomplete information
– Schema mismatch
– Differing formats or data types

Since ETL pipelines are built to be automated, production-oriented solutions must ensure pipelines behave as expected. This means that data engineers must both expect and systematically handle corrupt records.

This is the seedy underbelly of semi-structured data: you don’t have control over the data as it comes in, so you have to control the data coming out.

Comments closed

Hive + LLAP Now Faster with ElasticMapReduce 6

Suthan Phillips has a benchmark for ElasticMapReduce 5 versus 6:

To evaluate the performance benefits of running Hive with Amazon EMR release 6.0.0, we’re using 70 TCP-DS queries with a 3 TB Apache Parquet dataset on a six-node c4.8xlarge EMR cluster to compare the total runtime and geometric mean with results from EMR release 5.29.0.

The results show that the TPC-DS queries run twice as fast in Amazon EMR 6.0.0 (Hive 3.1.2) compared to Amazon EMR 5.29.0 (Hive 2.3.6) with the default Amazon EMR Hive configuration.

The following graph shows performance improvements measured as total runtime for 70 TPC-DS queries. Amazon EMR 6.0.0 has the better (lower) runtime.

Click through for the measures and a bit more info on LLAP.

Comments closed

VARCHAR Columns and Bytecode Version Mismatch in R

Dave Mason runs through a tricky problem with SQL Server Machine Learning Services:

During my testing, I’ve found R handles CHAR and VARCHAR data within the input data set as long as the ASCII codes comprising the data is in the range from 0 to 127. This much is not surprising–those are the character codes for the ASCII table. Starting with character code 128, R begins having some trouble. 

Read on to see the problem. Dave’s advice at the end is sound (and frankly, my advice for any string data in SQL Server).

Comments closed

Applying the Principles of Site Reliability Engineering

Sheldon Hull has an essay on site reliability engineering in practice:

I’ve always been focused on building resilient systems, sometimes to my own detriment velocity wise. Balancing the momentum of delivery features and improving reliability is always a tough issue to tackle. Automation isn’t free. It requires effort and time to do correctly. This investment can help scaling up what a team can handle, but requires slower velocity initially to do it right.

How do you balance automating and coding solutions to manual fixes, when you often can’t know the future changes in priority?

This is personal experience rather than prescriptive guidance. Very interesting personal experience.

Comments closed