Press "Enter" to skip to content

Day: May 21, 2020

Distributed Model Training with Cloudera ML

Zuling Kang and Anand Patil show us how to train models across several nodes using Cloudera Machine Learning:

Deep learning models are generally trained using the stochastic gradient descendent (SGD) algorithm. For each iteration of SGD, we will sample a mini-batch from the training set, feed it into the training model, calculate the gradient of the loss function of the observed values and the real values, and update the model parameters (or weights). As it is well known that the SGD iterations have to be executed sequentially, it is not possible to speed up the training process by parallelizing iterations. However, as processing one single iteration for a number of commonly used models like CIFAR10 or IMAGENET takes a long time, even using the most sophisticated GPU, we can still try to parallelize the feedforward computation as well as the gradient calculation within each iteration to speed up the model training process.

In practice, we will split the mini-batch of the training data into several parts, like 4, 8, 16, etc. (in this article, we will use the term sub-batch to refer to these split parts), and each training worker takes one sub-batch. Then the training workers do feedforward, gradient computation, and model updating using the sub-batches, respectively, just as in the monolithic training mode. After these steps, a process called model average is invoked, averaging the model parameters of all the workers participating in the training, so as to make the model parameters exactly the same when a new training iteration begins. Then the new round of the training iteration starts again from the data sampling and splitting step.

Read on for the high-level explanation, followed by some Python code working in TensorFlow.

Comments closed

Spark Application Execution Modes

Kundan Kumarr explains how the two execution modes differ with Apache Spark:

Whenever we submit a Spark application to the cluster, the Driver or the Spark App Master should get started. And the Driver will be starting N number of workers. Spark driver will be managing spark context object to share the data and coordinates with the workers and cluster manager across the cluster. Cluster Manager can be Spark Standalone or Hadoop YARN or Mesos. Workers will be assigned a task and it will consolidate and collect the result back to the driver. A spark application gets executed within the cluster in two different modes – one is cluster mode and the second is client mode.

Click through for a comparison.

Comments closed

Big-O Notation in .NET

Camilo Reyes takes us through a useful concept in computer science as applied to .NET Core:

Performance sensitive code is often overlooked in business apps. This is because high-performance code might not affect outcomes. Concerns with execution times are ignorable if the code finishes in a reasonable time. Apps either meet expectations or not, and performance issues can go undetected. Devs, for the most part, care about business outcomes and performance is the outlier. When response times cross an arbitrary line, everything flips to less than desirable or unacceptable.

Luckily, the Big-O notation attempts to approach this problem in a general way. This focuses both on outcomes and the algorithm. Big-O notation attempts to conceptualize algorithm complexity without laborious performance tuning.

This is a rather high-level take on the idea, as it doesn’t cover any of the O(NlogN) or O(logN) algorithms out there. But if you are not familiar with the concept, it is good to know.

Comments closed

Standardized DAX Separators in Power BI Desktop

Marco Russo goes over the ramifications of a recent change to Power BI Desktop:

Starting from the May 2020 version of Power BI Desktop, regardless of the Windows locale settings DAX always uses standard separators by default. This change does not affect most Power BI users around the world; but if you use the comma as a decimal separator and 10,000 is just ten and not ten thousand, then you are affected by this change.

First of all, you can restore the previous behavior as I describe in this post, but the recommendation and the default are now to use the standard DAX separators. I want to describe why this (I think good) choice was made and how I contributed to making this happen.

Read the whole thing.

Comments closed

May 2020 Release of Azure Data Studio

Alan Yu has some goodies for us:

The key highlights to cover this month include:

– Announcing Redgate SQL Prompt extension
– Announcing the new machine learning extension
– Added new Python dependencies wizard
– Added support for parameterization for Always Encrypted
– Improvements to the notebook markdown toolbar
– Bug fixes

For a list of complete updates, refer to the Azure Data Studio release notes.

I’ll have to check out the ML extension.

Comments closed

AMD Processor Recommendations for SQL Server

Glenn Berry has some thoughts on AMD’s EPYC line of processors:

Over the years, I have written many articles about the fine art of processor selection for SQL Server. This is an important topic, because it has a direct relationship to your SQL Server license costs. It also affects your performance and scalability. As new processor families are introduced, I do the required analytical work and update my recommendations. In this post, I will list my recommended AMD Processors for SQL Server.

I’m just happy that the answer isn’t a null set anymore.

Comments closed

Filtering Power BI Dimensions with List.Contains

Ed Hansberry gives us a second option for filtering dimension values:

I don’t like loading up a slicer with dozens or hundreds of items that have no corresponding records. The same would apply if there was no slicer, but the consumer wanted to filter using the Filter pane. So I’ll filter the customer table so it only includes what I would call “active customers” that are shown in the sales table.

The most straight forward way to do this is by doing an Inner Join between the tables, but there is another way, using the powerful List.Contains() feature of Power Query. And what makes it so powerful is not just it’s utility, but when you run it against data in a SQL Server or similar server, Power Query will fold the statement.

Let me walk you through both methods so it is clear.

Read on for the walkthrough.

Comments closed

Power BI Incremental Refresh Against Web API

Dustin Ryan shows how you can have Power BI perform incremental refresh against a .NET Web API source:

The customer is using Power BI to report on data from Service Now via APIs. So the customer was able to quickly connect Power BI to Service Now data and begin reporting on relevant datasets very quickly. The challenge, however, is that querying multiple years of data via the API was less than desirable for a variety of reasons.

The customer was interested in exploring the incremental refresh capabilities of Power BI, but were worried about using Power BI’s native incremental refresh capability since query folding (if you’re new to query folding, read this here) is not supported by Power BI’s web connector. So the customer team reached out to me with the challenge and I worked up an example that I think will help them meet their requirement.

Click through for the solution.

Comments closed