Press "Enter" to skip to content

Day: June 2, 2021

Using Lobe for Training ML Models

Chris Webb reviews a free tool from Microsoft:

The most impressive thing about it is not what it does but how it does it: a lot of tools claim to make machine learning easy for non-technical users but Lobe really is easy to use. My AI/ML knowledge is very basic but I got up and running with it extremely quickly.

To test it out I downloaded lots of pictures of English churches and trained a model to detect whether the church had a tower or a spire. After I labelled the pictures appropriately:

Click through for Chris’s findings. Looks like the only thing it does today is image classification, but more functionality is forthcoming.

Comments closed

Scaling HDFS to an Exabyte

Konstantin Shvachko, et al, explain some of the changes to the Hadoop Distributed File System needed to scale to one exabyte of data:

LinkedIn runs its big data analytics on Hadoop. During the last five years, the analytics infrastructure has experienced tremendous growth, almost doubling every year in data size, compute workloads, and in all other dimensions. It recently reached two important milestones.

1. LinkedIn now stores 1 exabyte of total data across all Hadoop clusters.

2. Our largest 10,000-node cluster stores 500 PB of data. It maintains 1 billion objects (directories, files, and blocks) on a single NameNode serving RPCs with an average latency under 10 milliseconds, making it one of the largest (if not the largest) Hadoop cluster in the industry.

From the early days of LinkedIn, Apache Hadoop was the basis of our analytics infrastructure. Many teams assisted in this effort to make Hadoop our canonical big data platform.

Read on for different techniques they’ve used, as well as code changes implemented in HDFS to support this data size.

Comments closed

Comparing Azure Analysis Services Scaling to Power BI PPU

Gilbert Quevauvilliers continues a series on migrating from Azure Analysis Services to Power BI Premium Per User:

If you missed the first part of the series here is the link here: Query Performance – Part 1 Migrating Azure Analysis Services to Power BI Premium Per User – Reporting/Analytics Made easy with FourMoo and Power BI

In this blog post I am going to investigate how well does PPU scale when comparing it to AAS.

When comparing AAS to PPU, I must find the same size AAS size to what we get with PPU.

Read on for Gibert’s findings.

Comments closed

Designing and Managing Large Datasets in Power BI

Paul Turley continues a series on doing Power BI the right way:

I was just talking to a consulting client about the best approach to build a data model and he told me something very interesting about the way they were loading data into Power BI. He said “We don’t use facts and dimensions, we load all of our data into one huge table.” He said that their data model performs well and that it meets their reporting needs. It is a difficult point to argue, when something is working at the time although the design might not follow the accepted rules. Life is like that and there are plenty of analogies to make the point that a practice, even a real bad practice, might solve a problem for a period of time and under certain conditions. <analogy>You can drive a car at excessive speed to get to your destination faster. You might not get caught by the police on that day and you might not crash but eventually, if you make it a habit, this practice will catch up to you.</analogy> Data is like that. If you don’t play by the rules, you limit your options. Bending the rules lets you move faster and sometimes with less hassle. But, as the project scope expands – and after adding enough data or other complexities to the solution, it will not endure. The data model won’t perform well, won’t load the correct data or it just won’t be reliable.

This post will explore the realities of best practice design for large data models; some important considerations and trade-off decisions when working with both “big data” and “large data”.

Read on for Paul’s tips.

Comments closed

Moving Synapse Databases Across Subscriptions

Steve Hughes hits on one of the tricky administrative bits of Azure Synapse Analytics:

So you can copy Azure SQL Database using the Azure Portal, PowerShell, Azure CLI, and T-SQL. However, this functionality is limited to Azure SQL Database and does not work for Azure Synapse databases (a.k.a. SQL Pools). Early in 2021, the ability to use the copy functionality to copy databases between subscriptions is also supported but requires security work to make sure the permissions in the database servers and networking allow that to happen.

There’s a lot involved in the process, leaving me to provide the sage wisdom that it’s easier not to put it in the wrong subscription to begin with if you can avoid it.

Comments closed

Inlined Financial Functions

Erik Darling has some functions for us:

At just about every client site, I see a common set of financial functions being used to calculate various things. The code is all the same, too.

Some of it comes from published government guidelines, and some of it comes straight out of accounting 101 books.

The big problem is that all of these functions were written as scalar UDFs, and performance becomes dead.

Recently, one of my clients was nice enough to agree to let me publish my rewrites of their functions as inline table valued functions.

Check them out on Erik’s GitHub repo.

Comments closed

Working with Multi-Row Headers in Power Query

Ed Hansberry has the solution to a tricky problem:

It is fairly common for users to format Excel reports with headers that are comprised of two or more rows for the header, rather than using a single cell with word wrap on. I’ve seen text files with similar issues as well. Consider the following example:

Getting this info into Power Query can sometimes be a challenge. I’m going to show you two ways to do it. The first way will be to do it manually mostly using the Power Query user interface. The second way will be a custom function that will do all of the work for you. For simplicity’s sake, I’ll use Power Query in Excel for this since my data is in Excel already, but the same logic would hold if you were importing the data into Power Bi. We do not want to consolidate the headers in Excel. You’ll just have to do it manually again the next time you get a file from someone. Remember – never pre-transform your data before you transform it in Power Query.

The nice thing is that Power Query makes this tricky problem fairly easy to solve.

Comments closed

15 tempdb Notes

Deepthi Goguri summarizes a detailed session from Bob Ward:

While I was preparing for my Tempdb presentation, I learned many interesting facts about Tempdb. Thanks so much Bob Ward (t|g) for providing me with the resources to prepare for my presentation. Bob Ward has presented an amazing 3 hour session about Tempdb for the PASS Summit couple of years ago. This information is invaluable.

Read on for 15 notes of interest.

Comments closed