Press "Enter" to skip to content

Day: July 15, 2020

Calculating Spark Application Resource Allocations

The Hadoop in Real World team walks us through resource allocation for Spark applications:

In this post we will look at how to calculate resource allocation for Spark applications. Figuring out how to allocate resources for a Spark application requires a good understanding of resource allocation properties in YARN and also resource related properties in Spark. Let’s look at both.

This post covers the properties you want to keep an eye on when running Spark applications.

Comments closed

Comparing Gradient Descent to the Normal Equation for Small Data Sets

Pushkara Sharma compares two techniques for regression:

In this article, we will see the actual difference between gradient descent and the normal equation in a practical approach. Most of the newbie machine learning enthusiasts learn about gradient descent during the linear regression and move further without even knowing about the most underestimated Normal Equation that is far less complex and provides very good results for small to medium size datasets.

If you are new to machine learning, or not familiar with a normal equation or gradient descent, don’t worry I’ll try my best to explain these in layman’s terms. So, I will start by explaining a little about the regression problem.

I was surprised by the results.

Comments closed

More Scraping Web Pages

Dave Mason continues scraping web pages for fun and profit:

In the last post, we looked at a way to scrape HTML table data from web pages, and save the data to a table in SQL Server. One of the drawbacks is the need to know the schema of the data that gets scraped–you need a SQL Server table to store the data, after all. Another shortcoming is if there are multiple HTML tables, you need to identify which one(s) you want to save.

For this post, we’ll revisit web scraping with Machine Learning Services and R. This time, we’ll take a schema-less approach that returns JSON data. As before, this web page will be scraped: Boston Celtics 2016-2017. It shows two HTML tables (grids) of data for the Boston Celtics, a professional basketball team. The first grid lists the roster of players, the second is a listing of games played during the regular season.

Click through to see how Dave manages this feat.

Comments closed

Lessons Learned from a Non-Standard Default Database

Richard Swinbank tells a tale of woe:

Migration day went pretty smoothly – it even looked like we’d found and amended every connection string likely to disable a downstream system. The instance from which we were migrating was a bit of a food court, so before signing off I opened SSMS to check on some other system issue… and found I could no longer log in.

Read on to understand why, as well as what Richard did to fix this.

Comments closed

The Limits of LEN (or REPLICATE)

Pamela Mooney takes us through a quandry:

I was using LEN() to troubleshoot an issue I was having with a dynamically constructed string truncating while inserting into an NVARCHAR(MAX) column.  Since I know that NVARCHAR(MAX) has a 2 GB limit (goodness only knows how many characters that is!),  I couldn’t explain the truncation.  A colleague suggested doing a test with another dynamically constructed string.  Maybe then, I could find where the cutoff was occurring.

Great idea!

So, I came up with a plan.

Click through for the plan, but be sure to read Pamela’s comment at the bottom as there’s a bit more to the story.

Comments closed

Adding Calculation Groups with the Tabular Object Model

Kasper de Jonge shows how you can add calculation groups in C# with the Tabular Object Model:

At the time of writing there are no tools built into Power BI to create them though. You can add them programmatically though.

To add them I wanted to play around with the tabular object model but more on this project later 🙂 . Unfortunately, there was not much documentation available on how to add calculation groups using TOM. Luckily, I have short access lines to the devs and they helped me :). I wanted to share in the code snippet below how to add a calculation group to your model in TOM using C#. Make sure you add the SSAS NuGet packages to your project.

Click through for an example of what to do.

Comments closed

Replacing GUIDs with Surrogate Keys in Power BI

Matt Allington finds another place where GUIDs aren’t your best option:

I was doing some work for a customer this week – they had a performance issue with a Power BI report. The data in the workbook wasn’t overly large, about 400,000 rows, yet the file size was 110 megabytes and the performance of the model was relatively slow given the number of records. When I looked at the report I noted that the report was using GUIDs between the primary and foreign keys on a number of tables. Generally speaking, it is not good practice to use a GUID to join tables, as GUIDs do not compress well and have a negative effect on the efficiency of physical 1 to many relationships.

Read on to learn more as well as what you can do about it.

Comments closed

Methods to Run Scheduled Tasks in Azure

Joey D’Antoni has a roundup of several techniques you can use to run scheduled tasks against an Azure SQL Database:

If you’ve worked with Microsoft SQL Server for any period of time, you are familiar with the SQL Server Agent. The Agent, which remains mostly unchanged since I started working with in 1999, is a fairly robust job scheduler that can also alert you in the event of job failures or system errors. I feel as though it’s a testament to the quality of the original architecture that the code hasn’t changed very much–it still meets the needs of about 90-95% of SQL Server workloads, based on an informal twitter discussion I had a few months ago. There are some cases where an enterprise scheduling tool is needed, but for maintaining most SQL Servers and executing basic ETL, the agent works fine. There’s one problem–the agent is only available in SQL Server and Azure SQL Managed Instance.

Read on to learn about those options.

Comments closed