Comparing Performance: HBase1 vs HBase2

Surbhi Kochhar takes us through performance improvements between HBase version 1 and HBase version 2:

We are loading the YCSB dataset with 1000,000,000 records with each record 1KB in size, creating total 1TB of data. After loading, we wait for all compaction operations to finish before starting workload test. Each workload tested was run 3 times for 15min each and the throughput* measured. The average number is taken from 3 tests to produce the final number. 

The post argues that there’s an improvement, but when a majority of cases end up worse (even if just a little bit), I’m not sure it’s much of an improvement.

The Transaction Log in Delta Tables

Burak Yavuz, et al, explain how the transaction log works with Delta Tables in Apache Spark:

When a user creates a Delta Lake table, that table’s transaction log is automatically created in the _delta_log subdirectory. As he or she makes changes to that table, those changes are recorded as ordered, atomic commits in the transaction log. Each commit is written out as a JSON file, starting with 000000.json. Additional changes to the table generate subsequent JSON files in ascending numerical order so that the next commit is written out as 000001.json, the following as 000002.json, and so on.

It’s interesting that they chose JSON instead of a binary transaction log like relational databases use.

Spark for .NET Developers

Ed Elliott has a long-form post covering spark-dotnet:

The .NET driver is made up of two parts, and the first part is a Java JAR file which is loaded by Spark and then runs the .NET application. The second part of the .NET driver runs in the process and acts as a proxy between the .NET code and .NET Java classes (from the JAR file) which then translate the requests into Java requests in the Java VM which hosts Spark.

The .NET driver is added to a .NET program using NuGet and ships both the .NET library as well as two Java jars. One jar is for Spark 2.3 and one for Spark 2.4, and you do need to use the correct one on your installed version of Scala.

As much as I’ve enjoyed his series, getting it in a single-post format is great.

Storytelling with Power BI: Consistency

Mark Lelijveld continues a series on storytelling with Power BI:

In the below report you can easily click on a country on the left side to move to another page. When it comes to interactivity it is all done right! On the right top you can also filter on order date. Let’s say we apply a filter to only see the sales up to the end of 2013. This results in a sales amount of nearly $ 319K.

Now, Germany gets my attention. I want to see more and decide to navigate to the other page by clicking on Germany. Ending up at the other page, I see that the sales amount changes back to $2.3M. In other words, my filter is gone!

Much of the difference between adequacy and excellence with visualization is in this kind of polish.

Powershell Remoting in dbatools

Claudio Silva takes us through a change to several cmdlets in dbatools:

I wondered why and asked the Windows team if they could provide any insight. A colleague explained to me that I needed to change three things to make my remoting commands work on our network:

1. Use the FQDN on -ComputerName and/or -SqlInstance parameters
2. Use -UseSSL parameter on the New-PSSession command
3. Use -IncludePortInSPN parameter for the New-PsSessionOption command

Read the whole thing.

Azure Dedicated Hosts in Preview

Mine Tokus covers the benefit of Azure Dedicated Hosts:

Recently introducedAzure Dedicated Host Preview provides single-tenant physical servers that can host one or more virtual machines. With this new hosting model, physical server is dedicated to your organization and capacity isn’t shared with other customers. Physical server-level isolation helps to address security and compliance requirements, brings visibility and control over the server infrastructure and enables significant cost savings and licensing flexibility for SQL Server workloads on Azure VMs.

I think this might get some recalcitrant large companies to be willing to adopt cloud technologies.

SQL Server 2019 RC1

Amit Banerjee announces SQL Server 2019 Release Candidate 1:

Today we’re announcing the availability of the first public release candidate for SQL Server 2019, which is now available for download. SQL Server 2019 brings the industry-leading performance and security of SQL Server to Windows, Linux, and containers and can tackle any data workload from business intelligence to data warehousing to analytics and AI over all your data both structured and unstructured.

Amit’s update covers the span of what we’ve seen in all of the CTPs. I went through the release notes and did not find a huge amount of detail on what went into RC1 versus CTP 3.2. But the fact that they’re up to RCs means that SQL Server 2019 is getting close to release.

Connecting to Redshift from Azure Analysis Services

Gilbert Quevauvilliers shows how we can connect to Amazon Redshift from Azure Analysis Services:

I am busy working with a customer and had a challenge when using Azure Analysis Services to connect to Amazon Redshift via an ODBC connection.

The first issue that I encountered was the following error: OLE DB or ODBC error: [Microsoft][ODBC Driver Manager] The specified DSN contains an architecture mismatch between the Driver and Application; AWS PROD. This lead me to a few websites and the one that got me to my solution was Tabular: Error while using ODBC data source for Importing Data

Below are the steps on how I installed, configured and got the connection and refresh working.

Read on for those steps.

MAPE and Its Flaws

Jan Fischer takes us through Mean Absolute Percentage Error as a measure of forecast quality:

Particular small actual values bias the MAPE.
If any true values are very close to zero, the corresponding absolute percentage errors will be extremely high and therefore bias the informativity of the MAPE (Hyndman & Koehler 2006). The following graph clarifies this point. Although all three forecasts have the same absolute errors, the MAPE of the time series with only one extremely small value is approximately twice as high as the MAPE of the other forecasts. This issue implies that the MAPE should be used carefully if there are extremely small observations and directly motivates the last and often ignored the weakness of the MAPE.

Jan also points out a couple of things people criticize MAPE for incorrectly, but several things for which it is actually guilty. It’s not a bad measure if you can make certain data assumptions, but Jan has a few alternatives which tend to be better than MAPE.

Debugging Spark Applications in Visual Studio

Ed Elliott continues a series on spark-dotnet:

There are two approaches, one I have used for years with dotnet when I want to debug something that is challenging to get a debugger attached – think apps which spawn other processes and they fail in the startup routine. You can add a Debugger.Launch() to your program then when spark executes it, a prompt will be displayed and you can attach Visual Studio to your program. (as an aside I used to do this manually a lot by writing an __asm int 3 into an app to get it to break at an appropriate point, great memories but we don’t need to do that anymore luckily :).

The second approach is to start the spark-dotnet driver in debug mode which instead of launching your app, it starts and listens for incoming requests – you can then run your program as normal (f5), set a breakpoint and your breakpoint will be hit.

Read on to see how it’s done, as well as a possibly-accidental benefit to this.

Categories

August 2019
MTWTFSS
« Jul  
 1234
567891011
12131415161718
19202122232425
262728293031