Press "Enter" to skip to content

Day: August 5, 2019

Replicating ACID Tables in Hive

Ashutosh Bapat shows off some of the improvements in Apache Hive replication:

Transactional tables in Hive support ACID properties. Unlike non-transactional tables, data read from transactional tables is transactionally consistent, irrespective of the state of the database. Of course, this imposes specific demands on replication of such tables, hence why Hive replication was designed with the following assumptions:

1. A replicated database may contain more than one transactional table with cross-table integrity constraints.
2. A target may host multiple databases, some replicated and some native to the target. The databases replicated from the same source may have transactional tables with cross-database integrity constraints.
3. A user should be able to run read-only workloads on the target and should be able to read transactionally consistent data.
4. Since in Hive a read-only transaction requires a new transaction-id, the transaction-ids on the source and the target of replication may differ. Thus transaction-ids can not be used for reading transactionally consistent data across source and replicated target.

Read on to learn why these assumptions are in place and what they mean for replication.

Comments closed

Naive Bayes Predictions with Analysis Services

Dinesh Asanka shows how you can use the Naive Bayes algorithm in an Analysis Services data mining project:

Microsoft Naive Bayes is a classification supervised learning. This data set can be bi-class which means it has only two classes. Whether the patient is suffering from dengue or not or whether your customers are bike buyers or not, are an example of the bi-class data set. There can be multi-class data set as well.

Let us take the example which we discussed in the previous article, AdventureWorks bike buyer example. In this example, we will use vTargetMail database view in the AdventureWorksDW database.

During the data mining algorithm wizard, the Microsoft Naive Bayes algorithm should be selected as shown in the below image.

Of mild interest is that it’s a two-class classifier here, but it’s a multi-class classifier in the (much) later ML.NET.

Comments closed

Distributed Transactions on Linux

Tejas Shah and crew announce distributed transactions with SQL Server on Linux:

With SQL Server 2017, a new era was heralded with SQL server being available to deploy on Linux (and Linux based container) systems. While all functionality of the SQL Server engine were brought over as is to SQL Server on Linux, some of the functionality which depended on Windows system processes such as distributed transactions (which relies on MSDTC service) were not brought over immediately.

Well, now your wait is over.

Comments closed

Hearing Differences in Power BI

David Eldersveld shows us one technique for adding a sound component to your Power BI dashboard:

Data sonification uses variations in audio to hear differences in data values. From an accessibility standpoint, data sonification offers another potential avenue to enrich your reports beyond methods pertaining to data visualization.

There are a few pieces that need to be assembled to enable data sonification that works in both Power BI Desktop and Service. While, audio has been used in the Power BI Service for awhile, Power BI Desktop has been silent until now. This post attempts to show how to produce audio tones in Power BI for greater accessibility. It also demonstrates how to blend data with a standard range of audio pitches.

It’s an interesting idea.

Comments closed

Finding Memory-Rich Queries

Matthew McGiffen wants to find the queries which demand the largest memory grants:

I had a server that looked like it had been suffering from memory contention. I wanted to see what queries were being run that had high memory requirements. The problem was that it wasn’t happening right now – I needed to be able to see what had happened over the last 24 hours.

Enter Query Store. In the run-time stats captured by Query Store are included details relating to memory.

Click through for a script which retrieves this data over a time frame.

Comments closed

ClassNotFoundException and .NET Spark

Ed Elliott takes us through two causes for a ClassNotFoundException when running a Spark job with .NET Spark:

There was a breaking change with version 0.4.0 that changed the name of the class that is used to load the dotnet driver in Apache Spark.

To fix the issue you need to use the new package name which adds an extra dotnet near the end, change:

spark-submit --class org.apache.spark.deploy.DotnetRunner

Click through to see what you should change this line of code to read. If that change doesn’t fix your problem, Ed has a broader solution.

Comments closed

Adding Aggregates to Table.Profile

Chris Webb shows us how to add additional aggregates to Table.Profile in M:

A few years ago I blogged about the Table.Profile M function and how you could use it to create a table of descriptive statistics for your data:

https://blog.crossjoin.co.uk/2016/01/12/descriptive-statistics-in-power-bim-with-table-profile/

Since that post was written a new, optional second parameter has been added to the function called additionalAggregates which allows you to add your own custom columns containing aggregate values to the output of Table.Profile, so I thought I’d write a follow-up on how to use it.

Click through for that follow-up.

Comments closed

Extended Filtering in DAX

Matt Allington continues a discussion on the FILTER() function in DAX:

The new formula follows the rule “don’t filter a table if you can filter a column”. But in this case the column and the table have the same cardinality, so there is little benefit there. Also, the new formula requires a second CALCULATE() and SUM() inside the FILTER() function. This is required because the column Customers[YearlyIncome] is no longer in the same table that FILTER() is iterating. The FILTER() function is iterating a virtual, single column table that contains all customer keys in the customer table. The column Customers[YearlyIncome] doesn’t exist in this virtual table, it exists in the Customers table, so you must wrap the column in an aggregation function, SUM() in this case. Further, as the FILTER() function iterates in a row context through the virtual table, the virtual relationship does not filter the connected tables UNLESS you specifically tell the formula to do so. Technically, to make the filter propagate from the new virtual table created by ALL(Customers[CustomerKey]), we need to convert the row context into an equivalent filter context via context transition. Context transition is triggered by the inner CALCULATE() inside the FILTER() function in this case.

Read on for several tips for efficient filtering.

Comments closed