Press "Enter" to skip to content

Day: December 21, 2020

TF-IDF in .NET for Spark, Updated

Ed Elliott has been busy:

Apache Spark has had a machine learning API for quite some time and this has been partially implemented in .NET for Apache Spark.

In this post we will look at how we can use the Apache Spark ML API from .NET. This is the second version of this post, the first version was written before version 1 of .NET for Apache Spark and there was a vital piece of the implementation missing which meant although we could build the model in .NET, we couldn’t actually use it. The necessary functionality is now available and so I am updating the post. To see the previous version go to: https://the.agilesql.club/2020/07/tf-idf-in-.net-for-apache-spark-using-spark-ml/

Read on for more information, as well as a call to action.

Comments closed

Apache Spark Basics in Azure Synapse Analytics

Euan Garden shows off some Apache Spark functionality in Azure Synapse Analytics:

Apache Spark has been a long-time favorite tool amongst data engineers and data scientists; it is well known for handling large scale data processing and complex machine learning workloads.

Azure Synapse Analytics offers a fully managed and integrated Apache Spark experience. By leveraging Apache Spark in Azure Synapse, you can benefit from integrated security, fully managed provisioning, and tight-coupling to other Azure services, such as SQL databases (dedicated and serverless), Azure Key Vault , ADLS Gen2, and Azure Blob Storage as well as fast starting, high performance compute instances.

Click through for the demo.

Comments closed

The Importance of Composite Models

Paul Turley lays out the significance of composite models in Power BI:

There are been many attempts by Microsoft and other vendors to create a data modelling architecture that provides for fast access to cached data, direct access to live data and scaled-out connections to established data models. Remember ROLAP and HOLAP storage in multidimensional cubes? These were great concepts with significant trade-off limitations. No other vendor has anything like this. Way back in the day, Microsoft jumped on the Ralph Kimball bandwagon to promote the idea that a company should have a “one version of the truth” exposed through their data warehouse and cubes or semantic data models. They met customer demand and gave us a BI tool that, in order to bring data together from multiple sources, makes it easy to create a lot of data silos. Arguably, there are design patterns to minimize data duplication but to use governed datasets, self-service report designers are limited to connecting to large, central models that might only be authored and managed by IT. This new feature can restore balance to the force and bring us back to “one version of the truth” again.

Read on for Paul’s early thoughts on the feature.

Comments closed

Small Multiples in Power BI

Chris Webb takes us through a new feature in Power BI:

While the long-awaited small multiples feature that previewed in the December 2020 release is an obvious boost to Power BI’s data visualisation capabilities, did you know that you can use it to improve report performance too?

Earlier this year I wrote blog posts showing how you can improve report performance by showing the same amount of data in fewer visuals (for example by replacing several cards with a single table) and how the number of visuals on a page affects report performance even if they aren’t displaying any data; several other people have written similar posts too. Small multiples are just another way you can replace several visuals with a single visual that displays the same data.

I liked this feature for the visualization improvements, but if you can throw in performance improvements as well, I’m sold.

Comments closed

Correlated Subqueries which Don’t

Daniel Hutmacher gives us an eye test:

The developer wrote this pretty little query to show us which accounts are up for review (which in our case means they have a “30” flag).

SELECT account, balance, 'For review' AS [status]
FROM #accounts WHERE account IN (SELECT account FROM #accountFlags WHERE flag=30) ORDER BY account;

Did you spot it?

I did, but in fairness, I’ve been burned enough times by this that I check for it.

Comments closed

Sync Logins between Availability Group Replicas

Taryn Pratt has a process:

Always On Availability Groups can support up to nine availability replicas, and while we don’t use anywhere near that many replicas in each of our clusters, we do have 2 replicas per cluster (3 servers total), with the replicas being used as a readable secondary.

Since we use readable secondaries in our environments, the application needs to connect to both the primary and the secondary servers with the same login. The catch is, logins don’t automatically sync across replicas. If the logins don’t sync, the application won’t connect to a secondary, which results in login failures.

Read on for one way to solve the problem.

Comments closed

Foreign Keys and Updating the Parent

Hugo Kornelis conclues a mini-series on foreign key constraints:

Welcome to part fourteen of the plansplaining series, where I wrap up the mini-series on how simple foreign keys have huge effects on execution plans for data modifications.

We already looked at inserting data in the referencing (child) table, and at deleting data from the referenced (parent) table as well as updates in the child table. We did not and will not look at deleting from the child table or inserting in the parent table: those operations can by default never violate the foreign key constraint, so no additional logic is needed.

So that means there is only one thing left to explore: updating the parent. Perhaps surprisingly, this is actually quite complex, so it warrants an entire post of its own.

Read on to see why.

Comments closed

T-SQL Additions to Serverless SQL Pools

Jovan Popvic lays out some of the T-SQL syntax added to serverless SQL pools in Azure Synapse Analytics:

Serverless Synapse SQL pools in Azure Synapse Analytics have a new set of features that will enable you to analyze your Azure data more efficiently. The new Transact-SQL (T-SQL) language features that you can use in serverless SQL pools are STRING_AGGOFFSET/FETCHPIVOT/UNPIVOTSESSION_CONTEXT, and CONTEXT_INFO.

Old T-SQL hands will likely know what all of this does, but click through if something looks unfamiliar. All of this is available in SQL Server 2017 and later (and everything but STRING_AGG() is available going back to 2008).

Comments closed