Press "Enter" to skip to content

Month: March 2023

Databricks Power Tools in VS Code

Gerhard Brueckl has some tools for us:

As you probably know, we at paiqo have developed our Databricks extension for VSCode over the last years and are constantly adding new features and improving user experience. The most notable features are probably the execution of local notebooks against a Databricks cluster, a nice UI to manage clusters, jobs, secrets, repos, etc. and last but not least also a browser for your workspace and DBFS to sync files locally.

In February 2023 Databricks also published its own official VSCode extension which was definitely long awaited by a lot of customers (blogextension). It allows you to run a local file on a Databricks cluster and display the results in VSCode again. Alternatively you can also run the code as a workflow. I am sure we can expect much more features in the near future and Databricks investing in local IDE support is already a great step forward!

As you can imagine, I am working very closely with the people at Databricks and we are happy to also announce the next major release of our Databricks VSCode extension 2.0 which now also integrates with the official Databricks extension! To avoid confusion between the two extensions we also renamed ours to Databricks Power Tools so from now on you will see two Databricks icons on the very left bar in VSCode.

Click through to read more in the announcement and some of the things which have changed as a result of version 2.0.

Comments closed

Tips for Scaling Cassandra Clusters

Mario Tavares wants more zoom:

When the use case aligns with the architectural limitations, Cassandra excels at storing and accessing datasets up to petabytes in volume, delivering impressive throughput. As the data or workload volume grows, we expand the cluster linearly, ensuring consistent performance.

However, even when we adhere to the documentation and best practices and create an effective data model, we might encounter underperforming nodes or unexpected challenges with throughput scaling after a cluster expansion—and it’s not always clear what causes the imbalance. Linear scalability relies on the assumption that workload and data are evenly distributed across all nodes in a cluster, and the cluster capacity relates directly to the number of nodes. Sometimes, these conditions aren’t met, affecting linear scalability. So, we strive for scalability and balance and are willing to fulfill the necessary conditions.

Read on for a few common performance issues and what you can do about them.

Comments closed

Finding SQL Server Columns with Defaults

Tom Collins sticks to the defaults:

Do you have a sql query to check every  sql server database  column and identify if a default value is applied to the column?  

Click through for a script which does just that. Tom’s query goes against system views and there’s a separate way to get those details from sys.default_constraints if you prefer to have a second option. If you’re on an older version of SQL Server where CONCAT_WS() doesn’t exist, concatenate it yourself.

SELECT
	CONCAT_WS('.', QUOTENAME(OBJECT_SCHEMA_NAME(c.object_id)), QUOTENAME(OBJECT_NAME(c.object_id))) AS TableName,
	c.name AS ColumnName,
	dc.name AS DefaultConstraintName,
	dc.definition AS DefaultConstraintDefinition
FROM sys.default_constraints dc
	INNER JOIN sys.columns c
		ON dc.parent_object_id = c.object_id
		AND dc.parent_column_id = c.column_id;
Comments closed

Best Practices Assessment for Azure Arc-Enabled SQL Server Instances

Ganapathi Varma Chekuri takes us through an assessment:

Best practices assessment provides a mechanism to evaluate the configuration of your SQL Server. Once the best practices assessment feature is enabled, your SQL Server instance and databases are scanned to provide recommendations for things like SQL Server and database configurations, index management, deprecated features, enabled or missing trace flags, statistics, etc. Assessment run time depends on your environment (number of databases, objects, and so on), with a duration from a few minutes, up to an hour.

If you’re familiar with the assessment on Azure VMs, this is quite similar, though it extends to on-premises machines or VMs running in other cloud providers. This does require installing the agent and paying for an Arc-Enabled SQL Server instance, so it’s not free.

Comments closed

Contrasting Azure IoT Hub and Event Hub

Brian Bønk lays out a quick comparison:

When working with Azure Data Explorer and loading data to the storage engine, you might have some streaming devices or services that should land in the engine.

Azure provides two out-of-the-box services:

  1. Azure IoT Hub
  2. Azure Event Hub

At first glance it seems like teh two services are doing the exact same thing – sending events through to other services in Azure. But there are some differences.

Read on to see what these differences are.

Comments closed

Applying Quality Assurance Practices to Data Science

Devin Partida bridges the gap:

The world runs on data. Data scientists organize and make sense of a barrage of information, synthesizing and translating it so people can understand it. They drive the innovation and decision-making process for many organizations. But the quality of the data they use can greatly influence the accuracy of their findings, which directly impacts business outcomes and operations. That’s why data scientists must follow strong quality assurance practices.

Read on for seven practices which can help data scientists achieve better outcomes.

Comments closed

The Legacy of Big Data

Adam Bellemare looks back:

Big Data was going to change the way everything worked. We were about to solve every financial, medical, scientific, and social problem known to humankind. All it would take was a great big pile of data and some way to process it all. 

But somewhere along the line, the big data revolution just sort of petered out, and today you barely hear anything about big data. 

Click through for Adam’s explanation, which is a more detailed form of “Some stuff worked out and became ubiquitous in other ways; others fell off the map.”

But I’m going to snag one more quotation here from Adam:

And finally, big data has shown us that no matter how hard we try, there’s simply no escaping from the inevitable convergence to a full SQL API.

Me: Laughs in Feasel’s Law.

Feasel’s Law – Any sufficiently advanced data retrieval process will eventually have a SQL interface.

1 Comment

Estimating and Managing Pod Spread in AKS

Joji Varghese talks pod distribution in Azure Kubernetes Service:

In Azure Kubernetes Service (AKS), the concept of pod spread is important to ensure that pods are distributed efficiently across nodes in a cluster. This helps to optimize resource utilization, increase application performance, and maintain high availability.

This article outlines a decision-making process for estimating the number of Pods running on an AKS cluster. We will look at pod distribution across designated node pools, distribution based on pod-to-pod dependencies and distribution where pod or node affinities are not specified. Finally, we explore the impact of pod spread on scaling using replicas and the role of the Horizontal Pod Autoscaler (HPA). We will close with a test run of all the above scenarios.

Read on for tips, as well as a few web tools, which you can use to estimate and control pod spread in AKS.

Comments closed

Tips for Using a Data Lakehouse

James Serra shares some advice:

As I mentioned in my Data Mesh, Data Fabric, Data Lakehouse presentation, the data lakehouse architecture, where you use a data lake with delta lake as a software layer and skip using a relational data warehouse, is becoming more and more popular. For some customers, I will recommend “Use a data lake until you can’t”. What I mean by this is to take the following steps when building a new data architecture in Azure with Azure Synapse Analytics:

Click through for six notes.

Comments closed

Parallelization in DirectQuery

Chris Webb shares some insight:

Recently we announced an important new optimisation for DirectQuery datasets: the ability to run (some) of the queries generated by a single DAX query in parallel. You can read the blog post here:

https://powerbi.microsoft.com/en-za/blog/query-parallelization-helps-to-boost-power-bi-dataset-performance-in-directquery-mode/

A few of us on the Power BI CAT team have tested this out with customers and seen some great results, so I thought I’d write a post illustrating the effect this optimisation can have and explaining when it can and can’t help.

Chris has examples of great success, as well as not-so-great success and utter failure, and explains the why behind each outcome.

Comments closed