Press "Enter" to skip to content

Day: December 6, 2021

Getting Around in Spark

Tomaz Kastrun continues a series on Apache Spark. Day 3 shows off the CLI and web UI:

In CLI we will type and run a simple Scala script and observe the behaviour in the WEB UI.

We will read the text file into RDD (Resilient Distributed Dataset). Spark engine resides on location:

/usr/local/Cellar/apache-spark/3.2.0 for MacOS and
C:\SparkApp\spark-3.2.0-bin-hadoop3.2 for Windows (based on the blogpost from Dec.1)

Day 4 compares local mode versus cluster mode:

Finding the best way to write Spark will be dependent of the language flavour. As we have mentioned, Spark runs both on Windows and Mac OS or Linux (both UNIX-like systems). And you will need Java installed to run the clusters. Spark runs on Java 8/11, Scala 2.12, Python 2.7+/3.4+ and R 3.1+. And the language flavour can also determine which IDE will be used.

Day 5 shows the setup of a Spark cluster:

Spark can run both by itself, or over several existing cluster managers. It currently provides several options for deployment. If you decide to use Hadoop and YARN, there is usually the installation needed to install everything on nodes. Installing Java, JavaJDK, Hadoop and setting all the needed configuration. This installation is preferred when installing several nodes. A good example and explanation is available here. you will also be installing HDFS that comes with Hadoop.

Check out all three posts and get caught up on Spark.

Comments closed

Indexing and Window Functions

I continue a series on window functions in SQL Server:

If you’ve been around the block with window functions, you’ve probably heard of the POC indexing strategy: Partition by, Order by, Covering. In other words, with a query, focus on the columns in the PARTITION BY clause (in order!), then the ORDER BY clause (again, in order!), and finally other columns in the SELECT clause to make the index covering (not in order! though it doesn’t hurt!).

But do read on to understand why this is not sufficient.

Comments closed

Troubleshooting Timeouts with Import Refresh

Chris Webb begins a series on troubleshooting timeouts:

If you’re working with a large Power BI dataset and/or a slow data source in Import mode it can be very frustrating to run into timeout errors after you have already waited a long time for a refresh to finish. There are a number of different types of timeout that you might run into, and in this series I’ll look at a few of them and discuss some of the ways you can work around them.

In this post I’ll look at one of the most commonly-encountered timeouts: the limit on the maximum length of time an Import mode dataset refresh can take. 

Click through to see the limits and ways to (sort of) get around them.

Comments closed

Tracking Value Changes in SQL Server

Tomas Zika wants to track a change:

This time a colleague from work asked how to best find a culprit that has been changing a specific cell in a table. It could be an automated process, application logic, application user or even an ad-hoc statement – we didn’t know. The table has many different access patterns, some of which are frequent. Ideally, we don’t want to monitor everything and sift through it.

We wanted to learn the who, the how and then ask why? If you like to know the whole journey, read on. Otherwise, you can skip to section Eureka moment

Click through for the Eureka moment. It is important to embrace the power of “and.”

Comments closed

The User-Assigned Managed Identity in ADF

Asanka Padmakumara takes a look at the user defined managed identity:

If you are familiar with Managed Identity concepts in ADF, each ADF instance comes with own System Assigned Managed Identity (MI). We can use that MI to control ADF’s access to any data sources which support Azure AD based authentication. This is considered to be the most secured and recommended way of authenticating ADF with cloud systems. If not, you can use Azure Key vault to store credentials. Let’s take an example on to discuss how User Assigned Managed Identity helps for manage access within multiple ADF environment.

Click through to see how the user assigned managed identity makes life better.

Comments closed

SQL Server Backend for Django

Warren Chu announces a new version of the SQL Server 3rd Party Backend for Django:

We have released version 1.1 of the SQL Server 3rd Party Backend for Django. This release contains support for the upcoming release of Django 4.0, as well as a number of issue fixes.

Our plan is to time releases to coincide with major releases of Django and SQL Server, to ensure users of this project can keep up to date with Django while continuing to use SQL Server as a backend.

Read on to see what this entails.

Comments closed