Deploying Azure Databricks in a Custom VNET

Abhinav Garg and Anna Shrestinian explain how you can use VNET injection with Azure Databricks:

To make the above possible, we provide a Bring Your Own VNET (also called VNET Injection) feature, which allows customers to deploy the Azure Databricks clusters (data plane) in their own-managed VNETs. Such workspaces could be deployed using Azure Portal, or in an automated fashion using ARM Templates, which could be run using Azure CLI, Azure Powershell, Azure Python SDK, etc.

With this capability, the Databricks workspace NSG is also managed by the customer. We manage a set of inbound and outbound NSG rules using a Network Intent Policy, as those are required for secure, bidirectional communication with the control/management plane. 

This is a good article if the defaults won’t get past corporate security.

SQL Server Numerology

Denis Gobo has some fun with SQL Server numbers of importance:

I was troubleshooting a deadlock the other day and it got me thinking…. I know the number 1205 by heart and know it is associated to a deadlock.  What other numbers are there that you can associate to an event or object or limitation. For example 32767 will be known by a lot of people as the database id of the ResourceDb, master is 1, msdb is 4 etc etc.

So below is a list of numbers I thought of

Leave me a comment with any numbers that you know by heart

BTW I didn’t do the limits for int, smallint etc etc, those are the same in all programming languages…so not unique to SQL Server

Read on for Denis’s list.

Ways to Use Power BI Dataflows

Melissa Coates gives us three use patterns for Power BI Dataflows:

In this first option, Power BI handles everything. We use the web-based Power Query Online tool for structuring the data. Power BI handles scheduling the data refresh.

The underlying data behind the dataflow is stored in a data lake. However, since it’s fully managed, this data lake is not directly accessible or visible to the customer. As with most cloud-based implementations, the infrastructure is hidden under the covers. This is what is happening if your users are utilizing dataflows currently but you haven’t specified a data lake account in the Power BI admin center.

Melissa gives us a great summary of the three patterns, so read the whole thing.

Comparing Oracle RAC to SQL Server Availability Groups

Kellyn Pot’vin-Gorman explains the difference between Oracle RAC and SQL Server Availability Groups:

There is a constant rumble among Oracle DBAs- either all-in for Oracle Real Application Cluster, (RAC) or a desire to use it for the tool it was technically intended for. Oracle RAC can be very enticing- complex and feature rich, its the standard for engineered systems, such as Oracle Exadata and even the Oracle Data Appliance, (ODA). Newer implementation features, such as Oracle RAC One-Node offered even greater flexibility in the design of Oracle environments, but we need to also discuss what it isn’t- Oracle RAC is not a Disaster Recovery solution.

Click through for a good high-level contrast, as these are quite different products.

SSMS 18.0 RC1

Dinakar Nethi announces the first release candidate of SQL Server Management Studio 18.0:

As we get closer to the General Availability of SQL Server Management Studio (SSMS) 18, we have decided to have a quick release of the Release Candidate (RC) build.

You can download SSMS 18.0 RC1 today and for more details on what’s included, please see the Release Notes.

It’s been in preview for a while but things are now getting real. I’ll have to check later on if it fixes the bug I found with PolyBase on SQL Server.

Compacting R Libraries

Kevin Feasel

2019-03-28

R

Dirk Eddelbuettel shows how you can save a lot of space by stripping excess information from R packages:

Back in August of 2017, we wrote two posts #9: Compating your Share Libraries and #10: Compacting your Shared Libraries, After The Buildabout “stripping” shared libraries. This involves removing auxiliary information (such as debug symbols and more) from the shared libraries which can greatly reduce the installed size (on suitable platforms – it mostly matters where I work, i.e. on Linux).

There’s a pretty good space savings in the tidyverse package. H/T R-Bloggers.

Consistency Versus Availability with Kafka

Kevin Feasel

2019-03-28

Hadoop

Sourabh Verma lists some of the areas where you can make a conscious tradeoff between consistency and availability with Apache Kafka:

1. Cluster Size (N): Number of nodes/brokers in the Kafka cluster, we should have 2x+1, i.e. at least 3 nodes or more in an odd number.
2. Partitions: We write/publish data/event into a topic which is divided into partitions (by default 1), but we should have M times N, where can be any integer number, i.e. M >= 1, to achieve more parallelism and partitioning of data over the cluster.
3.Replication Factor: determines the number of copies (including the original/Leader) of each partition in the cluster. All replicas of a partition exist on separate node/broker, and we should never have R.F. > N, but at least 3. 
We recommend having 3 RF with 3 or 5 nodes cluster. This helps in having both availabilities as well as consistency.

Click through for several more tradeoff points.

Connecting to Jira with Powershell

Kevin Marquette has a new Powershell module:

JiraPS is a wonderful module that is great at a lot of things. There several great people that have put a lot of time into it for it to have such good feature coverage. It was designed to be approachable and do a lot of validation for you. JiraPS does not have a good user story around bulk operations with its issue commands and thats what I need from it the most.

It also has a large user base with hundreds of thousands of downloads and is used in a lot of organizations. Every change made to that module at this point needs to pay very close attention to backwards compatibility.

Now there are two modules depending on your use case.

Bring .NET Support to Spark

I have a request that you vote up a Spark issue:

There is a Jira ticket for the Apache Spark project, SPARK-27006. The gist of this ticket is to bring .NET support to Spark, specifically by supporting DataFrames in C# (and hopefully F#). No support for Datasets or RDDs is included in here, but giving .NET developers DataFrame access would make it easy for us to write code which interacts with Spark SQL and a good chunk of the SparkSession object.

You an click through and read everything I have to say, but do go to the Spark ticket and vote for .NET support.

Against Surrogate Keys on Junction Tables

Lukas Eder explains the costs of surrogate keys on tables intended to join multiple tables together:

There is really no point in adding another column FILM_ACTOR_ID or ID for an individual row in this table, even if a lot of ORMs and non-ORM-defined schemas will do this, simply for “consistency” reasons (and in a few cases, because they cannot handle compound keys).

Now, the presence or absence of such a surrogate key is usually not too relevant in every day work with this table. If you’re using an ORM, it will likely make no difference to client code. If you’re using SQL, it definitely doesn’t. You just never use that additional column.

But in terms of performance, it might make a huge difference!

Lukas makes a good argument here.

Categories

March 2019
MTWTFSS
« Feb Apr »
 123
45678910
11121314151617
18192021222324
25262728293031