Azure SQL Data Warehouse Date Dimensions

Meagan Longoria shows how to create a date dimension in Azure SQL Data Warehouse:

Most data warehouses and data marts require a date dimension or calendar table. Those of us that have been building data warehouses in SQL Server for a while have collected our favorite scripts to build out a date dimension. For a standard date dimension, I am a fan of Aaron  Bertrand’s script posted on MSSQLTips.com. But the current version (as of Aug 8, 2016) of Azure SQL Data Warehouse doesn’t support computed columns, which are used in Aaron’s script.

Click through for the script.

Feed The CPUs

SQL Sasquatch is starting a new series on optimizing disk write to maximize CPU throughput:

When I work with SQL Server batch-controlled workflows, I use the theory “feed the CPUs”.  That’s the simplest positive adaptation I could come up with of Kevin Closson’s paradigm “Everything is a CPU problem” 🙂

What I mean by “Feed the CPUs” is that memory and disk response times are primary factors determining the maximum rate for the CPUs to process the data.  Nuts & bolts of such a model for SQL Server are slightly different than a similar model for Oracle.  SQL Server access to persistent data is always through database cache, while Oracle uses shared access to database cache in SGA and private access to persistent data through direct access in PGA.

Click through for more details.

Amazon Machine Learning

Ujjwal Ratan uses patient readmission data to demonstrate Amazon Machine Learning:

The Amazon ML endpoint created earlier can be invoked using an API call. This is very handy for building an application for end users who can interact with the ML model in real time.

Create a similar application and host it as a static website on Amazon S3. This feature of S3 allows you to host websites without any web servers and takes away the complexities of scaling hardware based on traffic routed to your application. The following is a screenshot from the application:

I think that Azure ML is still ahead of Amazon’s ML solution, but I’m happy to see the competition.

Explaining Yarn Container Memory Allocations

Kevin Feasel

2016-08-08

Hadoop

Skumar T explains container sizes in Yarn:

So jobs on yarn cluster runs in individual containers which is allocated by Node Manager which in turn gets permissions from Resource Manager.

So few configuration parameters of node manager those are important in context of jobs running in the containers.

–>yarn.nodemanager.resource.memory-mb  8192(value)

Amount of physical memory, in MB, that can be allocated for containers.

–>yarn.nodemanager.pmem-check-enabled  true(value)

Whether physical memory limits will be enforced for containers.

The bottom half of the article goes into an extended example.

While Loops

Kevin Feasel

2016-08-08

Syntax

Lukas Eder discovers that Oracle’s PL/SQL has WHILE loops:

In SQL, everything is a table (see SQL trick #1 in this article), just like in relational algebra, everything is a set.

Now, PL/SQL is a useful procedural language that “builds around” the SQL language in the Oracle database. Some of the main reasons to do things in PL/SQL (rather than e.g. in Java) are:

  • Performance (the most important reason), e.g. when doing ETL or reporting
  • Logic needs to be “hidden” in the database (e.g. for security reasons)
  • Logic needs to be reused among different systems that all access the database

Much like Java’s foreach loop, PL/SQL has the ability to define implicit cursors (as opposed to explicit ones)

The WHILE loop is a little more helpful in the SQL Server world for doing things like deleting lots of rows in small batches, but I agree with Lukas’s sentiment:  if you start writing a WHILE loop, it’s best to sit back and think about whether this is the best decision.

Checkpoint Behavior Changes

Mike Ruthruff discusses changes in checkpoint behavior in SQL Server 2016:

The following are the primary changes which will impact behavior of checkpoint in SQL Server 2016.

  1. Indirect checkpoint is the default behavior for new databases created in SQL Server 2016. Databases which were upgraded in place or restored from a previous version of SQL Server will use the previous automatic checkpoint behavior unless explicitly altered to use indirect checkpoint.

  2. When performing a checkpoint SQL Server considers the response time of the I/O’s and adjusts the amount of outstanding I/O in response to response times exceeding a certain threshold. In versions prior to SQL Server 2016 this threshold was 20ms. In SQL Server 2016 the threshold is now 50ms. This means that SQL Server 2016 will wait longer before backing off the amount of outstanding I/O it is issuing.

  3. The SQL Server engine will consolidate modified pages into a single physical transfer if the data pages are contiguous at the physical level. In prior versions, the max size for a transfer was 256KB. Starting with SQL Server 2016 the max size of a physical transfer has been increased to 1MB potentially making the physical transfers more efficient. Keep in mind these are based on continuity of the pages and hence workload dependent.

Definitely read the whole thing.

Lambda Architecture Primer

James Serra explains the Lambda architecture:

A brief explanation of each layer:

Data Consumption: This is where you will import the data from all the various source systems, some of which may be streaming the data.  Others may only provide data once a day.

Stream Layer: It provides for incremental updating, making it the more complex layer.  It trades accuracy for low latency, looking at only recent data.  Data in here may be only seconds behind, but the trade-off is the data may not be clean.

Batch Layer: It looks at all the data at once and eventually corrects the data in the stream layer.  It is the single version of the truth, the trusted layer, where there is usually lots of ETL and a traditional data warehouse.  This layer is built using a predefined schedule, usually once or twice a day, including importing the data currently stored in the stream layer.

Presentation Layer: Think of it as the mediator, as it accepts queries and decides when to use the batch layer and when to use the speed layer.  Its preference would be the batch layer as that has the trusted data, but if you ask it for up-to-the-second data, it will pull from the stream layer.  So it’s a balance of retrieving what we trust versus what we want right now.

I hate the fact that this is named “lambda.”  That’s a term which is way too overloaded in computer science.  You have the architecture, lambda functions, and AWS lambda, all of which are utterly different and yet end up in the same conversation.  This ends up confusing people unless you very specifically say things like “We’re going to use the AWS lambda service to create lambda functions to feed data from sensors into our lambda architecture.”  And even then people still get confused.

Copying Data With Data Factory

Kevin Feasel

2016-08-08

Cloud, ETL

Ginger Grant shows how to copy data from an Azure SQL Database to Azure Blob Storage using Data Factory:

Because we need a connection to a database and a Azure Blob, two Linked Services are required, one for each different type. Prior to completing this step, create an Azure Blob storage account by clicking on Add on All Resources. Create the second Linked service, like the first. Click on New data store then select Azure Storage. Using the template for an Azure Blob Storage linked services, I have modified it below adding the “hubName” as it is required

There’s a lot of JSON to write here, if you’re into that sort of thing.

Using Registered Server Groups

Kevin Hill shows a good use case for registered server groups:

In my last post I hoped to convince you to pay attention to all of the various “Login Failed for user…” messages that you see in your SQL Server ERRORLOGS.   ALL of them.

Yes, some you can ignore based on the environment or the person.   Jim the web guy on a Dev box is just not that much of a security threat (unless you let him touch Prod, but that’s a different post).

Some of you have one or two servers, and reviewing ERRORLOGs is no big deal to do manually.  More of you have tens and tens of them.   Some of you have thousands (I’m looking at you in Managed Hosting environments such as Verizon, Rackspace, etc. where customers pay you to do this).

The next step up from there is Central Management Servers.

Managing Power BI Group Workspace Members

Melissa Coates shows how to mange Power BI groups with larger numbers of members:

Dozens or hundreds of users in a group is what is prompting me to write this post. Manually managing the members within the Power BI workspace is just fine for groups with a very small number of members – for instance, your team of 8 people can be managed easily. However, there are concerns with managing members of a large group for the following reasons:

  • Manual Maintenance. The additional administrative effort of managing a high number of users is a concern.
  • Risk of Error. Let’s say there is an Active Directory (A/D) group that already exists with all salespersons add to the group. System admins are quite accustomed to centrally managing user permissions via A/D groups. Errors and inconsistencies will undoubtedly result when changes in A/D are coordinated with other applications, but not replicated to the Power BI Group.Depending on how sensitive the data is, your auditors will also be unhappy.

To avoid the above two main concerns, I came up with an idea. It didn’t work unfortunately, but I’m sharing what I learned with you anyway to save you some time.

Even though Melissa’s plan didn’t work, it’s a good concept, so I recommend reading.

Categories

August 2016
MTWTFSS
« Jul Sep »
1234567
891011121314
15161718192021
22232425262728
293031