Optimizing HBase In HDInsight

Ashish Thapliyal links to a 30-minute presentation on HBase optimization:

This session was presented by Nitin Verma (Sr. Software Engineer) and Pravin Mittal (Principal Engineering Manager) @ HBaseCon 2016. The session goes deeper into success story of enabling a big internal customer on HDInsight HBase.

HBase design is a totally different mindset from relational design, so you have to unlearn a lot of habits when moving over to it.

ODBC Driver 13.1

The SQL Server Blog reports that the Microsoft ODBC Driver for SQL Server has been updated to version 13.1:

Always Encrypted

You can now use Always Encrypted with the Microsoft ODBC Driver 13.1 for SQL Server. Always Encrypted is a new SQL Server 2016 and Azure SQL Database security feature that prevents sensitive data from being seen in plaintext in a SQL instance. You can now transparently encrypt the data in the application, so that SQL Server or Azure SQL Database will only handle the encrypted data and not plaintext values. If a SQL instance or host machine is compromised, an attacker can only access ciphertext of your sensitive data. Use the ODBC Driver 13.1 to encrypt plaintext data and store the encrypted data in SQL Server 2016 or Azure SQL Database. Likewise, use the driver to decrypt your encrypted data.

Check out the full list  of new features at the link above.

Reverse Engineering SSIS Packages

Ben Weissman shows how to use BimlOnline to reverse engineer an Integration Services package into its component Biml:

A few things to be aware of:

– Your file will be uploaded to and stored at BimlOnline so you may want to remove passwords etc.
– If you’re trying to figure out how to build a specific task in Biml but your file does way more that just that, consider creating (and uploading) a file that will only contain the task you’re looking for – this will keep the resulting Biml clean and easy to read.

This is extremely helpful for figuring out how to use third-party components with Biml.  If you want a local IDE, there’s always BimlStudio (which costs money).

Int To BigInt

Kendra Little walks through the process of expanding an int column into a bigint:

Sometimes you just can’t take the outage. In that case, you’ve got to proceed with your own wits, and your own code. This is tricky because changes are occurring to the table.

The solution typically looks like this:

  • Set up a way to track changes to the table – either triggers that duplicate off modifications or Change Data Capture (Enterprise Edition)

  • Create the new table with the new data type, set identity_insert on if needed

  • Insert data into the new table. This is typically done in small batches, so that you don’t overwhelm the log or impact performance too much. You may use a snapshot from the point at which you started tracking changes.

  • Start applying changed data to the new table

  • Make sure you’re cleaning up from the changed data you’re catching and not running out of space

  • Write scripts to compare data between the old and new tables to make sure you’re really in sync (possibly use a snapshot or a restored backup to compare a still point in time)

  • Cut over in a quick downtime at some point using renames, schema transfer, etc. If it’s an identity column, don’t forget to fix that up properly.

This method matches what I’ve done in zero downtime situations.

Also see Aaron Bertrand’s article on the same topic:

In part 3 of this series, I showed two workarounds to avoid widening an IDENTITY column – one that simply buys you time, and another that abandons IDENTITY altogether. The former prevents you from having to deal with external dependencies such as foreign keys, but the latter still doesn’t address that issue. In this post, I wanted to detail the approach I would take if I absolutely needed to move to bigint, needed to minimize downtime, and had plenty of time for planning.

Because of all of the potential blockers and the need for minimal disruption, the approach might be seen as a little complex, and it only becomes more so if additional exotic features are being used (say, partitioning, In-Memory OLTP, or replication).

At a very high level, the approach is to create a set of shadow tables, where all the inserts are directed to a new copy of the table (with the larger data type), and the existence of the two sets of tables is as transparent as possible to the application and its users.

Those are two good posts on this topic.

Billing Migration: Choosing A Database Product

Jyoti Shandil, et al, explain how they chose a database product for Netflix’s billing system:

AWS RDS MySQL: Ideally we would have gone with MySQL RDS as our backend, considering Amazon does a great job in managing and upgrading relational database as a service, providing multi-AZ support for high availability. However, the main drawback to RDS was the storage limit of 6TB. Our requirement at the time, was closer to 10TB.

AWS Aurora: AWS Aurora would have met the storage needs, but it was in beta at that time.

PostgreSQL: PostgreSQL is a powerful open source, object-relational database system, but we did not have much in house expertise using PostgreSQL. In the DC, our primary backend databases were Oracle and MySQL. Moreover, choosing PostgreSQL would have eliminated the option of a seamless migration to Aurora in future, as Aurora is based on the MySQL engine.

From there, they also explain some technical issues they found in migrating data.  Read the whole thing.  If you’re coming into this series blind, they also have part 1 and part 2 of the series, giving more of an architectural overview of their billing system.

DacFx Wrapper

Ed Elliott has a new Powershell module named DacFxed:

There is a solution? Well yes of course otherwise I wouldn’t have been writing this! DacFxed is a powershell module which:

  • 1. References the DacFx nuget package so updating to the latest version is simple
  • 2. Implements a hack (ooh) to allow contributors to be loaded from anywhere
  • 3. Is published to the powershell gallery so to use it you just do “Install-Module -Name DacFxed -Scope User -Force”
  • 4. Has a Publish-Database, New-PublishProfile and Get-DatabaseChanges CmdLets

Cool right, now a couple of things to mention. Firstly this is of course open source and available: https://github.com/GoEddie/DacFxed

This is a nice tool to deploy dacpac files using Powershell.  Check out the GitHub repo for more details.

Spatial Functions In Hive

Constantin Stanca has a couple of posts on using Hive to implement geospatial queries.  First, an overview:

The Esri Geometry API for Java includes geometry objects (e.g. points, lines, and polygons), spatial operations (e.g. intersects, buffer), and spatial indexing. By deploying the library (as a jar) within Hadoop, you are able to build custom MapReduce applications using Java to complete analysis on your spatial data. This can be used as a standalone library, or combined with Spatial Framework for Hadoop to create a SQL like experience.

The Spatial Framework for Hadoop includes among others, the Hive Spatial library with User-Defined Functions and SerDes for spatial analysis in Hive. By enabling this library in Hive, you are able to construct queries using Hive Query Language (HQL), which is very similar to SQL. This allows you to avoid complicated MapReduce algorithms and stick to a more familiar workflow. The API used by the Hive UDF’s could be used by developers building geometry functions for 3rd-party applications using Storm, Spark, HBase etc.

He follows that up with some pieces hive misses compared to SQL Server, Oracle, etc.:

As discussed with ESRI recently, there are no plans to open source all spatial functions currently available for traditional RDBMS like Oracle, SQL Server, or Netezza, as those are commercially licensed packages. The best option to compensate for the 5-10% missing functions is to contribute to ESRI’s open source repository: https://github.com/Esri/spatial-framework-for-hadoop. ESRI does not provide a commercial library for Hive including all spatial functions.

Be sure to check out that second link to get an understanding of exactly what’s missing.  Via Mark Herring.

U-SQL Updates

Michael Rys shows off what has been added to the August release for Azure Data Lake Analytics and U-SQL:

As part of the Azure Data Lake Analytics and U-SQL August 1st refresh, we released a couple of new, long awaited features, are fixing some confusing syntax and fixed several bugs

Read on for more details.

Deploying SSAS Cubes Using Powershell

Richie Lee has a Powershell script to deploy an Analysis Services cube:

Recently someone got in touch with me about one of my earliest posts. I realised I had made a mistake on that post and had updated. But I also realised that the script itself is actually part of a PowerShell module. So wanting to make things as easy as possible, and being the helpful guy that I am, I have altered that example into a script which I am posting here.

It’s worth stressing that the errors that output to the console are fairly generic, and that there is a log file that will contain all the information if there are any errors.

Sadly, you still need to write/tweak the big glob of XML, it seems.

DNS Aliases

Drew Furgiuele shows us how to use CNAME records to give us easy aliases for servers hosting SQL Server:

When you connect to a SQL Server instance, you’re most likely connecting directly to the host name of the server running that instance. So for example, if the host name of my instance is SQLSERVER-A, then in my SSMS connection (or application) I probably type in a host name offully qualified domain name (FQDN) of a server. If I want to move a database from one server to another, or stand up a new server and move everything over to it, from now on I’ll need to make sure I type in the new host name. And for a DBA, this is fine. We mostly identify our servers by the hosts they run on.

Developers and users, on the other hand, don’t always think like that. They each probably only care about one or two databases on a given instance, and depending on their release cycle even something as simple as changing a connection string might need to be a scheduled change. So when databases move or you build a new server you not only need to ensure as little downtime as possible from a system perspective but with as little impact to a user’s system too. And you can do that with the help of your network team and your local domain name system: DNS.

I’ve had great experiences with CNAME records masking actual server names.  This was most relevant in an environment in which devs just couldn’t remember which X-Men character was the production SQL Server and  which was QA.

Categories

July 2019
MTWTFSS
« Jun  
1234567
891011121314
15161718192021
22232425262728
293031