Press "Enter" to skip to content

Month: July 2019

Don’t Truncate Facts and Dimensions when Loading Data

Meagan Longoria explains why a truncate-and-reload strategy for data warehouses isn’t a good look:

Every once in a while, I come across a data warehouse where the data load uses a full truncate and reload pattern to populate a fact or dimension. While it may not be the end of the world for a small table, it does concern me and I usually recommend to redesign the load. My thoughts below on why this is an anti-pattern are true for using the actual TRUNCATE TABLE statement as well as executing a DELETE statement with no WHERE clause.

Read on for some great advice, including an exception to the rule.

Comments closed

Equidistant Points and Missing Data in Excel

Stephanie Evergreen shows how you can bring in missing data points in Excel to ensure the axis is accurate:

Excel automatically spaces your intervals and labels equidistant from one another but it is assuming that your intervals actually are equidistant. In this graph, that’s not the case. We are missing the months of March, April, July, and August, when either no one was enrolled in the study or we have some missing data. But we can’t just gloss over those months. It isn’t truthful and it distorts the data display.

Click through for the solution.

Comments closed

IDEs and Cloudera Data Science Workbench

Bethann Noble walks us through some of the options available for IDEs operating against Cloudera Data Science Workbench:

Other coders on the team including ML and DevOps engineers often work in local IDEs such as PyCharm.  These applications run locally on the user’s computer and connect to CDSW remotely over SSH for code completion and execution.  They must be configured per user and are not associated at the project level in CDSW. The documentation provides sample instructions for the Professional Edition of PyCharm v2019.1.

They support both browser-based and local IDEs.

Comments closed

Why Transaction Logs are Zero-Initialized

Paul Randal explains why the transaction log needs to be zero-initialized before SQL Server starts up:

It’s all to do with crash recovery. SQL Server knows where crash recovery has to start for a database, but not where it ends – i.e. SQL Server does not persist the ‘most recent LSN’ for a database anywhere. This means it has to work out where the end of the log is (and by end, I mean the most recent log record persisted on disk, not the physical end of the log file).

Read on for the detailed explanation.

Comments closed

Test Those Restores

Bob Pusateri wants you to test your restores, probably right now:

I had a client that was hit by a ransomware virus, encrypting several of their systems including the database server. Not to worry, though, they had “full backups” of all the affected machines, done by a third-party backup utility. After taking a day to cleanse their network, they restored these backups onto their servers. Now it was just a simple matter of bringing all the applications back online, right? Well, not exactly…

Just because a backup completes with no error code doesn’t mean it’s really a successful backup.

Comments closed

Getting the Largest Value Per Group

Erik Darling has a series on getting the highest value using CROSS APPLY. Part 1 covers the no-index route:

Let’s say you wanna get the highest thing. That’s easy enough as a concept.

Now let’s say you need to get the highest thing per user. That’s also easy enough to visualize.

There are a bunch of different ways to choose from to write it.

Part 2 covers the yes-index route:

In this round, row number had a tougher time than other ways to express the logic.

It just goes to show you, not every query is created equal in the eyes of the optimizer.

I don’t think I’m spoiling too much by saying that you really want a good index in place when using CROSS APPLY in this manner.

Comments closed

SQL Server 2019 CTP 3.2

Asad Khan announces the release of SQL Server 2019 CTP 3.2:

With this release of SQL Server 2019 community technology preview 3.2, we are announcing the public preview of Big Data Clusters for SQL Server 2019. Big Data Clusters for SQL Server enables big data analytics within SQL Server. It brings HDFS and Apache Spark™ into SQL Server for scale out compute and storage.

Big data clusters allow you to deploy scalable clusters of SQL Server, Apache Spark™, and HDFS running on Kubernetes. It provides all the tools and systems to ingest, store, and prepare data for analysis as well as to train and operationalize machine learning models. It allows you to query external data sources through data virtualization and combine and analyze your high-value relational data with high-volume big data. You will be also be able to build and deploy scalable and productive data-driven applications in big data clusters.

There’s a good bit in this release and because we’re in the 3.x range, you should be able to upgrade directly.

Comments closed

Azul Java in SQL Server 2019

Travis Wright announces support for Azul Systems’ Java distribution in SQL Server 2019:

In September 2018, Microsoft announced a new partnership with Azul Systems, a leading Java open source contributor and distributor. This partnership allows for all Azure customers to use Azul’s Zulu for Azure – Enterprise distribution of Java for free with support jointly provided by Microsoft and Azul. That’s right – supported for free.

Today, we are announcing that we have extended that partnership to cover SQL Server. Starting in the SQL Server 2019 community technology preview (CTP) 3.2 that was released today, we are including Azul System’s Zulu Embedded right out of the box for all scenarios where Java is used in SQL Server – in PolyBase, Apache Spark, Java extensibility, and more. There is no additional cost beyond what you pay for SQL Server.

This is interesting. We’ll have to see if the CTP 3.2 installation doesn’t ask for JDK 1.8 anymore and just installs the Azul Systems version.

Comments closed

MLflow 1.1 Released

Max Allen, et al, announce the release of MLflow 1.1:

We’re excited to announce today the release of MLflow 1.1. In this release, we’ve focused on fleshing out the tracking component of MLflow and improving visualization components in the UI.

Some of the major features include:
– Automatic logging from TensorFlow and Keras
– Parallel coordinate plots in the tracking UI
Pandas DataFrame based search API
– Java Fluent API
– Kubernetes execution backend for MLflow projects
– Search Pagination

Looks like they’re putting in a lot of work on this.

Comments closed