Press "Enter" to skip to content

Author: Kevin Feasel

Lubridate Updates

Hadley Wickham reports on a Lubridate update:

  • Date time rounding (with round_date()floor_date() and ceiling_date()) now supports unit multipliers, like “3 days” or “2 months”:

    ceiling_date(ymd_hms("2016-09-12 17:10:00"), unit = "5 minutes")
    #> [1] "2016-09-12 17:10:00 UTC"

If you handle date and time data in R, Lubridate is a tremendous asset.

Comments closed

Notebook Practices

Jonathan Whitmore has good practices for managing Jupyter notebooks:

Here’s an example of how we use git and GitHub. One beautiful new feature of Github is that they now render Jupyter Notebooks automatically in repositories.

When we do our analysis, we do internal reviews of our code and our data science output. We do this with a traditional pull-request approach. When issuing pull-requests, however, looking at the differences between updated .ipynb files, the updates are not rendered in a helpful way. One solution people tend to recommend is to commit the conversion to .py instead. This is great for seeing the differences in the input code (while jettisoning the output), and is useful for seeing the changes. However, when reviewing data science work, it is also incredibly important to see the output itself.

So far, I’ve treated notebooks more as presentation media and used tools like R Studio for tinkering.  This shifts my priors a bit.

Comments closed

Restoring All Databases

Kevin Hill builds a script to reload all databases at once:

We are doing a major upgrade this weekend, so like any good DBA, I have planned for a full backup the night before and needed the ability to quickly restore if it goes sideways and needs to roll back.

The catch is that there are close to 300 identical databases involved.

This is easy enough to do if you know where info is stored in MSDB related to the last backup.

Click through for the script.

Comments closed

Azure SQL Database Size Quotas

Dimitri Furman discusses the MAXSIZE property on an Azure SQL Database:

Customers can use this ability to allow scaling down to a lower service objective, when otherwise scaling down wouldn’t be possible because the database is too large.

While this capability is useful for some customers, the fact that the actual size quota for the database may be different from the maximum size quota for the selected service objective can be unexpected, particularly for customers who are used to working with the traditional SQL Server, where there is no explicit size quota at the database level. Exceeding the unexpectedly low database size quota will prevent new space allocations within the database, which can be a serious problem for many types of applications.

One more thing to think about, I suppose.

Comments closed

Restoring An Azure SQL Database

Arun Sirpal discusses ways to restore a database within Azure SQL Database:

You won’t have the ability to use the same name of the restoring database and the database that you want to replace; if you try you get the screen shot below: To get around this I think you would need to drop the old one once the new one has restored then do a rename.

This is a big difference compared to the on-prem version, so be sure to practice this before you find yourself in a crisis.

Comments closed

Migrating To Azure SQL Database

Mike Fal discusses BACPACs, DACPACs, and migrating on-prem databases to Azure SQL Database:

SQL Server Data Tools(SSDT) have always had a process to extract your database. There are two types of extracts you can perform:

  • DACPAC – A binary file that contains the logical database schema and possibly the data. This file retains the platform version of the database (i.e. 2012, 2014, 2016).

  • BACPAC – A binary file that contains the logical database schema and the data as insert statements. This stores the platform version, but is not locked into it.

Mike also walks through SqlPackage.exe.

Comments closed

Change Data Capture With Apache NiFi

Satish Bomma uses Apache NiFi to perform change data capture on a MySQL database:

The main things to configure is DBCPConnection Pool and Maximum-value Columns

Please choose this to be the date-time stamp column that could be a cumulative change-management column

This is the only limitation with this processor as it is not a true CDC and relies on one column. If the data is reloaded into the column with older data the data will not be replicated into HDFS or any other destination.

This processor does not rely on Transactional logs or redo logs like Attunity or Oracle Goldengate. For a complete solution for CDC please use Attunity or Oracle Goldengate solutions.

That last paragraph in the snippet is key:  it’s not a true replacement for CDC-friendly products.  It is, however, a good example for showing how to use NiFi to connect to a relational database and pump data out of it.

Comments closed