Press "Enter" to skip to content

Author: Kevin Feasel

The Impact Of Auto-Growth Settings For Log Files

Jamie Wick has started a series on log growth and starts by looking at auto-growth settings:

For the data file, the impact can be illustrated in the following chain of events:

  1. A new 1MB data file is created that contains no information. (ie. a 1MB data file containing 0MB of data)
  2. Data is written to the data until it reaches the file size. (ie. the 1MB data file now contains 1MB of data)
  3. The SQL server suspends normal operations to the database while the data file is grown by 1MB. (ie. the data file is now 2MB and contains 1MB of data) If Instant File Initialization (IFI) is enabled, the file is expanded and database operations resume. If IFI is not enabled, the expanded part of the data file must be zeroed before db operations resume, resulting in an additional delay.
  4.  Once the data file has been grown successfully, the server resumes normal database processing. At this point the server loops back to Step 2.

The server will continue this run-pause-run-pause processing until the data file reaches its Maxsize, or the disk becomes full. If the disk that the data file resides on has other files on it (ie. the C drive, or a disk that is shared by several databases), there will be other disk write events happening between the data file growth events. This may cause the data file expansion segments to be non-contiguous, increasing the file fragmentation and further decreasing the database performance.

This is all to answer the question, “What’s the problem with missing a few log backups?”

Comments closed

Gathering Info On Tables

Raul Gonzales has a script which provides useful information for tables and columns:

Useful information it provides at table level:

  • tableType, to identify HEAP tables
  • row_count, to identify tables with plenty of rows or now rows at all
  • TotalSpaceMB, to identify big tables in size
  • LastUserAccess, to identify tables that are not used
  • TotalUserAccess, to identify tables that are heavily used
  • TableTriggers, to identify tables that have triggers

Useful information it provides at column level:

  • DataType-Size, to identify supersized, incorrect or deprecated data types

  • Identity, to identify identity columns

  • Mandatory-DefaultValue, to identify NULL/NOT NULL columns or with default constraints

  • PrimaryKey, to identify primary key columns

  • Collation, to identify columns that might have different collation from the database

  • ForeignKey-ReferencedColumn, to identify foreign keys and the table.column they reference

Click through for the script.

Comments closed

Installing SQL Server On Ubuntu 18.04

Max Trinidad shows us how to install SQL Server on Ubuntu 18.04, though he leads off with a warning:

This has been an issue for sometime until now. I found the following link that help me install SQL Server on the latest Ubuntu 18.04:

https://askubuntu.com/questions/1032532/how-do-i-install-ms-sql-for-ubuntu-18-04-lts

But, there are few missing steps which can help ease the burden of errors. At the same time, the information is a little out-dated.

But, it works with the following adjustments.

Please Understand!!  This is NOT approved by Microsoft.  Use this method for Test Only!!

I’m waiting somewhat impatiently for Microsoft and Hortonworks to support Ubuntu 18.04.

Comments closed

Partitioning Data For Performance Improvement In R

John Mount shares a few examples of partitioning and parallelizing data operations in R:

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatabledata.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallelconveniently bundles all of these steps together when working with rquery pipelines.

There were some interesting results.  I expected data.table to be fast, but did not expect dplyr to parallelize so well.

Comments closed

Sharing R Notebooks

Hanyu Cui and Hossein Falaki show how to share a notebook using RMarkdown:

RMarkdown is the dynamic document format RStudio uses. It is normal Markdown plus embedded R (or any other language) code that can be executed to produce outputs, including tables and charts, within the document. Hence, after changing your R code, you can just rerun all code in the RMarkdown file rather than redo the whole run-copy-paste cycle. And an RMarkdown file can be directly exported into multiple formats, including HTML, PDF,  and Word.

Click through for the demo.

Comments closed

Pipe-Friendly Functions In R

William Doane gives some tips on writing pipe-friendly functions in R:

Languages that don’t begin by supporting pipes often eventually implement some version of them. In R, the magrittr package introduced the %>% infix operator as a pipe operator and is most often pronounced as “then”. For example, “take the mtcarsdata.frame, THEN take the head of it, THEN…” and so on.

For a function to be pipe friendly, it should at least take a data object (often named .data) as its first argument and return an object of the same type—possibly even the same, unaltered object. This contract ensures that your pipe-friendly function can exist in the middle of a piped workflow, accepting the input from its left-hand side and passing along output to its right-hand side.

Click through for a couple of examples.  H/T R-Bloggers

Comments closed

Using Azure Logic Apps For Database Tasks

Arun Sirpal shows off a technique he has developed to run maintenance jobs against Azure SQL Database databases:

I have been using Azure Logic apps recently to build some workflows to gather data from external sources ultimately inserting it into a database for reporting, I then got thinking, how can this be useful for “DBA” based tasks? Let’s take a step back for a minute, what are logic apps? It is a technology that helps integrate apps, data, systems, and services across enterprises. Key parts of a logic app solution are connectors, triggers and actions.

I decided that I wanted to execute a stored procedure every 6 hours to capture wait statistics for my Azure SQL Database and log the information in a table.

This is what my workflow looks like.

There are a few alternatives available, so it’s nice to see an example of one of them.

Comments closed

Grouping And Aggregating In SQL, R, And Python

Dejan Sarka has a few examples of aggregation in different languages, including SQL, R, and Python:

The query calculates the coefficient of variation (defined as the standard deviation divided the mean) for the following groups, in the order as they are listed in the GROUPING SETS clause:

  • Country and education – expression (g.EnglishCountryRegionName, c.EnglishEducation)
  • Country only – expression (g.EnglishCountryRegionName)
  • Education only – expression (c.EnglishEducation)
  • Over all dataset- expression ()

Note also the usage of the GROUPING() function in the query. This function tells you whether the NULL in a cell comes because there were NULLs in the source data and this means a group NULL, or there is a NULL in the cell because this is a hyper aggregate. For example, NULL in the Education column where the value of the GROUPING(Education) equals to 1 indicates that this is aggregated in such a way that education makes no sense in the context, for example aggregated over countries only, or over the whole dataset. I used ordering by NEWID() just to shuffle the results. I executed query multiple times before I got the desired order where all possibilities for the GROUPING() function output were included in the first few rows of the result set. Here is the result.

GROUPING SETS is an underappreciated bit of SQL syntax.

Comments closed

Index That Column Or Include It?

Jeanne Combrinck lays out her recommendations on whether to make a particular column part of an index or have it be an included column:

The original question we wanted to answer was whether we would see a performance difference when a query used the index with all columns in the key, versus the index with most of the columns included in the leaf level. In our first set of tests there was no difference, but in our third and fourth tests there was. It ultimately depends on the query. We only looked at two variations – one had an additional predicate, the other had an ORDER BY – many more exist.

What developers and DBAs need to understand is that there are some great benefits to including columns in an index, but they will not always perform the same as indexes that have all columns in the key. It may be tempting to move columns that are not part of predicates and joins out of the key, and just include them, to reduce the overall size of the index. However, in some cases this requires more resources for query execution and may degrade performance. The degradation may be insignificant; it may not be…you will not know until you test. Therefore, when designing an index, it’s important to think about the columns after the leading one – and understand whether they need to be part of the key (e.g. because keeping the data ordered will provide benefit) or if they can serve their purpose as included columns.

Read the whole thing and be willing to test different approaches.

Comments closed

The Best Of The Underground Toolbox

Adrian Buckman shares some of his favorite creations:

sp_AGreconfigure

This is a great goto proc for an alternative to the Always on availability group GUI for changing Failover mode, Synchronous mode or even Readable options.

When you manage multiple servers with multiple Availability groups this stored procedure can save you alot of time, sometimes I find the GUI can take a long time to open but equally it can take some time to execute the command.

sp_AGreconfigure can speed this process up for you, we tend to use this as our goto for switching synchronous settings when patching/rebooting replicas but also I tend to use it in @Checkonly = 1 mode for giving the Availability group settings a once over.

Click through for this and several other useful tools.

Comments closed