Press "Enter" to skip to content

Curated SQL Posts

Backing Up Cloudera Search Data

Eva Nahari explains different techniques to back up Cloudera Search data, as well as setting up disaster recovery:

If you have the raw data in HDFS (which most do, and which you should!), the most straightforward way to have a hot-warm disaster recovery setup is to use our Backup and Disaster Recovery tool. It allows you to set up regular incremental updates between two clusters. You then have the option of using MapReduce Indexer or Spark Indexer to regularly index the raw data in your recovery cluster and append to a running Solr service in that same recovery cluster. This way you can easily switch over from one Solr service to the backup Solr service if you experience downtime in the original cluster.

The lag would be depending on the network between the clusters and how frequent you transfer data between the clusters. To some extent it would also depend on how long time you need (i.e. how much resources you have available) to complete the MapReduce or Spark indexing workload and append it (using the Cloudera Search GoLive feature) into Solr active indexes on the recovery site.

Read on for several options.

Comments closed

Using COMBINEVALUES

Marco Russo explains why you might want to use the COMBINEVALUES operator in DAX:

When you import data in a Tabular model, relationships are optimized when they are based on a single column that does not have a high number of unique values. Columns that do have a high number of unique values are known as high cardinality columns. If two tables require two or more columns to define a relationship, it is possible to create a calculated column that concatenates the values of the columns used for the relationship, on both tables. This way, the relationship is based on one single column on each side. However, the resulting calculated column will have a higher cardinality than the original columns, which is not good for performance.

Most of the time, the presence of relationships based on multiple columns suggests that a better denormalization of the model is required to obtain an optimal star schema. Nevertheless, for smaller tables or when it is not possible to apply complex transformations, concatenating columns is the preferred way to obtain the relationships required. When this happens with data loaded in memory, the only concern is the cardinality of the resulting column. However, this is not the main concern in DirectQuery mode.

A calculated column in DirectQuery translates its expression into a native SQL expression. When this is used as a JOIN condition to express the relationship in a Tabular model, this could produce a non-optimal query plan, which could introduce unnecessary slowness in the query execution. For this reason, Microsoft introduced COMBINEVALUES, which is a function expressly designed to optimize relationships based on multiple columns in DirectQuery mode.

Read on for an example.

Comments closed

Building Control Charts With R

Kamal Kumar covers one of my favorite types of charts:

Control charts are used during the Control phase of DMAIC methodology. Control charts, also known as Shewhart charts or process-behavior charts, are a statistical process control tool used to determine if a manufacturing or business process is in a state of control. If analysis of the control chart indicates that the process is currently under control, then no corrections or changes to process control parameters are needed. Moreover, data from the method can be used to predict the future performance of the process. If the control chart indicates that the process is not in control, analysis of the chart can help determine the sources of variation, as this will result in degradation of process performance.

There are many packages in R, which can be used for analysis related to Six Sigma. Here, we will go through qcc package (R package for statistical quality control charts) and learn “How to create control chart (to know whether the process is in control)”.

Control charts are great for telling if a process has changed in some important way—if your machine is boring holes outside of tolerances, if your busy web server is getting closer to the breaking point, etc.

Comments closed

The Bitmap Operator

Hugo Kornelis explains what the Bitmap operator is in an execution plan:

As implied by its logical operation, “Create Bitmap”, the Bitmap operator creates a bitmap. (I assume that this is the only logical operation that the Bitmap operator supports, since I have never seen a Bitmap operator with a different logical operation). A bitmap is a structure that stores Boolean values for a consecutive range of values in a small amount of memory. E.g. the range from 1 to 8000 has 8000 possible values. These can be represented as 8000 bits in just 1000 bytes. For each value, the location of the corresponding bit can be computed by dividing the value by 8; the dividend is the location and the remainder determines which of the bits to use. The value of that specific bit can then be tested, or it can be set to false (zero) or true (one).

The bitmap is named Opt_Bitmap1005 in this case. This name is not exposed in the quick property popup, but you can find in in the full property sheet as shown here, in the Defined Values property. You will also note that this bitmap is not included in the Output List property. That’s because this bitmap is not created for each individual row; there is a single bitmap that accumulates information from all rows. Other operators in the execution plan can reference this bitmap by name, even though it is not passed to them in their input rows.

Hugo goes into great detail on the operator, so you’ll want to set aside some time to read this.

Comments closed

Open Source ML With Azure

David Smith shares his Build conference slides:

The topic for my talk at the Microsoft Build conference yesterday was “Migrating Existing Open Source Machine Learning to Azure”. The idea behind the talk was to show how you can take the open-source tools and workflows you already use for machine learning and data science, and easily transition them to the Azure cloud to take advantage of its capacity and scale. The theme for the talk was “no surprises”, and other than the Azure-specific elements I tried to stick to standard OSS tools rather than Microsoft-specific things, to make the process as familiar as possible.

Click through for the slides and additional resources.

Comments closed

Writing Extended Events To Blob Storage

Arun Sirpal shows how to write an Azure SQL Database extended events session to blob storage:

Last year I wrote about Azure SQL Database extended events (https://blobeater.blog/2017/02/06/using-extended-events-in-azure/)  and gave an example where I was capturing deadlocks via the ring buffer. Ever since then I wanted to do a follow-up post but using Azure storage as the target for my XEL files.

This is more complicated than using the ring buffer as the target and requires a couple of things:

  • Azure storage account where you create a dedicated container for the files.

  • SAS key.

  • Database master key.

  • Database scoped credential.

Also, fantastic obfuscation in the post.  Hello, Shane!

Comments closed

SQL Operations Studio May Release

Alan Yu announces the latest release of SQL Operations Studio:

Highlights for this build include the following.

  • Announcing Redgate SQL Search extension available in Extension Manager

  • Community Localization available for 10 languages: German, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, Simplified Chinese and Traditional Chinese!

  • GDPR-compliant build has reduced telemetry collection, improved opt-out experience and in-product links to Privacy Statement

  • Extension Manager has improved Marketplace experience to easily discover community extensions

  • SQL Agent extension Jobs and Job History view improvement

  • Updates for whoisactive and Server Reports extensions

  • Fix GitHub Issues

Click through for more details and be sure to grab the latest update.

Comments closed

ggplot2 Coordinate Systems

Lea Waniek walks us through coordinate systems in ggplot2:

The coordinate system can be manipulated by adding one of ggplot’s different coordinate systems. When you are imagining a coordinate system, you are most likely thinking of a Cartesian one. The Cartesian coordinate system combines x and y dimension orthogonally and is ggplots default (coord_cartesian).

There also are several varaitions of the familiar Cartesian coordinate system in ggplot, namely coord_fixedcoord_flip and coord_trans. For all of them, the displayed section of the data can be specified by defining the maximal value depicted on the x (xlim =) and y (ylim =) axis. This allows to “zoom in” or “zoom out” of a plot. It is a great advantage, that all manipulations of the coordinate system only alter the depiction of the data but not the data itself.

I tend to avoid polar coordinates, but that’s mostly because I don’t work in a space which benefits from it.

Comments closed

Limitations Of Mapping In Power BI

David Stelfox points out a limitation in Power BI and tries to circumvent it with R to some limited effect:

This results in a row per ride and visualises pretty well in SSMS. If you are familiar with the geography of London you can make out the river Thames toward the centre of the image and Regents Park towards the top left:

This could be overlaid on a shape file of London or a map from another provider such as Google Maps or Mapbox.

However, when you try to load the dataset into Power BI, you find that Power BI does not natively support Geography data types. There is an idea you can vote on here to get them supported: https://ideas.powerbi.com/forums/265200-power-bi-ideas/suggestions/12257955-support-sql-server-geometry-geography-data-types-i

Hit up that idea link if you want to see geography type support within Power BI.

Comments closed

Determining Compute Resources From Managed Instances

Dimitri Furman explains Azure SQL Database Managed Instance resource allocations:

In the current Azure SQL Database Managed Instance (MI) preview, when customers create a new instance, they can allocate a certain number of CPU vCores and a certain amount of disk storage space for the instance. However, there is no explicit configuration option for the amount of memory allocated to the instance, because on MI, memory allocation is proportional to the number of vCores used.

How can a customer determine the actual amount of memory their MI instance can use, in GB? The answer is less obvious than it may seem. Using the traditional SQL Server methods will not provide the right answer on MI. In this article, we will go over the technical details of CPU and memory allocation on MI, and describe the correct way to answer this question.

The information and behavior described in this article are as of the time of writing (April 2018). Some aspects of MI behavior, including the visibility of certain compute resource allocations, may be temporary and will likely change as MI progresses from the current preview to general availability and beyond. Nevertheless, customers using MI in preview will find that this article answers some of the common questions about MI resource allocation.

What you see in SQL Server Management Studio is true, but it is also not the whole picture.

Comments closed