Press "Enter" to skip to content

Day: May 9, 2018

Data Types In R

Ellen Talbot gives us an overview of the different data types in R:

Now here’s something we didn’t cover in the video and is especially helpful if something just WILL NOT work and you’ve spent all morning panic eating biscuits.

You can write checks to see if something is numeric, or an integer, with is.numeric() or is.integer().

The general “‘is.XXXXX()’” function will take many of the data types we cover here and more, and can be a real time/life saver.

We could also use class() here and inspect the result.^[You might recall that class(1) had the result of “numeric” – R was not by default considering 1 as an integer for the purpose of the class() function. ### Special numbers As well as i to denote imaginary numbers, there are some additional symbols you might encounter or want to use.

There’s a video as well as a full blog post.

Comments closed

Backing Up Cloudera Search Data

Eva Nahari explains different techniques to back up Cloudera Search data, as well as setting up disaster recovery:

If you have the raw data in HDFS (which most do, and which you should!), the most straightforward way to have a hot-warm disaster recovery setup is to use our Backup and Disaster Recovery tool. It allows you to set up regular incremental updates between two clusters. You then have the option of using MapReduce Indexer or Spark Indexer to regularly index the raw data in your recovery cluster and append to a running Solr service in that same recovery cluster. This way you can easily switch over from one Solr service to the backup Solr service if you experience downtime in the original cluster.

The lag would be depending on the network between the clusters and how frequent you transfer data between the clusters. To some extent it would also depend on how long time you need (i.e. how much resources you have available) to complete the MapReduce or Spark indexing workload and append it (using the Cloudera Search GoLive feature) into Solr active indexes on the recovery site.

Read on for several options.

Comments closed

Using COMBINEVALUES

Marco Russo explains why you might want to use the COMBINEVALUES operator in DAX:

When you import data in a Tabular model, relationships are optimized when they are based on a single column that does not have a high number of unique values. Columns that do have a high number of unique values are known as high cardinality columns. If two tables require two or more columns to define a relationship, it is possible to create a calculated column that concatenates the values of the columns used for the relationship, on both tables. This way, the relationship is based on one single column on each side. However, the resulting calculated column will have a higher cardinality than the original columns, which is not good for performance.

Most of the time, the presence of relationships based on multiple columns suggests that a better denormalization of the model is required to obtain an optimal star schema. Nevertheless, for smaller tables or when it is not possible to apply complex transformations, concatenating columns is the preferred way to obtain the relationships required. When this happens with data loaded in memory, the only concern is the cardinality of the resulting column. However, this is not the main concern in DirectQuery mode.

A calculated column in DirectQuery translates its expression into a native SQL expression. When this is used as a JOIN condition to express the relationship in a Tabular model, this could produce a non-optimal query plan, which could introduce unnecessary slowness in the query execution. For this reason, Microsoft introduced COMBINEVALUES, which is a function expressly designed to optimize relationships based on multiple columns in DirectQuery mode.

Read on for an example.

Comments closed

Building Control Charts With R

Kamal Kumar covers one of my favorite types of charts:

Control charts are used during the Control phase of DMAIC methodology. Control charts, also known as Shewhart charts or process-behavior charts, are a statistical process control tool used to determine if a manufacturing or business process is in a state of control. If analysis of the control chart indicates that the process is currently under control, then no corrections or changes to process control parameters are needed. Moreover, data from the method can be used to predict the future performance of the process. If the control chart indicates that the process is not in control, analysis of the chart can help determine the sources of variation, as this will result in degradation of process performance.

There are many packages in R, which can be used for analysis related to Six Sigma. Here, we will go through qcc package (R package for statistical quality control charts) and learn “How to create control chart (to know whether the process is in control)”.

Control charts are great for telling if a process has changed in some important way—if your machine is boring holes outside of tolerances, if your busy web server is getting closer to the breaking point, etc.

Comments closed

The Bitmap Operator

Hugo Kornelis explains what the Bitmap operator is in an execution plan:

As implied by its logical operation, “Create Bitmap”, the Bitmap operator creates a bitmap. (I assume that this is the only logical operation that the Bitmap operator supports, since I have never seen a Bitmap operator with a different logical operation). A bitmap is a structure that stores Boolean values for a consecutive range of values in a small amount of memory. E.g. the range from 1 to 8000 has 8000 possible values. These can be represented as 8000 bits in just 1000 bytes. For each value, the location of the corresponding bit can be computed by dividing the value by 8; the dividend is the location and the remainder determines which of the bits to use. The value of that specific bit can then be tested, or it can be set to false (zero) or true (one).

The bitmap is named Opt_Bitmap1005 in this case. This name is not exposed in the quick property popup, but you can find in in the full property sheet as shown here, in the Defined Values property. You will also note that this bitmap is not included in the Output List property. That’s because this bitmap is not created for each individual row; there is a single bitmap that accumulates information from all rows. Other operators in the execution plan can reference this bitmap by name, even though it is not passed to them in their input rows.

Hugo goes into great detail on the operator, so you’ll want to set aside some time to read this.

Comments closed

Open Source ML With Azure

David Smith shares his Build conference slides:

The topic for my talk at the Microsoft Build conference yesterday was “Migrating Existing Open Source Machine Learning to Azure”. The idea behind the talk was to show how you can take the open-source tools and workflows you already use for machine learning and data science, and easily transition them to the Azure cloud to take advantage of its capacity and scale. The theme for the talk was “no surprises”, and other than the Azure-specific elements I tried to stick to standard OSS tools rather than Microsoft-specific things, to make the process as familiar as possible.

Click through for the slides and additional resources.

Comments closed

Writing Extended Events To Blob Storage

Arun Sirpal shows how to write an Azure SQL Database extended events session to blob storage:

Last year I wrote about Azure SQL Database extended events (https://blobeater.blog/2017/02/06/using-extended-events-in-azure/)  and gave an example where I was capturing deadlocks via the ring buffer. Ever since then I wanted to do a follow-up post but using Azure storage as the target for my XEL files.

This is more complicated than using the ring buffer as the target and requires a couple of things:

  • Azure storage account where you create a dedicated container for the files.

  • SAS key.

  • Database master key.

  • Database scoped credential.

Also, fantastic obfuscation in the post.  Hello, Shane!

Comments closed

SQL Operations Studio May Release

Alan Yu announces the latest release of SQL Operations Studio:

Highlights for this build include the following.

  • Announcing Redgate SQL Search extension available in Extension Manager

  • Community Localization available for 10 languages: German, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, Simplified Chinese and Traditional Chinese!

  • GDPR-compliant build has reduced telemetry collection, improved opt-out experience and in-product links to Privacy Statement

  • Extension Manager has improved Marketplace experience to easily discover community extensions

  • SQL Agent extension Jobs and Job History view improvement

  • Updates for whoisactive and Server Reports extensions

  • Fix GitHub Issues

Click through for more details and be sure to grab the latest update.

Comments closed