Press "Enter" to skip to content

Author: Kevin Feasel

codecentric.ai Bootcamp

Shirin Glander announces a free German-language bootcamp:

This bootcamp is a free online course for everyone who wants to learn hands-on machine learning and AI techniques, from basic algorithms to deep learning, computer vision and NLP. However, the course language is German only, but for every chapter I did, you will find an English R-version here on my blog (see below for links).

Right now, the course is in beta phase, so we are happy about everyone who tests our content and leaves feedback. Also, not the entire curriculum is finished yet, we will update and extend the course during the next months. If there are specific topics you’d like to have us cover, just let us know!

If you understand German and want to learn about data science, check this out and leave feedback.

Comments closed

Building A Calendar Table

I have a post up on building a calendar table:

Another thing to keep in mind here is that you’re only going to load your calendar table once, so if it takes two minutes to do, who really cares? The version I have should run reasonably fast–I calculated 726 years on slow hardware in 19 seconds and fast hardware in 11 seconds. I’m sure you can play code golf and get it done faster, but that’s probably not a good use of your time.

What you want to sweat instead is query time: how long is it taking to access this data?

Click through for a script.

Comments closed

Dynamic Data Masking and Execution Plans

Arthur Daniels looks at what you can see in an execution plan with dynamic data masking:

I think Dynamic Data Masking is pretty cool. The idea is basically to provide a mask for certain users when they might see protected data. The documentation on this feature is actually pretty deep, it’s worth a look.

I just want to show you how you can see the masking in an execution plan. Let’s mask some data in StackOverflow2010! (Also, there’s an interesting side note at the end)

Click through for those notes.

Comments closed

Concerns With DISTINCT

Anvesh Patel does not like DISTINCT:

I am telling you personally that I hate the use of DISTINCT.
DISTINCT used by those people, who are not sure about their data set or SELECT statement or JOINS.
Whenever I get any query with DISTINCT, immediately I suggest to remove it.

I agree with this sentiment about 85% of the time. There are cases where I know l am working with data at a finer grain than I need and the counts aren’t important. But just tossing a DISTINCT on a query to stop it from repeating rows is the wrong approach: figure out why that repetition happens and fix it.

Comments closed

One More Data Gateway Is All You Need

Meagan Longoria explains when you might need data gateways when implementing an Azure BI architecture:

Let’s start with what services may require you to use a data gateway.

You will need a data gateway when you are using Power BI, Azure Analysis Services, PowerApps, Microsoft Flow, Azure Logic Apps, Azure Data Factory, or Azure ML with a data source/destination that is in a private network that isn’t connected to your Azure subscription with a VPN gateway. Note that a private network includes on-premises data sources and Azure Virtual Machines as well as Azure SQL Databases and Azure SQL Data Warehouses that require use of VNet service endpoints rather than public endpoints.  

There are a few of them so check out Meagan’s post and take notes.

Comments closed

Azure VM Boot Diagnostics

John Morehouse shows us how to enable and use boot diagnostics on Azure VMs to troubleshoot why that server isn’t coming up the way you’d expect:

The next blade will show you an active console of the virtual machine.  From here you are able to determine what the current status of the virtual machine might be.  You will also noticed that you can gain access to the serial log (shown below), which will give you more detailed information about the boot process.
Once we click on Boot Diagnostics, we will then see the initial startup screens of the server:

This is useful if you have some huge misconfiguration and the server’s failing for some reason.

Comments closed

Shredding Extended Event XML

Dave Mason shows us how you can use T-SQL to shred XML coming from extended events sessions:

Querying the data of an Extended Events session has never been easy. My XEvent sessions typically store event data in a target file, which means using sys.fn_xe_file_target_read_file. To get something of value, you need to shred the event data XML.

Doing this in T-SQL isn’t great. It’s probably better to shred in another language—F# would probably be my choice due to its type provider—and dump the results back into SQL. But if you want to stick to one language, Dave shows you how.

Comments closed

Sqoop From MySQL To Cloudera

Alan Choi and Laurel Hale show us how to use Sqoop to migrate data from MySQL into Impala:

The basic import steps described for tiny tables applies to importing bigger tables into Impala. The difference occurs when you construct your sqoop import command. For large tables, you want it to run fast, so setting parallelism to 1, which specifies one map task during the import won’t work well. Instead, using the default parallelism setting, which is 4 map tasks to import in parallel, is a good place to start. So you don’t need to specify a value for the -m option unless you want to increase the number of parallel map tasks.
Another difference is that bigger tables usually have a primary key, which become good candidates where you can split the data without skewing it. The tiny_table we imported earlier doesn’t have a primary key. Also note that the -e option for the sqoop import command, which instructs Sqoop to import the data returned for the specified SQL statement doesn’t work if you split data on a string column. If stringcolumns are used to split the data with the -e option, it generates incompatible SQL. So if you decide to split data on the primary key for your bigger table, make sure the primary key is on a column of a numeric data type, such as int, which works best with the -e option because it generates compatible SQL.

Read the whole thing. Sqoop has been around for a while because it does its job well.

Comments closed

Wait Stats And Missing Indexes

Arthur Daniels explains that missing indexes can cause high wait stat counts to appear:

At first, this statement might sound a bit confusing. Usually, we expect wait statistics to show us what a query is waiting on, whether it’s waiting on memory, loading pages from disk, or any of the other numerous wait types.
Once you start collecting wait statistics, you’ll have a lot of data to sort through. You might find waits like CX_PACKET, CX_CONSUMER, and PAGEIOLATCH. Surprisingly, these could mean that your databases aren’t well indexed.

This makes sense. At its core, wait stats tell you where SQL Server is hurting: where is the bottleneck. But just like a person at the doctor, SQL Server can only be so specific in how it relates this pain to you, and that specificity generally boils down to hardware components. The solution might be “get more hardware,” but as Arthur points out, writing better queries and using better indexes can mitigate those pains too.

Comments closed

Pipelines Everywhere

John Mount explains the benefit of pipes and pipelines, and shows us an advanced pipe in R:

The idea is: many important calculations can be considered as a sequence of transforms applied to a data set. Each step may be a function taking many arguments. It is often the case that only one of each function’s arguments is primary, and the rest are parameters. For data science applications this is particularly common, so having convenient pipeline notation can be a plus. An example of a non-trivial data processing pipeline can be found here.

In this note we will discuss the advanced R pipeline operator “dot arrow pipe” and an S4 class (wrapr::UnaryFn) that makes working with pipeline notation much more powerful and much easier.

As you’d expect from John, there’s a lot of detail and it’s an interesting approach.

Comments closed