Press "Enter" to skip to content

Month: December 2017

Data Skip Techniques In Impala

Mostafa Mokhtar, et al, explain a few methods for skipping unneeded data in Impala queries:

Each Apache Parquet file contains a footer where metadata can be stored including information like the minimum and maximum value for each column. Starting in v2.9, Impala populates the min_value and max_value fields for each column when writing Parquet files for all data types and leverages data skipping when those files are read. This approach significantly speeds up selective queries by further eliminating data beyond what static partitioning alone can do. For files written by Hive / Spark, Impala only reads the deprecated min and max fields.

The effectiveness of the Parquet min_value/max_value column statistics for data skipping can be increased by ordering (or clustering1) data when it is written by reducing the range of values that fall between the minimum and maximum value for any given file. It was for this reason that Impala 2.9 added the SORT BY clause to table DDL which directs Impala to sort data locally during an INSERT before writing the data to files.

Even if your answer is “throw more hardware at it,” there eventually comes a point where you run out of hardware (or budget).

Comments closed

Animated Dot Plots In R

John MacKintosh shows how to create dot plots in ggplot2, and then he uses gganimate to turn it into an animated plot:

If you want to follow along (go on) then you should head over to Neil’s site, download the excel file and take a look at the “how to” guide on the same page. Existing R users are already likely to be shuddering at all the manual manipulation required.

For the first attempt, I followed Neil’s approach pretty closely, resulting in a lot of code to sort and group, although ggplot2 made the actual plotting much simpler. I shared my very first attempt ( produce with barely any ggplot2 code) which was quite good, but there were a few issues – the ins/ outs being coloured blue instead of grey, and overplotting of several points.

Click through for code and explanation.  H/T R-Bloggers

Comments closed

Power BI Dashboard Sharing

Reza Rad covers one method of sharing Power BI content with users:

What dashboard sharing as the name of it explains is based on a dashboard. You can only share a dashboard with this method, not a report. Consider that you have a dashboard like below screenshot, and you want to share it. There is a share link at the top right corner of the dashboard.

Dashboard sharing have very few options to set and is very simple to configure. You just need to add the email address of people whom you want to share this report. You can also write a message for them to know that this report is shared with them.

Click through for more information.  Note that this is a paid feature.

Comments closed

Splitting Large Files Out In SQL Server

Tracy Boggiano has a script which splits out large files in a filegroup into a smaller set of files:

The solution I offer allows you to break your files into any size you want by rebuilding your indexes. You will need some disk space for it to create the new files while it runs the process then it will drop the large file.  This will also take up some space in your transaction log so if you not running your transaction log backups frequently enough you could have a lot of disk space taken up by that so watch out for that. All the code can be downloaded from my github repository here.

Read on for an explanation of the entire process.

Comments closed

Using DATEADD Instead Of DATEDIFF

Michael J. Swart points out a bit of trickery with DATEDIFF:

I assumed that the DATEDIFF function I wrote worked this way: Subtract the two dates to get a timespan value and then return the number of seconds (rounded somehow) in that timespan.

But that’s not how it works. The docs for DATEDIFF say:

“Returns the count (signed integer) of the specified datepart boundaries crossed between the specified startdate and enddate.”

There’s no rounding involved. It just counts the ticks on the clock that are heard during a given timespan.

Read the whole thing.

Comments closed

Logging Perfmon Data

Raul Gonzalez has started a new series around getting perfmon data into SQL Server.  First up is logging perfmon counters:

Here you have the template I have used to create my Data Collector, you just need to write it down to a XML file and change some of the counters which are related to SQL Server.

When I say MSSQL$MSSQL2016, that is because this counter refer to a named instance called MSSQL2016. If that was the default instance, it’d be just “SQL Server”.

Example: <Counter>\SQL Server:Buffer Manager\Page life expectancy</Counter>

Once you adjust it, you’re good to go.

Click through for a sample data collector set and some instructions on logging counter values.

Comments closed

Hierarchical Clustering

Chaitanya Sagar explains hierarchical clustering with examples in R:

Hope now you have a better understanding of clustering algorithms than what you started with. We discussed about Divisive and Agglomerative clustering techniques and four linkage methods namely, Single, Complete, Average and Ward’s method. Next, we implemented the discussed techniques in R using a numeric dataset. Note that we didn’t have any categorical variable in the dataset we used. You need to treat the categorical variables in order to incorporate them into a clustering algorithm. Lastly, we discussed a couple of plots to visualise the clusters/groups formed. Note here that we have assumed value of ‘k’ (number of clusters) is known. However, this is not always the case. There are a number of heuristics and rules-of-thumb for picking number of clusters. A given heuristic will work better on some datasets than others. It’s best to take advantage of domain knowledge to help set the number of clusters, if that’s possible. Otherwise, try a variety of heuristics, and perhaps a few different values of k.

There’s a lot to pick out of this post, but you’re able to walk through it step by step.  H/T R-Bloggers

Comments closed

The Triumph Of Functional Programming

Amanda LeClair and Michael Facemire have a new report on functional programming:

The customer-facing software development world is outgrowing stateful, object-oriented (OO) development. The bar for great, intuitive customer experience has been raised by ambient, conversation-driven user interfaces, like through Amazon Alexa. Functional programming allows enterprises to take better advantage of compute power to deliver those experiences at scale; better flexibility for delivering the right output; and a more efficient way of delivering customer value. FP also reduces regression defects in code, simplifies code creation and maintenance, and allows for greater code reuse.

Just as object-oriented programming (OOP) emerged as the solution to the limitations of procedural programming at the dawn of the internet boom in the mid-’90s, FP is emerging as the solution to the limitations of OOP today. The shift is already underway– 53% of global developers reported that at least some teams in their companies are practicing functional programming and are planning to expand their usage.

Alexey Sommer notes that functional programming has been sneaking into C# bit by bit for well over a decade:

Retrospective

C# 1.0 Visual Studio 2002

C# 1.1 Visual Studio 2003 – #line, pragma, xml doc comments

C# 2.0 Visual Studio 2005 – Generics, Anonymous methods, iterators/yield, static classes

C# 3.0 Visual Studio 2008 – LINQ, Lambda Expressions, Implicit typing, Extension methods

C# 4.0 Visual Studio 2010 – dynamic, Optional parameters and named arguments

C# 5.0 Visual Studio 2012 – async/await, Caller Information, some breaking changes

C# 6.0 Visual Studio 2015 – Null-conditional operators, String Interpolation

C# 7.0 Visual Studio 2017 – Tuples, Pattern matching, Local functions

I strongly believe that if you are a database developer and need to pick up a non-SQL programming language, functional languages will be a lot easier for you to get than object-oriented languages.  Many of the principles line up much smoother with functional languages, as you can most clearly see with the relationship between Scala and Spark.

Comments closed