Press "Enter" to skip to content

Day: September 4, 2019

Exploratory Data Analysis with ExPanDaR

Joachim Gassen walks us through the ExPanDaR package in R:

The ‘ExPanDaR’ package offers a toolbox for interactive exploratory data analysis (EDA). You can read more about it here. The ‘ExPanD’ shiny app allows you to customize your analysis to some extent but often you might want to continue and extend your analysis with additional models and visualizations that are not part of the ‘ExPanDaR’ package.

Thus, I am currently developing an option to export the ‘ExPanD’ data and analysis to an R Notebook. While it is not ready for CRAN yet, it seems to work reasonably well and I would love to see some people trying it and letting me know about any bugs or other issues that they encounter. Hence, this blog post.

Looks like an interesting package. H/T R-bloggers

Comments closed

Paired RDDs in Spark

Ramandeep Kaur explains how Paired Resilient Distributed Datasets (PairRDDs) differ from regular RDDs:

So, assuming that you have a fair idea about what Spark is and the basics of RDDs. Paired RDD is one of the kinds of RDDs. These RDDs contain the key/value pairs of data. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network. For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.

When datasets are described in terms of key/value pairs, it is common to want to aggregate statistics across all elements with the same key.

Paired RDDs bring us back to that key-value pair paradigm which Hadoop’s version of MapReduce brought to the forefront.

Comments closed

Saving Data in Docker Containers

Anthony Nocentino has a three-part series on persisting SQL Server data in Docker containers. Part 1 takes us through volumes:

Let’s talk about how we can use Docker Volumes and SQL Server to persist data. If we want to run SQL Server in a container we will want to decouple our data from the container itself. Doing so will enable us to delete the container, replace it and start up a new one pointing at our existing data. When running SQL Server in a container will store data in /var/opt/mssql by default. When the container starts up for the first time it will put the system databases in that location and any user databases created will also be placed at this location by default. 

Part 2 looks at how volumes differ between the Linux and Mac/Windows versions of Docker:

So in my previous post, we discussed Docker Volumes and how they have a lifecycle independent of the container enabling us to service the container image independent of the data inside the container. Now let’s dig into Volumes a little bit more and learn where Docker actually stores that data on the underlying operating system.  

Part 3 ties it in with SQL Server:

Makes sense…we changed where SQL Server is reading/writing data. macOS doesn’t support a file mode called O_DIRECT which allows for unbuffered read/write access to the file opened using the open system call.  O_DIRECT is used by systems that manage their own file caching, like relational database management systems (RDBMS). So as SQL starts up and tries to open the master database with O_DIRECT the files can’t be opened because the macOS kernel doesn’t support this mode. And this is the reason why we have to have that Linux VM around. That Linux VM will support O_DIRECT option on the file opened. See more about this at the GitHub issue here.

Definitely worth getting a handle on this if you’re interested in containers.

Comments closed

Azure Data Studio Server Groups

Dave Bland explains how we can get different tab colors in Azure Data Studio:

Have you ever been in an environment where you worked with both DEV\QA servers and Production servers?  Of course you have, we all have.  However, being in such an environment brings a certain level of risk.  We all try to be very careful to run any code we are asked to run on the correct server and in the correct environment.  Despite all of our checks and our Due diligence, we are still human and do make a mistake from time to time.  Azure Data Studio has a nice feature that will allows us to change the color of tabs based on the environment the server is in.  The image below is an example of what it would look like.

I really like this functionality because it gives you one additional warning that you’re about to run a dev script against a production server on accident.

Comments closed

When Indexes Collide

Andy Mallon gives us a case where it makes sense to have a non-clustered index which shares the same columns as your clustered index columns:

First off, let’s remember the difference between clustered & nonclustered indexes

The clustered index is organized by the key columns. It also includes every other column as part of the row structure (ie, it has the entire row).

The nonclustered index is also organized by the key columns. It implicitly includes the clustering key columns (if the table is clustered), or a pointer to the row (if the table’s a heap). If any INCLUDE columns are explicitly specified, they will also be included in the index structure (but these included columns don’t affect order).

I’ve seen other cases where it made sense on sufficiently large and wide tables even for seeks (where the page density difference is large enough that you have a 4-level clustered index but a 3-level non-clustered index), so I think there’s more than Andy’s one corner case. But I do agree that it generally doesn’t help.

Comments closed

Multi-Column Slicers in Power BI

Marco Russo shows us how we can filter on multiple columns in a single slicer in Power BI:

Power BI provides slicers for a single column, but there are scenarios where it could be useful to consolidate alternative filters for multiple columns in a single slicer. Technically, this is not possible in Power BI through the standard visualizations, but you can use a particular data modeling technique to obtain the desired result.

Consider the case of a Customer table with a geographical hierarchy with ContinentCountry, and State. The requirement is to enable a filter over California (State), France (Country), or Asia (Continent) using a single slicer

Marco takes us through the process and offers up a clever solution.

Comments closed

Power BI Desktop to SSAS Error with No Models

Shabnam Watson takes us through various cases where you might hit an error connecting your Power BI Desktop installation to an SSAS server:

I recently ran into an error while connecting live from Power BI Desktop to a SSAS server. Everything was on-premises.There was no cloud component involved. I had full admin rights to the SSAS server and could see all databases and models from SSMS and other tools but was getting this error from PBI Desktop.

The Server you’re trying to connect to doesn’t have any models or you don’t have permissions to access them.

If I explicitly specified the name of the database I wanted to connect to, then it would connect and show me the database contents but if I did not specify the database name, I would get this error. The question is why?

Read on for several possible answers.

Comments closed