Compacting R Libraries

Kevin Feasel

2019-03-28

R

Dirk Eddelbuettel shows how you can save a lot of space by stripping excess information from R packages:

Back in August of 2017, we wrote two posts #9: Compating your Share Libraries and #10: Compacting your Shared Libraries, After The Buildabout “stripping” shared libraries. This involves removing auxiliary information (such as debug symbols and more) from the shared libraries which can greatly reduce the installed size (on suitable platforms – it mostly matters where I work, i.e. on Linux).

There’s a pretty good space savings in the tidyverse package. H/T R-Bloggers.

Consistency Versus Availability with Kafka

Kevin Feasel

2019-03-28

Hadoop

Sourabh Verma lists some of the areas where you can make a conscious tradeoff between consistency and availability with Apache Kafka:

1. Cluster Size (N): Number of nodes/brokers in the Kafka cluster, we should have 2x+1, i.e. at least 3 nodes or more in an odd number.
2. Partitions: We write/publish data/event into a topic which is divided into partitions (by default 1), but we should have M times N, where can be any integer number, i.e. M >= 1, to achieve more parallelism and partitioning of data over the cluster.
3.Replication Factor: determines the number of copies (including the original/Leader) of each partition in the cluster. All replicas of a partition exist on separate node/broker, and we should never have R.F. > N, but at least 3. 
We recommend having 3 RF with 3 or 5 nodes cluster. This helps in having both availabilities as well as consistency.

Click through for several more tradeoff points.

Connecting to Jira with Powershell

Kevin Marquette has a new Powershell module:

JiraPS is a wonderful module that is great at a lot of things. There several great people that have put a lot of time into it for it to have such good feature coverage. It was designed to be approachable and do a lot of validation for you. JiraPS does not have a good user story around bulk operations with its issue commands and thats what I need from it the most.

It also has a large user base with hundreds of thousands of downloads and is used in a lot of organizations. Every change made to that module at this point needs to pay very close attention to backwards compatibility.

Now there are two modules depending on your use case.

Bring .NET Support to Spark

I have a request that you vote up a Spark issue:

There is a Jira ticket for the Apache Spark project, SPARK-27006. The gist of this ticket is to bring .NET support to Spark, specifically by supporting DataFrames in C# (and hopefully F#). No support for Datasets or RDDs is included in here, but giving .NET developers DataFrame access would make it easy for us to write code which interacts with Spark SQL and a good chunk of the SparkSession object.

You an click through and read everything I have to say, but do go to the Spark ticket and vote for .NET support.

Against Surrogate Keys on Junction Tables

Lukas Eder explains the costs of surrogate keys on tables intended to join multiple tables together:

There is really no point in adding another column FILM_ACTOR_ID or ID for an individual row in this table, even if a lot of ORMs and non-ORM-defined schemas will do this, simply for “consistency” reasons (and in a few cases, because they cannot handle compound keys).

Now, the presence or absence of such a surrogate key is usually not too relevant in every day work with this table. If you’re using an ORM, it will likely make no difference to client code. If you’re using SQL, it definitely doesn’t. You just never use that additional column.

But in terms of performance, it might make a huge difference!

Lukas makes a good argument here.

Date Dimensions and Large Power BI Files

Reza Rad explains why your Power BI data size might be abnormally large:

If you have a large *.pbix file, you can investigate what are the columns and tables that causing the highest storage consumption, using Power BI Helper. You can download Power BI Helper for free from here. I opened the file above in Power BI Helper, and in the Modeling Advise tab, this is what I see:

As you can see in the above output, the Date field in the Date table is the biggest column in this dataset. taking 150MB runtime memory! This is considering that we have only three distinct values in the column! Seems a bit strange, isn’t? let’s dig into the reason more in deep.

Read on for Reza’s explanation and what you can do to fix it.

Pacemaker Changes Affecting SQL on Linux

Randolph West has an important message if you’re running SQL Server on Linux:

Heads up for SQL Server on Linux folks using availability groups and Pacemaker. Pacemaker 1.1.18 has been out for a while now, but it’s worth mentioning that there was a behaviour change in how it fails-over a cluster. While the new behaviour is considered “correct”, it may affect you if you’ve configured availability groups on a previous version (specifically 1.1.16).

Click through for more details and what you can do about this.

Persisting Databases with Docker Compose

Rob Sewell tries out docker-compose to persist databases using named volumes:

With all things containers I refer to my good friend Andrew Pruski. Known as dbafromthecold on twitter he blogs at https://dbafromthecold.com

I was reading his latest blog post Using docker named volumes to persist databases in SQL Server and decided to give it a try.

His instructions worked perfectly and I thought I would try them using a docker-compose file as I like the ease of spinning up containers with them.

Read on for Rob’s travails, followed by great success. And never go into something named “the spooky basement;” that’s just good life advice.

Categories

March 2019
MTWTFSS
« Feb Apr »
 123
45678910
11121314151617
18192021222324
25262728293031