Press "Enter" to skip to content

Month: April 2018

R 3.5.0 Released

Tal Galili announces that R 3.5.0 is now available:

  • By default the (arbitrary) signs of the loadings from princomp() are chosen so the first element is non-negative.

  • If –default-packages is not used, then Rscript now checks the environment variable R_SCRIPT_DEFAULT_PACKAGES. If this is set, then it takes precedence over R_DEFAULT_PACKAGES. If default packages are not specified on the command line or by one of these environment variables, then Rscript now uses the same default packages as R. For now, the previous behavior of not including methods can be restored by setting the environment variable R_SCRIPT_LEGACY to yes.

  • When a package is found more than once, the warning from find.package(*, verbose=TRUE) lists all library locations.

  • POSIXt objects can now also be rounded or truncated to month or year.

Click through for the long, long list of changes.  H/T R-Bloggers

Comments closed

Natural Language Generation With Markov Chains

Abdul Majed Raja shows off Markovify, a Python package which builds sentences using Markov chains:

Markov chains, named after Andrey Markov, are mathematical systems that hop from one “state” (a situation or set of values) to another. For example, if you made a Markov chain model of a baby’s behavior, you might include “playing,” “eating”, “sleeping,” and “crying” as states, which together with other behaviors could form a ‘state space’: a list of all possible states. In addition, on top of the state space, a Markov chain tells you the probability of hopping, or “transitioning,” from one state to any other state — -e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first. Read more about how Markov Chain works in this interactive article by Victor Powell.

Click through for a fun example of headline generation.

Comments closed

Putting The Distribution Database In An Availability Group

Sourabh Agarwal announces that you will soon be able to put the distribution database into an Availability Group:

Many enterprise customers have asked the capability to combine the usage of SQL Server replication and Always On, such that they can place replication distribution databases within an Always On AG to achieve high availability for their distribution databases, with the expectation that after doing so and when AG failover happens, SQL Server replication will continue functioning seamlessly and correctly. While the Replication publication and subscription databases can be configured to use Availability Groups, this support was lacking for the replication Distribution Databases.

SQL Server engineering team is excited to announce the new enhancement around the configuration of Replication Distribution Database in an Availability Group. This feature enhancement would be available with SQL Server 2017 CU6 and will be ported to SQL Server 2016 in a subsequent CU for SP2.

For those people using merge replication, it won’t be supported in this release.

1 Comment

Using Logical OR Conditions With DAX

Marco Russo shows how to create multiple slicers with logical OR (rather than the default AND) conditions in Power BI and Excel:

Usually, the logical AND condition is the expected behavior for scenarios with different slicers. However, there could be cases where there is a need for a logical OR condition between several slicers. In our example, the semantic would be, “show the quantity sold of products that are of brand Contoso or that have been sold to customers who have a Professional occupation”. The measures OR #1, OR #2, and OR #3 show the same result obtained with three different techniques.

The first two techniques are best practices and might provide different performance depending on the data distribution. The third technique is usually a bad idea because it implements a table filter rather than a multi-column filter. It is included for informational purposes, and as an example of what not to do in production code.

It’s not trivial, but it’s possible.

Comments closed

Automating Data Importation With dbatools

Bob Pusateri shows how to keep a dataset up to date using Powershell and dbatools:

Now for the magic – let’s load some data! The dbatools command that does all the heavy lifting here is called Import-DbaCsvToSql. It loads CSV files into a SQL Server table quickly and easily. As an added bonus, the entire import is within a transaction, so if an error occurs everything gets rolled back. I like to specify my tables and datatypes ahead of time, but if you want to load into a table that doesn’t exist yet, this script will create a table and do its best to guess the appropriate datatype. To use, simply point it at a CSV file and a SQL Server instance, database, and (optionally) a table. It will take care of the rest.

It’s a nice post covering one more method of automating data loads without a major amount of orchestration.

Comments closed

Command Shell In SQL Agent Only Runs One Command

Kenneth Fisher makes an important discovery about SQL Agent jobs running in command shell mode:

Command shell steps in SQL Agent jobs only process the first command. (Yea, I know it’s the title of the post so this wasn’t much of a surprise. You’ll live.)

In order to prove this, I created a job with a series of steps, each one a different test. I’m using two very simple commands. whoami that tells you who is running the shell (very useful if you aren’t certain) and dir which I will assume everyone knows.

Kenneth ran through several tests to see what would trigger the “first command” rule and it turns out that pretty much everything does.

Comments closed

Row-By-Row Is Slow-By-Slow

Lukas Eder points out that row-by-row updates are a great way of slowing down your system:

The best way to find out is to benchmark. I’m doing two benchmarks for this:

  1. One that is run in PL/SQL, showing the performance difference between different approaches that are available to PL/SQL (namely looping, the FORALL syntax, and a single bulk UPDATE)

  2. One that is run in Java, doing JDBC calls, showing the performance difference between different approaches available to Java (namely looping, caching PreparedStatement but still looping, batching, and a single bulk UPDATE)

The results tend to be even more dramatic on SQL Server, where the row-by-row overhead is even greater.

Comments closed

Async Processing With Azure Analysis Services

Teo Lachev notes that you can process Azure Analysis Services cubes without maintaining an HTTP connection:

AAS supports processing tasks asynchronously with REST APIs. The difference is that the service component (REST API) maintains the connectivity to the server – thus reducing the chances of HTTP disconnections from the external application. Microsoft has provided a RestAPISample console app to help you get started. As with any REST API invocation, you’d need to register the app in the Azure Portal so that you can authenticate successfully. Other than that, it’s simple to invoke the REST API and Microsoft has provided step-by-step instructions.

Another, although synchronous, option is to run a PowerShell script in the Azure Cloud Shell environment. You can upload the script as a file. The script can ask you to provide credentials interactively (Get-Credentials method) or you can hardcode the credentials. Here is an example of a PowerShell script that processes a specific table.

Click through to check out how to do this.

Comments closed

Azure Data Factory v2 And Decompression

Ben Jarvis notes a file naming bug with Azure Data Factory v2 when decompressing files:

ADF V2 natively supports decompression of files as documented at https://docs.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#compression-support. With this functionality ADF should change the extension of the file when it is decompressed so 1234_567.csv.gz would become 1234_567.csv however, I’ve noticed that this doesn’t happen in all cases.

In our particular case the file names and extensions of the source files are all uppercase and when ADF uploads them it doesn’t alter the file extension e.g. if I upload 1234_567.CSV.GZ I get 1234_567.CSV.GZ in blob storage rather than 1234_567.CSV.

Click through for more details and be sure to vote on his Azure Feedback bug if this affects you.

Comments closed

Five Books For Learning Kafka

Data Flair has a guide to five books to help you learn Apache Kafka:

The book “Kafka: The Definitive Guide” is written by engineers from Confluent andLinkedIn who are responsible for developing Kafka.

They explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. It contains detailed examples as well. Through this including the replication protocol, you’ll learn Kafka’s design principles, reliability guarantees, key APIs, and architecture details, the controller, and the storage layer. However, even if you are new to Apache Kafka as the application architect, developer, or production engineer, this practical guide shows you how to use this open source streaming platform to handle real-time data feeds.

I haven’t read any of them yet, but a couple look interesting enough to add to my to-read list.

Comments closed