That’s the core of our code. The main function instantiates a new Kafka producer and gloms onto the Flights topic. From there, we call the loadEntries function. The loadEntries function takes a topic and filename. It streams entries from the 2008.csv file and uses the ParallelSeq library to operate in parallel on data streaming in (one of the nice advantages of using functional code: writing thread-safe code is easy!). We filter out any records whose length is zero—there might be newlines somewhere in the file, and those aren’t helpful. We also want to throw away the header row (if it exists) and I know that that starts with “Year” whereas all other records simply include the numeric year value. Finally, once we throw away garbage rows, we want to call the publish function for each entry in the list. The publish function encodes our text as a UTF-8 bytestream and pushes the results onto our Kafka topic.
All this plus a bonus F# pitch.
Variables are the best thing to happen to DAX since, well forever – they are so cool I’m almost ready to like DAX as much as I like MDX. There are already several good articles and blog posts out there describing how to use them (see here and here), but I was looking at a Profiler trace the other day and saw something I hadn’t yet realised about them: you can declare and use variables in the DEFINE clause of a DAX query. Since my series of posts on DAX queriesstill gets a fair amount of traffic, I thought it would be worth writing a brief post showing how this works.
There are some limitations, but Chris shows a way of getting around one of them.
Before we get to the numbers, an overview of the test environment, query set and data is in order. The Impala and Hive numbers were produced on the same 10 node d2.8xlarge EC2 VMs. To prepare the Impala environment the nodes were re-imaged and re-installed with Cloudera’s CDH version 5.8 using Cloudera Manager. The defaults from Cloudera Manager were used to setup / configure Impala 2.6.0. It is worth pointing out that Impala’s Runtime Filtering feature was enabled for all queries in this test.
Data: While Hive works best with ORCFile, Impala works best with Parquet, so Impala testing was done with all data in Parquet format, compressed with Snappy compression. Data was partitioned the same way for both systems, along the date_sk columns. This was done to benefit from Impala’s Runtime Filtering and from Hive’s Dynamic Partition Pruning.
I’m impressed with both of these projects.
When building up urls from different parameters in something like TeamCity, or Octopus, it’s simple enough to get double “//” in urls if the parameters are not consistent. So little helper functions are always useful to have imported to manage such things. Below is an example of such a thing!
Click through for the function.
An Azure Data Lake Analytics Unit, or AU, is a unit of computation resources made available to your U-SQL job. Each AU gives your job access to a set of underlying resources like CPU and memory. Currently, an AU is the equivalent of 2 CPU cores and 6 GB of RAM. As we see how people want to use the service, we may change the definition of an AU or more options for controlling CPU and memory usage.
How AUs are used during U-SQL Query Execution
When you submit a U-SQL script for execution, the U-SQL compiler parallelizes the U-SQL script into hundreds or even thousands of tasks called vertices. Each vertex is allocated to one AU. The AU is dynamically allocated to the task and released once that particular task is completed.
I appreciate the ADL team’s transparency in how they define a unit. It’s much nicer to be able to tell someone that an AU is 2 CPU cores + 6 GB of RAM, rather than saying it’s some fuzzy measure of CPU + memory + I/O which has no direct bearing on your operations.
The options we are interested in are OPERATION_CLEANUP_ENABLED and RETENTION_WINDOW. By default, RETENTION_WINDOW is 365. and OPERATION_CLEANUP_ENABLED is TRUE.
Since we want to set our retention window to 10 days, we need to update RETENTION_WINDOW to 10. We could do this with a simple update statement, but Microsoft provides us with a stored procedure that will do that for us. The benefit of the stored procedure over the UPDATE statement is that a vendor-provided stored procedure will typically encapsulate any additional steps required.
I do not at all like the idea of running SHRINKDATABASE and definitely wouldn’t have that plus a backup in the deletion loop, but if you get caught in a nasty situation with SSISDB, this can serve as the starting point for digging yourself out.
Normally, it is easy enough to setup a Linked Server on SQL Server to other data sources. Problems are usually caused by one of the usual culprits that have to be addressed
SQL Logins simply do not work well when trying to do this type of setup
The Windows login has to have permissions to the file (on a drive or network share)
The appropriate drivers have to be setup (64 bit / 32 bit)
Read on for a few different errors and their solutions.
The DTU Calculator, a third-party service created by Justin Henriksen (a Microsoft employee), will calculate the DTU requirements for our on-premises database that we want to migrate to Azure, by firstly capturing a few performance monitor counters, and then performing a calculation on those results, to provide the recommended service tier for our database.
Justin provides a command-line application or PowerShell script to capture these performance counters:
Processor – % Processor Time
Logical Disk – Disk Reads/sec
Logical Disk – Disk Writes/sec
Database – Log Bytes Flushed/sec
For more details on DTUs, John Sterrett looks at the math.
I want to make a couple of final points. I realize 99 indexes is a lot. It’s to emphasize the differences. However they were also fairly small indexes and this is a single table where a normal database might easily have hundreds. So take these results as an example. They aren’t going to match real life but will hopefully show you how all of this can play out.
Indexes are awesome but you want to be smart about adding them. My personal rule of thumb, with no scientific evidence behind it, is 5 indexes or less and I’m pretty easy. 5-10 indexes and you’ll have to convince me. I’m going to be reviewing the existing indexes and see what I can get rid of, or maybe I can combine something. Past 10 indexes and it had best be for a query that’s running a 100+ times a minute or something for the CEO.
Read on for demo code and specific results.
While R is an open source language, there are a number of different versions of R and each handles memory a little differently. Knowing which version is being used is important, especially when the code is going to be migrated to a server. As part of a SQL Server implementation, there are three different versions of R which come into play. The first is standard open source R, commonly known as CRAN R. This is the standard open source version of R which runs code in memory and is single threaded. The next version which will be installed as part of a SQL Server Installation is Microsoft R Open. This version of R was written to take advantage of the Intel Math Kernel Libraries [MLK]. Using the libraries speeds up many statistical calculations which use matrix operations. It also adds multi-threading capability to R as the rewrite provides the ability to use all available cores and processors and process in parallel. More information on how it works and how much faster Microsoft R Open is compared to standard R is available here. To use Microsoft R Open, once it is installed, in Rstudio should automatically start using it. To check out what version of R that is in use, within R Studio, go to Tools->Global Options and look at the R version.
If you’re concerned about R Services taking up too much server memory, you should look at Resource Governor.