Press "Enter" to skip to content

Month: March 2018

Apache Spark 2.3

The Databricks team has been busy.  They’ve recently announced Apache Spark 2.3 on Databricks:

Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.3 marks a major milestone for Structured Streaming by introducing low-latency continuous processing and stream-to-stream joins; boosts PySpark by improving performance with pandas UDFs; and runs on Kubernetes clusters by providing native support for Apache Spark applications.

In addition to extending new functionality to SparkR, Python, MLlib, and GraphX, the release focuses on usability, stability, and refinement, resolving over 1400 tickets. Other salient features from Spark contributors include:

Anirudh Ramanathan and Palak Bathia also get into Kubernetes support in Spark 2.3:

Starting with Spark 2.3, users can run Spark workloads in an existing Kubernetes 1.7+ cluster and take advantage of Apache Spark’s ability to manage distributed data processing tasks. Apache Spark workloads can make direct use of Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas, as well as administrative features such as Pluggable Authorization and Logging. Best of all, it requires no changes or new installations on your Kubernetes cluster; simply create a container image and set up the right RBAC rolesfor your Spark Application and you’re all set.

Concretely, a native Spark Application in Kubernetes acts as a custom controller, which creates Kubernetes resources in response to requests made by the Spark scheduler. In contrast with deploying Apache Spark in Standalone Mode in Kubernetes, the native approach offers fine-grained management of Spark Applications, improved elasticity, and seamless integration with logging and monitoring solutions. The community is also exploring advanced use cases such as managing streaming workloads and leveraging service meshes like Istio.

Stream to stream joins looks particularly interesting.

Comments closed

For Loops And R

John Mount has a couple of tips around using for loops in R.  First up, pre-allocate lists to make certain types of iterative processing faster:

Another R tip. Use vector(mode = "list") to pre-allocate lists.

result <- vector(mode = "list", 3)
print(result)
#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL

The above used to be critical for writing performant R code (R seems to have greatly improved incremental list growth over the years). It remains a convenient thing to know.

Also, use loop indices when iterating through for loops:

Below is an R annoyance that occurs again and again: vectors lose class attributes when you iterate over them in a for()-loop.

d <- c(Sys.time(), Sys.time())
print(d)
#> [1] "2018-02-18 10:16:16 PST" "2018-02-18 10:16:16 PST"

for(di in d) {
  print(di)
}
#> [1] 1518977777
#> [1] 1518977777

Notice we printed numbers, not dates/times.

Very useful information.

Comments closed

Using Python In SQL Server 2017

Emma Stewart has a post covering setup and configuration of SQL Server 2017 Machine Learning Services and using Python within SQL Server:

One of the new features of SQL Server 2017 was the ability to execute Python Scripts within SQL Server. For anyone who hasn’t heard of Python, it is the language of choice for data analysis. It has a lot of libraries for data analysis and predictive modelling, offers power and flexibility for various machine learning tasks and is also a much simpler language to learn than others.

The release of SQL Server 2016, saw the integration of the database engine with R Services, a data science language. By extending this support to Python, Microsoft have renamed R Services to ‘Machine Learning Services’ to include both R and Python.

The benefits of being able to run Python from SQL Server are that you can keep analytics close to the data (if your data is held within a SQL Server database) and reduce any unnecessary data movement. In a production environment you can simply execute your Python solution via a T-SQL Stored Procedure and you can also deploy the solution using the familiar development tool, Visual Studio.

ML Services is a great addition to SQL Server.

Comments closed

Query Store UserVoice Requests

Erin Stellato has a compendium of Query Store UserVoice requests:

In early January Microsoft announced that Connect, the method for filing SQL Server bugs and feature requests, was being retired.  It was replaced by User Voice, and any bugs/requests were ported over.  Sadly, the votes from Connect did not come across to User Voice, so I went through and found all the Query Store requests, which are listed below.  If you could please take the time to up-vote them, that would be fantastic.  If you could also take time to write about why this would help your business, help you upgrade, or purchase more SQL Server licenses, that is even better.  It helps the product team immensely to understand how this feature/fix/functionality helps you and your company, so taking 5 minutes to write about that is important.

Check them out and upvote any which look interesting.

Comments closed

Accessibility And Power BI Reports

Meagan Longoria has some tips to make your Power BI reports easier for people to read:

Avoid using color as the only means of conveying information. Add text cues where possible. It’s very common to show KPIs with a background color or a box next to a metric that uses red/yellow/green to indicate status. Users who have difficulties seeing color need another way to understand the status of a key metric. This could mean that you use a text icon in addition to or instead of color to indicate a status. Power BI reports often include conditional formatting to change the background color or font color of items in a table to convey high/low or acceptable/unacceptable values. If that is important for your users to understand, you could add a field containing the values “high” and “low” to the table itself or to the tooltips. Tooltips are accessible to screen readers via the accessible Show Data table (Alt + Shift + F11).

These are good design principles in addition to providing accessibility benefits.

Comments closed

Trial And Error With Read-Only Replica Queries

Cody Konior stress tests Availability Group round-robin routing:

I’ve been hearing about round-robin read-only routing ever since SQL 2016 came out but whenever I tried to test if it’s working it never seemed to be. But now I know exactly how it works and there’s a few loopholes where it may not trigger, and they’re not the documented ones you’re thinking of.

To test the limits of it you’re going to need:

  • PowerShell 5.1
  • Pester 4 (Install-Module Pester -Force)
  • DbData (Install-Module DbData -Force)

I’ll explain any of the Pester and DbData bits along the way so don’t worry. They’re minor framework stuff.

There’s some good stuff here around connection pooling, so check it out.

Comments closed

Managing Multiple Power BI Accounts With Chrome

Ike Ellis has a quick tip for managing multiple Power BI accounts across different clients:

 As a consultant, I find it difficult to switch between accounts on PowerBI.com.

I have to log out of an existing account and log back in to a new account. The login process takes a long time. I have found a work around. I use google chrome to manage different chrome accounts, different themes, different cookies, and this allows me to stay logged in to multiple power bi accounts at the same time.

Great tip.

Comments closed

The Date Data Type

Randolph West continues his dates and times series:

QL Server 2008 introduced new data types to handle dates and times in a more intelligent way than the previous DATETIME and SMALLDATETIME types that we looked at previously.

The first one we look at this week is DATE. Whereas DATETIME uses eight bytes and SMALLDATETIME uses four bytes  to store their values, DATE only needs a slender three bytes to store any date value between 0001-01-01 and 9999-12-31inclusive.

The DATE data type was a fantastic addition to SQL Server 2008.

Comments closed

Rolling Calculations In R

Steph Locke explains some tricky behavior with window functions in R:

So looking at the code I wrote, you may have expectedc2 to hold NA, 3, 5, ... where it’s taking the current value and the prior value to make a window of width 2. Another reasonable alternative is that you may have expected c2 to hold NA, NA, 3, ... where it’s summing up the prior two values. But hey, it’s kinda working like cumsum() right so that’s ok! But wait, check out c3. I gave c3 a window of width 3 and it gave me NA, 6, 9, ... which looks like it’s summing the prior value, the current value, and the next value. …. That’s weird right?

It turns out the default behaviour for these rolling calculations is to center align the window, which means the window sits over the current value and tries it’s best to fit over the prior and next values equally. In the case of us giving it an even number it decided to put the window over the next values more than the prior values.

Knowing your window is critical when using a window function, and knowing that some functions have different default windows than others helps you be prepared.

Comments closed

Disagreement On Outliers

Antony Unwin reviews how various packages track outliers using the Overview of Outliers plot in R:

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the OutliersO3 vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

Interesting reading.

Comments closed