2018-03-12 – Curated SQL

Apache Spark 2.3

Published 2018-03-12 by Kevin Feasel

The Databricks team has been busy. They’ve recently announced Apache Spark 2.3 on Databricks:

Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.3 marks a major milestone for Structured Streaming by introducing low-latency continuous processing and stream-to-stream joins; boosts PySpark by improving performance with pandas UDFs; and runs on Kubernetes clusters by providing native support for Apache Spark applications.

In addition to extending new functionality to SparkR, Python, MLlib, and GraphX, the release focuses on usability, stability, and refinement, resolving over 1400 tickets. Other salient features from Spark contributors include:

DataSource v2 APIs [SPARK-15689, SPARK-20928]
Vectorized ORC reader [SPARK-16060]
Spark History Server v2 with K-V store [SPARK-18085]
Machine Learning Pipeline API model scoring with Structured Streaming [SPARK-13030, SPARK-22346, SPARK-23037]
MLlib Enhancements Highlights [SPARK-21866, SPARK-3181, SPARK-21087, SPARK-20199]
Spark SQL Enhancements [SPARK-21485, SPARK-21975, SPARK-20331, SPARK-22510, SPARK-20236]

Anirudh Ramanathan and Palak Bathia also get into Kubernetes support in Spark 2.3:

Starting with Spark 2.3, users can run Spark workloads in an existing Kubernetes 1.7+ cluster and take advantage of Apache Spark’s ability to manage distributed data processing tasks. Apache Spark workloads can make direct use of Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas, as well as administrative features such as Pluggable Authorization and Logging. Best of all, it requires no changes or new installations on your Kubernetes cluster; simply create a container image and set up the right RBAC rolesfor your Spark Application and you’re all set.

Concretely, a native Spark Application in Kubernetes acts as a custom controller, which creates Kubernetes resources in response to requests made by the Spark scheduler. In contrast with deploying Apache Spark in Standalone Mode in Kubernetes, the native approach offers fine-grained management of Spark Applications, improved elasticity, and seamless integration with logging and monitoring solutions. The community is also exploring advanced use cases such as managing streaming workloads and leveraging service meshes like Istio.

Stream to stream joins looks particularly interesting.

Comments closed

For Loops And R

Published 2018-03-12 by Kevin Feasel

John Mount has a couple of tips around using for loops in R. First up, pre-allocate lists to make certain types of iterative processing faster:

Another R tip. Use vector(mode = "list") to pre-allocate lists.
result <- vector(mode = "list", 3)
print(result)
#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL
The above used to be critical for writing performant R code (R seems to have greatly improved incremental list growth over the years). It remains a convenient thing to know.

Also, use loop indices when iterating through for loops:

Below is an R annoyance that occurs again and again: vectors lose class attributes when you iterate over them in a for()-loop.
d <- c(Sys.time(), Sys.time())
print(d)
#> [1] "2018-02-18 10:16:16 PST" "2018-02-18 10:16:16 PST"

for(di in d) {
  print(di)
}
#> [1] 1518977777
#> [1] 1518977777
Notice we printed numbers, not dates/times.

Very useful information.

Comments closed

Using Python In SQL Server 2017

Published 2018-03-12 by Kevin Feasel

Emma Stewart has a post covering setup and configuration of SQL Server 2017 Machine Learning Services and using Python within SQL Server:

One of the new features of SQL Server 2017 was the ability to execute Python Scripts within SQL Server. For anyone who hasn’t heard of Python, it is the language of choice for data analysis. It has a lot of libraries for data analysis and predictive modelling, offers power and flexibility for various machine learning tasks and is also a much simpler language to learn than others.

The release of SQL Server 2016, saw the integration of the database engine with R Services, a data science language. By extending this support to Python, Microsoft have renamed R Services to ‘Machine Learning Services’ to include both R and Python.

The benefits of being able to run Python from SQL Server are that you can keep analytics close to the data (if your data is held within a SQL Server database) and reduce any unnecessary data movement. In a production environment you can simply execute your Python solution via a T-SQL Stored Procedure and you can also deploy the solution using the familiar development tool, Visual Studio.

ML Services is a great addition to SQL Server.

Comments closed

Query Store UserVoice Requests

Published 2018-03-12 by Kevin Feasel

Erin Stellato has a compendium of Query Store UserVoice requests:

In early January Microsoft announced that Connect, the method for filing SQL Server bugs and feature requests, was being retired. It was replaced by User Voice, and any bugs/requests were ported over. Sadly, the votes from Connect did not come across to User Voice, so I went through and found all the Query Store requests, which are listed below. If you could please take the time to up-vote them, that would be fantastic. If you could also take time to write about why this would help your business, help you upgrade, or purchase more SQL Server licenses, that is even better. It helps the product team immensely to understand how this feature/fix/functionality helps you and your company, so taking 5 minutes to write about that is important.

Check them out and upvote any which look interesting.

Comments closed

Accessibility And Power BI Reports

Published 2018-03-12 by Kevin Feasel

Meagan Longoria has some tips to make your Power BI reports easier for people to read:

Avoid using color as the only means of conveying information. Add text cues where possible. It’s very common to show KPIs with a background color or a box next to a metric that uses red/yellow/green to indicate status. Users who have difficulties seeing color need another way to understand the status of a key metric. This could mean that you use a text icon in addition to or instead of color to indicate a status. Power BI reports often include conditional formatting to change the background color or font color of items in a table to convey high/low or acceptable/unacceptable values. If that is important for your users to understand, you could add a field containing the values “high” and “low” to the table itself or to the tooltips. Tooltips are accessible to screen readers via the accessible Show Data table (Alt + Shift + F11).

These are good design principles in addition to providing accessibility benefits.

Comments closed

Trial And Error With Read-Only Replica Queries

Published 2018-03-12 by Kevin Feasel

Cody Konior stress tests Availability Group round-robin routing:

I’ve been hearing about round-robin read-only routing ever since SQL 2016 came out but whenever I tried to test if it’s working it never seemed to be. But now I know exactly how it works and there’s a few loopholes where it may not trigger, and they’re not the documented ones you’re thinking of.

To test the limits of it you’re going to need:

PowerShell 5.1

Pester 4 (Install-Module Pester -Force)

DbData (Install-Module DbData -Force)

I’ll explain any of the Pester and DbData bits along the way so don’t worry. They’re minor framework stuff.

There’s some good stuff here around connection pooling, so check it out.

Comments closed

Managing Multiple Power BI Accounts With Chrome

Published 2018-03-12 by Kevin Feasel

Ike Ellis has a quick tip for managing multiple Power BI accounts across different clients:

As a consultant, I find it difficult to switch between accounts on PowerBI.com.

I have to log out of an existing account and log back in to a new account. The login process takes a long time. I have found a work around. I use google chrome to manage different chrome accounts, different themes, different cookies, and this allows me to stay logged in to multiple power bi accounts at the same time.

Great tip.

Comments closed

The Date Data Type

Published 2018-03-12 by Kevin Feasel

Randolph West continues his dates and times series:

QL Server 2008 introduced new data types to handle dates and times in a more intelligent way than the previous DATETIME and SMALLDATETIME types that we looked at previously.

The first one we look at this week is DATE. Whereas DATETIME uses eight bytes and SMALLDATETIME uses four bytes to store their values, DATE only needs a slender three bytes to store any date value between 0001-01-01 and 9999-12-31inclusive.

The DATE data type was a fantastic addition to SQL Server 2008.

Comments closed

Day: March 12, 2018