Running PySpark In Visual Studio Code

Jenny Jiang shows how to run PySpark on HDInsight in VSCode:

We are excited to introduce the integration of HDInsight PySpark into Visual Studio Code (VSCode), which allows developers to easily edit Python scripts and submit PySpark statements to HDInsight clusters. For PySpark developers who value productivity of Python language, VSCode HDInsight Tools offer you a quick Python editor with simple getting started experiences, and enable you to submit PySpark statements to HDInsight clusters with interactive responses. This interactivity brings the best properties of Python and Spark to developers and empowers you to gain faster insights.

Click through to see how it’s done.

A/B Testing With R

Kevin Feasel

2017-11-27

R

Mira Celine Klein shows how to compare two versions of a feature (or advertising campaign or whatever) to determine if one is better than the other:

In comparison to other methods, conducting an A/B test does not require extensive statistical knowledge. Nevertheless, some caveats have to be taken into account.

When making a statistical decision, there are two possible errors (see also table 1): A Type I error means that we observe a significant result although there is no real difference between our groups. A Type II error means that we do not observe a significant result although there is in fact a difference. The Type I error can be controlled and set to a fixed number in advance, e.g., at 5%, often denoted as α or the significance level. The Type II error in contrast cannot be controlled directly. It decreases with the sample size and the magnitude of the actual effect. When, for example, one of the designs performs way better than the other one, it’s more likely that the difference is actually detected by the test in comparison to a situation where there is only a small difference with respect to the target metric. Therefore, the required sample size can be computed in advance, given α and the minimum effect size you want to be able to detect (statistical power analysis). Knowing the average traffic on the website you can get a rough idea of the time you have to wait for the test to complete. Setting the rule for the end of the test in advance is often called “fixed-horizon testing”.

Click through for more, including a sample with code.  H/T R-Bloggers

R Internals: Data Sizes With Nullable Columns

Niels Berglund digs into the Binary Exchange Langage (BXL) and notices something weird about data sizes:

When looking at the data sent, the size of the packages and “drilling” into the TCP packets we could deduct that: :

  • Each column has an over-head of 32 bytes (at least for non nullable data)

  • The size of the column in one row is the size of the data type for numeric types.

  • For decimal and numeric an extra byte is added to each column, where this byte indicates the precision.

  • Columns of alpha numeric type all had 2 bytes pre-pended to the bytes, except max types.

  • For char and nchar the storage size was 2 bytes plus the size the column was defined as.

  • For varchar and nvarchar the storage size was 2 bytes plus the size of the data stored.

  • For the varmax data types the number of bytes that were pre-pended varied dependent on the data size.

Read the whole thing.

TOP And UNIONs

Kevin Feasel

2017-11-27

T-SQL

Andy Levy points out that the TOP operator doesn’t always apply to each element in a UNION:

This system uses TOP every now and then trying to limit the number of records it gets back (and the developers always seem to use the arbitrary 301 – I’m guessing some degree of cargo cult programming going on). I’d prefer a well-constructed WHERE clause to limit the result set but beggars can’t be choosers when working with legacy vendor code.

What I found odd was that the Ledger1 table didn’t get a lot of traffic – with the WHEREclause in use (omitted here for brevity), you’d only get a handful of records, maybe a dozen at most.

Click through to see the rest of the problem, as well as Andy’s solution.

Thoughts On Data Sizing

Greg Low has some thoughts around data types and sizes:

I was recently at a site where they were changing all their bigint columns to uniqueidentifier columns (ie: GUID columns) because they were worried about running out of bigint values. In a word, that’s ridiculous. While it’s easy to say “64 bit integer”, I can assure you that understanding the size of one is out of our abilities. In 1992, I saw an article that said if you cleared the register of a 64 bit computer and put it in a loop just incrementing it (adding one), on the fastest machine available that day, you’d hit the top value in 350 years. Now machines are much faster now than back then, but that’s a crazy big number.

If you’re at risk of running out of bigints, you have a lot of data—that’s more than 18 quintillion rows that you can hold, and if my quick math is up to snuff, would be roughly 127 thousand petabytes of data just to store the IDs.

When AT TIME ZONE Is Too Slow

Robert Davis troubleshoots a performance problem relating to time zones:

Time Zones were definitely being a drag today. I got an email from one of the developers at work asking about the performance difference between 2 queries. The only difference between the 2 queries is that one of them uses the AT TIME ZONE clause that was added in SQL Server 2016. I have not played around with this particular clause, but we do store quite a bit of data in the datetimeoffset data type. In the table in the developer’s queries, the data is all stored in the Eastern time zone, but they are considering storing it in additional time zones and will want to be able to display it in the Eastern time zone even if not stored that way. Thus, AT TIME ZONE.

When the developer was testing the conversion function, he noticed that the query slowed waaaayyyyy down when he added AT TIME ZONE. Before adding AT TIME ZONE to the query, STATISTICS TIME for the query was: CPU time: 145549 ms, elapsed time: 21693 ms.. It returned 8,996 rows, but if I removed the DISTINCT, it returned over 72M rows. That’s a lot of clams … er, data.

Read on for the rest of the story, including Robert’s solution.  Also check out his Connect item related to this.

Roll Your Own SSAS Performance Monitoring Tool

Shabnam Watson has a great post on building an Analysis Services performance monitoring tool from scratch using Power BI:

In many cases, SSAS works efficiently with default settings right out of the box. However, when you have large databases, substantial number of concurrent users, insufficient resources on your server, or when best practices are not followed during SSAS database design, you can run into performance bottlenecks and problems. In these scenarios, you need to know what to measure and how to measure them, what’s normal for your environment (benchmark), and you need to have some amount of historical measurements to be able to see the events that lead to a certain bad performance/failure point. Once you have this data, you can improve your server’s performance by addressing the problem(s).

This is a tour de force of an article, absolutely worth reading if you plan on dealing with Analysis Services at some point.  Even if you don’t build your own tool, you’ll learn a lot about what drives SSAS performance and what indicates that there might be a problem.

Fun With Dynamic SQL: Implicit Casting Can Allow SQL Injection

Remus Rusanu shows an example where implicit casting from NVARCHAR to VARCHAR can introduce a SQL injection vulnerability that you otherwise wouldn’t expect:

In both examples above the SQL executed apparently should had been safe from SQL injection, but it isn’t. Neither REPLACE nor QUOTENAME were able to properly protect and the injected division by zero was executed. The problem is the Unicode MODIFIER LETTER APOSTROPHE(NCHAR(0x02bc)) character that I used in constructing the NVARCHAR value, which is then implicitly cast to VARCHAR. This cast is converting the special ‘modifier letter apostrophe’ character to a plain single quote. This introduces new single quotes in the value, after the supposedly safe escaping occurred. The result is plain old SQL Injection.

Click through for the script.  The upside of this is that it’s entirely under your control and you should be able to get it right by using NVARCHAR consistently.

Categories

November 2017
MTWTFSS
« Oct Dec »
 12345
6789101112
13141516171819
20212223242526
27282930