Neither service is truly “set and forget” and requires a dedicated engineer to learn the service and maintain it. You can use various tools to automate many aspects of the operation, but someone will have to maintain automation scripts and workflows.
That said, here are things that I’ve heard first-hand from talking to users
The bottom line there is that Redshift is a bit more mature than BigQuery today, but keep an eye on both of them.
The important thing to understand with window functions is that there is a frame at any point in time when the data is being scanned or processed. I’m not sure what the best term to use is.
Let’s look at the same data set Kathi used. For simplicity, I’ll use a few images of her dataset, but I’ll examine the SalesOrderID. I think that can be easier than looking at the amounts.
Here’s the base dataset for two customers, separated by CustomerID and ordered by the OrderDate. I’ve included amount, but it’s really not important.
Steve goes into detail and explains what’s going on each step of the way. Window functions are extremely useful; check them out if you’re not already familiar with them.
An interesting thing about FIELD_TERMINATOR is that it can be multi-character. MSDN uses ~|~ as a potential delimiter. The reason you’d look at a multi-character delimiter is that not all file formats handle quoted identifiers—for example, putting quotation marks around strings that have commas in them to indicate that commas inside quotation marks are punctuation marks rather than field separators—very well. For example, the default Hive SerDe (Serializer and Deserializer) does not handle quoted identifiers; you can easily grab a different SerDe which does offer quoted identifiers and use it instead, or you can make your delimiter something which is guaranteed not to show up in the file.
You can also set some defaults such as date format, string format, and data compression codec you’re using, but we don’t need those here. Read the MSDN doc above if you’re interested in digging into that a bit further.
It’s a bit of a read, but the end result is that we can retrieve data from a Hadoop cluster as though it were coming from a standard SQL Server table. This is easily my favorite feature in SQL Server 2016.
Our prior copyright license said you couldn’t install this on servers you don’t own. We’d had a ton of problems with consultants and software vendors handing out outdated or broken versions of our scripts, and then coming to us for support.
Now, it’s a free-for-all! If you find the scripts useful, go ahead and use ’em. Include sp_Blitz, sp_BlitzCache, sp_BlitzIndex, etc as part of your deployments for easier troubleshooting.
This is very good news.
A lot of this process involves designing and analyzing A/B tests, particularly about changing our targeting algorithms, ad design, and other factors to improve clickthrough rate (CTR). This process is more statistically interesting than I’d expected, in some cases letting me find new uses for methods I’d used to analyze biological experiments, and in other cases encouraging me to learn new statistical tools. In fact, much of my series on applying Bayesian methods to baseball batting statistics is actually a thinly-veiled version of methods I’ve used to analyze CTR across ad campaigns.
Sounds like a fun place to be.
Microsoft’s next recommendation is to make sure that predicate pushdown is enabled. To do that, we’re going to go back to the Hadoop VM and grab our yarn.application.classpath from there. To do that, cd to /etc/hadoop/conf/ and vi yarn-site.xml (or use whatever other text reader you want). Copy the value for yarn.application.classpath, which should be a pretty long string. Mine looks like:
Now that you have a copy of that value, go to your SQL Server installation directory (by default, C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf) and open up yarn-site.xml. Paste the value into the corresponding yarn.application.classpath setting and you’re good to go.
This is part one of a series on using Polybase.
We see a different behaviour:
messyinto a long data format with a warning by treating all columns as variable, while
melt()has treated trt as an “id variables”. Id columns are the columns that contain the identifier of the observation that is represented as a row in our data set. Indeed, if
melt()does not receive any id.variables specification, then it will use the factor or character columns as id variables.
gather()requires the columns that needs to be treated as ids, all the other columns are going to be used as key-value pairs.
Despite those last different results, we have seen that the two functions can be used to perform the exactly same operations on data frames, and only on data frames! Indeed,
gather()cannot handle matrices or arrays, while
melt()can as shown below.
It seems that these two tools have some overlap, but each has its own point of focus: tidyr is simpler for data tidying, whereas reshape2 has functionality (like data aggregation) which tidyr does not include.
The next step in building this pipeline is to configure the sink or destination for the imported data. Hydrator provides capabilities to store data in time-partitioned directories via a built-in CDAP Dataset called Time-partitioned File Set. Once the data is stored in the fileset, CDAP automatically adds a partition which can be queried using Hive.
In this use case we will configure a Time-partitioned File Set that stores data in Avro format by usingTPFSAvro as the sink.
I like the fact that there’s a UI for this. Between this tool and NiFi, the Hadoop ecosystem is getting some tools to make data migration easier to understand, and I think that will help adoption.
“Sparsing” is the way SQL Server optimizes spacing for NULL values at the cost of overhead for non-NULL values.
In other words, if you expect having in your column more nulls than not nulls you can SPARSE that column to optimize the space.
I’ve seen the situations when a lot of columns in Data Mart tables were almost completely filled with NULLS and I started wondering if “SPARSE” can be a good tool to gain some space.
Read the whole thing. I am not a fan of sparse columns because they prohibit things like page-level compression. Be sure to read the restrictions on using sparse columns before you give them a try; on net, I think they’re more trouble than they’re worth except in edge cases like extremely denormalized tables collecting thousands of data points from sensors.
In this example I just returned zero if I find error. but you can return error message if you like with [Revenue Per Item][ErrorMessage]. This method is great error handling method when an error out of blue happens in your data set. I always recommend using TRY method to get rid of errors that might stop the whole solution to work properly.
I have to mention that steps above are separated to show you how the output of try expression looks like. In fact you can combine both steps above in single step with TRY OTHERWISE as below (Thanks to Maxim Zelensky for pointing this out);
The end result is code which is a bit more complex, but safely handles a number of edge cases.