One of the more common dilemmas schema designers face, and I’m realizing it’s something that I should probably spend more time on in my presentations, is whether to use
nvarcharfor columns that will store string data. As part of my #EntryLevel challenge, I thought I’d start by writing a bit about this here.
In general, I come across two schools of thought on this:
varcharunless you know you need to support Unicode.
nvarcharunless you know you don’t.
My preference is to start with Unicode. 15 years ago, you could easily get away with using ASCII for most US-developed systems, but the likelihood that you will need to store data in multiple, varied languages is significantly higher today. And having already refactored one application to support Unicode after it became fairly large, I’d rather not do that again…
This is a real-world problem that I came across the other day. In a reporting scenario, I wanted to output a number of values in an easy, human-readable way for a report. But just making a long, comma-separated string of numbers doesn’t really make it very readable. This is particularly true when there are hundreds of values.
So here’s a powerful pattern to solve that task.
I really like this. It takes the gaps & islands problem and goes one step further.
ML studio now gives you even more flexibility, with new language engines supported in the language modules. Within the Execute Python Script module, you can now choose to use Python 2.7.11 or Python 3.5, both of which run within the Acaconda 4.0 distribution. And within the Execute R Script module, you can now choose Microsoft R Open 3.2.2 as your R engine, in addition to the existing CRAN R 3.1.0 engine. Microsoft R Open 3.2.2 not only gives you a newer R language engine, it also gives you access to a wealth of new R packages for use within ML Studio. Over 400 packages are pre-installed for use with the R Script module, and you can install and use any other R package (including CRAN packages and your own R packages) via the Script Bundle input port.
I’m interested in the Microsoft R Open language support, as Azure ML’s still using a relatively older version of R (3.1.0).
Up until August 1st if you had 2 vNets in the same Azure region (USWest for example) you needed to create a site to site VPN between them in order for the VMs within each vNet to be able to see each other. I’m happy to report that this is no longer the case (it is still the default configuration). On August 1st, 2016 Microsoft released a new version of the Azure portal which allows you to enable vNet peering between vNets within an account.
Now this feature is in public preview (aka. Beta) so you have to turn it on, which is done through Azure PowerShell. Thankfully it uses the Register-AzureRmProviderFeature cmdlet so you don’t need to have the newest Azure PowerShell installed, just something fairly recent (I have 1.0.7 installed). To enable the feature just request to be included in the beta like so (don’t forget to login with add-AzureRmAccount and then select-AzureRmSubscription).
Read the whole thing for details on how to enroll in this feature and how to set it up.
In this case, we have to assume that
Event IDs may be coming from anywhere and, as such, may not arrive in order. Even though we’re largely appending to the table, we may not be appending in a strict order. Using a clustered index to support the table isn’t the best option in this case – data will be inserted somewhat randomly. We’ll spend maintenance cycles defragmenting this data.
Another downside to this approach is that data is largely queried by
Owner ID. These aren’t unique, and one
Owner IDcould have many events or only a few events. To support our querying pattern we need to create a multi-column clustering key or create an index to support querying patterns.
This result is not intuitive to me, and I recommend reading the whole thing.
Reason 3: Poor transparency for teammates
Which jobs are running right now? Which are going to run today? How long do these jobs take? How do I schedule my job? What machine should I schedule it on? These are all questions that are impossible to answer without building custom orchestration around your Cron process – time you’d be better off spending on building a better system.
Matthew then gives us four alternative products.
This post is an extension of a previous one that appears here:https://drsimonj.svbtle.com/quick-plot-of-all-variables.
In that prior post, I explained a method for plotting the univariate distributions of many numeric variables in a data frame. This post does something very similar, but with a few tweaks that produce a very useful result. So, in general, I’ll skip over a few minor parts that appear in the previous post (e.g., how to use
purrr::keep()if you want only variables of a particular type).
Read on for code, including a good bit of tidyr.
Wow.. That sure looks like three auto-created, multi-column statistics! We have three stats: stats_ids 3, 4, and 5. The sys.stats_column table contains one row for each column that is in a statistic, so multiple rows for a single statistic (i.e., a single stats_id value), indicate multiple columns in that stat. Indeed, the column_id values indicate the table columns contained the stat. So stats_id 3 contains columns VersionMajor and ApplicationID (column_ids 3 and 1), stats_id 4 contains columns VersionMinor and ApplicationID (column_ids 4 and 1), and stats_id 5 contains columns VersionRevision and ApplicationID (column_ids 5 and 1). And, clearly, the auto_created flag is true, so these three stats were auto-created. What’s going on?
Read on for the answer.
Allows you to bind performance measures to 3 states.
The thresholds for these 3 stages can be set manually or be data driven.
It can visualize one measure while using a different one for the indicator value
This is a pretty small visualization, but it could be useful as part of a larger dashboard.
If you are familiar with traditional Spark streaming you may notice that the above example is lacking an explicit batch duration. In structured streaming the equivalent feature is a trigger. By default it will run batches as quickly as possible, starting the next batch as soon as more data is available and the previous batch is complete. You can also set a more traditional fixed batch interval for your trigger. In the future more flexible trigger options will be added.
A related consequence is that windows are no longer forced to be a multiple of the batch duration. Furthermore, windows needn’t be only on processing time anymore, we can rearrange events that may have been delayed or arrived out of order and window by event time. Suppose our input stream had a column event_time that we wanted to do windowed counts on. Then we could do something like the following to get counts of events in a 1 minute window:
Right now, there are some pretty strict limitations on this new streaming, but I imagine they’ll loosen up quite soon.