I thought it my be fun to use the geoms aesthetics to see if we could cluster aesthetically similar geoms closer together. The heatmap below uses cosine similarity and heirarchical clustering to reorder the matrix that will allow for like geoms to be found closer to one another (note that today I learned from “R for Data Science” about the seriation package [https://cran.r-project.org/web/packages/seriation/index.html] that may make this matrix reordering task much easier).
It’s an interesting analysis of what’s available within ggplot2 and a detailed look at how different geoms fit together with respect to aesthetic options.
Susan Li has a series on multi-class text classification in Python. First up is analysis with PySpark:
Our task is to classify San Francisco Crime Description into 33 pre-defined categories. The data can be downloaded from Kaggle.
Given a new crime description comes in, we want to assign it to one of 33 categories. The classifier makes the assumption that each new crime description is assigned to one and only one category. This is multi-class text classification problem.
* Input: Descript
* Example: “STOLEN AUTOMOBILE”
* Output: Category
* Example: VEHICLE THEFT
To solve this problem, we will use a variety of feature extraction technique along with different supervised machine learning algorithms in Spark. Let’s get started!
Then, she looks at multi-class text classification with scikit-learn:
The classifiers and learning algorithms can not directly process the text documents in their original form, as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Therefore, during the preprocessing step, the texts are converted to a more manageable representation.
One common approach for extracting features from the text is to use the bag of words model: a model where for each document, a complaint narrative in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored.
Specifically, for each term in our dataset, we will calculate a measure called Term Frequency, Inverse Document Frequency, abbreviated to tf-idf.
This is a nice pair of articles on the topic. Natural Language Processing (and dealing with text in general) is one place where Python is well ahead of R in terms of functionality and ease of use.
The dot intermediate convention is very succinct, and we can use it with base
Rtransforms to get a correct (and performant) result. Like all conventions: it is just a matter of teaching, learning, and repetition to make this seem natural, familiar and legible.
My preference is to use dplyr + magrittr because I really do like that pipe operator. John’s point is well-taken, however: you don’t need to use the tidyverse to write clean R code, and there can be value in using the base functionality.
Next we need to create a stored procedure that will accept JSON text as a parameter and insert it into the table. Two important points here:
JSON text must use the NVARCHAR(MAX) data type in SQL Server in order to support the JSON functions.
The OPENJSON function is used to convert the JSON text into a rowset, which is then inserted into the previously created table.
The whole process is quite easy; check it out.
An expression is a statement that evaluates to either true or false. In our case we are just checking if the
$hour_of_dayvariable is less than 12 and greater than or equal to 5. When comparing numeric values you must use the same operators you are familiar with from PowerShell:
-ne. When comparing string values you have to use operators you might be more used to from other languages:
!=and both values must be enclosed in double quotes
[ "string1" == "string2" ].
With just the things Mark has shown so far, you can begin to build helpful scripts.
This week, we look at the
DATETIME2data type. I’m not the first person to think that this was probably not the best name for a data type, but here we are, a decade later.
DATETIME2is, at its heart, a combination of the
TIMEdata types we covered in previous weeks.
DATEis 3 bytes long and
TIMEis between 3 and 5 bytes long depending on accuracy. This of course means that
DATETIME2can be anything from 6 to 8 bytes in length.
Nowadays, if you want to store a date plus time, this should be your default, not
Some of my functions in the demo code were showing up just fine. I was really puzzled by that. I thought …
Maybe this is a bug with ‘CREATE OR ALTER’? A sign of some weird memory pressure? Something introduced in SQL Server 2017? A buggy side effect of implicit conversions in some of the functions? A problem with the queries I was using? A weird setting on the database? (Also: about 100 other things that didn’t turn out to be the case.)
I finally wrote up some simple demo code, tested it against a SQL Server 2008 R2 instance (omitting the Query Store components), compared it with SQL Server 2017, and found it to be consistent.
Click through to see which types of functions show up and which ones stay hidden.
One of the concepts I find people misunderstand frequently is the recovery interval, either for the server as a whole or the per-database setting that was introduced in SQL Server 2012 for indirect checkpoints.
There are two misconceptions here:
The recovery interval equals how often a checkpoint will occur
SQL Server guarantees the recovery interval (i.e. crash recovery for the database will only take the amount of time specified in the recovery interval)
It’s good to keep this in mind.
Regardless of whether you like to use tabs or spaces, this is where you go to configure your settings. The first part of the screen controls the indenting options. If “None” is selected, then the next line will start at the beginning of the line. If you have selected “Block”, then it will align the next line with the previous line. And if you are using “Smart”, then the appropriate language will determine which indenting style to use.
The next section controls the tab size / indent size. This controls how many characters that a tab takes. It also controls whether tabs are converted to spaces or kept as tabs.
You can read more about these options at this link: Manage Code Formatting.
I turn on the View Whitespace option that Wayne mentions because I’m a formatting pedant that way.