When attempting predictive modeling with real-world data you quicklyrun into difficulties beyond what is typically emphasized in machine learning coursework:
- Missing, invalid, or out of range values.
- Categorical variables with large sets of possible levels.
- Novel categorical levels discovered during test, cross-validation, or model application/deployment.
- Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
- Nested model bias poisoning results in non-trivial data processing pipelines.
Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.
vtreatsystematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.
That’s immediately going onto my learn-more list.
R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed “Someone to Lean On” — likely a Peanuts reference, though I couldn’t find which one with a quick search) is a minor bugfix release, and shouldn’t cause any compatibility issues with scripts or packages written for prior versions of R in the 3.4.x series.
Read on to see the change list.
What we want is a table with an extra column which will represent the average price of all products belonging to the same category as the one on the current line.
One approach to solve this problem is to calculate the average price per category using an aggregate function and then join the result with the initial table over the Product Type column in order to get a new table looking at which you can easily find out if a product is more expensive than the average of its category.
Although this would definitely do the job, the query would be quite complicated and lengthy and may lack readability. To avoid these, an alternative approach would be to make use of window function where there is no need to mess with subqueries and joins. When using a windowed function, you can retrieve both aggregated and non-aggregated values at the same time while when using GROUP BY you can get only the results grouped into a single output row.
I ask questions about window (or windowing) functions whenever I interview someone for a job. They are extremely useful things, and I highly recommend Itzik Ben-Gan’s windowing functions book for SQL Server 2012 if you want to learn a lot more.
Now that there are no connections we can move the database. Depending on the situation it might be worth setting the database to read only or single user mode first. In my case, I had the application taken down so I felt confident no connections would be coming in.
With one line of code we can select the source and destination servers, the database name, specify that we want to use the backup and restore method, and then provide the path to a file share that both instance service accounts have access to:
The whole process is just five lines of code, so it could hardly be easier.
I’d never seen ORIGINAL_DB_NAME until recently and I thought it would be interesting to highlight it out, and in particular the difference between it and DB_NAME. I use DB_NAME and DB_ID fairly frequently in support queries (for example what database context is a query running from or what database are given DB files from). So starting with DB_NAME.
Click through to know when to use each.
I recently got this error in Management Studio when trying to view a deadlock graph that was collected with an extended events session:
Failed to initialize deadlock control.
Key cannot be null.
Parameter name: key
I found this error in a session that included the xml_deadlock_report event.
Read on for more information, and do check the comments where Lonny Niederstadt points out that even a victimless deadlocking scenario can have an ultimate victim: performance.