Press "Enter" to skip to content

Day: October 7, 2021

Word Stemming and Text Processing in R

Genrikh Ananiev takes us through some examples of text processing in R:

First, there are a lot of classes (in fact, how many products you have so many classes) And if in this process you have to work not only with the company’s products, but also competitors, the growth of such new classes can occur every day – therefore it becomes meaningless to teach one time Model to be repeatedly used to predict new products.

Secondly, the number of documents (different variations of the same product) in the classes are not very balanced: there may be one by one to class, and maybe more.

Click through for an example of the classical technique versus a classification-based technique.

Comments closed

Defect Detection with AWS Lookout and Sagemaker

Matthew Rhodes, et al, take us through an interesting case study:

According to a recent study, defective products cost industries over $2 billion from 2012–2017. Defect detection within manufacturing is an important business use case, especially in high-value product industries like the automotive industry. This allows for early diagnosis of anomalies to improve production line efficacy and product quality, and saves capital costs. Although advanced anomaly detection systems employ sensors as well as Internet of Things (IoT) devices to collect multimodal data to improve performance, computer vision continues to be a common approach. Detecting anomalies in automotive parts and components using computer vision can be done using normal images, and even X-Ray based images for structural damages. Recent advances in deep learning and computer vision have allowed scientists and manufacturers to develop enhanced anomaly detection systems, including surface defect detection on automotive body panels and dent detection in vehicles.

Read on for case notes.

Comments closed

Measuring File Latency in SQL Server

Anthony Nocentino has a script and some tips for us:

This post is a reference post for retrieving IO statistics for data and log files in SQL Server. We’ll look at where we can find IO statistics in SQL Server, query it to produce meaningful metrics, and discuss some key points when interpreting this data.

Click through for the script, and then a bulleted list of things to keep in mind as you’re reviewing the data.

Comments closed

New in SQL Server Big Data Clusters

Daniel Coelho has an update on what’s available in SQL Server Big Data Clusters:

SQL Server Big Data Clusters (BDC) is a capability brought to market as part of the SQL Server 2019 release. Big Data Clusters extends SQL Server’s analytical capabilities beyond in-database processing of transactional and analytical workloads by uniting the SQL engine with Apache Spark and Apache Hadoop to create a single, secure, and unified data platform. It is available exclusively to run on Linux containers, orchestrated by Kubernetes, and can be deployed in multiple-cloud providers or on-premises.

Today, we’re proud to announce the release of the latest cumulative update, CU13, for SQL Server Big Data Clusters which includes important changes and capabilities:

Updating to the most recent production-ready version of Spark (as of today) is a nice upgrade.

Comments closed

Wackiness with TrimEnd in Powershell

Shane O’Neill digs into TrimEnd:

A couple of days ago, I was running some unit tests across a piece of PowerShell code for work and a test was failing where I didn’t expect it to.

After realising that the issue was with the workings of TrimEnd and my thoughts on how TrimEnd works (versus how it actually works), I wondered if it was just me being a bit stupid.

So I put a poll up on Twitter, and I’m not alone! 60% of the people answering the poll had the wrong idea as well.

The way that works is…not what I would have expected.

Comments closed

Scaling Limitations with Site Reliability Engineering

Tyler Treat argues that the Site Reliability Engineering paradigm doesn’t scale

:We encounter a lot of organizations talking about or attempting to implement SRE as part of our consulting at Real Kinetic. We’ve even discussed and debated ourselves, ad nauseam, how we can apply it at our own product company, Witful. There’s a brief, unassuming section in the SRE book tucked away towards the tail end of chapter 32, “The Evolving SRE Engagement Model.” Between the SLIs and SLOs, the error budgets, alerting, and strategies for handling change management, it’s probably one of the most overlooked parts of the book. It’s also, in my opinion, one of the most important.

Read on for an explanation of this chapter and how it applies to organizations trying to implement SRE.

Comments closed

Data Type Casing in SQL Server

Aaron Bertrand finds a case where casing matters:

We all have coding conventions that we have learned and adopted over the years and, trust me, we can be stubborn about them once they’re part of our muscle memory. For a long time, I would always uppercase data type names, like INTVARCHAR, and DATETIME. Then I came across a scenario where this wasn’t possible anymore: a case-sensitive instance. In a recent post, Solomon Rutzky suggested:

As long as you are working with SQL Server 2008 or newer, all data type names, including sysname, are always case-insensitive, regardless of instance-level or database-level collations.

I have a counter-example that has led me to be much more careful about always matching the case found in sys.types.

Click through for that scenario.

Comments closed