Kevin Feasel – Page 621

Word Stemming and Text Processing in R

Published 2021-10-07 by Kevin Feasel

Genrikh Ananiev takes us through some examples of text processing in R:

First, there are a lot of classes (in fact, how many products you have so many classes) And if in this process you have to work not only with the company’s products, but also competitors, the growth of such new classes can occur every day – therefore it becomes meaningless to teach one time Model to be repeatedly used to predict new products.
Secondly, the number of documents (different variations of the same product) in the classes are not very balanced: there may be one by one to class, and maybe more.

Click through for an example of the classical technique versus a classification-based technique.

Comments closed

Reasons Why Slow Queries Might Not Use Much CPU

Published 2021-10-07 by Kevin Feasel

Erik Darling has a compendium for us:

No one ever says a broken record is right twice a day, perhaps because DJs are far more replaceable than clock makers.
I say that only to acknowledge that I may sounds like a broken record when I say that when you’re tuning a query, it’s quite important to compare wall clock time and duration. Things you should note:

Click through for those things.

Comments closed

Measuring File Latency in SQL Server

Published 2021-10-07 by Kevin Feasel

Anthony Nocentino has a script and some tips for us:

This post is a reference post for retrieving IO statistics for data and log files in SQL Server. We’ll look at where we can find IO statistics in SQL Server, query it to produce meaningful metrics, and discuss some key points when interpreting this data.

Click through for the script, and then a bulleted list of things to keep in mind as you’re reviewing the data.

Comments closed

Wackiness with TrimEnd in Powershell

Published 2021-10-07 by Kevin Feasel

Shane O’Neill digs into TrimEnd:

A couple of days ago, I was running some unit tests across a piece of PowerShell code for work and a test was failing where I didn’t expect it to.
After realising that the issue was with the workings of TrimEnd and my thoughts on how TrimEnd works (versus how it actually works), I wondered if it was just me being a bit stupid.
So I put a poll up on Twitter, and I’m not alone! 60% of the people answering the poll had the wrong idea as well.

The way that works is…not what I would have expected.

Comments closed

New in SQL Server Big Data Clusters

Published 2021-10-07 by Kevin Feasel

Daniel Coelho has an update on what’s available in SQL Server Big Data Clusters:

SQL Server Big Data Clusters (BDC) is a capability brought to market as part of the SQL Server 2019 release. Big Data Clusters extends SQL Server’s analytical capabilities beyond in-database processing of transactional and analytical workloads by uniting the SQL engine with Apache Spark and Apache Hadoop to create a single, secure, and unified data platform. It is available exclusively to run on Linux containers, orchestrated by Kubernetes, and can be deployed in multiple-cloud providers or on-premises.
Today, we’re proud to announce the release of the latest cumulative update, CU13, for SQL Server Big Data Clusters which includes important changes and capabilities:

Updating to the most recent production-ready version of Spark (as of today) is a nice upgrade.

Comments closed

Data Type Casing in SQL Server

Published 2021-10-07 by Kevin Feasel

Aaron Bertrand finds a case where casing matters:

We all have coding conventions that we have learned and adopted over the years and, trust me, we can be stubborn about them once they’re part of our muscle memory. For a long time, I would always uppercase data type names, like INT, VARCHAR, and DATETIME. Then I came across a scenario where this wasn’t possible anymore: a case-sensitive instance. In a recent post, Solomon Rutzky suggested:
As long as you are working with SQL Server 2008 or newer, all data type names, including sysname, are always case-insensitive, regardless of instance-level or database-level collations.
I have a counter-example that has led me to be much more careful about always matching the case found in sys.types.

Click through for that scenario.

Comments closed

Scaling Limitations with Site Reliability Engineering

Published 2021-10-07 by Kevin Feasel

Tyler Treat argues that the Site Reliability Engineering paradigm doesn’t scale

:We encounter a lot of organizations talking about or attempting to implement SRE as part of our consulting at Real Kinetic. We’ve even discussed and debated ourselves, ad nauseam, how we can apply it at our own product company, Witful. There’s a brief, unassuming section in the SRE book tucked away towards the tail end of chapter 32, “The Evolving SRE Engagement Model.” Between the SLIs and SLOs, the error budgets, alerting, and strategies for handling change management, it’s probably one of the most overlooked parts of the book. It’s also, in my opinion, one of the most important.

Read on for an explanation of this chapter and how it applies to organizations trying to implement SRE.

Comments closed

foreach() vs foreachPartition() in Spark

Published 2021-10-06 by Kevin Feasel

The Hadoop in Real World team contrasts two functions:

foreach() and foreachPartition() are action function and not transform function. Both functions, since they are actions, they don’t return a RDD back.

Read on for the big difference between the two.

Comments closed

pyspark.pandas in Apache Spark 3.2

Published 2021-10-06 by Kevin Feasel

Hyukjin Kwon and Xinrong Meng announce a built-in pandas API for Apache Spark 3.2:

We’re thrilled to announce the pandas API as part of the upcoming Apache Spark™ 3.2 release. pandas is a powerful, flexible library and has grown rapidly to become one of the standard data science libraries. Now pandas users can leverage the pandas API on their existing Spark clusters.
A few years ago, we launched Koalas, an open source project that implements the pandas DataFrame API on top of Spark, which became widely adopted among data scientists. Recently, Koalas was officially merged into PySpark by SPIP: Support pandas API layer on PySpark as part of Project Zen (see also Project Zen: Making Data Science Easier in PySpark from Data + AI Summit 2021).
pandas users can now scale their workloads with one simple line change in the upcoming Spark 3.2 release:

Click through to see more details on the change.

Comments closed

SQL Server Express Memory Limitations

Published 2021-10-06 by Kevin Feasel

Steve Stedman notes that the memory limitations on SQL Server Express Edition are not quite as stringent as you may first believe:

Looking at the memory limits and other limits on the SQL Server versions over time, we have seen things increase, but one limit that is still very low is the memory limit for SQL Express. Specifically the maximum memory for buffer pool per instance of SQL Server Database Engine for SQL 2019. The limit there is 1410 MB.
At first glance you may think that this limit is the total amount of memory that SQL Server will use, but let me show you a couple of screen shots for Database Health Monitor showing the memory utilization on two different SQL 2019 Express servers.

Read on to see what, exactly, the memory limitation is. Also, there are separate limits for things like In-Memory OLTP table sizes.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Author: Kevin Feasel