To enable data scientists to leverage the value of big data, Spark added a Python API in version 0.7, with support for user-defined functions. These user-defined functions operate one-row-at-a-time, and thus suffer from high serialization and invocation overhead. As a result, many data pipelines define UDFs in Java and Scala, and then invoke them from Python.
Vectorized UDFs built on top of Apache Arrow bring you the best of both worlds—the ability to define low-overhead, high performance UDFs entirely in Python.
This looks like a good performance improvement coming to PySpark, bringing it closer to Scala/Java performance with respect to UDFs.
Starting with release 0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically
you just have to do two things first create your warehouse in local system and give the default fs name to local put these properties inside your hive-site.xml
This is a fairly short post; click through to see the changes you’d make to hive-site.xml.
Next, we need to preprocess the text to convert it into a format that can be processed for extracting information. It is essential to reduce the size of the feature space before analyzing the text. There are various preprocessing methods that we can use here, such as stop word removal, case folding, stemming, lemmatization, and contraction simplification. However, it is not necessary to apply all of the normalization methods to the text. It depends on the data we retrieve and the kind of analysis to be performed.
The series starts off with a quick description of some preprocessing steps and then building an LDA model to extract key terms from articles.
After installation completed, the DBA enabled SQL Server 2017 Machine Learning Services, but as soon as I tried to run a simple R script, it stalled for about 30 seconds and then I got an error:
Msg 39012, Level 16, State 1, Line 0
Unable to communicate with the runtime for ‘R’ script. Please check the requirements of ‘R’ runtime.
STDERR message(s) from external script:
Error: could not find function “rxSqlUpdateLibPaths”
Click through for the solution.
But what happens if we try to use STRING_SPLIT with a multi-character separator, like this?
SELECT * FROM STRING_SPLIT(@GodawfulString, ‘,,’)
Msg 214, Level 16, State 11, Line 67
Procedure expects parameter ‘separator’ of type ‘nchar(1)/nvarchar(1)’.
Ah. That’s a no, then.
Click through for the solution.
The following query will return the hex and integer value for each row in the table (NOTE: Query store must be enabled for the database to return values):USE YourQueryStoreDatabase; SELECT set_options, CONVERT(INT, set_options) AS IntSetOptions FROM sys.query_context_settings;
The set_options value represents a bit mask, with each binary digit representing a specific set option. The full list of values can be found here. I created stored procedure ReturnSetOptions to take the IntSetOptions from the query above and return the set options represented. The code for the procedure is listed below.
Read on to get a script which breaks the bitmask field into human-readable results.
Set-Contentis one of those core PowerShell cmdlets that I can’t do without. I still remember using VBscript before we could use PowerShell to write to a file. I remember always trying to remember what kind of object I needed to use and the method name. Was it FileSystemObject, FileObject or what? It was a pain! Also, even when I did recall the method name was
CreateTextFile, I’d always forget to add
Trueas the second argument.
Here’s an example of the monstrosity I’m talking about.
Click through to see how easy writing is with Powershell.
This really didn’t make any sense. However, in one of the Discover Begin/End events, the same number appeared again: 8192 (this time explicitly marked as locale identifier). Hmmm, I had problems with weird locales before. I dug into my system, and yes, the English (Belgium) locale was lingering around. I removed it from my system and lo and behold, I could log into SSAS with SSMS again. Morale of the story: if you get weird errors, make sure you have a normal locale on your machine because apparently the SQL Server client tools go bonkers.
Worth reading the whole thing. And also maybe just using en-US for all locales; at least that one gets tested…
Index usage and tuning metrics became available on SQL Server 2005 with Dynamic Management Views and Functions, which will be discussed later. However, the meanings and significance of index DMV/DMF metrics are still not well understood by many despite only minor additions over the years. Specifically, the following list contains a synopsis of the topics that the author has observed to be the most salient index-related issues:
- Queries that need an index to function efficiently
- Which indices, if any, are used by a query (and how they are used, e.g., randomly or sequentially)
- Tables (and their indices) that merit evaluation and potential adjustment
- Indices that duplicate functionality of others
- A new index is truly needed and what improvement can be anticipated
- An index can be deleted without harming performance
- An index should be adjusted or added despite the query not producing any missing index warnings
Understanding why having too many indices results in
- Inserts and updates taking too long and/or creating blocking
- Suboptimal query plans being generated because there are too many index choices
Knowing Database Engine Tuning Advisor (DTA) pros & cons
Jeff starts with the basics of indexes, followed by some general strategy. This promises to be the first of several posts on the topic.