Press "Enter" to skip to content

Month: July 2018

Implementing K Nearest Neighbors In Python

Atul Harsha gives us a demo on k nearest neighbors in Python:

In order to make any predictions, you have to calculate the distance between the new point and the existing points, as you will be needing k closest points.

In this case for calculating the distance, we will use the Euclidean distance. This is defined as the square root of the sum of the squared differences between the two arrays of numbers

Specifically, we need only first 4 attributes(features) for distance calculation as the last attribute is a class label. So for one of the approach is to limit the Euclidean distance to a fixed length, thereby ignoring the final dimension.

Check it out.

Comments closed

The Basics Of RDDs In Apache Spark

Anmol Sarna walks us through some of the basics of Resilient Distributed Datasets in Apache Spark:

  • Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

  • Distributed with data residing on multiple nodes in a cluster.

  • Dataset is a collection of partitioned data.

Now we know what RDD stands for. Now let’s try to understand it.

It’s a nice intro to the topic.  And even though there are other data models which sit on top of RDDs to make life easier for developers, it’s still important to understand the core model in Spark.

Comments closed

What Those Power BI Processes In Task Manager Mean

Kellyn Pot’vin-Gorman explains what you see in the Task Manager when loooking at an instance of Power BI Desktop:

We can see that there are numerous threads, with a few taking considerable memory over others-  The CefSharp.BrowserSubprocess can be a bit misleading-  It’s Power BI using Chromium to render the visuals that are part of the Power BI Desktop that’s part of the current run.  Chromium (CefSharp.BrowserSubprocess) subprocesses will always come in pairs, one for rendering and one for messaging.

In the Task Manager Details, we can see each of the PIDs that correspond with the processes IDs listed in the logs.  By updating our viewable columns, (right click, choose “threads” and click OK) you can now view how many threads are associated with a given PID.

Read on for more.

Comments closed

Breaking A Database File Into Multiple Files

Lori Brown shows us how to take a database with one database file and add new database files to it:

I occasionally come across some pretty good sized databases that are set up with a single data file.  We recently have been working with a client to break up their single data file into multiple data files so that we can spread them over several different LUNs and so that they can take advantage of the improved performance of using the files in parallel.  The concept is much like setting up tempdb with 1 file (up to 8) per core.

Since most people don’t think about using multiple files for databases until they have grown large enough to be a problem, I think that most don’t realize that breaking up a database can be done at any time, you just need to have enough space for new files.  Here is a bit of a demo on how to do this.

Do read Lori’s warning at the end, however, should you decide to do this in production.

Comments closed

Understanding sp_reset_connection

Greg Low explains what sp_reset_connection does and why it’s often a good thing:

Anyone who’s ever traced activity against a SQL Server will have no doubt seen a large number of commands where the procedure sp_reset_connection has been executed. Yet, this command won’t appear anywhere in the source code of the applications that are running.

As an example of why this occurs, one of the most common data access technologies that is used to connect applications to SQL Server is ADO.NET. It has a SqlConnection object that represents a connection that can be opened to a SQL Server instance. In the design of the SqlConnection class, the architects of it were grappling with two big issues:

  • They knew that opening and closing connections to SQL Server was a relatively expensive process.
  • They also knew that on a busy website, they didn’t want to use enough resources (or might not even have had them), to open up a connection for each concurrent session on the website.

So they decided to make the connections to SQL Server able to be shared.

Read the whole thing.

Comments closed

When Using DBCC DROPCLEANBUFFERS

Dan Guzman shares words of wisdom with using DBCC DROPCLEANBUFFERS for testing query performance in SQL Server:

One can make the argument that DBCC DROPCLEANBUFFERS might not be particularly valuable for testing. First, the storage engine in SQL Server Enterprise Edition (or Developer Edition, which is often used when testing) behaves differently with a cold cache versus a warm one. With a warm cache, a page not already in cache (e.g. index seek by primary key) will be fetched from disk using a single 8K page IO request as one expects. However, when the cache isn’t fully warmed up (Buffer Manager’s Target Pages not yet met), the entire 64K extent (8 contiguous 8K pages) is read for the single page request regardless of whether the adjacent pages are actually needed by the query. This has the benefit of warming the cache much more quickly than would otherwise occur, but given that the normal steady state of a production SQL Server is a warm cache, testing with a cold cache isn’t a fair comparison of different plans. More data than normal will be transferred from storage so timings may not be indicative of actual performance.

I don’t think I agree 100% with that argument, but I am sympathetic to it.  Still, Dan has great advice in this post.

Comments closed

Layout Images In Power BI

Meagan Longoria has some tips for using layout images in Power BI:

Using layout images in Power BI has become a popular design trend. When I say layout images, I’m referring to background images with shapes around areas where visuals are placed. This is different from the new wallpaper feature that became available in the July release, which can be used to format the grey area outside your report page and extend the main color of background images.

Layout images can help with spacing and alignment within a report and can help create consistency across reports. They can also help create affordances, using consistent layout and design to make it obvious how users should interact with our reports.

I use layout images in some of my reports, but I don’t think they are necessary on every report. There are a couple of things to consider when using layout images.

Read on for an example of a good layout image versus a bad layout image as well as tips and tricks on how to create good layout images.

Comments closed

Bug In July Windows Updates Causing “TCP port is already in use” Errors

Jordon Riel warns us about a recent Windows update which can cause SQL Server’s database engine to fail to start up:

We have recently become aware of a regression in one of the TCP/IP functions that manages the TCP port pool which was introduced in the July 10, 2018 Windows updates for Windows 7/Server 2008 R2 and Windows 8.1/Server 2012 R2.

This regression may cause the restart of the SQL Server service to fail with the error, “TCP port is already in use”. We have also observed this issue preventing Availability Group listeners from coming online during failover events for both planned and/or unexpected failovers. When this occurs, you may observe errors similar to below in the SQL ERRORLOGs:

Error: 26023, Severity: 16, State: 1.
Server TCP provider failed to listen on [ <IP ADDRESS> <ipv4> <PORT>]. Tcp port is already in use.
Error: 17182, Severity: 16, State: 1.
TDSSNIClient initialization failed with error 0x2740, status code 0xa. Reason: Unable to initialize the TCP/IP listener. Only one usage of each socket address (protocol/network address/port) is normally permitted. 
Error: 17182, Severity: 16, State: 1.
TDSSNIClient initialization failed with error 0x2740, status code 0x1. Reason: Initialization failed with an infrastructure error. Check for previous errors. Only one usage of each socket address (protocol/network address/port) is normally permitted. 
Error: 17826, Severity: 18, State: 3.
Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.
Error: 17120, Severity: 16, State: 1.
SQL Server could not spawn FRunCommunicationsManager thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.

Read on for the solution.

Comments closed

Using rquery On Databricks

Nina Zumel and John Mount talk about rquery, a relational data transformation engine for R which runs on Spark:

rquery is based on an appreciation of Codds’ relational algebra. Codd’s relational algebra is a formal algebra that describes the semantics of data transformations and queries. Previous, hierarchical, databases required associations to be represented as functions or maps. Codd relaxed this requirement from functions to relations, allowing tables that represent more powerful associations (allowing, for instance, two-way multimaps).

Codd’s work allows most significant data transformations to be decomposed into sequences made up from a smaller set of fundamental operations:

  • select (row selection)
  • project (column selection/aggregation)
  • Cartesian product (table joins, row binding, and set difference)
  • extend (derived columns, keyword was in Tutorial-D).

One of the earliest and still most common implementation of Codd’s algebra is SQL. Formally Codd’s algebra assumes that all rows in a table are unique; SQL further relaxes this restriction to allow multisets.

rquery is another realization of the Codd algebra that implements the above operators, some higher-order operators, and emphasizes a right to left pipe notation. This gives the Spark user an additional way to work effectively.

They include a fairly lengthy example and give a great introduction to the tool.  It’s now officially on my list of stuff to try out.

Comments closed

Explaining Text Classification Models With LIME

Shirin Glander shows us how to use LIME to explain which words help us classify whether a user liked a particular item:

Okay, not a perfect score but good enough for me – right now, I’m more interested in the explanations of the model’s predictions. For this, we need to run the lime() function and give it

  • the text input that was used to construct the model
  • the trained model
  • the preprocessing function
explainer <- lime(clothing_reviews_train$text, 
                  xgb_model, 
                  preprocess = get_matrix)

With this, we could right away call the interactive explainer Shiny app, where we can type any text we want into the field on the left and see the explanation on the right: words that are underlined green support the classification, red words contradict them.

I hadn’t used LIME for this before, and it looks very interesting.  H/T R-Bloggers

Comments closed