Kevin Feasel – Page 1028

dbatools 1.0 Available

Published 2019-06-20 by Kevin Feasel

Chrissy LeMaire and a cast of thousands have officially released dbatools 1.0:

We are so super excited to announce that after 5 long years, dbatools 1.0 is publicly available!
Our team had some lofty goals and met a vast majority of them . In the end, my personal goal for dbatools 1.0 was to have a tool that is not only useful and fun to use but trusted and stable as well. Mission accomplished: over the years, hundreds of thousands of people have used dbatools and dbatools is even recommended by Microsoft.

Go forth and update.

Comments closed

Docker Desktop Resource Limits

Published 2019-06-20 by Kevin Feasel

Andrew Pruski lays out the default resource limits for Linux and Windows containers using Docker Desktop on Windows:

Docker Desktop is a great product imho. The ability to run Windows and Linux containers locally is great for development and has allowed me to really dig into SQL Server on Linux. I also love the fact that I no longer need to install SQL 2016/2017, I can run it in Windows containers.
However there are some differences between how Windows and Linux containers run in Docker Desktop. One of those differences being the default resource limits that are set.

Read on to see how they differ and how to deal with Windows container resources.

Comments closed

Patterns for ML Models in Production

Published 2019-06-19 by Kevin Feasel

Jeff Fletcher shows four patterns for productionalizing Machine Learning models, as well as some things to take care of once you’re in production:

Operational Databases
This option is sometimes considered to be real-time as the information is provided “as its needed,” but it is still a batch method. Using our telco example, a batch process can be run at night that will make a prediction for each customer, and an operational database is updated with the most recent prediction. The call center agent software can then fetch this prediction for the customer when they call in, and the agent can take action accordingly.

Read on for more.

Comments closed

Sales Predictions with Pandas

Published 2019-06-19 by Kevin Feasel

Megan Quinn shows how you can use Pandas and linear regression to predict sales figures:

Pandas is an open-source Python package that provides users with high-performing and flexible data structures. These structures are designed to make analyzing relational or labeled data both easy and intuitive. Pandas is one of the most popular and quintessential tools leveraged by data scientists when developing a machine learning model. The most crucial step in the machine learning process is not simply fitting a model to a given data set. Most of the model development process takes place in the pre-processing and data exploration phase. An accurate model requires good predictors and, in order to acquire them, the user must understand the raw data. Through Pandas’ numerous data wrangling and analysis tools, this important step can easily be achieved. The goal of this blog is to highlight some of the central and most commonly used tools in Pandas while illustrating their significance in model development. The data set used for this demo consists of a supermarket chain’s sales across multiple stores in a variety of cities. The sales data is broken down by items within the stores. The goal is to predict a certain item’s sale.

Click through for an example of the process, including data cleansing and feature extraction, data analysis, and modeling.

Comments closed

The Banality of Containers

Published 2019-06-19 by Kevin Feasel

Grant Fritchey points out that containers hosting SQL Server are, in and of themselves, banal:

I’m only half joking when I say containers are boring and dull. They’re not. The technology is amazing. However, that technology doesn’t fundamentally change what you’re dealing with. It’s SQL Server. How do you capture detailed performance metrics in a container? Extended Events. How do you capture aggregated performance metrics and query plans in a container? Query Store. What’s the backup syntax for a database in a container? BACKUP DATABASE. We can keep going on this, but I won’t.

To a great extent, this is the same as SQL Server on Linux: once you have it installed, it works just like the Windows version (well, save for the things which aren’t there yet). All of the magical parts are in getting there.

Comments closed

Embedding Refresh Times in Power BI Reports

Published 2019-06-19 by Kevin Feasel

Marc Lelijveld shows how you can embed Power BI Dataflow refresh times in your Power BI reports:

But maybe you want to visualize this as part of your report as well. With a really simple piece of Power Query code you can easily generate a date/time at the moment that your dataset is processed. Kasper de Jonge wrote a blog post on that, so I’m not going to elaborate on that. However, when we add this as a separate entity to each dataflow, it results in a last successful refresh date/time for each dataflows.
Since each dataflow will be refreshed on it’s own, likewise as a dataset, the entity with your last date/time will always the last date/time for the whole dataflow, no matter how many entities are in there.

Read on to see how to combine and display these refresh times.

Comments closed

Trailing Spaces and String Comparisons

Published 2019-06-19 by Kevin Feasel

Bert Wagner shows how SQL Server handles trailing spaces when comparing two strings:

The LEN() function shows the number of characters in our string, while the DATALENGTH() function shows us the number of bytes used by that string.
In this case, DATALENGTH is equal to 10. This result is due to the padded spaces occurring after the character “a” in order to fill the defined CHAR length of 10. We can confirm this by converting the value to hexadecimal. We see the value 61 (“a” in hex) followed by nine “20” values (spaces).

Click through to see what happens and why it works the way it does.

Comments closed

Breaking Down the MAXDOP Guidance Change

Published 2019-06-19 by Kevin Feasel

Joe Obbish digs into Microsoft’s new guidance for maximum degree of parallelism:

I’ve heard some folks claim that keeping all parallel workers on a single hard NUMA nodes can be important for query performance. I’ve even seen some queries experience reduced performance when thread 0 is on a different hard NUMA node than parallel worker threads. I haven’t heard of anything about the importance of keeping all of a query’s worker threads on a single soft-NUMA node. It doesn’t really make sense to say that query performance will be improved if all worker threads are on the same soft-NUMA node. Soft-NUMA is a configuration setting. Suppose I have a 24 core hard NUMA node and my goal is to get all of a parallel query’s worker threads on a single soft-NUMA node. To accomplish that goal the best strategy is to disable auto soft-NUMA because that will give me a NUMA node size of 24 as opposed to 8. So disabling auto soft-NUMA will increase query performance?

Joe takes a careful look at the documentation and brings up some really good questions.

Comments closed

When Not to Use Spark

Published 2019-06-17 by Kevin Feasel

Ramandeep Kaur gives us several cases when it makes sense not to use Apache Spark:

There can be use cases where Spark would be the inevitable choice. Spark considered being an excellent tool for use cases like ETL of a large amount of a dataset, analyzing a large set of data files, Machine learning, and data science to a large dataset, connecting BI/Visualization tools, etc.
But its no panacea, right?
Let’s consider the cases where using Spark would be no less than a nightmare.

No tool is perfect at everything. Click through for a few use cases where the Spark experience degrades quickly.

Comments closed

Linear Regression Assumptions

Published 2019-06-17 by Kevin Feasel

Stephanie Glen has a chart which explains the four key assumptions behind when Ordinary Least Squares is the Best Linear Unbiased Estimator:

If any of the main assumptions of linear regression are violated, any results or forecasts that you glean from your data will be extremely biased, inefficient or misleading. Navigating all of the different assumptions and recommendations to identify the assumption can be overwhelming (for example, normality has more than half a dozen options for testing).

Violating one of the assumptions isn’t the end of the world, though it can make understanding the model and generating accurate predictions harder.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Author: Kevin Feasel