Press "Enter" to skip to content

Month: December 2017

Data Science At A Small Tech Company

Julia Silge blogs about her first year as a data scientist at Stack Overflow:

In the fall I saw this post by Shanif Dhanani about being a data scientist at a small company, and it is entirely on point, the whole way through. So much of that post resonates with my own experience of being a data scientist at a small company. And yes, I do keep saying “small company”; Stack Overflow is likely smaller than you think it is, 250 or so employees in total. I am the second data scientist here, joining David Robinson who was the first data science hire, on a data team that is five in total.

I cannot emphasize enough how much of my day-to-day work is communicating, collaborating with others, and answering not-entirely-specified questions. Data science is highly technical work, but the value of my technical work would be much lower if I could not communicate what it means in clear and compelling ways. My definition of communication here is pretty broad, and includes speaking, writing, and data visualization.

If you’re interested in a career in data science, this is food for thought.

Comments closed

Backups And Distributed File Shares

Wayne Sheffield ran into a new oddity recently:

I was working on a client’s site today, setting up database backup routines. Part of which is to perform a database backup, and verify that everything went okay. I had Windows Explorer open to the location that the backup was going to. When the backup finished, I navigated over to Windows Explorer… and I have a missing database backup. There wasn’t a file in the directory for the backup that I had just performed.

After double and triple checking that I was looking at the same path that I had backed up the database to, I went in search of the network sysadmin to help me figure it out.

Read on for the solution.

Comments closed

Using JSON_MODIFY To Modify Existing JSON

Jovan Popovic shows off the JSON_MODIFY function in SQL Server:

Recently I found this question on stack overflow. The problem was in appending a new JSON object to the existing JSON array:

UPDATE TheTable
SET TheJSON = JSON_MODIFY(TheJSON, 'append $', N'{"id": 3, "name": "Three"}')
WHERE Condition = 1;

JSON_MODIFY function should take the array value from TheJSON column (the first argument), append the third argument into the first argument, and write the appended array back in TheJSON column.

However, the unexpected results in this case is the fact that JSON_MODIFY didn’t appended a JSON object {"id": 3, "name": "Three"}to the array. Instead, JSON_MODIFY appended a new JSON string literal  "{\"id\": 3, \"name\": \"Three\"}" to the end of the array.

This might be surprising result if you don’t know how JSON_MODIFY function works.

Read on to see how JSON_MODIFY works and why this doesn’t quite do what the poster thought.

Comments closed

ChartAccent Power BI Custom Visual

Devin Knight continues his Power BI custom visuals series:

In this module you will learn how to use the ChartAccent LineChart Custom Visual. This visual is a custom line chart that allows you to annotate individual data points, data series and ranges.

This visual runs the risk of getting very “noisy” but that can be fine if you’re building a presentation and want to build a somewhat complicated, annotated visual.

Comments closed

Streaming Performance Counters Into Power BI

Chris Koester shows how to load Performance Counters (i.e., what Perfmon displays) into Power BI in near real time:

In the previous post I showed how you can Push Data into Power BI Streaming Datasets with C#. That example used dummy data. In this post I’ll show how to push performance counter data into a Power BI Streaming Dataset as a real world example. This scenario allows for monitoring a computer or application in near real time in the browser.

I won’t go through the steps of creating a Power BI Streaming Dataset. You can reference my previous post if you need instructions. I will note that the value names that you choose in the Streaming Dataset must match the C# property names for the script to work. This is noted in the code comments as well.

Check it out.

Comments closed

Associative Purchasing Behavior With DAX

Matt Allington solves a particular associative purchasing behavior problem:

The requirement is to be able to select a single customer from a list of customers and see what products they have purchased.  Then the next things is to be able to compare sales of this same group of products across all customers while ignoring sales of all other products.  Stated another way, the report should show sales from all customers but only for the products purchased by the original selected customer.  Now there is a “cheats” solution to this problem using visual level filters – more on that later.  The purpose of this article is to show a DAX only solution.

Read Matt’s answer, but also check out the comments for an even better solution if you’re using a proper star schema.

Comments closed

Using Relog To Load Perfmon Counters Into SQL Server

Raul Gonzalez shows how to use the Relog executable to bring Perfmon counters into SQL Server:

Relog is an application and as such it might be able to connect to a SQL Server. In this case, the way to connect is via ODBC driver, hence we have to create a new DSN in the computer we want to run Relog.exe, which does not have to be necessarily the same server where SQL is installed.

If we click Start and start typing ODBC it’ll show the 2 versions available in modern OS’s which are 32bit and 64bit, we need to choose the 64bit version to make it work.

Seems like being relog.exe a bit old, the only driver that actually works is the oldest, but whatever, it works!, so we choose the SQL Server 10.00.xxx one.

He also provides a bonus script at the end, which builds a dynamic pivot table of your counter information.

Comments closed

Automatic Tuning In SQL Server

Grant Fritchey is a fan of SQL Server 2017’s automatic tuning feature:

The core of automatic tuning at this point in time (because I’m sure it’s going to evolve) is the ability of the query engine to spot when a query has generated a new plan and that new plan is causing performance to degrade. This is known as a regression in the plan. It comes from bad parameter sniffing, changes in statistics, cumulative updates, or the big notorious one, the cardinality estimator introduced in SQL Server 2014 (it’s been almost four years, I’m not calling it new any more). Spotting a plan regression prior to SQL Server 2016 and the introduction of the Query Store was a major pain. However, now it’s easy. You can spot them by reading the data collected. Further, Microsoft can spot them by reading the data collected, and it does.

If the engine sees that a plan is causing a regression (you have to have Query Store enabled for this), then it will write out a suggestion for fixing it to the new DMV, sys.dm_db_tuning_recommendations. If you further enable automatic tuning on your database, then SQL Server can automatically substitute the last good plan for you, fixing the problem. Further, SQL Server monitors this fix and if, over time, it’s clear that the forced plan is causing problems, it will unforce the plan, again automagically.

Click through for more information, including a query to read from the tuning recommendations DMV.

Comments closed

Unit Testing Spark Streaming DStreams

Anuj Saxena gives an example of using StreamingSuiteBase to build unit tests for DStreams in Spark Streaming:

So what’s the problem? How to execute streaming logic in a test environment.

We can write Integration test cases and provide the actual environment in the integration test. But for unit testing, we need a testing environment which should not depend on any external application.

Click through for the example.

Comments closed

The Need For Multiple Warehouse Architectures

James Serra argues in favor of a data lake approach and a traditional data warehouse:

I think the ultimate question is: Can all the benefits of a traditional relational data warehouse be implemented inside of a Hadoop data lake with interactive querying via Hive LLAP or Spark SQL, or should I use both a data lake and a relational data warehouse in my big data solution?  The short answer is you should use both.  The rest of this post will dig into the reasons why.

I touched on this ultimate question in a blog that is now over a few years old at Hadoop and Data Warehouses so this is a good time to provide an update.  I also touched on this topic in my blogs Use cases of various products for a big data cloud solutionData lake detailsWhy use a data lake?and What is a data lake? and my presentation Big data architectures and the data lake.  

Read on for James’s argument, which is good.  My argument is summed up as follows:  the purpose of a data warehouse is to solve known business problems—that is, to help build reports that people on the business side need based on established requirements.  The purpose of a data lake is to hold all kinds of data and curate it for when people come looking for something they didn’t know they needed.

Comments closed