Press "Enter" to skip to content

Month: October 2018

Reading Excel Files In An Office-less World

Bill Fellows shows us how to read from an Excel file on a machine without Microsoft Office installed:

A common problem working with Excel data is Excel itself. Working with it programatically requires an installation of Office, and the resulting license cost, and once everything is set, you’re still working with COM objects which present its own set of challenges. If only there was a better way.

Enter, the better way – EPPlus. This is an open source library that wraps the OpenXml library which allows you to simply reference a DLL. No more installation hassles, no more licensing (LGPL) expense, just a simple reference you can package with your solutions.

Let’s look at an example.

Read on for the example.  A couple alternatives I like are readxl and XLConnect in R.

Comments closed

Hortonworks And Cloudera To Merge

Ashley Stirrup analyzes the merger of the two largest Hadoop vendors:

Overall, this is great news for customers, the Hadoop ecosystem and the future of the market.  Both company’s customers can now sleep at night knowing that the pace of innovation from Cloudera 2.0 will continue and accelerate.  Combining the Cloudera and Hortonworks technologies means that instead of having to pick one stack or the other, now customers can have the best of both worlds. The statement from their press release “From the Edge to AI” really sums up how complementary some of the investments that Hortonworks made in IoT complement Cloudera’s investments in machine learning.  From an ecosystem and innovation perspective, we’ll see fewer competing Apache projects with much stronger investments.  This can only mean better experiences for any user of big data open source technologies.

At the same time, it’s no secret how much our world is changing with innovation coming in so many shapes and sizes.  This is the world that Cloudera 2.0 must navigate.  Today, winning in the cloud is quite simply a matter of survival.  That is just as true for the new Cloudera as it is for every single company in every industry in the world.  The difference is that Cloudera will be competing with a wide range of cloud-native companies both big and small that are experiencing explosive growth.  Carving out their place in this emerging world will be critical.

The company has so many of the right pieces including connectivity, computing, and machine learning.  Their challenge will be, making all of it simple to adopt in the cloud while continuing to generate business outcomes. Today we are seeing strong growth from cloud data warehouses like Amazon RedshiftSnowflakeAzure SQL Data Warehouse and Google Big Query.  Apache Spark and service players like Databricks and Qubole are also seeing strong growth.  Cloudera now has decisions to make on how they approach this ecosystem and they choose to compete with and who they choose to complement.

Rob Bearden on the Hortonworks side:

Cloudera has a like-minded approach to next generation data management and analytics solutions for hybrid deployments. Like Hortonworks, Cloudera believes data can drive high velocity business model transformations, and has innovated in ways that benefit the market and create new revenue opportunities. We are confident that our combined company will be ideally positioned to redefine the future of data as we extend our leadership and expand our offerings.

This transformational event will create benefits and growth opportunities for our stakeholders. Together with Cloudera, we will accelerate market development, fuel innovation and produce substantial benefits for our customers, partners, employees and the community.

By merging Cloudera’s investments in data warehousing and machine learning with Hortonworks’ investments in end-to-end data management, we are generating a winning combination, which will establish the standard for hybrid cloud data management.

Mike Olson on the Cloudera side:

We’re announcing the combination today, but we don’t expect the deal to close for several months. We’ll undergo the normal regulatory review that any merger of scale involving public companies gets, and the shareholders from both companies will have to meet and approve the deal.

Between now and the close date, we remain independent companies. Our customers are running our respective products. Our sales teams are working separate from each other with current and new customers to win more business and to make those customers successful. We’ll both continue to do that.

Customers who are running CDH, HDP and HDF are getting a new promise. Those product lines will each be supported and maintained for at least three years from the date our merger closes. Any customer who chooses either can be sure of a long-term future for the platform selected.

Guy Shilo isn’t quite as pleased:

On the business side, the new company will be a de facto monopoly, as those two are the largest Hadoop vendors in terms of market share. Less competition often leads to lack of incentive to innovate and rising prices. Let’s hope the joint company will not go this way and leverage its funds and power to improving their products and services.

On the technological side, it will be interesting to see the way CDH and HDP will go. Will they keep both products alive ? will they continue only one ? which of them will it be ? Will they take the HortonWorks approach that embraces the Hadoop open source community and its fast changing versions or the Cloudera more conservative approach ?

I am cautiously pessimistic about this.  Cloudera and Hortonworks combined for a huge amount of the Hadoop market (approximately 80% as of a couple of years ago).  There are several competitors in the broader market, but I thought that Cloudera and Hortonworks gave us two separate visions for different types of companies.

Comments closed

Analyzing Update Dates For R Packages

Tomaz Kastrun takes a look at CRAN package update dates:

So more updates are coming in autumn times. But the results of correlation:

cor(dd_ym2010)[2,3]

is still just 0.155, making it hard to draw any concrete conclusions. Adding year 2018 will skew the picture and add several outliers, as the fact that year 2018 is still a running year (as of writing this blog post).

Read on for a descriptive analysis of this data set.

Comments closed

Missing Backup Directory When Trying To Upgrade SQL Server

Lori Brown walks us through the solution to an error she experienced:

I was recently performing an in-place upgrade of SQL 2008 R2 to SQL 2014 on one of my client’s servers.  I have done a ton of successful and uneventful in-place upgrades and was surprised when the upgrade failed with the error message:  “Failed to create a new folder ‘X:\SQLBackups’. The specified path is invalid (for example, it is on an unmapped drive).”  This client had over the years changed from using a local drive for all backups to having backups sent to a network share.  So, the X drive really was no longer in existence.

Read on for the solution.

Comments closed

Parameter Sniffing Issues With Table Variables

Milos Radivojevic points out the downside to table variable deferred compilation in SQL Server 2019:

Since the actual number of rows is significantly greater than in the call with the parameter ‘White’, you can see here sort warnings because this time 1MB of memory grant was not sufficient for sorting. But, the execution plan is exactly the same as for the first call.

Prior to SQL Server 2019, the execution plan for the second query in this stored procedure was always the same, regardless of parameter used for the first invocation and thus plan generation. Since the table variable has cardinality of 1, all estimations and the execution plan will be the same. We can say, using a table variable in this stored procedure and passing it to the second query neutralizes parameter sniffing effect. That does not mean, this is good or bad for all executions (you saw sort warnings and they are always bad), but the plan was stable, it did not change even after failover or clearing cache. If you call this stored procedure usually with high selective parameters, you can consider this plan as a good plan.

In SQL Server 2019, since table variable could have different cardinality, this stored procedure is prone to parameter sniffing and depending on the first execution parameter, you can have different execution plans:

This is a natural outcome and something we’d work with just like we would with a temp table or regular table in a stored procedure.

Comments closed

Creating Minesweeper In Power BI

Philip Seamark has fun with a classic Windows game reimplemented in Power BI:

The latest addition to my recent series of DAX based games is the classic Minesweeper game.  This is the game where you are presented with a 9 x 9 matrix of squares.  There are 10 hidden mines and you can either step on a square or place a flag where you think there might be a mine.  If you are lucky enough not to step on a square that contains a mine, you will get clues that help you identify where the mines are.

Click here if you would like to see the final publish to web version.

Click here if you would like to download the PBIX version to go through the code.

Still better than the Windows 10 version.

Comments closed

Taking Action With Wait Stats

Aaron Bertrand lays out a course of action (or inaction) when dealing with the most common wait types in SQL Server:

I started going a little further than this, mapping out some of the more common wait types, and noting some of the properties they shared. Translated into questions a tuner might have about a wait type they are experiencing:

  • Can the wait type be solved at the query level?
  • Is the core symptom of the wait likely to be affecting other queries?
  • Is it likely you will need more information outside the context of a single query and the wait types it experienced in order to “solve” the problem?

When I set out to write this post, my goal was just to group the most common wait types together, and then start jotting notes about them relating to the above questions. Jason pulled the most common ones from the library, and then I drew some chicken scratch on a whiteboard, which I later tidied up a bit. This initial research led to a talk that Jason gave on the most recent TechOutbound SQL Cruise in Alaska. I’m kind of embarrassed that he put a talk together months before I could finish this post, so let’s just get on with it. Here are the top waits we see (which largely match Paul’s survey from 2014), my answers to the above questions, and some commentary on each:

Read on for the top 10 list.

Comments closed

Table Variable Deferred Compilation: When It Works

Milos Radivojevic gives us a good example of when table variable deferred compilation is a good thing:

As mentioned in the previous article, SQL Server 2019 cardinality estimations for a table variable are based on actual table variable row counts. Therefore, in SQL Server 2019, we should expect better estimations and better plans for queries that use table variables.
Which queries will benefit from this improvement? Generally, queries that use table variables with a lot of rows in them, which are not tuned yet. For table variables with a few rows, there will not be significant changes and you should expect the same execution plan and almost same execution parameters.

Queries whose execution was slow due to underestimation in table variables usually implement logical joins by using Nested Loop Join physical operator where a Hash or Merge Join operators would be more appropriate. In addition to this, underestimation of table variables participating in multiple joins could lead to issues with insufficient memory grants, and thus data spilling to tempdb .

Click through for the example.  The next post in the series will be a case where it doesn’t work very well.

Comments closed

Troubleshooting KSQL Executions

Robin Moffatt shows us some of the tools available for researching problems with KSQL queries executed against a server:

What does any self-respecting application need? Metrics! We need to know how many messages have been processed, when the last message was processed and so on.

The simplest option for gathering these metrics comes from within KSQL itself, using the same DESCRIBE EXTENDED command that we saw before:

ksql> DESCRIBE EXTENDED GOOD_RATINGS;
[...]
Local runtime statistics
------------------------
messages-per-sec:      1.10 total-messages:     2898 last-message: 9/17/18 1:48:47 PM UTC
 failed-messages:         0 failed-messages-per-sec:         0 last-failed: n/a
(Statistics of the local KSQL server interaction with the Kafka topic GOOD_RATINGS)
ksql>

You can get more details, including explain plans, from this.  There are external tools which Robin demonstrates as well, which let you track the streams over time.

Comments closed

Enhancements To Actual Query Plans In SSMS 18

Brent Ozar points out a big enhancement to the way SQL Server Management Studio views actual query plans:

You can see the estimated and actual number of rows right there on the query plan just like live query plans! You no longer have to waste hours of your life hovering over different parts of the query plan in order to see where the estimated row counts veer off from the actual row counts.

This doesn’t require SQL Server 2019, either.

Read on for Brent’s thoughts on the matter.

Comments closed