Press "Enter" to skip to content

Month: October 2019

The Benefits of Delta Lake

Kaushik Nath explains what a Delta Lake is and why it is beneficial:

Data lakes have generated a large amount of publicity as the new storage technology for our big data era. Because something new is always better, right? 

All this hype around data lakes has ignored their inherent drawbacks and limitations. Well, I’m Not Here to create a debate by saying that no one should ever use data lakes. But I am saying that companies should enter into the data lake investment with eyes wide open. Otherwise it might lead to some serious complications.

Delta Lake is a concept intended to mitigate some of the issues with data lakes in general, turning them into data swamps.

Comments closed

Family History with SQL Graph

Mala Mahadevan takes us through family histories in a graph database:

I have been working a lot of SQL Graph related queries and applications of the graph data concept to the extent possible within SQL Server’s graph capabilities. Genealogy, or querying family trees is an important graph data application. A lot of us may not have work related applications that are genealogy related, necessarily. But conceptually, this can apply to many similar tree/hierarchy type structures. I was looking into some data to play with in this regard. Sometime ago – we were discussing novels by famed novelist James Michener. My friend Buck Woody made a tweet-remark that it would need a graph database to keep track of the characters and relationships in some of Michener’s novels. I am a big fan of Michener’s novels, and the most recent one I have read is ‘Hawaii’. It is based on history and evolution of the Hawaiian islands, and has a rather complex network of characters, with many ethnicities and several interwoven relationships. I decided to use the characters in Hawaii as my test data to understand how to query geneological data, stored in graph database format.

Read on to see Mala’s table and a procedure to retrieve this data.

Comments closed

Mapping Usernames within the Power BI Gateway

Jeff Pries takes us through a difficult scenario:

With some data sources, such as Analysis Services, you want to pass the username of the person running the report back to the server executing the query (such as in a Row Level Security configuration). Adam Saxton of Guy in a Cube does a great job of explaining how this works at a high level in the video linked here.

In the video, Adam mentions that if our Power BI login does not match a UPN in our local Active Directory, then the lookup will fail…which is a problem if you just don’t have the option of having your Power BI login match a UPN in your local Active Directory and don’t want to manage static user mappings over time. The following will cover a method of allowing the lookup to use a different Active Directory property, such as “mail” to perform the matching.

Read on for the steps.

Comments closed

PySpark DataFrame Joining

Monika Rathor shows the various ways you can join DataFrames with PySpark:

PySpark provides multiple ways to combine dataframes i.e. join, merge, union, SQL interface, etc. In this article, we will take a look at how the PySpark join function is similar to SQL join, where two or more tables or dataframes can be combined based on conditions. 

One join type you don’t directly get in SQL Server is the left anti join. We can build something quite similar with NOT EXISTS, though.

Comments closed

Preventing Lost Code in Azure Data Studio

Dave Bland shows how you can keep from losing code when you close Azure Data Studio:

After working on a query for a long time, we want to make sure that we save the changes we have made.  I have lost hours of work over the years because I didn’t save the changes.  Azure Data Studio has a few features that can help prevent this from happening.

Read on to learn how. With SSMS, this had historically been the domain of third-party plugins, but it’s built into VS Code and ADS.

Comments closed

Power BI Desktop Data Profiling

Angela Henry walks through some changes in Power BI:

The Data Profiling feature was first added to public preview just under a year ago in November 2018.  Then it went GA in May 2019 and just 5 months later, they’ve added more goodness.  That’s one of the great things about Power BI, the release cadence.  If you don’t like something or want more features, just wait a few months (or five in this case).

One of the big things that was lacking with the Data Profiling feature was the text length statistics.  This is a huge deal for me.  It’s one of the things that I’ve encounter most frequently, incorrectly sized string columns in data warehouses.  Well, the wait is over, text lengths are now available.  Unfortunately, it’s not intuitive on how to get them.

Sounds like there’s still room for further improvement, but this is a positive step.

Comments closed

Financial Time Series Analysis in Databricks

Ricardo Portilla shares a demo of financial time series analysis in Databricks:

We’ve shown a merging technique above, so now let’s focus on a standard aggregation, namely Volume-Weighted Average Price (VWAP), which is the average price weighted by volume. This metric is an indicator of the trend and value of the security throughout the day.  The vwap function within our wrapper class (in the attached notebook) shows where the VWAP falls above or below the trading price of the security. In particular, we can now identify the window during which the VWAP (in orange) falls below the trade price, showing that the stock is overbought.

Click through for the article, as well as a notebook you can try out.

Comments closed

Limiting Index Sizes in Cosmos DB

Hasan Savran explains why you might want to exclude columns from Cosmos DB indexes:

If everything is indexed already; Why do we want to exclude some of indexes? Indexes are saved on disk, you pay for the storage in Azure. If you keep indexing everything, your index file gets larger and you pay more for storage.

     Also; write operations to index file takes longer if index file is larger. By keeping only what you need in index file will improve the latency of write operations. If you will need to change your indexing policies, Rebuilding indexes will take less time.

This behavior is quite different from the way SQL Server behaves, where indexing is more of an opt-in philosophy.

Comments closed

Data Retrieval Bug Fixed for Columnstore Indexes

Dmitri Korotkevich takes us through an important bugfix in SQL Server:

The typical columnstore table is usually large and contains hundreds of millions or even billions of rows. Think about large fact tables in the data warehouses or huge transactional tables in OLTP systems. Those tables are usually partitioned. Besides usual reasons (Availability, Maintainability, etc), partitioning helps with the data load – it is easier to perform ETL in the staging table and import data through partition switch.

And here comes the problem. If you run OLTP query against partitioned clustered columnstore table and end up with the execution plan that uses index intersection of nonclustered B-Tree indexes, you may get incorrect results.

Getting the correct results in a query is pretty important (he says, with understatement), so this is an important bugfix; keep those SQL Server instances patched accordingly.

Comments closed