Press "Enter" to skip to content

Curated SQL Posts

Learning About Spatial Data In R

Steph Locke has a compendium of resources for people wishing to learn more about working with spatial data in R:

I recently met up with someone who does geospatial stuff but uses the more traditional GIS software to do it. I showed him a few things in R but not being a person who does a lot of geospatial analysis I thought I’d ask the lovely #rspatial crowd what they’d recommend. Here are the compiled recommendations. Happy learning spatial R!

Feel free to comment or tweet your recommendations to get them added to this list.

There’s a lot of reading, watching, and doing there, so thanks to Steph for putting it together.

Comments closed

Transforming Data: ELT Or ETL?

Artyom Keydunov argues that Extract-Load-Transform is a better model than Extract-Transform-Load:

ETL arose to solve a problem of providing businesses with clean and ready-to-analyze data. We remove dirty and irrelevant data and transform, enrich, and reshape the rest. The example of this could be sessionization: the process of creating sessions out of raw pageviews and users’ events.

ETL is complicated, especially the transformation part. It requires at least several months for a small-sized (less than 500 employees) company to get up and running. Once you have the initial transform jobs implemented, never-ending changes and updates will begin because data always evolves with business.

The other problem of ETL is that during the transformation, we reshape data into some specific form. This form usually lacks some data’s resolution and does not include data that is useless for that time or for that particular task. Often, “useless” data becomes “useful.” For example, if business users request daily data instead of weekly, then you will have to fix your transformation process, reshape data, and reload it. That would take a few weeks more.

Read on for more, including his argument for why ELT is better.

Comments closed

Monitoring Performance Of Natively Compiled Stored Procedures

Jos de Bruijn announces a feature coming to the next version of SQL Server:

We just added new database-scoped configuration options that will help with monitoring performance of natively compiled stored procedures. The new options XTP_PROCEDURE_EXECUTION_STATISTICS and XTP_QUERY_EXECUTION_STATISTICS are available now in Azure SQL Database, and will be available in the next major release of SQL Server. These options will improve your monitoring and troubleshooting experience for databases leveraging In-Memory OLTP with natively compiled stored procedures.

After enabling these options, you can monitor the performance of natively compiled stored procedures using Query Store, as well as the DMVs sys.dm_exec_query_stats and sys.dm_exec_procedure_stats. Note that there is a performance impact to enabling execution statistics collection, thus we recommend to disable stats collection when not needed.

That last sentence is important:  there’s an observer effect which slows down execution of natively compiled stored procedures, and considering that you’re implementing them specifically for the speed, that’s fairly unwelcome.

Comments closed

Using SQLCMD’s Exit Command

Louis Davidson has a quick tip on using :EXIT in SQLCMD mode:

The first use I will cover will help you stopping from running an entire file’s worth of SQL statements without meaning to. I use this mostly when doing demo code, but it certainly finds its way into some of my other code as well. The hotkeys in SSMS for hiding the results (Ctrl-R), and executing a query (Ctrl-E) are next to each other. In a demo, where you are showing code, statement by statement, accidentally executing the entire file of queries can cause you to spend unwanted time recovering with many eyes staring back at you. This is only slightly better than accidentally running code in your office system, and losing your job.

Read on for more.

Comments closed

Query Tuning With The APPLY Operator

Daniel Janik walks through using the APPLY operator to tune a couple of queries:

Recently we were doing a project that heavily focused on query tuning and many tables had various outer joins. My co-worker pointed out that many of these could be converted to an apply rather than a join.

Apply gives you both CROSS and OUTER. Think of CROSS APPLY like an INNER JOIN and OUTER APPLY like an OUTER JOIN.

Let’s compare some code to see how APPLY stacks up.

I like the APPLY operator so much that I created an entire presentation on it.  It’s not a cure-all by any means, but if you understand the intent, you can find places where it improves your code significantly.

Comments closed

Finding The Real Character Set: Unicode And SQL Server Identifiers

Solomon Rutzky wraps up his series on Unicode and regular identifiers:

The question that I’m trying to answer is: what are the valid “letters” and “decimal numbers” from other national scripts?

I tried using the online research tool “UnicodeSet”, but that gave slightly different results compared (using the “alphabetic” and “numeric_type = decimal” properties) to what I discovered SQL Server actually accepts.

I then loaded the actual Unicode 3.2 data files only to find that the number of characters having either the “alphabetic” or “numeric_type = decimal” properties was different than both the online search and what SQL Server actually accepts.

And so…..

Click through to find the real Unicode killer.

Comments closed

Wait Stats Related To Columnstore Indexes

Niko Neugebauer has some documentation on important wait stats around columnstore:

I split the known wait types into the following distinct groups:
– Batch Execution Mode (HTBUILD, HTDELETE, HTMEMO, HTREINIT, HTREPARTITION, PWAIT_QRY_BPMEMORY, BPSORT)
– Columnstore Indexes (ROWGROUP_VERSION, ROWGROUP_OP_STATS, SHARED_DELTASTORE_CREATION, COLUMNSTORE_BUILD_THROTTLE, COLUMNSTORE_MIGRATION_BACKGROUND_TASK, COLUMNSTORE_COLUMNDATASET_SESSION_LIST)

With an appearance in the next SQL Server version of the Batch Execution Mode for the RowStore Indexes, the first group of the waits will suddenly be becoming more important for every single SQL Server user and mixing it together with the Wait Types specific to the internal structures and functions of the Columnstore Indexes makes no sense.

Read on to learn more about these important wait types.

Comments closed

Finding System-Generated Constraint Names

Michael J. Swart has a script which helps you find system-generated constraint names:

Names for constraints are optional meaning that if you don’t provide a name when it’s created or cannot afford one, one will be appointed to you by the system.
These system provided names are messy things and I don’t think I have to discourage you from using them. Kenneth Fisher has already done that in Constraint names, Say NO to the default.

But how do you know whether you have any?

Check out the informative comments as well.

Comments closed

Using Hive: Tiered Or Decoupled Storage?

Brandon Wilson and Gopal Vijayaraghavan compare a series of Hive queries against EC2 instances with persistent storage and S3:

There are advantages and disadvantages to each approach. The tiered approach has the most flexibility for an operator to tune the performance of the cluster while trading off size of the hot data zone for better performance or smaller resource footprint. The downside of this approach is that, having data on HDFS, resizing the cluster is a slow and tedious process due to HDFS needing to be rebalanced to achieve performance and fault-tolerance expectations. Thus this architecture is generally only used for statically sized clusters with steady, well-known workloads.

The decoupled architecture, on the other hand, enables maximum flexibility for cluster growth and reduction. For example, a cluster could run at 100 nodes during the day to support analytics and reporting and then shrink to 24 nodes overnight to support smaller ETL workloads. Historically, the disadvantage to decoupling is that cloud storage is not local and therefore could drastically affect runtime of the analytical workloads (hence the hybrid approach of tiered storage). However, the advent of LLAP in Hive 2.0 has limited this overhead making the approach far more attractive. The dynamic cache within LLAP also means that we do not need to statically define what data is hot. It can be inferred at query time (i.e., as users access the data, that data will become hot). We will look closer at how LLAP closes the runtime gap in the next section.

Historically, the argument was that you should avoid S3 in part because it’s relatively flaky compared to disks (in terms of performance and in its eventual consistency model).  It looks like that’s no longer a pressing concern.

Comments closed

Filtering On Kafka Streams

Robin Moffatt has a new series showing how to use Kafka Streams for dealing with syslog data:

syslog is one of those ubiquitous standards on which much of modern computing runs. Built into operating systems such as Linux, it’s also commonplace in networking and IoT devices like IP cameras. It provides a way for streaming log messages, along with metadata such as the source host, severity of the message, and so on. Sometimes the target is simply a local logfile, but more often it’s a centralised syslog server which in turn may log or process the messages further.

As a high-performance, distributed streaming platform, Apache Kafka® is a great tool for centralised ingestion of syslog data. Since Apache Kafka also persists data and supports native stream processing we don’t need to land it elsewhere before we can utilise the data. You can stream syslog data into Kafka in a variety of ways, including through Kafka Connect for which there is a dedicated syslog plugin.

In this post, we’re going to see how KSQL can be used to process syslog messages as they arrive in realtime.

Check it out.

Comments closed