User-Defined Functions In KSQL

Kai Waehner demonstrates building a user-defined function for Kafka Streams:

As you can see, the full implementation is just a few lines of Java code. In general, you need to implement the logic between receiving input and returning output of the UDF in the evaluate()method. You also need to implement exception handling (e.g. invalid input arguments) where applicable. The init() method is empty in this case, but could initialise any required object instances.

Note that this UDF has state: dateFormat can be null or already initialized. However, no worries. You do not have to manage the scope as Kafka Streams (and therefore KSQL) threads are independent of each other. So this won’t cause any issues.

Click through for the entire process.

Labels And Annotations In ggplot2

I have another post in my ggplot2 series:

Annotations are useful for marking out important comments in your visual.  For example, going back to our wealth and longevity chart, there was a group of Asian countries with extremely high GDP but relatively low average life expectancy.  I’d like to call out that section of the visual and will use an annotation to do so.  To do this, I use the annotate() function.  In this case, I’m going to create a text annotation as well as a rectangle annotation so you can see exactly the points I mean.

By this point, we’re getting closer and closer to high-quality graphics.

The Year Of The Data Engineer

Alex Woodie points out that data science also requires data engineers:

The shortage of data scientists – those triple-threat types who possess advanced statistics, business, and coding skills – has been well-documented over the years. But increasingly, businesses are facing a shortage of another key individual on the big data team who’s critical to achieving success – the data engineer.

Data engineers are experts in designing, building, and maintaining the data-based systems in support of an organization’s analytical and transactional operations. While they don’t boast the quantitative skills that a data scientist would use to, say, build a complex machine learning model, data engineers do much of the other work required to support that data science workload, such as:

  • Building data pipelines to collect data and move it into storage;

  • Preparing the data as part of an ETL or ELT process;

  • Stitching the data together with scripting languages;

  • Working with the DBA to construct data stores;

  • Ensuring the data is ready for use;

  • Using frameworks and microservices to serve data.

Read the whole thing.  My experience is that most shops looking to hire a data scientist really need to get data engineers first; otherwise, you’re wasting that high-priced data scientist’s time.  The plus side is that if you’re already a database developer, getting into data engineering is much easier than mastering statistics or neural networks.

Tupper’s Self-Referential Formula In Postgres

Lukas Eder has a fun post on Tupper’s self-referential formula:

Luckily, this syntax also happens to be SQL syntax, so we’re almost done. So, let’s try plotting this formula for the area of x BETWEEN 0 AND 105 and y BETWEEN k AND k + 16, where k is just some random large number, let’s say

96093937991895888497167296212785275471500433966012930665
15055192717028023952664246896428421743507181212671537827
70623355993237280874144307891325963941337723487857735749
82392662971551717371699516523289053822161240323885586618
40132355851360488286933379024914542292886670810961844960
91705183454067827731551705405381627380967602565625016981
48208341878316384911559022561000365235137034387446184837
87372381982248498634650331594100549747005931383392264972
49461751545728366702369745461014655997933798537483143786
841806593422227898388722980000748404719

Unfortunately, most SQL databases cannot handle such large numbers without any additional libraries, except for the awesome PostgreSQL, whose decimal / numeric types can handle up to 131072 digits before the decimal point and up to 16383 digits after the decimal point.

Yet again, unfortunately, even PostgreSQL by default can’t handle such precisions / scales, so we’re using a trick to expand the precision beyond what’s available by default.

Check it out.

Displaying Items Not Selected In A Power BI Slicer

Matt Allington tries to solve the converse of an easy problem:

My idea was that I would load school photos and also the reunion photos onto the one page.  The user can then click on a slicer with someone’s name (or any other information about people) and “see” those people highlighted in the photo.  I started thinking that I could use the excellent Synoptic Panel from The Italians for this.  The only problem I could foresee was that Synoptic Panel is designed to provide shading over an image based on what was selected.  I wanted to shade/hide those people that were NOT selected.  Anyhow, I love a challenge.

Read on for Matt’s solution.

Maintain MSDB

Lori Brown points out that there are some SQL Server service tables which can bloat your msdb database:

I recently received a panicked call from a client who had a SQL instance go down because the server’s C drive was full. As the guy looked he found that the msdb database file was 31 GB and was consuming all of the free space on the OS drive causing SQL to shut down. He cleaned up some other old files so that SQL would work again but did not know what to do about msdb.

As we looked at it together I found that the sysmaintplan_logdetail table was taking all the space in the database. The SQL Agent had been set to only keep about 10000 rows of history but for some unknown reason the table never removed history. After consulting MSDN I found this code did the trick for truncating this table.

Lori’s focus here is on SQL Agent history, but don’t forget about things like backup history as well.

Installing Docker On Linux

Mark Broadbent shows how to install Docker on Linux Mint:

A web search will almost certainly point you to lots of similar posts, mostly (if not) all of which start instructing you to add unofficial or unrecognized sources, keys etc. Therefore my intention with this post is not to replace official documentation, but to make the process as simple as possible, whilst still pointing to all the official documentation so that you can be confident you are not breaking security or other such things!

You can head over to the following Docker page Get Docker CE for Ubuntu for the initial setup and updates, but for simplicity, you can follow along below.

The installation instructions will also work for Ubuntu and other related variants.

The Links That Tie Row To LOB

Steve Stedman shows how to use DBCC PAGE and DBCC IND to piece together where LOB data is stored for a particular row:

The question came up as how to find a link from blog storage that is corrupt back to the table and row that contains that data.

The is no link from the blob storage back to the table and row, but this is a link from the data page containing the table and row off to the blob data.

Read the whole thing.

Making Power BI Reports Screen Reader Accessible

Meagan Longoria has a couple of posts on designing Power BI reports to make them accessible to people who use screen readers.  First up is a list of good tips:

  • Avoid auto-playing video or audio as that conflicts with the screen reader. If you must use video or audio, provide it in a way that requires the user to start it rather than stop it.

  • Be sure to format numbers appropriately so screen readers don’t read out a long series of insignificant digits.

  • Avoid the use of lots of decorative shapes and images within your report page that do not relay information to users. The screen reader reads each one. When using shapes and images to call out data points, use the alt text to explain what is being called out.

Meagan also has a follow-up blog post with more detail:

The data in the accessible Show Data table will render in the order it is shown in the visual, so you can control that in your design. One exception to this is when the data is rendered in a matrix rather than a table: the total in the accessible Show Data table is positioned at the top rather than the bottom where we see it visually. This is a purposeful design decision to help the user understand the total and then the breakdown of subtotals. Another good thing about the accessible Show Data tables is that tooltips are included, just like when we use the See Data feature.

Another nice feature (not sure if this is built in to JAWS or something the Power BI team added) is that if you have a report page that takes a while to load JAWS will say “Alert: Visual are loading” so it’s obvious to a blind/low vision user that they need to wait to get the full report page.

There is still a bit of work to be done to make Power BI truly accessible to screen readers.

Both posts are definitely worth the read.

Categories

February 2018
MTWTFSS
« Jan  
 1234
567891011
12131415161718
19202122232425
262728