Press "Enter" to skip to content

Month: September 2016

Pulsar

Alex Woodie reports that Yahoo is open sourcing Pulsar:

Pulsar uses Apache Bookkeeper (committed by Yahoo to open source in 2011) as its durable storage mechanism. “With Bookkeeper, applications can create many independent logs, called ledgers,” Pulsar’s project page on GitHub says. “A ledger is an append-only data structure with a single writer that is assigned to multiple storage nodes (or bookies) and whose entries are replicated to multiple of these nodes.”

Pulsar uses brokers to serve topics. Each topic is assigned to a broker, and an individual broker can serve thousands of topics, Yahoo says. “The broker accepts messages from writers, commits them to a durable store, and dispatches them to readers,” Yahoo says.

My biases would lead me to still go with Kafka over Pulsar, but it’d be interesting to see a good comp between the two.

Comments closed

Power BI And Impala

Justin Kestelyn describes the Impala Connector for Power BI Desktop:

Note that the connector currently only supports Import mode, which requires downloading the query output data to the local data model. In future updates, we will enhance the connector with DirectQuery capabilities, as well as with support for refresh scenarios via the Power BI Gateway. [Ed. Note: As of the August 2016 update, the Impala Connector also supports DirectQuery mode, which means you are always viewing the most up-to-date data. The functionality for both periodic refreshes and DirectQuery mode require the Power BI Gateway running either on-premise or in Microsoft Azure.]

Enabling Power BI connectivity to Impala has been a very frequently requested capability from our customers. We encourage you to give it a try and share with us any feedback or issues that you encounter via the “Send a Frown” feature in Power BI Desktop.

Good stuff.

Comments closed

Instance Segregation

Kennie Pontoppidan discusses why you don’t want your transactional system on the same instance as your warehouse:

Mixing OLTP and data warehouse workload types in the same SQL Server instance is not a good idea. Why? Well, because the workload types don’t mix. As described above, an ideal buffer cache for an OLTP system contains pages used in searches as well as pages for the most used customer/items etc. Running a few data warehouse queries will change the buffer cache to contain (mostly) all dimension data. This will make OLTP operations slow. On the other hand, the ideal buffer cache in a data warehouse will contain all dimension data, which makes aggregations fast because all dimension lookups will be done in memory. OLTP operations might flush out parts of this dimension data from the cache, hurting the performance for data warehouse queries. The two workload types will keep on fighting about the content of the buffer cache, making both systems suffer.

Except in very small systems (where it doesn’t much matter), this is sound advice.

Comments closed

Looking For SQL Saturday Sessions

Wayne Sheffield has a Powershell cmdlet to find sessions matching SQL Saturday topics:

The output to the screen is the Event #, Speaker, Session Title and URL for the presentation.

The generated export file will also include the event name and session abstract. Additionally, the URL will be encased with the Excel HYPERLINK() function. When the export file is opened up with Excel, clicking the URL will open your browser to the session information, where the presentation material can be downloaded.

I hope that you get a lot of use out of this script.

Click through for the script and more details.

Comments closed

Ambari And Active Directory

Jon Morisi documents his efforts in getting Ambari to play nicely with Active Directory over Kerberos:

You then need to trust the certificate on all the linux hosts
From the IBM article:

  1. Create ‘/etc/pki/ca-trust/source/anchors/activedirectory.pem’ and paste the certificate contents

  2. Trust CA cert: sudo update-ca-trust enable; sudo update-ca-trust extract; sudo update-ca-trust check

  3. Trust CA cert in Java:

  4. mycert=/etc/pki/ca-trust/source/anchors/activedirectory.pem sudo keytool -importcert -noprompt -storepass changeit -file ${mycert} -alias ad -keystore /etc/pki/java/cacerts

  5. And at last, please make sure every node on your cluster has access to the ad host.

LDAP support is a key part of setting up a production Hadoop cluster.

Comments closed

Chaos Sloth

Erik Darling has created a script to make your servers go slow:

It randomly generates values and changes some important configuration settings.

  • Max Degree of Parallelism
  • Cost Threshold
  • Max Memory
  • Database compatibility level

This was written for SQL Server 2016, on a box that had 384 GB of RAM. If your specs don’t line up, you may have to change the seed values here. I’m not putting any more development into this thing to automatically detect SQL version or memory in the server, because this was a one-off joke script to see how bad things could get.

How bad did they get? The server crashed multiple times.

Not for production purposes.  Or maybe any purposes…

Comments closed

Try-Catch Blocks

Dave Mason discusses a major problem with try-catch blocks:

We didn’t just lose one error. We lost them all! DBCC CHECKDB on a corrupt database just came back “clean” and the CATCH block never executed. We’ve moved into dangerous territory now. This is no longer an inconvenience or an annoyance. This could lead to data loss, financial harm to your company, maybe even some risk to your career.

Dave promises a better approach in his next post.

Comments closed

Table Heatmaps

Devin Knight continues his Power BI custom visuals series:

  • In the Table Heatmap the color of the boxes is determined by the value in your measure.

  • Only 1 category field can be used, which will dynamically generate the number of columns based on the number of distinct values your field has.

  • The number and types of colors can be changed using some of the settings we’ll discuss below.

I can see the table heatmap being a good visual for calendars.

Comments closed

Spark At Scale

Sital Kedia, Shuojie Wang, and Avery Ching have an example of how Facebook uses (and has improved) Spark for their ranking system:

Debugging at full scale can be slow, challenging, and resource intensive. We started off by converting the most resource intensive part of the Hive-based pipeline: stage two. We started with a sample of 50 GB of compressed input, then gradually scaled up to 300 GB, 1 TB, and then 20 TB. At each size increment, we resolved performance and stability issues, but experimenting with 20 TB is where we found our largest opportunity for improvement.

While running on 20 TB of input, we discovered that we were generating too many output files (each sized around 100 MB) due to the large number of tasks. Three out of 10 hours of job runtime were spent moving files from the staging directory to the final directory in HDFS. Initially, we considered two options: Either improve batch renaming in HDFS to support our use case, or configure Spark to generate fewer output files (difficult due to the large number of tasks — 70,000 — in this stage). We stepped back from the problem and considered a third alternative. Since the tmp_table2 table we generate in step two of the pipeline is temporary and used only to store the pipeline’s intermediate output, we were essentially compressing, serializing, and replicating three copies for a single read workload with terabytes of data. Instead, we went a step further: Remove the two temporary tables and combine all three Hive stages into a single Spark job that reads 60 TB of compressed data and performs a 90 TB shuffle and sort.

Maybe it’s just a mindset thing, but the part that impressed me was the number of pull requests for system improvements (and the number which were accepted).

Comments closed