At Databricks, we used Databricks Notebooks and cluster management to set up a reproducible benchmarking harness that compares the performance of Apache Spark’s Structured Streaming, running on Databricks Unified Analytics Platform, against other open source streaming systems such as Apache Kafka Streams and Apache Flink. In particular, we used the following systems and versions in our benchmarks:
The Yahoo Streaming Benchmark is a well-known benchmark used in industry to evaluate streaming systems. When setting up our benchmark, we wanted to push each streaming system to its absolute limits, yet keep the business logic the same as in the Yahoo Streaming Benchmark. We shared some of the results we achieved from these benchmarks during Spark Summit West 2017 keynote showing that Spark can reach 5x or higher throughput over other popular streaming systems. In this blog, we discuss in more detail about how we performed this benchmark, and how you can reproduce the results yourselves.
Standard vendor-based metric warnings aside, you can get the benchmark details at their GitHub repo.
Linear Discriminant Analysis takes a data set of cases (also known as observations) as input. For each case, you need to have a categorical variable to define the class and several predictor variables (which are numeric). We often visualize this input data as a matrix, such as shown below, with each case being a row and each variable a column. In this example, the categorical variable is called “class” and the predictive variables (which are numeric) are the other columns.
Following this is a clear example of how to use LDA. This post is also the second time this week somebody has suggested The Elements of Statistical Learning, so I probably should make time to look at the book.
Bayesian Nonparametrics are a class of models for which the number of parameters grows with data. A simple example is non-parametric K-means clustering . Instead of fixing the number of clusters K, we let data determine the best number of clusters. By letting the number of model parameters (cluster means and covariances) grow with data, we are better able to describe the data as well as generate new data given our model.
Of course, to avoid over-fitting, we penalize the number of clusters K via a regularization parameter which controls the rate at which new clusters are created.
This is an interesting discussion of the Dirichlet process, particularly as applied to K-mean clustering. It helps you figure out your best choice for K, no small task.
I recently read something that said using the RESTORE WITH REPLACE command could be faster than dropping a database and then performing a RESTORE, because the shell of the file could be used and therefore skip file initialization. I did not think that was the case, but books online wasn’t clear about the situation, so I went ahead and built a quick test case, using ProcMon from sysinternals. If you aren’t familar with the sysinternals tools, you should be—they are a good way to get under the hood of your Windows Server to see what’s going on, and if you’re old like me, you probably used PSEXEC to “telnet” into a Windows server to restart a service before RDP was a thing.
Read on to see how the processes compare.
If your user is a database owner, (i.e. is a member of the db_owner group or has CONTROL permissions on the database) the default schema will always be dbo. This is something you can’t change.
So if your legacy application needs quasi-administrative privileges in the database, you can’t make it a database owner, but you can grant those permissions on the schema instead (which is actually a better idea anyway).
What Daniel is doing is akin to the pre-2005 concept of user spaces, where Bob had a schema and Mary had a schema and Jill had a schema and so forth.
This post is another in the continuing theme of “making things consistent.” We were voluntold to help another team get their staging environment set up. Piece of cake, SQL Compare made it trivial to snap the tables over.
Oh, we don’t want these tables in Custom schema, we want them in dbo. No problem, SQL Compare again and change owner mappings and bam, out come all the tables.
Oh, can we get this in near real-time? Say every 15 minutes. … Transaction replication to the rescue!
Oh, we don’t know what data we need yet so could you keep it all, forever? … Temporal tables to the rescue?
Yes, temporal tables is perfect. But don’t put the history table in the same schema as the table, put in this one. And put all of that in its own file group.
Click through for a helpful script, and tune in next time, when the other team has Bill move their furniture around. Maybe move the couch just a hair to the right…no, a little more, oops, too much…
Doug Kline has a new series on window functions. First, he looks at differences between RANK, DENSE_RANK, and ROW_NUMBER:
— Quick! What’s the difference between RANK, DENSE_RANK, and ROW_NUMBER?
— in short, they are only different when there are ties…
— here’s a table that will help show the difference
— between the ranking functions
— note the [Score] column,
— it will be the basis of the ranking
— here’s a simple SELECT statement from the Products table
ORDER BY UnitPrice DESC
— this shows that the highest priced product is Cote de Blaye, productID 38
— but sometimes the *relative* price is more important than the actual price
— in other words, we want to know how products *rank*, based on price
Doug’s entire posts are T-SQL scripts along with embedded videos.
This month’s T-SQL Tuesday was all about Big Data. See what the community has to say about Big Data with this collection of articles ranging from deep technical walk-throughs to musings about Big Data’s impact on our industry and the data professional.
Click through to see the participants.