The Gartner Magic Quadrant for Data Science and Machine Learning Platforms is just out and once again there are big changes in the leaderboard. Say what you will about our profession but as a platform developer you certainly can’t rest on your laurels. Some traditional leaders have fallen (SAS, KNIME, H2Oai, IBM) and some challengers have risen (Alteryx, TIBCO, RapidMiner).
Databricks is making a big push and there’s more movement than usual in this year’s chart. Check it out.
This Python script ran on a single machine, and is from the early days of the company. However, this script didn’t scale since it cannot run in a distributed manner. As a result, this Python job ends up flapping—crashing and restarting regularly in production depending on the load it needs to process.
Second, the Python script puts read pressure on MongoDB and Cassandra, because it has to query the databases for each batch of walk-ins and Zenreach Messages. MongoDB and Cassandra are our primary databases for serving customer read queries. So we wanted to remove the additional read pressure added by this job, which currently competes for resources with our customers.
For these reasons, we wanted to move to a streaming solution—specifically, Kafka Streams. We already switched to Kafka Streams for walk-in detection, which my teammate Eugen Feller explained in a previous post.
Click through for a review of the architecture and some tips if you want to do this yourself.
Recently I have reached interesting problem in Databricks Non delta. I tried to read data from the the table (table on the top of file) slightly transform it and write it back to the same location that i have been reading from. Attempt to execute code like that would manifest with exception:“org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from”.
The lets try to answer the question How to write into a table(dataframe) that we are reading from as this might be a common use case?
This problem is trivial but it is very confusing, if we do not understand how queries are processed in spark.
Click through for the answer. I’m a little squeamish about doing this because my expectation is for data to flow from one source to another source; feeding the data back to the initial source feels strange, like running a load of clothes through the washer and dryer and then dumping them back into the hamper with the remainder of the dirty clothes.
The rows between September 2009 and December 2009 should not be visible. The goal here is to display a blank value in these out-of-range, “future” months.
A similar issue exists for the year-over-year calculation (YOY). Even though the measure tries to show a blank value in case of missing values in current or previous year, the amounts for August 2009 and for CY 2009 might be considered wrong.
The answer is certainly not trivial but it does make for a much nicer display.
Anybody that has interviewed for a job has most likely run into the trick question. Some interviewers like to throw out multiple trick questions all in an effort to trip up the candidate and get the candidate to doubt him/her self. Sure, there can be some benefit to throwing out a trick question or four. One such benefit would be to see how the candidate performs under pressure (see them squirm).
The downside to throwing out trick questions, in my opinion, would be that you can turn a serious candidate into an uninterested candidate. So, when throwing out the tricks, tread carefully.
Incidentally, I don’t think his example question was that tricky, in that there are good reasons to do what he shows. I have one question I like to ask during phone screens which is of a similar vein. I won’t share the question for obvious reasons, but answering it requires a reasonable amount of knowledge of the product and a little bit of cleverness.
On the whole, my interview philosophy is to ask questions which directly relate to the job at hand. If the job involves doing a lot of work with warehousing and ETL with SSIS, ask questions around columnstore indexes, tuning SSIS packages, and some of the types of red flags when looking at packages. I’ve found that people who really don’t know what they’re doing sort themselves out easily enough if you ask relevant questions.
dbaTools have announced that a number of functions have been deprecated and will be removed before the release of version 1.
To future proof the Catalogue against these breaking changes, all deprecated functions have been removed from the Interrogation Powershell script.
Click through for the full change log.
If you manage a lot of SQL Server instances, you likely run into failed login attempts quite often. Perhaps you’re even wondering what client machine is causing all those failures. Since most environments run over TCP/IP; SQL Server helpfully logs the IP address of the client machine that made these failed login attempts to the SQL Server Error Log.
This solution is in T-SQL but shells out to cmd. It might be better suited for Powershell, but it does the trick.
By the 4th
Invoke-DbaQuery, I found myself thinking “this repetitive typing kind of sucks.” Then I remembered Chrissy LeMaire’s segment in the first PSPowerHour where she talked about default values, and her accompanying dbatools blog post. Most of the blog posts and demos of this feature focus on using it from the command line, so I had overlooked the fact that I could use it from within a script as well, and even change the values when looping.
As it turns out – it works inside scripts and functions as well, and can make them a lot easier to read. And you’re not limited to the default parameters every time you call a given function; you can override the defaults by specifying the parameters when you call it.
Andy gives us an example with default values for a SQL Server instance and database, but also shows us how to override that.