Querying Serverless SQL Pools from Spark Notebooks in Scala

Published 2021-04-19 by Kevin Feasel

Jovan Popovic shows off one integration point between the data services in Azure Synapse Analytics:

Azure Synapse Analytics provides multiple query runtimes that you can use to query in-database or external data. You have the choice to use T-SQL queries using a serverless Synapse SQL pool or notebooks in Apache Spark for Synapse analytics to analyze your data.
You can also connect these runtimes and run the queries from Spark notebooks on a dedicated SQL pool.
In this post, you will see how to create Scala code in a Spark notebook that executes a T-SQL query on a serverless SQL pool.

Read on to see how. You can also query Spark pool and dedicated SQL pool tables from serverless SQL pools.

Published in Cloud, Hadoop, Spark and Synapse Analytics

4 Comments

George Walkey 2021-04-20

Kevin, just curious, which is faster Spark or SQL?
(Leave the question open-ended, Kevin wont notice)
- Kevin Feasel 2021-04-20
  
  If you leave the question open-ended, you get an open-ended answer: it depends.
  
  The best part is, it’s true! Even if you limit the question to “Which is faster, Spark in Scala with RDDs or Spark SQL?” That also depends because of the different optimization approaches available to you, where sometimes Spark SQL can be faster than writing the code yourself using the “lower-level” construct.
  
  And when it comes to Spark versus something like SQL Server, you bet your bippy it’s an “it depends.” For small to moderate data set sizes, SQL Server will almost definitely be faster due to the quicker startup time for queries. Where Spark can dominate is in cases in which you can partition large amounts of data horizontally and arbitrarily–that is, when one row is pretty well independent of the next. In that case, scale-out solutions give you something approaching a linear function: with N nodes, it takes approximately 1/N time units to process, and with N+1 nodes, it’s approximately 1/(N+1). And, naturally, this all depends on the hardware you’re willing to throw at the problem, the configuration of each system, the skill level of the developer, etc.
  
  And it even depends on whether you’re looking at this from a formal competition perspective or if you’re thinking of a “general business” answer. In a competition, you might do things with CPU affinity, strange partitioning schemes, and arcane SQL Server settings which would rightfully be frowned upon in most circumstances because you’d be tuning to one specific workload + problem and leaving yourself open to performance issues with other workloads + problems.
George Walkey 2021-04-20

Have you read the Polaris whitepaper?
Guess we”ll just have to see why they killed Databricks with Synapse
- Kevin Feasel 2021-04-20
  
  Yeah, I did see the Polaris whitepaper. As for whether it’s faster than Spark, we’ll have to wait and see. I haven’t seen any proper performance comparisons yet and don’t have enough experience on the serverless SQL pool side to prognosticate on performance differences.
  
  As for Databricks, as of today, their version of Spark is way faster than what’s in Azure Synapse Analytics Spark pools, as Databricks has a bunch of proprietary work done around performance enhancements that aren’t in the open source product. I’m pretty optimistic about the future of Spark in Azure Synapse Analytics, but if I were to develop a solution in Azure today which drove from Spark, I’d still pick Databricks. Azure Synapse Analytics definitely has nice integration points and I’m pretty happy with performance on the dedicated and serverless SQL pool sides, but Spark pools are the laggard.

Comments are closed.

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30