Looking at the first 5 records of the RDD
This output is difficult to read. This is because we are asking PySpark to show us data that is in the RDD format. PySpark has a DataFrame functionality. If the Python version is 2.7 or higher, you can utilize the pandas package. However, pandas doesn’t work on Python versions 2.6, so we use the Spark SQL functionality to create DataFrames for exploration.
The full example is a fairly simple k-means clustering process, which is a great introduction to PySpark.
The main problem with an over-engineered solution is that it takes longer to ship than is necessary. By definition, we are doing more than is necessary, and that will take longer to ship. There’s almost never a reason to prefer longer ship-times over shorter ones, all things being equal.
The more serious problem with over-engineering is the carry cost.
A carrying cost is a cost the team bears for having to maintain software and infrastructure. Each feature requires tests, monitoring, and maintenance. Each new feature is made in the context of those that came before it. This is why a feature that might’ve taken one week when the project was new requires a month to make in more mature project.
Read the whole thing and simplify your solutions.
Based on this testing, lock contention, which usually results in a performance bottleneck and underutilized resources, was our first “suspect.” We knew that using a commercial Java profiler, such as Yourkit, JProfiler and Java Flight Recorder, would help easily identify locks and determine how much time threads spend waiting on them. Meanwhile, the team had built custom infrastructure that allows one to run experiments with a profiler attached via a single command-line parameter.
In my own testing, the profiler data indeed revealed some contention particularly related to
HdfsUpdateLoglocks, leading to long thread wait time. Although promisingly, this result corresponded somewhat to the description in SOLR-6820, nothing actionable resulted from the experiment.
I like these sorts of case studies because example is the school of mankind. In this particular case, I really like the methodical approach, using available information to search for a root cause. Some of the things Michael calls “false starts” I would consider to be initial steps: checking OS, filesystem, and garbage collection metrics are important even in a case like this in which they did not lead to the culprit, as they help you eliminate suspects.
Generates a call that returns 20 results, rather than the default 10:
Obviously these options make it easier to construct urls and the code is much clearer, but there are also other benefits to using these options which I’ll cover in another blog post soon.
This makes for a more maintainable, dynamic URL generation. Think about an internal product dashboard, where you might need to make API calls to pull in data by product (or maybe you want to send people to an external link for each product). This can help you parameterize your URLs quite easily.
The second issue is that even if you do install SSMS 2016, SQL Agent won’t recognize and give you access to the new module if you use a PowerShell job step. When you create a PowerShell job step, the script in that job step runs within a specific context. It’s hidden from you, but whenever that script runs the first thing that happens is SQL Server launchessqlps.exe.
Check out the links Mike provides to Connect items and the Trello board if you want to see the issues he brought up fixed.
Parameter sniffing fixes are based on your career progression with databases, and they go like this:
1. Reboot the server! – Junior folks panic and freak out, and just restart the server. Sure enough, that erases all cached execution plans. As soon as the box comes back up, they run rpt_Sales for China because that’s the one that was having problems. Because it’s called first, it gets a great plan for big data – and the junior admin believes they’ve fixed the problem.
2. Restart the SQL Server instance – Eventually, as these folks’ careers progress, they realize they can’t go rebooting Windows all the time, so they try this instead. It has the same effect.
If a reboot can’t fix the problem, I’m out of ideas…
By the way, I second Brent’s recommendation of Erland’s query plan article. Erland doesn’t publish frequently, but when he does it’s worth the wait.
Let’s just get this out there right now – I suck at JSON. I suck at XML. The idea of querying a non-normalized document to get the data is not very endearing to me. It is for that reason that I have written utilities or scripts to help generate my XML shredding scripts – as can be seen here.
Knowing that I have this allergy to features similar to XML, I need to build up some resistance to the allergy through a little learning and a little practice. Based on that, my plan is pretty simple:
Read up on JSON
Find some tutorials on JSON
Practice using the feature
Potentially do something destructive with JSON
I’m not particularly excited about JSON support in SQL Server 2016 but the fact that it is there, combined with the fact that so many developers love JSON means that it’s a good idea to learn how to integrate, if only to figure out when it’s a bad idea to parse JSON within your very expensive SQL Server instances.
When referring to what can be done in iOS, Apple often say that there is an “app” for that. Likewise, when R developers refer to what can be done in R, we often say that there is a “package” for that. For instance:
· If one needs to scrap data from the web there are packages for that (rvest, rcurl, and others)
· If one needs to make complicated transformations to their data there are packages for that (dplyr, tidyr, lubrdiate, stringr, and others)
I like the F#-ness of M, but I admit that I’m happy there’s some fairly close R integration within Power BI, as that means there’s one fewer language I need to learn right now…
PHYSICAL_ONLY looks at page structures and makes sure something isn’t obviously wrong. If you have the database’s page verification option set to CHECKSUM (the default since SQL Server 2005, and a very good thing), it runs a checksum on the page and compares it to a checksum recorded in the header when the page was last modified.
When you use the PHYSICAL_ONLY option, you’re telling SQL Server to skip logical checks. It won’t do things like make sure the data in a clustered and nonclustered index on a table is consistent with one another.
I like the analogy, as well as the how-to guide on messing up your database. But try not to do that to a production database…
In my example my linked server has an IP addressed name “192.168.1.103” and pointed to the same SQL Server host.
If we want to redirect that linked server to another SQL Server host we can do following:
April Fool’s Day joke, anybody?