Previously, we set up a Scala application in order to execute a simple word count on Hadoop.
What comes next is uploading our application to HDInsight. So, we shall proceed in creating a Hadoop cluster on HDInsight.
Read the whole thing, but the upshot is that Scala apps build jar files just like Java would, so there’s nothing special about running them.
This post describes one way that you can read the top N rows from large text files with C#. This is very useful when working with giant files that are too big to open, but you need to view a portion of them to determine the schema, data types, etc.
I’ve used PowerShell many times to do this with large csv files, but in this example we’re going to use C# and look at the Wikipedia XML dump of pages and articles. The 3017-03-01 dump is very large and comes in at 59.5 GB.
I’ve had to write something similar before on Windows machines where I didn’t have access to more/less. It’s really helpful for perusing the first few lines of gigantic log files.
There is graph support in the next version of SQL Server. The private preview page states
SQL Graph adds graph processing capabilities to SQL Server, which will help you link different pieces of connected data to help gather powerful insights and increase operational agility. Graphs are well suited for applications where relationships are important, such as fraud detection, risk management, social networks, recommendation engines, predictive analysis, dependence analysis, IoT suites, etc.
Initially, SQL Server will support CRUD graph operations and multi-hop graph navigation, and the following functionality will be available in the private preview:
- Create graph objects, that is, nodes to represent entities and edges to represent relationships between any 2 given nodes. Both Nodes and Edges can have properties associated to them.
- SQL language extensions to support join free, pattern matching queries for multi-hop navigation
Kennie Pontoppipidan wrote a great blog post on where to find out more information.
Click through for more links to interesting resources.
1. Clean up formatting
The overall format of your code is what makes it possible to quickly navigate to areas of interest. Consistent indentation, line breaks, and patterns help programmers skim large chunks of code. Take the following sloppily formatted code for example:
Read on for the rest. This has analogues in every language: the goal is to create simple, concise, easily scannable, and human-readable code which also correctly solves the relevant business problem.
The reason why we all love Java is due to the fact that we can be careless with memory creations and the work of cleaning the mess is performed by the JVM. On a high level, Java heap memory is classified into two phases:
1) Young (eden) space
The eden space is where newly created objects goto. There are various algorithms for garbage collection, but all of them try to first free memory from the young space and for those long lasting memory objects, they are transferred to the old space.
One common issue that can be noticed in running Map Reduce Applications are GC overhead limit exceeded.
Read on for more, including where you can find GC logs.
Step 2: Check out the Github project page what’s in development.
Next, you should visit the project issues page. Here, you’ll find a list of all the features requested, in development and completed on the project. You can also filter the pages to look at current bugs or requested enhancements. Once you see what’s what, if you think of something you want to work on or help with, make a note of it. You should also look at examples of things in development and things that have been completed so you get an idea of the creative and technical process that goes into the project.
Step 3: Speak up!
Head on back to the Slack channel and let everyone know you want to help out. Someone (probably Chrissy) will add your Github account to to the project as a contributor so you can have things assigned to you. Congrats, you’re now on the hook!
I’m happy that the dbatools community has sprung up and hope it’s a gateway to further open source development in the SQL Server community.
This reminds me of an old saying: If you’re the smartest person in the room, then you’re in the wrong room.*
Now, this is not a commentary on my current team. I work with some really smart people, and I’m very grateful for that. But while my teammate may be one of the best PHP or Node.js coders I know, that doesn’t necessarily translate to an expertise with the .NET Framework. The true test is this – no matter how smart they are, if they’re not catching my mistakes, then I’m not being held accountable.
There is some good advice here on threading (yes, definitely use the newer threading libraries), but also good advice on surrounding yourself with intelligent people who can catch your mistakes.
My personal website is a static site: 100% HTML, JS, and CSS files with no server-side processing. I have custom code that pulls data from a variety of sources and builds updated versions of the files from templates, which are then deployed to the host. I do this to move the CPU latency of building the pages to my time, instead of charging it to visitors on each page hit. While I have a host, a strategy like this means I could also choose to host for free via github or similar services.
So there’s a great benefit to the reader and our wallet, but no server-side execution makes things like contact forms trickier. Luckily, Azure Functions or AWS Lambda can be used as a webhook to receive the form post and process it, costing nothing near nothing to use (AWS and Azure both offer a free tier for 1M requests/month and 400,000 GB-seconds of compute time).
Eli has a working example in the post, which I recommend checking out.
The timings in this post came from combining 8 csv files with 13 columns and a combined total of 9.2 million rows.
I first tried combining the files with the PowerShell technique described here. It was painfully slow and took an hour and a half! This is likely because it is deserializing and then serializing every bit of data in the files, which adds a lot of unnecessary overhead.
Next I tried the C# script below using LINQPad. When reading from and writing to a network share, it took 3 minutes and 56 seconds. Much better! Next I tried it on a local SSD drive and it took just 44 seconds.
Read on for the script itself. The ReadAllLines method works fine as long as the file isn’t larger than your working memory.
Since I’ve started to play with (and rave about) functional programming (FP), a lot of people have asked me how to get started.
Instead of writing the same email multiple times, I decided to create a blog post I can refer them to. Also, it’s a central place to put all my notes about the topic.
Here’s a small collection of all the resources I’ve accumulated on my adventure on learning functional programming.
I think the functional paradigm fits relational database development extremely well, better than the object-oriented paradigm.