While developers don’t think they need them, DBAs are still needed for governance to make it easier to analyze data.
DBAs have gone from managing databases tobeing data engineers across multiple systems. They focus on how data moves from one database to another, the consumption of data, tuning of the data, and management of the data process across the data landscape is critical until it is distributed and executed automatically.
DBAs have moved from being focused on individual products like SQLServer and Oracle to having to deal with bringing companies’ big data implementation to life.
There are a lot of points here. I agree with many, disagree with a few, and think that some of them are quite context-sensitive. But all are worth thinking about.
As we are approaching the end of 2017, many people have resolutions or goals for the new year. How about a goal to get organized…in your data lake?
The most important aspect of organizing a data lake is optimal data retrieval.
Click through for a great visual showing the various zones in a data lake.
tabulizer, if the data is relatively well formatted in a PDF you can use
tabulizer::extract_tables(). This gives you a bunch of data.frames which you can process. Unfortunately, in the case of the PASS budget with 22 pages of tables, including tables that span multiple pages, we’re not so lucky!
We need to fall back to
tabulizer::extract_text()and do a lot of wrangling to reconstruct the tables.
Steph shows her work, so click through to see the scripts.
Ginger Grant continues her Machine Learning Services series with a couple more posts. First up is on memory allocation:
Enabling Machine Learning Services on SQL Server which I discussed in a previous blog post, requires you to enable external scripts. Machine Learning Services are run as external processes to SQLPAL. This means that when you are running Python or R code you are running it outside of the managed processes of SQL Server and SQLPAL. This design means that the resources used to run Machine Learning Services will run outside of the resources allocated for SQL Server. If you are planning on using Machine Learning Services you will want to review the server memory options which you may have set for SQL Server. If you have set the max server memory For example, if your server has 16 GB of RAM memory, and you have allocated 8 GB to SQL Server and you estimate that the operating system will use an additional 4 GB, that means that machine learning services will have 4 GB remaining which it can use.
By design, Machine Learning Services will not starve out all of the memory for SQL Server because it doesn’t use it. This means DBAs to not have to worry about SQL Server processes not running because some R program is using all the memory as it does not use the memory SQL Server has allocated. You do have to worry about the amount of memory allocated to Machine Learning Services as by default, using our previous example where there was 4 GB which Machine Learning Services can use, it will only use 20% of the available memory or 819 KB of memory. That is not a lot of memory. Most likely if you are doing a lot of Machine Learning Services work you will want to use more memory which means you will want to change the default memory allocation for external services.
Ginger also talks about the Launchpad service:
When calling external processes, internally SQL Server uses User IDs to call the Launchpad service, which is installed as part of Machine Learning Services and must be running for SQL Server to be able to execute code written in R or Python. The number of users is set by default. To change the number of users, open up SQL Server Configuration Manager by typing SQLServerManager14.msc at the run prompt. For some unknowable reason Microsoft decided to hide this application which was previously available by looking at the installed programs on the server. Now for some reason they think everyone should memorize this obscure command. Once you have the SQL Server Configuration Manager open, right click on the SQL Server Launchpad service and select the properties which will show the window, as shown below. You will notice I am running an instance called SQLServer2017 which is listed in parenthesis in the window name.
Both are worth reading.
In my lab, I’m running Windows Server 2012. Let’s use Powershell to install the File System Resource Manager, which will allow us to create a quota for the relevant folder:
add-windowsfeature –name fs-resource-manager –includemanagementtools
After installing the Windows feature we can set the quota for the folder, but we shouldn’t enable it just yet, because first we have to verify the current size of the folder.
On my server, I created a quota of 1.5GB, and then enabled it.
Now let’s INSERT rows into the table, in batches of 1000, until we reach the limit (the INSERT script is listed in Part 2, I’m trying to keep this post from getting too long).
Click through to see what happens. It’s not exactly a swath of carnage, but it’s also something you really don’t want to happen.
Access WANem from either SQL Server virtual machine using a case sensitive URL that includes WANem’s IP address. In this example, the URL is http://126.96.36.199/WANem. Inside the SQL Server virtual machines, I set the browser’s start page to the WANem home page. Create a delay of 1000 msec and retest SQL Server to SQL Server connectivity.
It looks like a good way of proving out whether your setup can handle extreme latency before you build it for real.
Admittedly, using Cross-Database Ownership Chaining and/or Impersonation and/or
TRUSTWORTHYare quicker and easier to implement than Module Signing. However, the relative simplicity in understanding and implementing these options comes at a cost: the security of your system.
- Cross-DB Ownership Chaining:
- security risk (can spoof User / DB-level)
db_ownerusers can create objects for other owners
- Users with
CREATE DATABASEpermission can create new databases and attach existing databases
IMPERSONATEpermission is required:
- can be used any time
- No granular control over permissions
- Cross-DB operations need
- Need to use
- Elevated permissions last until process / sub-process ends or
- Bigger security risk
- can also spoof Logins, such as “sa” !
- If using SQLCLR Assemblies, no per-Assembly control of ability to be marked as either
UNSAFE; all Assemblies are eligible to be marked as either of those elevated permission sets.
The common theme across all three areas is no control, within a Database, over who or what can make use of the feature / option, or when it can be used.
Read the whole thing.
Goofy outliers are an issue, but the larger the dataset the smaller the issue. If Bill Gates walks into a bar, the average wealth in the bar goes up by a billion. If he walks into a football stadium, everyone gets a million dollar raise.
One way of looking at the issue is to compare the median to the mean. The median is the salary smack dab in the middle, whereas mean is what we normally think of when we think of average.
The median doesn’t care where Bill Gates is, but the mean is sensitive to outliers. If we compare the two, that should give us an idea if we have too much skew in either direction.
If you’re not well-versed in descriptive statistics, Eugene has a good, methodical process and explains each step well.