Wow.. That sure looks like three auto-created, multi-column statistics! We have three stats: stats_ids 3, 4, and 5. The sys.stats_column table contains one row for each column that is in a statistic, so multiple rows for a single statistic (i.e., a single stats_id value), indicate multiple columns in that stat. Indeed, the column_id values indicate the table columns contained the stat. So stats_id 3 contains columns VersionMajor and ApplicationID (column_ids 3 and 1), stats_id 4 contains columns VersionMinor and ApplicationID (column_ids 4 and 1), and stats_id 5 contains columns VersionRevision and ApplicationID (column_ids 5 and 1). And, clearly, the auto_created flag is true, so these three stats were auto-created. What’s going on?
Read on for the answer.
Allows you to bind performance measures to 3 states.
The thresholds for these 3 stages can be set manually or be data driven.
It can visualize one measure while using a different one for the indicator value
This is a pretty small visualization, but it could be useful as part of a larger dashboard.
If you are familiar with traditional Spark streaming you may notice that the above example is lacking an explicit batch duration. In structured streaming the equivalent feature is a trigger. By default it will run batches as quickly as possible, starting the next batch as soon as more data is available and the previous batch is complete. You can also set a more traditional fixed batch interval for your trigger. In the future more flexible trigger options will be added.
A related consequence is that windows are no longer forced to be a multiple of the batch duration. Furthermore, windows needn’t be only on processing time anymore, we can rearrange events that may have been delayed or arrived out of order and window by event time. Suppose our input stream had a column event_time that we wanted to do windowed counts on. Then we could do something like the following to get counts of events in a 1 minute window:
Right now, there are some pretty strict limitations on this new streaming, but I imagine they’ll loosen up quite soon.
My co-worker had set up a dynamic connection (see Rafael Salas or Hari Bagra for details on how to do this), but something weird was happening: the package was trying to push everything to the same server. I confirmed that if all relevant customers loaded were for the same server, that the process would work correctly, and that I could run each server load one at a time, so there weren’t any problems connecting to particular servers or parameters overriding this choice. It’s like the connection was “sticky,” connecting successfully to the first server and then ignoring the later changes.
RetainSameConnection allows for certain benefits, but has specific limitations. Click through to see those details.
In the following script, I have tried to accomplish just that – a single script to create the entire XML parser for me, for you, and for anybody wishing to use it. I don’t want to have to remember the subtle nuances of how to parse each of the events each time I need to parse the session data. I want something that is quick, easy, and repeatable.
With all of that said, here is the script that I now use to parse my session data. You should notice that it has been simplified and is more extensive now.
Jason also has sample usage. Check this out for sure.
A developer wants to be able to truncate a table.
This isn’t an unreasonable request right? She’s writing a piece of code that loads a bunch of data into a staging table. She want’s to be able to truncate the table once the load is over. Unfortunately the permission required to do this is ALTER on the table. That’s not just going to let her truncate the table, it’s going to let her change the structure of the table. Not acceptable in a production environment. A development environment sure. Not a production one. So what do we do?
We use impersonation.
Check out the post to see how to do this.
Don’t go getting all excited. I recognize that these two plans look similar, but they are different. First, let me point out that we have more reads with 1546 and an increase in duration to 273ms. This comes from two places. First, we’re creating statistics on the data in the temporary table where none exist on the table variable. Second, because I want to run this script over and over, I’m including the DROP TABLE statement, which is adding overhead that I wouldn’t see if I treated it like the table variable (which I could, but not here). However, breaking down the to the statement level, I get 250ms duration, just like with the table variable, but, I see 924 reads.
What’s going on?
There certainly are differences between the three. Read on for more details.
First let me say that PolyBase is cool. I can query data in text files and join to tables in my database. Next let me say PolyBase is a fairly young technology and has some limitations that I imagine will be improved in later versions.
One of those limitations (as of July 30, 2016) is that while you can declare your field delimiter and a string delimiter in external file formats, the row delimiter is not user configurable and there is no way to escape or ignore the row delimiter characters (\r, \n, or \r\n) inside of a string. So if you have a string that contains the row delimiter, PolyBase will interpret it as the end of the row even if it is placed inside of the string delimiters.
This is definitely something to keep in mind. I haven’t dealt with data with newlines within attributes, so I haven’t run into this yet, but don’t let it bite you.
Forwarded records are rows in a heap that have been moved from the original page to a new page, leaving behind a forwarding record pointer on the original page to point at the new page. This occurs when you update a column that increases the size of the column and can no longer fit on the page. UPDATEs can cause forwarded records if the updated data does not fit on the page. Forwarding pointers are used to keep track of where the data is.
The comments are also worth reading. Except for the terrible puns.
Existing platform based Hadoop architectures make several implicit assumptions on how users interact with the platform such as developmental research versus production applications. While this was perfectly good in a research mode, as we move to a modern data application architecture we need to bring back modern application concepts to the Hadoop ecosystem. For example, existing Hadoop architectures tightly couple the user interface with the source of data. This is done for good reasons that apply in a data discovery research context, but cause significant issues in developing and maintaining a production application. We see this in some of the popular user interfaces such as Kibana, Banana, Grafana, etc. Each user interface is directly tied to a specific type of data lake and imposes schema choices on that data.
Read the whole thing. Also, “Securing the data plane” sounds like a terrible ’90s action film.