Before being loaded into a Snowflake table, the data can be optionally staged, which is essentially just a pointer to a location where the files are stored. There are different types of stages including:
– User stages, which each user will have by default
– Table stages, which each table will have by default
– Internal named stages, meaning staged within Snowflake
Internal named stages are the best option for regular data loads, if you are thinking along the lines of your standard daily ETL process. One benefit of these is the flexibility in that they are database objects, so you can grant privileges to roles to access these objects as you would expect. Alternatively, there are external stages, such as Azure Blob storage.
Read on to see what comes next.
Now, we all know that the consultant answer to “which should I use?” is It Depends ™ 🙂 But what does it depend on?
To me, it boils down to a few key questions you need to ask:
– What is the task or problem you are trying to solve?
– Where and how will you use the output?
– Which tool are you most comfortable using?
Read on to see how they both work.
Back when I worked a normal job, I had two calendars: Office 365 for work and Google for home. Now that I work for myself, that’s a lot more complicated. Sometimes a customer will create an account for me in their network. Sometimes I’ll partner with other consultants and work as part of their team. And of course, I’ve got my own work email at [email protected].
I need all of these calendars to consolidate to one place. My natural inclination and personal preference is to put it all into Google. Now, there are sync apps available, but this sort of problem is a perfect use case. A calendar event is created in outlook, a flow is triggered, and that information is transferred to Google.
I was just complaining about this yesterday and then I see the post this morning. I’m pleased though simultaneously concerned that Eugene has bugged my hotel room.
Clean the data and optionally aggregate it as it sits in source system. The tool used for this would depend on the source system that stores the data (i.e. if SQL Server, you would use stored procedures). The only benefit with this option is if you aggregate the data, you will move less data from the source system to Azure, which can be helpful if you have a small pipe to Azure and don’t need the row-level details. The disadvantages are: the raw source data is not available in the data lake, so you would always need to go back to source system if you needed to get it again, and it may not even still exist in the source system; you would put extra stress on the source system when doing the cleaning which could affect end users using the system; it could take a long time to clean the data as the source system may not have fast performance; and you would not be able to use other tools (i.e. Hadoop, Databricks) to clean it. Strongly advise against this option
Read on for additional options and James’s recommendations.
A user recently asked me a question on my previous blog post (Setting Variables in Azure Data Factory Pipelines) about possibility extracting the first element of a variable if this variable is set of elements (array).
So as a spoiler alert, before writing a blog post and adding a bit more clarity to the existing Microsoft ADF documentation, here is a quick answer to this question.
You’ll have to click through even for the quick answer.
Now let’s go to the construction of the sample application. In the example, we will first send the data from our Linux file system to the data storage unit of the Hadoop ecosystem (HDFS) (for example, Extraction). Then we will read the data we have written here with Spark and then we will apply a simple Transformation and write to Hive (Load). Hive is a substructure that allows us to query the data in the hadoop ecosystem, which is stored in this environment. With this infrastructure, we can easily query the data in our big data environment using SQL language.
Most of the things relational database professionals do are pretty much the same things that you do with Spark and Hive. There are differences in implementation and level of programming familiarity, but they’re pretty similar.
A work table is a table defined in a nearby data location; either a schema in the source or target database or in a database on the same instance. I take a constraint-driven approach to work table location selection. Closer – a schema in the same database – is often better for performance.
I write this knowing some folks will frown at the suggestion of polluting a data source or target database with additional schemas and tables. Best practices exist for a reason. It’s helpful to maintain a list of best practices and to include in this list the reasons each practice exists. This could be a case where violating one or more best practices is justified.
Andy throws out a few ideas as alternatives but states his preference for using work tables to solve this problem.
The Write-DbaDbTableData cmdlet is pretty neat because it can create automatically the destination table if it doesn’t exists, truncate the table if it exists (or append, your choice), keep the identity values and nulls if necessary and everything is done via a bulk insert with a configurable batch size.
Click through for the script. It’s not a replacement for a real ETL process but if you just need something fast, it will do the job.
File management may not be at the top of my list of priorities during data integration projects. I assume that once I learn enough about sourcing data systems and target destination platform, I’m ready to design and build a data integration solution between two or more connecting points. Then, a historical file management process becomes a necessity or a need to log and remove some of the incorrectly loaded data files. Basically, a step in my data integration process to remove (or clean) such files would be helpful.
Click through to see how to do this.
Obviously, you have to have the module installed, and a copy of AdventureWorksDW2017 db restored to a SQL Server. After that, all you have to do is loop through the tables, ‘query’ them with the Read-SqlTableData cmdlet, and pipe the results to the Export-Excel cmdlet.
I did some trial and error with this yesterday. I settled on exporting all of the Dimension tables to separate Worksheets within the same Excel file, and exporting all of the Fact tables to their own individual files (since they tend to be much larger).
Click through for Aaron’s script.