Python – Page 28 – Curated SQL

R and Python Interop via Reticulate

Published 2022-04-18 by Kevin Feasel

I am way more experienced with R than with Python and prefer to code in this language when possible. This applies, especially when it is about visualizations. Plotly and ggplot2 are fantastic packages that provide a lot of flexibility. However, every language has its limitations, and the best results stem from their efficient combination.
This week, I created the candlestick below, and I think it’s an excellent case study to illustrate a few things:

Read on to learn more about using reticulate to execute Python code and interact with the results in R.

Comments closed

Data Shaping in Python with Pandas

Published 2022-04-15 by Kevin Feasel

Adrian Tam digs into Pandas:

After finishing this tutorial, you will learn
– What the pandas library provides
– What is a DataFrame and a Series in pandas
– How to manipulate DataFrame and Series beyond the trivial array operations
Let’s get started!

Let’s, shall we?

Comments closed

Logging in Python

Published 2022-04-08 by Kevin Feasel

Daniel Chung shows off the logging module in Python:

Note that now all five messages were output, so the default level that the root logger logs is now “DEBUG.” The log record attributes (such as %(asctime)s) that can be used to format the output can be found in the logging documentation.
Although there is a default logger, we usually want to make and use other loggers that can be configured separately. This is because we may want a different severity level or format for different loggers.

Next on the todo list is to implement the Reader monad to hide that logging deeper in your code base so that you a) don’t need to see logging code everywhere, and b) don’t forget to include logging in some function.

Comments closed

Topic Modeling with Python

Published 2022-04-08 by Kevin Feasel

Sanil Mhatre takes us through topic modeling:

Topic modeling is a powerful Natural Language Processing technique for finding relationships among data in text documents. It falls under the category of unsupervised learning and works by representing a text document as a collection of topics (set of keywords) that best represent the prevalent contents of that document. This article will focus on a probabilistic modeling approach called Latent Dirichlet Allocation (LDA), by walking readers through topic modeling using the team health demo dataset. Demonstrations will use Python and a Jupyter notebook running on Anaconda. Please follow instructions from the “Initial setup” section of the previous article to install Anaconda and set up a Jupyter notebook.
The second article of this series, Text Mining and Sentiment Analysis: Power BI Visualizations, introduced readers to the Word Cloud, a common technique to represent the frequency of keywords in a body of text. Word Cloud is an image composed of keywords found within a body of text, where the size of each word indicates its frequency in that body of text. This technique is limited in its ability to discover underlying topics and themes in the text, because it only relies on the frequency of keywords to determine their popularity. Topic modeling overcomes these limitations and uncovers deeper insights from text data using statistical modeling for discovering the topics (collection of words) that occur in text documents.

Read on for an informative article with plenty of code.

Comments closed

Data Profiling in Python

Published 2022-04-05 by Kevin Feasel

Brendan Tierney uses Python to look at some data:

One of the most common, and sometimes boring, task when working with datasets is writing some code to profile the data. Most data scientists will have built a set of tools/scripts to help them with this regular and slightly boring task. As with most IT tasks we should be trying to automate what we can, to allow us to spend more time on more important tasks, such as deriving insights and delivering value to the business, instead of repeatedly writing code to produce various statistics about the data and drawing pretty pictures.
I’ve written previously about automating and using some data profiling libraries to help us with this task. There are lots of packages available on pypi.og and on GitHub. Below I give examples of 5 Python Data Profiling libraries, with links to their GitHubs.

Brendan includes some good examples of libraries here so check it out.

Comments closed

Building S3 Data Pipelines — The Tools

Published 2022-04-01 by Kevin Feasel

Chris Adkin continues a series:

In my last post I outlined a number of architectural options for solutions that could be implemented in light of Microsoft retiring SQL Server 2019 Big Data Clusters, one of which was data pipelines that leverage Python and Boto 3. Before diving into these things in greater detail, lets take a recap on what S3 is.

Click through for a simple data pipeline example.

Comments closed

Data Visualization in Python

Published 2022-03-28 by Kevin Feasel

Mehreen Saeed uses a few data visualization libraries in Python:

Data visualization is an important aspect of all AI and machine learning applications. You can gain key insights of your data through different graphical representations. In this tutorial, we’ll talk about a few options for data visualization in Python. We’ll use the MNIST dataset and the Tensorflow library for number crunching and data manipulation. To illustrate various methods for creating different types of graphs, we’ll use the Python’s graphing libraries namely matplotlib, Seaborn and Bokeh.

Bokeh results can look really nice, although it does feel like it requires a lot more developer time and effort to get it right. Click through for examples of each of the three libraries.

Comments closed

Dynamic DAGs with Apache Airflow

Published 2022-03-17 by Kevin Feasel

Bhavya Garg explains how we can create dynamic directed acyclic graphs in Apache Airflow:

Airflow dynamic DAGs can save you a ton of time. As you know, Apache Airflow is written in Python, and DAGs are created via Python scripts. That makes it very flexible and powerful (even complex sometimes). By leveraging Python, you can create DAGs dynamically based on variables, connections, a typical pattern, etc. This very nice way of generating DAGs comes at the price of higher complexity and subtle tricky things that you must know

Read on for an example.

Comments closed

Snowflake Purchases Streamlit

Published 2022-03-10 by Kevin Feasel

Alex Woodie reports on a purchase:

Cloud data warehousing giant Snowflake showed it’s serious about Python and data science this week when it announced that it plans to spend $800 million to buy Streamlit, a provider of Python-based tools for rapidly developing interactive data applications on the Web.
Co-founded in San Francisco in 2018 by Adrien Treuille, Amanda Kelly, and Thiago Teixeira, Streamlit develops an open source framework of the same name that allows data scientists and machine learning engineers to create and deploy data applications. The software is compatible with other Python-based frameworks, such as NumPy, Pandas, Matplotlib, and Scikit-learn, and uses React to render screens on the front-end.

Streamlit is nice. $800 million nice? That’s a good question.

Comments closed

Building a Recommender in Spark

Published 2022-03-03 by Kevin Feasel

Avinash Sooriyarachchi makes a recommendation:

There has been an exponential increase in the volume and variety of data at our disposal to build recommenders and notable advances in compute and algorithms to utilize in the process. Particularly, the means to store, process and learn from image data has dramatically increased in the past several years. This allows retailers to go beyond simple collaborative filtering algorithms and utilize more complex methods, such as image classification and deep convolutional neural networks, that can take into account the visual similarity of items as an input for making recommendations. This is especially important given online shopping is a largely visual experience and many consumer goods are judged on aesthetics.
In this article, we’ll change the script and show the end-to-end process for training and deploying an image-based similarity model that can serve as the foundation for a recommender system. Furthermore, we’ll show how the underlying distributed compute available in Databricks can help scale the training process and how foundational components of the Lakehouse, Delta Lake and MLflow, can make this process simple and reproducible.

Click through for the process.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Python