Learning Path: Python: Effective Data Analysis Using Python
Data Science with Python (Learning Path) | Python for Data Science: Analyzing data is far more important than transferring tons of it. The data itself is useless until it is analyzed and presented by visualization tools. Learn how to use Python to do this in this course.
- Self-paced with Life Time Access
- Certificate on Completion
- Access on Android and iOS App
Data Science with Python (Learning Path) | Python for Data Science
Use Python’s tools and libraries effectively for extracting data from the web and creating attractive and informative visualizations.
Over the years, almost every organization has understood the importance of analyzing data.
In fact, it would not be an overstatement to say that “No organization will be able to survive today’s cut-throat competition if it does not analyze data.”
Data analysis as we know it is the process of taking the source data, refining it to get useful information, and then making useful predictions from it.
In this Learning Path, we will learn how to analyze data using the powerful toolset provided by Python.
Packt’s Video Learning Paths are a series of individual video products put together in a logical and stepwise manner such that each video builds on the skills learned in the video before it.
Python features numerous numerical and mathematical toolkits such as Numpy, Scipy, Scikit learn, and SciKit, all used for data analysis and machine learning. With the aid of all of these, Python has become the language of choice of data scientists for data analysis, visualization, and machine learning.
We will have a general look at data analysis and then discuss the web scraping tools and techniques in detail. We will show a rich collection of recipes that will come in handy when you are scraping a website using Python, addressing your usual and unusual problems while scraping websites by diving deep into the capabilities of Python’s web scraping tools such as Selenium, BeautifulSoup, and urllib2.
We will then discuss the visualization best practices. Effective visualization helps you get better insights from your data, and help you make better and more informed business decisions.
After completing this Learning Path, you will be well-equipped to extract data even from dynamic and complex websites by using Python web scraping tools, and get a better understanding of the data visualization concepts. You will also learn how to apply these concepts and overcome any challenge while implementing them.
To ensure that you get the best of the learning experience, in this Learning Path we combine the works of some of the leading authors in the business.
Benjamin Hoff
- Ben spent 3 years working as a software engineer and team leader doing graphics processing, desktop application development, and scientific facility simulation using a mixture of C++ and Python. This sparked a passion for software development and developmental programming and led him to explore state-of-the art projects in natural language processing, facial detection/recognition, and machine learning.
Charles Clayton
- Charles Clayton is a sole proprietor of crclayton technologies co, and an independent web developer. He is an experienced developer and Python specialist in Python web scraping solutions and tools such as Selenium, BeautifulSoup, and urllib2. He also has worked as a Reliability Engineer with West frazweer.
Dimitry Foures
- Dimitry is a data scientist with a background in applied mathematics and theoretical physics. After completing his physics undergraduate studies in ENS Lyon (France), he studied fluid mechanics at École Polytechnique in Paris where he obtained first class in Master’s degree. He holds a PhD in applied mathematics from the University of Cambridge. He currently works as a data scientist for a smart energy startup in Cambridge, in close collaboration with the university.
Giuseppe Vettigli
- Giuseppe Vettigli is a data scientist who has worked in the research industry and academia for many years. His work is focused on the development of machine learning models and applications to use information from structured and unstructured data. He also writes about scientific computing and data visualization in Python in his blogs.
Igor Milovanović
- Igor Milovanović is an experienced developer, with strong background in Linux system knowledge and software engineering education. He is skilled in building scalable data-driven distributed software rich systems.
- Anyone opting for this course should be well-versed with the basics of Python
- Scrape the Twitter stream to collect real-time data
- Predictive methods that can forecast and predict future trends based on current data
- Use the Selenium module and scrape with Selenium
- Discover how to perform parsing with BeautifulSoup
- Make 3D visualizations mainly using mplot3d
This video provides an overview of the entire course.
The aim of this video is to introduce us to Python.
- Cover the basic installation and setup for the course
We will learn how to collect and store the data.
- Introduce the the Twitter API
- Create a SQLAlchemy engine to connect
We will explore how to collect and store twitter tweets.
- Learn how to scrap tweets and meta-data
- Explore how to store tweets in a database
We will talk about database design.
- Look into data-base design in greater detail
- look at what level of information one should gather from databases
We will explore Pandas and other databases.
- Learn how to Read a table as a pandas DataFrame
- Explore how to query a table using pandas
- Write a pandas DataFrame as a table
We will explore the concepts of Panda series, data frames and columnar operations.
- Introduce Panda series and dataframes
- Explore columnar operations using the apply function
- Work with missing values
We will take a look operations and how to exactly work with columns.
- Explore grouping operations
- Work with Date columns
We will explore how to merge various operations and learn how to export data to JSON/CSV.
- Take a further look at combining operations
- Learn how to export various data
We will take a look at what exactly arrays are, their different types, and histogram functions.
- Introduce array features
- Explore how Bucketting arrays digitize and histogram functions
See exactly what simple aggregations are.
- Introduce aggregations
- Explore more about simple aggregations
We will explore the concept of linear algebra.
- Visually explain linear algebra
- Explore the various functions of linear algebra
We will learn how to present stories via simple visualizations and representations.
- Explore the functions of PyQT
- Take a more in depth look at MatplotLib
We will learn the different types of graphical representations.
- Passing custom colors in our pie charts
- Creating a Bar graph widget
We will learn how to create Simple XY plots and axis scales.
- Generate a simple XY plot and all its elements
- Create a legend with multiple sets of data
We will learn how to handle text data.
- Look at what exactly the NTLK package consists of in greater detail
- Start building our data application
We will find out exactly what do we mean by Bag of words.
- Apply features to our all Tokens class
- Create a main function
We will learn how to classify words.
- Employ the steam switch tab widget
- Insert data into our map
We will take a look at stemming of words.
- Run our graphical user interface
- Map numerous data points
We will use the simple sentiment analysis using scrapped tweets.
- Explain the functions in the package
- Call the items command
- Populate our sentiment analysis
We will learn how to group dimensions and also take a look at the different types of data that is generated.
- Categorizing tweeters/users
- Grouping users by Dimensionality
We will take a look at New metrics and dimensions will be derived to get hidden insights.
- Look at various aspects of trend analysis
- Learn how to derive new metrics
We will take a look at the concept of co-relation analysis.
- Talk about correlations and what exactly they are
- Look at the algorithms that are employing correlation
We will briefly go over what we covered in the course and also take a glimpse at what the future holds for us.
- Recap the technologies learnt
- Look at the future scope and potential of the application
This video provides an overview of the entire course.
This video aims to explain the course’s expected prerequisite knowledge and system requirements, then introduce the concept of web scraping, situations in which you may want to use it,and why it is a valuable skill to know.
- Understand the scope of the course.
- Understand the applications of web scraping.
- Understand how it will help you in your day-to-day life.
Without understanding the foundations of web development, it is challenging to write efficient and robust web scraping scripts, so we will cover how a website is structured and how to locate data with precision.
- Understanding the basics of HTML tags and properties.
- Understanding CSS and how it relates to our goal.
- Applying CSS selectors to cleanly identify our desired data.
In order to query a website to scrape data from it, we need to see how the website is structured in its underlying code. We also need an application that will let us test our queries.To do this, we will learn about the element explorer and console of the Chrome Developer Tools.
- Open your browser’s developer tools.
- Use the Element Explorer to identify our desired data.
- Use the Console to test and debug our selectors.
Now we know how to create CSS selectors and use the Chrome developer tools to look at HTML and construct a query, but how do we turn this into a Python script? We use the selenium module and a web driver.
- Download the ChromeDriverWebDriver.
- Install the Selenium module for Python.
- Use these to write Python code to automate the browser..
Now that we know how to web scrape with Python, we need to be aware of the ethical and legal ramifications associated with web scraping. Mainly, the solution is to be considerate and use common sense.
- Understand your script’s effect on a website.
- Read a website’s terms and conditions.
- Take steps to mitigate the stress your script imposes.
BeautifulSoup cannot work alone. Although it’s a great tool for parsing and organizing a website’s HTML, it doesn’t get the HTML for us, so we have to figure out another method to request a website’s HTML.
- Make sure to confirm the website is server-side.
- Use the Requests or urllib module to make HTTP requests.
- Download the HTML locally in order to develop your script.
So, now we have some HTML strings loaded in Python, but how can we use BeautifulSoup to intelligently start selecting important data from it?
- First, figure out which parser to use
- Then, experiment with the BeautifulSoup methods and objects
- Now, use the CSS selectors we have already mastered!
The aim of the video is to show an example on how to parse a webpage. For eg, Wikipedia.
- First, we will walk through a complete example of parsing some inconsistent information on Wikipedia
- Then, we will see BeautfulSoup in action using CSS Selectors
- Finaly, we will use Python and BeautifulSoup methods to extract the data we want
Is writing a web-scraping script always the right method, or are there better alternative solutions?
- Always look for the source of the data
- See if there is an API available
- Don’t reinvent the wheel where unnecessary!
If not through web scraping, how can we get the information using an API with Python?
- Understand how to structure the API request.
- Understand how to interpret the API response.
- Submit your requests just like before
Some APIs require authentication and they require multiple parameters. How do we integrate these into our script?
- Request or purchase an API key from the developer.
- Include your parameters as a dictionary type.
- Know when to web scrape and when to use APIs.
This section gives an overview of the entire course.
Importing data from csv into Python can be a bit tricky. It needs careful inspection and appropriate functions. Let's see how we can do that.
- Import the csv module
- Use the csv.reader() method
- Inspect the file and check the output
When we are automating a data pipe for many files, we are not in a position to convert an Excel file into CSV and then import it. This video shows us how to import data directly from an Excel file.
- Import the xlrd module
- Load workbook and access sheets
- Inspect the cell to load the Python date object
We've learned how to import data from CSV and Excel. But how do we do that with a file that has fixed-width data? Let's explore.
- Use the struct and string modules
- Use a format mask
- Read and extract the file
Although tab-delimited format is simple to read as csv files, we need to ensure that certain parameters are there to keep the reading process accurate. Let's explore how we can do that.
- Reuse the csv data-import code
- Instantiate the csv reader object
- Ensure that the data is not 'dirty'
Let's explore how we can import data from a JSON resource like GitHub, and How to get it and process it later.
- Use the requests module to get content from GitHub
- Process the data after getting the JSON object
- Parse float as decimal in JSON
Modern applications often hold different datasets inside relational databases (or other databases like MongoDB), and we have to use these databases to produce beautiful graphs. This video will show us how to use SQL drivers from Python to access data.
- Install the SQLite library and connect to the database engine
- Run a query against the selected tables
- Read the result returned from the database engine
Data coming from the real world needs cleaning before processing or even visualization. It's not fully automated and we need to understand outliers in order to clean the data. Let's see how we can do that.
- Use median absolute deviation (MAD) to detect outliers
- Create scatter plots
- Identify how to deal with a dataset having missing values
In scientific computing, images are often represented as NumPy array data structures. We can import images using various techniques. In this video, we will take a look at using image processing in Python, mainly related to scientific processing and less on the artistic side of image manipulation.
- Read and display an image
- Convert the image into a one-channel ndarray, and zoom in using array slicing
- Use numpy.memmap for large images
In this video, we will see different ways of generating random number sequences and word sequences. Some of the examples use standard Python modules, and others use NumPy/SciPy functions.
- Use the random module
- Generate a time series plot of fictional price growth data
- Use different distributions
Data that comes from different real-life sensors is not smooth; it contains some noise that we don't want to show on diagrams and plots. In this video, we introduce a few advanced algorithms to help with cleaning of data coming from real-world sources.
- Average out data over a sample and plot it for that sample
- Use the SciPy library and apply the convolution principle
- Use the Median Filter
There are different plots used for representing data differently. In this video, we'll compare them and understand advanced concepts in data visualization. We would also plot sine and cosine plots and customize them.
- Create basic plot in matplotlib and add new value range
- Compare various kinds of plots, plotting the same dataset
- Plot sin(x) and cos(x) and then customize the plot
Now that we've learned the concepts of basic plotting and customizing, this video will show us a variety of useful axis properties that we can configure in matplotlib to define axis lengths and limits.
- Fire up IPython and import the plotting functions
- Use the axis() function
- Set xmin, xmax, ymin, and ymax
There are different kinds of audiences to whom the data is presented. Having lines set up distinct enough for target audiences (for example, vivid colors for young audience) leaves a great impact on the viewer. This video shows how we can change various line properties such as styles, colors, or width.
- Pass keyword parameters to plot()
- Use a set of setter methods
- Use the setp() function
As we now know how to change various line properties such as styles, colors, and width, this video will guide us with adding more data to our figure and charts by setting axis and line properties.
- Get the current axis
- Use locator_params()
- Convert dates between different representations
Legends and annotations explain data plots clearly and in context. By assigning each plot a short description about what data it represents, we enable an easier model for the viewer. This video will show how to annotate specific points on our figures and how to create and position data legends.
- Generate different normal distributions
- Use legend() to generate a legend box
- Annotate an important value
Spines define data area boundaries; they connect the axis tick marks. There are four spines. We can place them wherever we want. As they are placed on the border of the axis, we see a box around our data plot. This video will demonstrate how to move spines to the center.
- Remove two spines
- Move the bottom and left spine to 0,0
- Move the ticks' positions
Histograms are often used in image manipulation software as a way to visualize image properties such as distribution of light in a particular color channel. This video will help us create histograms in 2D.
- Create different histograms using matplotlib.pyplot.hist()
- Set labels and a title for the plot
To visualize the uncertainty of measurement in our dataset or to indicate the error, we can use error bars. Error bars can easily give an idea of how error free the dataset is. In this video, we will see how to create bar charts and how to draw error bars.
- Generate a number of measurements
- Add error samples from a standard normal distribution
- Draw and show an error bar
The matplotlib library allows us to fill areas in between and under the curves with color so that we can display the value of that area to the viewer. In this video, we will learn how to fill the area under a curve or in between two different curves.
- Measure two different signals
- Plot signals and fill the gap using fill_between()
- Take a look at fill_betweenx() and fill()
If you have two variables and want to spot the correlation between those, a scatter plot may be the solution to spot patterns. This type of plot is also very useful as a start for more advanced visualizations of multidimensional data. Let's see how to create a scatter plot.
- Generate x values
- Create random as well as co-related measurements
- Create a scatterplot using parameters like color, marker, edgecolors and labels
To be able to distinguish one particular plot line in the figure, we need to add a shadow effect.
- Create the figure and axes
- Obtain a signal and plot it
- Make the shadow transformation and plot the shadow
Adding a data table beside our chart helps to visualize information.
- Create a table of cells and add it to the current axes
You can create custom subplot configurations on your plots in this video.
- Create a figure
- Add various subplot layouts using subplot2grid
- And reconfigure the tick label size
To spot differences in patterns and compare plots visually in the figure, we need to customize our grids.
- Load data from the sample data directory
- Slice the data
- Plot the data
To display isolines, we create contour plots.
- Implement a function to act as a mock signal processor
- Generate data and transform it into suitable matrices
- Plot contour lines, add contour line labels, and show the plot
To distinguish clearly between two different plots, we fill the areas with different patterns.
- Create two sinusoidal functions that overlap at certain points
- Create two subplots to compare the two variations that render filled regions
When the information is radial in nature, we need a polar plot to display information.
- Create a square figure and add the polar axes to it
- Generate random values for a set of angles and a set of polar distances
- Plot the values
You will learn how to visualize a real-world task in this video.
- Pick up the command-line arguments
- Build a list of dictionaries and calculate the size
- Pass them to a function draw
You must be curious to plot 3D data after getting your hands on 2D. Python provides a toolkit called mplot3d in matplotlib for this. Let's go ahead and explore its working!
- Specify the backend and generate random data for 4 years
- Specify Z values to be the same for the 3D axis
- Associate each Z-order collection of xs, ys pairs
Similar to 3D bars, you might want to create 3D histograms since these are useful for easily spotting correlations between three independent variables. Let us now dive into it!
- Import the NumPy package
- Generate x and y from normal distributions
- Plot the scatter plot of the same dataset
This video will walk you through graphics rendering with OpenGL. So let's go ahead and do it!
- Generate a dataset and create functions for plot3d
- Import the mlab_source object
- Use points and scalar features to set particular values
Images can be used to highlight the strengths of your visualization in addition to pure data values. It maps deeper into the viewer's mental model, thereby helping the viewer to remember the visualizations better and for a longer time. Let's see how we could use them in Python!
- Create a figure and load the data using the csv module
- Instantiate the csv reader object, and iterate over the data
- Compute the zoom coefficient to scale the size of image
This video will walk you through how you can make simple yet effective usage of the Python matplotlib library to process image channels and display the per-channel histogram of an external image.
- Load the image and separate the RGB channels from the image matrix
- Instantiate the ImageViewer class and configure the figure and axes
- Plot channel histograms and the image
The best geospatial visualizations are done by overlaying data on the map. This video will show you how to project data on a map using matplotlib's Basemap toolkit. Let's dive into it!
- Instantiate Basemap, defining the projection to be used
- Set up the Basemap instance map
- Instruct the Basemap instance map to draw meridians and parallels
This video will take you through the generation of random images to tell humans and computers apart. Let's do it!
- Define size, text, font size, background color, and CAPTCHA length
- Add some noise in the form of lines and arcs
- Return the image object to the caller together with the CAPTCHA challenge
With the logarithmic scale, the ratio of consecutive values is constant. This is important when we are trying to read log plots. Let us step ahead and see how to perform it!
- Generate datasets:y—exponential/logarithmic in nature; and z—linear in nature
- Create subplots containing the y dataset in logarithmic scale and linear scale
- Create other subplots containing the z dataset logarithmic and linear scale.
In this video we will discuss how to create a stem plot which will display data as lines extending from a baseline along the x-axis.
- Use matplotlib to plot stem plots using the stem() function
- Once done with the previous task, what is the ideal next step to solve the problem or achieve the goal?
- The final step and the problem would be solvedor goal would be achieved! Also verify that the problem is solved for the viewer.
In this video we will visualize wind patterns or liquid flow, and we will use uniform representation of the vector field for this. So, let's go ahead and do it!
- Create data vectors and print intermediate values
- Plot the stream-plot
- Show the figure with streamlines visualizing our vectors.
Color-coding the data can have great impact on how your visualizations are perceived by the viewer, as they come with assumptions about colors and what colors represent. This video will walk you through the steps showing the use of colormaps!
- Use the Color Brewer website to get divergent colormap color values
- Apply customization to the scatter plot functions of matplotlib
- Tweak scatter marker, line color and width.
If we want to take a quick look at the data and see if there is any correlation, we would draw a quick scatter plot.Iin this video, you will understand scatter plots.
- Use a cleaned dataset of Google Trend search volume
- Use a random normal distribution
- Create a figure containing four subplots.
If you have two different datasets from two different observations, you want to know if those two event sets are correlated. You want to cross-correlate them and see if they match in any way. This video will let you achieve this goal!
- Import the matplotlib.pyplot module and the numpy package
- Plot the datasets and cross correlation diagram
- Tighten the layout and add labels and grids.
How you could predict the growth of stock dividends? In this video we will dive into some interesting steps which will let you understand the importance of autocorrelation for this prediction!
- Use a cleaned dataset of Google search volume for a year
- Generate the same-length random dataset using NumPy
- Plot the random dataset and its autocorrelation diagram.
Let's look into how to visualize two-dimensional vector quantities such as speed and direction of wind!
- Generate a grid of coordinates to simulate observations
- Simulate observational values for wind speed
- Plot barb diagrams
- Plot quivers to demonstrate different appearances
How will you visually compare several similar data series? This video will walk you through making a box-and-whisker plot which achieves this goal!
- Read data/labels from the PROCESSES dictionary into DATA/LABELS respectively
- Render the box-and-whisker plot using matplotlib.pyplot.boxplot
- Remove some chart junk from the figure and add axes labels
One form of very widely used visualization of time-based data is a Gantt chart. Let us see how to work with it!
- Load TEST_DATA and instantiate the Gantt class with TEST_DATA
- Process all tasks by plotting horizontal bars on the axes
- Format the x and y axes for the data
- Tighten the layout
Error bars are useful to display the dispersion of data on a plot. So, let's explore their use in Python for data visualization.
- Use some sample data that consists of four sets of observations
- Compute the mean value and 95% confidence interval for observations
- Render bars with vertical symmetrical error bars
This video will let you explore more features of text manipulation in matplotlib, giving a powerful toolkit for even advanced typesetting needs. Let's dive into it.
- We list all the possible properties we want to vary on the font
- Iterate over the first and second set of variations
- Render text samples for these iterations, and print the variation combination
- Remove axes from the figure, as they serve no purpose
This video will explain some of the programming interfaces in matplotlib and make a comparison of pyplot and object-oriented API. Let us now explore it!
- Instantiate the matplotlib Path object for custom drawing
- Construct the vertices of the object and path's command codes
- Create a patch and add it to the Axes instance of figure function