Pandas tips and tricks to help you get started with data analysis

Photo by Anastasiia Chepinska on Unsplash

A MultiIndex (also known as a hierarchical index) DataFrame allows you to have multiple columns acting as a row identifier and multiple rows acting as a header identifier. With MultiIndex, you can do some sophisticated data analysis, especially for working with higher dimensional data. Accessing data is the first step when working on a MultiIndex DataFrame.

In this article, you’ll learn how to access data in a MultiIndex DataFrame. This article is structured as follows:

  1. Selecting data via multi-level index
  2. Select a range of data using slice
  3. Selecting all content using slice(None)
  4. Using…


Pandas tips and tricks to help you get started with Data Analysis

Photo by Sanah Suvarna on Unsplash

When doing data analysis, it is important to ensure correct data types. Otherwise, you may get unexpected results or errors. Datetime is a common data type in data science projects and the data is often saved as numbers or strings. During data analysis, you will likely need to explicitly convert them to a datetime type.

This article will discuss how to convert numbers and strings to a datetime type. More specifically, you will learn how to use the Pandas built-in methods to_datetime() and astype() to deal with the following common problems:

  1. Converting strings to datetime
  2. Handling…


Pandas tips and tricks to help you get started with Data Analysis

Photo by Ross Findon on Unsplash

When doing data analysis, it is important to ensure correct data types. Otherwise, you may get unexpected results or errors. In the case of Pandas, it will correctly infer data types in many cases and you can move on with your analysis without any further thought on the topic.

Despite how well pandas works, at some point in your data analysis process you will likely need to explicitly convert data from one type to another. This article will discuss how to change data to a numeric type. …


Pandas tips and tricks to help you get started with data analysis

Photo by Susan Q Yin on Unsplash

In data preprocessing and analysis, you will often need to figure out whether you have duplicate data and how to deal with them.

In this article, you’ll learn the two methods, duplicated() and drop_duplicates(), for finding and removing duplicate rows, as well as how to modify their behavior to suit your specific needs. This article is structured as follows:

  1. Counting duplicate and non-duplicate rows
  2. Extracting duplicate rows with loc
  3. Determining which duplicates to mark with keep
  4. Dropping duplicate rows

For demonstration, we will use a subset from the Titanic dataset available on Kaggle.

import pandas as pd


Pandas tips and tricks to help you get started with data analysis

Photo by Clay Banks on Unsplash

When it comes to select data on a DataFrame, Pandas loc and iloc are two top favorites. They are quick, fast, easy to read, and sometimes interchangeable.

In this article, we’ll explore the differences between loc and iloc, take a looks at their similarities, and check how to perform data selection with them. We will go over the following topics:

  1. Selecting via a single value
  2. Selecting via a list of values
  3. Selecting a range of data via slice
  4. Selecting via conditions and callable
  5. loc and iloc are interchangeable when labels are 0-based integers

Please check…


Pandas tips and tricks to help you get started with data analysis

Photo by AbsolutVision on Unsplash

In exploratory data analysis, we often would like to analyze data by some categories. In SQL, the GROUP BY statement groups row that has the same category values into summary rows. In Pandas, SQL’s GROUP BY operation is performed using the similarly named groupby() method. Pandas’ groupby() allows us to split data into separate groups to perform computations for better analysis.

In this article, you’ll learn the “group by” process (split-apply-combine) and how to use Pandas’s groupby() function to group data and perform operations. This article is structured as follows:


Pandas tips and tricks to help you get started with Data Analysis

Photo by Chris Lawton on Unsplash

In data analysis, we may work on a dataset that has no column names or column names contain some unwanted characters (e.g. space), or maybe we just want to rename columns to have better names. These all require us to rename columns in a Pandas DataFrame.

In this article, you’ll learn 5 different approaches to do that. This article is structured as follows:

  1. Using rename() function
  2. Renaming columns while reading a CSV file
  3. Using columns.str.replace() method
  4. Renaming columns via set_axis()

For demonstration, we will use a subset from the Titanic dataset available…


Some of the most useful Pandas tricks

All Pandas json_normalize() you should know for flattening JSON (Image by Author using canva.com)

Reading data is the first step in any data science project. As a machine learning practitioner or a data scientist, you would have surely come across JSON (JavaScript Object Notation) data. JSON is a widely used format for storing and exchanging data. For example, NoSQL database like MongoDB store the data in JSON format, and REST API’s responses are mostly available in JSON.

Although this format works well for storing and exchanging data, it needs to be converted into a tabular form for further analysis. You are likely to deal with 2 types of JSON structure, a JSON object or…


All Pandas cut() you should know for transforming numerical data into categorical data (Image by author using canva.com)

Numerical data is common in data analysis. Often you have numerical data that is continuous, or very large scales, or is highly skewed. Sometimes, it can be easier to bin values into discrete intervals. This is helpful to perform descriptive statistics when values are divided into meaningful categories. For example, we can divide the age into Toddler, Child, Adult, and Elder.

Pandas’ built-in cut() function is a great way to transform numerical data into categorical data. In this article, you’ll learn how to use it to deal with the following common tasks.

  1. Adding custom bins
  2. Adding…


All you need to know about Pandas Series — the basic building blocks of a DataFrame.

A practical introduction to Pandas Series (Image by Author using canva.com)

DataFrame and Series are two core data structures in Pandas. DataFrame is a 2-dimensional labeled data with rows and columns. It is like a spreadsheet or SQL table. Series is a 1-dimensional labeled array. It is sort of like a more powerful version of the Python list. Understanding Series is very important, not only because it is one of the core data structures, but also because it is the building blocks of a DataFrame.

In this article, you’ll learn the most commonly used data operations with Pandas Series and should help you get started with Pandas. …

B. Chen

Machine Learning practitioner | Formerly health informatics at University of Oxford | Ph.D.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store