A MultiIndex (also known as a hierarchical index) DataFrame allows you to have multiple columns acting as a row identifier and multiple rows acting as a header identifier. With MultiIndex, you can do some sophisticated data analysis, especially for working with higher dimensional data. Accessing data is the first step when working on a MultiIndex DataFrame.
In this article, you’ll learn how to access data in a MultiIndex DataFrame. This article is structured as follows:
When doing data analysis, it is important to ensure correct data types. Otherwise, you may get unexpected results or errors. Datetime is a common data type in data science projects and the data is often saved as numbers or strings. During data analysis, you will likely need to explicitly convert them to a datetime type.
This article will discuss how to convert numbers and strings to a datetime type. More specifically, you will learn how to use the Pandas built-in methods
astype() to deal with the following common problems:
When doing data analysis, it is important to ensure correct data types. Otherwise, you may get unexpected results or errors. In the case of Pandas, it will correctly infer data types in many cases and you can move on with your analysis without any further thought on the topic.
Despite how well pandas works, at some point in your data analysis process you will likely need to explicitly convert data from one type to another. This article will discuss how to change data to a numeric type. …
In data preprocessing and analysis, you will often need to figure out whether you have duplicate data and how to deal with them.
In this article, you’ll learn the two methods,
drop_duplicates(), for finding and removing duplicate rows, as well as how to modify their behavior to suit your specific needs. This article is structured as follows:
For demonstration, we will use a subset from the Titanic dataset available on Kaggle.
import pandas as pd…
When it comes to select data on a DataFrame, Pandas
iloc are two top favorites. They are quick, fast, easy to read, and sometimes interchangeable.
In this article, we’ll explore the differences between
iloc, take a looks at their similarities, and check how to perform data selection with them. We will go over the following topics:
ilocare interchangeable when labels are 0-based integers
In exploratory data analysis, we often would like to analyze data by some categories. In SQL, the
GROUP BY statement groups row that has the same category values into summary rows. In Pandas, SQL’s
GROUP BY operation is performed using the similarly named
groupby() method. Pandas’
groupby() allows us to split data into separate groups to perform computations for better analysis.
In this article, you’ll learn the “group by” process (split-apply-combine) and how to use Pandas’s
groupby() function to group data and perform operations. This article is structured as follows:
groupby()and how to access groups information?
In data analysis, we may work on a dataset that has no column names or column names contain some unwanted characters (e.g. space), or maybe we just want to rename columns to have better names. These all require us to rename columns in a Pandas DataFrame.
In this article, you’ll learn 5 different approaches to do that. This article is structured as follows:
For demonstration, we will use a subset from the Titanic dataset available…
Although this format works well for storing and exchanging data, it needs to be converted into a tabular form for further analysis. You are likely to deal with 2 types of JSON structure, a JSON object or…
Numerical data is common in data analysis. Often you have numerical data that is continuous, or very large scales, or is highly skewed. Sometimes, it can be easier to bin values into discrete intervals. This is helpful to perform descriptive statistics when values are divided into meaningful categories. For example, we can divide the age into Toddler, Child, Adult, and Elder.
cut() function is a great way to transform numerical data into categorical data. In this article, you’ll learn how to use it to deal with the following common tasks.
DataFrame and Series are two core data structures in Pandas. DataFrame is a 2-dimensional labeled data with rows and columns. It is like a spreadsheet or SQL table. Series is a 1-dimensional labeled array. It is sort of like a more powerful version of the Python list. Understanding Series is very important, not only because it is one of the core data structures, but also because it is the building blocks of a DataFrame.
In this article, you’ll learn the most commonly used data operations with Pandas Series and should help you get started with Pandas. …
Machine Learning practitioner | Formerly health informatics at University of Oxford | Ph.D.