Pandas¶
Data Types¶
dtypes¶
DatetimeTZDtype, CategoricalDtype, PeriodDtype, SparseDtype, IntervalDtype, Int64Dtype, StringDtype, BooleanDtype
Get Dataframe dtypes for each column and value counter
1 2 | |
defaults¶
By default integer types are int64 and float types are float64, regardless of platform (32-bit or 64-bit)
astype¶
astype() method may convert dtypes from one to another. These will by default return a copy, even if the dtype was unchanged (pass copy=False to change this behavior).
Convert a subset of columns to a specified type using astype()
1 | |
Convert certain columns to a specific dtype by passing a dict to astype()
1 | |
note: loc() method tries to fit in what we are assigning to the current dtypes, while [] will overwrite them taking the dtype from the right hand side.
object conversion¶
pandas has two ways to store strings.
- object dtype, which can hold any Python object, including strings.
- StringDtype, which is dedicated to strings.
The infer_objects() methods can be used to soft convert to the correct type.
1 | |
- to_numeric() (conversion to numeric dtypes)
- to_datetime() (conversion to datetime objects)
- to_timedelta() (conversion to timedelta objects)
Convert Series datatype to numeric, getting rid of any non-numeric values
1 2 | |
Convert Series datatype to numeric, changing non-numeric values to NaN
1 | |
The to_numeric() provides another argument downcast, which gives the option of downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory
1 | |
As these methods apply only to one-dimensional arrays, lists or scalars; they cannot be used directly on multi-dimensional objects such as DataFrames. However, with apply(), we can apply the function over each column efficiently:
1 | |
Selecting columns based on dtype¶
select_dtypes() has two parameters include and exclude that allow you to say “give me the columns with these dtypes” (include) and/or “give the columns without these dtypes” (exclude).
1 | |
Exploring and Finding Data¶
Get a report of all duplicate records in a DataFrame, based on specific columns
1 2 | |
List unique values in a DataFrame column
1 | |
For each unique value in a DataFrame column, get a frequency count
1 | |
Grab DataFrame rows where column = a specific value
1 | |
Grab DataFrame rows where column value is present in a list
1 2 3 4 | |
Grab DataFrame rows where column value is not present in a list
1 2 3 4 | |
Select from DataFrame using criteria from multiple columns
(use | instead of & to do an OR)
1 | |
Loop through rows in a DataFrame
(if you must)
1 2 | |
Much faster way to loop through DataFrame rows if you can work with tuples
1 2 | |
Get top n for each group of columns in a sorted DataFrame
(make sure DataFrame is sorted first)
1 2 3 | |
Grab DataFrame rows where specific column is null/notnull
1 | |
Select from DataFrame using multiple keys of a hierarchical index
1 2 3 | |
Slice values in a DataFrame column (aka Series)
1 | |
Get quick count of rows in a DataFrame
1 | |
Get length of data in a DataFrame column
1 | |
Updating and Cleaning Data¶
Delete column from DataFrame
1 | |
Rename several DataFrame columns
1 2 3 4 5 | |
Lower-case all DataFrame column names
1 | |
Even more fancy DataFrame column re-naming
lower-case all DataFrame column names (for example)
1 | |
Lower-case everything in a DataFrame column
1 | |
Sort DataFrame by multiple columns
1 2 | |
Change all NaNs to None (useful before loading to a db)
1 | |
More pre-db insert cleanup…make a pass through the DataFrame, stripping whitespace from strings and changing any empty values to None
(not especially recommended but including here b/c I had to do this in real life once)
1 | |
Get rid of non-numeric values throughout a DataFrame:
1 2 3 4 5 6 7 8 9 | |
Clean up missing values in multiple DataFrame columns
1 2 3 4 5 6 7 8 | |
Doing calculations with DataFrame columns that have missing values. In example below, swap in 0 for df[‘col1’] cells that contain null.
1 2 | |
Split delimited values in a DataFrame column into two new columns
1 2 3 | |
Collapse hierarchical column indexes
1 | |
Reshaping, Concatenating, and Merging Data¶
Pivot data (with flexibility about what what becomes a column and what stays a row).
1 2 3 4 | |
Concatenate two DataFrame columns into a new, single column
(useful when dealing with composite keys, for example)
1 | |
Display and formatting¶
Set up formatting so larger numbers aren’t displayed in scientific notation
1 | |
To display with commas and no decimals
1 | |
Creating DataFrames¶
Create a DataFrame from a Python dictionary
1 | |
List unique values in a DataFrame column
1 | |
Grab DataFrame rows where column has certain values
1 2 | |
Grab DataFrame rows where column doesn’t have certain values
1 2 | |
Delete column from DataFrame
1 | |
Select from DataFrame using criteria from multiple columns
(use | instead of & to do an OR)
1 | |
Rename several DataFrame columns
1 2 3 4 5 | |
Lower-case all DataFrame column names
1 | |
Even more fancy DataFrame column re-naming lower-case all DataFrame column names (for example)
1 | |
Loop through rows in a DataFrame (if you must)
1 2 | |
Much faster way to loop through DataFrame rows if you can work with tuples (h/t hughamacmullaniv)
1 2 | |
Next few examples show how to work with text data in Pandas. Full list of .str functions Slice values in a DataFrame column (aka Series)
1 | |
Lower-case everything in a DataFrame column
1 | |
Get length of data in a DataFrame column
1 | |
Sort dataframe by multiple columns
1 | |
Get top n for each group of columns in a sorted dataframe (make sure dataframe is sorted first)
1 | |
Grab DataFrame rows where specific column is null/notnull
1 | |
Select from DataFrame using multiple keys of a hierarchical index
1 | |
Change all NaNs to None (useful before loading to a db)
1 | |
More pre-db insert cleanup…make a pass through the dataframe, stripping whitespace from strings and changing any empty values to None (not especially recommended but including here b/c I had to do this in real life one time)
1 | |
Get quick count of rows in a DataFrame
1 | |
Pivot data (with flexibility about what what becomes a column and what stays a row). Syntax works on Pandas >= .14
1 2 3 4 | |
Change data type of DataFrame column
1 | |
Get rid of non-numeric values throughout a DataFrame:
1 2 | |
Set DataFrame column values based on other column values (h/t: @mlevkov)
1 | |
Clean up missing values in multiple DataFrame columns
1 2 3 4 5 6 7 8 | |
Concatenate two DataFrame columns into a new, single column (useful when dealing with composite keys, for example)
1 | |
Doing calculations with DataFrame columns that have missing values In example below, swap in 0 for df[‘col1’] cells that contain null
1 | |
Split delimited values in a DataFrame column into two new columns
1 | |
Collapse hierarchical column indexes
1 | |
Create a DataFrame from a Python dictionary
1 | |
Get a report of all duplicate records in a dataframe, based on specific columns
1 | |
Set up formatting so larger numbers aren’t displayed in scientific notation (h/t @thecapacity)
1 | |
Pandas Code example¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | |