
From Beginner to Pro: Pandas Hacks for Streamlined Data Processing
1. Data Loading and Handling
- Reading data:
pd.read_csv()
: Read CSV files.pd.read_excel()
: Read Excel files.pd.read_json()
: Read JSON files.- Specify data types for faster reading:
dtype = {'column_name': 'category'}
- Converting to DataFrames:
pd.DataFrame()
: Create DataFrames from lists, dictionaries, or arrays.
2. Data Type Conversion
df.astype({'column_name': 'new_type'})
: Convert column data types (e.g., to numeric or category).pd.to_numeric(errors='coerce')
: Convert to numeric, handling errors gracefully.
3. Datetime Handling
pd.to_datetime()
: Convert columns or DataFrames to datetime format.
4. Aggregation and Grouping
df.groupby(grouping_column)[column_to_aggregate].agg(aggregation_function)
: Group data and apply aggregations.df.describe()
: Get summary statistics for numerical columns.
5. Indexing and Selection
df.set_index(column_name)
: Set a column as the index.df.loc[index_label, column_name]
: Select data by label-based indexing.df.iloc[row_number, column_number]
: Select data by position-based indexing.df.query('condition')
: Select rows based on boolean conditions.
6. Filtering and Cleaning
df.drop(labels, axis = 'rows or columns')
: Drop rows or columns.df.dropna(thresh=threshold)
: Drop rows with a certain number of missing values.df.fillna(value)
: Fill missing values with a specified value.
7. Renaming and Ordering
df.rename(columns={'old_name': 'new_name'})
: Rename columns.df.sort_values(by='column_name')
: Sort DataFrame by a column.df.rate.cat.reorder_categories(['good', 'very good', 'excellent'])
: Order a categorical column.
8. Merging and Joining
pd.merge(left_df, right_df, on='common_column')
: Merge DataFrames based on a common column.pd.concat([df1, df2], axis=0)
: Concatenate DataFrames vertically.
9. Working with Missing Values
df.isna().sum()
: Count missing values in each column.df.interpolate()
: Fill missing values using interpolation.
10. Memory Usage
df.info(memory_usage='deep')
: Get detailed memory usage information.
11. String Manipulation
df.column_name.str.split('_')
: Split strings in a column.df.column_name.str.get(index)
: Extract elements from split strings.
12. Data Transformation
df.pivot_table()
: Create pivot tables to summarize data.df.melt()
: Melt DataFrames from wide to long format.
13. Exporting Data
df.to_csv('filename.csv')
: Save DataFrame as CSV.df.to_excel('filename.xlsx')
: Save DataFrame as Excel.df.to_json('filename.json')
: Save DataFrame as JSON.
14. Additional Useful Functions
df.sample(frac=0.1)
: Get a random sample of rows.df.describe(include='number')
: Describe numerical columns.df.select_dtypes(include=['number', 'category', 'datetime'])
: Select columns by data type.df.prefix("X_")
: Add a prefix to column names.df.suffix("_Y")
: Add a suffix to column names.