From Beginner to Pro: Pandas Hacks for Streamlined Data Processing
1. Data Loading and Handling
- Reading data:
pd.read_csv(): Read CSV files.pd.read_excel(): Read Excel files.pd.read_json(): Read JSON files.- Specify data types for faster reading:
dtype = {'column_name': 'category'}
- Converting to DataFrames:
pd.DataFrame(): Create DataFrames from lists, dictionaries, or arrays.
2. Data Type Conversion
df.astype({'column_name': 'new_type'}): Convert column data types (e.g., to numeric or category).pd.to_numeric(errors='coerce'): Convert to numeric, handling errors gracefully.
3. Datetime Handling
pd.to_datetime(): Convert columns or DataFrames to datetime format.
4. Aggregation and Grouping
df.groupby(grouping_column)[column_to_aggregate].agg(aggregation_function): Group data and apply aggregations.df.describe(): Get summary statistics for numerical columns.
5. Indexing and Selection
df.set_index(column_name): Set a column as the index.df.loc[index_label, column_name]: Select data by label-based indexing.df.iloc[row_number, column_number]: Select data by position-based indexing.df.query('condition'): Select rows based on boolean conditions.
6. Filtering and Cleaning
df.drop(labels, axis = 'rows or columns'): Drop rows or columns.df.dropna(thresh=threshold): Drop rows with a certain number of missing values.df.fillna(value): Fill missing values with a specified value.
7. Renaming and Ordering
df.rename(columns={'old_name': 'new_name'}): Rename columns.df.sort_values(by='column_name'): Sort DataFrame by a column.df.rate.cat.reorder_categories(['good', 'very good', 'excellent']): Order a categorical column.
8. Merging and Joining
pd.merge(left_df, right_df, on='common_column'): Merge DataFrames based on a common column.pd.concat([df1, df2], axis=0): Concatenate DataFrames vertically.
9. Working with Missing Values
df.isna().sum(): Count missing values in each column.df.interpolate(): Fill missing values using interpolation.
10. Memory Usage
df.info(memory_usage='deep'): Get detailed memory usage information.
11. String Manipulation
df.column_name.str.split('_'): Split strings in a column.df.column_name.str.get(index): Extract elements from split strings.
12. Data Transformation
df.pivot_table(): Create pivot tables to summarize data.df.melt(): Melt DataFrames from wide to long format.
13. Exporting Data
df.to_csv('filename.csv'): Save DataFrame as CSV.df.to_excel('filename.xlsx'): Save DataFrame as Excel.df.to_json('filename.json'): Save DataFrame as JSON.
14. Additional Useful Functions
df.sample(frac=0.1): Get a random sample of rows.df.describe(include='number'): Describe numerical columns.df.select_dtypes(include=['number', 'category', 'datetime']): Select columns by data type.df.prefix("X_"): Add a prefix to column names.df.suffix("_Y"): Add a suffix to column names.