From Beginner to Pro: Pandas Hacks for Streamlined Data Processing


1. Data Loading and Handling

  • Reading data:
    • pd.read_csv(): Read CSV files.
    • pd.read_excel(): Read Excel files.
    • pd.read_json(): Read JSON files.
    • Specify data types for faster reading: dtype = {'column_name': 'category'}
  • Converting to DataFrames:
    • pd.DataFrame(): Create DataFrames from lists, dictionaries, or arrays.

2. Data Type Conversion

  • df.astype({'column_name': 'new_type'}): Convert column data types (e.g., to numeric or category).
  • pd.to_numeric(errors='coerce'): Convert to numeric, handling errors gracefully.

3. Datetime Handling

  • pd.to_datetime(): Convert columns or DataFrames to datetime format.

4. Aggregation and Grouping

  • df.groupby(grouping_column)[column_to_aggregate].agg(aggregation_function): Group data and apply aggregations.
  • df.describe(): Get summary statistics for numerical columns.

5. Indexing and Selection

  • df.set_index(column_name): Set a column as the index.
  • df.loc[index_label, column_name]: Select data by label-based indexing.
  • df.iloc[row_number, column_number]: Select data by position-based indexing.
  • df.query('condition'): Select rows based on boolean conditions.

6. Filtering and Cleaning

  • df.drop(labels, axis = 'rows or columns'): Drop rows or columns.
  • df.dropna(thresh=threshold): Drop rows with a certain number of missing values.
  • df.fillna(value): Fill missing values with a specified value.

7. Renaming and Ordering

  • df.rename(columns={'old_name': 'new_name'}): Rename columns.
  • df.sort_values(by='column_name'): Sort DataFrame by a column.
  • df.rate.cat.reorder_categories(['good', 'very good', 'excellent']): Order a categorical column.

8. Merging and Joining

  • pd.merge(left_df, right_df, on='common_column'): Merge DataFrames based on a common column.
  • pd.concat([df1, df2], axis=0): Concatenate DataFrames vertically.

9. Working with Missing Values

  • df.isna().sum(): Count missing values in each column.
  • df.interpolate(): Fill missing values using interpolation.

10. Memory Usage

  • df.info(memory_usage='deep'): Get detailed memory usage information.

11. String Manipulation

  • df.column_name.str.split('_'): Split strings in a column.
  • df.column_name.str.get(index): Extract elements from split strings.

12. Data Transformation

  • df.pivot_table(): Create pivot tables to summarize data.
  • df.melt(): Melt DataFrames from wide to long format.

13. Exporting Data

  • df.to_csv('filename.csv'): Save DataFrame as CSV.
  • df.to_excel('filename.xlsx'): Save DataFrame as Excel.
  • df.to_json('filename.json'): Save DataFrame as JSON.

14. Additional Useful Functions

  • df.sample(frac=0.1): Get a random sample of rows.
  • df.describe(include='number'): Describe numerical columns.
  • df.select_dtypes(include=['number', 'category', 'datetime']): Select columns by data type.
  • df.prefix("X_"): Add a prefix to column names.
  • df.suffix("_Y"): Add a suffix to column names.