Level Up Your Python: Essential Techniques for Efficient Data Manipulation
In this blog post, I share some of the most effective tips and tricks I’ve recently discovered in Python, especially in the realm of data manipulation. These insights are gleaned from my personal experiences and are intended to help both beginners and seasoned programmers alike. Whether it’s finding more efficient ways to handle complex datasets, streamlining your coding process, or learning new approaches to common Python challenges, this article is a culmination of practical knowledge meant to enhance your Python skill set. Join me in exploring these techniques, which I’ve found incredibly useful and am excited to share with you!
1. **kwargs inside a function
def print_kwargs(**kwargs):
"""Prints all keyword arguments passed to the function."""
print(kwargs)
print_kwargs(goku=9001, krillin=500, picolo=2500)
def extract_kwargs(**kwargs):
"""Prints keyword arguments as a list of tuples and a list of values."""
print([*kwargs]) # Prints key-value pairs as tuples
print([*kwargs.values()]) # Prints only the values
extract_kwargs(goku=9001, krillin=500, picolo=2500)
Explanation:
- **kwargs allows you to accept an arbitrary number of keyword arguments in a function.
- It collects them into a dictionary.
- You can access and manipulate them like any other dictionary.
2. *args: return whatever elements is passed to it
def add_all(*args):
"""Prints and returns the sum of all arguments passed to the function."""
print(args) # Prints a tuple of arguments
return sum(args)
result = add_all(1, 2, 3, 4, 5)
print(result) # Output: 15
def print_args(*args):
"""Prints all arguments passed to the function."""
print(args)
print_args('Ahmed', 'is', 'Amazzzing')
Explanation:
- *args allows you to accept an arbitrary number of positional arguments in a function.
- It collects them into a tuple.
- You can access and manipulate them like any other tuple.
3. All(): Return True if all values are True, Any(): return True if any of the values is True
import numpy as np
power = np.array([100, 4000, 150, 9001, 1500])
# Check if all elements in the array are greater than 9000
print(all(power > 9000)) # Output: False
# Check if any element in the array is greater than 9000
print(any(power > 9000)) # Output: True
Explanation:
all(iterable)
returns True only if all elements in the iterable are True.any(iterable)
returns True if any element in the iterable is True.
4. Zip function
log = [4, 5, 7, 8]
lat = [1, 9, 7, 5]
# Iterate through both lists simultaneously and add corresponding elements
for a, b in zip(log, lat):
print(a + b) # Output: 5, 14, 14, 13
# Create a list of pairs from two lists
zipped = [*zip(log, lat)]
print(zipped) # Output: [(4, 1), (5, 9), (7, 7), (8, 5)]
# Handle uneven lists with `zip_longest`
short = [4, 5, 7, 8]
long = [1, 9, 7, 5, 1, 9]
# Use `fillvalue` to fill in missing elements
zipped_longest = [*zip_longest(short, long, fillvalue=None)]
print(zipped_longest) # Output: [(4, 1), (5, 9), (7, 7), (8, 5), (None, 1), (None, 9)]
Explanation:
zip(iterable1, iterable2)
creates an iterator that pairs corresponding elements from two iterables.- You can use
*
operator to unpack the iterator into a list. zip_longest
allows handling lists of different lengths by padding with afillvalue
.
5. How to create data
import numpy as np
import pandas as pd
# Set a random seed for reproducibility
np.random.seed(42)
# List of possible voter races
races = ["asian", "black", "hispanic", "other", "white"]
# Generate random voter races with probabilities
voter_race = np.random.choice(a=races, p=[0.05, 0.15, 0.25, 0.05, 0.5], size=1000)
# Generate random voter ages with a Poisson distribution
voter_age = stats.poisson.rvs(loc=18, mu=30, size=1000)
# Create a DataFrame with voter data
df = pd.DataFrame({"race": voter_race, "age": voter_age})
Explanation:
np.random.choice
can be used to randomly select elements from a list with specified probabilities.stats.poisson.rvs
generates random values from a Poisson distribution.pd.DataFrame
is used to create a structured data table from lists or dictionaries.
6. Anova Test
# Select data by race
asian = df.loc[df.race == "asian", "age"]
black = df.loc[df.race == "black", "age"]
hispanic = df.loc[df.race == "hispanic", "age"]
other = df.loc[df.race == "other", "age"]
white = df.loc[df.race == "white", "age"]
# Perform ANOVA test to compare means across races
stats.f_oneway(asian, black, hispanic, other, white)
Explanation:
- ANOVA (Analysis of Variance) tests whether the means of several groups are statistically different.
- This code selects data for each race and then performs an ANOVA test on the age variable.
7. Tukey’s test
# Import Tukey's HSD test from statsmodels
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Perform Tukey's HSD test for pairwise comparisons
tukeys = pairwise_tukeyhsd(endog=voter_age, groups=voter_race, alpha=0.05)
# Plot the simultaneous comparisons
tukeys.plot_simultaneous()
# Add a vertical line to highlight a specific comparison (optional)
plt.vlines(x=49.66, ymin=-0.6, ymax=4.5, color='red', linestyle='dashdot')
# Print summary of comparisons
tukeys.summary()
Explanation:
- Tukey’s HSD test is a post-hoc test used to compare the means of two groups after a significant ANOVA result.
- This code performs Tukey’s test on the voter age data grouped by race and generates a plot and summary table of the comparisons.
8. Handle / Parse Dates
# Example date string
date = "15 July 2015"
# Parse date string to DateTime object
parsed_date = pd.to_datetime(date)
print(parsed_date) # Output: 2015-07-15 00:00:00
# Example date string with time
date_time = "12:30:15 15:07:2015"
# Parsing without format will fail
try:
pd.to_datetime(date_time)
except ValueError:
print("Unable to parse date with default format")
# Use specific format string for date-time
parsed_date_time = pd.to_datetime(date_time, format="%H:%M:%S %d:%m:%Y")
print(parsed_date_time) # Output: 2015-07-15 12:30:15
# Consult documentation for different format specifiers
Explanation:
pd.to_datetime
converts a string to a DateTime object.- You need to specify the correct format string if the date format is not standard.
9. Describe string variables
# Describe only string columns
description = df.select_dtypes(include='object').describe()
print(description)
# Describe all columns including string attributes
description = df.describe(include='object')
print(description)
Explanation:
describe()
generates summary statistics for numerical columns.- You can use
include='object'
to include descriptive statistics for string columns as well
10. Convert to categorical
# Convert `Survived` column to categorical variable with custom labels
df['Survived'] = pd.Categorical(df['Survived']).rename_categories(['Died', 'Survived'])
# Convert `Pclass` to ordered categorical variable with custom labels
df['Pclass'] = pd.Categorical(df['Pclass'], ordered=True).rename_categories(['Class1', 'Class2', 'Class3'])
# Ordered categorical data allows comparisons like "<" or ">"
print(df['Pclass'] < df['Pclass'].max())
Explanation:
pd.Categorical
converts a column to a categorical data type.- You can set custom labels and order for categories.
- Ordered categorical variables allow comparisons and calculations based on their order.
11. Merge data
# Merge two DataFrames based on a common column
df1_merged = pd.merge(df1, df2, how='inner', on='id')
# Different join types:
# `inner`: Keep rows where both DataFrames have matching values in the join column.
# `left`: Keep all rows from left DataFrame and matching rows from right DataFrame.
# `right`: Keep all rows from right DataFrame and matching rows from left DataFrame.
# `outer`: Keep all rows from both DataFrames, regardless of matching values.
# You can also specify multiple join columns using a list.
Explanation:
pd.merge
combines two DataFrames based on a shared column.- Different join options determine how rows are kept based on matching values.
12. Working directories
import os
# Get current working directory
current_dir = os.getcwd()
print(f"Current directory: {current_dir}")
# Change working directory
os.chdir('C:\\Users\\AhMeD DaWooD\\Desktop')
new_dir = os.getcwd()
print(f"New directory: {new_dir}")
# List files and directories in current directory
files = os.listdir()
print(f"Files and directories: {files}")
Explanation:
os
module provides tools for interacting with the operating system.os.getcwd()
returns the current working directory.os.chdir(path)
changes the current working directory.os.listdir()
lists the files and directories in the current directory.