CONTENT

Title Component

Pandas library for Data Science

Explore the pandas library for Python, a powerful tool for data manipulation and analysis with high-level data structures and functions.

Saartje Ly

Data Engineering Intern

April 3, 2024

Introduction

The pandas library is a powerful and widely-used Python library for data manipulation and analysis. It provides high-level data structures and functions designed to make working with structured data fast, easy, and expressive.

Installing the pandas library

Windows:

1. open cmd

2. type python -m pip install pandas

Linux or macOS:

1. open a terminal

2. type pip install pandas

Load in pandas library using an alias

import pandas as pd

Reading in data

pd.read_csv(data.txt, delimiter=…)

Where data.txt is the file path to your data - if your data is in the same folder as your .py file, then you may just use the filename.

The delimiter is set to ',' (comma) by default, but you can specify a different delimiter if your CSV values are separated by something different.

Turn data into a pandas DataFrame

data=…

df = pd.DataFrame(data)

Access headers of a pd.DataFrame

df.columns

Selecting columns

df['column_name'] to select a specific column

df[['column_name1', 'column_name2', 'column_name3']] to select 3 specific columns

Summary statistics

df.describe() gives count, mean, std, min, 25%, 50%, 75%, and max for each numerical column in df

Sorting values

df.sort_values('Column') sorts rows in a df by the ascending values of 'Column'.

Dropping columns

df.drop(columns=[])will drop one or more columns from a df.

Output a CSV

df.to_csv('output_name') to write contents of df to a CSV called 'output_name'

Integer location based indexing

Use df.iloc[row_index, column_index] to access data in a DataFrame by its numerical position.

row_index and column_index may be a positional index or a slice.

df.iloc[1] gives everything in the first row

df.iloc[1:4] gives rows 1 to 4]

df.iloc[:, :].sum(axis=1) sums values horizontally

Location based indexing

Use df.loc[row_index, column_index] to access data in a DataFrame by its labels.

df.loc['A', :] gives rows with label 'A' and all columns.

df.loc[df['column1'] == "hi"] gives all rows where column1 == "hi"

Grouping data

df.groupby() is used to split a df into groups then apply a function to each group.

data = { 'Category': ['A', 'A', 'B', 'B', 'A'], 'Value': [10, 20, 30, 40, 50] }
df = pd.DataFrame(data)
grouped_df = df.groupby('Category').sum()

Introduction

Installing the pandas library

Windows:

1. open cmd

2. type python -m pip install pandas

Linux or macOS:

1. open a terminal

2. type pip install pandas

Load in pandas library using an alias

import pandas as pd

Reading in data

pd.read_csv(data.txt, delimiter=…)

Where data.txt is the file path to your data - if your data is in the same folder as your .py file, then you may just use the filename.

The delimiter is set to ',' (comma) by default, but you can specify a different delimiter if your CSV values are separated by something different.

Turn data into a pandas DataFrame

data=…

df = pd.DataFrame(data)

Access headers of a pd.DataFrame

df.columns

Selecting columns

df['column_name'] to select a specific column

df[['column_name1', 'column_name2', 'column_name3']] to select 3 specific columns

Summary statistics

df.describe() gives count, mean, std, min, 25%, 50%, 75%, and max for each numerical column in df

Sorting values

df.sort_values('Column') sorts rows in a df by the ascending values of 'Column'.

Dropping columns

df.drop(columns=[])will drop one or more columns from a df.

Output a CSV

df.to_csv('output_name') to write contents of df to a CSV called 'output_name'

Integer location based indexing

Use df.iloc[row_index, column_index] to access data in a DataFrame by its numerical position.

row_index and column_index may be a positional index or a slice.

df.iloc[1] gives everything in the first row

df.iloc[1:4] gives rows 1 to 4]

df.iloc[:, :].sum(axis=1) sums values horizontally

Location based indexing

Use df.loc[row_index, column_index] to access data in a DataFrame by its labels.

df.loc['A', :] gives rows with label 'A' and all columns.

df.loc[df['column1'] == "hi"] gives all rows where column1 == "hi"

Grouping data

df.groupby() is used to split a df into groups then apply a function to each group.

data = { 'Category': ['A', 'A', 'B', 'B', 'A'], 'Value': [10, 20, 30, 40, 50] }
df = pd.DataFrame(data)
grouped_df = df.groupby('Category').sum()

Introduction

Installing the pandas library

Windows:

1. open cmd

2. type python -m pip install pandas

Linux or macOS:

1. open a terminal

2. type pip install pandas

Load in pandas library using an alias

import pandas as pd

Reading in data

pd.read_csv(data.txt, delimiter=…)

Where data.txt is the file path to your data - if your data is in the same folder as your .py file, then you may just use the filename.

The delimiter is set to ',' (comma) by default, but you can specify a different delimiter if your CSV values are separated by something different.

Turn data into a pandas DataFrame

data=…

df = pd.DataFrame(data)

Access headers of a pd.DataFrame

df.columns

Selecting columns

df['column_name'] to select a specific column

df[['column_name1', 'column_name2', 'column_name3']] to select 3 specific columns

Summary statistics

df.describe() gives count, mean, std, min, 25%, 50%, 75%, and max for each numerical column in df

Sorting values

df.sort_values('Column') sorts rows in a df by the ascending values of 'Column'.

Dropping columns

df.drop(columns=[])will drop one or more columns from a df.

Output a CSV

df.to_csv('output_name') to write contents of df to a CSV called 'output_name'

Integer location based indexing

Use df.iloc[row_index, column_index] to access data in a DataFrame by its numerical position.

row_index and column_index may be a positional index or a slice.

df.iloc[1] gives everything in the first row

df.iloc[1:4] gives rows 1 to 4]

df.iloc[:, :].sum(axis=1) sums values horizontally

Location based indexing

Use df.loc[row_index, column_index] to access data in a DataFrame by its labels.

df.loc['A', :] gives rows with label 'A' and all columns.

df.loc[df['column1'] == "hi"] gives all rows where column1 == "hi"

Grouping data

df.groupby() is used to split a df into groups then apply a function to each group.

data = { 'Category': ['A', 'A', 'B', 'B', 'A'], 'Value': [10, 20, 30, 40, 50] }
df = pd.DataFrame(data)
grouped_df = df.groupby('Category').sum()

CONTENT

Title Component