CONTENT
Title Component
Pandas library for Data Science
Explore the pandas library for Python, a powerful tool for data manipulation and analysis with high-level data structures and functions.
Saartje Ly
Data Engineering Intern
April 3, 2024
Introduction
The pandas library is a powerful and widely-used Python library for data manipulation and analysis. It provides high-level data structures and functions designed to make working with structured data fast, easy, and expressive.
Installing the pandas library
Windows:
1. open cmd
2. type python -m pip install pandas
Linux or macOS:
1. open a terminal
2. type pip install pandas
Load in pandas library using an alias
import pandas as pd
Reading in data
pd.read_csv(data.txt, delimiter=…)
Where data.txt is the file path to your data - if your data is in the same folder as your .py file, then you may just use the filename.
The delimiter is set to ',' (comma) by default, but you can specify a different delimiter if your CSV values are separated by something different.
Turn data into a pandas DataFrame
data=…
df = pd.DataFrame(data)
Access headers of a pd.DataFrame
df.columns
Selecting columns
df['column_name']
to select a specific column
df[['column_name1', 'column_name2', 'column_name3']]
to select 3 specific columns
Summary statistics
df.describe()
gives count, mean, std, min, 25%, 50%, 75%, and max for each numerical column in df
Sorting values
df.sort_values('Column')
sorts rows in a df by the ascending values of 'Column'.
Dropping columns
df.drop(columns=[])
will drop one or more columns from a df.
Output a CSV
df.to_csv('output_name')
to write contents of df to a CSV called 'output_name'
Integer location based indexing
Use df.iloc[row_index, column_index]
to access data in a DataFrame by its numerical position.
row_index and column_index may be a positional index or a slice.
df.iloc[1]
gives everything in the first row
df.iloc[1:4]
gives rows 1 to 4]
df.iloc[:, :].sum(axis=1)
sums values horizontally
Location based indexing
Use df.loc[row_index, column_index]
to access data in a DataFrame by its labels.
df.loc['A', :]
gives rows with label 'A' and all columns.
df.loc[df['column1'] == "hi"]
gives all rows where column1 == "hi"
Grouping data
df.groupby()
is used to split a df into groups then apply a function to each group.
data = { 'Category': ['A', 'A', 'B', 'B', 'A'], 'Value': [10, 20, 30, 40, 50] }
df = pd.DataFrame(data)
grouped_df = df.groupby('Category').sum()
Introduction
The pandas library is a powerful and widely-used Python library for data manipulation and analysis. It provides high-level data structures and functions designed to make working with structured data fast, easy, and expressive.
Installing the pandas library
Windows:
1. open cmd
2. type python -m pip install pandas
Linux or macOS:
1. open a terminal
2. type pip install pandas
Load in pandas library using an alias
import pandas as pd
Reading in data
pd.read_csv(data.txt, delimiter=…)
Where data.txt is the file path to your data - if your data is in the same folder as your .py file, then you may just use the filename.
The delimiter is set to ',' (comma) by default, but you can specify a different delimiter if your CSV values are separated by something different.
Turn data into a pandas DataFrame
data=…
df = pd.DataFrame(data)
Access headers of a pd.DataFrame
df.columns
Selecting columns
df['column_name']
to select a specific column
df[['column_name1', 'column_name2', 'column_name3']]
to select 3 specific columns
Summary statistics
df.describe()
gives count, mean, std, min, 25%, 50%, 75%, and max for each numerical column in df
Sorting values
df.sort_values('Column')
sorts rows in a df by the ascending values of 'Column'.
Dropping columns
df.drop(columns=[])
will drop one or more columns from a df.
Output a CSV
df.to_csv('output_name')
to write contents of df to a CSV called 'output_name'
Integer location based indexing
Use df.iloc[row_index, column_index]
to access data in a DataFrame by its numerical position.
row_index and column_index may be a positional index or a slice.
df.iloc[1]
gives everything in the first row
df.iloc[1:4]
gives rows 1 to 4]
df.iloc[:, :].sum(axis=1)
sums values horizontally
Location based indexing
Use df.loc[row_index, column_index]
to access data in a DataFrame by its labels.
df.loc['A', :]
gives rows with label 'A' and all columns.
df.loc[df['column1'] == "hi"]
gives all rows where column1 == "hi"
Grouping data
df.groupby()
is used to split a df into groups then apply a function to each group.
data = { 'Category': ['A', 'A', 'B', 'B', 'A'], 'Value': [10, 20, 30, 40, 50] }
df = pd.DataFrame(data)
grouped_df = df.groupby('Category').sum()
Introduction
The pandas library is a powerful and widely-used Python library for data manipulation and analysis. It provides high-level data structures and functions designed to make working with structured data fast, easy, and expressive.
Installing the pandas library
Windows:
1. open cmd
2. type python -m pip install pandas
Linux or macOS:
1. open a terminal
2. type pip install pandas
Load in pandas library using an alias
import pandas as pd
Reading in data
pd.read_csv(data.txt, delimiter=…)
Where data.txt is the file path to your data - if your data is in the same folder as your .py file, then you may just use the filename.
The delimiter is set to ',' (comma) by default, but you can specify a different delimiter if your CSV values are separated by something different.
Turn data into a pandas DataFrame
data=…
df = pd.DataFrame(data)
Access headers of a pd.DataFrame
df.columns
Selecting columns
df['column_name']
to select a specific column
df[['column_name1', 'column_name2', 'column_name3']]
to select 3 specific columns
Summary statistics
df.describe()
gives count, mean, std, min, 25%, 50%, 75%, and max for each numerical column in df
Sorting values
df.sort_values('Column')
sorts rows in a df by the ascending values of 'Column'.
Dropping columns
df.drop(columns=[])
will drop one or more columns from a df.
Output a CSV
df.to_csv('output_name')
to write contents of df to a CSV called 'output_name'
Integer location based indexing
Use df.iloc[row_index, column_index]
to access data in a DataFrame by its numerical position.
row_index and column_index may be a positional index or a slice.
df.iloc[1]
gives everything in the first row
df.iloc[1:4]
gives rows 1 to 4]
df.iloc[:, :].sum(axis=1)
sums values horizontally
Location based indexing
Use df.loc[row_index, column_index]
to access data in a DataFrame by its labels.
df.loc['A', :]
gives rows with label 'A' and all columns.
df.loc[df['column1'] == "hi"]
gives all rows where column1 == "hi"
Grouping data
df.groupby()
is used to split a df into groups then apply a function to each group.
data = { 'Category': ['A', 'A', 'B', 'B', 'A'], 'Value': [10, 20, 30, 40, 50] }
df = pd.DataFrame(data)
grouped_df = df.groupby('Category').sum()
CONTENT
Title Component
SHARE