18.Pandas
Why Pandas? - Easy to visualize - Easy to present - Easy to manipulate data as SQL - High-performance containers for data analysis
What will you learn ? - How to work with pandas - Exploratory data analysis - statistical & graphical methods - Using pandas to deal with time series data - Time indices, resampling
Store Data in DataFrame
- method_1
import pandas as pd
df = pd.DataFrame({
'names': ['Bob','Jessica','Mary','John','Mel'],
'births': [968, 155, 77, 578, 973]
})
- method_2
import pandas as pd
# The inital set of baby names and bith rates
names = ['Bob', 'Jessica', 'Mary', 'John', 'Mel', 'Jim']
births = [968, 155, 77, 578, 973, 968 ]
BabyDataSet = list(zip(names, births))
df = pd.DataFrame(BabyDataSet, columns=['Names', 'Births'])
# 將資料儲存成 csv
df.to_csv("birth_data.csv", index = True)
# 讀取 csv
df = pd.read_csv("birth_data.csv")
## 觀察前五列
df.head()
Work With CSV
Check point of CSV file - header - row index - encoding
Work With CSV - check header
import pandas as pd
df = pd.read_csv('birth_data.csv', names=['births', 'names'])
Work With CSV - check row index
import pandas as pd
df = pd.read_csv('birth_data.csv', index_col=0)
Partial Data
Instead of viewing the whole data, sometimes we just need to check the brief information.
df.head(n=5) # top n rows
df.tail(n=5) # last n rows
df[1:5] # 1 to 4 rows
df.sample(5) # random pick n rows
Make a Dataset
import numpy as np
import pandas as pd
# # set seed
np.random.seed(111)
# Function to generate test data
def CreateDataSet(Number=1):
Output = []
for i in range(Number):
# Create a weekly (mondays) date range
rng = pd.date_range(start='1/1/2009', end='12/31/2012', freq='W-MON') #random number generator
# Create random data
data = np.random.randint(low=25, high=1000, size=len(rng))
# Status pool
status = [1,2,3]
# Make a random list of statuses
random_status = [status[np.random.randint(low=0, high=len(status))] for i in range(len(rng))]
# State pool
states = ['GA','FL','fl','NY','NJ','TX']
# Make a random list of states
random_states = [states[np.random.randint(low=0, high=len(states))] for i in range(len(rng))]
Output.extend(zip(random_states, random_status, data, rng))
#extend& append 差異
return Output
```
```py
dataset = CreateDataSet(4)
dataset
len(dataset) #check shape
df = pd.DataFrame(data=dataset, columns=['State','Status','CustomerCount','StatusDate'])
df.head(10)
Data Observation
Exploring data with useful method
df['State'].unique() # get the set of unique value
df['State'].value_counts() # get the counts of each value
df.info() # get the brief info
Index and Select Data
- Square brackets
- Advanced methods
- loc
- iloc
Column Access
# DataFrame this would be a DataFrame
df[["CustomerCount"]]
df[["State","Status"]][1:3]
# Series this would be a series
df["CustomerCount"]
df["StatusDate"][100:105]
Index and Select Data - Row Access
df[2:9] →會印出index為2 到 index為8 的值
Slicing with Pandas
Pandas slicing - DataFrame.loc (label-based) - DataFrame.iloc (integer position-based)
slicing with pandas - loc
# select specific row(0 1 2 3)
print(df.loc[0:3])
# subset all row and specific columns
print(df.loc[:,["State", "Status"]].head())
# 只取State欄位
print(df.loc[0:5,"State"])
Slicing with Pandas -iloc
# row access(切片所以只有0 1 2)
print(df.iloc[0:3])
# subset all row and specific columns
print(df.iloc[1:5,[0,1]])
# 只取State欄位
print(df.iloc[:,1].head())
print(df.iloc[:,1].head().values)
Conditional Data
Sometimes, we need to estimate data behavior with given statement
df.loc[df.State == 'FL']
df.loc[(df.State == 'FL') & (df['Status'] == 3)]
Columns Observation
Intuitively, a column of values is a vector
print(df['Status'])
print(df['Status'].values)
Generate a vector - numerical operation
We know we can create a new column by assigning new vector.How about basic numerical operation?
print(df['Status'] + 1)
Generate a vector by function
But we want more than numerical operation. Try functional programming !
# apply func to each column
print(df.apply(len, axis=0))
# apply func to each row
print(df.apply(len, axis=1))
Generate a vector by function
Design our own function and do more !
def text(x):
status = x['Status']
return 'correct' if status == 1 else 'error'
df['StatusText'] = df.apply(text, axis=1)
Get the summary data by group
We just learned how to get the data with condition! e.g. I want to know the total count in each state
df2 = df.loc[df['State'] == 'NJ']
total = 0
for count in df2['CustomerCount']:
total += count
print(total)
# could we do calculation only once?
How to grouping data?
# group the data by given column name
g_state = df.groupby(['State'])
g_state.size()
Get the grouped dataframe
# group the data by given column name
g_state = df.groupby(['State'])
# get the group data
g_state.get_group('NJ')
Get the grouped data summary
# group the data by given column name
g_state = df.groupby(['State'])
# sum all the numerical data
g_state.sum()
Group by multiple indexes and hierarchical
# group the data by given column name
g_state = df.groupby(
['StatusDate', 'State'],
sort=True).sum()
Combine DataFrame
When we try to combine similar Dataframe, we can easily use
pd.concat([df1, df2])
Concept of Joint Dataframe
When the row and column do not fully match, we have to consider the strategy to merge Dataframe. - LEFT join - RIGHT join - OUTER join - INNER join
Left join
pd.merge(left, right, on=['key1', 'key2'],
how='left')
Right join
pd.merge(left, right, on=['key1', 'key2'],
how='right')
Full Outer join
pd.merge(left, right, on=['key1', 'key2'],
how='outer')
Inner join
pd.merge(left, right, on=['key1', 'key2'],
how='inner')
Pandas 欄位名稱排序
data = pd.read_csv('./data.csv')
newData=data[sorted(data) ]