18.Pandas
Why Pandas? - Easy to visualize - Easy to present - Easy to manipulate data as SQL - High-performance containers for data analysis
What will you learn ? - How to work with pandas - Exploratory data analysis - statistical & graphical methods - Using pandas to deal with time series data - Time indices, resampling
Store Data in DataFrame
- method_1
import pandas as pd df = pd.DataFrame({ 'names': ['Bob','Jessica','Mary','John','Mel'], 'births': [968, 155, 77, 578, 973] })
- method_2
import pandas as pd # The inital set of baby names and bith rates names = ['Bob', 'Jessica', 'Mary', 'John', 'Mel', 'Jim'] births = [968, 155, 77, 578, 973, 968 ] BabyDataSet = list(zip(names, births)) df = pd.DataFrame(BabyDataSet, columns=['Names', 'Births']) # 將資料儲存成 csv df.to_csv("birth_data.csv", index = True) # 讀取 csv df = pd.read_csv("birth_data.csv") ## 觀察前五列 df.head()
Work With CSV
Check point of CSV file - header - row index - encoding
Work With CSV - check header
import pandas as pd df = pd.read_csv('birth_data.csv', names=['births', 'names'])
Work With CSV - check row index
import pandas as pd df = pd.read_csv('birth_data.csv', index_col=0)
Partial Data
Instead of viewing the whole data, sometimes we just need to check the brief information.
df.head(n=5) # top n rows df.tail(n=5) # last n rows df[1:5] # 1 to 4 rows df.sample(5) # random pick n rows
Make a Dataset
import numpy as np import pandas as pd # # set seed np.random.seed(111) # Function to generate test data def CreateDataSet(Number=1): Output = [] for i in range(Number): # Create a weekly (mondays) date range rng = pd.date_range(start='1/1/2009', end='12/31/2012', freq='W-MON') #random number generator # Create random data data = np.random.randint(low=25, high=1000, size=len(rng)) # Status pool status = [1,2,3] # Make a random list of statuses random_status = [status[np.random.randint(low=0, high=len(status))] for i in range(len(rng))] # State pool states = ['GA','FL','fl','NY','NJ','TX'] # Make a random list of states random_states = [states[np.random.randint(low=0, high=len(states))] for i in range(len(rng))] Output.extend(zip(random_states, random_status, data, rng)) #extend& append 差異 return Output ``` ```py dataset = CreateDataSet(4) dataset len(dataset) #check shape df = pd.DataFrame(data=dataset, columns=['State','Status','CustomerCount','StatusDate']) df.head(10)
Data Observation
Exploring data with useful method
df['State'].unique() # get the set of unique value df['State'].value_counts() # get the counts of each value df.info() # get the brief info
Index and Select Data
- Square brackets
- Advanced methods
- loc
- iloc
Column Access
# DataFrame this would be a DataFrame df[["CustomerCount"]] df[["State","Status"]][1:3] # Series this would be a series df["CustomerCount"] df["StatusDate"][100:105]
Index and Select Data - Row Access
df[2:9] →會印出index為2 到 index為8 的值
Slicing with Pandas
Pandas slicing - DataFrame.loc (label-based) - DataFrame.iloc (integer position-based)
slicing with pandas - loc
# select specific row(0 1 2 3) print(df.loc[0:3]) # subset all row and specific columns print(df.loc[:,["State", "Status"]].head()) # 只取State欄位 print(df.loc[0:5,"State"])
Slicing with Pandas -iloc
# row access(切片所以只有0 1 2) print(df.iloc[0:3]) # subset all row and specific columns print(df.iloc[1:5,[0,1]]) # 只取State欄位 print(df.iloc[:,1].head()) print(df.iloc[:,1].head().values)
Conditional Data
Sometimes, we need to estimate data behavior with given statement
df.loc[df.State == 'FL'] df.loc[(df.State == 'FL') & (df['Status'] == 3)]
Columns Observation
Intuitively, a column of values is a vector
print(df['Status']) print(df['Status'].values)
Generate a vector - numerical operation
We know we can create a new column by assigning new vector.How about basic numerical operation?
print(df['Status'] + 1)
Generate a vector by function
But we want more than numerical operation. Try functional programming !
# apply func to each column print(df.apply(len, axis=0)) # apply func to each row print(df.apply(len, axis=1))
Generate a vector by function
Design our own function and do more !
def text(x): status = x['Status'] return 'correct' if status == 1 else 'error' df['StatusText'] = df.apply(text, axis=1)
Get the summary data by group
We just learned how to get the data with condition! e.g. I want to know the total count in each state
df2 = df.loc[df['State'] == 'NJ'] total = 0 for count in df2['CustomerCount']: total += count print(total) # could we do calculation only once?
How to grouping data?
# group the data by given column name g_state = df.groupby(['State']) g_state.size()
Get the grouped dataframe
# group the data by given column name g_state = df.groupby(['State']) # get the group data g_state.get_group('NJ')
Get the grouped data summary
# group the data by given column name g_state = df.groupby(['State']) # sum all the numerical data g_state.sum()
Group by multiple indexes and hierarchical
# group the data by given column name g_state = df.groupby( ['StatusDate', 'State'], sort=True).sum()
Combine DataFrame
When we try to combine similar Dataframe, we can easily use
pd.concat([df1, df2])
Concept of Joint Dataframe
When the row and column do not fully match, we have to consider the strategy to merge Dataframe. - LEFT join - RIGHT join - OUTER join - INNER join
Left join
pd.merge(left, right, on=['key1', 'key2'], how='left')
Right join
pd.merge(left, right, on=['key1', 'key2'], how='right')
Full Outer join
pd.merge(left, right, on=['key1', 'key2'], how='outer')
Inner join
pd.merge(left, right, on=['key1', 'key2'], how='inner')
Pandas 欄位名稱排序
data = pd.read_csv('./data.csv') newData=data[sorted(data) ]