跳轉到

18.Pandas

Why Pandas? - Easy to visualize - Easy to present - Easy to manipulate data as SQL - High-performance containers for data analysis

What will you learn ? - How to work with pandas - Exploratory data analysis - statistical & graphical methods - Using pandas to deal with time series data - Time indices, resampling

Store Data in DataFrame

  • method_1
import pandas as pd
df = pd.DataFrame({
'names': ['Bob','Jessica','Mary','John','Mel'],
'births': [968, 155, 77, 578, 973]
})
  • method_2
import pandas as pd
# The inital set of baby names and bith rates
names = ['Bob', 'Jessica', 'Mary', 'John', 'Mel', 'Jim']
births = [968, 155, 77, 578, 973, 968 ]
BabyDataSet = list(zip(names, births))
df = pd.DataFrame(BabyDataSet, columns=['Names', 'Births'])
# 將資料儲存成 csv
df.to_csv("birth_data.csv", index = True)
# 讀取 csv
df = pd.read_csv("birth_data.csv")
## 觀察前五列
df.head()

Work With CSV

Check point of CSV file - header - row index - encoding

Work With CSV - check header

import pandas as pd
df = pd.read_csv('birth_data.csv', names=['births', 'names'])

Work With CSV - check row index

import pandas as pd
df = pd.read_csv('birth_data.csv', index_col=0)

Partial Data

Instead of viewing the whole data, sometimes we just need to check the brief information.

df.head(n=5) # top n rows
df.tail(n=5) # last n rows
df[1:5] # 1 to 4 rows
df.sample(5) # random pick n rows

Make a Dataset

import numpy as np
import pandas as pd
# # set seed
np.random.seed(111)

# Function to generate test data
def CreateDataSet(Number=1):
    Output = []
    for i in range(Number):
        # Create a weekly (mondays) date range
        rng = pd.date_range(start='1/1/2009', end='12/31/2012', freq='W-MON') #random number generator

        # Create random data
        data = np.random.randint(low=25, high=1000, size=len(rng))

        # Status pool
        status = [1,2,3]

        # Make a random list of statuses
        random_status = [status[np.random.randint(low=0, high=len(status))] for i in range(len(rng))]

        # State pool
        states = ['GA','FL','fl','NY','NJ','TX']

        # Make a random list of states 
        random_states = [states[np.random.randint(low=0, high=len(states))] for i in range(len(rng))]

        Output.extend(zip(random_states, random_status, data, rng))
        #extend& append 差異

    return Output
  ```

  ```py
dataset = CreateDataSet(4)
dataset
len(dataset) #check shape
df = pd.DataFrame(data=dataset, columns=['State','Status','CustomerCount','StatusDate'])
df.head(10)

Data Observation

Exploring data with useful method

df['State'].unique() # get the set of unique value
df['State'].value_counts() # get the counts of each value
df.info() # get the brief info

Index and Select Data

  • Square brackets
  • Advanced methods
  • loc
  • iloc

Column Access

# DataFrame this would be a DataFrame
df[["CustomerCount"]]
df[["State","Status"]][1:3]
# Series this would be a series
df["CustomerCount"]
df["StatusDate"][100:105]

Index and Select Data - Row Access

df[2:9] →會印出index為2 到 index為8 的值

Slicing with Pandas

Pandas slicing - DataFrame.loc (label-based) - DataFrame.iloc (integer position-based)

slicing with pandas - loc

# select specific row(0 1 2 3)
print(df.loc[0:3])
# subset all row and specific columns
print(df.loc[:,["State", "Status"]].head())
# 只取State欄位
print(df.loc[0:5,"State"])

Slicing with Pandas -iloc

# row access(切片所以只有0 1 2)
print(df.iloc[0:3])
# subset all row and specific columns
print(df.iloc[1:5,[0,1]])
# 只取State欄位
print(df.iloc[:,1].head())
print(df.iloc[:,1].head().values)

Conditional Data

Sometimes, we need to estimate data behavior with given statement

df.loc[df.State == 'FL']

df.loc[(df.State == 'FL') & (df['Status'] == 3)]

Columns Observation

Intuitively, a column of values is a vector

print(df['Status'])
print(df['Status'].values)

Generate a vector - numerical operation

We know we can create a new column by assigning new vector.How about basic numerical operation?

print(df['Status'] + 1)

Generate a vector by function

But we want more than numerical operation. Try functional programming !

# apply func to each column
print(df.apply(len, axis=0))

# apply func to each row
print(df.apply(len, axis=1))

Generate a vector by function

Design our own function and do more !

def text(x):
  status = x['Status']
  return 'correct' if status == 1 else 'error'
df['StatusText'] = df.apply(text, axis=1)

Get the summary data by group

We just learned how to get the data with condition! e.g. I want to know the total count in each state

df2 = df.loc[df['State'] == 'NJ']
total = 0
for count in df2['CustomerCount']:
total += count
print(total)
# could we do calculation only once?

How to grouping data?

# group the data by given column name
g_state = df.groupby(['State'])
g_state.size()

Get the grouped dataframe

# group the data by given column name
g_state = df.groupby(['State'])

# get the group data
g_state.get_group('NJ')

Get the grouped data summary

# group the data by given column name
g_state = df.groupby(['State'])

# sum all the numerical data
g_state.sum()

Group by multiple indexes and hierarchical

# group the data by given column name
g_state = df.groupby(
['StatusDate', 'State'],
sort=True).sum()

Combine DataFrame

When we try to combine similar Dataframe, we can easily use

pd.concat([df1, df2])

Concept of Joint Dataframe

When the row and column do not fully match, we have to consider the strategy to merge Dataframe. - LEFT join - RIGHT join - OUTER join - INNER join

Left join

pd.merge(left, right, on=['key1', 'key2'],
how='left')

Right join

pd.merge(left, right, on=['key1', 'key2'],
how='right')

Full Outer join

pd.merge(left, right, on=['key1', 'key2'],
how='outer')

Inner join

pd.merge(left, right, on=['key1', 'key2'],
how='inner')

Pandas 欄位名稱排序

data = pd.read_csv('./data.csv')
newData=data[sorted(data) ]