python - Grouping and filtering data -
dataframe:
protein peptide mean intensity a1 aab 4,54 a1 abb 5,56 a1 abb 4,67 a1 aab 5,67 a1 abc 5,67 a2 abb 4,64 a2 aab 4,54 a2 abb 5,56 a2 abc 4,67 a2 abc 5,67
but need find every protein top 2 (most frequent) peptides output a1 :
protein peptide mean intensity a1 aab 4,54 + 5.67 / 2 abb 5.56 + 4.67 / 2 a2 abb 7,42 abc 5,17
so problem needs stay dataframe.
first, can perform groupby/apply operation obtain protein/peptide pairs 2 largest peptide counts each protein:
counts = (df.groupby(['protein'])['peptide'] .apply(lambda x: x.value_counts().nlargest(2))) counts = counts[counts >= 2] counts = counts.to_frame() # counts # protein peptide # a1 aab 2 # abb 2 # a2 abb 2 # abc 2
now can merge original dataframe, df
counts
, joining on columns of df
, index of counts
. using inner join guarantees protein/peptide pairs present in both df
, counts
show in result
:
result = pd.merge(df, counts, left_on=['protein', 'peptide'], right_index=true, how='inner') # protein peptide mean intensity counts # 0 a1 aab 4.54 2 # 3 a1 aab 5.67 2 # 1 a1 abb 5.56 2 # 2 a1 abb 4.67 2 # 5 a2 abb 4.64 2 # 7 a2 abb 5.56 2 # 8 a2 abc 4.67 2 # 9 a2 abc 5.67 2
now easy perform desired groupby/mean
operation:
result = result.groupby(['protein', 'peptide'])['mean intensity'].mean()
so putting together,
import pandas pd df = pd.read_table('data', sep='\s{2,}') counts = (df.groupby(['protein'])['peptide'] .apply(lambda x: x.value_counts().nlargest(2))) counts = counts[counts >= 2] counts = counts.to_frame() result = pd.merge(df, counts, left_on=['protein', 'peptide'], right_index=true, how='inner') result = result.groupby(['protein', 'peptide'])['mean intensity'].mean() result = result.reset_index() print(result)
yields
protein peptide mean intensity 0 a1 aab 5.105 1 a1 abb 5.115 2 a2 abb 5.100 3 a2 abc 5.170
Comments
Post a Comment