python - Grouping and filtering data -


dataframe:

protein   peptide   mean intensity   a1        aab       4,54              a1        abb       5,56              a1        abb       4,67                        a1        aab       5,67              a1        abc       5,67             a2        abb       4,64              a2        aab       4,54              a2        abb       5,56              a2        abc       4,67                         a2        abc       5,67             

but need find every protein top 2 (most frequent) peptides output a1 :

protein    peptide   mean intensity    a1         aab       4,54 + 5.67 / 2            abb       5.56 + 4.67 / 2 a2         abb       7,42            abc       5,17 

so problem needs stay dataframe.

first, can perform groupby/apply operation obtain protein/peptide pairs 2 largest peptide counts each protein:

counts = (df.groupby(['protein'])['peptide']             .apply(lambda x: x.value_counts().nlargest(2))) counts = counts[counts >= 2] counts = counts.to_frame() #                  counts # protein peptide         # a1      aab           2 #         abb           2 # a2      abb           2 #         abc           2 

now can merge original dataframe, df counts, joining on columns of df , index of counts. using inner join guarantees protein/peptide pairs present in both df , counts show in result:

result = pd.merge(df, counts, left_on=['protein', 'peptide'], right_index=true,                   how='inner')  #   protein peptide  mean intensity  counts # 0      a1     aab            4.54       2 # 3      a1     aab            5.67       2 # 1      a1     abb            5.56       2 # 2      a1     abb            4.67       2 # 5      a2     abb            4.64       2 # 7      a2     abb            5.56       2 # 8      a2     abc            4.67       2 # 9      a2     abc            5.67       2 

now easy perform desired groupby/mean operation:

result = result.groupby(['protein', 'peptide'])['mean intensity'].mean() 

so putting together,

import pandas pd df = pd.read_table('data', sep='\s{2,}')  counts = (df.groupby(['protein'])['peptide']             .apply(lambda x: x.value_counts().nlargest(2))) counts = counts[counts >= 2] counts = counts.to_frame() result = pd.merge(df, counts, left_on=['protein', 'peptide'], right_index=true,                   how='inner') result = result.groupby(['protein', 'peptide'])['mean intensity'].mean() result = result.reset_index() print(result) 

yields

  protein peptide  mean intensity 0      a1     aab           5.105 1      a1     abb           5.115 2      a2     abb           5.100 3      a2     abc           5.170 

Comments

Popular posts from this blog

css - SVG using textPath a symbol not rendering in Firefox -

Java 8 + Maven Javadoc plugin: Error fetching URL -

datatable - Matlab struct computations -