python - How to speed up complex/difficult data filtering in pandas -

i have large data set has below indices , column headers.

+------+------------------------+------------------------------+--------------------------+------------------------------------+-------------------------------------+------------------------+--------------------------+--------------------------------+----------------------------+--------------------------------------+---------------------------------------+--------------------------+ |      | count: interaction_eis | count: interaction_eis_reply | count: interaction_match | count: interaction_single_message_ | count: interaction_single_message_1 | count: interaction_yes | dc(uid): interaction_eis | dc(uid): interaction_eis_reply | dc(uid): interaction_match | dc(uid): interaction_single_message_ | dc(uid): interaction_single_message_1 | dc(uid): interaction_yes | +------+------------------------+------------------------------+--------------------------+------------------------------------+-------------------------------------+------------------------+--------------------------+--------------------------------+----------------------------+--------------------------------------+---------------------------------------+--------------------------+ | uid  |                        |                              |                          |                                    |                                     |                        |                          |                                |                            |                                      |                                       |                          | | 38   |                     36 |                            0 |                        0 |                                 14 |                                   0 |                    163 |                        1 |                              0 |                          0 |                                    1 |                                     0 |                        1 | | 66   |                     63 |                            0 |                        0 |                                  0 |                                   0 |                      0 |                        1 |                              0 |                          0 |                                    0 |                                     0 |                        0 | | 1466 |                      0 |                            0 |                        0 |                                  0 |                                   0 |                      1 |                        0 |                              0 |                          0 |                                    0 |                                     0 |                        1 | | 1709 |                     51 |                            0 |                        0 |                                  1 |                                   0 |                      9 |                        1 |                              0 |                          0 |                                    1 |                                     0 |                        1 | | 1844 |                     66 |                            0 |                        1 |                                  3 |                                   1 |                     17 |                        1 |                              0 |                          1 |                                    1 |                                     1 |                        1 | +------+------------------------+------------------------------+--------------------------+------------------------------------+-------------------------------------+------------------------+--------------------------+--------------------------------+----------------------------+--------------------------------------+---------------------------------------+--------------------------+

i attempting group uids type of interaction received, if user has 1 type of interaction grouped other users have specific type of interaction.

to started taking of dc(uid) columns have 1 "hit" per interaction type , 0 if interaction type never happened , aggregating them groups row row this:

cols = [i in list(all_f_rm.columns) if i[0]=="d"]  def aggregate(row):     key = ""     in cols:         key+=str(row[i])      if key not in results:         results[key] = []     results[key].append(row.name)  results = {} all_f_rm.apply(aggregate, axis=1)

results.keys() of potential interaction type combinations (35 of them) , value each key every index (uid) belongs combination. looks this: {'001101': [141168, 153845, 172598, 254401, 448276,...

next, made function filter out of non matching rows each combination/key:

def tablefor(key):     return all_f_rm[all_f_rm.apply(lambda row: row.name in results[key], axis=1)]

and tablefor('001101') displays exact dataframe want.

my problem wrote list comprehension loop through 35 combinations [tablefor(x) x in results.keys()] taking forever (1+ hrs , hasn't finished) , need perform on 5 more data sets. there more efficient way accomplish i'm trying do?

iiuc, can want groupby. constructing toy dataframe yours:

df = pd.dataframe({"uid": np.arange(10**6)}) col in range(6):     df["dc{}".format(col)] = np.random.randint(0,2,len(df))

we can group columns of interest , associated id numbers rapidly:

>>> dcs = [col col in df.columns if col.startswith("dc")] >>> df.groupby(dcs)["uid"].unique() dc0  dc1  dc2  dc3  dc4  dc5 0    0    0    0    0    0      [302, 357, 383, 474, 526, 614, 802, 812, 865, ...                          1      [7, 96, 190, 220, 405, 453, 534, 598, 606, 866...                     1    0      [16, 209, 289, 355, 430, 620, 634, 736, 780, 7...                          1      [9, 79, 166, 268, 408, 434, 435, 447, 572, 749...                1    0    0      [60, 120, 196, 222, 238, 346, 426, 486, 536, 5...                          1      [2, 53, 228, 264, 315, 517, 557, 621, 626, 630...                     1    0      [42, 124, 287, 292, 300, 338, 341, 350, 500, 5...                          1      [33, 95, 140, 192, 225, 282, 328, 339, 365, 44...           1    0    0    0      [1, 59, 108, 134, 506, 551, 781, 823, 836, 861...                          1      [149, 215, 380, 394, 436, 482, 570, 600, 631, ...                     1    0      [77, 133, 247, 333, 374, 782, 809, 892, 1096, ...                          1      [14, 275, 312, 326, 343, 444, 569, 692, 770, 7...                1    0    0      [69, 104, 143, 404, 431, 468, 636, 639, 657, 7...                          1      [178, 224, 367, 402, 664, 666, 739, 807, 871, ... [...]

if you'd prefer associated groups instead, can list or dictionary well, rather pulling out indices:

>>> groups = list(df.groupby(dcs, as_index=false)) >>> print(groups[0][0]) (0, 0, 0, 0, 0, 0) >>> print(groups[0][1])            uid  dc0  dc1  dc2  dc3  dc4  dc5 302        302    0    0    0    0    0    0 357        357    0    0    0    0    0    0 383        383    0    0    0    0    0    0 [...] 999730  999730    0    0    0    0    0    0 999945  999945    0    0    0    0    0    0 999971  999971    0    0    0    0    0    0  [15357 rows x 7 columns]

and on.

Search This Blog

Remember

python - How to speed up complex/difficult data filtering in pandas -

Comments

Post a Comment

Popular posts from this blog

Java 8 + Maven Javadoc plugin: Error fetching URL -

css - SVG using textPath a symbol not rendering in Firefox -

python - Subprocess.CREATE_NEW_CONSOLE -