python - How to speed up complex/difficult data filtering in pandas -
i have large data set has below indices , column headers.
+------+------------------------+------------------------------+--------------------------+------------------------------------+-------------------------------------+------------------------+--------------------------+--------------------------------+----------------------------+--------------------------------------+---------------------------------------+--------------------------+ | | count: interaction_eis | count: interaction_eis_reply | count: interaction_match | count: interaction_single_message_ | count: interaction_single_message_1 | count: interaction_yes | dc(uid): interaction_eis | dc(uid): interaction_eis_reply | dc(uid): interaction_match | dc(uid): interaction_single_message_ | dc(uid): interaction_single_message_1 | dc(uid): interaction_yes | +------+------------------------+------------------------------+--------------------------+------------------------------------+-------------------------------------+------------------------+--------------------------+--------------------------------+----------------------------+--------------------------------------+---------------------------------------+--------------------------+ | uid | | | | | | | | | | | | | | 38 | 36 | 0 | 0 | 14 | 0 | 163 | 1 | 0 | 0 | 1 | 0 | 1 | | 66 | 63 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | | 1466 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | | 1709 | 51 | 0 | 0 | 1 | 0 | 9 | 1 | 0 | 0 | 1 | 0 | 1 | | 1844 | 66 | 0 | 1 | 3 | 1 | 17 | 1 | 0 | 1 | 1 | 1 | 1 | +------+------------------------+------------------------------+--------------------------+------------------------------------+-------------------------------------+------------------------+--------------------------+--------------------------------+----------------------------+--------------------------------------+---------------------------------------+--------------------------+
i attempting group uids type of interaction received, if user has 1 type of interaction grouped other users have specific type of interaction.
to started taking of dc(uid) columns have 1
"hit" per interaction type , 0
if interaction type never happened , aggregating them groups row row this:
cols = [i in list(all_f_rm.columns) if i[0]=="d"] def aggregate(row): key = "" in cols: key+=str(row[i]) if key not in results: results[key] = [] results[key].append(row.name) results = {} all_f_rm.apply(aggregate, axis=1)
results.keys()
of potential interaction type combinations (35 of them) , value each key every index (uid) belongs combination. looks this: {'001101': [141168, 153845, 172598, 254401, 448276,...
next, made function filter out of non matching rows each combination/key:
def tablefor(key): return all_f_rm[all_f_rm.apply(lambda row: row.name in results[key], axis=1)]
and tablefor('001101')
displays exact dataframe want.
my problem wrote list comprehension loop through 35 combinations [tablefor(x) x in results.keys()]
taking forever (1+ hrs , hasn't finished) , need perform on 5 more data sets. there more efficient way accomplish i'm trying do?
iiuc, can want groupby
. constructing toy dataframe yours:
df = pd.dataframe({"uid": np.arange(10**6)}) col in range(6): df["dc{}".format(col)] = np.random.randint(0,2,len(df))
we can group columns of interest , associated id numbers rapidly:
>>> dcs = [col col in df.columns if col.startswith("dc")] >>> df.groupby(dcs)["uid"].unique() dc0 dc1 dc2 dc3 dc4 dc5 0 0 0 0 0 0 [302, 357, 383, 474, 526, 614, 802, 812, 865, ... 1 [7, 96, 190, 220, 405, 453, 534, 598, 606, 866... 1 0 [16, 209, 289, 355, 430, 620, 634, 736, 780, 7... 1 [9, 79, 166, 268, 408, 434, 435, 447, 572, 749... 1 0 0 [60, 120, 196, 222, 238, 346, 426, 486, 536, 5... 1 [2, 53, 228, 264, 315, 517, 557, 621, 626, 630... 1 0 [42, 124, 287, 292, 300, 338, 341, 350, 500, 5... 1 [33, 95, 140, 192, 225, 282, 328, 339, 365, 44... 1 0 0 0 [1, 59, 108, 134, 506, 551, 781, 823, 836, 861... 1 [149, 215, 380, 394, 436, 482, 570, 600, 631, ... 1 0 [77, 133, 247, 333, 374, 782, 809, 892, 1096, ... 1 [14, 275, 312, 326, 343, 444, 569, 692, 770, 7... 1 0 0 [69, 104, 143, 404, 431, 468, 636, 639, 657, 7... 1 [178, 224, 367, 402, 664, 666, 739, 807, 871, ... [...]
if you'd prefer associated groups instead, can list or dictionary well, rather pulling out indices:
>>> groups = list(df.groupby(dcs, as_index=false)) >>> print(groups[0][0]) (0, 0, 0, 0, 0, 0) >>> print(groups[0][1]) uid dc0 dc1 dc2 dc3 dc4 dc5 302 302 0 0 0 0 0 0 357 357 0 0 0 0 0 0 383 383 0 0 0 0 0 0 [...] 999730 999730 0 0 0 0 0 0 999945 999945 0 0 0 0 0 0 999971 999971 0 0 0 0 0 0 [15357 rows x 7 columns]
and on.
Comments
Post a Comment