python - Numpy apply function to group in structured array -
starting off structured numpy array has 4 fields, trying return array latest dates, id, containing same 4 fields. found solution using itertools.groupby
works here: numpy mean structured array
the problem don't understand how adapt when have 4 fields instead of 2. want whole 'row' back, rows latest dates each id. understand kind of thing simpler using pandas, small piece of larger process, , can't add pandas dependency.
data = np.array([('2005-02-01', 1, 3, 8), ('2005-02-02', 1, 4, 9), ('2005-02-01', 2, 5, 10), ('2005-02-02', 2, 6, 11), ('2005-02-03', 2, 7, 12)], dtype=[('dt', 'datetime64[d]'), ('id', '<i4'), ('f3', '<i4'), ('f4', '<i4')])
for example array, desired output be:
np.array([(datetime.date(2005, 2, 2), 1, 4, 9), (datetime.date(2005, 2, 3), 2, 7, 12)], dtype=[('dt', '<m8[d]'), ('id', '<i4'), ('f3', '<i4'), ('f4', '<i4')])
this i've tried:
latest = np.array([(k, np.array(list(g), dtype=data.dtype).view(np.recarray) ['dt'].argmax()) k, g in groupby(np.sort(data, order='id').view(np.recarray), itemgetter('id'))], dtype=data.dtype)
i error:
valueerror: size of tuple must match number of fields.
i think because tuple has 2 fields array has 4. when drop 'f3'
, 'f4'
array works correctly.
how can return 4 fields?
lets figure out error pealing off 1 layer:
in [38]: operator import itemgetter in [39]: itertools import groupby in [41]: [(k, np.array(list(g), dtype=data.dtype).view(np.recarray) ['dt'].argmax()) k, g in groupby(np.sort(data, order='id').view(np.recarray), itemgetter('id'))] out[41]: [(1, 1), (2, 2)]
what list of tuples supposed represent? isn't rows data
. , since each tuple has 2 items can't mapped onto data.dtype
array. hence value error.
after playing around bit, think: [(1, 1), (2, 2)]
means, id==1
, use [1]
item group; id==2
, use [2]
item group.
[(datetime.date(2005, 2, 2), 1, 4, 9), (datetime.date(2005, 2, 3), 2, 7, 12)]
you have found maximum dates, have translate either indexes in data
, or select items groups.
in [91]: groups=groupby(np.sort(data, order='id').itemgetter('id')) # don't need recarray in [92]: g = [(k,list(g)) k,g in groups] in [93]: g out[93]: [(1, [(datetime.date(2005, 2, 1), 1, 3, 8), (datetime.date(2005, 2, 2), 1, 4, 9)]), (2, [(datetime.date(2005, 2, 1), 2, 5, 10), (datetime.date(2005, 2, 2), 2, 6, 11), (datetime.date(2005, 2, 3), 2, 7, 12)])] in [107]: i=[(1,1), (2,2)] in [108]: [g[1][i[1]] g,i in zip(g,i)] out[108]: [(datetime.date(2005, 2, 2), 1, 4, 9), (datetime.date(2005, 2, 3), 2, 7, 12)]
ok, selection g
clumsy, start.
if define simple function pull record latest date group, processing lot simpler.
def maxdate_record(agroup): an_array = np.array(list(agroup)) = np.argmax(an_array['dt']) return an_array[i] groups = groupby(np.sort(data, order='id'),itemgetter('id')) np.array([maxdate_record(g) k,g in groups])
producing:
array([(datetime.date(2005, 2, 2), 1, 4, 9), (datetime.date(2005, 2, 3), 2, 7, 12)], dtype=[('dt', '<m8[d]'), ('id', '<i4'), ('f3', '<i4'), ('f4', '<i4')])
i don't need specify dtype
when convert list of records array, since records have own dtype.
Comments
Post a Comment