python - Any way to speed up this pandas comparison? -
i have python script slurping odd log files , putting them pandas.dataframe can stat analysis. since logs snapshot of processes @ 5 minute intervals, when read each file checking new lines against data entered last file see if same process before (in case update time on existing record). works okay, can surprisingly slow when individual logs on 100,000 lines.
when profile performance, there few stand-outs, show lot of time spent in simple function, comparing series against rows carried-over previous log:
def carryover(s,df,ids): # see if pd.series (s) matches rows in pd.dataframe (df) given indices (ids) id in ids: r = df.iloc[id] if (r['a']==s['a'] , r['b']==s['b'] , r['c']==s['c'] , r['d']==s['d'] , r['e']==s['e'] , r['f']==s['f'] ): return id return none
i'd figure pretty efficient, since and
's short-circuiting , all... there maybe better way?
otherwise, there other things can run faster? resulting dataframe should fit in ram fine, don't know if there things should setting ensure caching, etc. optimal. thanks, all!
it's quite slow iterate , lookup (even though short-circuit), speed depends on how hit s...
a more "numpy" way calculation on entire array:
equals_s = df.loc[ids, ['a', 'b', 'c', 'd', 'e', 'f']] == s.loc['a', 'b', 'c', 'd', 'e', 'f'] row_equals_s = equals_s.all(axis=1)
then first index true idxmax
:
row_equals_s.idxmax()
if speed crucial, , short-circuiting important, idea rewrite function in cython, can iterate fast on numpy arrays.
Comments
Post a Comment