pandas dataframe fill missing rows with date column -
uuid site ts visit 0 +cw99 1124 2013-06-24 2 1 +cw99 1124 2013-06-26 1 2 +cw99 1124 2013-06-27 1 3 +cw99 1124 2013-06-20 1 4 +cw99 1124 2013-06-21 1 5 +cw99 1124 2013-06-24 2 6 +cw9w 956 2013-06-21 4 7 +cw9w 956 2013-06-22 2 8 +cw9w 956 2013-06-23 3 9 +cw9w 956 2013-06-24 4
i have dataframe passed in looks that. don't know how many distinct ts period there. want find min , max of ts expands rows not have ts (daily) 0 fill visit column while copying rest of fields
uuid site ts visit 0 +cw99 1124 2013-06-20 1 1 +cw99 1124 2013-06-21 1 2 +cw99 1124 2013-06-22 1 3 +cw99 1124 2013-06-23 0 4 +cw99 1124 2013-06-24 2 5 +cw99 1124 2013-06-25 0 6 +cw99 1124 2013-06-26 1 7 +cw99 1124 2013-06-27 1 8 +cw9w 956 2013-06-20 0 9 +cw9w 956 2013-06-21 4 10 +cw9w 956 2013-06-22 2 11 +cw9w 956 2013-06-23 3 12 +cw9w 956 2013-06-24 4 13 +cw9w 956 2013-06-25 0 14 +cw9w 956 2013-06-26 0 15 +cw9w 956 2013-06-27 0
thanks.
this non-trivial. i'll explain why below
prelim, read in original data frame , make sure ts
column dtype of datetime64[ns]
# may need correct dtype df['ts'] = df['ts'].to_datetime(df['ts']) in [107]: df out[107]: uuid site ts visit 0 +cw99 1124 2013-06-24 00:00:00 2 1 +cw99 1124 2013-06-26 00:00:00 1 2 +cw99 1124 2013-06-27 00:00:00 1 3 +cw99 1124 2013-06-20 00:00:00 1 4 +cw99 1124 2013-06-21 00:00:00 1 5 +cw99 1124 2013-06-24 00:00:00 2 6 +cw9w 956 2013-06-21 00:00:00 4 7 +cw9w 956 2013-06-22 00:00:00 2 8 +cw9w 956 2013-06-23 00:00:00 3 9 +cw9w 956 2013-06-24 00:00:00 4 in [106]: df.dtypes out[106]: uuid object site int64 ts datetime64[ns] visit int64 dtype: object
create master times between min , max
in [110]: all_ts = pd.date_range(df['ts'].min(),df['ts'].max()) in [111]: all_ts out[111]: <class 'pandas.tseries.index.datetimeindex'> [2013-06-20 00:00:00, ..., 2013-06-27 00:00:00] length: 8, freq: d, timezone: none
define function this
in [103]: def f(x): # want of ``ts`` column not in master time series .....: adf = dataframe(dict(ts = all_ts-index(x['ts'])),columns=df.columns) # should have visit of 0 .....: adf['visit'] = 0 # first add them frame (x), ignoring index # sort ts column # fillforward missing values .....: return x.append(adf,ignore_index=true).sort_index(by='ts').ffill() .....:
apply function (you group uuid,site if want well)
in [116]: df.groupby('uuid').apply(f) out[116]: uuid site ts visit uuid +cw99 3 +cw99 1124 2013-06-20 00:00:00 1 4 +cw99 1124 2013-06-21 00:00:00 1 0 +cw99 1124 2013-06-24 00:00:00 2 5 +cw99 1124 2013-06-24 00:00:00 2 6 +cw99 1124 2013-06-25 00:00:00 0 1 +cw99 1124 2013-06-26 00:00:00 1 2 +cw99 1124 2013-06-27 00:00:00 1 +cw9w 0 +cw9w 956 2013-06-21 00:00:00 4 1 +cw9w 956 2013-06-22 00:00:00 2 2 +cw9w 956 2013-06-23 00:00:00 3 3 +cw9w 956 2013-06-24 00:00:00 4 4 +cw9w 956 2013-06-25 00:00:00 0
note: have duplicate in posted frame. not sure if intential or not preserved that. easier problem if don't have duplicates (in ts
column)
here no dup way
in [207]: def f(x): .....: x = x.set_index('ts').reindex(all_ts).reset_index() .....: x['visit'] = x['visit'].fillna(0) .....: return x.ffill() .....: in [208]: df_no_dups.groupby('uuid').apply(f) out[208]: index uuid site visit uuid +cw99 0 2013-06-20 00:00:00 +cw99 1124 1 1 2013-06-21 00:00:00 +cw99 1124 1 2 2013-06-22 00:00:00 +cw99 1124 0 3 2013-06-23 00:00:00 +cw99 1124 0 4 2013-06-24 00:00:00 +cw99 1124 2 5 2013-06-25 00:00:00 +cw99 1124 0 6 2013-06-26 00:00:00 +cw99 1124 1 7 2013-06-27 00:00:00 +cw99 1124 1 +cw9w 0 2013-06-20 00:00:00 nan nan 0 1 2013-06-21 00:00:00 +cw9w 956 4 2 2013-06-22 00:00:00 +cw9w 956 2 3 2013-06-23 00:00:00 +cw9w 956 3 4 2013-06-24 00:00:00 +cw9w 956 4 5 2013-06-25 00:00:00 +cw9w 956 0 6 2013-06-26 00:00:00 +cw9w 956 0 7 2013-06-27 00:00:00 +cw9w 956 0
this forces elements there (note nan
becuase no way ffill
on first element). drop these if want.
Comments
Post a Comment