pandas dataframe fill missing rows with date column -

- February 15, 2013

    uuid  site          ts  visit 0  +cw99  1124  2013-06-24     2 1  +cw99  1124  2013-06-26     1 2  +cw99  1124  2013-06-27     1 3  +cw99  1124  2013-06-20     1 4  +cw99  1124  2013-06-21     1 5  +cw99  1124  2013-06-24     2 6  +cw9w   956  2013-06-21     4 7  +cw9w   956  2013-06-22     2 8  +cw9w   956  2013-06-23     3 9  +cw9w   956  2013-06-24     4

i have dataframe passed in looks that. don't know how many distinct ts period there. want find min , max of ts expands rows not have ts (daily) 0 fill visit column while copying rest of fields

        uuid  site          ts  visit     0  +cw99  1124  2013-06-20     1     1  +cw99  1124  2013-06-21     1     2  +cw99  1124  2013-06-22     1     3  +cw99  1124  2013-06-23     0     4  +cw99  1124  2013-06-24     2     5  +cw99  1124  2013-06-25     0     6  +cw99  1124  2013-06-26     1     7  +cw99  1124  2013-06-27     1     8  +cw9w   956  2013-06-20     0     9  +cw9w   956  2013-06-21     4    10  +cw9w   956  2013-06-22     2    11  +cw9w   956  2013-06-23     3    12  +cw9w   956  2013-06-24     4    13  +cw9w   956  2013-06-25     0    14  +cw9w   956  2013-06-26     0    15  +cw9w   956  2013-06-27     0

thanks.

this non-trivial. i'll explain why below

prelim, read in original data frame , make sure ts column dtype of datetime64[ns]

# may need correct dtype    df['ts'] = df['ts'].to_datetime(df['ts'])  in [107]: df out[107]:      uuid  site                  ts  visit 0  +cw99  1124 2013-06-24 00:00:00      2 1  +cw99  1124 2013-06-26 00:00:00      1 2  +cw99  1124 2013-06-27 00:00:00      1 3  +cw99  1124 2013-06-20 00:00:00      1 4  +cw99  1124 2013-06-21 00:00:00      1 5  +cw99  1124 2013-06-24 00:00:00      2 6  +cw9w   956 2013-06-21 00:00:00      4 7  +cw9w   956 2013-06-22 00:00:00      2 8  +cw9w   956 2013-06-23 00:00:00      3 9  +cw9w   956 2013-06-24 00:00:00      4  in [106]: df.dtypes out[106]:  uuid             object site              int64 ts       datetime64[ns] visit             int64 dtype: object

create master times between min , max

in [110]: all_ts = pd.date_range(df['ts'].min(),df['ts'].max())  in [111]: all_ts out[111]:  <class 'pandas.tseries.index.datetimeindex'> [2013-06-20 00:00:00, ..., 2013-06-27 00:00:00] length: 8, freq: d, timezone: none

define function this

in [103]: def f(x):                # want of ``ts`` column not in master time series    .....:     adf = dataframe(dict(ts = all_ts-index(x['ts'])),columns=df.columns)                # should have visit of 0    .....:     adf['visit'] = 0                # first add them frame (x), ignoring index               # sort ts column               # fillforward missing values    .....:     return x.append(adf,ignore_index=true).sort_index(by='ts').ffill()    .....:

apply function (you group uuid,site if want well)

in [116]: df.groupby('uuid').apply(f) out[116]:            uuid  site                  ts  visit uuid                                            +cw99 3  +cw99  1124 2013-06-20 00:00:00      1       4  +cw99  1124 2013-06-21 00:00:00      1       0  +cw99  1124 2013-06-24 00:00:00      2       5  +cw99  1124 2013-06-24 00:00:00      2       6  +cw99  1124 2013-06-25 00:00:00      0       1  +cw99  1124 2013-06-26 00:00:00      1       2  +cw99  1124 2013-06-27 00:00:00      1 +cw9w 0  +cw9w   956 2013-06-21 00:00:00      4       1  +cw9w   956 2013-06-22 00:00:00      2       2  +cw9w   956 2013-06-23 00:00:00      3       3  +cw9w   956 2013-06-24 00:00:00      4       4  +cw9w   956 2013-06-25 00:00:00      0

note: have duplicate in posted frame. not sure if intential or not preserved that. easier problem if don't have duplicates (in ts column)

here no dup way

in [207]: def f(x):    .....:     x = x.set_index('ts').reindex(all_ts).reset_index()    .....:     x['visit'] = x['visit'].fillna(0)    .....:     return x.ffill()    .....:   in [208]: df_no_dups.groupby('uuid').apply(f) out[208]:                        index   uuid  site  visit uuid                                            +cw99 0 2013-06-20 00:00:00  +cw99  1124      1       1 2013-06-21 00:00:00  +cw99  1124      1       2 2013-06-22 00:00:00  +cw99  1124      0       3 2013-06-23 00:00:00  +cw99  1124      0       4 2013-06-24 00:00:00  +cw99  1124      2       5 2013-06-25 00:00:00  +cw99  1124      0       6 2013-06-26 00:00:00  +cw99  1124      1       7 2013-06-27 00:00:00  +cw99  1124      1 +cw9w 0 2013-06-20 00:00:00    nan   nan      0       1 2013-06-21 00:00:00  +cw9w   956      4       2 2013-06-22 00:00:00  +cw9w   956      2       3 2013-06-23 00:00:00  +cw9w   956      3       4 2013-06-24 00:00:00  +cw9w   956      4       5 2013-06-25 00:00:00  +cw9w   956      0       6 2013-06-26 00:00:00  +cw9w   956      0       7 2013-06-27 00:00:00  +cw9w   956      0

this forces elements there (note nan becuase no way ffill on first element). drop these if want.

Search This Blog

IO

pandas dataframe fill missing rows with date column -

Comments

Post a Comment

Popular posts from this blog

javascript - DIV "hiding" when changing dropdown value -

html - Accumulated Depreciation of Assets on php -

node.js - Node - Passport Auth - Authed Post Route hangs on form submission -