python - Understanding indexing issues in Pandas 0.8.1 (and 0.11) -

here example ipython session straightforward indexing , assignments pandas dataframe work , don't work when seem straightforward:

in [652]: dfrm = pandas.dataframe(np.random.rand(10,3), columns=['a', 'b', 'c'])  in [653]: dfrm out[653]:                   b         c 0  0.777147  0.558404  0.424222 1  0.906354  0.111197  0.492625 2  0.011354  0.468661  0.056303 3  0.118818  0.117526  0.649210 4  0.746045  0.583369  0.962173 5  0.374871  0.285712  0.868599 6  0.223596  0.963223  0.012154 7  0.969879  0.043160  0.891143 8  0.527701  0.992965  0.073797 9  0.553854  0.969303  0.523098  in [654]: dfrm['a'][dfrm.a > 0.5] = [1,2,3,4,5,6]  in [655]: dfrm out[655]:                   b         c 0  1.000000  0.558404  0.424222 1  2.000000  0.111197  0.492625 2  0.011354  0.468661  0.056303 3  0.118818  0.117526  0.649210 4  3.000000  0.583369  0.962173 5  0.374871  0.285712  0.868599 6  0.223596  0.963223  0.012154 7  4.000000  0.043160  0.891143 8  5.000000  0.992965  0.073797 9  6.000000  0.969303  0.523098  in [656]: dfrm[['b','c']][dfrm.a > 0.5] = 100*np.random.rand(6,2)  in [657]: dfrm out[657]:                   b         c 0  1.000000  0.558404  0.424222 1  2.000000  0.111197  0.492625 2  0.011354  0.468661  0.056303 3  0.118818  0.117526  0.649210 4  3.000000  0.583369  0.962173 5  0.374871  0.285712  0.868599 6  0.223596  0.963223  0.012154 7  4.000000  0.043160  0.891143 8  5.000000  0.992965  0.073797 9  6.000000  0.969303  0.523098  in [658]: dfrm[dfrm.a > 0.5] = 100*np.random.rand(6,3)  in [659]: dfrm out[659]:                     b          c 0  27.738118  18.812116  46.369840 1  35.335223  58.365611   7.773464 2   0.011354   0.468661   0.056303 3   0.118818   0.117526   0.649210 4  97.439481  98.621074  69.816171 5   0.374871   0.285712   0.868599 6   0.223596   0.963223   0.012154 7  53.609637  30.952762  81.379502 8  68.473117  16.261694  91.092718 9  82.253724  94.979991  72.571951  in [660]: dfrm[dfrm.a > 0.5] = 0.5*dfrm[dfrm.a > 0.5] --------------------------------------------------------------------------- assertionerror                            traceback (most recent call last) <ipython-input-660-35fb8e212806> in <module>() ----> 1 dfrm[dfrm.a > 0.5] = 0.5*dfrm[dfrm.a > 0.5]  /opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in __setitem__(self, key, value)    1707             self._boolean_set(key, value)    1708         elif isinstance(key, (np.ndarray, list)): -> 1709             return self._set_item_multiple(key, value)    1710         else:    1711             # set column  /opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in _set_item_multiple(self, keys, value)    1728     def _set_item_multiple(self, keys, value):    1729         if isinstance(value, dataframe): -> 1730             assert(len(value.columns) == len(keys))    1731             k1, k2 in zip(keys, value.columns):    1732                 self[k1] = value[k2]  assertionerror:

can explain why (but not all) of these work, , why final 1 induces error?

update:

we have pandas 0.11 installed, it's not default version development it's sandbox sort of thing me right now. when repeat example in 0.11, see same assignment problems, except last example works correctly no error. muddled-ness of conventions how invoke original dataframe's __setitem__ still there:

python 2.7.3 |epd 7.3-2 (64-bit)| (default, apr 11 2012, 17:52:16) [gcc 4.1.2 20080704 (red hat 4.1.2-44)] on linux2 type "credits", "demo" or "enthought" more information. hello >>> import pandas >>> pandas.__version__ '0.11.0' >>> dfrm = pandas.dataframe(np.random.rand(10,3), columns=['a', 'b', 'c']) traceback (most recent call last):   file "<stdin>", line 1, in <module> nameerror: name 'np' not defined >>> import numpy np >>> dfrm = pandas.dataframe(np.random.rand(10,3), columns=['a', 'b', 'c']) >>> dfrm                   b         c 0  0.745516  0.062613  0.147684 1  0.369141  0.447022  0.114963 2  0.820178  0.946806  0.687971 3  0.771971  0.934799  0.633633 4  0.828249  0.065587  0.848788 5  0.433796  0.740885  0.160140 6  0.663891  0.753134  0.849269 7  0.647054  0.962267  0.453865 8  0.345706  0.030634  0.058697 9  0.994135  0.990536  0.436903 >>> dfrm[dfrm.a > 0.5]                   b         c 0  0.745516  0.062613  0.147684 2  0.820178  0.946806  0.687971 3  0.771971  0.934799  0.633633 4  0.828249  0.065587  0.848788 6  0.663891  0.753134  0.849269 7  0.647054  0.962267  0.453865 9  0.994135  0.990536  0.436903 >>> len(dfrm[dfrm.a > 0.5]) 7 >>> dfrm['a'][dfrm.a > 0.5] = [1,2,3,4,5,6,7] >>> dfrm                   b         c 0  1.000000  0.062613  0.147684 1  0.369141  0.447022  0.114963 2  2.000000  0.946806  0.687971 3  3.000000  0.934799  0.633633 4  4.000000  0.065587  0.848788 5  0.433796  0.740885  0.160140 6  5.000000  0.753134  0.849269 7  6.000000  0.962267  0.453865 8  0.345706  0.030634  0.058697 9  7.000000  0.990536  0.436903 >>> dfrm[['b','c']][dfrm.a > 0.5] = 100*np.random.rand(7,2) >>> dfrm                   b         c 0  1.000000  0.062613  0.147684 1  0.369141  0.447022  0.114963 2  2.000000  0.946806  0.687971 3  3.000000  0.934799  0.633633 4  4.000000  0.065587  0.848788 5  0.433796  0.740885  0.160140 6  5.000000  0.753134  0.849269 7  6.000000  0.962267  0.453865 8  0.345706  0.030634  0.058697 9  7.000000  0.990536  0.436903 >>> dfrm[dfrm.a > 0.5] = 0.5*dfrm[dfrm.a > 0.5] >>> dfrm                   b         c 0  0.500000  0.031306  0.073842 1  0.369141  0.447022  0.114963 2  1.000000  0.473403  0.343985 3  1.500000  0.467400  0.316816 4  2.000000  0.032794  0.424394 5  0.433796  0.740885  0.160140 6  2.500000  0.376567  0.424635 7  3.000000  0.481133  0.226933 8  0.345706  0.030634  0.058697 9  3.500000  0.495268  0.218452 >>>

second update:

here's super unexpected behavior:

in [681]: id(dfrm.a) out[681]: 298480536  in [682]: id(dfrm.a) out[682]: 298480536  in [683]: id(dfrm.a) out[683]: 298480536  in [684]: id(dfrm['a']) out[684]: 298480536  in [685]: id(dfrm['a']) out[685]: 298480536  in [686]: id(dfrm['a']) out[686]: 298480536  in [687]: id(dfrm[['a']]) out[687]: 281536912  in [688]: id(dfrm[['a']]) out[688]: 281535824  in [689]: id(dfrm[['a']]) out[689]: 281536336

assigning 2 or more getitems/slices (chaining) may or may not work depending on situation...
should avoid doing it!! should rewrite each in 1 pass.

there quite substantial amount of work in 0.11 (possibly before) clear behaviour... pandas overloads these assignments not care if it's view or copy, if doing in 1 pass, should doing, in general.
example:

dfrm.loc[dfrm.a > 0.5, 'a'] = [1, 2, 3, 4, 5, 6]  dfrm.loc[[dfrm.a > 0.5], ['b','c']] = 100 * np.random.rand(6, 2)

also, practise specify indexing label (with loc):

dfrm.loc[dfrm.a > 0.5] = 100 * np.random.rand(6, 3)

you consider rewriting:

dfrm.loc[dfrm.a > 0.5] = 0.5 * dfrm.loc[dfrm.a > 0.5]

dfrm.loc[dfrm.a > 0.5] *= 0.5

this surprising error in 0.8.1 (but seems fixed in later versions), perhaps workaround (if above doesn't work) set fancy index first (df_a_gt_half = dfrm.a > 0.5) , assignment using that... , forced use ix rather loc.

Search This Blog

IO

python - Understanding indexing issues in Pandas 0.8.1 (and 0.11) -

Comments

Post a Comment

Popular posts from this blog

javascript - DIV "hiding" when changing dropdown value -

html - Accumulated Depreciation of Assets on php -

java - How to use LIMIT in spring within sql query? -