python - Understanding indexing issues in Pandas 0.8.1 (and 0.11) -
here example ipython session straightforward indexing , assignments pandas dataframe work , don't work when seem straightforward:
in [652]: dfrm = pandas.dataframe(np.random.rand(10,3), columns=['a', 'b', 'c']) in [653]: dfrm out[653]: b c 0 0.777147 0.558404 0.424222 1 0.906354 0.111197 0.492625 2 0.011354 0.468661 0.056303 3 0.118818 0.117526 0.649210 4 0.746045 0.583369 0.962173 5 0.374871 0.285712 0.868599 6 0.223596 0.963223 0.012154 7 0.969879 0.043160 0.891143 8 0.527701 0.992965 0.073797 9 0.553854 0.969303 0.523098 in [654]: dfrm['a'][dfrm.a > 0.5] = [1,2,3,4,5,6] in [655]: dfrm out[655]: b c 0 1.000000 0.558404 0.424222 1 2.000000 0.111197 0.492625 2 0.011354 0.468661 0.056303 3 0.118818 0.117526 0.649210 4 3.000000 0.583369 0.962173 5 0.374871 0.285712 0.868599 6 0.223596 0.963223 0.012154 7 4.000000 0.043160 0.891143 8 5.000000 0.992965 0.073797 9 6.000000 0.969303 0.523098 in [656]: dfrm[['b','c']][dfrm.a > 0.5] = 100*np.random.rand(6,2) in [657]: dfrm out[657]: b c 0 1.000000 0.558404 0.424222 1 2.000000 0.111197 0.492625 2 0.011354 0.468661 0.056303 3 0.118818 0.117526 0.649210 4 3.000000 0.583369 0.962173 5 0.374871 0.285712 0.868599 6 0.223596 0.963223 0.012154 7 4.000000 0.043160 0.891143 8 5.000000 0.992965 0.073797 9 6.000000 0.969303 0.523098 in [658]: dfrm[dfrm.a > 0.5] = 100*np.random.rand(6,3) in [659]: dfrm out[659]: b c 0 27.738118 18.812116 46.369840 1 35.335223 58.365611 7.773464 2 0.011354 0.468661 0.056303 3 0.118818 0.117526 0.649210 4 97.439481 98.621074 69.816171 5 0.374871 0.285712 0.868599 6 0.223596 0.963223 0.012154 7 53.609637 30.952762 81.379502 8 68.473117 16.261694 91.092718 9 82.253724 94.979991 72.571951 in [660]: dfrm[dfrm.a > 0.5] = 0.5*dfrm[dfrm.a > 0.5] --------------------------------------------------------------------------- assertionerror traceback (most recent call last) <ipython-input-660-35fb8e212806> in <module>() ----> 1 dfrm[dfrm.a > 0.5] = 0.5*dfrm[dfrm.a > 0.5] /opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in __setitem__(self, key, value) 1707 self._boolean_set(key, value) 1708 elif isinstance(key, (np.ndarray, list)): -> 1709 return self._set_item_multiple(key, value) 1710 else: 1711 # set column /opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in _set_item_multiple(self, keys, value) 1728 def _set_item_multiple(self, keys, value): 1729 if isinstance(value, dataframe): -> 1730 assert(len(value.columns) == len(keys)) 1731 k1, k2 in zip(keys, value.columns): 1732 self[k1] = value[k2] assertionerror:
can explain why (but not all) of these work, , why final 1 induces error?
update:
we have pandas 0.11 installed, it's not default version development it's sandbox sort of thing me right now. when repeat example in 0.11, see same assignment problems, except last example works correctly no error. muddled-ness of conventions how invoke original dataframe's __setitem__
still there:
python 2.7.3 |epd 7.3-2 (64-bit)| (default, apr 11 2012, 17:52:16) [gcc 4.1.2 20080704 (red hat 4.1.2-44)] on linux2 type "credits", "demo" or "enthought" more information. hello >>> import pandas >>> pandas.__version__ '0.11.0' >>> dfrm = pandas.dataframe(np.random.rand(10,3), columns=['a', 'b', 'c']) traceback (most recent call last): file "<stdin>", line 1, in <module> nameerror: name 'np' not defined >>> import numpy np >>> dfrm = pandas.dataframe(np.random.rand(10,3), columns=['a', 'b', 'c']) >>> dfrm b c 0 0.745516 0.062613 0.147684 1 0.369141 0.447022 0.114963 2 0.820178 0.946806 0.687971 3 0.771971 0.934799 0.633633 4 0.828249 0.065587 0.848788 5 0.433796 0.740885 0.160140 6 0.663891 0.753134 0.849269 7 0.647054 0.962267 0.453865 8 0.345706 0.030634 0.058697 9 0.994135 0.990536 0.436903 >>> dfrm[dfrm.a > 0.5] b c 0 0.745516 0.062613 0.147684 2 0.820178 0.946806 0.687971 3 0.771971 0.934799 0.633633 4 0.828249 0.065587 0.848788 6 0.663891 0.753134 0.849269 7 0.647054 0.962267 0.453865 9 0.994135 0.990536 0.436903 >>> len(dfrm[dfrm.a > 0.5]) 7 >>> dfrm['a'][dfrm.a > 0.5] = [1,2,3,4,5,6,7] >>> dfrm b c 0 1.000000 0.062613 0.147684 1 0.369141 0.447022 0.114963 2 2.000000 0.946806 0.687971 3 3.000000 0.934799 0.633633 4 4.000000 0.065587 0.848788 5 0.433796 0.740885 0.160140 6 5.000000 0.753134 0.849269 7 6.000000 0.962267 0.453865 8 0.345706 0.030634 0.058697 9 7.000000 0.990536 0.436903 >>> dfrm[['b','c']][dfrm.a > 0.5] = 100*np.random.rand(7,2) >>> dfrm b c 0 1.000000 0.062613 0.147684 1 0.369141 0.447022 0.114963 2 2.000000 0.946806 0.687971 3 3.000000 0.934799 0.633633 4 4.000000 0.065587 0.848788 5 0.433796 0.740885 0.160140 6 5.000000 0.753134 0.849269 7 6.000000 0.962267 0.453865 8 0.345706 0.030634 0.058697 9 7.000000 0.990536 0.436903 >>> dfrm[dfrm.a > 0.5] = 0.5*dfrm[dfrm.a > 0.5] >>> dfrm b c 0 0.500000 0.031306 0.073842 1 0.369141 0.447022 0.114963 2 1.000000 0.473403 0.343985 3 1.500000 0.467400 0.316816 4 2.000000 0.032794 0.424394 5 0.433796 0.740885 0.160140 6 2.500000 0.376567 0.424635 7 3.000000 0.481133 0.226933 8 0.345706 0.030634 0.058697 9 3.500000 0.495268 0.218452 >>>
second update:
here's super unexpected behavior:
in [681]: id(dfrm.a) out[681]: 298480536 in [682]: id(dfrm.a) out[682]: 298480536 in [683]: id(dfrm.a) out[683]: 298480536 in [684]: id(dfrm['a']) out[684]: 298480536 in [685]: id(dfrm['a']) out[685]: 298480536 in [686]: id(dfrm['a']) out[686]: 298480536 in [687]: id(dfrm[['a']]) out[687]: 281536912 in [688]: id(dfrm[['a']]) out[688]: 281535824 in [689]: id(dfrm[['a']]) out[689]: 281536336
assigning 2 or more getitems/slices (chaining) may or may not work depending on situation...
should avoid doing it!! should rewrite each in 1 pass.
there quite substantial amount of work in 0.11 (possibly before) clear behaviour... pandas overloads these assignments not care if it's view or copy, if doing in 1 pass, should doing, in general.
example:
dfrm.loc[dfrm.a > 0.5, 'a'] = [1, 2, 3, 4, 5, 6] dfrm.loc[[dfrm.a > 0.5], ['b','c']] = 100 * np.random.rand(6, 2)
also, practise specify indexing label (with loc):
dfrm.loc[dfrm.a > 0.5] = 100 * np.random.rand(6, 3)
you consider rewriting:
dfrm.loc[dfrm.a > 0.5] = 0.5 * dfrm.loc[dfrm.a > 0.5]
to
dfrm.loc[dfrm.a > 0.5] *= 0.5
this surprising error in 0.8.1 (but seems fixed in later versions), perhaps workaround (if above doesn't work) set fancy index first (df_a_gt_half = dfrm.a > 0.5
) , assignment using that... , forced use ix
rather loc
.
Comments
Post a Comment