python - Is it worth using IPython parallel with scipy's eig? -
i'm writing code has compute large numbers of eigenvalue problems (typical matrices dimension few hundreds). wondering whether possible speed process using ipython.parallel
module. former matlab user , python newbie looking similar matlab's parfor
...
following tutorials online wrote simple code check if speeds computation @ , found out doesn't , slows down(case dependent). think, might missing point in , maybe scipy.linalg.eig
implemented in such way uses cores available , trying parallelise interrupt engine management.
here 'parralel' code:
import numpy np scipy.linalg import eig ipython import parallel #create matrices matrix_size = 300 matrices = {} in range(100): matrices[i] = np.random.rand(matrix_size, matrix_size) rc = parallel.client() lview = rc.load_balanced_view() results = {} #compute eigenvalues in range(len(matrices)): asyncresult = lview.apply(eig, matrices[i], right=false) results[i] = asyncresult i, asyncresult in results.iteritems(): results[i] = asyncresult.get()
the non-parallelised variant:
#no parallel in range(len(matrices)): results[i] = eig(matrices[i], right=false)
the difference in cpu time 2 subtle. if on top of eigenvalue problem parallelised function has more matrix operations starts last forever, i.e. @ least 5 times longer non-parallelised variant.
am right eigenvalue problems not suited kind of parallelisation, or missing whole point?
many thanks!
edited 29 jul 2013; 12:20 bst
following moarningsun's suggestion tried run eig
while fixing number of threads mkl.set_num_threads
. 500-by-500 matrix minimum times of 50 repetitions set following:
no of. threads minimum time(timeit) cpu usage(task manager) ================================================================= 1 0.4513775764796151 12-13% 2 0.36869288559927327 25-27% 3 0.34014644287680085 38-41% 4 0.3380558903450037 49-53% 5 0.33508234276183657 49-53% 6 0.3379019065051807 49-53% 7 0.33858615048501406 49-53% 8 0.34488405094054997 49-53% 9 0.33380300334101776 49-53% 10 0.3288481198342197 49-53% 11 0.3512653110685733 49-53%
apart 1 thread case there no substantial difference (maybe 50 samples bit small...). still think i'm missing point , lot done improve performance, not sure how. these run on 4 cores machine hyperthreading enabled giving 4 virtual cores.
thanks input!
interesting problem. because think should possible achieve better scaling investigated performance small "benchmark". test compared performance of single , multi-threaded eig
(multi-threading being delivered through mkl lapack/blas routines) ipython parallelized eig
. see difference make varied view type, number of engines , mkl threading method of distributing matrices on engines.
here results on old amd dual core system:
m_size=300, n_mat=64, repeat=3 +------------------------------------+----------------------+ | settings | speedup factor | +--------+------+------+-------------+-----------+----------+ | func | neng | nmkl | view type | vs single | vs multi | +--------+------+------+-------------+-----------+----------+ | ip_map | 2 | 1 | direct_view | 1.67 | 1.62 | | ip_map | 2 | 1 | loadb_view | 1.60 | 1.55 | | ip_map | 2 | 2 | direct_view | 1.59 | 1.54 | | ip_map | 2 | 2 | loadb_view | 0.94 | 0.91 | | ip_map | 4 | 1 | direct_view | 1.69 | 1.64 | | ip_map | 4 | 1 | loadb_view | 1.61 | 1.57 | | ip_map | 4 | 2 | direct_view | 1.15 | 1.12 | | ip_map | 4 | 2 | loadb_view | 0.88 | 0.85 | | parfor | 2 | 1 | direct_view | 0.81 | 0.79 | | parfor | 2 | 1 | loadb_view | 1.61 | 1.56 | | parfor | 2 | 2 | direct_view | 0.71 | 0.69 | | parfor | 2 | 2 | loadb_view | 0.94 | 0.92 | | parfor | 4 | 1 | direct_view | 0.41 | 0.40 | | parfor | 4 | 1 | loadb_view | 1.62 | 1.58 | | parfor | 4 | 2 | direct_view | 0.34 | 0.33 | | parfor | 4 | 2 | loadb_view | 0.90 | 0.88 | +--------+------+------+-------------+-----------+----------+
as see performance gain varies on different settings used, maximum of 1.64 times of regular multi threaded eig
. in these results parfor
function used performs badly unless mkl threading disabled on engines (using view.apply_sync(mkl.set_num_threads, 1)
).
varying matrix size gives noteworthy difference. speedup of using ip_map
on direct_view
4 engines , mkl threading disabled vs regular multi threaded eig
:
n_mat=32, repeat=3 +--------+----------+ | m_size | vs multi | +--------+----------+ | 50 | 0.78 | | 100 | 1.44 | | 150 | 1.71 | | 200 | 1.75 | | 300 | 1.68 | | 400 | 1.60 | | 500 | 1.57 | +--------+----------+
apparently relatively small matrices there performance penalty, intermediate size speedup largest , larger matrices speedup decreases again. achieve performance gain of 1.75 make using ipython.parallel
worthwhile in opinion.
i did tests earlier on intel dual core laptop also, got funny results, apparently laptop overheating. on system speedups little lower, around 1.5-1.6 max.
now think answer question should be: depends. performance gain depends on hardware, blas/lapack library, problem size , way ipython.parallel
deployed, among other things perhaps i'm not aware of. , last not least, whether it's worth depends on how of performance gain you think worthwhile.
the code used:
from __future__ import print_function numpy.random import rand ipython.parallel import client mkl import set_num_threads timeit import default_timer clock scipy.linalg import eig functools import partial itertools import product eig = partial(eig, right=false) # desired keyword arg standard class bench(object): def __init__(self, m_size, n_mat, repeat=3): self.n_mat = n_mat self.matrix = rand(n_mat, m_size, m_size) self.repeat = repeat self.rc = client() def map(self): results = map(eig, self.matrix) def ip_map(self): results = self.view.map_sync(eig, self.matrix) def parfor(self): results = {} in range(self.n_mat): results[i] = self.view.apply_async(eig, self.matrix[i,:,:]) in range(self.n_mat): results[i] = results[i].get() def timer(self, func): t = clock() func() return clock() - t def run(self, func, n_engines, n_mkl, view_method): self.view = view_method(range(n_engines)) self.view.apply_sync(set_num_threads, n_mkl) set_num_threads(n_mkl) return min(self.timer(func) _ in range(self.repeat)) def run_all(self): funcs = self.ip_map, self.parfor n_engines = 2, 4 n_mkls = 1, 2 views = self.rc.direct_view, self.rc.load_balanced_view times = [] n_mkl in n_mkls: args = self.map, 0, n_mkl, views[0] times.append(self.run(*args)) args in product(funcs, n_engines, n_mkls, views): times.append(self.run(*args)) return times
dunno if matters start 4 ipython parallel engines typed @ command line:
ipcluster start -n 4
hope helps :)
Comments
Post a Comment