python - What's the best way to insert over a hundred million rows into a SQLite database? -
i have load of data in csv format. need able index data based on single text field (the primary key), i'm thinking of entering database. i'm familiar sqlite previous projects, i've decided use engine.
after experimentation, realized that storing hundred million records in 1 table won't work well: indexing step slows crawl pretty quickly. come 2 solutions problem:
- partition data several tables
- partition data several databases
i went second solution (it yields several large files instead of 1 huge file). partition method @ first 2 characters of primary key: each partition has approximately 2 million records, , there approximately 50 partitions.
i'm doing in python sqlite3 module. keep 50 open database connections , open cursors entire duration of process. each row, @ first 2 characters of primary key, fetch right cursor via dictionary lookup, , perform single insert statement (via calling execute on cursor).
unfortunately, insert speed still decreases unbearable level after while (approx. 10 million total processed records). can around this? there better way i'm doing?
i think problem have once processing cannot use in-memory buffers hard disk head jumping randomly between 50 locations , dog slow.
something can try processing 1 subset @ time:
seen = {} # key prefixes processed while true: k0 = none # current prefix l in all_the_data: k = l[0][:2] if k not in seen: if k0 none: k0 = k if k0 == k: store_into_database(l) if k0 none: break seen.append(k0)
this n+1
passes on data (where n
number of prefixes) access 2 disk locations (one reading , 1 writing). should work better if you've separate physical devices.
ps: really sure sql database best solution problem?
Comments
Post a Comment