Go to file
J. Fernando Sánchez 0bebed2ff7 First commit
2018-08-25 12:23:46 +02:00
.gitignore First commit 2018-08-25 12:23:46 +02:00
convert.py First commit 2018-08-25 12:23:46 +02:00
main.go First commit 2018-08-25 12:23:46 +02:00
README.md First commit 2018-08-25 12:23:46 +02:00
sqlite-twitter First commit 2018-08-25 12:23:46 +02:00
sqlite.sh First commit 2018-08-25 12:23:46 +02:00

This is a quick comparison of three different approaches to load a big CSV/TSV file into a sqlite database. As input, I used a sample of ~125MB from the Twitter Social Graph 2009 dataset.

TL;DR sqlite has an .import function that works wonderfully.

Results

Python

  time python2 convert.py sample-twitter_rv.net        
Done: 9999999 lines. 131128444 / 131128444 bytes (100.0 %)
python2 convert.py sample-twitter_rv.net  63.96s user 27.51s system 63% cpu 2:23.53 total

This is the output of cProfile:

python2 -m cProfile convert.py sample-twitter_rv.net 
Done: 9999999 lines. 131128444 / 131128444 bytes (100.0 %)
         50006080 function calls (50006073 primitive calls) in 241.581 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  241.581  241.581 convert.py:1(<module>)
        1   12.686   12.686  241.576  241.576 convert.py:18(main)
 10000000    5.114    0.000  157.256    0.000 convert.py:7(addusers)
 10000004  152.164    0.000  152.164    0.000 {method 'execute' of 'sqlite3.Connection' objects}
     1001   66.972    0.067   66.972    0.067 {method 'commit' of 'sqlite3.Connection' objects}
 10000002    2.471    0.000    2.471    0.000 {method 'split' of 'str' objects}
 10000000    1.103    0.000    1.103    0.000 {method 'strip' of 'str' objects}
 10000000    1.015    0.000    1.015    0.000 {len}
     1001    0.032    0.000    0.046    0.000 convert.py:12(update_progress)
     1001    0.011    0.000    0.011    0.000 {method 'format' of 'str' objects}
     1000    0.005    0.000    0.005    0.000 {method 'tell' of 'file' objects}
        1    0.000    0.000    0.005    0.005 __init__.py:24(<module>)
        1    0.002    0.002    0.004    0.004 dbapi2.py:24(<module>)
     2002    0.003    0.000    0.003    0.000 {method 'write' of 'file' objects}
        1    0.001    0.001    0.002    0.002 collections.py:11(<module>)

Golang

With Transactions

  time ./twitter-sqlite sample-twitter_rv.net  
 131128444/131128444  Bytes (100.000%) 10000000 lines - 0.00 Bps (avg. 1656955.51 Bps)
./sqlite-twitter sample-twitter_rv.net  71.16s user 24.15s system 120% cpu 1:19.15 total

Using the pprof module, I could also extract some profiling information:

  go tool pprof sqlite-twitter tx
File: sqlite-twitter
Build ID: 708e90eba7948cb0851dfbf3bb6170ccaa418eff
Type: cpu
Time: Aug 24, 2018 at 4:26pm (CEST)
Duration: 1.26mins, Total samples = 1.47mins (117.03%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) cum
(pprof) top20
Showing nodes accounting for 65.26s, 73.82% of 88.40s total
Dropped 222 nodes (cum <= 0.44s)
Showing top 20 nodes out of 70
      flat  flat%   sum%        cum   cum%
     0.09s   0.1%   0.1%     58.96s 66.70%  main.main
     0.44s   0.5%   0.6%     53.39s 60.40%  database/sql.(*Tx).StmtContext
     0.11s  0.12%  0.72%     48.59s 54.97%  database/sql.asString
     0.06s 0.068%  0.79%     47.55s 53.79%  github.com/mattn/go-sqlite3.(*SQLiteConn).Prepare
    43.32s 49.00% 49.80%     44.27s 50.08%  runtime.c128hash
     0.02s 0.023% 49.82%     40.16s 45.43%  net/url.Values.Encode
     0.02s 0.023% 49.84%     39.79s 45.01%  github.com/mattn/go-sqlite3.(*SQLiteConn).lastError.func3
     0.24s  0.27% 50.11%     23.36s 26.43%  runtime.findrunnable
     0.03s 0.034% 50.15%     23.24s 26.29%  runtime.casgstatus.func3
     0.44s   0.5% 50.64%     16.84s 19.05%  runtime.forEachP
     0.17s  0.19% 50.84%     15.17s 17.16%  internal/poll.runtime_pollSetDeadline
    14.98s 16.95% 67.78%     14.98s 16.95%  runtime.duffcopy
     0.41s  0.46% 68.25%     10.67s 12.07%  runtime.startm
     0.17s  0.19% 68.44%      8.62s  9.75%  runtime.panicdottypeI
     0.11s  0.12% 68.56%      7.25s  8.20%  runtime.needm
     0.31s  0.35% 68.91%      6.97s  7.88%  runtime.panicdottypeE
     0.33s  0.37% 69.29%      6.90s  7.81%  github.com/mattn/go-sqlite3.(*SQLiteDriver).Open
     1.81s  2.05% 71.33%      6.38s  7.22%  runtime.newm1
     0.09s   0.1% 71.44%      4.75s  5.37%  runtime.startlockedm
     2.11s  2.39% 73.82%      4.66s  5.27%  runtime.schedtrace

Raw statements and fmt.Sprintf

Plain awful. I didn't even complete the test. I suspect Exec() starts a new transaction, even though some sources claim it does not if you use a single string (e.g. with fmt.Sprint).

Just one transaction

  time ./twitter-sqlite sample-twitter_rv.net  
./sqlite-twitter sample-twitter_rv.net  67.94s user 20.34s system 129% cpu 1:08.10 total
  go tool pprof sqlite-twitter tx
File: sqlite-twitter
Build ID: 7ec752e835de12b94418fffb45515e1b0f89e89f
Type: cpu
Time: Aug 24, 2018 at 4:57pm (CEST)
Duration: 1.25mins, Total samples = 1.46mins (117.52%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) cum
(pprof) top20
Showing nodes accounting for 64.61s, 73.59% of 87.80s total
Dropped 207 nodes (cum <= 0.44s)
Showing top 20 nodes out of 70
      flat  flat%   sum%        cum   cum%
     0.08s 0.091% 0.091%     55.82s 63.58%  main.updateStatus
     0.49s  0.56%  0.65%     51.68s 58.86%  database/sql.(*Tx).StmtContext
     0.12s  0.14%  0.79%     46.56s 53.03%  database/sql.asString
     0.11s  0.13%  0.91%     45.52s 51.85%  github.com/mattn/go-sqlite3.(*SQLiteConn).Prepare
    41.12s 46.83% 47.74%     41.97s 47.80%  runtime.c128hash
     0.01s 0.011% 47.76%     38.64s 44.01%  github.com/mattn/go-sqlite3.(*SQLiteConn).lastError.func3
     0.05s 0.057% 47.81%     38.64s 44.01%  net/url.Values.Encode
     0.38s  0.43% 48.25%     25.40s 28.93%  runtime.findrunnable
     0.04s 0.046% 48.29%     25.25s 28.76%  runtime.casgstatus.func3
     0.51s  0.58% 48.87%     17.72s 20.18%  runtime.forEachP
     0.30s  0.34% 49.21%     15.93s 18.14%  internal/poll.runtime_pollSetDeadline
    15.61s 17.78% 66.99%     15.61s 17.78%  runtime.duffcopy
     0.46s  0.52% 67.52%     11.50s 13.10%  runtime.startm
     0.15s  0.17% 67.69%      9.12s 10.39%  runtime.panicdottypeI
     1.71s  1.95% 69.64%      7.39s  8.42%  runtime.newm1
     0.06s 0.068% 69.70%      7.34s  8.36%  runtime.needm
     0.36s  0.41% 70.11%      7.28s  8.29%  runtime.panicdottypeE
     0.45s  0.51% 70.63%      5.93s  6.75%  github.com/mattn/go-sqlite3.(*SQLiteDriver).Open
     2.48s  2.82% 73.45%      5.72s  6.51%  runtime.schedtrace
     0.12s  0.14% 73.59%      4.77s  5.43%  runtime.startlockedm

CLI

Indexing first

 time sh sqlite.sh sample-twitter_rv.net
sh sqlite.sh sample-twitter_rv.net  25.18s user 6.67s system 91% cpu 34.900 total

Indexing afterwards

 time sh sqlite.sh sample-twitter_rv.net
sh sqlite.sh sample-twitter_rv.net  14.91s user 1.30s system 84% cpu 19.279 total

Comments

There are way too many knobs to fiddle with, and I know very little about sqlite or SQL in general. This is a very specific use-case, and I've tried to tune the settings accordingly.

Python was the easiest one to try. It is the language I'm more familiar with, and sqlite3 is included in the standard library, so only this file is needed.

In Go, I tried compiling with go build in my machine and copying the binary to a remote host. I couldn't run it, apparently due to a mismatched glibc version or LDPATH. Instead, I had to use: go build -ldflags "-linkmode external -extldflags -static" . . It raises a warning, but I had no issue in my tests.

In the end, the sqlite command line was the fastest of the three, and very easy to set up.

If the file you are working with is sorted and without duplicates, the best option is to create the indexes after all the data has been loaded. You will also have to start over if the import fails or is interrupted. It is harder to remove duplicates afterwards in such a big dataset, and the cleanest solution is to simply copy all the unique entries to a new table and delete the old one.