- performance recommendations;
- possible pitfalls when using DataFrames;
- useful packages, currently FreqTables and DataFramesMeta.
Actually all three sections would benefit from the experience of the community, so if you have a comment please make an issue or PR on https://github.com/bkamins/Julia-DataFrames-Tutorial.
As for useful packages I have tried to use raw DataFrames to reduce dependencies of my code, but actually FreqTables and DataFramesMeta helped me a lot and did not give too much mental overhead of things to remember.
I would like to especially recommend FreqTables - a small package, but really useful. I would say that it deserves much more attention from the community than it gets (looking at the number of stars). So let me write a bit about it.
There are three reasons I like it:
- simply I make contingency tables almost all the time; previously I have used countmap from StatsBase a lot, but it returns a dictionary which is not very handy; freqtable returns a much nicer result (e.g. if possible it is sorted) and allows for more than one dimension;
- with freqtable I can use vectors or work on data frames, it nicely handles missings and allows for weighting;
- freqtable is faster than countmap (I was surprised when I learned this, maybe not a critical thing but a nice plus).
So how does the output from freqtable look? Here is a sampler:
julia> using DataFrames, FreqTables
julia> srand(1); df = DataFrame(rand(1:3, 10, 2))
10×2 DataFrames.DataFrame
│ Row │ x1 │ x2 │
│ 1 │ 3 │ 2 │
│ 2 │ 3 │ 1 │
│ 3 │ 3 │ 1 │
│ 4 │ 3 │ 2 │
│ 5 │ 1 │ 2 │
│ 6 │ 1 │ 2 │
│ 7 │ 1 │ 3 │
│ 8 │ 2 │ 3 │
│ 9 │ 1 │ 1 │
│ 10 │ 1 │ 2 │
julia> freqtable(df, :x1, :x2)
3×3 Named Array{Int64,2}
x1 ╲ x2 │ 1 2 3
1 │ 1 3 1
2 │ 0 0 1
3 │ 2 2 0
And now a simple benchmark against countmap :
julia> using DataFrames, FreqTables, StatsBase, BenchmarkTools
julia> srand(1); x = rand(1:100, 10^6); y = categorical(x); z = string.(x);
julia> @benchmark freqtable($x)
memory estimate: 25.89 KiB
allocs estimate: 83
minimum time: 24.246 ms (0.00% GC)
median time: 24.672 ms (0.00% GC)
mean time: 25.425 ms (0.00% GC)
maximum time: 39.739 ms (0.00% GC)
samples: 197
evals/sample: 1
julia> @benchmark countmap($x)
memory estimate: 6.61 KiB
allocs estimate: 10
minimum time: 42.230 ms (0.00% GC)
median time: 42.813 ms (0.00% GC)
mean time: 43.110 ms (0.00% GC)
maximum time: 46.244 ms (0.00% GC)
samples: 116
evals/sample: 1
julia> @benchmark freqtable($y)
memory estimate: 10.16 KiB
allocs estimate: 76
minimum time: 1.064 ms (0.00% GC)
median time: 1.112 ms (0.00% GC)
mean time: 1.129 ms (0.09% GC)
maximum time: 3.485 ms (66.72% GC)
samples: 4403
evals/sample: 1
julia> @benchmark countmap($y)
memory estimate: 6.61 KiB
allocs estimate: 10
minimum time: 87.141 ms (0.00% GC)
median time: 88.167 ms (0.00% GC)
mean time: 88.510 ms (0.00% GC)
maximum time: 92.177 ms (0.00% GC)
samples: 57
evals/sample: 1
julia> @benchmark freqtable($z)
memory estimate: 45.81 MiB
allocs estimate: 2000285
minimum time: 75.712 ms (3.94% GC)
median time: 77.057 ms (3.94% GC)
mean time: 77.346 ms (4.16% GC)
maximum time: 83.298 ms (3.35% GC)
samples: 65
evals/sample: 1
julia> @benchmark countmap($z)
memory estimate: 6.61 KiB
allocs estimate: 10
minimum time: 81.931 ms (0.00% GC)
median time: 83.128 ms (0.00% GC)
mean time: 83.472 ms (0.00% GC)
maximum time: 89.977 ms (0.00% GC)
samples: 60
evals/sample: 1
As you can see freqtable does really a good job on different types of inputs.
Actually there is a third way to do a similar using by form DataFrames which is also quite fast but it is more messy. freqtable is more specialized - does one job, but does it well.
Here are the benchmarks of by:
julia> @benchmark by(DataFrame(x = $x), :x, nrow)
memory estimate: 38.91 MiB
allocs estimate: 5986
minimum time: 28.946 ms (1.56% GC)
median time: 34.440 ms (14.82% GC)
mean time: 34.291 ms (14.80% GC)
maximum time: 41.079 ms (20.70% GC)
samples: 146
evals/sample: 1
julia> @benchmark by(DataFrame(x = $y), :x, nrow)
memory estimate: 38.92 MiB
allocs estimate: 6198
minimum time: 44.810 ms (3.52% GC)
median time: 50.244 ms (10.53% GC)
mean time: 49.715 ms (10.38% GC)
maximum time: 56.052 ms (17.67% GC)
samples: 101
evals/sample: 1
julia> @benchmark by(DataFrame(x = $z), :x, nrow)
memory estimate: 38.91 MiB
allocs estimate: 5986
minimum time: 46.891 ms (0.93% GC)
median time: 53.657 ms (10.12% GC)
mean time: 52.539 ms (9.68% GC)
maximum time: 60.736 ms (16.24% GC)
samples: 96
evals/sample: 1