Julia snippets: An update of my tutorial on DataFrames

Since my last post about my tutorial to DataFrames I had switched to use them a lot more in my daily Julia workflow. Based on the experience I have added three sections to it:

performance recommendations;
possible pitfalls when using DataFrames;
useful packages, currently FreqTables and DataFramesMeta.

Actually all three sections would benefit from the experience of the community, so if you have a comment please make an issue or PR on https://github.com/bkamins/Julia-DataFrames-Tutorial.

As for useful packages I have tried to use raw DataFrames to reduce dependencies of my code, but actually FreqTables and DataFramesMeta helped me a lot and did not give too much mental overhead of things to remember.

I would like to especially recommend FreqTables - a small package, but really useful. I would say that it deserves much more attention from the community than it gets (looking at the number of stars). So let me write a bit about it.

There are three reasons I like it:

simply I make contingency tables almost all the time; previously I have used countmap from StatsBase a lot, but it returns a dictionary which is not very handy; freqtable returns a much nicer result (e.g. if possible it is sorted) and allows for more than one dimension;
with freqtable I can use vectors or work on data frames, it nicely handles missings and allows for weighting;
freqtable is faster than countmap (I was surprised when I learned this, maybe not a critical thing but a nice plus).

So how does the output from freqtable look? Here is a sampler:

julia> using DataFrames, FreqTables

julia> srand(1); df = DataFrame(rand(1:3, 10, 2))

10×2 DataFrames.DataFrame

│ Row │ x1 │ x2 │

├─────┼────┼────┤

│ 1 │ 3 │ 2 │

│ 2 │ 3 │ 1 │

│ 3 │ 3 │ 1 │

│ 4 │ 3 │ 2 │

│ 5 │ 1 │ 2 │

│ 6 │ 1 │ 2 │

│ 7 │ 1 │ 3 │

│ 8 │ 2 │ 3 │

│ 9 │ 1 │ 1 │

│ 10 │ 1 │ 2 │

julia> freqtable(df, :x1, :x2)

3×3 Named Array{Int64,2}

x1 ╲ x2 │ 1 2 3

────────┼────────

1 │ 1 3 1

2 │ 0 0 1

3 │ 2 2 0

And now a simple benchmark against countmap :

julia> using DataFrames, FreqTables, StatsBase, BenchmarkTools

julia> srand(1); x = rand(1:100, 10^6); y = categorical(x); z = string.(x);

julia> @benchmark freqtable($x)

BenchmarkTools.Trial:

memory estimate: 25.89 KiB

allocs estimate: 83

--------------

minimum time: 24.246 ms (0.00% GC)

median time: 24.672 ms (0.00% GC)

mean time: 25.425 ms (0.00% GC)

maximum time: 39.739 ms (0.00% GC)

--------------

samples: 197

evals/sample: 1

julia> @benchmark countmap($x)

BenchmarkTools.Trial:

memory estimate: 6.61 KiB

allocs estimate: 10

--------------

minimum time: 42.230 ms (0.00% GC)

median time: 42.813 ms (0.00% GC)

mean time: 43.110 ms (0.00% GC)

maximum time: 46.244 ms (0.00% GC)

--------------

samples: 116

evals/sample: 1

julia> @benchmark freqtable($y)

BenchmarkTools.Trial:

memory estimate: 10.16 KiB

allocs estimate: 76

--------------

minimum time: 1.064 ms (0.00% GC)

median time: 1.112 ms (0.00% GC)

mean time: 1.129 ms (0.09% GC)

maximum time: 3.485 ms (66.72% GC)

--------------

samples: 4403

evals/sample: 1

julia> @benchmark countmap($y)

BenchmarkTools.Trial:

memory estimate: 6.61 KiB

allocs estimate: 10

--------------

minimum time: 87.141 ms (0.00% GC)

median time: 88.167 ms (0.00% GC)

mean time: 88.510 ms (0.00% GC)

maximum time: 92.177 ms (0.00% GC)

--------------

samples: 57

evals/sample: 1

julia> @benchmark freqtable($z)

BenchmarkTools.Trial:

memory estimate: 45.81 MiB

allocs estimate: 2000285

--------------

minimum time: 75.712 ms (3.94% GC)

median time: 77.057 ms (3.94% GC)

mean time: 77.346 ms (4.16% GC)

maximum time: 83.298 ms (3.35% GC)

--------------

samples: 65

evals/sample: 1

julia> @benchmark countmap($z)

BenchmarkTools.Trial:

memory estimate: 6.61 KiB

allocs estimate: 10

--------------

minimum time: 81.931 ms (0.00% GC)

median time: 83.128 ms (0.00% GC)

mean time: 83.472 ms (0.00% GC)

maximum time: 89.977 ms (0.00% GC)

--------------

samples: 60

evals/sample: 1

As you can see freqtable does really a good job on different types of inputs.

Actually there is a third way to do a similar using by form DataFrames which is also quite fast but it is more messy. freqtable is more specialized - does one job, but does it well.

Here are the benchmarks of by:

julia> @benchmark by(DataFrame(x = $x), :x, nrow)

BenchmarkTools.Trial:

memory estimate: 38.91 MiB

allocs estimate: 5986

--------------

minimum time: 28.946 ms (1.56% GC)

median time: 34.440 ms (14.82% GC)

mean time: 34.291 ms (14.80% GC)

maximum time: 41.079 ms (20.70% GC)

--------------

samples: 146

evals/sample: 1

julia> @benchmark by(DataFrame(x = $y), :x, nrow)

BenchmarkTools.Trial:

memory estimate: 38.92 MiB

allocs estimate: 6198

--------------

minimum time: 44.810 ms (3.52% GC)

median time: 50.244 ms (10.53% GC)

mean time: 49.715 ms (10.38% GC)

maximum time: 56.052 ms (17.67% GC)

--------------

samples: 101

evals/sample: 1

julia> @benchmark by(DataFrame(x = $z), :x, nrow)

BenchmarkTools.Trial:

memory estimate: 38.91 MiB

allocs estimate: 5986

--------------

minimum time: 46.891 ms (0.93% GC)

median time: 53.657 ms (10.12% GC)

mean time: 52.539 ms (9.68% GC)

maximum time: 60.736 ms (16.24% GC)

--------------

samples: 96

evals/sample: 1

7 comments:

Tamas K PappDecember 20, 2017 at 11:12 AM
Thanks for mentioning FreqTables, I did not know about it but it turns out to be very useful.
Milan Bouchet-ValatDecember 21, 2017 at 10:34 PM
Thanks for the publicity! These functions should ideally live in StatsBase, but that's not possible without introducing a dependency on NamedArrays, which is blocked in particular by the competition with AxisArrays.

FWIW, StatsBase.countmap() should be faster in recent StatsBase releases, normally even faster than freqtable() with bits types which can work with radix sort (at the expense of allocating a copy of the input).
xiaodaiJanuary 1, 2018 at 12:01 AM
Yes from my own testing `countmap` is faster; my `Pkg.status("StatsBase")` gives `# - StatsBase 0.19.4`.

Also you may want to consider the unpublished (as of 1Jan2018) `fastby` from [FastGroupBy.jl](https://github.com/xiaodaigh/FastGroupBy.jl/blob/master/README.md). It allows you to do more than frequency counts for each by-group.

(Shameless self-promotion aside) Eventually many of `FastGroupBy.jl` functionalities will pushed to `DataFrame` and `StatsBase` etc too.
xiaodaiJanuary 1, 2018 at 10:50 AM
yes there is support for categorical array and pooledarray but only in the function sumby. will add more. actually the faster countmap in StatsBase was contributed by me and i will continue to contribute more these fastby function become mature

Note: Only a member of this blog may post a comment.

Wednesday, December 20, 2017

An update of my tutorial on DataFrames

7 comments: