This document contains recipes to create specific kinds of plots. It’s aimed mainly at people who are new the grammar of graphics. It can also be seen as the equivalent of a plot gallery.
Note that this Org file is actually a “literate programming document”, which uses ntangle. If you want to generate the recipes locally, just run:
ntangle recipes.org
Note that the recipes directory of the repo already contains these files.
The document contains two kinds of recipes.
- First the “Get To The Point” (GTTP) kind of recipe. This is a minimal example to produce a specific plot, without any fancy options, interesting data etc.
- And secondly the “Tell Me A Story” (TMAS) recipes, which try to explain every important step, typically introduce some interesting data set which we’ll investigate to finally produce a plot of a certain kind. Here also alternative ways might be presented. For some people maybe the more interesting read. :)
The recipes in this section present the simplest way to achieve the desired plot, without any superfluous talking or complicating examples.
Just a line plot of some data. Simple as that.
Some basic imports:
import ggplotnim, sequtils, seqmath
Create some data so that we have something to plot:
let x = linspace(0.0, 30.0, 1000)
let y = x.mapIt(pow(sin(it), 2.0))
Build a dataframe from it:
let df = toDf(x, y)
and create the plot:
ggplot(df, aes("x", "y")) + # x and y are the identifiers given above as strings
geom_line() +
ggsave("media/recipes/rSimpleLinePlot.png")
This results in the following plot:
Creating a plot with a specific output size and saving it as a
different filetype is very easy. It’s basically exactly the same as
the plot above with the only change in the ggsave
call:
import ggplotnim, sequtils, seqmath
const
width = 720
height = 480
let x = linspace(0.0, 30.0, 1000)
let y = x.mapIt(pow(sin(it), 2.0))
let df = toDf(x, y)
ggplot(df, aes("x", "y")) +
geom_line() +
ggsave("media/recipes/rLinePlotSize.png", width = width, height = height)
You can see that ggsave
accepts a width
and height
argument. The
desired output filetype is deduced from the filename
extension. Supported are:
.pdf
.svg
.png
The above generates the following plot:
Often one has a certain type of data for several discrete different
classes. In these cases a stacked histogram may be useful. Considering
the mpg
dataset, this is done via:
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("cty", fill = "class")) +
geom_histogram() +
ggsave("media/recipes/rStackedMpgHistogram.png")
This generates:
NOTE: the first lines should actually go to 0 again. That’s a bug currently.
The same as above can also be represented with lines using geom_freqpoly
:
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("cty", color = "class")) +
geom_freqpoly() +
ggsave("media/recipes/rStackedMpgFreqpoly.png")
This generates:
Most geoms (those which allow a fillColor
argument / being
classified by fill
) also allow an alpha
field since version
v0.2.17
.
This allows to change the alpha of the fill without affecting the
actual color or the normal color
field.
For instance we can use it to make the above plot a little more pretty:
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("cty", fill = "class")) +
geom_freqpoly(alpha = 0.3) +
ggsave("media/recipes/rFreqPolyWithAlpha.png")
Notice how if a fill color is set, the lines are always drawn down to
0! For instance subcompact
and pickup
reach down to the x axis.
This generates:
Given a continuous data column we may want to calculate a histogram with N bins:
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy")) +
geom_histogram(bins = 20) + # by default 30 bins are used
ggsave("media/recipes/rMpgHistoNumBins.png")
This generates:
We can also set a specific bin width instead of a number of
bins. E.g. we want to bin by 1.5 mpg, which can be done using the
binWidth
argument:
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy")) +
geom_histogram(binWidth = 1.5) +
ggsave("media/recipes/rMpgHistoBinWidth.png")
This generates:
Or sometimes one has specific edges in mind one wants to
investigate. This can be done via the breaks
argument. NOTE: the
breaks given are interpreted as left bin edges plus the right most
edge of the last bin! So the below starts at 0 on the left side and
the last bin ends at 40 on the right side.
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy")) +
geom_histogram(breaks = @[0'f64, 10, 15, 19, 23, 25, 40]) +
ggsave("media/recipes/rMpgHistoCustomBreaks.png")
This generates:
While this is a bad example, there are use cases when aside from the bins for the histogram points should be used to indicate some data that is binned in the same way. For instance when comparing simulations with experimental data in particle physics. In this case here we’ll just bin the point data in the same way as the histogram itself and draw the points into the center bin positions.
import ggplotnim
let df = readCsv("data/mpg.csv")
let breaks = @[0'f64, 10, 15, 19, 23, 25, 40]
ggplot(df, aes("cty")) +
geom_histogram(breaks = breaks) +
geom_point(stat="bin", breaks = breaks, binPosition = "center") +
ggsave("media/recipes/rMpgHistoPlusPoints.png")
Note both additional arguments to the geom_point
call. The
stat="bin"
tells ggplotnim that the user wants to perform binning of
the given x columns (“cty”). With that setting the breaks
argument
won’t be ignored and will be applied in the same way as for the
geom_histogram
call. Finally the binPosition="center"
is used to
draw the points not where the binning data is located (read: left bin
edge), but rather in the center of the bins.
This generates:
Sometimes the black points (or color of some other non classified
geom) might be boring. That’s why the geom_*
procs also take a
color
argument. This just takes a chroma.Color
object. Let’s color
our points monokai pink:
import ggplotnim
let df = readCsv("data/mpg.csv")
let breaks = @[0'f64, 10, 15, 19, 23, 25, 40]
ggplot(df, aes("displ", "cty")) +
geom_point(color = "#F92672") +
ggsave("media/recipes/rMpgCustomColorPoint.png")
Gives:
When dealing with histograms it’s quite likely that the user has
already computed the bin content with a custom binning before hand and
simply wants to plot that information. This can also be done easily by
building a DF from that prebinned data and using stat="identity"
as
the histogram argument:
import ggplotnim
let bins = @[0, 2, 5, 9, 15]
let counts = @[0.1, 0.8, 0.3, 0.05, 0.0] # <- last element is dummy
let df = toDf({"bin_edges" : bins, "counts" : counts})
ggplot(df, aes("bin_edges", "counts")) +
geom_histogram(stat = "identity") +
ggsave("media/recipes/rPrebinnedHisto.png")
There are a couple of important things to mention here. In the case
above the user hands x
as the bin edges (!) starting from the left
bin edge of the first bin to the right edge of the last bin. However,
this means we only have N - 1
actual bins. Yet the DF requires all
columns to have the same number of entries (Note: technically that’s
not true, it’ll be filled up with VNull
values, yet it’s not a nice
solution and will cause other issues).
There are several ways to deal with this. Either one hands one
additional dummy value in the counts
sequence, which will be ignored
for the bin content. Alternatively, one may only hand essentially the
left edges of the bins (and thus as many bins
elements as real
counts
values) and let ggplotnim determine the bin width. This works
fine as long as the bin width of the second to last bin is the same as
the bin width of the last bin. So if that is the case for your data,
feel free to only hand N - 1
elements.
Which generates the following:
Often one deals with categorical data with N classes, e.g. the
different type of cars listed in the mpg
dataset and wishes to count
the number of elements in each class. For this we can use a geom_bar
plot. NOTE: For the moment the only function supported is the number
of counts. FormulaNode
and other Nim function support will probably
be added at some point (or when desired by someone):
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("class")) +
geom_bar() +
ggsave("media/recipes/rMpgSimpleBarPlot.png")
Giving us the following result:
Consider a case in which we have a datafile for N sensors, where each row corresponds to one measurement per minute. In that case we might want to plot the mean value measured for all channels in a bar plot. To achieve this, we can do the following.
We start by putting the table into long form and parsing the channel numbers to integers. Then we compute the mean by grouping by channel and finally we create the plot.
Note: Parsing the channel numbers and thus having to set the x scale as `dcDiscrete` is of course not required. But this way we combine two examples into one. :)
import ggplotnim, sequtils, seqmath, strutils
let cols = toSeq(0 .. 7).mapIt($it)
# make `parseInt` work on Values, so we can parse the long form
# `Channel` column
#liftScalarStringProc(parseInt)
let df = readCsv("data/szinti_channel_counts.txt",
sep = '\t',
colNames = cols)
.gather(cols, key = "Channel", value = "Count")
.mutate(f{string -> int: "Channel" ~ parseInt( df["Channel"][idx] )})
let dfMean = df.group_by("Channel").summarize(f{float: "Mean counts / min" << mean( c"Count" )})
# calculate mean for each channel
ggplot(dfMean, aes("Channel", "Mean counts / min")) +
geom_bar(stat = "identity", position = "identity") +
scale_x_discrete(name = "Channel number") +
ggtitle("Mean counts per channel") +
ggsave("media/recipes/rBarPlotCompStats.png")
Gives the following plot:
Sometimes the above classes may be part of some other class though, which is supposed to be split by as well. For instance whether cars of a specific class are 4WD, RWD or FWD:
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("class", fill = "drv")) +
geom_bar() +
ggsave("media/recipes/rMpgStackedBarPlot.png")
Results in:
Sometimes a scatter plot is supposed to highlight some continuous
value of the points. For instance we can color a geom_point
plot of
the mpg
displacement versus the highway efficiency by the city
efficiency to get an even fuller picture:
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("displ", "hwy", color = "cty")) +
geom_point() +
ggsave("media/recipes/rMpgContinuousColorPoints.png")
See:
Instead of using bars to show some count data classified by two columns, we can show the same thing using points instead. The same plot as above is then:
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("class", color = "drv")) +
geom_point(stat = "count") +
geom_line(stat = "count") +
ggsave("media/recipes/rMpgStackedPointPlot.png")
Results in:
It’s also just possible to draw a point (or line, see below) with an x axis that contains discrete data while providing a continuous y axis.
import ggplotnim
let df = readCsv("data/mpg.csv")
# coloring by class is of course not required to make this work :)
ggplot(df, aes("cyl", "hwy", color = "class")) +
geom_point() +
ggsave("media/recipes/rMpgDiscreteXScale.png")
Results in:
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy", "cyl")) +
geom_point() +
ggsave("media/recipes/rDiscreteYAxis.png")
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("class", "cyl", color = "class")) +
geom_point() +
ggsave("media/recipes/rBothDiscreteAxes.png")
Inspired by #40.
A plot with a discrete X scale (using string values) and continuous y scale.
import ggplotnim, seqmath
import random
const paths = 10
const dates = 80
proc gaussian*(rnd: var Rand, mu = 0.0, sigma = 1.0): float =
var
s = 0.0
u = 0.0
v = 0.0
while s >= 1.0 or s <= 0.0:
u = 2.0 * rnd.rand(0.0..1.0) - 1.0
v = 2.0 * rnd.rand(0.0..1.0) - 1.0
s = (u * u) + (v * v)
let x = u * sqrt(-2.0 * ln(s) / s)
return (mu + (sigma * x))
proc createDataFrame(): DataFrame =
const sigma = 0.10
var rnd = initRand(124325)
var pathNames = newSeq[string](dates * paths)
var pathVals = newSeq[float](dates * paths)
var tenors = newSeq[int](dates * paths)
for j in 0 ..< paths:
pathVals[j * dates] = 100.0
pathNames[j * dates] = "path" & $(j + 1)
tenors[j * dates] = 0
for i in 1 ..< dates:
let idx = j * dates + i
pathNames[idx] = "path" & $(j + 1)
pathVals[idx] = (pathVals[idx - 1] * exp(-0.5 * sigma * sigma + sigma * gaussian(rnd)))
tenors[idx] = i
result = toDf({ "tenors" : tenors,
"pathNames" : pathNames,
"pathValues" : pathVals })
let df = createDataFrame()
ggplot(df, aes("tenors", "pathValues", color = "pathNames")) +
geom_line() +
ggsave("media/recipes/rDiscreteXLine.png")
Results in:
In many practical cases we may end up with some data Y1
and Y2
,
which both is equivalent in the phase space sense and was measured
against the same variable but is available only in the format of:
# X Y1 Y2 x1 y1_1 y2_1 x2 y1_2 y2_2 ...
One case such as this would be having two different sensors for the same property, which both took data at the same time. In those cases one probably wants a plot of X against Y1 and X againt Y2 in one plot with two different colors.
The naive way to do this is the following:
import ggplotnim, sequtils, seqmath
let df = readCsv("data/50-18004.CSV")
ggplot(df) +
geom_line(aes(x = "in_s", y = "C1_in_V", color = "C1")) +
geom_line(aes(x = "in_s", y = "C2_in_V", color = "C2")) +
ggsave("media/recipes/rTwoSensorsBadStyle.png")
This generates the following:
Which means that we specify the x
and y
aesthetics only in the two
geoms and give it a color according to a string, which does represent
a column of the df
. In that way the color
is being set to “C1”
and “C2”. While this works it’s maybe not the nicest way to handle
this, since the gather
proc is specifically there to convert a table
from this short form of X, Y1, Y2
to a long format dataframe of the
type X, Key, Value
, where Key
stores the name of the previous
column (or a custom name) and Value
the value corresponding to X
of that column. This simplifies the plotting to a single call to
geom_line
by specifying the Channel
column as the discrete color scale:
import ggplotnim, sequtils, seqmath
let df = readCsv("data/50-18004.CSV")
let dfnew = df.gather(["C1_in_V", "C2_in_V"], key = "Channel", value = "V")
# Plotting via `df` directly causes scale problems!
ggplot(dfNew, aes("in_s", "V", color = "Channel")) +
geom_line() +
ggsave("media/recipes/rTwoSensorsGoodStyle.png")
Which then results in the following nicer plot (note that the legend now says something more useful as its title):
When dealing with discrete data, which contains a large number of labels a typical problem is that there’s not enough space for all labels. In this case we want to rotate the labels and right align them. Considering some fake dataset with people who did some shifts listed by year and we want a bar plot of the number of shifts each person did per year. Also we need a much wider plot for this dataset:
import ggplotnim
let df = readCsv("data/fake_shifter_data.txt")
ggplot(df, aes("Shifters", fill = "Year")) +
geom_bar() +
xlab(rotate = -45.0, margin = 1.75, alignTo = "right") +
ggtitle("Number of shifts done by each shifter by year") +
ggsave("media/recipes/rBarPlotRotatedLabels.png", width = 5000, height = 1000)
NOTE: It’s recommended to generate a plot as this as a vector graphic (svg, pdf) instead of a png. However, since these recipe plots are part of the CI, we generate PNGs for all of them, since conversion from svg, pdf yields bad results on travis.
Gives us the following plot. Probably better to look at the original instead of the embedded plot, since it’s so wide.
By using stat = "identity"
combined with geom_bar
or
geom_histogram
, one can also plot bars with negative height.
This recipe is taken from: #64
import ggplotnim
let trials = @["A", "B", "C", "D", "E"]
let values = @[1.0, 0.5, 0, -0.5, -1.0]
let df = toDf({ "Trial" : trials,
"Value" : values })
ggplot(df, aes(x="Trial", y="Value")) +
geom_bar(stat="identity", position="identity") +
ggsave("media/recipes/rNegativeBarPlot.png")
Gives us the following plot:
Putting custom annotations onto the plot is supported. However the styling of the annotations is still somewhat limited.
Annotations can either be done via relative coordinates of the plot
area (left, bottom)
or via data coordiantes (x, y)
.
Note that the annotation is drawn before the data!
By default a white rectangular background is drawn behind the
annotation. This can be modified using the backgroundColor
argument.
An example is shown below where we print the largest 5 values as an annotation onto the plot.
import ggplotnim
import algorithm, sequtils, strformat, strutils
# get the data from one of the other recipes
let df = readCsv("data/50-18004.CSV")
let dfnew = df.gather(["C1_in_V", "C2_in_V"], key = "Channel", value = "V")
# assume we want to create an annotation that prints the largest 5 values of
# Channel 2; get largest values, sorted by time (`in_s`)
let dfChMax = dfNew.filter(f{c"Channel" == "C2_in_V"})
.arrange("V", SortOrder.Descending)
.head(5)
.arrange("in_s") # sort again by x axis
# build an annotation:
var annot: string
let idxs = toSeq({'A'..'E'})
for j, id in idxs:
let xVal = alignLeft(formatFloat(dfChMax["in_s", j, float], precision = 2), 9)
let yVal = formatFloat(dfChMax["V", j, float], precision = 4)
annot.add &"{id}: (x: {xVal}, y: {yVal})"
if j < idxs.high:
annot.add "\n"
# create a font to use using the `ggplotnim.font` helper
let font = font(12.0, family = "monospace")
# now create the plot and put the annotation where we want it
ggplot(dfNew, aes("in_s", "V", color = "Channel")) +
geom_line() +
# either for instance in relative coordinates of the plot viewport
# Values smaller 0.0 or larger 1.0 work too. Puts the annotation outside
# of the plot
annotate(annot,
left = 0.5,
bottom = 1.0,
font = font) +
# or in data coordinates using `(x, y)`
annotate(annot,
x = -2e-6,
y = 0.06,
font = font,
backgroundColor = parseHex("FFEBB7")) +
ggsave("media/recipes/rCustomAnnotations.png")
Gives us this (somewhat ugly, but that’s not the point) plot:
It’s also possible to limit / enlarge the plotting range using xlim
and ylim
. The behavior of points which possibly lie outside the
plotting range is determined by the outsideRange
argument, which can
take the values:
- “clip” (default): clip them to the maximum range of the limit
- “drop”: drop those points from the plot
- “none”: leave them as they are. Will potentially show them outside the plot area.
This becomes a little more complicated in combination with the
xMargin
and yMargin
procs. See below for an example.
First of all we can use it to enlarge the x range:
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy", "cty")) +
geom_point() +
xlim(10.0, 60.0) +
ggsave("media/recipes/rEnlargeXRange.png")
which gives us:
On the other hand we can also use it to limit the range of a plot:
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy", "cty")) +
geom_point() +
xlim(10.0, 30.0) +
ggsave("media/recipes/rLimitXRange.png")
Notice how all values larger than 30.0
(compare with plot above) are
being clipped to 30.0
.
Sometimes it might be nice to have an explicit area in the plot, which
is used to designate data points, which lie outside a desired data
range (or are Inf
). In this case the xMargin
or yMargin
procs
can be used (possibly in combination with xlim
, ylim
above).
Assuming we want to create the same plot as above, but for some reason
are only interested in y values up to 25
, but we want to be easily
aware all points, which are larger (but not equal!) that value. Let’s
add margin in y
to achieve that.
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy", "cty")) +
geom_point() +
ylim(5.0, 25.0) +
yMargin(0.1) +
ggsave("media/recipes/rCreateMarginBuffer.png")
Notice how all values that are larger than 25
appear at the top of
the plot, while values smaller (and equal to 25) appear where they belong.
If we wish to highlight certain points of a plot with a specific geom / style we can do this in the following way.
It is based on first filtering an additional dataset to the values to
be highlighted and then using the optional color
etc. arguments to
the geom_*
procs to set a certain style.
For instance let’s highlight the min and max values of the second channel from the Plotting column Y1 and Y2 against X example above.
import ggplotnim, algorithm
# base this on one of the above examples
let df = readCsv("data/50-18004.CSV")
.gather(["C1_in_V", "C2_in_V"], key = "Channel", value = "V")
# filter to Channel 2 and sort by voltage
let dfSorted = df.filter(f{c"Channel" == "C2_in_V"})
.arrange("V", SortOrder.Descending)
# get min and max
let dfMax = dfSorted.head(1)
let dfMin = dfSorted.tail(1)
ggplot(df, aes("in_s", "V", color = "Channel")) +
geom_line() + # the actual data
# add additional geom with `data =` arg and set styles.
geom_point(data = dfMax,
color = "#FF0000", # named colors (e.g. "red") are also possible!
size = 5.0,
marker = mkCross) +
geom_point(data = dfMin,
color = "#0000FF",
size = 5.0) +
ggsave("media/recipes/rHighlightMinMax.png")
Results in the following plot:
In some cases the data frame one has does not contain exactly the data we actually want to plot.
Take for instance the mpg dataset where the fuel economoy is (as the name implies) given in miler per gallon. People who use a sensible unit system will probably want the fuel economy in liters per 100 km.
There are two ways to plot L/100km
instead of mpg
.
Either (as shown as an example in the documentation) by mutating the
data frame we have using transmute
or mutate
to create a new,
transformed column.
The other way to achieve this is to provide a FormulaNode
to the
aes
call, like so:
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes(f{235 / c"cty"}, "displ")) +
geom_point() +
xlab("cty [L / 100km]") +
ggsave("media/recipes/rFormulaAesthetic.png")
Which gives us:
This approach can be used for (almost) arbitrary computations on (even
more than one) column. Note that if you wish to apply a proc
to a
column, make sure it’s lifted and corresponds to one of the types
explained in the README.
Starting from version v0.2.15
plotting of error bars is
supported. This is done via geom_errorbar
.
Error bars are handled by new fields of the Aesthetics
object,
namely xMin
, xMax
, yMin
and yMax
. The important thing to keep
in mind is that these fields require absolute values. So if you have
an error of 0.1
you don’t set yMin
to -0.1
and yMax
to 0.1
,
but rather you add and subtract from y
using a formula. See the
example below.
The example below assumes asymmetric, but constant errors of 0.03
down and 0.05
up. Note that xMin
etc. are completely normal
aesthetic fields. You can also assign a column of your DF with
precomputed min and max values or even use more complex functionality
provided by aesthetics via formulas, e.g. compute the square root
error via yMax = f{`y` + sqrt(`y`)}
for instance.
import ggplotnim, seqmath, sequtils
# create some polynomial data
let x = linspace(0, 1.0, 10)
let y = x.mapIt(0.5 * it - 1.2 * it * it + 1.1 * it * it * it)
let df = toDf(x, y)
# let's assume we have asymmetric errors, 0.03 down and 0.05 up
ggplot(df, aes("x", "y")) +
geom_point() +
# define errors as a formula, which references our "y" scale
geom_errorbar(aes(yMin = f{`y` - 0.03}, yMax = f{`y` + 0.05})) +
ggsave("media/recipes/rErrorBar.png")
Which results in the following plot:
If a plot contains multiple aesthetic scales, which require a legend,
they will be attempted to be drawn above one another.
However, at the time of version v.0.2.18
they are not made smaller
so they fit. If too many elements are shown, they won’t fit the plot.
An example below in which we classify by color and size:
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("cty", "displ", size = "cyl", color = "cty")) +
geom_point() +
ggsave("media/recipes/rMultipleLegends.png")
For a simple tile plot, let’s generate some data for a 28x28 tiling.
By default tiling will assume that the bin widhts of each tile is 1 in each dimension. You can provide custom bin widths by using the
width
height
aesthetic in the aes
call. Same as with geom_errorbar
this can
either be a full column containing widths for each tile or a constant
value via a formula, e.g.
aes(x = ..., ..., width = f{0.95}, height = f{0.95})
to get tiles which only have a width of 0.95.
Also for tiles it may be (more so than other times) be desirable to
have each axis be considered discrete despite the data being
continuous like. In that case use scale_*_continuous
as shown
(commented out) below.
import ggplotnim, random
var
xs = newSeq[float]()
ys = newSeq[float]()
zs = newSeq[float]()
rnd = initRand(42)
for x in 0 ..< 28:
for y in 0 ..< 28:
xs.add x.float
ys.add y.float
zs.add rnd.rand(1.0)
let df = toDf(xs, ys, zs)
ggplot(df, aes("xs", "ys", fill = "zs")) +
geom_tile() +
#scale_x_discrete() +
#scale_y_discrete() +
ggsave("media/recipes/rSimpleTile.png")
This gives us the following plot:
Note to get rid of the spacing on the upper and right side (or get
even spacing), use xlim
and ylim
.
When dealing with the above case Simple tile plot this approach
quickly becomes unwieldly for a large number of tiles. That is because
using geom_tile
each tile is drawn separately. Especially when
storing the result as a vector graphic this can result in bad
performance and huge file sizes.
geom_raster
instead draws a single bitmap for the whole data (a
cairo PNG surface is filled pixel wise).
This comes at the (arbitrary, but done for simplicity’s sake)
limitation that each tile has the same size. The width
and height
fields are available, but they are redundant.
We’ll modify the example from above a bit to include more elements and move the location to a non trivial position.
import ggplotnim, random
var
xs = newSeq[float]()
ys = newSeq[float]()
zs = newSeq[float]()
rnd = initRand(42)
for x in countup(-256, 254, 2):
for y in 0 ..< 256:
xs.add x.float
ys.add y.float
zs.add rnd.rand(1.0)
let df = toDf(xs, ys, zs)
ggplot(df, aes("xs", "ys", fill = "zs")) +
geom_raster() +
ggsave("media/recipes/rSimpleRaster.png")
And for a slight modification of two facet wrapped heatmaps (mainly as a regession test):
import ggplotnim, random
var
xs = newSeq[float]()
ys = newSeq[float]()
zs1 = newSeq[float]()
zs2 = newSeq[float]()
rnd = initRand(42)
for x in 0 ..< 256:
for y in 0 ..< 256:
xs.add x.float
ys.add y.float
zs1.add rnd.rand(1.0)
zs2.add rnd.rand(1.0)
let df = toDf(xs, ys, zs1, zs2)
.gather(["zs1", "zs2"], key = "Map", value = "vals")
ggplot(df, aes("xs", "ys", fill = "vals")) +
facet_wrap("Map") +
xlim(0, 256) + ylim(0, 256) +
geom_raster() +
ggsave("media/recipes/rFacetRaster.png", width = 920)
Since version v0.5.0
we provide 4 default colormaps to choose from:
- Viridis
- Magma
- Inferno
- Plasma
(these are the colormaps introduced in matplotlib in version 2. They are more or less perceptually homogeneous in color, which is a nice property to avoid being biased due to a perceptual large shift in color representing a small change in actual values)
In addition to those predefined scales you can either modify any of the existing colormaps or simply hand your own.
First let’s look at the existing colormaps by drawing gradients of the scales:
import ggplotnim, sequtils
# 1000 points to use
let xs = linspace(0.0, 1.0, 1000)
var plts: seq[GgPlot]
# generate data to show a gradient for (currently a single element in one
# axis isn't supported for `geom_raster`, as it's essentially discrete).
var df = newDataFrame()
for i in 0 ..< 5:
df.add toDf({"x" : xs, "y" : i })
for scale in [viridis(), magma(), plasma(), inferno()]:
plts.add ggplot(df, aes("x", "y", fill = "x"), backend = bkCairo, fType = fkPng) +
scale_fill_gradient(scale) + # assign the correct scale
geom_raster() +
ylim(0, 5) + # due to our weird data (height deduced as 1, set correct limits)
theme_void() + # no scales
hideLegend() + # no legend
margin(top = 1, bottom = 0.3, right = 0.3) + # set small margin left / bottom
ggtitle("Colorscale: " & $scale.name)
# now create a multi plot of all of them
ggmulti(plts, "media/recipes/rColormaps.png", widths = @[600], heights = @[150, 150, 150, 150],
width = 600, height = 600)
This gives us the following color comparison:
Any of the functions viridis(), magma(), plasma()
and inferno()
return a ColorScale
object:
type
ColorScale* = object
name*: string # name of the used color scale
colors*: seq[uint32] # the used color scale as colors encoded in uint32
The name
is optional and can simply be used as we do in the example
above to give the name in the title of each subplot.
The colors
are stored as uint32
colors, which means:
let color = 0xAA_BB_CC_DD
# ^ ^ ^ ^--- blue value as hex byte
# | | |--- green value as hex byte
# | |--- red value as hex byte
# |--- alpha value of the color as hex byte
This is mainly done, as the most performance sensitive use of
colorscales is for a geom_raster
, in which case we have to use
uint32
colors to color each pixel (e.g. in Cairo). As such it’s good
idea to already use uint32
to avoid the need to convert the
colorscale from some arbitrary color type to uint32
.
As such, one can either create a custom colorscale by just taking such
a ColorScale
object and assigning a sequence of uint32
colors (any
number of colors. ggplotnim
will squeeze all input data into the
corresponding number of colors) or directly hand a seq[uint32]
to
the scale_fill_gradient
(or related scale_color_gradient
)
procedure.
Let’s look at an example of modifying a color scale to assign a transparent color to exact 0 values. We’ll plot a 2D gaussian:
import ggplotnim
import seqmath # for gauss
# generate some gaussian 2D data
var
xs = newSeq[float]()
ys = newSeq[float]()
zs = newSeq[float]()
let coords = linspace(-1.0, 1.0, 200)
for y in coords:
for x in coords:
xs.add x
ys.add y
zs.add gauss(sqrt(x*x + y*y), mean = 0.0, sigma = 0.2) # small sigma to cover multiple σ
# get the Inferno colormap
var customInferno = inferno()
customInferno.name = "InfernoWithTransparent"
# and assign the 0th element to exact transparent
customInferno.colors[0] = 0 # transparent
ggplot(toDf(xs, ys, zs), aes("xs", "ys", fill = "zs")) +
geom_raster() +
xlim(-1, 1) + ylim(-1, 1) +
scale_fill_gradient(customInferno) +
ggtitle("Modified Inferno colormap with 0 set to transparent") +
ggsave("media/recipes/rCustomColormap.png")
Note: we can see some substructure due to the small number of points we actually sample (200×200 points for 256 color values).
geom_text
can be used to represent an additional scale on the plot
via text. E.g. either to write a classification as a string onto the
plot, overlay numbers onto points etc.
Take note that by default the text will be centered on the position
given by the x
and y
scales. You can change the alignment using
the alignKind
argument to geom_text
or by providing the optional
font
argument, which has an alignKind
field.
There are many ways it can be useful. However, geom_text
is a
completely valid geom
, which means we can replace e.g. a
geom_point
by geom_text
and the resulting plot works as expected
(although it may be messy):
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy", "displ")) +
geom_text(aes(text = "manufacturer")) +
ggsave("media/recipes/rSimpleGeomText.png")
This generates:
We can take things even further by also applying additional scales to the plot, which change the color and size of the shown text. That way we can end up showing 5 different scales in a single plot! Again, the following may not be the most reasonable example, but well…
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy", "displ", color = "class", size = "cyl")) +
geom_text(aes(text = "manufacturer")) +
ggsave("media/recipes/rClassifiedGeomText.png")
This generates:
In more practical terms we might want to annotate certain points with
text in a plot. This can also be done using geom_text
.
To avoid the text being drawn on top of the point, we can modify the x
or y
scale of the aesthetic for geom_text
to nudge it to the side.
Let’s start from a simple combination of geom_point
and geom_text
:
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy", "displ")) +
geom_point() +
geom_text(aes(x = f{c"hwy" + 0.3},
text = "manufacturer"),
alignKind = taLeft,
# font = font(10.0, ...) <- you can also change the font
) +
ggsave("media/recipes/rAnnotateUsingGeomText.png")
This generates:
This is still pretty messy and usually not what one might use
geom_text
for. Instead let’s consider a similar example where we
want to annotate the car model with the best fuel economy:
import ggplotnim
let df = readCsv("data/mpg.csv")
let dfMax = df.mutate(f{"mpgMean" ~ (`cty` + `hwy`) / 2.0})
.arrange("mpgMean")
.tail(1)
ggplot(df, aes("hwy", "displ")) +
geom_point(aes(color = "cty")) + # set point specific color mapping
# Add the annotation for the car model below the point
geom_text(data = dfMax,
aes = aes(y = f{c"displ" - 0.2},
text = "model")) +
# and add another annotation of the mean mpg above the point
geom_text(data = dfMax,
aes = aes(y = f{c"displ" + 0.2},
text = "mpgMean")) +
ggsave("media/recipes/rAnnotateMaxValues.png")
This generates:
geom_text
and geom_tile
can be nicely combined to create annotated
heatmaps.
Let’s calculate the mean of highway fuel economy for each pair of
(car class, number of cylinder)
in the mpg
dataset and create an
annotated heatmap from the combination.
import ggplotnim, math
let df = readCsv("data/mpg.csv")
let dfRed = df.group_by(["class", "cyl"]).summarize(f{float: "meanHwy" << mean( c"hwy" )})
# stringification of formula is default name
let meanHwyCol = "meanHwy"
# fill only applies to `tile`, but not text. `color` would apply to text!
ggplot(dfRed, aes("class", "cyl", fill = meanHwyCol)) +
geom_tile() +
geom_text(aes(text = meanHwyCol)) +
scale_y_discrete() +
ggsave("media/recipes/rAnnotatedHeatmap.png")
Which results in:
In certain domains one often ends up with the desire to create a plot
from multiple subplots. This is supported already, but requires
explicit use of ginger
functinoality at the moment.
See for an example, inspired by here: https://staff.fnwi.uva.nl/r.vandenboomgaard/SP20162017/SystemsSignals/plottingsignals.html below.
The major point is to create two plots (but not draw them) using
ggcreate
with custom defined width
and height
. Then create a
viewport, which will hold the two plots, with width and height such
that the two plots fit. Then create a layout of rows / columns
(theoretically custom sizes can be set, but for an equal sized
subplot not required) and embed the plots.
Note: From version v0.4.10
on a backend must be given manually to
the ggplot
calls. This is because work started to make ginger
less backend dependent. Previously the Cairo backend was simply
assumed. This is only required when not directly saving a plot of
course (using ggsave
), i.e. when calling ggcreate
manually.
import ggplotnim, seqmath, math, sequtils, complex, ginger
let t = linspace(-0.02, 0.05, 1000)
let y1 = t.mapIt(exp(im(2'f64) * Pi * 50 * it).re)
let y2 = t.mapIt(exp(im(2'f64) * Pi * 50 * it).im)
let df = toDf({ "t" : t,
"Re x(t)" : y1,
"Im x(t)" : y2 })
let plt1 = ggcreate(
ggplot(df, aes("t", "Re x(t)"),
backend = bkCairo, fType = fkPng) + # tell `ggsave` we wish to create the plot on the cairo backend and as a PNG
geom_line() +
xlim(-0.02, 0.05) +
ggtitle("Real part of x(t)=e^{j 100 π t}"),
width = 800, height = 300
)
let plt2 = ggcreate(
ggplot(df, aes("t", "Im x(t)"),
backend = bkCairo, fType = fkPng) +
geom_line() +
xlim(-0.02, 0.05) +
ggtitle("Imaginary part of x(t)=e^{j 100 π t}"),
width = 800, height = 300
)
# combine both into a single viewport to draw as one image
var plt = initViewport(wImg = 800, hImg = 600, backend = bkCairo) # set backend of this viewport too
plt.layout(1, rows = 2)
# embed the finished plots into the the new viewport
plt.embedAt(0, plt1.view)
plt.embedAt(1, plt2.view)
plt.draw("media/recipes/rMultiSubplots.png")
Which gives us:
These two examples are essentially the following gist: https://gist.github.com/Vindaar/9c32c0676ffddec9078e4c0917861fcd
which I wrote for @voltist on IRC, who wanted to create neural raster spike plots.
Just a disclaimer: I know nothing about these plots, so if they’re wrong in some way (axes, labels, whatever), please open an issue on how to make them correct!
I googled and found this: https://pythontic.com/visualization/charts/spikerasterplot
which is easy to do in ggplotnim
using geom_linerange
.
I’ll show two ways to do it. One “elegant” way (in terms of what’s considered typical usage of ggplot) and one rather weird way, which allows to use custom color codes.
The first version relies on creating a long format data frame, with one column for spike numbers, one for the time axis, containing when a neuron spiked and finally a line size column for (what I assume is) the amplitude of the spike.
# first start with auto selection of colors
import ggplotnim, sequtils
const numx = 50
const numy = 8
const lineSizes = [0.4, 0.3, 0.2, 0.8, 0.5, 0.6, 0.7, 0.9]
# NOTE: The creation of the data here could surely be done in a nicer
# way...
var spikes = newSeq[float]()
var sizes = newSeq[float]()
for y in 0 ..< numy:
for x in 0 ..< numx:
spikes.add y.float
sizes.add lineSizes[y]
var df = newDataFrame()
df["spikes"] = toColumn spikes
df["neurons"] = toColumn randomTensor(numx * numy, 1.0)
df["lineSize"] = toColumn sizes
ggplot(df, aes("neurons", "spikes", color = factor("lineSize"))) +
geom_linerange(aes(ymin = f{c"spikes" - c"lineSize" / 2.0},
ymax = f{c"spikes" + c"lineSize" / 2.0})) +
scale_y_continuous() + # make sure y is considered cont.
ylim(-1, 8) + # at the moment ymin, ymax are not considered for the plot range (that's a bug)
ggtitle("Spike raster plot") +
ggsave("media/recipes/rAutoColoredNeuralSpikes.png")
This gives us the following plot:
In constrast this version only has a data frame with one column for each neuron containing the times when they spiked. The spike number, line size and color are all constant for each neuron.
import ggplotnim, sequtils
const numx = 50
const numy = 8
const lineSizes = [0.4, 0.3, 0.2, 0.8, 0.5, 0.6, 0.7, 0.9]
# alternatively using fixed colors and one geom_linerange for each color
let colorCodes = @[color(0, 0, 0),
color(1, 0, 0),
color(0, 1, 0),
color(0, 0, 1),
color(1 , 1, 0),
color(1, 0, 1),
color(0, 1, 1),
color(1, 0, 1)]
var df = newDataFrame()
for nr in 0 ..< numy:
df["neuron " & $nr] = toColumn randomTensor(numx, 1.0)
var plt = ggplot(df)
for nr in 0 ..< numy:
# could combine with above loop, but for clarity
plt = plt + geom_linerange(aes(x = ("neuron " & $nr),
y = nr,
ymin = nr.float - lineSizes[nr] / 2.0,
ymax = nr.float + lineSizes[nr] / 2.0),
color = colorCodes[nr])
# finally add scales, title and plot
plt + scale_y_continuous() + # make sure y is considered cont.
ylim(-1, 8) + # at the moment ymin, ymax are not considered for the plot range (that's a bug)
xlab("Neurons") +
ylab("Spikes") +
ggtitle("Spike raster plot, manual colors") +
ggsave("media/recipes/rCustomColoredNeuralSpikes.png")
This results in this plot:
Facet wraps are useful to display subplots consisting of different labels of a discrete variable. Essentially instead of e.g. coloring each geom based on each label an individual subplot is created. Of course those can be combined as well.
The only important thing is that the variables by which facetting is done have to be discrete. For continuous data the data is still interpreted as discrete, so this might result in many subplots!
By default the scales of each subplot is fixed to the same scale,
which is the one determined from the x and y aesthetics (in the below
displ
and hwy
). This can be overriden by the scales
argument to
facet_wrap
, which can be one of the following:
"fixed"
: default, all the same scale"free_x"
: x axis is free and will be determined based on the data of each label"free_x"
: y axis is free and will be determined based on the data of each label"free"
: both axes are free and will be determined based on the data of each label
Facetting can be done by as many variables as one likes, but again, this will result in a combinatorial explosion.
import ggplotnim
let mpg = readCsv("data/mpg.csv")
ggplot(mpg, aes("displ", "hwy")) +
geom_point(aes(color = "manufacturer")) +
facet_wrap(["drv", "cyl"]) +
ggsave("media/recipes/rSimpleFacet.png")
This gives us:
Facet wraps can also be (ab)used to display data that is not quite the same data just for different labels, but rather different data sets, which may be somehow related and one wishes to get a glimpse of some geom for all those data sets.
Consider the example below. It is based on background data from my
https://github.com/Vindaar/TimepixAnalysis of a gaseous detector with
a pixelized readout (256x256
pixels) consisting mostly of cosmic
muons, which typically look something like this:
For every event such as this clusters are identified and then (geometric) properties of each event are calculated. The ones we are going to look at here:
hits
: number of active pixels in a clusterpos_x
: center position along x of the clusterpos_y
: center position along y of the clusterecc
: eccentricity of the cluster (essentially ratio of long to short axis)rms_trans
: RMS of the cluster along the short axis
The other day I had a weird issue with one dataset, so I wanted to get
an overview of the distributions of these properties. Thanks to
facet_wrap
that’s a few lines of code:
import ggplotnim
let df = readCsv("data/run_305_tpa_data.csv")
# gather all columns to a long format df
echo df
let dfLong = df.gather(getKeys(df), key = "Property", value = "Value")
ggplot(dfLong, aes("Value")) +
facet_wrap("Property",
scales = "free") + # each property has very different data ranges, Leave free
geom_histogram(bins = 100, position = "identity",
binBy = "subset") + # `binBy subset` means the histogram will be calculated
# in the data range of each properties data range
ggsave("media/recipes/rFacetTpa.png", width = 800, height = 600)
Which gives us the following plot:
Notice how in this facetting example each subplot has its own x and y
axes and tick labels. That is because we set the scales to "free"
.
Also note that if we didn’t set the binBy = "subset"
argument, all
subplots would be binned to 100 bins in the same data range, which is
the encompassing data range of all subplots. Especially thanks to
having the hits
property in there, this would lead to useless
binning.
NOTE: Starting from version v0.4.4
this approach is not
recommended to be used for dates, but only to customize other
numerical data. In this approach the actual placement of the ticks
does not change, only the label does. See <a href=”Adjusting tick labels
according to custom date time information”>Adjusting tick labels
according to custom date time information below for the best approach now.
Starting from version v0.3.7
(thanks to @cooldome !) it’s possible
to hand a callback to define custom tick labels.
This example shows how to get tick labels showing dates (especially
useful, since ggplotnim does not natively support DateTime
objects
so far!).
The callback is given to the scale_x/y_discrete/continuous
procs via
the labels
parameter. For continuous labels the signature is:
proc(x: float): string
while for discrete labels it is:
proc(x: Value): string
where the Value
corresponds to each discrete label on the discrete scale.
import ggplotnim, strutils, sequtils, seqmath, times
let x = linspace(0.0, 30.0, 1000)
let y = x.mapIt(pow(sin(it), 2.0))
let df = toDf(x, y)
# helper template to get a reproducible `DateTime` for CI!
template nowTmpl(): untyped = initDateTime(15, mMay, 2020, 00, 00, 00, 00, utc())
ggplot(df, aes("x", "y")) +
geom_line() +
scale_y_continuous(labels = proc(x: float): string =
x.formatFloat(ffDecimal, 1)) +
scale_x_continuous(labels = proc(x: float): string =
getDateStr(nowTmpl() - int(x).months)) +
xlab(label = " ") +
ggsave("media/recipes/rFormatDatesPlot.png")
This gives us:
In the same vain as the above, it might sometimes be useful to have
control over the precision of the tick labels shown on the plot. This
can also be done using the labels
callback approach:
import ggplotnim, strutils, sequtils, seqmath
let x = linspace(0.0, 30.0, 1000)
let y = x.mapIt(pow(sin(it), 2.0))
let df = toDf(x, y)
ggplot(df, aes("x", "y")) +
geom_line() +
scale_x_continuous(labels = proc(x: float): string =
x.formatFloat(ffDecimal, 2)) +
xlab(label = " ") +
ggsave("media/recipes/rFormatDecimalsPlot.png")
This gives us:
Another sometimes desired feature is changing the number of ticks or
placing ticks at predefined position. Both can be achieved by using
the breaks
argument to the scale_x/y_*
procedures. A modified
example of the rSimpleLine.nim
recipe shows both possibilities:
import ggplotnim, sequtils, seqmath
let x = linspace(0.0, 30.0, 1000)
let y = x.mapIt(pow(sin(it), 2.0))
let df = toDf(x, y)
ggplot(df, aes("x", "y")) + # x and y are the identifiers given above as strings
geom_line() +
scale_x_continuous(breaks = @[0.0, 1.0, 2.0, 12.0]) + # set custom ticks along x
scale_y_continuous(breaks = 50) + # set a custom number of ticks along y
ggsave("media/recipes/rCustomBreaks.png")
Note in particular that if one uses an integer, the given number is only a “desired” number of ticks. Internally we still look for the closest number that yields “nice” looking tick labels.
Since version v0.3.9
it’s finally possible to give the elements
being binned when using geom_histogram
(or any other geom with stat =
"bin"
).
Note that for the time being the default label for the weighted axis is still “count”!
If each element to be binned has a corresponding weight, this is
simply done by using the weight
aes pointing to said column:
import ggplotnim
let df = readCsv("data/diamonds.csv")
ggplot(df, aes("carat", weight = "price")) +
geom_histogram() +
ylab("Binned carat weighted by price") +
ggsave("media/recipes/rWeightedHistogram.png")
Which gives the following:
Feel free to compare this result to the same plot with a weight of 1 for each element!
Another addition for binned geoms in version v0.3.9
is the support
for density calculations. Instead of associating to each bin the
number of counts, the density of each bin is returned instead.
Don’t confuse this with the (soon to be added™) geom_density
, which
calculates a kernel density estimation given a set of samples and
returns a smooth estimation of the underlying distribution.
import ggplotnim
let df = readCsv("data/diamonds.csv")
ggplot(df, aes("carat")) +
geom_histogram(density = true) +
ggsave("media/recipes/rHistogramDensity.png")
With version 0.3.17
it is finally possible to provide custom
mappings for discrete scales, using scale_color/size/fill_manual
.
Courtesy of @hffqyd, consider the following tile map, in which we override the default ggplot fill colors by a custom mapping:
import ggplotnim, sequtils, seqmath, chroma, tables
let
pos = [1, 2, 3, 1, 2, 3]
name = ["a", "a", "a", "b", "b", "b"]
n = [0, 1, 4, 4, 2, 3]
df = toDf(pos, name, n)
ggplot(df, aes("pos", "name")) +
geom_tile(aes(fill = "n")) +
geom_text(aes(text = "n"), size = 25.0) +
scale_x_discrete() +
scale_y_discrete() +
scale_fill_manual({ 0 : color(1.0, 0.0, 0.0),
1 : color(0.0, 1.0, 0.0),
2 : color(0.0, 0.0, 1.0),
3 : color(1.0, 1.0, 0.0),
4 : color(1.0, 0.0, 1.0) }.toTable) +
ggsave("media/recipes/rCustomFill.png")
Which gives us the following tiles:
While ggplotnim
tries to be reasonably smart about the margins a
plot requires, there are still many cases in which user defined
margins are required.
An example are very long labels. An example derived from the above
rCustomFill
case by @hffyd (ref: issue #89).
The margin
procedure can be used to set the correct fields of a
Theme
object. By default it interprets the given quantities in
centimeter. Inch, point and even relative values are also supported
using its unit
argument.
import ggplotnim
let
pos = [1, 2, 3, 1, 2, 3]
name = ["a very long long label", "a very long long label", "a very long long label", "b", "b", "b"]
n = [0, 1, 4, 4, 2, 3]
df = toDf(pos, name, n)
ggplot(df, aes("pos", "name")) +
geom_tile(aes(fill = "n")) +
geom_text(aes(text = "n"), size = 25.0) +
scale_x_discrete() +
scale_y_discrete() +
margin(left = 6.0) +
ggsave("media/recipes/rCustomMargins.png")
In case of super long titles the title will automatically be wrapped.
However, by default the space in the top part of the plot will not be large enough to accomodate the required space for more than 1 line of text using the default font size. That’s why we set the margin at the top of the plot here to 2 cm.
If the user provides a manual wrapping (i.e. the title contains at
least one \n
) no automatic wrapping will be performed.
import ggplotnim, sequtils, seqmath
let x = linspace(0.0, 30.0, 1000)
let y = x.mapIt(pow(sin(it), 2.0))
let df = toDf(x, y)
ggplot(df, aes("x", "y")) +
geom_line() +
margin(top = 2) +
ggtitle("This is a very long title which gets cropped on the right side as it's longer than the image width.") +
ggsave("media/recipes/rLongTitleMultiline.png")
v0.3.21
adds an additional drawing option for histograms. Before
histograms were always drawn by drawing individual bars.
This can lead to problems:
- we can have aliasing effects where depending on the size the resulting plot is viewed at, the background may be visible in small lines between the bars, even though the bars should be touching perfectly (or even overlap slightly), see <a href=”Simple histogram with N bins”>Simple histogram with N bins for an example where this effect is visible.
- when plotting multiple histograms using
position="identity"
(i.e. on top / behind one another) the edges of each bar can be very distracting visually, because even if one sets an alpha for the histograms the alphas of the bin edges might overlap making them stand out again.
By drawing histograms as outlines both of these problems are solved.
Let’s look at how to plot such a histogram and see what it looks like. We will plot a histogram of purely the outline (so no fill) to better visualize the idea, but be assured that issue 1 is remedied in case of filled histograms.
We will draw the example from Change alpha of the fill color as outlined histograms.
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("cty", color = "class")) +
geom_histogram(lineWidth = 2.0,
alpha = 0.0, # make transparent (only fill)
hdKind = hdOutline) + # draw as outline
ggsave("media/recipes/rHistogramOutline.png")
In case of dealing with noisy data, one might want to see the trend of
the data geom_smooth
is the tool for the job.
It provides (currently) two different smoothing methods to apply to data.
- A Savitzky-Golay filter (often called “LOESS” or local regression in certain bubbles)
- A polynomial fit of arbitrary order N.
(The latter can of course be used to perform a linear fit to the data)
Note: both the Savitzky-Golay filter use LAPACK to solve a least squares problem, which means you need a working LAPACK installation.
Let us look at the commits per day to all Nimble repositories (thanks to @haxscramper for the dataset!).
In this example we leave the x labels as a unix timestamp. Combine it with the plot below (<a href=”Adjusting tick labels according to custom date time information”>Adjusting tick labels according to custom date time information) to see how to improve on that.
import ggplotnim
let df = readCsv("data/commits_nimble.csv")
ggplot(df, aes("days", "count")) +
geom_line() + # plot the raw data as a line
geom_smooth() + # draw a default smoother. This is a Saitzky-Golay filter of
# order 5 and a window `span` of 70%
geom_smooth(smoother = "poly", # add a polynomial smoother using the full range
polyOrder = 7, # of order 7
color = "#FF0000", # and red line (named colors e.g. "red" also allowed)
size = 1.0) + # that is not as thick
ggsave("media/recipes/rGeomSmooth.png")
As we can see, in this case both algorithms give more or less good results. The Savitzky-Golay filter represents more of the variation of the actual change in the data, whereas the polynomial fit is just a smooth interpolation of the data. The latter shows some artifacts in places, e.g. on the left is an unphysical bump (c/f Runge’s phenomenon), which make it less suited for some datasets and polynomial order combinations.
In case of the Savitzky-Golay filter it’s possible to keep more of the
local information by choosing a smaller span size (argument
span
). That results in a more wiggly result, which may be preferable
in some use cases.
In the above example Smoothing noisy data with =geom_smooth= the x axis contains unix timestamps as the date format. This is often enough for quickly checking things, but once a plot should be presentable to others or for a better understanding of specific dates (as unix timestamps are hard to assign to actual calendar dates), assigning labels that are human readable is very useful.
Starting from ggplotnim
version v0.4.4
one can achieve this with
the added scale_x/y_date
, which assigns a date scale to either the x
or y
axis.
It requires 3 of 4 arguments:
isTimestamp: bool
: tells the procedure whether the associated axis (x
ory
) contains time data as a unix timestamp (as integer or float).parseDate
: ifisTimestamp
isfalse
, we require a procedure of signature:proc(x: string): DateTime
which simply performs the necessary conversion of a string storing date time information to a
DateTime
object (we don’t use a format string here as it is more restrictive in what is possible here. It may be added as an additional argument in the future though).formatString: string
: The format string to use to convert aDateTime
object to a tick label. This defines what the tick label will look like later.
As these 3 arguments alone are not enough to determine where and how many ticks to place, a last argument is required:
dateSpacing: Duration
: It simply gives aDuration
object that stores the duration required between two ticks. Be careful when dealing with multiple years, asDuration
does not support years and a full year is not exactly 52 weeks, but 52 weeks + 1 day { when ignoring leap years etc. }).
Let’s use this information to turn the above plot into one with a sensible x axis:
import ggplotnim, times
var df = readCsv("data/commits_nimble.csv")
ggplot(df, aes("days", "count")) +
geom_line() +
scale_x_date(isTimestamp = true, # x is unix timestamp
formatString = "MMM-yyyy", # format as 'Jan-1970'
dateSpacing = initDuration(days = 1, # one year is roughly
weeks = 52)) + # 365.25 days, but we use 365
geom_smooth(span = 0.64) +
ylab("Commit count") + xlab("Date") +
ggtitle("Daily commit counts in all nimble repositories") +
ggsave("media/recipes/rScaleXDate.png")
As we can see, we now get nice yearly labels of the month. The label is placed exactly at the start of the month of that year!
Note: In principle using weeks = 52
is enough, as we internally
first filter to all unique tick labels that are possible for the
format string. Then we check the distance between the last acceptable
tick and next ones to find that which is closest to being
dateSpacing
away from it. The first tick label is computed from the
date range on the axis. With 52 weeks this would already always fall
to the next exact year, as we only keep month + year information in
the format string.
In addition to the used arguments there are 2 more arguments of
interest to the scale_x/y_date
procedures:
dateAlgo
: allows the selection of an alternative algorithm for the determination of the tick positions. The defaultdtaFilter
behaves as described in the note above.dtaAddDuration
simply adds the givendateSpacing
onto the first valid tick and re-formats the result withformatString
for a correct placement and text. The latter is useful for sparse input data along the time column.breaks
: allows to hand custom timestamps (unix) at which to place ticks. UsesformatString
to format the timestamps. Ignores thedateAlgo
anddateSpacing
.
geom_smooth
currently provides two ways to “smooth” data. Either
using a Savitzky-Golay filter (to smooth using a convolution in a
fixed window with specific polynomial coefficients) and regular
polynomial fits to the data.
One special case that is commonly used of polynomial fits, is fitting a polynomial of order 1. Or in other words, fitting a line.
Let’s showcase this by fitting independent lines to multiple classes
of a dataset. Namely, we will take our classy mpg
dataset and plot
the engine displacement displ
against the highway fuel efficiency
hwy
. In addition we will classify these by color based on the class
of car class
each car model is in.
Then we’ll fit a linear model to all classes that tells us the trend in engine displacement vs. fuel economy for different classes of cars.
All we need to do to accomplish this, is add one geom_smooth
layer,
in which we use the "poly"
smoother using polyOrder = 1
(a line):
import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("displ", "hwy", color = "class")) +
geom_point() +
geom_smooth(smoother = "poly", polyOrder = 1) +
ggsave("media/recipes/rLinearFit.png")
This gives us the following pretty plot:
There are many reasons why one might want to produce a TeX file as the target for a plot. Two major ones:
- matching fonts & font sizes for plots in a LaTeX document
- type setting arbitrary text (maths, physics, chemistry equations & formulae, …)
From ggplotnim
version v0.4.4
the TikZ backend allows one to do
just that. In principle it generates a basic .tex
file. Different
options allow to define the document class (article
or
standalone
), purely emitting a TikZ command (\begin{tikzpicture}
... \end{tikzpicture)
) or just giving a full TeX template into which
the TikZ image will be inserted (requires a strutils.%
compatible
string with a single $#
argument).
Alternatively, it is possible to directly compile the generated TeX
files to PDFs using a local TeX compiler. By default we use xelatex
(for better unicode support).
Note: Placing text on the TikZ backend comes with some quirks:
- text placement may be slightly different than on the Cairo
backend, as we currently use a hack to determine string widths /
heights based on font size alone.
ginger
needs an overhaul to handle embedding of coordinates into viewports to keep string width / height information until the locations are written to the output file (so that we can make use of text size information straight from TeX) - Text is placed into TikZ
node
elements. These have some quirky behavior for more complex LaTeX constructs. E.g. it is not really possible to use anequation
environment in them (leads to “Missing $ inserted” errors). - because of hacky string width / height determination placing a non transparent background for annotations leads to background rectangles that are too small. Keep the background color transparent for the time being.
- Do not include line breaks
\n
in your annotations if you wish to let LaTeX handle line breaks for you. Any manual line break\n
will be handled byginger
. Due to the string height hack, this can give somewhat ugly results.
Let’s create a plot in which we draw a Landau distribution and annotate the text with the correct mathematical formula to showcase math typesetting.
import ggplotnim, math, sequtils, latexdsl, strutils, ginger
proc landauApprox(x: float): float =
result = 1.0 / sqrt(2 * math.PI) * exp(- (x + exp(-x)) / 2 )
proc annotateText(): string =
# pure math in raw string due to too many invalid nim constructs for `latex`
# macro (would result in ugly mix of strings and identifiers)
# Need manual math mode via `$`!
let eq = r"$p(x) = \frac{1}{2πi}\int_{a-i∞}^{a+i∞} e^{s \log(s) + xs}\, ds$"
let eqApprox = r"$p(x) \approx \frac{1}{\sqrt{2π}} \exp\left(- \frac{x + e^{-x}}{2} \right)$"
# align text with math using 2 line breaks for better separation (single line break is too
# squished. Cannot use `equation` or similar in a TikZ node afaiu :/)
result = r"The Landau distribution\\ \\ " &
eq & r"\\ \\" &
r"reduces to the approximation:\\ \\ " &
eqApprox & r"\\ \\ " &
r"for $μ = 0, c = 1$"
let x = linspace(-5.0, 15.0, 1000)
let y = x.mapIt(landauApprox(it))
let df = toDf(x, y)
ggplot(df, aes("x", "y")) +
geom_line() + # draw our Landau data as a line
annotate(annotateText(), # add our text annotation
x = 5.0, y = 0.1, # at this location in 'data space'
backgroundColor = transparent) + # transparent background as we do manual TeX line breaks
ggtitle(r"Approximation of Landau distribution: " &
r"$p(x) \approx \frac{1}{\sqrt{2π}} \exp\left(- \frac{x + e^{-x}}{2} \right)$",
titleFont = font(12.0)) +
ggsave("media/recipes/rTikZLandau.tex", standalone = true) # standalone to get TeX file w/ only plot
# use:
# ggsave("media/recipes/rTikZLandau.pdf", useTex = true, standalone = true)
# to directly compile to standalone PDF (using local xelatex)
which results in a (document class ‘standalone’) TeX file. After
compilation with xelatex
it yields the following plot (converted to
a PNG to show here):
The generated TeX file can be found here:
Ridgeline plots (sometimes also called joyplots, referencing the album cover of “Unknown Pleasures” by Joy Division, which shows a time series plot of the first discovered radio pulsar CP 1919) can sometimes be useful to see the trends visible in multiple histograms / density estimations of different times / classes etc. For few histograms it’s fine to plot them in one window, but for more than 4 or 5 histograms it can quickly become too visually busy (drawing a histogram as an outline can help to an extent).
Ridgeline plots solve this by stacking the individual plots in multiple “ridges”, sometimes done including an overlap of the different panes.
We’re going to look at an example here, showing the distribution of the “gaussianity” of a point-cloud over different runs of a detector.
We will start with a regular plot, where we will highlight both the mean value of each distribution in red and the mean of all datasets in black.
The input file gaussSigma_runs.csv
contains 3 columns, bins
,
counts
and Run
. The first two are simply the binned data of the
distribution and Run
designates which run the data corresponds to.
import ggplotnim
proc getMean(bins, counts: Tensor[float]): float =
## we define a helper proc to compute each mean of each ridge
## individually for each `Run`
for i in 0 ..< counts.size:
result += bins[i] * counts[i]
result /= counts.sum
let df = readCsv("data/gaussSigma_runs.csv")
echo df
let mean = getMean(df["bins", float], df["counts", float])
let ymax = df["counts", float].max
ggplot(df, aes("bins", "counts", fill = "Run")) +
ggridges("Run", overlap = 3.0) + # use a large overlap
geom_freqpoly(stat = "identity", # do not perform binning
color = "white", # white outline
size = 1.5) + # make the lines a bit thicker
geom_linerange(aes = aes(yMin = 0, # draw black line of common mean
yMax = ymax,
x = mean),
color = "black") +
geom_linerange(aes = aes(
fill = "Run", yMin = 0, yMax = ymax, # draw red line for each mean
x = f{float -> float: getMean(`bins`, `counts`)}), # compute the mean of the current Run
color = "#FF0000") + # color the line red
margin(top = 2) + # increase top margin due to large overlap
xlab("gaussSigma") + ylab("Counts") +
ggsave("media/recipes/rRidgeLineGauss.png", width = 1200, height = 1200)
This is a nice plot:
But of course it begs the question, can we make it pretty? ;)
Let’s remove those distracting lines for the mean, turn off the legend and paint it black!
import ggplotnim
let df = readCsv("data/gaussSigma_runs.csv")
ggplot(df, aes("bins", "counts", fill = "Run")) +
ggridges("Run", overlap = 3.0) +
geom_freqpoly(stat = "identity", color = "white",
size = 3.0) +
xlab("gaussSigma") + ylab("Counts") +
margin(top = 3, right = 2.5, bottom = 2) +
theme_void("black") + hideLegend() +
ggsave("media/recipes/rRidgeLineGaussBlack.png", width = 1200, height = 1200)
That’s what I call pretty!
But well, some people might prefer to go full Joy Division (just remove the classification by “Run”):
import ggplotnim
let df = readCsv("data/gaussSigma_runs.csv")
ggplot(df, aes("bins", "counts")) +
ggridges("Run", overlap = 3.0) +
geom_freqpoly(stat = "identity", color = "white",
size = 2.0) +
margin(top = 3, right = 2.5, bottom = 2) +
theme_void("black") + hideLegend() +
ggsave("media/recipes/rJoyplot.png", width = 1200, height = 1200)
We joyplots yet? Well, this may not be the prettiest joyplot ever, but given more classifying fields and a larger overlap, this would look like the original. ;)
From version v0.3.21
on most plots also work using Vega-Lite (with
some quirks). This excludes plots, which use facetting or subplots of
any kind (including ggridges
).
Usage is very simple and just requires two things: import
ggplotnim/ggplot_vega
and replace the ggsave
call by ggvega
. A
simple example:
import ggplotnim
import ggplotnim/ggplot_vega
let mpg = readCsv("data/mpg.csv")
ggplot(mpg, aes(x = "displ", y = "cty", color = "class")) +
geom_point() +
ggtitle("ggplotnim in Vega-Lite!") +
ggvega("media/recipes/rSimpleVegaLite.html") # w/o arg creates a `/tmp/vega_lite_plot.html`
The behavior of the ggvega
call can be adjusted with multiple
arguments. By default without any arguments it will generate an HTML
file in the temp directory and show it using the webview backend. The
backend can be switched to using the default browser with the
backend
argument. The temporary file will be removed after, unless
removeFile
is set to false.
If the given filename is a .json
file only such a file will be
generated and no plot will be shown, which is useful to generate
multiple Vega-Lite plots in one program. Pure HTML files can also be
generated without showing the plots if the show
argument is set to false.
This recipe gives us the following plot:
and here as an interactive chart in the vega browser.
See TMAS section below for now.
See TMAS section below for now.
See TMAS section below for now.
NOTE: starting from version v0.2.15
, this can also be achieved
using geom_linerange
!
This section goes for a more cohesive, explanatory and hopefully more fun introduction to different kinds of plots. Also possible alternatives might be discussed.
This example ports the idea from the plotnine
documentation.
import ggplotnim, sequtils, seqmath, strutils
##
## This is a straight up adaptation from the genius `plotnine` example
## here:
## https://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_tile.html
##
var elements = readCsv("data/elements.csv")
.mutate(f{Value -> int: "group" ~ (
if `group` == "-": -1
else: `group`.toInt)})
echo elements.pretty(5)
# split the lanthanides and actinides from the rest
var top = elements.filter(f{`group` != -1})
.rename(f{"x" <- "group"},
f{"y" <- "period"})
var bottom = elements.filter(f{`group` == -1})
echo top["x"]
echo top["y"]
const nrows = 2
const hshift = 3.5
const vshift = 3.0
bottom["x"] = toColumn cycle(arange(0, bottom.len div nrows), nrows).mapIt(it.float + hshift)
bottom = bottom.mutate(f{"y" ~ `period` + vshift})
const tile_width = 0.95
const tile_height = 0.95
# replace `elements` by stacked top and bottom
elements = bind_rows([top, bottom])
let splitDf = toDf({
"y": @[6, 7],
"metal": @["lanthanoid", "actinoid"]
})
func cycle[T](s: openArray[T]; nums: seq[int]): seq[T] =
result = newSeq[T](nums.foldl(a + b))
var idx = 0
for i in 0 ..< nums.len:
for j in 0 ..< nums[i]:
result[idx] = s[i]
inc idx
# finally define rows and cols
let groupdf = toDf({
"group": arange(1, 19),
"y": cycle(@[1, 2, 4, 2, 1], @[1, 1, 10, 5, 1])})
let periodDf = toDf({
"period": arange(1, 8),
"x": cycle(@[0.5], @[7])})
ggplot(elements, aes("x", "y", fill = "metal")) +
geom_tile(aes = aes(width = tileWidth,
height = tileHeight)) +
geom_tile(data = splitDf,
aes = aes(x = 3 - tileWidth/4.0 + 0.25,
width = tileWidth / 2.0,
height = tileHeight)) +
scale_y_continuous() +
geom_text(aes(x = f{`x` + 0.15},
y = f{`y` + 0.15},
text = "atomic number"),
font = font(6.0)) +
geom_text(aes(x = f{`x` + 0.5},
y = f{`y` + 0.4},
text = "symbol"),
font = font(9.0)) +
geom_text(aes(x = f{`x` + 0.5},
y = f{`y` + 0.6},
text = "name"),
font = font(4.5)) +
geom_text(aes(x = f{`x` + 0.5},
y = f{`y` + 0.8},
text = "atomic mass"),
font = font(4.5)) +
geom_text(data = groupdf,
aes = aes(x = f{`group` + 0.5},
y = f{`y` - 0.2},
text = "group"),
font = font(9.0, color = color(0.5, 0.5, 0.5))) +
geom_text(data = periodDf,
aes = aes(x = f{`x` + 0.3},
y = f{`period` + 0.5},
text = "period"),
font = font(9.0, color = color(0.5, 0.5, 0.5))) +
legendPosition(0.82, 0.1) +
theme_void() +
margin(right = 5, bottom = 2) + # adjust margin as `theme_void` implies tight margins
scale_y_reverse() +
scale_x_continuous() +
ggsave("media/recipes/rPeriodicTable.png",
width = 1000,
height = 500)
which gives the following amazing result (better to look at a PDF!):
This came up recently in a discussion with a colleague. The question was how to determine whether a point in 2D space lies within one or more (not necessarily disjoint) polygons.
I thought it would be fun to implement this with a simple algorithm and use ggplotnim to show that it works.
Let’s start by defining a few data types to make life easier:
import ggplotnim, sequtils, chroma, options, sugar, random
type
Point = object
x, y: float
Vertex {.borrow: `.`.} = distinct Point
Polygon = object
vertices: seq[Vertex]
Not that it really matters for a toy example, but I actually wanted
Point
to be a generic, but then {.borrow.}
is broken:
nim-lang/Nim#14449
The Polygon
type in this example could of course also just be an
alias for seq[Vertex]
. But maybe for a more complete example the
polygon would store additional information.
Given our types, we can define few helpers to access information more
easily and create a Polygon
from a set of points (not using Point
due to the issue mentioned above…):
func len(p: Polygon): int = p.vertices.len
func `[]`(p: Polygon, idx: int): Vertex = p.vertices[idx]
proc initPolygon[T](vs: varargs[tuple[x, y: T]]): Polygon =
result.vertices = newSeq[Vertex](vs.len)
for i, v in vs:
result.vertices[i] = Point(x: v[0].float, y: v[1].float).Vertex
On the other hand later we want to visualize the result with
ggplotnim
, so we also need something to convert the data back into a
form from which we can create a data frame (obviously the code as
written here is super inefficient since we usl mapIt
for clarity and
thus loop multiple times!):
proc flatten(p: Polygon): (seq[float], seq[float]) =
result = (p.vertices.mapIt(it.x), p.vertices.mapIt(it.y))
# add first point to get proper drawn polygon with geom line!
result[0].add p.vertices[0].x
result[1].add p.vertices[0].y
where we appended the first point to the result again, because we have
to “abuse” geom_line
to draw a polygon, which doesn’t close the line
by default.
Now comes the major part of the code, namely the check whether a given point lies within a certain polygon. This based on the code here: https://wrf.ecse.rpi.edu/Research/Short_Notes/pnpoly.html
proc inPolygon(p: Point, poly: Polygon): bool =
# based on: https://wrf.ecse.rpi.edu/Research/Short_Notes/pnpoly.html
var j = poly.len - 1
for i in 0 ..< poly.vertices.len:
if ((poly[i].y <= p.y and p.y < poly[j].y) or
(poly[j].y <= p.y and p.y < poly[i].y)) and
(p.x < (poly[j].x - poly[i].x) * (p.y - poly[i].y) / (poly[j].y - poly[i].y) + poly[i].x):
result = not result
j = i
With this defined we can quickly define a proc that checks whether a point is in a sequence of polygons:
proc inAnyPolygon(p: Point, polys: seq[Polygon]): bool =
for poly in polys:
if p.inPolygon(poly): return true
Now we are essentially done. Let’s create a couple of polygons:
let p1 = initPolygon((0, 1), (6, 0), (5, 2), (4, 1), (2, 4))
let p2 = initPolygon((5, 4), (8, 10), (10, 2), (7, 4))
and create a data frame, which essentially stores the columns:
x | y | Num |
where Num
is the number of the polygon is part of. This layout
allows us to use ggplotnims
built-in functionality to draw them as
separate polygons.
let df1 = toDf({ "x" : p1.flatten[0],
"y" : p1.flatten[1] })
let df2 = toDf({ "x" : p2.flatten[0],
"y" : p2.flatten[1] })
let df = bind_rows(("Polygon 1", df1), ("Polygon 2", df2), "Num")
All that is left is to draw a bunch of points and check whether they
are in any polygon or not. This is stored as a column of bools, which
we can then classify by in the same way as we will for the Num
column of the polygons:
var rnd = initRand(42)
# now sample a bunch of points in (0, 10) plane and plot it
let points = collect(newSeq):
for i in 0 ..< 300:
Point(x: rnd.rand(10.0), y: rnd.rand(10.0))
let inPoly = points.mapIt(it.inAnyPolygon(@[p1, p2]))
let dfPoints = toDf({ "x" : points.mapIt(it.x),
"y" : points.mapIt(it.y),
"InPoly" : inPoly })
The ggplot
call is reasonably simple. The main aes
only contains
x
and y
, because only these two columns are shared between the two
data frames.
The fillColor
argument for the geom_line
call is just to override
the colorring that the additional fill
aesthetic will otherwise
provide (which we need to get separate polygons). The problem is that
the colors would be the same as for the points (the outline color
still shows that!).
For the points we hand the additional data frame and define the
classification via the InPoly
column.
# TODO: results in vertical line at start of polygon
ggplot(df, aes(x, y)) +
geom_line(aes = aes(fill = "Num"), fillColor = "#ebba34") +
geom_point(data = dfPoints, aes = aes(color = "InPoly")) +
ggsave("./media/recipes/rPointInPolygons.png")
Which gives us the following plot that shows us the algorithm works as expected:
Assuming we have some mathematical function we want to plot. While
this library is no gnuplot
, this is still very simple (goes on
talking about not simple example…). Let’s pretend we want to plot
the gravitational acceleration of a point mass according to
Newton. The analytical description would be
where r
is the radial distance, R
the radius of Earth, m
the
mass of Earth and G
the gravitational constant. It shows both the
case inside a massive body and outside.
Let’s plot this for Earth in a range from Earth’s center to X km!
First import the stuff we need:
import ggplotnim
import seqmath # for linspace, pow
import sequtils # for mapIt
Now we define the func, which returns the acceleration of Earth
depending on the radial distance r
:
func newtonAcceleration(r: float): float =
## returns the graviational acceleration experienced by a test mass
## at different distances from Earth (or inside Earth).
## `r` is the radial distance given in `m`
const R = 6371 * 1000 # mean radius of Earth in m
const m_E = 5.972e24 # kg
const G = 6.674e-11 # m^3 kg^-1 s^-2
if r < R:
result = G * m_E * r / pow(R, 3.0)
else:
result = G * m_E / (r * r)
We have to define the range we actually want to look at. Let’s
consider Earth’s center up to roughly the geostationary orbit at
~ 35,000 km
.
let radii = linspace(0.0, 35_000_000, 1000) # up to geostationary orbit
# and the corresponding accelerations
let a = radii.mapIt(newtonAcceleration(it))
This gives us two seq[float]
, but we need a DataFrame
. So we
combine the two:
var df = toDf({ "r / m" : radii,
"g(r) / m s¯²" : a})
which gives us a data frame with two columns. The names are, as one can guess, the given strings. (Note that in practice one might not want to use unicode superscripts etc. It’s just to show that it’s possible and allows us to have it in the y label without setting the y label manually).
However, we might want to plot it against km
instead of m
, so
let’s transmute the data frame:
df = df.transmute(f{"r / km" ~ c"r / m" / 1000.0}, f{"g(r) / m s¯²"})
The transmute
function takes a variable number of elements. Only
those columns that appear here (on the LHS of the ~) will be part of
the resulting data frame.
And finally create the plot of the dependency:
ggplot(df, aes("r / km", "g(r) / m s¯²")) +
geom_line() +
ggtitle("Gravitational acceleration of Earth depending on radial distance") +
ggsave("media/recipes/rNewtonAcceleration.png")
The resulting plot is the following:
and shows what we expect. A linear increase in acceleration up to the
surface and a 1/r^2
drop from there.
At this point we might ask “Do we recover the known 9.81 m/s^2 at the
surface?”. Let’s see. There’s many different ways we could go on about
this. We’ll use summarize:
let maxG = df.summarize(f{float: "g_max" << max(col("g(r) / m s¯²"))})
An alternative way would be to access the data column directly, like so:
let maxG_alt = df["g(r) / m s¯²"].toTensor(float).max
where access via []
returns a PersistentVector[Value]
. To copy the values to a
seq[Value]
, so that we can use procs like max
on it, we can use
vToSeq
(it’s not just toSeq
, because that breaks things:
https://github.com/nim-lang/Nim/issues/7322…)
Let’s see what we have:
echo "Max acceleration:\n ", maxG
which should give roughly 9.8 m / s^2
. The deviation comes from the fact
that we didn’t actually look at the value at the surface exactly, but took a rough
grid from 0 - 35,000 km
with 1000
points. Evaluating the proc at the radius exactly
might give a better result:
echo "At surface = ", newtonAcceleration(6371000)
except now we see a value slightly too large (~ 9.82
). Because now
we’d have to include the rotation of Earth to account for the
centripetal force… But since this isn’t a physics lesson (going
down this rabbit hole is a lot of fun though, I promise!), we’ll stop
here!
Creating a log-log plot is as easy as done in the makePlot
proc at
the bottom. And yes, there’ll be explanations soon here for the
curious of you! :)
import sequtils, seqmath, ggplotnim
proc effPhotonMass(ne: float): float =
## returns the effective photon mass for a given electron number density
const alpha = 1.0 / 137.0
const me = 511e3 # 511 keV
# note the 1.97e-7 cubed to account for the length scale in `ne`
result = sqrt( pow(1.97e-7, 3) * 4 * PI * alpha * ne / me )
proc numDensity(c: float): float =
## converts a molar concentration in mol / m^3 to a number density
const Z = 2 # number of electron in helium atom
result = Z * 6.022e23 * c
proc molarAmount(p, vol, temp: float): float =
## calculates the molar amount of gas given a certain pressure,
## volume and temperature
## the pressure is assumed in mbar
const gasConstant = 8.314 # joule K^-1 mol^-1
let pressure = p * 1e2 # pressure in Pa
result = pressure * vol / (gasConstant * temp)
proc babyIaxoEffMass(p: float): float =
## calculates the effective photon (and thus axion mass) for BabyIAXO given
## a certain helium pressure in the BabyIAXO magnet
const vol = 10.0 * (PI * pow(0.3, 2)) # 10m length, bore radius 30 cm
# UPDATE: IAXO will be run at 4.2 K instead of 1.7 K
# const temp = 1.7 # assume 1.7 K same as CAST
const temp = 4.2
once:
echo "BabyIAXO magnet volume is ", vol, " m^3"
echo "BabyIAXO magnet temperature is ", temp, " K"
let amountMol = molarAmount(p, vol, temp) # amount of gas in mol
let numPerMol = numDensity(amountMol / vol) # number of electrons per m^3
result = effPhotonMass(numPerMol)
proc logspace(start, stop: float, num: int, base = 10.0): seq[float] =
## generates evenly spaced points between start and stop in log space
let linear = linspace(start, stop, num)
result = linear.mapIt(pow(base, it))
proc makePlot(pstart, pstop: float, fname: string) =
let pressures = logspace(pstart, pstop, 1000) # 1000 values from 1e-5 mbar to 500 mbar
let masses = pressures.mapIt(babyIaxoEffMass(it)) # corresponding masses
# convert both seqs to a dataframe
let df = toDf({"P / mbar" : pressures, "m_a / eV" : masses})
ggplot(df, aes("P / mbar", "m_a / eV")) +
geom_line() +
scale_x_log10() +
scale_y_log10() +
ggtitle("Sensitive axion mass in eV depending on helium pressure in mbar") +
ggsave(fname)
makePlot(-6.0, 2.0, "media/recipes/rAxionMassesLogLog.png")
This creates the following plot:
A typical problem is comparing some measured data from experiment with
a theoretical model for which an analytical description may exist. So
that the analytical model may be represented by O(1000) elements,
while the measured data only consists of N << 1000
elements. This
can be done as shown below. It creates a plot of the mass attenuation
function of X-rays in the energy range between 0-10 keV
(very soft
X-rays) for ⁴He.
import sequtils, seqmath, ggplotnim
proc logMassAttenuation(e: float): float =
## calculates the logarithm of the mass attenuation coefficient for a given
## energy `e` in `keV` and the result in `cm^2/g`
result = -1.5832 + 5.9195 * exp(-0.353808 * e) + 4.03598 * exp(-0.970557 * e)
let energies = linspace(0.0, 10.0 , 1000) # from 0 to 10 keV
let logMuOverRho = energies.mapIt(logMassAttenuation(it))
# now the non-log values
let muOverRho = logMuOverRho.mapIt(exp(it))
const massAttenuationFile = "data/mass_attenuation_nist_data.txt"
# skip one line after header, second header line
var dfMuRhoTab = readCsv(massAttenuationFile, header = "#",
sep = ' ')
# convert MeV energy to keV
.mutate(f{"Energy" ~ c"Energy" * 1000.0})
.filter(f{float: c"Energy" >= energies.min and c"Energy" <= energies.max})
# create df of interpolated values
let dfMuRhoInterp = toDf({ "E / keV" : energies,
"mu/rho" : muOverRho })
# rename the columns of the tabulated values df and create a log column
dfMuRhoTab = dfMuRhoTab.rename(f{"E / keV" <- "Energy"})
# build combined DF
let dfMuRho = bind_rows([("Interpolation", dfMuRhoInterp),
("NIST", dfMuRhoTab)],
id = "type")
# and the plot of the raw mu/rho values
ggplot(dfMuRho, aes("E / keV", "mu/rho", color = "type")) +
geom_line(data = dfMuRho.filter(f{`type` == "Interpolation"})) +
geom_point(data = dfMuRho.filter(f{`type` == "NIST"})) +
scale_y_log10() +
ggtitle("Mass attenuation coefficient interpolation and data") +
ggsave("media/recipes/rMassAttenuationFunction.png")
Gives:
See TMAS section below for now.
import sequtils, seqmath, ggplotnim
proc logspace(start, stop: float, num: int, base = 10.0): seq[float] =
## generates evenly spaced points between start and stop in log space
let linear = linspace(start, stop, num)
result = linear.mapIt(pow(base, it))
proc density(p: float, temp = 4.2): float =
## returns the density of the gas for the given pressure.
## The pressure is assumed in `mbar` and the temperature (in `K`).
## The default temperature corresponds to BabyIAXO aim.
## Returns the density in `g / cm^3`
const gasConstant = 8.314 # joule K^-1 mol^-1
const M = 4.002602 # g / mol
let pressure = p * 1e2 # pressure in Pa
# factor 1000 for conversion of M in g / mol to kg / mol
result = pressure * M / (gasConstant * temp * 1000.0)
# convert the result to g / cm^3 for use with mass attenuations
result = result / 1000.0
proc numDensity(c: float): float =
## converts a molar concentration in mol / m^3 to a number density
const Z = 2 # number of electron in helium atom
result = Z * 6.022e23 * c
proc effPhotonMass(ne: float): float =
## returns the effective photon mass for a given electron number density
const alpha = 1.0 / 137.0
const me = 511e3 # 511 keV
# note the 1.97e-7 cubed to account for the length scale in `ne`
result = sqrt( pow(1.97e-7, 3) * 4 * PI * alpha * ne / me )
proc pressureGivenEffPhotonMass(m_gamma: float, T = 4.2): float =
## calculates the inverse of `babyIaxoEffPhotonMass`, i.e. the pressure
## from a given effective photon mass in BabyIAXO
result = m_gamma * m_gamma * T / 0.01988
proc molarAmount(p, vol, temp: float): float =
## calculates the molar amount of gas given a certain pressure,
## volume and temperature
## the pressure is assumed in mbar
const gasConstant = 8.314 # joule K^-1 mol^-1
let pressure = p * 1e2 # pressure in Pa
result = pressure * vol / (gasConstant * temp)
proc babyIaxoEffMass(p: float): float =
## calculates the effective photon (and thus axion mass) for BabyIAXO given
## a certain helium pressure in the BabyIAXO magnet
const vol = 10.0 * (PI * pow(0.3, 2)) # 10m length, bore radius 30 cm
const temp = 4.2
let amountMol = molarAmount(p, vol, temp) # amount of gas in mol
let numPerMol = numDensity(amountMol / vol) # number of electrons per m^3
result = effPhotonMass(numPerMol)
proc vacuumMassLimit(energy: float, magnetLength = 10.0): float =
## given an axion energy in keV, calculate the limit of coherence
## for the vacuum case in BabyIAXO
let babyIaxoLen = magnetLength / 1.97e-7 # length in "eV"
result = sqrt(PI * 2 * energy * 1e3 / babyIaxoLen) # convert energy to eV
const babyIaxoVacuumMassLimit = vacuumMassLimit(4.2)
proc m_a_vs_density(pstart, pstop: float) =
let pressures = logspace(pstart.log10, pstop.log10, 1000)
let densities = pressures.mapIt(density(it, 4.2))
let masses = pressures.mapIt(babyIaxoEffMass(it)) # corresponding masses
# convert both seqs to a dataframe
let ref1Bar = density(1000, 293.15)
let df1bar = toDf({"ρ / g/cm^3" : @[ref1Bar, ref1Bar], "m_a / eV" : @[1e-2, 1.0]})
let ref3Bar = density(3000, 293.15)
let df3bar = toDf({"ρ / g/cm^3" : @[ref3Bar, ref3Bar], "m_a / eV" : @[1e-2, 1.0]})
let refVacLimit = density(pressureGivenEffPhotonMass(babyIaxoVacuumMassLimit))
let dfVacLimit = toDf({"ρ / g/cm^3" : @[refVacLimit, refVacLimit], "m_a / eV" : @[1e-2, 1.0]})
let df = toDf({"ρ / g/cm^3" : densities, "m_a / eV" : masses})
let dfComb = bind_rows([("ma vs ρ", df),
("1 bar @ 293 K", df1bar),
("3 bar @ 293 K", df3bar),
("Vacuum limit", dfVacLimit)],
id = "Legend")
ggplot(dfComb, aes("ρ / g/cm^3", "m_a / eV", color = "Legend")) +
geom_line() +
scale_x_log10() +
scale_y_log10() +
ggtitle("Sensitive axion mass in eV depending on helium density in g / cm^3") +
ggsave("media/recipes/rAxionMassVsDensity.png")
m_a_vs_density(pressureGivenEffPhotonMass(babyIaxoVacuumMassLimit) * 0.9,
pressureGivenEffPhotonMass(0.4521) * 1.1)
Which gives us the following annotated plot:
While the following document was mainly written for myself, it might
be a nice example as to how one might use ggplotnim
to explore some
calculation, generate a bunch of plots in a literate environment
(almost relieving us of our desire for a working repl…) etc. It
showcases a variety of plots one might want to create. At some point
those will be part of the recipes above… In fact the plots found in
the document below correspond to the TODO items in the GTTP section.
It contains calculations (and a lot of plots) for the sensitive axion mass ranges achievable in the BabyIAXO experiment, a prototype for the IAXO experiment.
- the original Org file (do yourself a favor and view it in emacs): https://github.com/Vindaar/TimepixAnalysis/blob/refactorRawManipulation/Doc/other/axionWithGasPhase.org
- the generated PDF: https://github.com/Vindaar/TimepixAnalysis/blob/refactorRawManipulation/Doc/other/axionWithGasPhase.pdf