Aggregating

The ‘Grouping’ pane references the source dataset and gives users the option to define any of the four major columns: Date, Group, Feature, Value. Each column has a specific effect on the resulting plot. A staple column # observations is automatically created to report the number of observations in each group. By default, no columns (thus no groups) are defined, and # observations will represent the number of rows in the source dataset.

Date Column

When Date Column is specified, it will be parsed into date-time (POSIXct) format according to the format specified in Date Format. An attempt is made to automatically specify the proper format. The lubridate package is used for date parsing. Many of the abbreviations match lubridate’s function names, such as ymd (“year-month-day”). If a format match is not found, the format * - Guess - * is selected, which employs the anytime package. If any date fails to parse using the specified format, the column specification is ignored.

The (optional) values provided in Date Range will filter the dataset to include only dates that fall in the given range. Note that all date formats, even simply ‘Year’, are parsed to datetime format before being passed to the Date Range filter. Also, if the Date Column is already in date/POSIXct/numeric format, it may be best to apply the column filter in the Source Dataset pane.

Date Transformation can be a useful option that is initially set to the format specified in Date Format. If you clear this field, the grouped dataset will show the unformatted POSIXct datetime. Users can transform the date into a variety formats. Note that there can be value in transforming from a more specific to a less specific format (e.g. Year-Month-Day -> Year), but the opposite provides no meaningful insight (e.g. Year -> Year-Month-Day).

Lastly, there is an option to set Included Days of Week to include or exclude specific weekdays from the analysis.

Group Column

In regards to the ‘Grouped Dataset’, the specified Group Column serves as another grouping variable. In regards to the plot output, however, the Group Column defines how to facet the resulting plot.

Ideally, observations should belong to distinct groups (if any), but this is not a requirement. However, if the group column is not specified (or between-groups aggregation is performed), and a within-groups aggregation is specified, the resulting numbers can lose their meaning.

For example, consider three different restaurants. If counts of all employees at these restaurants are aggregated together, one can obtain the total count of employees at all three restaurants. Of course, this assumes the employees cannot be employed at more than one restaurant.

However, suppose these three restaurants logged of the numbers of distinct customers each month. Obviously, customers can visit multiple restaurants within the month. Thus, one cannot accurately obtain the number of distinct customers between all three restaurants by aggregating these counts.

The moral of the story is, if groups exist in the dataset, the group column should be specified, else the resulting insights could be meaningless.

Top Groups enables the user to select the top n groups. The top n groups are determined by summing out all other columns sorting the value column descending order. When no value column is specified, # observations is used.

Feature Column

The Feature Column serves as another grouping variable. The geoms of the plot are colored by the defined feature column. The Top Features slider functions exactly as the slider for Top Groups.

Value Column

The Value Column defines the values to be plotted, thus it must be a numeric column. In fact, the options for the value column only include numeric columns. As seen, if this column is ommitted, the # observations is plotted instead. If the value column contains values for more than one feature (as occurs after melting multiple columns), it is crucial to also define the feature column, else there is no way to know what the values represent.

After defining a value column, it may be useful to define an aggregation function, which will be detailed next.

Aggregation

  • Within-Groups: after defining a value column, it may be useful to aggregate the values by the specified aggregation function. Within-Groups will aggregate values within the groups defined by the date, group, and feature columns. The available aggregation functions are: sum, mean, median, min, and max. Plots that quantify dispersion (e.g., histogram, density), will provide more insight when the values are not aggregated. However, plots that require distinct values per group (e.g. bar, area) will benefit from aggregation. In these cases, if no aggregation function is specified, values will automatically be summed, as indicated by the y-axis label.

  • Between-Groups: marginalizes the group column, which is equivalent to removing the group column altogether.

Unaggregated:

Aggregated: