1 Motivation

1.1 Density Curves: Useful, but Often Mysterious

Density curves can be a great tool for conveying information about the distribution of a variable. For example, the density curve below, from a decision support system for insurance underwriters, addresses some pretty complex information needs while requiring very little screen space (a full discussion of this use case available here).



However, if you’re like me before I had the chance to take a data mining course, you’ve probably sketched a density curve into one of your designs without stopping to think about how that curve will be generated when your interface is implemented in software. This might seem like a trivial detail, but it can really matter. That is because a density curve is not an innate property of the data we’re representing, but rather an approximation that can be derived using several different statistical techniques known as density estimation approaches. You and I could start with the same data but end up putting very different density curves in front of our users, depending on the techniques that we use and how we adjust different settings of those techniques.


“You and I could start with the same data but end up putting very different density curves in front of our users, depending on the techniques that we use and how we adjust different settings of those techniques.”


1.2 What You’ll Learn

This tutorial introduces several of the most common techniques for producing density estimates and discusses how the properties of each technique can influence the density curves that users see in a finished display.

While there are plenty of detailed tutorials for each of these techniques available elsewhere, they tend to focus on the mathematics involved in each technique. Here, our focus is less on mathematics and more on the design implications of each density estimation approach.

By the end of this tutorial, you should have a solid conceptual understanding of the most common density estimation approaches, including:

  • The general logic behind each approach
  • The assumptions, parameters, and fitting procedures associated with each approach
  • How applying each approach with different settings and parameter values can influence the final density curve


1.3 Prerequisites

This tutorial is purposefully equation-free. However, I do assume familiarity with basic statistics and linear modeling similar to what you would get in a college-level introductory statistics course. If you’re a bit rusty on these topics, see here for a good refresher.

This tutorial includes snippets of R code detailing how to implement each of the density estimation approaches we discuss. Although looking at the code could be useful if you want to implement these procedures yourself, all of the content here is specifically designed to be understandable and useful for professional designers. No programming background required.


1.4 Example Data

For the practical examples in this tutorial, we will use the “NBA Draft Combine Measurements” dataset, which contains anthropometric and performance measurements from recruits at the NBA draft combine in the years 2012-2016. This dataset was originally collected by Andrew Chou (https://data.world/achou), and it is freely available online at: https://data.world/achou/nba-draft-combine-measurements

To explore how density estimation approaches handle different types of distribution shapes (namely, distributions with outliers, biomodal distributions, and skewed distributions) we will also use some manually generated example data. The details on how I generated these data are available in the companion R script for this tutorial.

If you would like to follow along and implement the density estimation procedures we discuss here for yourself, the R code that I used to generate the calculations and figures in this tutorial is available here.

Without further ado, let’s load in our data and get started!

# Load Data ---------------------------------------------------------------

url <- "https://query.data.world/s/3cx3sheimvijxdcnfzba6pvouobnm4"
nba <- read_csv(url)



# Clean Data --------------------------------------------------------------

nba <- nba %>%
  # Make some of the variable names R-Friendly
  rename(
    "Index"               = "X1",
    "PickNumber"          = "Draft pick",
    "HeightNoShoes"       = "Height (No Shoes)",
    "HeightWithShoes"     = "Height (With Shoes)",
    "StandingReach"       = "Standing reach",
    "VerticalMax"         = "Vertical (Max)",
    "VerticalMaxReach"    = "Vertical (Max Reach)",
    "VerticalNoStep"      = "Vertical (No Step)",
    "VerticalNoStepReach" = "Vertical (No Step Reach)",
    "BodyFat"             = "Body Fat",
    "HandLength"          = "Hand (Length)",
    "HandWidth"           = "Hand (Width)"
  )



2 Density Estimation 101

2.1 What is Density Estimation?

The goal of density estimation is to use the observations of a variable that are available in a dataset to create a way of estimating the probability that any given value of that variable will occur. This probability estimate comes in the form of a probability density function (PDF). At a high level, you can think of a PDF as a machine that takes in the value of a variable and spits out the probability of that value occurring. Once you have the PDF for a variable, you can estimate the probability that any value occurs by plugging it into the PDF.

PDFs are usually plotted like the example above, with the different values for the variable you’re interested in on the x-axis and the probability of each value occurring on the y-axis. This type of curve should look familiar, and for good reason! That’s because when we display a density curve for a variable, we’re actually displaying an estimate of the PDF for that variable.

“When we display a density curve for a variable, we’re actually displaying an estimate of the PDF for that variable.”

Density estimation (DE) describes a set of statistical approaches for estimating the PDF for a variable using only the limited set of observations available in a given dataset. All DE techniques share the same goal: Find the PDF “machine” that produces the “best” possible estimates of the probability that each value occurs. However, each technique estimates this optimal PDF in a different way.


2.2 Density Estimates Are Statistical Models

At their core, density estimates are statistical models, or functions that relate the value of one or more “predictor” variables to the value of a “response” variable. In general, you can think of a model as a machine that takes in values of one or more predictors and spits out a predicted value of some response variable. In density estimation, our model is a PDF, which takes in the value of a single “predictor” variable (the variable of interest) and spits out a probability estimate for that value of the predictor variable.

A common aphorism in statistics is that “All models are wrong, but some are useful”. This saying refers to the fact that all statistical models are approximations. Some approximations can be better or worse than others, depending on how well their predictions align with what we observe in the real world. However, there will always be some amount of variability in the outcome of interest that even the best models cannot account for.

Since density estimates are models, we can always expect density estimates to be at least a little bit “wrong”. This is ok, as long as they are still “useful”. However, this means that it is important for us to understand the factors that can influence how wrong (and useful) density estimates can be in different circumstances, across the different approaches that can be used to generate them.

“Since density estimates are models, we can always expect density estimates to be at least a little bit “wrong”."

To explore these factors for each density estimation approach, we first need to understand three important properties of statistical models and how they relate to density estimation. In the section below, we discuss these properties using the familiar example of linear models.


2.3 Important Properties of Statistical Models

2.3.1 Assumptions

Different types of models make different assumptions about the general pattern of the relationship between predictor variable(s) and the response variable. These assumptions define the “family” or type of a model. Selecting a model family is analogous to selecting the type of machine we’re going to use to process our predictor variable(s) into a predicted value of the response variable. Different machines perform different operations on the predictor variable(s) to produce predictions about the response.

For example, linear models assume that there is a linear relationship between the outcome variable (y) and the predictor variable (x), such that for every one-unit increase in x, y will increase or decrease by a constant amount (see an example below). As you probably know, this assumption produces a model that is shaped like a line. Once we decide to use this model family, we then must find the line that gives the best approximations of our outcome variable (y).


2.3.2 Parameters

Just like different members of a human family can have different values of specific properties (e.g. height, weight, or eye color), different instances of the same model family can have different values of properties known as parameters. The values of these parameters determine how the general shape of a model adapts to capture the unique pattern in a particular dataset. If a model is a machine, you can think of parameters as control knobs that can be tuned to adjust the settings of the machine to adapt to different conditions.

You probably recall that simple linear models can have different values on two parameters: the slope and the intercept. These parameters control the position and orientation of the line that we use to predict values for the outcome variable. By changing the values of these two parameters, we can adapt a linear model to capture different trends in the data.

The examples below show linear model fits that use the player’s height (x-axis) to predict the values of 3 different response variables (y-axis). As you can see, each of these models is still a member of the linear model “family”, but we have changed the values of the model parameters (slope and intercept) to adapt each model to the specific circumstances of each variable relationship.


2.3.3 Fitting procedures

Once we’ve selected a type of model to use, we usually face the task of finding the value(s) for that model’s parameter(s) that enable(s) that model to produce the best possible approximation of our data. Depending on the type of model we decide to use, and the parameters associated with that model type, this can be accomplished in different ways. For some parameters, we can use a mathematical procedure to find the parameter values that best adapt our model to match our observed data. For other parameters, we may need to manually select values using more subjective criteria.

In the case of a simple linear model, we can mathematically determine the optimal slope and intercept values for a given dataset using the method of least squares. Practically, this approach selects the slope and intercept values that produce a line that minimizes the squared distance (along the y-axis) between each datapoint and the prediciton line. The closer the datapoints are to the line along the y-axis, the better the fit.


2.4 Tying it All Together

So what do these properties of statistical models have to do with density estimation? When we think about density estimation procedures as modeling approaches, and the estimated PDF as a statistical model, we can describe each density estimation approach in terms of how it varies on each of the three properties that we just described.

  • Assumptions: Some DE approaches make specific assumptions about the general pattern of the PDF, while others do not.
  • Parameters: Each DE approach has a different set of parameters that can influence the final shape of the estimated PDF.
  • Fitting procedures: In some cases, parameter values are determined using different mathematical approaches. In other cases, they may be set manually by a human analyst.

As we discuss different density estimation approaches, we will describe how each of them stacks up on these properties and how this can impact the density curve that users will see in your final product.



3 Parametric Density Estimation

3.1 Overview

Parametric density estimation (PDE) takes advantage of the fact that certain probability distribution families can be described completely by a relatively small set of parameters. For example, if we define the mean and the standard deviation of a normal distribution (a.k.a, a gaussian distribution), then we can easily calculate the probability of any value along that distribution.

Parametric density estimation involves the following steps:

  1. Choose a type (i.e., “family”) of probability distribution, and assume that the PDF for your variable of interest belongs to this family of models.
  2. Find the parameter values for your chosen distribution that cause it to match your available data as well as possible.


3.2 Assumptions

As described above, in parametric density estimation we assume that the PDF for our variable of interest is a specific type of distribution, such as a normal distribution.


3.3 Settings and Parameters

3.3.1 Distribution Type

The choice of distribution type significantly influences the final density curve by constraining the shape of that curve. If we select a normal distribution, then our density estimate will be a normal curve. If we select a binomial distribution, then our density estimate will be a binomial curve.

For simplicity, this tutorial will illustrate the strengths and weaknesses of parametric density estimation using the familiar example of a normal curve. See here for an extensive list of other probability distribution types.


3.3.2 Distribution-Specific Parameters

The “shape” of a given type of distribution is defined by the values of one or more parameters, and each distribution has its own set of parameters.

For example, the shape of a normal distribution is defined by the values of two parameters: the mean and the standard deviation. As you may recall, the mean determines where the distribution is centered along the x-axis (i.e., the distribution’s center), while the standard deviation determines how wide the distribution is (i.e., the distribution’s spread).

By changing the values of these two parameters, we can generate any possible normal curve:


3.4 Fitting Procedures

Once we have chosen a model family to use, a mathematic approach known as maximum likelihood estimation (MLE) can be used to estimate the parameter values that produce a PDF shape that best approximates our observed data. The details of MLE are beyond the scope of the current tutorial. However, conceptually, this approach defines the “best” estimated PDF as the one that makes the values in our dataset most likely to have occurred. In other words, for a PDF to be good, values that occur frequently in our dataset should have higher estimated probabilities, and values that occur less frequently in our dataset (or perhaps not at all) should have lower estimated probabilities.

The code below demonstrates an implementation of parametric density estimation with a normal distribution, and the resulting density estimate curve. In this and subsequent examples, we will be estimating a density curve for the height of the players in the NBA Combine dataset.

# Fit distribution using maximum likelihood estimation
gaussian_de_height <- fitdist(
  data = nba$HeightNoShoes,
  distr = "norm",
  method = "mle"
)

# Extract mean and SD of the fitted distribution
gaussian_de_height_mean <- gaussian_de_height$estimate[1]
gaussian_de_height_sd <- gaussian_de_height$estimate[2]

The resulting estimated PDF is a distribution from the selected model family (in this case, a normal curve) that has been modified to provide the best fit to the observed data. To examine how well this density estimate fits our data, the plots below compare the final density curve to a histogram of the observations in our dataset, plotted with different bin widths.


3.5 Strengths and Limitations

From the plots above, we can see that while parametric DE with a normal distribution seems to have captured much of the data well, it did not capture the two small peaks that seem to exist near the middle of the distribution. Because this pattern does not conform to the shape of a normal curve, we cannot capture it with a gaussian parametric density estimate. This is a major limitation of parametric density estimation– if a pattern in our distribution does not match the shape of the distribution type that we assume, then the final density estimate will not pick up on that pattern.

This limitation can become more of a problem when patterns incompatible with our assumed distribution type are more extreme. As you can see in the examples below, a gaussian parametric density estimate fails to adequately capture the distribution of data with a strong bimodal distribution, data with outliers, and skewed data, all of which fail to conform to the shape of a normal distribution. If we could only see the density curves without the histograms for reference, we could make very incorrect conclusions about each of these distributions.

However, parametric density estimation does generate very smooth density curves, and it can work well when patterns in the data do conform to the assumed distribution shape. This approach also requires less data to generate a reliable density estimate than some of the other DE approaches.



4 Kernel Density Estimation

4.1 Overview

Kernel density estimation (KDE) is a non-parametric approach to density estimation. Rather than assuming a distribution type for the PDF at the outset, KDE uses only the available data to determine the shape of the density estimate. In short, KDE predicts the density at a given value based on the weighted number of observations that are “near” that value. The weights for this calculation are constructed to ensure that observations that are nearer to a given value count more toward estimating the density at that value than observations that are farther from that value.

Conceptually, the process for KDE can be understood as follows:

  1. Select a kernel function to use for the estimate. We can interpret this function as describing how the presence of an observation at a given value of a variable influences the density estimate at adjacent values of that variable.
  2. Position a copy of the kernel function directly over each observation, and scale each of these kernels based on the number of observations in the dataset. When there are more observations in the dataset, the kernel function copies will be smaller.
  3. Add up the values of each kernel function at that variable value to produce the density estimate for that value.



4.2 Assumptions

Unlike parametric density estimation, KDE does not make any assumptions about the general pattern of the PDF.


4.3 Settings and Parameters

4.3.1 Kernel function

As we introduced above, we can think of the kernel function in KDE as defining how the presence of an observation at a given value of a variable influences the density estimate at adjacent values of that variable. Kernel functions can take several different shapes, with each shape corresponding to a different pattern of impact that an observation at a given point can have on the density estimates at the surrounding points.

The plots below show examples of some common kernel function shapes and how they can impact the final density curve. Gaussian kernels are the most commonly used, but other kernel shapes can have useful properties. See here for an extensive list of potential kernel shapes.



4.3.2 Bandwidth

The bandwidth of the kernel function is the most important parameter for determining the shape of the final density curve. This parameter determines how wide or narrow the kernel is. This, in turn, determines how broad of an impact an observation at a given value of a variable has on the density estimates produced for other values.

Practically, bandwidth controls the amount of smoothing in a kernel density estimate, with higher bandwidths resulting in smoother density curves and lower bandwidths resulting in bumpier density curves. When bandwidth is low, an observation only influences density estimates at values very close to it. This produces bumpier density curves that better capture local trends, but at a cost to their ability to capture global trends in the data. Conversely, as bandwidth increases, an observation influences density estimates at values farther away from it. This produces smoother density curves that better capture overall trends, but at a cost to their ability to capture local trends in the data.

The plots below illustrate this pattern by showing the kernel density estimates for our player height data, fitted using a gaussian kernel and different kernel widths. You can clearly see that as kernel width increases, the density curve becomes more smooth and captures fewer of the fine-grained patterns in the data.

4.4 Fitting Procedures

Typically, the kernel function is selected manually by the programmer or data analyst. As we noted above, while gaussian kernels are most commonly used, other types of kernel function are available.

Bandwidth can be selected manually, often with the goal of finding a curve that is not too bumpy and not too smooth. However, bandwidth can also be selected automatically by software using statistical techniques based on plug-in and cross-validation approaches. While these techniques are beyond the scope of the current tutorial, plenty of information about different bandwidth selection approaches is available online.

The code below demonstrates an implementation of KDE and the resulting density estimates for our NBA player data.

# Estimate the optimal kernel bandwidth using cross validation
(nba_bw <- hscv(nba$HeightNoShoes))
## [1] 0.9811668
# Alternatively, we could use a plug-in bandwidth selector.
hpi(nba$HeightNoShoes)
## [1] 0.972888
# This gives a slightly different result

# Use the CV bandwidth and a gaussian kernel to calculate the KDE
nba_kde <- kde(nba$HeightNoShoes, h = nba_bw)

# Extract the evaluation points and probability estimates
nba_kde_output <- tibble(
  EvalPoint = nba_kde$eval.points,
  DensityEstimate = nba_kde$estimate
)


4.5 Strengths and Limitations

Importantly, KDE does not make any assumptions about the general pattern of the PDF. As a result, this approach can capture more complex patterns that may be missed by parametric density estimation.

As you can see below, KDE is able to capture many of the more unusual patterns in our example data that we were unable to capture using parametric density estimation. In particular, notice how this approach is able to register the presence of outliers in the density curve, while curves generated using gaussian PDE failed to capture the presence of these points.

However, as we noted above, the specific form of the density estimate produced by KDE can vary as a function of the kernel shape and the bandwidth value used to generate that estimate. These parameters should be selected carefully, and possibly with an eye to the intended use of the final density estimate. We discuss this more and provide some suggestions in the final section of the tutorial.



5 Mixture Density Estimation

5.1 Overview

As we introduced earlier, a primary limitation of simple parametric density estimation procedures is that they are unable to capture more complex patterns that do not conform to any one distribution type (e.g. a normal distribution). Mixture density estimation (MDE) accounts for these more complex patterns by combining multiple simple parametric density estimates to produce one final density estimate. You can think of MDE as an intermediate approach that incorporates some elements from both PDE and KDE.

Recall that in KDE, we place one identical copy of the kernel over each observation in our dataset, and these kernel copies are all weighted equally based on the number of observations in our dataset. MDE relies on a similar approach, but with a few modifications:

  • MDE uses only a set number of kernels
  • In MDE, the parameters of each kernel (e.g. the mean and SD for a gaussian kernel) can vary independently from each other
  • In MDE, each kernel can have a different weight


5.2 Assumptions

As we hinted above, rather than assuming that the PDF of a variable follows the pattern associated with a particular distribution type, mixture models instead assume that the PDF can be captured by a weighted combination of multiple models of a given type. For example, we might assume that the PDF for a variable can be approximated by a weighted combination of two normal distributions.


5.3 Settings and Parameters

5.3.1 Number and Type of Component Distributions

To generate a mixture density estimate, we need to select the number and type(s) of component distributions to include. As more component distributions are included, the density curve will typically become less smooth, and it will be able to capture more complex patterns in the data. As fewer component distributions are included, the density curve will become more smooth, but it will be less able to capture complex patterns in the data.

Practically, choosing the number of component distributions to include in an MDE is a more graded version of the choice between PDE and KDE. In fact, we can think of PDE as a special case of MDE that only uses one component distribution. At the other end of the spectrum, we can think of KDE as a special case of MDE where the number of component distributions is equal to the number of observations in our dataset (and we include some other restrictions).


5.3.2 Distribution-Specific Parameters

Just as in simple parametric density estimation, the shape of a specific distribution is determined by the value of one or more parameters, and each type of distribution has its own set of parameters. In MDE, the only difference is that we now need to set these parameters individually for each of our component distributions. For example, if we perform MDE with two normal curves as our component distributions, then we will need to define a mean and a standard deviation for each of these curves.


5.3.3 Mixing Parameters

In addition to the parameters for each component distribution, MDE also requires us to specify one or more mixing parameters that determine how strongly each of the component distributions contributes to the final density estimate. Depending on the values of these parameters, both component distributions may contribute similarly to the final density estimate, or one distribution may dominate the other.


The plots below show how changing the values of distribution-specific and mixing parameters changes the shape of the density curve produced by MDE. For this example, the component distributions are two normal curves.


5.4 Fitting Procedures

The number and type of component distributions is usually set manually. Once these settings are defined, the optimal values for the other parameters can be calculated using a statistical approach called the expectation-maximization (EM) algorithm. The mathematical details of this fitting procedure are beyond the scope of this tutorial. However, it is worth knowing that this approach can land on a solution that is “locally” optimal, but is not necessarily “globally” optimal. Practically, this means that while the EM algorithm will produce a good fit, it is not guaranteed to produce the best possible fit to our data. A summary of the EM algorithm is available here.

The code below provides an example of how to implement MDE for our player height data. For this example, we use two normal curves as our component distributions.

# Helper function for generating gaussian mixture PDF ---------------------

# Arguments:
  # means: Vector containing the means for the first (index 1) and second (index 2) components
  # sds: Vector containing the standard deviations for the first (index 1) and second (index 2) components
  # weight:   Value to use for the relative weighting of the two components
  # x_values: A vector of x values for which to estimate the density

gaussian_mixture_pdf <- function(means, sds, weight, x_values){

  # Get probability densities from each distribution
  component_1 <- dnorm(x_values, mean = means[1], sd = sds[1])
  component_2 <- dnorm(x_values, mean = means[2], sd = sds[2])

  # Mix the two distributions
  component_mix <- component_1 * weight + component_2 * (1-weight)

  # Return the result
  return(component_mix)

}

# Estimate optimal parameter values using EM
set.seed(1990)
nba_mix <- mixtools::normalmixEM(x = nba$HeightNoShoes)
## number of iterations= 997
# Combine data to plot
nba_mix_density <- tibble(
  XVal = seq(min(nba$HeightNoShoes), max(nba$HeightNoShoes), by = 0.1),
  Density = gaussian_mixture_pdf(
    means = nba_mix$mu,
    sds = nba_mix$sigma,
    weight = nba_mix$lambda[1],
    x_values = seq(min(nba$HeightNoShoes), max(nba$HeightNoShoes), by = 0.1)
  )
)


5.5 Strengths and Limitations

Like we discussed above, we can think of mixture density estimates as representing a midpoint between the two extremes of parametric density estimation and kernel density estimation. Compared to parametric density estimates, mixture density estimates tend to capture complex patterns in the data more accurately (in statistical parlance, they exhibit less “bias”). On the other hand, compared to kernel density estimates, mixture density estimates tend to be less sensitive to the impacts of small changes in the data (i.e., they exhibit less “variance”).

This is evident from looking at how a two-component gaussian MDE handles our three difficult example cases:

As you can see, density curves generated using two-component gaussian MDE capture these patterns more effectively than the curves generated using PDE, but somewhat less effectively than the curves generated using KDE. Again, this highlights the tradeoff between the smoothness of a density curve and the level of detail that it can capture. Compared to KDE, density estimates produced using a mixture model can be more intuitive and easier to interpret. However, this increased simplicity can come with a cost, since a density curve generated using MDE may still not accurately capture some of the more complex patterns in the data.

MDE handled the bimodal data in our example case very well. This is not surprising, since I happen to know that these example data were actually generated from a two component gaussian mixture model. However, it is important to note that we should expect gaussian MDE to perform more poorly on bimodal data that have a distribution shape that is more difficult to approximate with a weighted sum of normal curves.

Also note how, unlike the curve generated using PDE, the MDE curve was able to capture outliers. Specifically, you can see that the EM algorithm was able to capture the data on the left side of the plot well using only one normal curve, leaving the second curve available to capture the outliers.

However, it is important to note that if the non-outlier observations had exhibited a more complex structure (rather than being drawn from a normal distribution, as they were here), then the MDE curve may have needed to “spend” both component distributions to capture those patterns. If this were the case, then the MDE curve would likely be unable to capture the outliers very well, since it would not have an extra component distribution available to separately model those outliers. Including a larger number of component distributions could increase the ability of MDE curves to capture outliers in our data.


6 Takeaways for Design

As we’ve discussed the processes used to generate density curves, you’ve probably already started to sniff out several of the design questions that can lurk beneath the apparent simplicity of these smooth plotted curves. To wrap up, I’d like to distill some of the density curve-related design questions that I’ve encountered so far and provide some suggestions for addressing them.


6.1 Choosing a Density Estimation Approach

Perhaps the most obvious question is “Which density estimation approach should I use in my display?” Based on everything we’ve discussed, I would suggest that kernel density estimation may be your safest bet for most applications. With KDE, there is much less risk of missing complex features of a distribution that do not conform to an assumed distribution shape. With PDE (and to some extent, MDE), you may run the risk of presenting an overly-simplistic view of the true distribution to your users if the observed data cannot be captured by a stereotyped distribution shape.

However, PDE and MDE do offer benefits that could be worth this risk in some circumstances. Compared to KDE, density estimates derived using PDE or MDE can be smoother and easier to interpret. These benefits could be especially valuable for interfaces that value aesthetics over true-to-life accuracy. If you’re facing this type of design challenge and you can be confident that the variable your density curve is summarizing can be accurately captured by a combination of one or more parametric density curves, then it may be worth considering PDE or MDE.


6.2 Choosing Settings and Parameters

For many density estimation approaches, there are mathematical techniques for estimating the “optimal” parameter values that produce the best possible fit to your observed data. However, settings and parameter values can only be “optimal” to the extent that they produce a density curve that addresses the information needs inherent in the work domain for which you are designing. Therefore, while automatic parameter selection approaches are a good start, I also recommend considering the practical tradeoffs inherent in selecting a value for each parameter as you make your decisions.

For example, when specifying the bandwidth for KDE, you may want to consider the intended use of the final density estimate. If the purpose of this display element is to enable users to detect outliers or unusual events, then a narrower kernel may be preferable, since it could be more likely to capture outliers. On the other hand, if the primary concern is the general shape of the distribution, then the smoother curve produced with a wider kernel may be preferable.


The figure below summarizes some recommendations for addressing these questions in your designs.



That’s it for now! The next time you sketch a density curve into one of your designs, I hope you’ll join me in doing a quick “gut check” to consider the processes that could be used to produce that curve in your final display and how they can impact what users will see in your final product.



7 TLDR

  • Density curves are models, and “all models are wrong”.
    • Using a different density estimation approach, or the same approach with different parameter values, can give you a different density curve.
  • Each density estimation approach makes different assumptions that can influence the final density curve.
    • Parametric DE: Assumes a distribution shape
    • Kernel DE: Does not assume a distribution shape
    • Mixture DE: Assumes a distribution shape for each of multiple component distributions
  • Each density estimation approach has distinct settings that can have different influences on the final density curve.
    • Parametric DE: Distribution type; Distribution-specific parameters
    • Kernel DE: Kernel Shape; Bandwidth
    • Mixture DE: Type and number of component distributions; Distribution-specific parameters; Mixing parameters
  • When specifying a DE approach to use in a display and setting the parameters for that approach, consider the tradeoffs between the approaches in the context of what you need the distribution to do in your display.
    • See above for recommendations on when to consider using each approach in your designs.



8 References

8.1 Content

  • Silverman, B. W. (1986). Density estimation for statistics and data analysis (Vol. 26). CRC press.
  • Course notes and lectures from SYS 6018 (Data Mining), taught by Prof. Michael Porter at the University of Virginia, Spring 2021.
  • Lecture notes from a statistical learning course taught by Nuno Vasconcelos. Available online from the Statistical Visual Computing Lab at the University of California, San Diego (http://www.svcl.ucsd.edu/)


8.2 Figures

Two of the elements in the figures above were adapted from figures published by other authors on Wikipedia. These elements are denoted with an attribution to the original authors in the bottom right corner of the image, and are listed below:


All other figures included in this tutorial are original work created by the author using the following software:

8.3 Data

The NBA Draft Combine Measurements dataset used in this tutorial was originally collected by Andrew Chou (https://data.world/achou). This dataset is available online at: https://data.world/achou/nba-draft-combine-measurements