Dealing with Missing Data for Visualization

Visual Reality
3 min readApr 6, 2021

In this blog, we mentioned the challenges of having missing data and their impact on our visualizations. As explained in the blog about data description, around half of the values in our data table are missing. This situation brings certain difficulties in our data analysis and visualization process. It is not about the general theory of missing mechanisms or methods to handle missing data. This post is mainly about an overview of what variables have more missing values, possible reasons behind the missingness, and how we should proceed with missing data.

Variable-wise missingness: As the first step, the percentages of missing values for each variable are checked. There is a large variation for missing values among the variables. Some variables like change in general energy consumption have only around 2% of missing data, whereas some variables like the change in biofuel consumption have almost 95% of missing data. Overall, we can see that the more specific measurements tend to have more missing values.

Time and geography dependent missingness: Another not-so-surprising observation is that more recent data has fewer missing values compared to older data. It is mainly because there is more diversity in energy types in recent years. For instance, there are no observations for solar energy or biofuel energy from the 1960s. Similarly, when the data is grouped by country, it is seen that the countries with more missing data show less diversity in their energy sources.

Visualization with missing data: After understanding the main reasons behind the missingness in our dataset, the first question that pops up is: “How will visualizations look like with missing data?” Some simple examples can be seen here. Since we have a huge dataset including a wide range of years, skipping some years over the course of decades would not damage the overall interpretation. Also, the specific variables generally start from certain years for most countries, for instance, the biofuel-related variables. Therefore, comparing the specific variables in the restricted time periods might be a better approach for the visualization. The simple graphs of an artificial dataset are given below. As one gives the whole picture of the data, the other zooming into the complete cases, from left to right.

Left: Plot with missing data, resulting in shorter lines. Right: Zooming on complete cases

Methods to deal with missing data: Methods can be mainly categorized into two — complete case analysis and imputation. Since there are lots of missingness in our dataset, complete cases only consist of a very minor part of the entire data. So, it is out of the question. Imputation methods (mean, median, regression, etc.) might be a good option for some cases. However, it would not be feasible to impute the 90% of missingness of a variable with only 10% of it is complete. Also, it would not be meaningful to impute missing values for most of the variables. For example, if a country has built the first nuclear plant in 1995, then extrapolating the nuclear energy production before 1995 would not add any value to our analysis. If, however, data are missing between a brief period of time, then interpolation would be suitable.

What should we do?: We do not have to visualize the entire data set in a single dashboard. It would not be feasible. This is also in line with some of the feedback we have received. Data can be cut into different sub-sections based on specific contexts (e.g., renewable energy trends of South America in the last decade). Then, in different sub-sections, the percentage of missing data would be less than general missingness. So, overall, we can say that we are going to use available case analysis as our main method to handle missing data.

Another option (which is also received as feedback) would be to aggregate data according to certain measures. For instance; variables related to clean energy (solar, wind, biofuel, etc.) can be grouped and labeled as “clean energy”. Then, there will be less missingness among the clean energy group. Also, certain jumps in visualizations may indicate the sudden increase that can be interpreted as the opening of a new energy facility (e.g., installation of a new wind turbine).

In conclusion, we discussed how to analyze missing data and how to visually interpret missingness. In our next blog, we will show our preliminary visualizations.

--

--