“How many dimensions can you show on a scatterplot or a chart?” is a question that often pops up in many situations, for example, data scientist interview. This question came up also in my workplace. My colleagues were of the opinion that five dimensions would be a bit too much for a two-dimensional scatterplot. Here, I write a simple code in R to plot five dimensions in a two-dimensional scatterplot. Along with the five dimensional scatterplot, I will also add few manipulations to the figure produced by ggplot for better (?) visuals.
First let us load the libraries needed for the script
Next, we create a five dimensional dataframe. We will use two continuous variables for the two axes and three categorical variables to display different classes in the data. Since, we use rnorm function to generate random points use set.seed() function to set the seed to get exactly same sequence of numbers. This is important for reproducibility of the code.
set.seed(123) dat <- data.frame(status = rep(c("Single", "Married"), each=10),# conditional categorical variables height = 41:60 + rnorm(20,sd=3), # variable for the x-axis weight = 41:60 + rnorm(20,sd=3), # variable for the y-axix Education = rep(4:8,4), # A numerical but categorical variable determines the size of points gender = rep(c("Male", "Female"), each=2)) # gender to display in the plot)
Now, we use ggplot function from ggplot2 package. The most important parameter of ggplot is a function aes. The aes is often called as aesthetics and it maps variables to different parts of the plot. It takes as input x-axis, y-axis and other formatting options.
dat.plot <- ggplot(dat, aes(x=height, # height as x-axis y=weight, # weight as y-axix shape=status, # shape denotes marital status color = gender, # color denotes the gender size = Education)) # size denotes the education
The aes function only creates the aesthetics but we need to add a layer of points using the aesthetics mapping. The function geom_point() does exactly that.
dat.plot <- dat.plot + geom_point()
We can change labels in both x-axis, y-axis and title using:
dat.plot <- dat.plot + xlab("Height") # add xlabel to plot dat.plot <- dat.plot + ylab("Weight") # add ylabel to plot dat.plot <- dat.plot + ggtitle("Five Dimensional Scatterplot") # add title to image
This creates a plot like the Figure 1 above. Figure 1 shows five dimensions in two-dimensional scatterplot. We have used glyphs, shapes, and color to add further three dimensions to the scatterplot on top of the regular X and Y axis. However, we can put the legends inside the plot, as the left-hand side of plot is free. We can also use the theme argument of ggplot2 which changes the themes of the plots.
# Put bottom-left corner of legend box in bottom-left corner of graph dat.plot <- dat.plot + theme(legend.position=c(0,0), #position the legend legend.justification=c(0,0)) #justification of legend # bottom-left is 0,0; top-right is 1,1
This creates a plot shown in Figure 2, however, Edward Tufte, would still be not happy because the gray color of the background ggplot still takes some ink so data to ink ratio is decreased (Actually, it is my personal preference that plots have white background). We also can increase the font size of titles and X and Y labels. Also lets improve the readability of the plot with larger font sizes for axes and titles.
dat.plot <- dat.plot + theme(legend.position=c(0,0.3), legend.justification=c(0,0), panel.background = element_rect(fill = "white", #background color of the plot colour = "black", #color of the rectange around plot size = 1, linetype = "solid"), # Line type and width of lines of the rectangle around the figure axis.title=element_text(size=18,face="bold"), # fontface and font size of both X and Y labels axis.text=element_text(size=14), # ticklabels of of both X and Y axes plot.title = element_text(size = rel(2), # Title of the plot face="bold", colour = "black")) # Color and fontface of the title
dat.plot <- dat.plot + scale_color_discrete(name ="Gender", labels=c("Female", "Male")) dat.plot <- dat.plot + scale_shape_discrete(name="Marital\nStatus", labels=c("Married", "Single" ))
The final plot is as shown in Figure 3, which seems better than the first one. The code is available from my Github page.