Why you should use Swarmplots for Data Visualization

Debanjali Basu
3 min readFeb 21, 2022
Photo by City Church Christchurch on Unsplash

Data visualization is a very important step in EDA(Exploratory Data Analysis).The two most commonly used data visualization libraries in python are matplotlib and seaborn. We can create some wonderful plots and diagrams using the seaborn library. There are many unexplored and less used plots which has a great potential to tell very insightful stories on the used data, and one such plot is the swarmplot.

Swarmplot works best in a classification problem where there are many independent features. It is similar to a scatterplot but here the datapoints does not overlap, hence we can get a clear insight in the distribution of classes among all the features.

I am going to use the Breast Cancer data to show the visualization of the data using swarmplot. It is a classification problem where the objective is to correctly identify if person has breast cancer. Here I will show how to use swarmplot for identifying the best features in classification. You can get the data from here.

First, import the libraries:

Load the data and print the first 5 rows.

There are total 569 rows and 33 columns. I will drop the unwanted columns and split the columns into x and y variables.

There are no null or duplicate values here. So, I will directly split the data into train and test sets. I will use the train data for the visualization but before that I will have to standardize the data to bring them in the same scale. Standardization will help to compare the importance of the different features in classification problem.(This can be a important step for feature selection)

Let’s visualize the data using swarmplot. Here, I am using the melt function of the pandas library, where ‘diagnosis’ is the identifier variable and rest of the features are the measured variables. If you want to read more on the melt function, you can find it here.

Swarmplot

From the swarmplot above, we can see that some of the variables like the radius_mean,perimeter_mean,area_mean,concavity_mean,concave_points_mean,perimeter_se,area_se,radius_worst,perimeter_worst,area_worst,concavity_worst,concave points_worst are the variables that can easily differentiate between the two diagnosis classes. So, these features will be more helpful in building the model and will contribute towards better performance/accuracy.

In the other features like texture_mean, smoothness_mean etc it is difficult to segregate the data on the basis of the diagnosis classes, since the data points from both the classes are mixed up.

The swarmplot can be very useful in checking the outliers in the data. We can see that area_se, concavity_se, fractal_dimension_se has outliers present.

Therefore, Swarmplots can be very insightful in explaining the importance of features in the problem, their distribution and the outliers in the dataset.

Hopefully, this was helpful in understanding the importance of swarmplots in classification problem. I will post another article on the same dataset with a detail work on EDA and visualization.

--

--

Debanjali Basu

Data Scientist | Data Lover ❤ I like to use Data Science to answer Financial and Socio-Economic issues.