A beginner friendly step-by-step guidance to EDA and Visualization — Breast Cancer Data

Debanjali Basu
5 min readFeb 22, 2022
Photo by Angiola Harry on Unsplash

Data Science is one of the fastest growing path in the job market. More digitalization = more data = more data science requirement. Everyday more and more people are switching to data science and it can be a little overwhelming at the beginning to learn so many concepts and using them to analyze real life data.

EDA (Exploratory data analysis) is one of the most vital step for any data science project. Here, I will show a step by step guidance for EDA and visualization on Breast Cancer Data. You can download the data from here.

Let’s first import the libraries:

Read the data and check the first five rows to get a rough understanding of the data.

The dataset has 569 rows and 33 columns. You can check it using the data.shape() function.

The target variable is ‘diagnosis’ which has two classes : B (if the tumor is benign) or M (if the tumor is malignant). The independent features show the properties of the breast tumor.

I will assign ‘diagnosis’ to the y variable and rest of the features to the x variable after dropping the unwanted columns.

EDA and Feature Engineering

First, check for any missing values and duplicate rows in the dataset. There are no missing values or duplicate observations here. But, in case you face missing values in any data set, there are few ways to deal with it. Here’s a good article on handling missing data.

One more important step here is to check the balance of the dataset. If the datasets are imbalanced then the predictions will be biased , so it is important to balance the datasets before progressing further. I am using countplot from the seaborn library to visualize the total number of records in both the classes. It seems the dataset is not much imbalanced, so I will use it as it is. If you want to read more on balancing datasets you can check this article.

Next step should be to check the descriptive statistics of the dataset. This will give us an understanding of the range, measures of central tendencies(mean, median, mode) and measures of dispersion(standard deviation).

Train Test Split

We should split the dataset into train and test sets before performing any feature engineering techniques to avoid data leaking.

In the train data there are 455 records whereas in the test data there are 114 records.

Feature Scaling

Since the columns are in different units it is difficult to compare among the columns. So, before visualizing and building a model we have to bring all the independent variables in a standard scale. There are two methods of feature scaling : standardization and normalization. Here, I will do standardization to bring all the independent features into standard normal distribution.

After scaling, I will join the x_train and the y_train for visualization purpose using the concat function of the pandas library.

Data Visualization

I will be using violin plots and swarmplots to visualize the data. These two plots are less used but both of them are very insightful.

Violin Plot

Violin plots are similar to box plots, but they also show the probability density of the data at different values.

Observation : 1.radius_se,perimeter_se,area_se,smoothness_se,concavity_se,fractal_dimension_se has a lot of outliers.

2.radius_mean,perimeter_mean,area_mean,concavity_mean,concave points_mean,perimeter_worst,area_worst,concavity_worst,concave points_worst are the variables that can easily differentiate between the two diagnosis classes.

Swarmplot

It is similar to a scatterplot but here the datapoints does not overlap, hence we can get a clear insight in the distribution of classes among all the features.

Observations :

  1. From the swarmplot above, we can see that some of the variables like the radius_mean,perimeter_mean,area_mean,concavity_mean,concave_points_mean,perimeter_se,area_se,radius_worst,perimeter_worst,area_worst,concavity_worst,concave points_worst are the variables that can easily differentiate between the two diagnosis classes. So, these features will be more helpful in building the model and will contribute towards better performance/accuracy.
  2. In the other features like texture_mean, smoothness_mean etc it is difficult to segregate the data on the basis of the diagnosis classes, since the data points from both the classes are mixed up.
  3. The swarmplot can be very useful in checking the outliers in the data. We can see that area_se, concavity_se, fractal_dimension_se has outliers present.

Correlation Analysis

correlation shows the relationship between two variables. The value ranges from -1 to +1, any value close to 1 means there is a positive correlation between the variables. If there’s any significant relationship between any independent variables (multicollinearity) , then the model will be a biased one. I will generate a heatmap to see if there’s any signification correlation present.

The lighter the box, more the correlation. We can see there are many variables with high correlation. In the next article I will show how to remove multicollinearity using VIF and some other feature selection methods.

Hope you all have learned something from this article. If you find it helpful please give it a clap and share with someone who might find it useful.

--

--

Debanjali Basu

Data Scientist | Data Lover ❤ I like to use Data Science to answer Financial and Socio-Economic issues.