This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The dataset is related to red variant of the Portuguese wine “Vinho Verde”.
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## [1] "fixed.acidity -> 0"
## [1] "volatile.acidity -> 0"
## [1] "citric.acid -> 0"
## [1] "residual.sugar -> 0"
## [1] "chlorides -> 0"
## [1] "free.sulfur.dioxide -> 0"
## [1] "total.sulfur.dioxide -> 0"
## [1] "density -> 0"
## [1] "pH -> 0"
## [1] "sulphates -> 0"
## [1] "alcohol -> 0"
## [1] "quality -> 0"
Each variable in the dataset is explored individually by plotting histograms to visualize the distribution of the data.
## pH density fixed.acidity volatile.acidity
## Min. :2.740 Min. :0.9901 Min. : 4.60 Min. :0.1200
## 1st Qu.:3.210 1st Qu.:0.9956 1st Qu.: 7.10 1st Qu.:0.3900
## Median :3.310 Median :0.9968 Median : 7.90 Median :0.5200
## Mean :3.311 Mean :0.9967 Mean : 8.32 Mean :0.5278
## 3rd Qu.:3.400 3rd Qu.:0.9978 3rd Qu.: 9.20 3rd Qu.:0.6400
## Max. :4.010 Max. :1.0037 Max. :15.90 Max. :1.5800
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## citric.acid residual.sugar chlorides sulphates
## Min. :0.000 Min. : 0.900 Min. :0.01200 Min. :0.3300
## 1st Qu.:0.090 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.:0.5500
## Median :0.260 Median : 2.200 Median :0.07900 Median :0.6200
## Mean :0.271 Mean : 2.539 Mean :0.08747 Mean :0.6581
## 3rd Qu.:0.420 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:0.7300
## Max. :1.000 Max. :15.500 Max. :0.61100 Max. :2.0000
## free.sulfur.dioxide total.sulfur.dioxide
## Min. : 1.00 Min. : 6.00
## 1st Qu.: 7.00 1st Qu.: 22.00
## Median :14.00 Median : 38.00
## Mean :15.87 Mean : 46.47
## 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :72.00 Max. :289.00
Most of the variables are normally distributed except quality, which shows 6 levels. These variables mentioned below are transformed to log scale for a normally distributed dataset:
These sets of variables seems to have more than meaningful correlation (>0.5)
Meaningful Correlations
The strongest correlation is found to be between pH-fixed.acidity (-0.683) followed by fixed acidity-density (0.668) and citric acid-fixed acidity (0.672). Since pH is a measure of acidity these correlations are expected.
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7082857 -0.6559174
## sample estimates:
## cor
## -0.6829782
##
## Pearson's product-moment correlation
##
## data: citric.acid and pH
## t = -25.767, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5756337 -0.5063336
## sample estimates:
## cor
## -0.5419041
##
## Pearson's product-moment correlation
##
## data: residual.sugar and chlorides
## t = 9.8152, df = 1451, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2006699 0.2971338
## sample estimates:
## cor
## 0.2495207
##
## Pearson's product-moment correlation
##
## data: density and alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
##
## Pearson's product-moment correlation
##
## data: density and fixed.acidity
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6399847 0.6943302
## sample estimates:
## cor
## 0.6680473
##
## Pearson's product-moment correlation
##
## data: density and fixed.acidity
## t = 34.697, df = 1433, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6465499 0.7028622
## sample estimates:
## cor
## 0.6756906
##
## Pearson's product-moment correlation
##
## data: density and fixed.acidity
## t = 34.989, df = 1436, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6494617 0.7053386
## sample estimates:
## cor
## 0.6783799
##
## Pearson's product-moment correlation
##
## data: free.sulfur.dioxide and total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6395786 0.6939740
## sample estimates:
## cor
## 0.6676665
##
## Pearson's product-moment correlation
##
## data: free.sulfur.dioxide and total.sulfur.dioxide - free.sulfur.dioxide
## t = 18.771, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3841336 0.4644895
## sample estimates:
## cor
## 0.4251489
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and volatile.acidity
## t = -10.589, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3013681 -0.2097433
## sample estimates:
## cor
## -0.2561309
## # A tibble: 6 × 3
## quality Skew Ku
## <int> <dbl> <dbl>
## 1 3 -0.40890877 -0.9932596
## 2 4 0.61327279 -0.2320630
## 3 5 1.83042317 5.2544218
## 4 6 0.54442666 -0.1579873
## 5 7 0.01033155 -0.4677010
## 6 8 -0.19693316 -0.9776767
## # A tibble: 6 × 3
## quality Skew Ku
## <int> <dbl> <dbl>
## 1 3 0.6270191 -0.54054899
## 2 4 0.1488264 -0.81529662
## 3 5 0.5911007 1.36339622
## 4 6 0.4327494 0.08843808
## 5 7 0.9428877 0.76022793
## 6 8 1.4455109 1.62440497
## # A tibble: 6 × 3
## quality Skew Ku
## <int> <dbl> <dbl>
## 1 3 0.8850968 -1.0898173
## 2 4 1.6198326 3.3427853
## 3 5 0.5222718 -0.5163235
## 4 6 0.2213311 -0.9880063
## 5 7 -0.3755223 -0.4595533
## 6 8 -0.3260505 -0.9180710
## # A tibble: 6 × 3
## quality Skew Ku
## <int> <dbl> <dbl>
## 1 3 -0.009144773 -1.28854080
## 2 4 -0.293059304 2.35443949
## 3 5 0.053054942 -0.01995589
## 4 6 0.301382195 1.44202134
## 5 7 0.377166503 0.58981835
## 6 8 0.354035873 -0.13122625
Chlorides vary independently of total sulphur dioxide across all the wine qualities and can be used as an independent parameter in the predictive model.
Most of the outliers in chlorides vs. density / alcohol / total sulphur dioxide plot seems to be from low and medium quality wines (5,6). This could be due to the higher sampling rate / availabilty of the medium quality wines as well as the incosistency of prepartion process. Higher quality wines go thrugh more stringent preparation protocol resulting in the consistency. Moreover, chlorides seems to be almost absent in high qulaity wines (7, 8).
Although the predictive model includes the outcome for all quality 1-10, we have only shown a prunned version due to space constraints. The model clearly shows the significance of alcohol content in the initial partitioning of the tree diagram.
The negative correlation of fixed acidity with pH is due to the scale with which pH is defined.
Fixed acidity is has strong negative correlation with pH which expected since pH is a measure of acidity. But more intriguiging factor is the extent to which fixed acidity (tartaric acid) influences pH when other acids are also present in wine. One of the probably reasons for this high correlation is because fixed acidity is also a measure of non volatile acidic content.
## # A tibble: 6 × 3
## quality Skew Ku
## <int> <dbl> <dbl>
## 1 3 0.8850968 -1.0898173
## 2 4 1.6198326 3.3427853
## 3 5 0.5222718 -0.5163235
## 4 6 0.2213311 -0.9880063
## 5 7 -0.3755223 -0.4595533
## 6 8 -0.3260505 -0.9180710
The redwine dataset containes 1599 observations by 12 variables. I started digging the dataset by looking at the distribution of individual variables and then by probing the relation through 2 variable plots and multivariable plots.
The relationships between most of the variables are not apparent until it is filtered based on another independent variable. It was also not obvious to me that quality was an output variable based on other continuous variables in the dataset.
The lack of ‘NA’ values in the dataset made it much easier to hadle data without using many filters.
There are multiple variables which are related among each other and showed multicolinearity. these variables ahave to dropped while building a predictive model to reduce the redendancy.
Another struggle was to visualize the huge data set based on three variables other than the variable quality.
A visualization based on three continuous variables could identify interestings trends. Another idea would be to cut off the data in both tails of the distribution and look at the correlation between variables in the middle values.