self picture of CY
CY Wu

This is me who never ever stop learning.

kaggle spanish wine price prediction

Introduction

該數據集包含來自西班牙的 7500 種不同類型的紅葡萄酒,具有 11 個特徵,描述了它們的價格、評級,甚至一些風味描述,目的為預測價格或品質。[1]

數據包含11個類型:
1. winery: 酒莊名稱
2. wine: 葡萄酒名稱
3. year: 葡萄收穫年分
4. rating: 葡萄酒的平均評分
5. num_reviews: 評論數量。
6. country: 原產地,即西班牙
7. region: 葡萄酒產區
8. price: 價格
9. type: 葡萄酒品種。
10. body: 酒體評分,代表口中葡萄酒的豐富度和重量。
11. acidity: 酸度評分,代表葡萄酒的「皺褶」或酸味。

cover

Price and type relationship

Data

data在kaggle上Spanish Wine Quality Dataset page上取得。

Importing package
Data cleaning & Check data
                    
                        Index: 6070 entries, 0 to 7499
                        Data columns (total 11 columns):
                         #   Column       Non-Null Count  Dtype  
                        ---  ------       --------------  -----  
                         0   winery       6070 non-null   object 
                         1   wine         6070 non-null   object 
                         2   year         6070 non-null   int64  
                         3   rating       6070 non-null   float64
                         4   num_reviews  6070 non-null   int64  
                         5   country      6070 non-null   object 
                         6   region       6070 non-null   object 
                         7   price        6070 non-null   float64
                         8   type         6070 non-null   object 
                         9   body         6070 non-null   float64
                         10  acidity      6070 non-null   float64
                        dtypes: float64(4), int64(2), object(5)
                        memory usage: 569.1+ KB
                    
                
winery wine year rating num_reviews region price type body acidity
0 Teso La Monja Tinto 2013 4.9 58 Toro 995.00 Toro Red 5.0 3.0
1 Artadi Vina El Pison 2018 4.9 31 Vino de Espana 313.50 Tempranillo 4.0 2.0
2 Vega Sicilia Unico 2009 4.8 1793 Ribera del Duero 324.95 Ribera Del Duero Red 5.0 3.0
3 Vega Sicilia Unico 1999 4.8 1705 Ribera del Duero 692.96 Ribera Del Duero Red 5.0 3.0
4 Vega Sicilia Unico 1996 4.8 1309 Ribera del Duero 778.06 Ribera Del Duero Red 5.0 3.0
Vistualization: Heat map[2]
heat map-wine

summary: strong relation between price and rating, and specific type of wine will affect price.

relationship between price and wine type[2、3]
bar chart scatter-humidity & temperature

summary: spefic wine have high price, and next step will be finding the model to predict the price.

searching the model[3]
                    
                        Index: 6070 entries, 0 to 7499
                        Data columns (total 10 columns):
                         #   Column       Non-Null Count  Dtype  
                        ---  ------       --------------  -----  
                         0   winery       6070 non-null   int32  
                         1   wine         6070 non-null   int32  
                         2   year         6070 non-null   int64  
                         3   rating       6070 non-null   float64
                         4   num_reviews  6070 non-null   int64  
                         5   region       6070 non-null   int32  
                         6   price        6070 non-null   float64
                         7   type         6070 non-null   int32  
                         8   body         6070 non-null   float64
                         9   acidity      6070 non-null   float64
                        dtypes: float64(4), int32(4), int64(2)
                        memory usage: 426.8 KB
                    
                
                    {'LinearRegression': 0.3997098329164317,
                     'Lasso': -0.0002825894551166108,
                     'Ridge': 0.3997079007354649,
                     'BayesianRidge': 0.3996679858749155,
                     'DecisionTreeRegressor': 0.3963314565148496,
                     'LinearSVR': 0.2089817504376078,
                     'KNeighborsRegressor': 0.5949966180649249,
                     'RandomForestRegressor': 0.7376611282475856}
                

summary: KNeighborsRegressor and RandomForestRegressor have higher r2 factor, hence we use these two models to generate std price vs. index plots.

KNeighborsRegressor model to predict std price
                          index  y_prediction  y_actual
                    0      1764     -0.295041 -0.369379
                    1      1821     -0.178520 -0.353428
                    2      1835     -0.256382 -0.348233
                    3      1595     -0.226673 -0.346843
                    4      1731     -0.208753 -0.345574
                    ...     ...           ...       ...
                    1209     42      5.391731  7.383273
                    1210    100      7.117485  7.882926
                    1211    595      3.849200  7.892109
                    1212    199      8.146660  9.494925
                    1213    332      6.879295  9.622104
                    
                    [1214 rows x 3 columns]
                
KNeighborsRegressor model
RandomForestRegressor model to predict std price
                          index  y_prediction   y_actual
                    0      1764     -0.251923  -0.369379
                    1      1947     -0.257671  -0.364968
                    2      1029     -0.267733  -0.359531
                    3      1805     -0.183921  -0.355422
                    4      1821     -0.184619  -0.353428
                    ...     ...           ...        ...
                    1209    595      0.269403   7.892109
                    1210     94     13.041085   7.957844
                    1211     97      4.603715   8.655424
                    1212    188      6.941651  10.383123
                    1213    343      7.828997  16.207611
                    
                    [1214 rows x 3 columns]
                
RandomForestRegressor model
Conclusion

首先經由sort_values(by='y_actual')和reset_index()兩指令使得x軸呈現標準化的low price到high price,y軸則為deviation。可見KNeighborsRegressor model 和 RandomForestRegressor model在index 600下都能有效、正確的預測標準化後的wine price。當index超過800甚至到1000以上時兩model得到的prediction value與actual value相差很大,說明此時model無法預測high price下的actual price。其背後的原因為:從box chart中能發現大部分的wine都屬於low price,當資料量夠大時能得到較準確預測pricce的模型。而high price的wine數量非常少,不容易生成準確的model
因此,我們能用這兩個model來預測特定type的wine的價格(特別在價格不高時)。此練習使用了各式的model,其各自的模型原理還需要再細看。

Source

[1] Spanish Wine Quality Dataset
[2] Wine Price EDA + Regression Analysis
[3] Wine Price Regression