找回密码
 会员注册
查看: 25|回复: 0

Python]机器学习-常用数据集(Dataset)之糖尿病(diabetes)数据集介绍,数据可视化和使用案例

[复制链接]

2万

主题

0

回帖

7万

积分

超级版主

积分
70598
发表于 2024-9-5 17:47:35 | 显示全部楼层 |阅读模式
糖尿病(diabetes)数据集介绍diabetes是一个关于糖尿病的数据集,该数据集包括442个病人的生理数据及一年以后的病情发展情况。该数据集共442条信息,特征值总共10项,如下:age:年龄sex:性别bmi(bodymassindex):身体质量指数,是衡量是否肥胖和标准体重的重要指标,理想BMI(18.5~23.9)=体重(单位Kg)÷身高的平方(单位m)bp(bloodpressure):血压(平均血压)s1,s2,s3,s4,s4,s6:六种血清的化验数据,是血液中各种疾病级数指针的6的属性值。s1——tc,T细胞(一种白细胞)s2——ldl,低密度脂蛋白s3——hdl,高密度脂蛋白s4——tch,促甲状腺激素s5——ltg,拉莫三嗪s6——glu,血糖水平.._diabetes_datasetiabetesdataset----------------Tenbaselinevariables,age,sex,bodymassindex,averagebloodpressure,andsixbloodserummeasurementswereobtainedforeachofn=442diabetespatients,aswellastheresponseofinterest,aquantitativemeasureofdiseaseprogressiononeyearafterbaseline.**DataSetCharacteristics:**:NumberofInstances:442:NumberofAttributes:First10columnsarenumericpredictivevalues:Target:Column11isaquantitativemeasureofdiseaseprogressiononeyearafterbaseline:AttributeInformation:-ageageinyears-sex-bmibodymassindex-bpaveragebloodpressure-s1tc,totalserumcholesterol-s2ldl,low-densitylipoproteins-s3hdl,high-densitylipoproteins-s4tch,totalcholesterol/HDL-s5ltg,possiblylogofserumtriglycerideslevel-s6glu,bloodsugarlevelNote:Eachofthese10featurevariableshavebeenmeancenteredandscaledbythestandarddeviationtimesthesquarerootof`n_samples`(i.e.thesumofsquaresofeachcolumntotals1).SourceURLiabetesDataFormoreinformationsee:BradleyEfron,TrevorHastie,IainJohnstoneandRobertTibshirani(2004)"LeastAngleRegression,"AnnalsofStatistics(withdiscussion),407-499.(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)加载糖尿病数据集diabetes并查看数据sklearn.datasets.load_diabetes—scikit-learn1.4.0documentation fromsklearn.datasetsimportload_diabetesdiabete_datas=load_diabetes()diabete_datas.data[0:5]diabete_datas.data.shapediabete_datas.target[0:5]diabete_datas.target.shapediabete_datas.feature_names基于线性回归对数据集进行分析LinearRegressionExample—scikit-learn1.4.0documentationimportmatplotlib.pyplotaspltimportnumpyasnpfromsklearnimportdatasets,linear_modelfromsklearn.metricsimportmean_squared_error,r2_score#Loadthediabetesdatasetdiabetes_X,diabetes_y=datasets.load_diabetes(return_X_y=True)print('diabetes_X:\n',diabetes_X[0:5])print('diabetes_y:\n',diabetes_y[0:5])#Useonlyonefeature(bmi)diabetes_X=diabetes_X[:,np.newaxis,2]print('diabetes_X:\n',diabetes_X[0:5])#Splitthedataintotraining/testingsetsdiabetes_X_train=diabetes_X[:-20]diabetes_X_test=diabetes_X[-20:]#Splitthetargetsintotraining/testingsetsdiabetes_y_train=diabetes_y[:-20]diabetes_y_test=diabetes_y[-20:]#Createlinearregressionobjectregr=linear_model.LinearRegression()#Trainthemodelusingthetrainingsetsregr.fit(diabetes_X_train,diabetes_y_train)#Makepredictionsusingthetestingsetdiabetes_y_pred=regr.predict(diabetes_X_test)#Thecoefficientsprint("斜率Coefficients:\n",regr.coef_)#Theinterceptprint("截距(intercept):\n",regr.intercept_)#Themeansquarederrorprint("均方误差Meansquarederror:%.2f"%mean_squared_error(diabetes_y_test,diabetes_y_pred))#Thecoefficientofdetermination:1isperfectpredictionprint("R2分数(Coefficientofdetermination):%.2f"%r2_score(diabetes_y_test,diabetes_y_pred))#Plotoutputsplt.scatter(diabetes_X_test,diabetes_y_test,color="black")plt.plot(diabetes_X_test,diabetes_y_pred,color="blue",linewidth=3)plt.show()使用岭回归交叉验证找出最重要的特征sklearn.linear_model.RidgeCV—scikit-learn1.4.0documentationModel-basedandsequentialfeatureselection—scikit-learn1.4.0documentationTogetanideaoftheimportanceofthefeatures,wearegoingtousethe RidgeCV estimator.Thefeatureswiththehighestabsolutecoef_valueareconsideredthemostimportant.Wecanobservethecoefficientsdirectlywithoutneedingtoscalethem(orscalethedata)becausefromthedescriptionabove,weknowthatthefeatureswerealreadystandardized.为了了解特征的重要性,我们将使用RidgeCV估计器。具有最高绝对有效值的特征被认为是最重要的。我们可以直接观察系数,而无需缩放它们(或缩放数据),因为从上面的描述中,我们知道特征值已经被标准化了。importmatplotlib.pyplotaspltimportnumpyasnpfromsklearn.linear_modelimportRidgeCVfromsklearn.datasetsimportload_diabetesdiabetes=load_diabetes()X,y=diabetes.data,diabetes.targetridge=RidgeCV(alphas=np.logspace(-6,6,num=5)).fit(X,y)importance=np.abs(ridge.coef_)feature_names=np.array(diabetes.feature_names)plt.bar(height=importance,x=feature_names)plt.title("Featureimportancesviacoefficients")plt.show()从上图可以看出,特征s1和s5的重要程度最高,特征bmi次之。
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 会员注册

本版积分规则

QQ|手机版|心飞设计-版权所有:微度网络信息技术服务中心 ( 鲁ICP备17032091号-12 )|网站地图

GMT+8, 2025-1-8 12:09 , Processed in 0.907296 second(s), 26 queries .

Powered by Discuz! X3.5

© 2001-2025 Discuz! Team.

快速回复 返回顶部 返回列表