{"id":2003,"date":"2019-04-12T06:38:35","date_gmt":"2019-04-12T06:38:35","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/04\/12\/cross-validation-concept-and-example-in-r\/"},"modified":"2019-04-12T06:38:35","modified_gmt":"2019-04-12T06:38:35","slug":"cross-validation-concept-and-example-in-r","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/04\/12\/cross-validation-concept-and-example-in-r\/","title":{"rendered":"Cross-Validation: Concept and Example in R"},"content":{"rendered":"<p>Author: Andrea Manero-Bastin<\/p>\n<div>\n<p><em><span>This article was written by <a href=\"https:\/\/sondosatwi.wordpress.com\/author\/sondosatwi\/\" target=\"_blank\" rel=\"noopener noreferrer\">Sondos Atwi<\/a><\/span><\/em><em><span>.<\/span><\/em><\/p>\n<p><span style=\"font-size: 14pt;\"><strong>What is Cross-Validation?<\/strong><\/span><\/p>\n<p><span>In Machine Learning,\u00a0Cross-validation\u00a0is a resampling method used for model evaluation to avoid testing a model on the same dataset on which it was trained. This is a common mistake, especially that a separate testing dataset is not always available. However, this usually leads to inaccurate performance measures (as the model will have an almost perfect score since it is being tested on the same data it was trained on). To avoid this kind of mistakes, cross validation is usually preferred.<\/span><\/p>\n<p><span>The concept of\u00a0cross-validation\u00a0is actually simple: Instead of using the whole dataset to train and then test\u00a0on same data, we could randomly divide our\u00a0data into training and testing datasets.<\/span><\/p>\n<p><span>There are several types of\u00a0cross-validation\u00a0methods (LOOCV \u2013 Leave-one-out cross validation,\u00a0the holdout method,\u00a0k-fold\u00a0cross validation). Here, I\u2019m gonna discuss the K-Fold cross validation method.<br \/><\/span> <span>K-Fold\u00a0 basically consists of the below steps:<\/span><\/p>\n<ol>\n<li><span>Randomly split\u00a0the data into k subsets, also called folds.<\/span><\/li>\n<li><span>Fit the model on the training data (or\u00a0k-1 folds).<\/span><\/li>\n<li><span>Use the remaining part of the data as test set to validate the model. (Usually, in this step the accuracy or test error of the model is measured).<\/span><\/li>\n<li><span>Repeat the\u00a0procedure k times.<\/span><\/li>\n<\/ol>\n<p><span><a href=\"https:\/\/sondosatwi.files.wordpress.com\/2017\/03\/k-fold_cross_validation_en.jpg\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/sondosatwi.files.wordpress.com\/2017\/03\/k-fold_cross_validation_en.jpg?profile=RESIZE_710x\" class=\"align-center\"><\/a><\/span><\/p>\n<p><span>\u00a0<\/span><\/p>\n<p><span style=\"font-size: 14pt;\"><strong>How can it be done with R?<\/strong><\/span><\/p>\n<p><span>In the below exercise, I\u00a0am using logistic regression to predict whether a passenger in the famous\u00a0<em>Titanic\u00a0<\/em>dataset has survived or not.\u00a0The purpose is to find an optimal threshold on the predictions to know whether to classify the result as 1 or 0.<\/span><\/p>\n<p><em><span>Threshold Example:\u00a0<\/span><\/em><span>Consider that the model has predicted the following values for two passengers: p1 = 0.7 and p2 = 0.4. If the\u00a0threshold\u00a0is 0.5, then p1 >\u00a0threshold\u00a0and passenger 1 is in the\u00a0survived\u00a0category. Whereas, p2 <\u00a0threshold,\u00a0so passenger 2 is in the\u00a0not survived\u00a0category.<\/span><\/p>\n<p><span>However, and depending on our data, the\u00a00.5\u00a0\u2018default\u2019 threshold will not alway\u00a0ensure the maximum\u00a0the number of correct classifications. In this context,\u00a0we could use\u00a0Cross-validation\u00a0to determine the best threshold for each fold based on the results of running the model on the\u00a0validation\u00a0set.<\/span><\/p>\n<p><span>In my implementation, I followed the below steps:<\/span><\/p>\n<ol>\n<li><span>Split the data randomly into 80\u00a0<em>(train and validation),\u00a0<\/em>20\u00a0<em>(test with unseen data)<\/em>.<\/span><\/li>\n<li><span>Run\u00a0cross-validation on 80% of the data, which will be used to\u00a0train\u00a0and\u00a0validate\u00a0the model.\u00a0<\/span><\/li>\n<li><span>Get the optimal threshold after\u00a0running the model on the validation dataset according to the best accuracy at each fold iteration.<\/span><\/li>\n<li><span>Store the best accuracy and the optimal threshold resulting from the fold iterations\u00a0in\u00a0a dataframe.<\/span><\/li>\n<li><span>Find the best threshold (the one that has the highest accuracy) and use it\u00a0as a cutoff when testing the model against\u00a0the test dataset.<\/span><\/li>\n<\/ol>\n<p><em><span>To read the full article, click <a href=\"https:\/\/sondosatwi.wordpress.com\/2017\/03\/03\/cross-validation-concept-and-example-in-r\/\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>.<\/span><\/em><\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:797176\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Andrea Manero-Bastin This article was written by Sondos Atwi. What is Cross-Validation? In Machine Learning,\u00a0Cross-validation\u00a0is a resampling method used for model evaluation to avoid [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/04\/12\/cross-validation-concept-and-example-in-r\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":474,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2003"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2003"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2003\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/456"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2003"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2003"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2003"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}