{"id":443,"date":"2017-01-15T16:22:16","date_gmt":"2017-01-16T00:22:16","guid":{"rendered":"http:\/\/konukoii.com\/blog\/?p=443"},"modified":"2017-01-15T23:44:12","modified_gmt":"2017-01-16T07:44:12","slug":"5-min-tutorial-k-means-clustering-in-python","status":"publish","type":"post","link":"https:\/\/konukoii.com\/blog\/2017\/01\/15\/5-min-tutorial-k-means-clustering-in-python\/","title":{"rendered":"5-Min Tutorial: K-Means Clustering In Python"},"content":{"rendered":"<span class=\"span-reading-time rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">Reading Time: <\/span> <span class=\"rt-time\"> 4<\/span> <span class=\"rt-label rt-postfix\">minutes<\/span><\/span><p style=\"text-align: justify;\">K-Means Clustering is a common machine learning tool that allows to separate data into \"clusters\" (groups). Intuitively, you can imagine plotting each datapoint into a field (could be 2-D,3-D, or n-D field) and then looking at which points are close to which, trying to distinguish groups.<\/p>\n<p style=\"text-align: justify;\">In one of my previous\u00a0projects in which we analyzed\u00a0<a href=\"http:\/\/konukoii.com\/blog\/2017\/01\/11\/soundcloud-spam-analysis\/\">Spam in Soundcloud<\/a> we actually used K-Means to group together similar types of users. This allowed us to more clearly separate and see groups of musicians, fans, and spam.<\/p>\n<h5><span style=\"text-decoration: underline;\"><strong>Understanding the algorithm<\/strong><\/span><\/h5>\n<p style=\"text-align: justify;\">The K-Means algorithm is actually surprisingly intuitive and simple, yet very powerful.<\/p>\n<p style=\"text-align: justify;\">Take a look at the animation below (sorce:\u00a0<a href=\"https:\/\/www.projectrhea.org\/rhea\/index.php\/SlectureDavidRunyanCS662Spring14\">David Runyan<\/a>), as you read through the algorithmic steps.<\/p>\n<ol>\n<li style=\"text-align: justify;\">Select number of clusters (refereed to as K) and randomly assign the center points of this clusters anywhere near your dataset points.<\/li>\n<li style=\"text-align: justify;\">Update Cluster Assignments: For each point in the dataset, find which is the closest cluster and assign that point to it.<\/li>\n<li style=\"text-align: justify;\">Update Cluster Center: For each point in a given cluster, find the average center between them all and choose that to be the new cluster center.<\/li>\n<li style=\"text-align: justify;\">Repeat step 2 and 3 until the cluster centers are no longer changing. Convergence tends to be surprisingly quick for most datasets.<\/li>\n<\/ol>\n<figure id=\"attachment_446\" aria-describedby=\"caption-attachment-446\" style=\"width: 480px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-446 size-full\" src=\"http:\/\/konukoii.com\/blog\/wp-content\/uploads\/2017\/01\/RunyanKmeans.gif\" alt=\"K-Means Clustering Visualized - Credit: David Runyan (https:\/\/www.projectrhea.org\/rhea\/index.php\/SlectureDavidRunyanCS662Spring14)\" width=\"480\" height=\"480\" \/><figcaption id=\"caption-attachment-446\" class=\"wp-caption-text\">K-Means Clustering Visualized - Credit: David Runyan<\/figcaption><\/figure>\n<h5><span style=\"text-decoration: underline;\"><strong>Coding the algorithm<\/strong><\/span><\/h5>\n<p style=\"text-align: justify;\">The first thing we will do is define a distance function. In our case a simple Euclidean Distance function will be more than enough. Notice that we want to allow for any number of dimensions.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n#Euclidian Distance between two d-dimensional points\r\ndef eucldist(p0,p1):\r\n    dist = 0.0\r\n    for i in range(0,len(p0)):\r\n        dist += (p0&#x5B;i] - p1&#x5B;i])**2\r\n    return math.sqrt(dist)\r\n<\/pre>\n<p style=\"text-align: justify;\">Next, we want to code the actual K-Means algorithm. Initially we will designate an max # of iterations and randomly choose centers for the clusters. I like choosing a random datapoint as my center, as it makes things simpler.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n#K-Means Algorithm\r\ndef kmeans(k,datapoints):\r\n\r\n    # d - Dimensionality of Datapoints\r\n    d = len(datapoints&#x5B;0]) \r\n    \r\n    #Limit our iterations\r\n    Max_Iterations = 1000\r\n    i = 0\r\n    \r\n    cluster = &#x5B;0] * len(datapoints)\r\n    prev_cluster = &#x5B;-1] * len(datapoints)\r\n    \r\n    #Randomly Choose Centers for the Clusters\r\n    cluster_centers = &#x5B;]\r\n    for i in range(0,k):\r\n        new_cluster = &#x5B;]\r\n        #for i in range(0,d):\r\n        #    new_cluster += &#x5B;random.randint(0,10)]\r\n        cluster_centers += &#x5B;random.choice(datapoints)]\r\n\r\n        #Sometimes The Random points are chosen poorly and so there ends up being empty clusters\r\n        #In this particular implementation we want to force K exact clusters.\r\n        #To take this feature off, simply take away &quot;force_recalculation&quot; from the while conditional.\r\n        force_recalculation = False\r\n<\/pre>\n<p>We will now start the clustering loop, which will only end if we reach Max_iterations or when our clusters stop changing.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n    while (cluster != prev_cluster) or (i &gt; Max_Iterations) or (force_recalculation) :\r\n        \r\n        prev_cluster = list(cluster)\r\n        force_recalculation = False\r\n        i += 1\r\n<\/pre>\n<p>Next, for every point in the field we will find which is the closest cluster center to it and have it align itself with it.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n        #Update Point's Cluster Alligiance\r\n        for p in range(0,len(datapoints)):\r\n            min_dist = float(&quot;inf&quot;)\r\n            \r\n            #Check min_distance against all centers\r\n            for c in range(0,len(cluster_centers)):\r\n                \r\n                dist = eucldist(datapoints&#x5B;p],cluster_centers&#x5B;c])\r\n                \r\n                if (dist &lt; min_dist):\r\n                    min_dist = dist  \r\n                    cluster&#x5B;p] = c   # Reassign Point to new Cluster\r\n<\/pre>\n<p style=\"text-align: justify;\">Then we will update the cluster's center position. This is done by taking the average position of all the points in the given cluster. Notice that if we find a cluster which has 0 members, we force the center to randomly move to a new spot. (This is done because we want to enforce the number of centers to be k, and once a cluster has no members, there is no way to add new members to it.)<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n        #Update Cluster's Position\r\n        for k in range(0,len(cluster_centers)):\r\n            new_center = &#x5B;0] * d\r\n            members = 0\r\n            for p in range(0,len(datapoints)):\r\n                if (cluster&#x5B;p] == k): #If this point belongs to the cluster\r\n                    for j in range(0,d):\r\n                        new_center&#x5B;j] += datapoints&#x5B;p]&#x5B;j]\r\n                    members += 1\r\n            \r\n            for j in range(0,d):\r\n                if members != 0:\r\n                    new_center&#x5B;j] = new_center&#x5B;j] \/ float(members) \r\n                \r\n                #This means that our initial random assignment was poorly chosen\r\n                #Change it to a new datapoint to actually force k clusters\r\n                else: \r\n                    new_center = random.choice(datapoints)\r\n                    force_recalculation = True\r\n                    print &quot;Forced Recalculation...&quot;\r\n                    \r\n            \r\n            cluster_centers&#x5B;k] = new_center\r\n<\/pre>\n<p style=\"text-align: justify;\">Finally we can now use our program by simply feeding it a list of datapoints. Notice that these datapoints can be tuples of whatever size one would want. Also you can enforce the number of k clusters too.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n#TESTING THE PROGRAM#\r\nif __name__ == &quot;__main__&quot;:\r\n    #2D - Datapoints List of n d-dimensional vectors. (For this example I already set up 2D Tuples)\r\n    #Feel free to change to whatever size tuples you want...\r\n    datapoints = &#x5B;(3,2),(2,2),(1,2),(0,1),(1,0),(1,1),(5,6),(7,7),(9,10),(11,13),(12,12),(12,13),(13,13)]\r\n\r\n    k = 2 # K - Number of Clusters\r\n      \r\n    kmeans(k,datapoints) \r\n<\/pre>\n<h5><strong><span style=\"text-decoration: underline;\">Complete Code<\/span><\/strong><\/h5>\n<p style=\"text-align: justify;\">This was a fun little script that was surprisingly useful for\u00a0a couple of homeworks and project that I did last quarter and I thought I would share with you guys. Hopefully you have found this small write-up useful, and as always feel free to contact me with any questions. Cheers!<\/p>\n<p><script src=\"https:\/\/gist.github.com\/pmsosa\/5454ade527adbee105dd51066ef30c5f.js\"><\/script><\/p>\n","protected":false},"excerpt":{"rendered":"<p>K-Means Clustering is a common machine learning tool that allows to separate data into \"clusters\"&#8230;<\/p>\n<div class=\"more-link-wrapper\"><a class=\"more-link\" href=\"https:\/\/konukoii.com\/blog\/2017\/01\/15\/5-min-tutorial-k-means-clustering-in-python\/\">Read the post<span class=\"screen-reader-text\">5-Min Tutorial: K-Means Clustering In Python<\/span><\/a><\/div>\n","protected":false},"author":1,"featured_media":452,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9,10,32],"tags":[33,85,88,77,38,26],"class_list":["post-443","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-compsci","category-theory","category-tutorials","tag-5-min-tutorials","tag-clustering","tag-k-means","tag-machine-learning","tag-python","tag-tutorial","excerpt","zoom","full-without-featured","even","excerpt-0"],"_links":{"self":[{"href":"https:\/\/konukoii.com\/blog\/wp-json\/wp\/v2\/posts\/443","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/konukoii.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/konukoii.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/konukoii.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/konukoii.com\/blog\/wp-json\/wp\/v2\/comments?post=443"}],"version-history":[{"count":9,"href":"https:\/\/konukoii.com\/blog\/wp-json\/wp\/v2\/posts\/443\/revisions"}],"predecessor-version":[{"id":456,"href":"https:\/\/konukoii.com\/blog\/wp-json\/wp\/v2\/posts\/443\/revisions\/456"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/konukoii.com\/blog\/wp-json\/wp\/v2\/media\/452"}],"wp:attachment":[{"href":"https:\/\/konukoii.com\/blog\/wp-json\/wp\/v2\/media?parent=443"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/konukoii.com\/blog\/wp-json\/wp\/v2\/categories?post=443"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/konukoii.com\/blog\/wp-json\/wp\/v2\/tags?post=443"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}