# What are clusters and how to do clustering?

For certain range of values on X and why I have these people for certain range of values and accept boy I have these So why is that this gap between these people on these two dimensions What causes the gap Why did the data points are not continuously started in my mathematical space Why are they coming together What are the gravitating towards some Centrowitz If I explore that and find out that will produce Richard information to me But the question But the problem is I don’t know how many clusters are there I can’t see it I have 50 dimensions and thousands of records Millions of the court.

How do you know how many clusters So unless and until I find the tightest in the farthest clusters I will not be able to make any meaningful information out of the clustering and all the effort will go waste on worst It might give give you misleading information It might even give you a misleading information Taken complete the wrong village So this particular technique has to be implemented The Great cave and it might require a lot of interests on these formulas Don’t get hassled by these formulas.

These formulas we already know this surveillance from like Simon is expertise to Porto in the variance formula What is very small A examine his expert raise too poor too If you want average ingredients divided in same day Same thing there This CK is number of data points in the cluster Okay So what This is telling us about this particular form of stealing us the variance if you take a cluster one What is the evidence in that cluster if you take Lester case According to what is the evidence in the cluster That’s what it’s telling you Examine its expert How does it find out Simple So if these are the data points in this cluster projector don’t they have whatever the values are here Does some off all and divide by 10 You get the X bar This is my central right on this access.

So you know you find out excitement is expert re support tuba in buildings on this excess Similar to find evidence on this exits So God that gives you the cluster readings So Jay Sigel to want to be piece perimeters perimeter so dimensions It’s on every dimension Find out the Vedran sum it up That gives it the cluster brilliance So the optimization problem for us is this is radiance within a cluster Have k clusters So for Casey Kal toh Oneto case It’ll want okay I want to minimize the variance within the customers I wonder for my clusters in such a way that the variances some off all the variants across the questions that minimized that is optimization problem.

This is the driving force behind clustering so K means clustering What it does is it I trade through your mathematical space bills clusters removes and build schools does move in until it reaches the point where this is minimized That is a uh that is um ethnic to design the quality of the model But what is it Learning algorithm In leaner English grading descent grading descent I’ll go to them and Lena mortals Similarly in this the driving force is to minimize or optimized this However there’s a problem here I’ll tell you what the problem is We’ll see the room so more concept If I indicate.

The variance off all the data points in the class your old data points Okay You’re not going to move You’re going to remain very war so I can calculate the variance on X axis and the Y access I can draw your eye I can project You want the exact Since all of you I can project all of him The Y axis I can find out Except expert What have I bought over there I can calculate the variance on the X axis and the buildings on the west so that we’re aliens off the entire class remains fixed That is a property of her mathematical space Given the data set that variance is called T next college Total variants Let’s call it total billions Okay now within cluster villians.

We know if I break my daytime tow This did this mathematical space into poo clusters One cluster here one cluster there I can find out the variance within clusters Right So that is called variance within closer w parishioners W’s readings between clusters The center of gravity of this place is a center of gravity Of that list of excitement is expert They support too anyone But in this number of clusters that is your between cluster variants Now can you Can you understand that total millions will be some off between within cluster variants and between cluster variance meeting Mr is Yep All the places put together more than the median between each cluster is one data point.

This is one data point is wonder the point This one data point Okay The data point is nothing but the center of gravity of those places The central it’s what is the rate in straight in there So find out Derek’s bars Excitement is expert rest for two grand little between class evidence Do you get it Okay I heard a voice here Yeah Yes No Watch this I’m gonna No Yeah Okay Let me repeat that again Suppose we’re going for two clusters of three kills Okay One cluster took less than three Cluster There is a central for every class project the central on the X axis Produce this indoors on the way exits Can’t.

Let the variance between the Seine Troy’s on exact suspicions That is between cluster Williams Okay All right Why am I telling you all this Once again our objective is to minimize WC within cluster variance We want those to be asked Titus possible So if you look at the formula that minimize this he’s a constant You can’t do anything He’s a constant surveillance within your details So to minimize this I have to maximize this magnitude Waste Magne to advice as BC increases doubly single Goodall.

Either you look for tightening the clusters are maximizing the distance between the twisters both of the ability of the same thing Unfortunately unfortunately both this but we want to find the Titus pleasanter or farthest blisters But these approaches they don’t have a well defined algorithm They belong to a class off problems in mathematics school NP hard rolls non deterministic polynomial time problems They don’t have any vilifying solutions.

Which means if I want to find out the three best clusters less has interested Supposedly three clusters in My Day doesn’t based on domain knowledge But if I want to find out the three best clusters in my data set there’s no algorithm which will find the best three clusters K means clustering belongs to MP hard family of problems which don’t have a solution so came in sports strings They usually end up in local minimus and that’s what that’s what causes.

If you’re not aware of this on you go on and do your clustering You might completely go in the wrong direction and all our ultimate hypothesis live in the city so this technique has to be used with great kids and lot off thought process has to go into this Which dimensions should I including my closeness right Well we’ll see how to kind of back all these problems but nothing has got indeed so I might end of its sub optimal solutions The biggest problem is if I give you 30 dimensions on 1000 records are 10,000 records whatever it is First thing The first problem is how do I know how many clusters to look for Once I know that then I can talk about where those lister’s off.