jupytext | kernelspec | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
FIZ228 - Numerical Analysis
Dr. Emre S. Tasci, Hacettepe University
+++
_This lecture is heavily benefitted (almost copied fully) from David MacKay's "Information Theory, Inference, and Learning Algorithms"](http://www.inference.org.uk/mackay/itila/book.html)._
Suppose that we are given the following list, and asked to single out the odd one:
- Frog
- Parsley
- Brocoli
- Spinach
- Cauliflower
Depending on our background, we will be picking one of these two:
Frog, because the rest are vegetables, or,
Cauliflower, because the rest are green.
We spend our lives by clustering and classification, because otherwise we wouldn't be able to process the world happening around ourselves.
+++
A biologist encounters a new green thing he has not seen before. Even though this is the first time he has seen it, based on his experience, he can assume/deduce that:
- it is unlikely to jump on him and eat him
- if he touches it, he might get grazed/stung
- if he eats it, he might feel sick
These deductions are a case of "mixture density modelling". How well the model is working is the information content of the data
$\left(h(x)=\log{\frac{1}{P(x)}}\right)$ .
+++
"Go to third tree on the right, then take a right turn."
Third tree? Large green thing with red berries, then past the large green thing with thorns, then...
- lossy: e.g., image compression (jpg); sound compression (mp3) | (vector quantization)
- lossless: image (png, raw); sound (flac)
+++
If there is an anomaly --> it's a good thing!
Examples:
- The incompressible patches on ocean from a satellite image --> ships/islands
- The sudden peak at the input signals --> some kind of communication signal
- if the green thing runs away --> misfit with the model (something new, something interesting)!
+++
Putting
Data points:
Each vector
The metric that defines distances between points:
(this metric assumes that every component has the same importance)
+++
-
Initializement Step: Set
$k$ means$\left{\vec{m}^{(k)}\right}$ to random values.
Two-step algorithm:
-
Assignment step: Each data point
$n$ is assigned to the nearest mean. - Update step: Means are adjusted to match the sample means of the data points that they are responsible for.
=> The k-means algorithm always converges to a fixed point!
+++
-
Initializement Step: Set
$k$ means$\left{\vec{m}^{(k)}\right}$ to random values.
Two-step algorithm:
-
Assignment step: Each data point
$n$ is assigned to the nearest mean. We denote our guess for the cluster$k^{(n)}$ that the point$x^{(n)}$ belongs to by$\hat{k}^{(n)}$ :
alternative approach: responsibilities as indicator variables
(if there is a tie, set to the smallest winning
- Update step: Means are adjusted to match the sample means of the data points that they are responsible for.
if
- Repeat the assignment & update steps until the assignments do not change any more.
+++
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(228)
colors = ["b","g","c","m","y","k","b","g","c","m","y","k"]
N = 10 ## Total number of points
K = 3 ## Total number of means
## World Limits
max_x = 10
max_y = 5
## Randomly generate the points
xy = np.random.rand(N,2)
xy[:,0]*=max_x
xy[:,1]*=max_y
print(xy)
plt.plot(xy[:,0],xy[:,1],"ob")
for i in range(N):
plt.text(xy[i,0],xy[i,1],str(i),fontsize=12)
plt.xlim([0,max_x])
plt.ylim([0,max_y])
plt.title("{:d} Randomly Distributed Points".format(N))
plt.show()
## Step 0: Initialize the means at random points
m = np.random.rand(K,2)
m[:,0] *= max_x
m[:,1] *= max_y
print(m)
plt.xlim([0,max_x])
plt.ylim([0,max_y])
plt.plot(xy[:,0],xy[:,1],"ob")
for i in range(N):
plt.text(xy[i,0],xy[i,1],str(i),fontsize=12)
plt.plot(m[:,0],m[:,1],"*r")
for k in range(K):
plt.text(m[k,0],m[k,1],"["+str(k)+"]",color="red",fontsize=12)
plt.title("Here we randomly initialize the {:d} means".format(K))
plt.show()
## Step 1: Assign the responsibilities of each point to the nearest mean
r = np.zeros((N,K))
## Calculate the distance of each point to each mean
def distance(i,k):
## Calculates the distance of the ith point to the kth mean
d = np.linalg.norm((xy[i,0]-m[k,0],xy[i,1]-m[k,1]))
return d
distances = np.empty((N,K))
for i in range(N):
for k in range(K):
distances[i,k] = distance(i,k)
filter = distances[i,:] == min(distances[i,:])
# print(filter)
# print(np.arange(K)[filter])
## Assign the responsibilities
r[i,np.arange(K)[filter]] = 1
print(r)
for k in range(K):
this_k_set_filter = (r[:,k] == 1)
print (k,this_k_set_filter)
plt.plot(xy[this_k_set_filter][:,0],xy[this_k_set_filter][:,1],"o"+colors[k])
plt.plot(m[k,0],m[k,1],"*"+colors[k])
for i in range(N):
plt.text(xy[i,0],xy[i,1],str(i),fontsize=12)
for k in range(K):
plt.text(m[k,0],m[k,1],"["+str(k)+"]",color=colors[k],fontsize=12)
plt.xlim([0,max_x])
plt.ylim([0,max_y])
plt.title("Each point is assigned to the closest mean")
plt.show()
## Step 2: Update the means' locations so that
## they reside in the center of mass of their points
for k in range(K):
this_k_set_filter = (r[:,k] == 1)
# print(xy[this_k_set_filter])
# print(np.sum(this_k_set_filter))
mean_k = np.sum(xy[this_k_set_filter],axis=0)/np.sum(this_k_set_filter)
print(mean_k)
m[k,:] = mean_k
plt.plot(xy[this_k_set_filter][:,0],xy[this_k_set_filter][:,1],"o"+colors[k])
plt.plot(m[k,0],m[k,1],"*"+colors[k])
for i in range(N):
plt.text(xy[i,0],xy[i,1],str(i),fontsize=12)
for k in range(K):
plt.text(m[k,0],m[k,1],"["+str(k)+"]",color=colors[k],fontsize=12)
plt.xlim([0,max_x])
plt.ylim([0,max_y])
plt.title("The means are moved to the COM of their set of points")
plt.show()
## Step 3: Repeat steps 1 & 2 until there is no change
m_prev = np.zeros((K,2))
step = 1
while(not np.array_equal(m_prev,m)):
m_prev = m.copy()
## Step 1: Assign the responsibilities of each point to the nearest mean
r = np.zeros((N,K))
distances = np.empty((N,K))
for i in range(N):
for k in range(K):
distances[i,k] = distance(i,k)
filter = distances[i,:] == min(distances[i,:])
## Assign the responsibilities
r[i,np.arange(K)[filter]] = 1
## Step 2: Update the means' locations so that
## they reside in the center of mass of their points
for k in range(K):
this_k_set_filter = (r[:,k] == 1)
mean_k = np.sum(xy[this_k_set_filter],axis=0)/np.sum(this_k_set_filter)
m[k,:] = mean_k
plt.plot(xy[this_k_set_filter][:,0],xy[this_k_set_filter][:,1],"o"+colors[k])
plt.plot(m[k,0],m[k,1],"*"+colors[k])
for i in range(N):
plt.text(xy[i,0],xy[i,1],str(i),fontsize=12)
for k in range(K):
plt.text(m[k,0],m[k,1],"["+str(k)+"]",color=colors[k],fontsize=12)
plt.xlim([0,max_x])
plt.ylim([0,max_y])
plt.title("Step: {:03d}".format(step))
plt.show()
step += 1
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(228)
colors = ["b","g","c","m","y","k","b","g","c","m","y","k"]
N = 10 ## Total number of points
K = 3 ## Total number of means
## World Limits
max_x = 10
max_y = 5
## Randomly generate the points
xy = np.random.rand(N,2)
xy[:,0]*=max_x
xy[:,1]*=max_y
plt.plot(xy[:,0],xy[:,1],"ob")
for i in range(N):
plt.text(xy[i,0],xy[i,1],str(i),fontsize=12)
plt.xlim([0,max_x])
plt.ylim([0,max_y])
plt.title("{:d} Randomly Distributed Points".format(N))
plt.show()
## -------------------------------------------------------
## Step 0: Initialize the means at random points
m = np.random.rand(K,2)
m[:,0] *= max_x
m[:,1] *= max_y
plt.xlim([0,max_x])
plt.ylim([0,max_y])
plt.plot(xy[:,0],xy[:,1],"ob")
for i in range(N):
plt.text(xy[i,0],xy[i,1],str(i),fontsize=12)
plt.plot(m[:,0],m[:,1],"*r")
for k in range(K):
plt.text(m[k,0],m[k,1],"["+str(k)+"]",color="red",fontsize=12)
plt.title("Here we randomly initialize the {:d} means".format(K))
plt.show()
## -------------------------------------------------------
## Calculate the distance of each point to each mean
def distance(i,k):
## Calculates the distance of the ith point to the kth mean
d = np.linalg.norm((xy[i,0]-m[k,0],xy[i,1]-m[k,1]))
return d
## -------------------------------------------------------
## Step 3: Repeat steps 1 & 2 until there is no change
m_prev = np.zeros((K,2))
step = 0
while(not np.array_equal(m_prev,m)):
m_prev = m.copy()
## Step 1: Assign the responsibilities of each point to the nearest mean
r = np.zeros((N,K))
distances = np.empty((N,K))
for i in range(N):
for k in range(K):
distances[i,k] = distance(i,k)
filter = distances[i,:] == min(distances[i,:])
## Assign the responsibilities
r[i,np.arange(K)[filter]] = 1
## Step 2: Update the means' locations so that
## they reside in the center of mass of their points
for k in range(K):
this_k_set_filter = (r[:,k] == 1)
mean_k = np.sum(xy[this_k_set_filter],axis=0)/np.sum(this_k_set_filter)
m[k,:] = mean_k
plt.plot(xy[this_k_set_filter][:,0],xy[this_k_set_filter][:,1],"o"+colors[k])
plt.plot(m[k,0],m[k,1],"*"+colors[k])
for i in range(N):
plt.text(xy[i,0],xy[i,1],str(i),fontsize=12)
for k in range(K):
plt.text(m[k,0],m[k,1],"["+str(k)+"]",color=colors[k],fontsize=12)
plt.xlim([0,max_x])
plt.ylim([0,max_y])
plt.title("Step: {:03d}".format(step))
plt.show()
step += 1
## Let's also connect the points to their assigned means for a nice view:
for k in range(K):
this_k_set_filter = (r[:,k] == 1)
mean_k = np.sum(xy[this_k_set_filter],axis=0)/np.sum(this_k_set_filter)
m[k,:] = mean_k
for i in range(np.sum(this_k_set_filter)):
plt.plot(xy[this_k_set_filter][i,0],xy[this_k_set_filter][i,1],"o"+colors[k])
plt.plot(np.array([xy[this_k_set_filter][i,0],m[k,0]]),np.array([xy[this_k_set_filter][i,1],m[k,1]]),"-"+colors[k])
plt.plot(m[k,0],m[k,1],"*"+colors[k])
plt.xlim([0,max_x])
plt.ylim([0,max_y])
plt.title("Connected...")
plt.show()
To overcome the bad cases, and also to deal with the following issues (also related to bad cases):
- Assignment step problem/issue: A point is strictly owned only by one mean -- no matter if it is just next to the mean or at the border of the mean's influence.
- Update step problem/issue: Every point owned by the mean has exactly the same weight in influencing the mean's new position. However, "everyday logic" (physics(!)) dictates/suggests that the further the point, the less influence it has.
These two issues are actually different manifestations of the same problem. The hardness of the k-means algorithm can be softened by allowing partial ownership, hence the soft k-means clustering:
1. Assignment Step: Responsibility vector (the responsibility of cluster
where
2. Update step: It is identical to the hard k-means algorithm:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(228)
colors = ["b","g","c","m","y","k","b","g","c","m","y","k"]
beta = 1.5 ## Softness paraemeter
NN = 500 ## Max step number
N = 10 ## Total number of points
K = 3 ## Total number of means
## World Limits
max_x = 10
max_y = 5
## Randomly generate the points
xy = np.random.rand(N,2)
xy[:,0]*=max_x
xy[:,1]*=max_y
plt.plot(xy[:,0],xy[:,1],"ob")
for i in range(N):
plt.text(xy[i,0],xy[i,1],str(i),fontsize=12)
plt.xlim([0,max_x])
plt.ylim([0,max_y])
plt.title("{:d} Randomly Distributed Points".format(N))
plt.show()
## -------------------------------------------------------
## Step 0: Initialize the means at random points
m = np.random.rand(K,2)
m[:,0] *= max_x
m[:,1] *= max_y
plt.xlim([0,max_x])
plt.ylim([0,max_y])
plt.plot(xy[:,0],xy[:,1],"ob")
for i in range(N):
plt.text(xy[i,0],xy[i,1],str(i),fontsize=12)
plt.plot(m[:,0],m[:,1],"*r")
for k in range(K):
plt.text(m[k,0],m[k,1],"["+str(k)+"]",color="red",fontsize=12)
plt.title("Here we randomly initialize the {:d} means".format(K))
plt.show()
## -------------------------------------------------------
## Calculate the distance of each point to each mean
def distance(i,k):
## Calculates the distance of the ith point to the kth mean
d = np.linalg.norm((xy[i,0]-m[k,0],xy[i,1]-m[k,1]))
return d
## -------------------------------------------------------
## Step 3: Repeat steps 1 & 2 until there is no change
m_prev = np.zeros((K,2))
step = 0
while(not(np.all(m-m_prev<1E-5)) and step<(NN+1)):
m_prev = m.copy()
## Step 1: Assign the responsibilities of each point to the nearest mean
r = np.zeros((N,K))
distances = np.empty((N,K))
for i in range(N):
for k in range(K):
#distances[i,k] = distance(i,k) ## Hard k-means
distances[i,k] = np.exp(-beta*distance(i,k))
## Assign the responsibilities
r[i,:] = distances[i,:] / np.sum(distances[i,:])
## Step 2: Update the means' locations so that
## they reside in the center of mass of their points
for k in range(K):
r_k = r[:,k]
mean_k = np.multiply(r_k.reshape(N,1),xy)
mean_k = np.sum(mean_k,axis=0)
mean_k /= np.sum(r_k)
m[k,:] = mean_k
if(not(step%10)):
plt.plot(xy[:,0],xy,"o"+colors[0])
plt.plot(m[k,0],m[k,1],"*"+colors[k])
if(not(step%10)):
for i in range(N):
plt.text(xy[i,0],xy[i,1],str(i),fontsize=12)
for k in range(K):
plt.text(m[k,0],m[k,1],"["+str(k)+"]",color=colors[k],fontsize=12)
plt.xlim([0,max_x])
plt.ylim([0,max_y])
plt.title("Step: {:03d}".format(step))
plt.show()
step += 1
plt.plot(xy[:,0],xy[:,1],"o"+colors[0])
plt.plot(m[k,0],m[k,1],"*"+colors[0])
for i in range(N):
plt.text(xy[i,0],xy[i,1],str(i),fontsize=12)
for k in range(K):
plt.text(m[k,0],m[k,1],"["+str(k)+"]",color=colors[k],fontsize=12)
plt.xlim([0,max_x])
plt.ylim([0,max_y])
plt.title("Step: {:03d} (Final)".format(step))
plt.show()
#print(r)
print("\n\n{:^44}".format("Responsibility Distributions"))
print("{:^10s}|{:^10s}|{:^10s}|{:^10s}|".format("Point #","mean 0","mean 1","mean 2"))
print("-"*44)
for i in range(r.shape[0]):
print("{:^10d}|{:^10.6f}|{:^10.6f}|{:^10.6f}|".format(i,r[i,0],r[i,1],r[i,2]))
print("-"*44)
David MacKay's "Information Theory, Inference, and Learning Algorithms".