Machine Learning: About k-NN
k-NN algorithm makes an assumption that similar labels result in from similar data points. As K increases the mis-labelled “islands” start to disappear and get washed out making the decision boundary smoother. Incrementing K keeps reducing complexity as at some point it starts misclassifying significant groups of data point. This is because the number of neighbor increases and the simple rule of ML is to not viewing the world as unpredictable and arbitrary, it starts to predict more and more for less noise at the expense of data variance. Hence it becomes too smooth for truly complex data sets.
So, the simple techniques to lowering K, Data augmentation can be useful specially for image processing, to increase and dense the training dataset.
Calculating the distance weight using a combination of local mean based k-nearest neighbor (LMKNN) and distance weight k-nearest neighbor (DWKNN) also can help increase the classification accuracy.
k-NN assumes that the user has a way to compute distances that reflect meaningful dissimilarities. The distance function used is the underlying principle for k-NN algorithm. The distance metric has to be application specific. The most common measure is Euclidian distance. The other metrics most popular are Minkowski distance and Manhattan distance (Taxicab geometry).
Since k-NN assumes that “similar inputs have similar labels” and the hyperparameter k is number of neighbors the prediction accuracy directly depends on the distance of such neighbors and their labels. So how we determine the distance is critical to both accuracy and performance. When features are normalized in some cases using L1 distance instead of more common L2 distance provides better learning algorithm.
The computing time increases with number of data points because the number of calculation of application-specific distance metric also increases. More neighbors mean more distance to calculate and more data to find the minimum to determine a label. It may also cause encountering more tie situations and then more and more adjustments of k to make it optimal.
The most useful technique to help improve efficiency of k-NN in such a scenario is to use a k-d tree or ball tree structure. It groups data point neighbors into a box and one calculation of distance of the closest arm of the box from target data point can allow the algorithm to ignore all the points inside the box because of higher distance than another closer neighbor. That way significant reduction of distance calculation can lower down the need of total time and enhancement of computing power.
Increasing the number of dimensions for each data point in a data set, increases dissimilarity between neighboring data points. For example, as data points from 2nd dimension expand along 3rd dimension, their relative pair-wise distances increase. This is called curse of dimensionality. As it is opposing to the main assumption of k-NN algorithm, “similar inputs have similar labels”, k-NN breaks down in high-dimensional space. For example, if we find labels based on height only, we will have lot of similar outputs but if we add age as a second feature the similarity will significantly dive down. Adding more features like blood pressure, heart rate will diversify labels further. This expansion as we increase dimensionality means our nearest neighbor are no longer near to test point making k-NN algorithm ineffective eventually.
Generally if features are correlated such as, blood pressure and age, the k-NN can still work in higher dimensions. It may also work if among all the features few are dominating (applying Principal Component Analysis can help). Otherwise k-NN algorithm may not be suitable for the problem.
References:
1. Gupta, P., Zhao, D., Guallar, E., Ko, F., Boland, M. V., & Friedman, D. S. (2016). Prevalence of Glaucoma in the United States: The 2005 2008 National Health and Nutrition Examination Survey. Investigative ophthalmology & visual science, 57(6), 2905–2913. doi:10.1167/iovs.15-18469
2. https://numerics.mathdotnet.com/Distance.html
So, the simple techniques to lowering K, Data augmentation can be useful specially for image processing, to increase and dense the training dataset.
Calculating the distance weight using a combination of local mean based k-nearest neighbor (LMKNN) and distance weight k-nearest neighbor (DWKNN) also can help increase the classification accuracy.
k-NN assumes that the user has a way to compute distances that reflect meaningful dissimilarities. The distance function used is the underlying principle for k-NN algorithm. The distance metric has to be application specific. The most common measure is Euclidian distance. The other metrics most popular are Minkowski distance and Manhattan distance (Taxicab geometry).
Since k-NN assumes that “similar inputs have similar labels” and the hyperparameter k is number of neighbors the prediction accuracy directly depends on the distance of such neighbors and their labels. So how we determine the distance is critical to both accuracy and performance. When features are normalized in some cases using L1 distance instead of more common L2 distance provides better learning algorithm.
The computing time increases with number of data points because the number of calculation of application-specific distance metric also increases. More neighbors mean more distance to calculate and more data to find the minimum to determine a label. It may also cause encountering more tie situations and then more and more adjustments of k to make it optimal.
The most useful technique to help improve efficiency of k-NN in such a scenario is to use a k-d tree or ball tree structure. It groups data point neighbors into a box and one calculation of distance of the closest arm of the box from target data point can allow the algorithm to ignore all the points inside the box because of higher distance than another closer neighbor. That way significant reduction of distance calculation can lower down the need of total time and enhancement of computing power.
Increasing the number of dimensions for each data point in a data set, increases dissimilarity between neighboring data points. For example, as data points from 2nd dimension expand along 3rd dimension, their relative pair-wise distances increase. This is called curse of dimensionality. As it is opposing to the main assumption of k-NN algorithm, “similar inputs have similar labels”, k-NN breaks down in high-dimensional space. For example, if we find labels based on height only, we will have lot of similar outputs but if we add age as a second feature the similarity will significantly dive down. Adding more features like blood pressure, heart rate will diversify labels further. This expansion as we increase dimensionality means our nearest neighbor are no longer near to test point making k-NN algorithm ineffective eventually.
Generally if features are correlated such as, blood pressure and age, the k-NN can still work in higher dimensions. It may also work if among all the features few are dominating (applying Principal Component Analysis can help). Otherwise k-NN algorithm may not be suitable for the problem.
References:
1. Gupta, P., Zhao, D., Guallar, E., Ko, F., Boland, M. V., & Friedman, D. S. (2016). Prevalence of Glaucoma in the United States: The 2005 2008 National Health and Nutrition Examination Survey. Investigative ophthalmology & visual science, 57(6), 2905–2913. doi:10.1167/iovs.15-18469
2. https://numerics.mathdotnet.com/Distance.html
Comments
Post a Comment