The k-NN algorithm is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:
- In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbours, with the object being assigned to the class most common among its k nearest neighbours (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbour.
- In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbours.
For example, suppose we have training data for some fruit with measurements of the size, colour, acidity and a label stating the type of fruit (orange, lemon, lime). We can use k-NN to "predict" the class of an unknown fruit based on its measurements of size, colour and acidity. Alternatively, we could "predict" the value of acidity based on the fruit's colour, size and type. The former case is classification, the latter is regression.
For time-series analysis, this is generally just treated as a regression in 1-dimension, i.e. predicting a value of price from the"k" nearest time samples. This works very well for smoothing the data, but unfortunately breaks down somewhat at the right-hand edge, since the k nearest samples will always be to the left of it. In fact, the result is somewhat similar to a simple moving average.
The following graph shows some sample time series data, with a k-NN (k = 3) and a simple moving average (length = 3) for comparison.
The 1 dimensional k-NN does a great job of smoothing the data, but is not particularly good for prediction of the future (or at least no better than a moving average).
Another approach is to use a multi-dimensional k-NN. For example, suppose we use a feature vector "v"of 10 dimensions (denoted as v[0] to v[9]). Now, let v[0] be the price at time t0, v[1] be the price at time t1, v[2] be the price at time t2, etc.
We can use the k-NN algorithm to find the k nearest neighbours in our training data to this target vector. In this sense, it is essentially operating as a pattern matching algorithm. The output could be something like the average price-move of our k nearest matches, or even a buy/sell classification.
However, there are a number of issues:
- We need some kind of price invariance built into the system. For example, a particular pattern based around 1.100 should be matched with an identical pattern that happens to be based around 1.200 or 1.300, etc.
- Very large data sets are needed, and unlike the 1-dimensional case where we only had to search the last k points, we have to search the whole training set for our matches. This presents challenges for storage and performance.
One way to achieve the price invariance can be built in by looking at % change. For example, simply re-stating the feature vector so that v[0] = 0, and v[1] = (price at time t1 - price at time t0) / (price at time t0), etc. Similarly, one could use "returns" or "log returns".
Tackling the large amount of data requires a number of innovations:
- Reducing which data is used. For example, suppose we only care about price patterns that occur when the price is "over-bought" or "over-sold" then there is no need to store data for all the other cases.
- Using data structures like k-d trees to store the data more efficiently can drastically reduce the search time.
No comments:
Post a Comment