# K-Means Clustering Algorithm For Pair Selection In Python – Part II

Contributor:
QuantInsti
Visit: QuantInsti

In the first installment, Lamarcus explained K-Means Clustering. In today’s article, the author will try to implement statistical arbitrage without using K-Means first.

Life Without K-Means

To gain an understanding of why we may want to use K-Means to solve the problem of pair selection we will attempt to implement a Statistical Arbitrage as if there was no K-Means. This means that we will attempt to develop a brute force solution to our pair selection problem and then apply that solution within our Statistical Arbitrage strategy.

Let’s take a moment to think about why K-Means could be used for trading. What’s the benefit of using K-Means to form subgroups of possible pairs? I mean couldn’t we just come up with the pairs ourselves?

This is a great question and one undoubtedly you may have wondered about. To better understand the strength of using a technique like K-Means for Statistical Arbitrage, we’ll do a walk-through of trading a Statistical Arbitrage strategy if there was no K-Means. I’ll be your ghost of trading past so to speak.

First, let’s identify the key components of any Statistical Arbitrage trading strategy.

1. We must identify assets that have a tradable relationship
2. We must calculate the Z-Score of the spread of these assets, as well as the hedge ratio for position sizing
3. We generate buy and sell decisions when the Z-Score exceeds some upper or lower bound

To begin we need some pairs to trade. But we can’t trade Statistical Arbitrage without knowing whether or not the pairs we select are cointegrated. Cointegration simply means that the statistical properties between our two assets are stable. Even if the two assets move randomly, we can count on the relationship between them to be constant, or at least most of the time.

Traditionally, when solving the problem of pair selection, in a world with no K-Means, we must find pairs by brute force, or trial and error. This was usually done by grouping stocks together that were merely in the same sector or industry. The idea was that if these stocks were of companies in similar industries, thus having similarities in their operations, their stocks should move similarly as well. But, as we shall see, this is not necessarily the case.

The first step is to think of some pairs of stocks that should yield a trading relationship. We’ll use stocks in the S&P 500 but this process could be applied to any stocks within any index. Hmm, how about Walmart and Target. They both are retailers and direct competitors. Surely they should be cointegrated and thus would allow us to trade them in a Statistical Arbitrage Strategy.

Let’s begin by importing the necessary libraries as well as the data that we will need. We will use 2014-2016 as our analysis period.

#importing necessary libraries
#data analysis/manipulation
import numpy as np
import pandas as pd
#importing pandas datareader to get our data
#importing the Augmented Dickey Fuller Test to check for cointegration