Abstract

The paper Cyclic Arbitrage in Decentralized Exchange Markets[1] showed a large number of occurrences of cyclic arbitrages in decentralised exchanges compared to centralised ones. In their work, they mainly focus on analysing these cycles in terms of length, token distribution, daily patens and profitability.

However, the factors driving their appearance have not been studied yet. To this end, we propose to extend the work of [1] on Uniswap data. Moreover, we also plan to study the predictive power of these factors in a binary classification setting. It will allow determining whether or not a cycle can actually be implemented and generate a positive return, which has an inherent market value.

Goal

The goal of this project is to study exploited cyclic arbitrage in decentralised exchanges. We already have access to the Cyclic transaction dataset which contains cyclic arbitrages that were exploited. We intend to extract features out of events (trade rates, trade volumes, liquidity) preceding the arbitrages.

These features could potentially be high dimensional (depending on the length of the time series) and we will need to use dimensionality reduction techinques to create an embedding to build a relevant set of features of our future machine learning models.

Then, we will cluster the arbitrages based on the computed features. Ideally, we would like to observe meaningful clusterings: profitable cycles get clustered together, cycles having similar duration (how long it is profitable) also end up in the same cluster, etc. Once meaningful clusters are obtained, it gets interesting to use the same features in a prediction model having profitability of the arbitrage as a target.

Data Acquisition

Data used in the study come from two different sources: information about exploited cycles comes from the dataset used in the arxiv paper. Data concerning rates preceding the cycles come are from through bitquery.

Dataset from the paper

We already have downloaded the dataset used in the arxiv paper. This dataset consists of arbitrage cycles that were exploited in the past. Each of these cycles is described by a path (the token swaps), a cost (gas fees), a profit, etc. It consists of a single JSON file and the downloading process is straightforward. The cyclic transaction dataset contains cycles of various lengths (number of tokens involved). The following figure displays the distribution of these lengths :

Note: this figure was replicated from the arxiv paper.

Moreover, to compute the embedding (next step) it would be more convenient to work on fixed-length cycles. Thus, cycles whose lengths are different than 3 are filtered out. The obtained data is called filtered_cycles_data. While filtering, a new indexing system is created to identify cycles through an incremental cycle_id.

Custom extended dataset (Uniswap data)

As a second step, we construct an extended version of the cyclic transaction dataset as follows. For each cycle in filtered_cycles_data:

We fetch from https://bitquery.io/ the exchange rates and gas prices of the k preceding the swaps present in the cycle path. This downloading process is more complex and needs to address some challenges. The free version of the bitquery API that is used for this project only allows a limited number of queries (10/min). To solve this issue we used EPFL’s cluster machine to query the API for a few weeks with a time delay between each call. To increase the throughput we used 2 API keys. For each given cycle in filtered_cycles_data 3 calls are needed, one per edge of the cycle.
The fetched data is saved in multiple JSON files named uniswap_raw_data_start_end where start and end designate the starting and ending cycle_id of cycles included in the file. We chose to use multiple files to avoid big losses in case of failure, these files are copied in safe storage as a backup. The files are then combined into a single pandas DataFrame named uniswap_data.csv. We posted the dataset (splited files) on Kaggle

Each row contains information about a single swap :

cycle_id, token1, token2, baseAmount, quoteAmount,quotePrice, gasPrice, gasValue, time.

In the following section, we will only work on this extended dataset which is therefore referred to as the dataset.

Data Wrangling

Data Exploration

Before developing any machine learning model, we need to grasp a basic understanding of the dataset and its statistical properties.

As a first step, some basic descriptive statistics are shown.

	baseAmount	quoteAmount	quotePrice	gasPrice	gasValue
count	15000000.0	15000000.0	15000000.0	15000000.0	15000000.0
mean	617285.7691774105	713846.7959430087	2242173912899.245	220.7801204099542	0.039095987647290976
std	65344245.49678698	70486989.66812174	1692048320610068.8	358.3729580470268	0.07014017303962658
min	-15.142535104	-12.159705765907386	-2524.8558484026958	4.0	0.000995174389907456
25%	0.8133318661470393	0.9036593505321668	0.004853244571226323	38.999998464	0.0077283002647838715
50%	12.397615	13.492995324544614	1.00714360456757	112.99999744	0.017265654713286657
75%	452.1615419051607	437.5007715050853	210.47100813227757	337.500012544	0.05276144843830067
max	38617566009.5172	38617566009.5172	2.8893209691893514e+18	100000.00037683199	13.081600120162616

Since we are dealing with financial time series for the following features:

baseAmount, quoteAmount,quotePrice, gasPrice, gasValue

We will probably observe heavy tail distribution to some extent. It is what we are going to check first.

Note: in this section, no distinction is made between cycles. In other words, all data available is treated as a single feature, aka a single distribution. Indeed, understanding each cycle separately is a cumbersome process. Furthermore, it does not help in getting a global understanding of the dataset.

Note that only quotePrice and gasPrice are used as features for the embedding (in fact quotePrice = quoteAmount/baseAmount ). The distribution of quotePrice is shown below:

We observe that both features are extremely heavy-tailed. It is likely to cause some issues when used as features for machine learning models. As shown in the plot, applying a logarithmic transformation make the distributions more Gaussian (desired behaviour).

As a result, we computed again the basic descriptive statistics but this time on a log scale. We observe a better scale and the results are much more interpretable.

	baseAmount	quoteAmount	quotePrice	gasPrice	gasValue
count	14999899.00	14999852.00	14999751.00	15000000.00	15000000.00
mean	2.91	2.95	0.04	4.74	-3.94
std	4.02	3.99	5.70	1.17	1.17
min	-41.45	-41.45	-40.91	1.39	-6.91
25%	-0.21	-0.10	-5.33	3.66	-4.86
50%	2.52	2.60	0.01	4.73	-4.06
75%	6.11	6.08	5.35	5.82	-2.94
max	24.38	24.38	42.51	11.51	2.57

We also would like to draw the readers attention to the fact that this global heavy tail phenomenon also appears at the token pair scale. However, we observe more variability in the distributions. We took a liquid token pool (lots of transactions within a small time frame) to compute the following graph.

Furthermore, some tokens were partially illiquid which negatively affects the number of transactions available in the dataset. To better understand this phenomenon, we plotted the distribution of transactions per token pair.

Furthermore, it is likely that for illiquid Uniswap pools the time between the first and the last transactions reported differ quite significantly. In contrast, for very liquid pools, the time-span can be in the order of milliseconds. This discrepancy in the dataset could create some issues for the analysis performed in this study.

We provide more detailed information on the time-span distribution in the following plot.

A large disparity is observed with respect to the time span distribution. Furthermore, some transactions have more than 100 days gap for the same token which characterise the illiquidity of the underlying tokens. However, the observed median time span is still fairly reasonable: 1 day 21:52:20.

After a quick data exploration on these illiquid tokens, we realised they have a very different feature distribution than the other. They can be considered as outliers and therefore negatively impact the training of our machine learning tasks. Furthermore, one might argue that arbitrage on illiquid tokens are hard to realise in practice and they there is less incentive in studying them.

At this stage, we propose to develop two sub-datasets for the rest of the tasks, one qualified as full since it contains all cycles and another qualified as liquid containing only those that are considered liquid enough. Liquid tokens must satisfy the following conditions:

The token has at least half of the transactions that should have been fetched (i.e. 300 out of 600).
The token is in the 80% quantile of the time-span distribution defined above (i.e. span less than 15 days).

Then a cycle is considered liquid if at least 2/3 of its composed tokens are themselves liquid.

As we will later see, the full dataset is harder to train and we decided to only use the liquid at some point.

Data preprocessing

As it was previously said, the focus of this study is on cycles of length 3. We propose to model each cycle as a list of 3 nodes which are themselves represented as 2-dimensional time series: quote prices and gas prices. These time series have length at most P = 600 (transaction data fetched from bitquery).

Logarithmic transformation

As shown in section Data Exploration, it is probably a good idea to apply the element-wise logarithm to the quotePrice and gasPrice features as a first step.

Train/Test split

In order to test our models with the least amount of bias possible, we randomly split the datasets into a training and testing set (20%).

Scaling the features

To avoid the effect of the scales of the features in our results we have to re-scale them. To do so, we can consider two different approaches:

Use a standard scaler that applies the following transformation to each feature $x_i$ : $x_i=\frac{x_i-\bar{x_i}}{std(x_i)}$ By doing so we obtain a dataset with zero mean and a variance of 1
Use a min max scaler that reduce the range fo the features to [0,1] by applying the following transformation to each feature $x_i$ : $x_i=\frac{x_i-min(x_i)}{max(x_i)-min(x_i)}$

After taking the log global features shown in the data exploration section somehow look Gaussian. Hence, we decided to opt for the standard scaler.

Moreover, token pairs can have very different scales in terms of quote and gas prices. Therefore, it might be worthy to define a custom scaling mechanism for each of these pairs. We refer to this as a TokenStandardScaler in the code.

As expected, the latter approach yields better performance in the following machine learning tasks. However, it requires more processing power than the first approach. We had to restrict the number of rows in the dataset to be able to leverage it.

Zero-Padding

In this study, each cycle is represented as a tensor with the following 3 dimensions:

Name	Size	Description
`D`	3	the length of the cycle, aka the number of tokens
`P`	600	the length of the time series of swap transactions
`K`	2	the amount of time series/features (`quotePrice` and `gasPrice`)

However, due to lack of liquidity in Uniswap pools for a given pair of tokens, there could be less than P = 600 swap transaction fetched from bitquery. Since machine learning models are required to fix the input size for their functioning, we need to pad the shorter time series. To this end, the fixed-length P =600 was chosen, as well as the zero-padding technique for simplicity reasons. An extensive backtesting process could be performed later on (e.g. fixed or mean frequency between two cycles). However, due to time constraints, we decided not to explore further in this direction.

Note: in both train and test splits the zero-padding artificial values account for roughly 17-18% of the full dataset and 10-11% of the liquid dataset. In the further steps section, we propose an alternative to reduce this overhead.

Building the feature tensor

The dataset is not directly shaped to build the tensor feature for the machine learning models. Indeed it simply contains a list of swap transactions in csv format. We need to massage it using multiple operations to group transactions associated with the same cycle together and pad each time series independently.

Note: hardware capacities of the cluster only allowed us to process 10 000 000 swap transactions.

Cycle embedding

Movitation

In section Data preprocessing, we modelled each cycle by a D x P x K = 3 x 600 x 2 tensor. Hence, each of them is represented by 3600 parameters.

However, one could argue that 3600 is a too high dimensional way of encoding cycles. Especially given the fact we are dealing with financial time series containing lots of noise. Building a latent representation of cycles in which distances are more effectively modelled which can be a great use in further studies.

In fact, later on in this article, we are going to cluster cycles using the KMeans algorithm which heavily relies on meaningful Euclidian distances computation which can make great use of a dimensionality reduction.

To create this embedding, multiple approaches can be considered. We propose the following:

PCA is referred to in the literature as a simple linear projection operation. We plan to use it as a baseline comparison with the more complex autoencoder model.

The task of an autencoder is summarised in the following figure.

Diagram encoder

The dimension of the latent dimension Q determines the reduction factor optained through this process.

For example : with Q = 100, the eduction factor is 3600/100 = 36x which is non-negligible.

However, the more we lower Q, the more the signal is compressed which increases the amount of error in the decoding phase.

Using the PCA approach, we can easily understand how much is lost when Q varies.

In the Cycles profitability prediction task, we will be able to measure the gain of the embedding compared to the raw features (base model).

Autoencoder : different architectures

In this section, we tried multiple models (mainly focusing on autoencoders) for embedding computation through dimensionality reduction. As a first step, we trained the described models on the entire dataset that we have. However, we realized that some data points of the set were not liquid at all. These data points were represented by rates spanning over hundreds of days. The models performed poorly on this set so we decided to focus on liquid data only (see Data exploration section). Multiple optimizers (Adam,SGD,NAdam,Adamax,RSMprop) were tested for training. It appears that Adamax is the one working the best on these tasks. It was not always the case (no free lunch theorem) but Adamax was faster than the other optimizers by a factor of 10 when working on the liquid data.
Manually defining and testing architectures was a tedious and time-consuming task, multiple days of the study were dedicated to this section.

PCA

The first model chosen for embedding representation is a PCA having a latent space of 100 dimensions. note that dim=100 will be used for all next models. PCA only considers linear transformations. However, it is fast to train (around 30s in our case) and can act as a baseline comparison for other models.

Linear autoencoder

Secondly, we trained an autoencoder with linear activations only. The purpose of this choice is to compare the autoencoder architecture with the PCA. Indeed, when using linear activations on all layers of the model, it should perform similarly to PCA. Thus, we expect the performances of this model to be comparable with the ones of PCA. The model consists of 3 fully connected linear layers, one of them being the bottleneck (dim=100). It is trained using Stochastic gradient descent and the following losses are obtained :

Multilayer autoencoder

Let’s go deep! In this section, the number of layers is increased and activation functions are changed to be non-linear(elu,relu,selu…). The neural network used here has 2 fully connected layers of 600 neurons each. They are symmetric to the bottleneck layer and uses elu activations. Multiple activation functions were tested on this architecture and elu was retained to be the best one (based on test MSE). This model is named fully_connected_3L and the obtained losses are:

Surprisingly, the best loss obtained by this complex model does not beat the PCA model. It is the reason why training goes up to 500 epochs. We wanted to see how the loss behaves and if it drops later on. But it did not happen. This model has the same number of layers as the linear one but it has mode neurons and the activations are more complex. Since the model is more complex we expect it to outperform the linear and PCA models on the training loss.
Note that more variants of neural network architectures will be trained and tested later on using the Talos library.

Convolutional autoencoder

To better capture the structure of cycles, we propose an alternative to the fully dense model of the previous section: a convolutional autoencoder. The motivation to introduce this complex architecture is that when a cyclic arbitrage is implemented, the first transaction could affect some price/gas of the second token and similarly for other transactions. The convolution operations could extract these neighbouring relationships between tokens to build a better latent representation of cycles. We hope that a convolutional layer will allow us to leverage this structural bias. First, we tried to train a simple CNN but it did not perform well. CNN are a simpler model than fully connected networks, having a limited number of parameters that are shared might cause some bias in the prediction. So we decided to add complexity to this model :

In addition to the convolutional layers of the network, we added 2 dense layers (in blue) of 300 neurons symmetrically connected to the bottleneck (Green). 4 Convolution layers (in red) are added. They consist of a 2D convolution layer followed by a max-pooling operation on the encoding side and an upsampling operation on the decoding side:

Diagram encoder

This model is named CNN_fully_connected and produces following losses :

Performance Analysis

The following figure illustrates the losses obtained by each described model :

As expected, the linear model’s performances are close to the ones of PCA. However, the results obtained for more complex models do not meet our expectations. The fully connected network and the CNN perform poorly compare to PCA. They should perform better on the training data at least because they are more flexible. PCA is restrained to linear transformations which is not the case for CNN_fully_connected and fully_connected_3L. This poor performance might come from the choice of the network architecture: number of layers/neurons, activation functions… To find the optimal architecture we tune these parameters in the following section.

Hyper-parameter optimisation

Now that we selected our main architecture for our autoencoder, namely a fully connected MLP, let’s optimise our loss. They are multiple hyperparameters that can be tune for this model. We selected the following :

Parameter name	Option 1	Option 2	Option 3
`activation function`	`selu`	`elu`
`# dense layers`	`3`	`5`
`# neurons in first/last layer`	`100`	`300`	`600`
`dropout factor`	`0%`	`25%`	`50%`
`optimizer`	`adam`	`Adamax`
`batch_size`	`16`	`32`
`# epochs`	`150`

In order preform this best parameter search, we used the Talos python library which makes it fairly easy to test hundreds of parameter combinaisons in a few lines of code. Note that the training of fully connected models tested previously converged around epoch 100. Moreover, Talos retained the best-obtained loss and corresponding training epoch. Thus we only chose one value for the # epochs parameter.

After running not less than 12 hours on the IZAR EPFL cluster, we got the following results:

round_epochs	loss	val_loss	activation	batch_size	dense_layers	dropout	epochs	first_neuron	optimizer
112	0.075	0.09	selu	16	3	0.0	150	600	Adamax
109	0.08	0.093	selu	16	3	0.0	150	300	Adamax
75	0.098	0.107	selu	16	3	0.0	150	300	adam
69	0.101	0.108	selu	16	3	0.0	150	600	adam
143	0.104	0.111	selu	16	5	0.0	150	300	Adamax
104	0.104	0.112	selu	16	5	0.0	150	600	Adamax
150	0.128	0.112	selu	16	3	0.25	150	600	Adamax
115	0.104	0.113	selu	16	3	0.0	150	100	Adamax
150	0.105	0.113	selu	16	5	0.0	150	100	Adamax
150	0.143	0.117	selu	16	3	0.25	150	300	Adamax
86	0.116	0.12	selu	16	3	0.0	150	100	adam
94	0.117	0.122	selu	16	5	0.0	150	600	adam
65	0.115	0.122	selu	16	5	0.0	150	300	adam
65	0.116	0.126	selu	16	5	0.0	150	100	adam
114	0.137	0.128	selu	16	3	0.25	150	300	adam
150	0.144	0.133	selu	16	5	0.25	150	600	Adamax
150	0.159	0.135	selu	16	3	0.25	150	100	Adamax
150	0.15	0.137	selu	16	5	0.25	150	300	Adamax
89	0.169	0.138	selu	16	3	0.25	150	600	adam
150	0.173	0.146	selu	16	5	0.25	150	100	Adamax
65	0.165	0.147	selu	16	3	0.25	150	100	adam
101	0.158	0.149	selu	16	5	0.25	150	300	adam
150	0.203	0.15	selu	16	3	0.5	150	600	Adamax
107	0.176	0.153	selu	16	5	0.25	150	100	adam
46	0.164	0.154	selu	16	5	0.25	150	600	adam
150	0.211	0.154	selu	16	3	0.5	150	300	Adamax
131	0.239	0.159	selu	16	3	0.5	150	600	adam
111	0.211	0.16	selu	16	3	0.5	150	300	adam
92	0.206	0.163	selu	16	3	0.5	150	100	adam
150	0.211	0.165	selu	16	3	0.5	150	100	Adamax
150	0.237	0.202	selu	16	5	0.5	150	300	Adamax
150	0.275	0.212	selu	16	5	0.5	150	100	Adamax
94	0.308	0.246	selu	16	5	0.5	150	100	adam
78	0.301	0.268	selu	16	5	0.5	150	300	adam

Talos has not finished running at the moment. But we already realize the Adamax works better than Adam on that particular task. The dropout seems to harm the performances. Model complexity (#neurons) is also a factor driving performance, the higher the complexity is the better the performance is. We realize now that 150 epochs are probably too small since a few models seem not to have converged after 150 epochs.

Note: Talos does not support K-fold cross-validation yet, so we had to use a single validation set (20%) for this task.

Cycles profitability prediction

Motivation

The goal of the project also consists of testing the predictability of the cycle’s profitability. The return of a given cycle is defined by its revenues minus its cost (fees). Profitability is a Boolean value indicating if the corresponding cycle has positive or negative profitability. Profitability is then used as a target/label for classification tasks. 95% of the cycles have a positive return. This imbalance can badly affect the training process. Indeed, a model that always outputs true will obtain a precision of 94% despite being meaningless. We proposed to deal with the imbalance through class weights in the loss function. The idea is to reweight the samples during training to obtain a 1 to 1 balance between positive and negative data points.

Method and features

Obviously, at first, the models will make use of embeddings produced by the encoding layer of the autoencoders as features. Multiple embeddings are tested (AE, PCA)

In our initial dataset, we also have access to the names of the 3 tokens participating in the cyclic arbitrage, which could potentially be used as extra features! However, machine learning models usually don’t like strings features. Let’s tokenize them! Since we are dealing with a fixed (categorical) set of non-ordered features, a one-hot encoding is probably a good way to go.

For instance, imagine we only have 3 tokens in our dataset :

ETH, DAI and AAVE

Then one could use the following one-hot encoding to represent them. We have 3 tokens so the encoding will be 3-dimensional

Token Name	\|	Dim 1	Dim 2	Dim 3
`ETH`	\|	1	0	0
`DAI`	\|	0	1	0
`AAVE`	\|	0	0	1

From a linear algebra perspective, we observe that all rows have the same norm and are linearly independent, this is what makes this one-hot encoding an excellent choice for our purposes.

We draw the attention of the reader to the fact that these extra features should not be added as an input to the convolutional autoencoder. Indeed, there is no translation bias to exploit here. In order to ease the performance comparison with other types of embedding models, we decided not to use them in any of the embedding related tasks.

To sum up, two different sets of features are used as input for prediction:

Embeddings.
Embeddings + tokens.

These two types of features are used and the scores are compared to see if the names of the involved tokens bring relevant information to the prediction.

Finally, we need to select a performance metric to compare our models. Usually, for binary classification tasks, the accuracy is chosen. However, it is probably not a good chose here since we have a strong class imbalance. Indeed, using accuracy the trivial model that outputs always true will have a high performance and this is not what we want. Hence, we will investigate the differences in terms of precision and recall, in particular through the f1-score metric which is a geometric mean between precision and recall.

Different models

Logistic regression

The first model consists of logistic regression. It is fitted on the standardized embeddings using a grid search cross-validation process to tune the hyperparameter C (regularizer). The following confusion matrices (one per type of features) are obtained on the test set :

Embeddings confusion matrix :

/	True(pred)	False(pred)
True(real)	657	530
False(real)	25	29

Embeddings + tokens confusion matrix :

/	True(pred)	False(pred)
True(real)	829	358
False(real)	36	18

Corresponding f1 scores :

/	`Embeddings`	`Embeddings + tokens`
f1 score	0.703.	0.808

Support vector machine (SVM)

The second model is a support vector machine trained on the standardized embeddings to find the optimal boundary between profitable and non-profitable cycles. Again, cross-validation is used to tune the hyperparameters. Namely: the kernel of the SVM (linear, rbf, or poly) and the regularizer (C). The selected model produces the following confusion matrix on the test set :

Embeddings confusion matrix :

/	True(pred)	False(pred)
True(real)	1192	0
False(real)	49	0

Embeddings + tokens confusion matrix :

/	True(pred)	False(pred)
True(real)	1155	37
False(real)	46	3

Corresponding f1 scores :

/	`Embeddings`	`Embeddings + tokens`
f1 score	0.9799	0.9653

Despite rebalancing the classes, it appears that the SVM models always predict true and reaches an f1 score of 0.9799. Even though this model reaches the best performance, it’s a trivial model.Surprisingly, including the tokens encoding does not improve the performance when using SVM as a classifier. The f1 score decreased but now the model does not behave trivially (always outputting True).

Global Interpretation

Given the result above, we can conclude that there is some predictability in the one-hot encoding of the token since the f1-score raised from 0.70 to 0.8 in the AE case. We also wanted to stress that all models usually experience some difficulties when it comes to predicting non-profitable cycles. Please note that the models were trained on a subset of the data because of limited computational capabilities. Using the entire set of data would for sure increase the scores.

Investigation of the different embeddings performance

To evaluate the performance of our cycle embedding (autoencoder), we propose to study the impact of different embeddings features on the output of a binary classification task, precisely the profitability of a cycle.

The emphasis is not put on finding the best overall model here. The idea is to study the difference in the confusion matrix and metrics (accuracy, f1-score, recall, precision) that occur when the input features change.

If we observe better performance metrics for our autoencoder embedding, this experiment will have provided evidence that our embedding somehow captures the underlying structure of cyclic arbitrages.

Method

To this end, we propose to study 3 different types of embedding :

The AE embedding (base hypothesis)
The PCA embedding (alternative hypothesis 1)
A rule-based embedding using technical indicators (alternative hypothesis 2)

If the first two options are described in detail in the earlier steps of this project, it is not the case for the third one.

The following rolling indicators are used :

SMA
Rolling volatility

using two different rolling windows (5 and 20).

These indicators are applied on the following underlying time-series data (for each cycle):

Quote price
Gas price
Log-returns of Quote price
Log-returns of Gas price

In other words, for each cycle, we construct 4 * 2 * 2 = 16 features.

After zero-padding the tensor build is of shape N x 3 x 600 x 16.

Note: the NaN introduced by the computed are filled using 0 to have comparable shapes with the AE features.

Again since we have a massive imbalance between classes the accuracy metrics needs to be avoided. We will investigate the differences in terms of precision and recall, in particular through the f1-score metric.

Results

For the standard AE embedding, the confusion matrix is available in section Logistic regression.

The PCA confusion matrix is show below

/	True(pred)	False(pred)
True(real)	814	373
False(real)	34	20

with a f1-score of 0.800

For the rule-based encoding, we have

/	True(pred)	False(pred)
True(real)	977	210
False(real)	42	12

with a f1-score of 0.886

Unfortunately, the conclusions that were drawn in the Performance Analysis section of the AE training can also be applied here. Namely, the AE is not able to reach the same level of f1-score as the PCA and rule-based embedding.

We would like to draw the attention of the reader to the fact that even though the MSE Loss was higher for the AE than for PCA, it is not obvious that the performance of embeddings themselves are comparable in the same way. Indeed, the AE was trained to produce reconstruct the input data not to construct a relevant embedding in the latent dimensions.

Cycle clustering

Motivation and method

Cycles clustering can be understood as an unsupervised method to classify cycles into a given (k) number of categories. The clustering assignments provide a natural way of grouping cycles together. Statistical analysis can be conducted on the clusters to understand the general structural properties of the dataset.

To this end, we will start by studying the output of a standard clustering algorithm named K-means.

Which embedding ?

Given the poor results of the AE embedding in the previous tasks, we first try to use the PCA embedding in this section. However, for almost all k values, most of the data points are clustered together (see example below for k = 22)

Furthermore, we observed little persistence for the patterns observed on the training train and test sets (especially when k increases). In other words, a KMeans model trained on training set will probably not hold enough predictive power to be used test set which makes it almost useless in practice. This is mostly true for the k-1 small clusters.

However, surprisingly, using the AE embedding, the performance is improved. Clusters are much more balanced and we observe more predictable behaviours between the training set and test sets. Therefore, we propose to conduct the following analysis using the latter embedding.

Choosing the right value for `k`

The starting point of the analysis is to understand which values of k (the number of clusters) lead to a relevant clustering assignment. silhouette and sse plots are the standard way to go.

Usually, we should observe a maximum spike in the silhouette method plot. This is not the case here. The curve has a clear growing trend but there is no clear reason, at least for our analysis, why we should go above 20 clusters. Hence the silhouette plot is not particularly useful in the task of choosing the best k.

However, the elbow method applied on the SSE graph seems to indicate that the steepest slope is in the range $[0, 22]$. For $k > 22$ there is less evidence than, increasing k improves the quality of the clustering.

Individual area silhouette scores are also worth looking at. They can be found in the next plot, sorted and grouped by clusters for convenience.

Since the elbow method did not allow us to exclude any values of k before 20e).

Therefore, we propose to further investigate k=4, 9, 16, 22 which may be fair tradeoffs between the goodness of the fit and the number of clusters. In the following plot, we will investigate the quality of the individual cluster to choose our final value.

Alt text

Guidelines to interpret the plot:

Silhouette scores can be interpreted as follows :
- 1 indicates that the sample is far away from the neighbouring clusters
- 0 indicates that the sample is on or very close to the decision boundary between two neighbouring clusters
- <0 indicates that those samples might have been assigned to the wrong cluster.
On the y-axis, the width of clustering is an indicator of the number of data points assigned to the cluster.
The red line indicates an average threshold. Bad clusters are those that fail to reach that target for all data_points assigned to them.

Key observations from the plot

For k = 4:
- The below-avarage ratio is 20% (number of below average clusters divided by k)
- Only cluster 0 is below average (red dotted lines) but it is the largest cluster.
- The laregest clusters (0 and 3) contain negative silhouette scores
For k = 9:
- The below-avarage ratio is 33%
- Negative silhouette scores occur in 33% of the clusters
- Clusters 4, 5, 8 look terrible (lots of negative values and below average score) and contain lots of data point
- The other clusters are fairly shaped and look decent
For k = 16:
- The below-avarage ratio is 25%
- Negative silhouette scores occur in 31% of the clusters
- Clusters 1 and 14 are bad
- Clusters 4 and 7 almost reach the average target
- All other clusters look good
For k = 22:
- The below-avarage ratio is 27%
- Negative silhouette scores occur in 27% of the clusters
- In this case only cluster 5 contains very negative values
- However most clusters that are below-average fail to meet the required target by a large value
- Cluster contains, on average, less data points than for k = 16

Overall, we observe high variability in the silhouette scores for all values of k. There is no clear answer to which value of k is best. However, in our opinion, the plot suggests that k = 16 may be slightly better.

Clustering validation

In this section, we aim to provide the reader with evidence that the clustering contains useful information (aka clustering assignments are neither random nor trivial, all cluster are in the same cluster).

In other words, using relevant metrics, we need to demonstrate dissimilarities across clusters that are also persistent on the test set.

For instance, we could consider the following :

Number of cycles per cluster: a sanity check to understand the overall quality of the clustering. Indeed, if 99% of cycles are clustered together, we won’t be able to extract meaningful information out of the clustering.
Profit per cluster: we would expect to observe clusters more profitable than others
Profitability per cluster: this metric is related to the risk associated with a cluster. Indeed, some clusters could be less profitable than others (on average) but yield a higher probability to make a profit in the end. It s the type of analysis that we would like to conduct with this metric.
Token distribution understanding per cluster: one desirable property of an interesting clustering could be to observe important differences in terms of token distribution across clusters. For example, computing the median of the distribution would allow us to understand whether or not only a few tokens that are very profitable or not are used. Furthermore, the entropy of the distribution can be used as a comparison to a random clustering.

These metrics, computed on the training set, are shown below.

At first sight, it already looks quite promising. We can group clusters together in terms of behaviour. There are two main trends :

Profitable clusters (group A): 1 2, 3, 4, 9, 10, 12, 13, 14
Less profitable clusters (group B): 5 6, 7, 8, 11 and 15

Let’s dive into the details :

Group A appears to be the one generating the larger amount of profits, with slightly better profitability but nothing astonishing.
On the contrary, group B is below average in terms of profits.
There are no outstanding differences in terms of profitability across clusters. However, one should note that the global average is already at 95% which shows that most of the cycles are profitable anyway.
Clusters 2 and 12 (which below to group A) contain far more data points than the others.
When it comes to the token distribution entropy, group A is more random than the rest. However, there is no clear difference in terms of the median distribution across groups.
The full token distribution plots show that clusters grouped together (A and B) follow a similar distribution. For instance, for group B, there is no bulk at the very left but rather a smaller bulk around mid-right. Since we are dealing with a less profitable group, by hovering the plot, we may be able to identify tokens that do not have a very good overall performance.

In the following set of plots, the same metrics are recomputed but this time on the test set. Interestingly, the conclusions that were drawn for the train set can be extended for the test, demonstrating some degree of predictability/persistence.

Conclusion

The acquisition of data took was time-consuming though we managed to create a big set of data in relation to the one used in the arxiv paper. Data exploration demonstrated the diversity of the data in terms of distribution and liquidity as well.

To build effective features for our machine learning models, we had to perform numerous data wrangling steps. Among others, a token-based scaling was applied and missing data were handled through `zero-padding

The Autoencoder gave us a tough ride. We had to test many different training/test sets and architectures to construct a relatively sound model. In the end, only the most liquid tokens are used to compute 100-dimensional embeddings using autoencoders and PCA. Unfortunately, they did not reach the expected performance regarding the reconstruction MSE. We could not manage to overperform PCA.

Then, we proposed further tests for the quality of the autoencoder. In particular, we first started to build a binary classifier for the profitability of the cycle. Multiple models (logistic regression, SVM) were contrasted to better understand what accuracy we can expect in this task. We also noticed that the addition of the involved tokens has a positive impact on the prediction. However, in the end, the result obtained leans towards the hypothesis that the PCA embedding yields better results than our AE embedding.

Finally, we used clustering methods to analyse the structural properties of cyclic arbitrages. As expected, the clustering based on liquid embeddings produced clusters where profitable cycles were clustered together. In contrast to the previous observations, the most satisfactory clustering assignments were obtained using the AE method.

The conducted study can be improved: models used for embedding extraction are not satisfying (MSE wise). Nevertheless, possible areas of improvement are described in the next section.

Further steps

We concluded the project, but there is still lots of opportunities for improvement. Some of them are described below.

Embedding improvement

Attention Learning

In section Data preprocessing, 0-padding was introduced to standardise the length of each time series. However, the choice of 0s is rather arbitrary and can introduce many problems upon training the autoencoder (as well as scaling the data). Indeed, a small computation shows that introducing this padding technique adds 7 500 000 000 zeros which corresponds to a fraction 17% of the training set entries. This means that the autoencoder can do a decent only by trying to improve the reconstruction of 0sis the training set.

Moreover, if we keep increasing the number of padded Os, we can make the MSE arbitrary close to 0 (perfect model).

These undesired behaviours could be addressed by introducing a special token PAD which has no intrinsic value.

A classical way to solve this problem is to use attention learning as we have in Natural Language Processing tasks such as the BERT Transformer model.

The idea is to define, for each data point, a mask containing 0 and 1 entries that specifies which part of the data point shall be considered valid by the Neural Network layers. This mask is therefore passed along each layer, until the final loss function which is only be evaluated for the valid (non-padded) entries.

Keras offers two build-in constructs for mask-generating layers: Embedding and Masking.

Using them would increase the quality of the embedding. However due to time constrains, we did not consider it in this study.

Circular Convolutions

In the above sections, we defined a fairly complex convolutional architecture for the autoencoder. In the scheme, we use standard zero-padded convolutions. However, due to the cyclic nature of the arbitrages, a possible improvement would be use circular convolutions instead. However, since no default implementation in keras was available, a custom layer needs to defined.

Since this process is time consuming and error-prone, we decided to stick with keras basic implementation but circular convolution is definitely a step that is worth exploring.

In depth comparision

The performance of the encoder could be compared with more complex embedding techniques than PCA such as Fourier/Wavelet transform or Signature transform.

Clustering improvement

If we increase the quality of the embedding, the clustering quality should increase as well. However, we can further back test the clustering algorithm by comparing KMeans with BDscan for instance.