CLeaR: An Adaptive Continual Learning Framework for Regression Tasks

Catastrophic forgetting means that a trained neural network model gradually forgets the previously learned tasks when being retrained on new tasks. Overcoming the forgetting problem is a major problem in machine learning. Numerous continual learning algorithms are very successful in incremental learning of classification tasks, where new samples with their labels appear frequently. However, there is currently no research that addresses the catastrophic forgetting problem in regression tasks as far as we know. This problem has emerged as one of the primary constraints in some applications, such as renewable energy forecasts. This article clarifies problem-related definitions and proposes a new methodological framework that can forecast targets and update itself by means of continual learning. The framework consists of forecasting neural networks and buffers, which store newly collected data from a non-stationary data stream in an application. The changed probability distribution of the data stream, which the framework has identified, will be learned sequentially. The framework is called CLeaR (Continual Learning for Regression Tasks), where components can be flexibly customized for a specific application scenario. We design two sets of experiments to evaluate the CLeaR framework concerning fitting error (training), prediction error (test), and forgetting ratio. The first one is based on an artificial time series to explore how hyperparameters affect the CLeaR framework. The second one is designed with data collected from European wind farms to evaluate the CLeaR framework's performance in a real-world application. The experimental results demonstrate that the CLeaR framework can continually acquire knowledge in the data stream and improve the prediction accuracy. The article concludes with further research issues arising from requirements to extend the framework.


Introduction
In the late 1980s, McCloskey and Cohen [1] and Ratcliff [2] observed a phenomenon where the well-learned knowledge of connectionist models is erased by new knowledge under specific conditions when the models learn new tasks successively. It is referred to as Catastrophic Forgetting or Catastrophic Interference. The challenge is supposed to be a general problem existing in different kinds of neural networks, e.g., backpropagation neural networks and unsupervised neural networks. Each neuron of a layer is connected to all neurons of the next layer in a neural network. Weights control the strength of the connection between two neurons. A weight vector can be expressed as a column vector w = (w 1 , w 2 , ..., w N ) T with N values. Some weights have a significant influence on more than one * Correspondence: yujiang.he@uni-kassel.de Intelligent Embedded Systems (IES) Group, University of Kassel, Wilhelmshöher Allee 71 -73, Kassel, Germany Full list of author information is available at the end of the article task. For example, w 1 , w 2 , w 3 are important for task 1 and w 1 , w 4 , w 5 are important for task 2. In this case, the overlapped weight w 1 could be adjusted during learning task 2 sequentially, which is one of the main reasons for Catastrophic Forgetting.
Overcoming the forgetting problem is a crucial step in implementing real intelligence. Models require plasticity for learning and integrating new knowledge as well as stability for consolidating what models have learned previously. Excessive plasticity can cause the acquired knowledge to be erased while learning new tasks. On the other hand, successively learning new tasks can become more challenging due to extreme stability. It is the so-called stability-plasticity dilemma [3].
Many researchers have proposed continual learning (CL) algorithms to solve the problem, such as in [4], [5], and [6]. A three-way categorization for the most common CL strategies is described in [6]: (1) regularization strategies, (2) rehearsal strategies, and (3) architectural strategies. Similarly, CL strategies are grouped as (1) prior-focused approaches, (2) likelihood-focused approaches, and (3) dynamic architectures in [7]. These categorizations standardize terminologies and outline a distinct research direction for the CL community. CL algorithms have been proven successful in supervised learning and reinforcement learning to train several tasks sequentially without forgetting the acquired ones. Application scenarios cover handwriting recognition [8,9,10], image classification [5], sequentially learning to play games in a reinforcement learning setting [4] and much more.
To our best knowledge, some of the standard CL benchmarks are the reconstructions of well-known datasets, such as permuted MNIST [11,12], where the CL tasks are obtained by scrambling the pixel positions in the MNIST dataset [13]. Moreover, some datasets are explicitly generated to evaluating CL algorithms, for example, CORe50 [14] for continuous object recognition.
Most of the past research focused mainly on classification tasks rather than regression tasks, where the Catastrophic Forgetting problem usually occurs as well. For example, establishing regional smart grids requires power generation and consumption forecasts. In [15] and [16], neural networks are used to forecast renewable energy generation with weather prediction data. Note that weather data is non-stationary, where the probability distribution changes over time, e.g., from summer to winter. In this situation, training neural networks has to be delayed until sufficient data is collected. Otherwise, the networks could be overfitted to the limited training dataset. Furthermore, unseen situations, e.g., the extreme weather conditions, updating/damaging/ageing of generators, and climate changes, can be called special events. The neural networks have to update themselves by continually learning these situations when they appear in the application. Besides, power consumption regarding a household or a factory is also easily affected by unpredictable things, such as purchasing new equipment or hiring more employees. These factors can change the obtained mapping between inputs and outputs. Under these conditions where historical data may be private, unrecorded, or too cumbersome to be retrained, the trained models have to learn new knowledge and consolidate the previously-stored internal representations without the help of old data.
The main contribution of this article is to propose a framework called CLeaR (Continual Learning for Regression Tasks) for continually learning the identified changes in non-stationary data streams. Moreover, the framework is tested in two sets of experiments to assess its performance and analyze its hyperparameters' effects. The CLeaR framework consists of neural networks for prediction and buffers for storing new data. We calculate an error between the prediction and the corresponding true value at each point in time. The new data is labeled by comparing the error to a dynamically adjustable threshold. If the error is larger than the threshold, the data is labeled as a novelty and stored in a finite novelty buffer, or else stored as familiarity in an infinite familiarity buffer. When the novelty buffer is full, updating will be triggered. The network will be retrained on the dataset in the novelty buffer using CL. The retrained network will then be tested on the familiarity dataset to evaluate how much old knowledge is retained. After updating, the threshold needs to be re-estimated for the following learning step. The re-estimation depends on the performances of the updated network on the dataset of both buffers. Afterwards, the buffers will be emptied. Updating will be repeated until the novelty buffer is filled again.
The remainder of the article will review the literature regarding CL algorithms and applications. Then we will outline the proposed framework's fundamental structure and give an insight into the experimental details. Furthermore, we will analyze the experimental results. This article will conclude with our findings and provide an outlook for future research.

Related work
This section will start with a brief overview of the recent academic literature regarding approaches and experimental setups designed for CL classification applications.
The changes in data or goals can be defined as new tasks in the CL community. For example, a model is expected to learn new instances of the same class while retaining its knowledge about the previous instances, or to learn new instances of different classes without losing accuracy on previous classes, or to learn new instances of the known and unknown classes. These are defined as different CL scenarios in [14]. In both [6] and [7], CL algorithms are categorized into three groups in a similar way: • Prior-focused approaches denote that the posterior probability of N tasks is a product of the likelihood of the N th task and the posterior probability of the first N − 1 tasks, as .
As a regularization, the posterior probability of the first N − 1 tasks is added in the loss function to avoid changing the weights that are important for previous tasks. Well-known prior-focused algorithms include Elastic Weight Consolidation (EWC) [4], Synaptic Intelligence (SI) [5], Variational Continual Learning [8]. In [9], a generalization of EWC++ and SI was proposed, which is referred to as the RWalk algorithm. • Likelihood-focused approaches require a subset of randomly selected samples from the original dataset of the previous N − 1 tasks, or a dataset generated by a generative network that has learned the tasks, see in [17], [7], and [10]. • Dynamic architectures enable neural networks to learn CL tasks sequentially by adjusting the networks' architecture for specific applications. Progressive Networks, Learning Without Forgetting (LWF), and Less-Forgetting Learning have been introduced in [18,19,20], respectively. The above algorithms have been evaluated in multitask scenarios with the reshaped versions of famous datasets, e.g., MNIST and CIFAR-10/CIFAR-100. In these scenarios, a model learns a new, isolated task in a sequence while remembering how to solve the learned tasks. However, there are no class overlaps among the different tasks. For example, in [5] the MNIST dataset is split into five tasks, one of which contains two labels (two digits). The model can classify data to the correct group only if the information regarding the current task is given. In this case, the model learns how to solve a series of discrete tasks rather than keep learning knowledge to address incremental problems. The experimental setups and the datasets do not allow for a fair comparison among the CL algorithms. Lomonaco et al. [14] create CORe50 specifically for single-incremental-task scenarios, which can be seen as a test benchmark for continuous object recognition. Similarly, the iCubWorld benchmark [21] is designed for robotic vision challenges, where comparison among various CL approaches is feasible.
Besides, continual learning should be considered in regression as well. In [22], He et al. propose two CL application scenarios for establishing regional smart grids: the task-domain incremental scenario and the data-domain incremental scenario. The scenarios are applicable for forecasting power, including renewable energy generation and power consumption in the middle-/low-voltage grid. Moreover, performances of four CL algorithms (EWC, Online-EWC, SI, and LWF) are evaluated concerning accuracy, forgetting ratio, and training time in the two scenarios. However, prior knowledge about new tasks is given in their experimental setup, which means that models know when new tasks will occur without novelty detection. Therefore, this setup is incompatible with the real world.
In [23], Farquhar et al. conclude that an inappropriate experimental design could misrepresent the performances of the well-known CL approaches. Therefore, they suggest five requirements for evaluating CL algorithms and demonstrate their necessities. The five requirements are: (1) cross-task resemblances; (2) shared output head; (3) no test-time assumed task labels; (4) no unconstrained retraining on old tasks; and (5) more than two tasks.
These suggestions are worth being considered in our experimental setup and inspire us to design the CLeaR framework. (1) Most novelties are due to changes of data P (X) or targets P (Y |X, Θ). The dataset of every full novelty buffer can be viewed as a new task that resembles the previous tasks. (2) The neural network outputs the power value prediction, and the new tasks will not require a change of the network's architecture. (3) The prior knowledge regarding the new task, such as when the new task appears or what the distribution of the new task is, is unknown in the application. Updating is triggered automatically only when the finite novelty buffer is filled in our experimental setup. (4) Considering that privacy laws might prohibit the long-term storage of historical datasets, we retrain the neural network only on the dataset newly collected in applications and delete it after updating. (5) More tasks will appear as the probability distribution or the mapping between inputs and outputs changes over time.

Power forecasts using deep neural networks
At the beginning of this section, we list the chosen mathematical notations in Table 1. It can help readers to understand the following mathematical expressions. Besides, we use superscript T to denote the transpose of a matrix or a vector and T to denote the number of tasks.
Deep neural networks are a kind of machine learning inspired by biological neural networks to model nonlinear dependencies in high dimensional data. Compared with traditional high dimensional data reduction techniques, such as principal component analysis (PCA), the multiple deep layers of a neural network can extract representations efficiently from massive data to provide predictive performance gains.
The rest of this section will start with formulating the problem. Moreover, this section will illustrate the probability distribution changes in the experimental dataset over time and clarify how the CLeaR framework works in a general power forecasting workflow.

Problem formulation
Deep neural networks can output a prediction y n with given a high-dimensional input x n . The goal of training Table 1 Notations.
Symbol Definition X The N × M matrix with N feature samples Y The matrix with N measurements associated with X Y The matrix with N predictions associated with X xn The nth column feature vector, x T n = Xn,: yn The nth measurement, yn = Yn,: yn The nth prediction,ŷn =Ŷn,: The weight matrix of the lth layer, The activation function of the lth layer, f l : The neural network with given weight martrix Θ z l The output column vector of the lth layer L Loss function P Probability density N (µ, σ 2 ) Gaussian distribution with mean µ and variance σ 2 Ft The Fisher information matrix of the tth task F i t The ith diagonal element of Ft is to find a mapping between x n and y n . A general deep neural network consists of a series of hidden layers, which can be formulated as: with an output column vector z l−1 of the l − 1th layer, where z 0 = x n . Here Θ T l denotes a transposed weight matrix, whose dimension is the dimension of the l−1th layer by the dimension of the lth layer. A prediction y n of the deep neural network is then where L is the number of layers. Training a neural network is to minimize the defined loss function. In this section, we explain the process with an example of Mean Square Error (MSE), i.e., In order to avoid overfitting, a regularization term is usually added in Eq. 4, as where λ ∈ (0, ∞) is a hyperparameter that weights the contribution of penalty term R (Θ). Different choices for R (Θ) can result in different solutions. From a probabilistic perspective, according to chapter 9 in [24], we can assume that we are given an input x n and the corresponding noisy observation y n = y n + . More specifically, we assume further that this noise follows an independent and identical Gaussian distribution with zero mean and variance σ 2 . Therefore, the regression problem can be considered with a likelihood function: When we are given the datasets X and Y , Eq. 6 can be expressed as where assumed that the y i and y j are conditionally independent given their feature vectors x i and x j . To avoid overfitting during training, we seek parameters Θ that maximize the posterior distribution P (Θ|X, Y ) instead of the likelihood. We can obtain the posterior distribution by applying Bayes' theorem as Note that the posterior distribution depends on the given X and Y . If the statistical properties of the distributions change, such as the mean or the variance, new parameter values will become optimal.
Change in probability distribution A learning task can be described as changes in distributions of data P (X) or targets P (Y |X, Θ). In our experimental setup, meteorological features and the power measurements are viewed as the inputs and outputs of the neural network, respectively. Some meteorological features, e.g., temperature and wind direction, fluctuate yearly periodically. Also, renewable energy generation depends on meteorological conditions. Therefore, the joint probability distribution P (X, Y ) changes period by period. Mathematically, the change can be expressed as follows: where X and Y represent a batch of features and power measurements respectively, which are measured from a non-stationary data stream. If two measurement periods t 0 and t 1 are far apart, both distributions are different due to concept drift. The formula: demonstrates that P (X, Y ) is affected by the change in the probability distribution of the inputs and the obtained mapping. Figs. 1 and 2 illustrate the change regarding P (X) and P (Y |X) based on data from a European wind farm dataset. A sample of the dataset contains sevendimensional meteorological features and a scalar power value. The first two principal components are extracted from the seven features using PCA and labeled as X1 and X2. 64.19% of the overall variance is explained by the two components. We split 10000 samples from the dataset into four sections sequentially over time, each of which has 2500 samples. The distribution of each section regarding X1 and X2 is plotted in Fig. 1. Similarly, the distributions of four sections regarding X1 and the power Y are shown in Fig. 2. The distributions of P (X1, X2) and P (X1, Y ) with the same 10000 samples are plotted in Fig. 1(e) and The two-dimensional probability distribution of P (X1, X2) with 10000 samples shows a circular shape, as shown in Fig. 1(e). From Figs. 1(a) to 1(d), the center of the contour line regarding each section moves clockwise along the circle. This movement illustrates the periodic change in the P (X) over time. Similarly, we can observe that the center of the distribution P (Y |X1) moves from right to left over time, as shown in Fig. 2.
Combined with Eq. 9 and the observations, we conclude that a periodic change exists in the dataset. The model needs to be updated in applications if it was pre-trained only on a limited dataset.
We define that an exception deviating from the expected model is an outcome from an unknown process. For example, wind power generators are automatically shut down for protection under extreme weather conditions, such as typhoons or storms. Exceptions have to be identified first as the data arrives. If it exists in the data stream, the exceptions need to be reported and processed manually. Although the dataset used here has been cleaned in the preprocessing phase, these exceptions might appear in applications. Labeling and learning exceptions are one of the research challenges in the field of active learning, which is beyond the scope of this article. It can be further researched in the future. The CLeaR framework contains the models for prediction and the buffers for storage. Once the measurements Y are available, the corresponding data is labeled and stored into the two buffers depending on the preset threshold and the error. We choose the MSE here for supervised regression tasks, but the method can be customized in other scenarios, e.g., probabilistic forecasts. The data in the novelty buffer covers the change of probability distribution detected in the data stream. It is used for retraining the models when the update is triggered. The data in the familiarity buffer has information that the models are familiar with. It can be used for testing whether the models still retain the old knowledge after updating. Updating models can be considered as accumulations of knowledge for improving prediction accuracy.

CLeaR
The details of the CLeaR framework will be explained through the instance used in our experiments, as shown in Fig. 4. In this instance, the block Models contains an autoencoder and a fully-connected neural network for detecting the changes of the distributions P (X) and P (Y |X) and for forecasting power values. Data is labeled as novelty or familiarity by comparing the MSE to the threshold. Besides, we adopt Online-EWC to update the models and adjust the threshold dynamically after each update. We suggest that the components of the CLeaR framework should be selected flexibly depending on the specific application scenario.

Models
An autoencoder is a neural network that usually consists of two symmetric parts with a bottleneck between them. In an undercomplete autoencoder, the bottleneck has a smaller dimension than the input layer, which helps to extract latent representations z from the input. An autoencoder can reconstruct the input at the output, rather than simply copy the input [25]. The encoder and the decoder can be formulated z n = f Θ (x n ) andx n = g Φ (z n ), where Θ and Φ are the parameter matrices. The optimization goal is to minimize the loss function, by penalizing the reconstruction being different from the input. The change of distribution P (X), as shown in Fig. 1, can be detected by the reconstruction error of the autoencoder.
At the next step, the extracted representation z n is fed into the predictor. As explained in Eqs. 4 and 5, optimizing the network in the general supervised setting is to minimize the MSE. Thus a true measurement y n is required. The predictor, a fully-connected neural network, can be replaced by other networks, e.g., LSTM. We can also drop the predictor in applications where only the reconstruction is needed, as we will introduce in experiment 1.

Buffers
Every neural network that needs to be updated during its application owns a limited novelty buffer and an unlimited familiarity buffer. When true target values are provided, the MSE can be used as a criterion to be compared to the preset threshold. The samples with a small MSE are stored in the familiarity buffer because the trained network has learned to cope with them before. The samples with a large MSE are stored in the novelty buffer. The network needs to be retrained based on these novelties to learn new knowledge. After updating, a validation error can be calculated using the familiarities to estimate whether the network can still retain the old knowledge acquired previously. How to deal with a poor update result remains an open question and needs to be further discussed. We empty both buffers after finishing an update.

Threshold
Each model that needs to be updated owns a threshold. As shown in Fig. 4, Threshold a is for the autoencoder and Threshold p is for the predictor. The value of the threshold determines how new samples are classified. The smaller the threshold, the more samples are likely to be labeled as novelties, where more welllearned knowledge will be re-learned, thus leading to unnecessary updates. A larger threshold could cause inefficient updates because too many novelties are misclassified. We suggest that the threshold value should be adjusted dynamically depending on the training results.   5 illustrates a distribution of errors for all samples after learning. MSE min refers to the minimum MSE obtained by minimizing a loss function, which can also be viewed as the mean of the distribution. The distribution has a lower mean and a lower variance indicating the model learns better on the given dataset. In this article, we adjust the threshold by where α is a fixed threshold factor and the MSE min is re-estimated after each update.

Update
The updating method and the trigger condition play a crucial role in the CLeaR framework. As mentioned in the Section about related work, retraining on old tasks should be constrained due to reasons, such as privacy protection or data storage overhead. Therefore, we adopt Online-EWC, which penalizes the loss when the overlapped significant weights are changed while learning new tasks. We adapt the notation given in [26] to explain EWC and Online-EWC in the context of the CLeaR framework.

EWC
The goal of EWC is to approximate Bayesian posteriors over model parameters given tasks. In CLeaR, data is always split according to two kinds of tasks, the known task (T k ) and the unknown task (T u ). T k refers to what the neural network has already learned. The corresponding dataset D k,T −1 is a combination of the datasets that were stored in the novelty and the familiarity buffers during the T − 1th update. Note that novelties in D k,T −1 have already been learned before the T th update. Therefore, it belongs to the known task. Comparably, T u refers to what the network will learn at the T th update. The corresponding D u,T refers only to the dataset stored in the novelty buffer for triggering the T th update. Chronologically, the T −1th update occurs before the T th update. From Eq. 1, we can get the posterior after learning T u , as: By converting Eq. 13 into a logarithm, we will get log P (Θ|Du,T , D k,T −1 ) = log (Du,T |Θ) The term log P (D u,T |Θ) is generally tractable through minimizing an MSE loss with respect to Θ and the dataset D u,T . In the case where the trained network can perform very well on the known task T k , we have gotten the optimal parameters Θ * k,T −1 , which makes the gradient of − log P (Θ|D k,T −1 ) with respect to Θ equal to 0. Therefore, the − log P (Θ|D k,T −1 ) can be estimated using 2nd order Taylor series around Θ * k,T −1 : where ∆ (Θ) = Θ − Θ * k,T −1 . H(Θ * k,T −1 ) is the Hessian of − log P (Θ|D k,T −1 ) with respect to Θ, evaluated at the optimum Θ * k,T −1 . Furthermore, we can approximate the Hessian as: where N k is the number of samples in D k,T −1 , and F(Θ * k,T −1 ) is the empirical Fisher information matrix on the known dataset D k,T −1 , and H prior (Θ * k,T −1 ) is the Hessian of the negative log prior with respect to Θ. EWC estimates the Fisher information matrix in a high-dimensional space as a diagonal matrix, i.e., the non-diagonal elements are zero [4]. The diagonal Fisher information values are denoted by F i k concerning the ith parameter in the network. Thus the formula of EWC is where the update has been done T − 1 times already. D u,T is the dataset stored in the novelty buffer for the unknown task at the T th update. F i k,t is the diagonal Fisher information with respect to the parameter i, estimated by the dataset D k,t stored in both buffers after the T th update. θ * ,i k,T −1 is the optimal parameter i learned from the dataset D k,T −1 after the T − 1th update. λ t and λ prior are hyperparameters for EWC. Note that all of the diagonal Fisher information F k,t in Eq. 17, obtained after every update, must be stored.

Online-EWC
The regularization term in EWC can be replaced by one Gaussian approximation to the whole posterior of all previous tasks, as proven in [26]. Based on this derivation, Online-EWC is proposed in [27], as: where F i k,T = γ F i k,T −1 + F i k,T and F i k,1 = F i k,1 . γ is a hyperparameter governing the contribution of previous tasks and not larger than 1. Therefore, F i k,T −1 contains diagonal Fisher information of all updates which precede the T th update.
By contrast with EWC, Online-EWC requires less space to store diagonal Fisher information and optima of each previous task. Another similar algorithm is EWC++ introduced in [9], where the diagonal Fisher information is defined as: F i k,T −1 = α F i k,T −1 + (1 − α)F i k,T with 0 < α < 1. In our experiments, Online-EWC is selected for updates within the CLeaR framework.

Trigger condition
The trigger condition for updating the model is another customizable parameter. In our realization of the framework, the update will be triggered when the finite novelty buffer is filled. The smaller the size of this novelty buffer, the more easily this buffer gets full. On the one hand, fewer data in a full buffer is available for updating, which might lead to a failed update. On the other hand, more updates will result in increasing computational overhead. Moreover, note that the hyperparameter γ of Online-EWC is not larger than 1, which causes a gradual decay of the previous Fisher information when the number of updates increases. If the size of the buffer is too large, updating the network will be delayed until the buffer is full. The delay also results in the fact that the threshold can not be adjusted frequently.

Experiments
In this section, two sets of experiments will be performed and analyzed. An unlabeled artificial dataset is generated for experiment 1, where we analyze the effects of two hyperparameters of CLeaR, the novelty buffer size and the threshold factor. We implement experiment 2 on a labeled dataset to predict wind power generation measured in 10 European wind farms [28] using supervised learning. In this experiment, we assess the CLeaR framework's performance in wind power prediction applications. Both experiments will adopt three evaluation metrics: fitting error, prediction accuracy, and forgetting ratio. The remainder of the section will introduce the datasets, the experimental setup, the evaluation criteria, and analyze the results.

Dataset
Artificial dataset is a seven-dimensional periodic unsupervised dataset, i.e., D = [x 1 , . . . , x 7 ]. The underlying generation model is shown as follows: where x n (t) is sampled from a Gaussian distribution with a time-dependent mean and variance and denotes the tth data of the sequence x n . The mean and variance are sampled from an absolute value of a sinusoidal function at the tth point in time with a period T and a phase p n . The phase p n is randomly generated with a standard normal distribution. Hyperparameters A m and A v refer to the amplitude of the sinusoidal function.
In order to generate a dataset containing daily and yearly periodical changes, we adopt two generation models, x d n (t) and x y n (t), with two periods T d and T y respectively. The period T d is 24 to simulate the fluctuation within one day (24 hours), and the period T y of x y n (t) is 8760 to mimic the periodic changes over a year (8760 hours). The resulting time series data is a sum of the two models, i.e., x n (t) = x d n (t) + x y n (t). Fig. 6 displays the sequence x 1 (marked in red) and the corresponding generating process (marked in black) in three different time slots. Note that the first 48 samples of Fig. 6(b) are displayed in Fig. 6(a). Similarly, the first 720 samples of Fig. 6(c) are also shown in Fig. 6(b). The periodical statistic fluctuation displayed in Fig. 6 can simulate features of non-stationary data, such as temperature, which usually peaks at noon and falls at night.
Wind power dataset [28] contains seven meteorological features and hourly averaged wind power generation data measured from European wind farms for two years in a row. The features are 24-hour-ahead meteorological forecasts using the European Centre for Medium-Range Weather Forecasts model [29], including (1) wind speed in 100m height, (2) wind speed in 10m height, (3) wind direction (zonal) in 100m height, (4) wind direction (meridional) in 100m height, (5) air pressure, (6) air temperature, and (7) humidity.
The power generation time series are normalized by the respective rated capacity of the wind farm for easy, scale-free comparison. All features are normalized to the range between 0 and 1. When no power has been generated longer than 24 hours, time points are removed in the pre-processing phase. The dropped time points can be viewed as exceptions that need to be processed manually in real-world applications.
Both datasets are available on our department website [28] or by contacting the corresponding author. Researchers can re-implement our experiments with the datasets.

Experimental setup
The artificial dataset and the wind power dataset are split into three subsets according to three phases: a warm-up phase, an update phase, and an evaluation phase. In the warm-up phase, the model is pretrained on the first 1000 samples, containing only partial information that can describe the current task but The three subfigures illustrate the fluctuation of the first dimension sequence x 1 over 48 hours (2 days), 720 hours (30 days), and 17520 hours (2 years), respectively. The underlying generation model is marked in black, and generated data is in red. The generation model of x 1 is a sum of two sub-models x d 1 and x y 1 , which have different periods T d (24 hours) and T y (8760 hours) respectively. not future tasks. The model monitors a data stream with the following 10000 samples in the update phase, simulating a theoretically infinite data stream in a real application scenario. Here, we assume that 10000 samples can provide enough information to describe the real distribution of samples because 10000 samples span 10000 hours, covering a complete period. The model will be retrained based on the novelty buffer and validated on the familiarity buffer once updating is triggered. The updated model is finally evaluated in the evaluation phase with the following 1000 samples.
We implement the following three CLeaR instances and a baseline model to compare their performances in experiments: • Instance A : Its models are trained on the warmup dataset in supervised learning mode and then directly evaluated in the evaluation phase without updating. The experimental results can reflect the disadvantages due to the lack of samples. Instance A is viewed as a lower bound of models' performance. • Instance B : Its models are pre-trained in the warmup phase, as Instance A . Then the models will be updated in the update phase by using finetuning. Fine-tuning allows the pre-trained models to learn a new dataset, which they were not originally trained on, by slightly adjusting all unfrozen parameters. The re-training process is monitored by early stopping with 30 epochs of patience to avoid overfitting. Early stopping will stop the re-training process if the loss is no longer decreasing in 30 epochs. The optimal models are the ones with the lowest loss before stopping. Identically, the updated Instance B will be evaluated in the evaluation phase after the update phase. • Instance C : It is similar to Instance B , but uses Online-EWC without being monitored by early stopping for updating. It will be evaluated in the evaluation phase as other models. • Baseline model: It is a common deep neural network model with the same structure as the neural network models in the above CLeaR instances. The baseline model is traditionally trained with 11000 samples (i.e., the samples of the warm-up phase and the update phase). Dropout layers with a dropout rate of 0.2 are used during training to avoid overfitting. It will also be evaluated in the evaluation phase. Two sets of experiments are conducted on the artificial dataset and the wind power dataset, respectively.
• Experiment 1 is based on the artificial dataset to analyze the correlation between the CLeaR framework's performance and two framework-related hyperparameters, i.e., the novelty buffer size and the threshold factor. This unsupervised experiment aims to extract latent representations and reconstruct the inputs. The Instance C (only an autoencoder involved) and the baseline model (an autoencoder) are implemented. The grid search range of the framework's hyperparameters is shown in Table 2, including 56 available pairs of parameter values. The experiment with each pair of parameters is repeated 20 times with different initialization of the autoencoder.   Identically, the baseline model is also repeated 20 times. The architecture of the autoencoder and the training setting for experiment 1 are shown in Tables 3 and 4. • Experiment 2 aims to evaluate all three CLeaR instances and the baseline model in the realworld application of wind power generation forecast based on 10 European wind farms datasets. The adopted model consists of an autoencoder and a deep neural network, as proposed in Fig. 4. Its architecture parameters are empirically selected and kept identical for a fair comparison, referring to Table 5. The parameters for the framework and Online EWC algorithm are selected using grid search from the range in Table 6. The training setting for experiment 2 is shown in Table 7.

Metrics
Models are evaluated in terms of fitting error, prediction error, and forgetting ratio.
Fitting error indicates how well the instance fits all seen samples after the update phase. Updating a CLeaR instance by mini-batch data might lead to a   local minimum during the updating process. Eventually, the instance fits only a specific subset rather than all seen data. Such effects are measured by calculating the MSE on the 11000 samples, 1000 of which are from the warm-up phase, and the rest 10000 are from the update phase. Therefore, a lower fitting error reflects that more knowledge is finally accumulated.
Prediction error reflects the ability of CLeaR instances to perform predictions on previously unseen data. It is calculated with 1000 samples in the evaluation phase. The overfitting problem often exists due to neural networks' powerful learning ability, where a test error is much larger than a training error. Regu-larization techniques, such as dropout and early stopping, are used to avoid overfitting during training. The prediction error can tell us whether the instance has already fallen into a local optimum after several updates. Also, the comparison between Instance A and Instance C in experiment 2 shows how important accumulating data and updating the models are for improving prediction accuracy, especially under the condition of a limited pre-training dataset.
Forgetting ratio measures how much old knowledge a model forgets after learning new tasks. In [22], He et al. compared the forgetting ratio to average test error and demonstrated that the forgetting ratio could reflect the severity of the forgetting problem. The formula is where L 1 warm up indicates the MSE on the warm-up dataset at the end of the warm-up phase and L 2 warm up indicates the error on the same dataset at the end of the update phase, and max(x 1 , x 2 ) returns the larger one of either x 1 or x 2 . The increment of the error for the same task describes the model's forgetfulness after learning new tasks. The comparison is performed only between the Instance B and the Instance C in experiment 2 because the Instance A and the baseline model are not updated.
Results of the experiment 1 Fig. 7 illustrates the correlation between the frameworks' hyperparameters (novelty buffer size and threshold factor) and the evaluation metrics (fitting error, prediction error, and forgetting ratio) based on 1120 results obtained in experiment 1. The lower the evaluation metric is, the better the model performs. Note that the binary logarithm of the fitting error and prediction error are plotted in Fig. 7. Moreover, the binary logarithm of the best fitting error and the best prediction error of the baseline model are -4.82 and -6.58, respectively.
Regarding the metric of fitting error, 1089 of the 1120 CLeaR models (97.2%) perform better than the best baseline model. This result indicates that the CLeaR framework can accumulate knowledge continually and effectively. Besides, it shows that the CLeaR framework's continual fitting ability is relatively robust to both framework-related hyperparameters.
Regarding the prediction error, 360 of the 1120 CLeaR models (32.1%) outperform the best baseline model. 253 of the 360 models have a threshold factor greater than or equal to 0.95. 159 of the 253 models have a novelty buffer size lower than or equal to 1000. Furthermore, 455 CLeaR models (40.8%) perform worse than the worst baseline model, whose prediction error is -5.84. 267 of the 455 models have a threshold factor lower than or equal to 0.95. 203 of the 267 models have a novelty buffer size greater than or equal to 1200. On the one hand, the experimental results are in accordance with our expectation, i.e., continual accumulation of meaningful knowledge can improve neural networks' prediction abilities. To a certain extent, the CLeaR framework can even predict non-stationary data more accurately than a neural network trained with sufficient historical samples. On the other hand, our findings indicate that the CLeaR framework's prediction ability is susceptible to its hyperparameter values. In the update phase, the average update frequency of the 360 models that outperform the best baseline model is 10.2 times, while the average update frequency of the 455 models that perform worse than the worst baseline model is 6.87 times. A smaller novelty buffer gets filled more easily so that updating is triggered more frequently. Besides, a higher threshold can result in that only high-entropy data is stored in the novelty buffer. We conclude that timely learning of high-entropy data can effectively improve the prediction accuracy of neural networks for non-stationary data.
We only analyze the forgetting ratio of these CLeaR models because the forgetting problem does not happen with the baseline model. 174 of the 1120 CLeaR models (15.5%) obtain a forgetting ratio greater than 0.1. 172 of the 174 models have a novelty buffer size lower than or equal to 1000. Combined with the conclusion concerning the prediction error, we find that a smaller novelty buffer can trigger updating more often, which can decrease the prediction error but also increase the forgetting ratio. The main reason can be that the hyperparameter γ of Online EWC is set to 0.9 (see Table 4), which leads to a decay of the Fisher information regarding the previous tasks after each update. Therefore, how to adjust the hyperparameters and how to supervise the updating process will be one of the key points in our further research. Otherwise, it will always be faced with a trade-off between prediction error and forgetting ratio. Fig. 7 is plotted using HiPlot [30]. The raw data of the results is available by contacting the corresponding author.
Results of the experiment 2 First, we analyze the fitting error results of the three CLeaR instances and the baseline model for the 11000 samples in the first two phases, see Tables 8 and 9. Table 8 shows the fitting errors regarding the weatherto-weather data, i.e., the outputs of the autoencoders, Figure 7 The parallel coordinates plot illustrates correlations between the frameworks' hyperparameters (novelty buffer size and threshold factor) and the evaluation metrics (fitting error, prediction error, and forgetting ratio) based on 1120 results obtained in experiment 1. 56 pairs of hyperparameters are repeated 20 times. The best fitting error and the best prediction error of the baseline model in 20 repeated experiments are -4.82 and -6.58. and Table 9 presents the corresponding results of the weather-to-power data, i.e., the outputs of the predictors. WF is the abbreviation of the wind farm. The values on the last row of the tables are the average results of the 10 wind farms. The best results among the three CLeaR instances are marked in bold. Compared to the baseline model results, Instance C can decrease the fitting error in the term of either input reconstruction or power prediction. It is similar to our observation in experiment 1. We can also observe that the average fitting error of the Instance B is sightly lower than that of the Instance C in Table 8, but rises obviously in Table 9. We infer that Catastrophic Forgetting happens on the Instance B because it applies fine-tuning that can only adapt the model to the new situations. We can obtain the conclusion regarding the existence of the forgetting problem according to the high forgetting ratio in Table 11 as well. Table 10 shows the results of power prediction, where the Instance C clearly outperforms the other two instances. The baseline model results indicate that sufficient training data can enable neural network models to obtain as much meaningful knowledge as possible, which helps to improve the models' predictive ability. Fig. 8 illustrates the errors of the three instances and the baseline model over the 12000 weather-to-power  samples of one wind farm. The errors are calculated after finishing the update phase. We split the samples into 12 sections, and each point in Fig. 8 refers to an MSE of 1000 samples. The average value of the points from 1000 to 11000 can be equal to the fitting error, and the value at a coordinate of 12000 is the prediction error. Compared to the other three lines, the red line is less volatile. Moreover, we can observe that the lines start to rise at a coordinate of 5000 and drop back to the original level at the end. On the one hand, this reflects that the periodic changes can lead to these fluctuated curve shapes and influence the mapping between weather and power generation. On the other hand, in Tables 9, 10, and Fig 8, where the Instance C outperforms the other instances, the results prove that updating is significant for predicting a non-stationary data stream, especially when the model is trained only on a limited dataset. Table 11 presents the forgetting ratio values between Instance B and Instance C calculated after the update phase. According to the average results, we can conclude that Instance C outperforms Instance B here, for both the weather-to-weather task and the weather-topower task. Moreover, as the conclusion of Table 9, the existence of the forgetting problem leads to the increment of the fitting error.

Conclusions
We believe that continual learning will be the key to the future machine's intelligence. The non-stationary world requires future artificial intelligence to be updated smoothly by taking account into the different data distributions but still to retain previous useful knowledge. Therefore, in this article, the proposed CLeaR describes the prototype structure of the continual learning based framework. It can be applied to lots of real-world projects, such as power predictions for smart grids, where prediction models have to mimic humans' ability to acquire and transfer knowledge incrementally from new data throughout their lifespan. The framework still needs to be improved in our future research. For example, although the cleaned exception is not considered, it is necessary to define and process such exceptions in a real application. Besides, we only use MSE to calculate the difference between predictions and measurements. However, this method might not work when the values are unavailable in unsupervised settings. Therefore, we suggest estimating the uncertainty of new predictions to detect changes in distributions, for example, using Monte Carlo dropout [31]. If the uncertainty is over a threshold, the data can be labeled as novelties. Futhermore, hyperparameters are found by grid search in the context of the current design. It is worth researching whether hyperparameters can be found by transferring relevant knowledge from a similar task dataset. In addition, we suppose that Online-EWC can be replaced by (or combined with) other CL algorithms to improve the framework. Novel analysis methods and evaluation metrics for the updated models will also be one of our main research focuses in future.
In a nutshell, we expect that the framework can be designed as a modular tool like LEGO toys. Each component of the framework is flexible and can be added, removed, replaced, or expanded. It should be possible for researchers and users to adapt the framework to a specific application scenario for achieving their own goals.