 Research article
 Open Access
 Published:
CLeaR: An adaptive continual learning framework for regression tasks
AI Perspectives volume 3, Article number: 2 (2021)
Abstract
Catastrophic forgetting means that a trained neural network model gradually forgets the previously learned tasks when being retrained on new tasks. Overcoming the forgetting problem is a major problem in machine learning. Numerous continual learning algorithms are very successful in incremental learning of classification tasks, where new samples with their labels appear frequently. However, there is currently no research that addresses the catastrophic forgetting problem in regression tasks as far as we know. This problem has emerged as one of the primary constraints in some applications, such as renewable energy forecasts. This article clarifies problemrelated definitions and proposes a new methodological framework that can forecast targets and update itself by means of continual learning. The framework consists of forecasting neural networks and buffers, which store newly collected data from a nonstationary data stream in an application. The changed probability distribution of the data stream, which the framework has identified, will be learned sequentially. The framework is called CLeaR (Continual Learning for Regression Tasks), where components can be flexibly customized for a specific application scenario. We design two sets of experiments to evaluate the CLeaR framework concerning fitting error (training), prediction error (test), and forgetting ratio. The first one is based on an artificial time series to explore how hyperparameters affect the CLeaR framework. The second one is designed with data collected from European wind farms to evaluate the CLeaR framework’s performance in a realworld application. The experimental results demonstrate that the CLeaR framework can continually acquire knowledge in the data stream and improve the prediction accuracy. The article concludes with further research issues arising from requirements to extend the framework.
Introduction
In the late 1980s, McCloskey and Cohen [1] and Ratcliff [2] observed a phenomenon where the welllearned knowledge of connectionist models is erased by new knowledge under specific conditions when the models learn new tasks successively. It is referred to as Catastrophic Forgetting or Catastrophic Interference. The challenge is supposed to be a general problem existing in different kinds of neural networks, e.g., backpropagation neural networks and unsupervised neural networks. Each neuron of a layer is connected to all neurons of the next layer in a neural network. Weights control the strength of the connection between two neurons. A weight vector can be expressed as a column vector w=(w_{1},w_{2},...,w_{N})^{T} with N values. Some weights have a significant influence on more than one task. For example, w_{1},w_{2},w_{3} are important for task 1 and w_{1},w_{4},w_{5} are important for task 2. In this case, the overlapped weight w_{1} could be adjusted during learning task 2 sequentially, which is one of the main reasons for Catastrophic Forgetting.
Overcoming the forgetting problem is a crucial step in implementing real intelligence. Models require plasticity for learning and integrating new knowledge as well as stability for consolidating what models have learned previously. Excessive plasticity can cause the acquired knowledge to be erased while learning new tasks. On the other hand, successively learning new tasks can become more challenging due to extreme stability. It is the socalled stabilityplasticity dilemma [3].
Many researchers have proposed continual learning (CL) algorithms to solve the problem, such as in [4–6]. A threeway categorization for the most common CL strategies is described in [6]: (1) regularization strategies, (2) rehearsal strategies, and (3) architectural strategies. Similarly, CL strategies are grouped as (1) priorfocused approaches, (2) likelihoodfocused approaches, and (3) dynamic architectures in [7]. These categorizations standardize terminologies and outline a distinct research direction for the CL community. CL algorithms have been proven successful in supervised learning and reinforcement learning to train several tasks sequentially without forgetting the acquired ones. Application scenarios cover handwriting recognition [8–10], image classification [5], sequentially learning to play games in a reinforcement learning setting [4] and much more.
To our best knowledge, some of the standard CL benchmarks are the reconstructions of wellknown datasets, such as permuted MNIST [11, 12], where the CL tasks are obtained by scrambling the pixel positions in the MNIST dataset [13]. Moreover, some datasets are explicitly generated to evaluating CL algorithms, for example, CORe50 [14] for continuous object recognition.
Most of the past research focused mainly on classification tasks rather than regression tasks, where the Catastrophic Forgetting problem usually occurs as well. For example, establishing regional smart grids requires power generation and consumption forecasts. In [15, 16], neural networks are used to forecast renewable energy generation with weather prediction data. Note that weather data is nonstationary, where the probability distribution changes over time, e.g., from summer to winter. In this situation, training neural networks has to be delayed until sufficient data is collected. Otherwise, the networks could be overfitted to the limited training dataset. Furthermore, unseen situations, e.g., the extreme weather conditions, updating/damaging/ageing of generators, and climate changes, can be called special events. The neural networks have to update themselves by continually learning these situations when they appear in the application. Besides, power consumption regarding a household or a factory is also easily affected by unpredictable things, such as purchasing new equipment or hiring more employees. These factors can change the obtained mapping between inputs and outputs. Under these conditions where historical data may be private, unrecorded, or too cumbersome to be retrained, the trained models have to learn new knowledge and consolidate the previouslystored internal representations without the help of old data.
The main contribution of this article is to propose a framework called CLeaR (Continual Learning for Regression Tasks) for continually learning the identified changes in nonstationary data streams. Moreover, the framework is tested in two sets of experiments to assess its performance and analyze its hyperparameters’ effects. The CLeaR framework consists of neural networks for prediction and buffers for storing new data. We calculate an error between the prediction and the corresponding true value at each point in time. The new data is labeled by comparing the error to a dynamically adjustable threshold. If the error is larger than the threshold, the data is labeled as a novelty and stored in a finite novelty buffer, or else stored as familiarity in an infinite familiarity buffer. When the novelty buffer is full, updating will be triggered. The network will be retrained on the dataset in the novelty buffer using CL. The retrained network will then be tested on the familiarity dataset to evaluate how much old knowledge is retained. After updating, the threshold needs to be reestimated for the following learning step. The reestimation depends on the performances of the updated network on the dataset of both buffers. Afterwards, the buffers will be emptied. Updating will be repeated until the novelty buffer is filled again.
The remainder of the article will review the literature regarding CL algorithms and applications. Then we will outline the proposed framework’s fundamental structure and give an insight into the experimental details. Furthermore, we will analyze the experimental results. This article will conclude with our findings and provide an outlook for future research.
Related work
This section will start with a brief overview of the recent academic literature regarding approaches and experimental setups designed for CL classification applications.
The changes in data or goals can be defined as new tasks in the CL community. For example, a model is expected to learn new instances of the same class while retaining its knowledge about the previous instances, or to learn new instances of different classes without losing accuracy on previous classes, or to learn new instances of the known and unknown classes. These are defined as different CL scenarios in [14]. In both [6, 7], CL algorithms are categorized into three groups in a similar way:

Priorfocused approaches denote that the posterior probability of N tasks is a product of the likelihood of the Nth task and the posterior probability of the first N−1 tasks, as
$$\begin{array}{@{}rcl@{}} \begin{aligned} {}{P\left(\ThetaD_{1},~\dots~D_{N}\right)=\frac{P\left(D_{N}\Theta\right)P\left(\ThetaD_{1},~\dots~D_{N1}\right)}{P\left(D_{N}D_{1},~\dots~D_{N1}\right)}}. \end{aligned} \end{array} $$(1)As a regularization, the posterior probability of the first N−1 tasks is added in the loss function to avoid changing the weights that are important for previous tasks. Wellknown priorfocused algorithms include Elastic Weight Consolidation (EWC) [4], Synaptic Intelligence (SI) [5], Variational Continual Learning [8]. In [9], a generalization of EWC++ and SI was proposed, which is referred to as the RWalk algorithm.

Likelihoodfocused approaches require a subset of randomly selected samples from the original dataset of the previous N−1 tasks, or a dataset generated by a generative network that has learned the tasks, see in [7, 10, 17].

Dynamic architectures enable neural networks to learn CL tasks sequentially by adjusting the networks’ architecture for specific applications. Progressive Networks, Learning Without Forgetting (LWF), and LessForgetting Learning have been introduced in [18–20], respectively.
The above algorithms have been evaluated in multitask scenarios with the reshaped versions of famous datasets, e.g., MNIST and CIFAR10/CIFAR100. In these scenarios, a model learns a new, isolated task in a sequence while remembering how to solve the learned tasks. However, there are no class overlaps among the different tasks. For example, in [5] the MNIST dataset is split into five tasks, one of which contains two labels (two digits). The model can classify data to the correct group only if the information regarding the current task is given. In this case, the model learns how to solve a series of discrete tasks rather than keep learning knowledge to address incremental problems. The experimental setups and the datasets do not allow for a fair comparison among the CL algorithms. Lomonaco et al. [14] create CORe50 specifically for singleincrementaltask scenarios, which can be seen as a test benchmark for continuous object recognition. Similarly, the iCubWorld benchmark [21] is designed for robotic vision challenges, where comparison among various CL approaches is feasible.
Besides, continual learning should be considered in regression as well. In [22], He et al. propose two CL application scenarios for establishing regional smart grids: the taskdomain incremental scenario and the datadomain incremental scenario. The scenarios are applicable for forecasting power, including renewable energy generation and power consumption in the middle/lowvoltage grid. Moreover, performances of four CL algorithms (EWC, OnlineEWC, SI, and LWF) are evaluated concerning accuracy, forgetting ratio, and training time in the two scenarios. However, prior knowledge about new tasks is given in their experimental setup, which means that models know when new tasks will occur without novelty detection. Therefore, this setup is incompatible with the real world.
In [23], Farquhar et al. conclude that an inappropriate experimental design could misrepresent the performances of the wellknown CL approaches. Therefore, they suggest five requirements for evaluating CL algorithms and demonstrate their necessities. The five requirements are: (1) crosstask resemblances; (2) shared output head; (3) no testtime assumed task labels; (4) no unconstrained retraining on old tasks; and (5) more than two tasks.
These suggestions are worth being considered in our experimental setup and inspire us to design the CLeaR framework. (1) Most novelties are due to changes of data P(X) or targets P(YX,Θ). The dataset of every full novelty buffer can be viewed as a new task that resembles the previous tasks. (2) The neural network outputs the power value prediction, and the new tasks will not require a change of the network’s architecture. (3) The prior knowledge regarding the new task, such as when the new task appears or what the distribution of the new task is, is unknown in the application. Updating is triggered automatically only when the finite novelty buffer is filled in our experimental setup. (4) Considering that privacy laws might prohibit the longterm storage of historical datasets, we retrain the neural network only on the dataset newly collected in applications and delete it after updating. (5) More tasks will appear as the probability distribution or the mapping between inputs and outputs changes over time.
Power forecasts using deep neural networks
At the beginning of this section, we list the chosen mathematical notations in Table 1. It can help readers to understand the following mathematical expressions. Besides, we use superscript ^{T} to denote the transpose of a matrix or a vector and T to denote the number of tasks.
Deep neural networks are a kind of machine learning inspired by biological neural networks to model nonlinear dependencies in high dimensional data. Compared with traditional high dimensional data reduction techniques, such as principal component analysis (PCA), the multiple deep layers of a neural network can extract representations efficiently from massive data to provide predictive performance gains.
The rest of this section will start with formulating the problem. Moreover, this section will illustrate the probability distribution changes in the experimental dataset over time and clarify how the CLeaR framework works in a general power forecasting workflow.
Problem formulation
Deep neural networks can output a prediction y_{n} with given a highdimensional input x_{n}. The goal of training is to find a mapping between x_{n} and y_{n}. A general deep neural network consists of a series of hidden layers, which can be formulated as:
with an output column vector z_{l−1} of the l−1th layer, where z_{0}=x_{n}. Here \(\Theta _{l}^{\mathrm {T}}\) denotes a transposed weight matrix, whose dimension is the dimension of the l−1th layer by the dimension of the lth layer. A prediction \(\hat {y}_{n}\) of the deep neural network is then
where L is the number of layers.
Training a neural network is to minimize the defined loss function. In this section, we explain the process with an example of Mean Square Error (MSE), i.e.,
In order to avoid overfitting, a regularization term is usually added in Eq. 4, as
where λ∈(0,∞) is a hyperparameter that weights the contribution of penalty term R(Θ). Different choices for R(Θ) can result in different solutions.
From a probabilistic perspective, according to chapter 9 in [24], we can assume that we are given an input x_{n} and the corresponding noisy observation \(y_{n} = \hat {y}_{n} + \epsilon \). More specifically, we assume further that this noise ε follows an independent and identical Gaussian distribution with zero mean and variance σ^{2}. Therefore, the regression problem can be considered with a likelihood function:
When we are given the datasets X and Y, Eq. 6 can be expressed as
where assumed that the y_{i} and y_{j} are conditionally independent given their feature vectors x_{i} and x_{j}. To avoid overfitting during training, we seek parameters Θ that maximize the posterior distribution P(ΘX,Y) instead of the likelihood. We can obtain the posterior distribution by applying Bayes’ theorem as
Note that the posterior distribution depends on the given X and Y. If the statistical properties of the distributions change, such as the mean or the variance, new parameter values will become optimal.
Change in probability distribution
A learning task can be described as changes in distributions of data P(X) or targets P(YX,Θ). In our experimental setup, meteorological features and the power measurements are viewed as the inputs and outputs of the neural network, respectively. Some meteorological features, e.g., temperature and wind direction, fluctuate yearly periodically. Also, renewable energy generation depends on meteorological conditions. Therefore, the joint probability distribution P(X,Y) changes period by period. Mathematically, the change can be expressed as follows:
where X and Y represent a batch of features and power measurements respectively, which are measured from a nonstationary data stream. If two measurement periods t_{0} and t_{1} are far apart, both distributions are different due to concept drift. The formula:
demonstrates that P(X,Y) is affected by the change in the probability distribution of the inputs and the obtained mapping.
Figures 1 and 2 illustrate the change regarding P(X) and P(YX) based on data from a European wind farm dataset. A sample of the dataset contains sevendimensional meteorological features and a scalar power value. The first two principal components are extracted from the seven features using PCA and labeled as X1 and X2. 64.19% of the overall variance is explained by the two components. We split 10000 samples from the dataset into four sections sequentially over time, each of which has 2500 samples. The distribution of each section regarding X1 and X2 is plotted in Fig. 1. Similarly, the distributions of four sections regarding X1 and the power Y are shown in Fig. 2. The distributions of P(X1,X2) and P(X1,Y) with the same 10000 samples are plotted in Figs. 1(e) and 2(e) in comparison to other subfigures.
The twodimensional probability distribution of P(X1,X2) with 10000 samples shows a circular shape, as shown in Fig. 1(e). From Fig. 1(a) to (d), the center of the contour line regarding each section moves clockwise along the circle. This movement illustrates the periodic change in the P(X) over time. Similarly, we can observe that the center of the distribution P(YX1) moves from right to left over time, as shown in Fig. 2.
Combined with Eq. 9 and the observations, we conclude that a periodic change exists in the dataset. The model needs to be updated in applications if it was pretrained only on a limited dataset.
Power forecasting workflow
We suggest that a general power forecasting workflow should comprise four phases: (1) reporting exceptions, (2) predicting targets, (3) storing data, and (4) updating models, as shown in Fig. 3.
We define that an exception deviating from the expected model is an outcome from an unknown process. For example, wind power generators are automatically shut down for protection under extreme weather conditions, such as typhoons or storms. Exceptions have to be identified first as the data arrives. If it exists in the data stream, the exceptions need to be reported and processed manually. Although the dataset used here has been cleaned in the preprocessing phase, these exceptions might appear in applications. Labeling and learning exceptions are one of the research challenges in the field of active learning, which is beyond the scope of this article. It can be further researched in the future.
The CLeaR framework contains the models for prediction and the buffers for storage. Once the measurements Y are available, the corresponding data is labeled and stored into the two buffers depending on the preset threshold and the error. We choose the MSE here for supervised regression tasks, but the method can be customized in other scenarios, e.g., probabilistic forecasts. The data in the novelty buffer covers the change of probability distribution detected in the data stream. It is used for retraining the models when the update is triggered. The data in the familiarity buffer has information that the models are familiar with. It can be used for testing whether the models still retain the old knowledge after updating. Updating models can be considered as accumulations of knowledge for improving prediction accuracy.
CLeaR
The details of the CLeaR framework will be explained through the instance used in our experiments, as shown in Fig. 4. In this instance, the block Models contains an autoencoder and a fullyconnected neural network for detecting the changes of the distributions P(X) and P(YX) and for forecasting power values. Data is labeled as novelty or familiarity by comparing the MSE to the threshold. Besides, we adopt OnlineEWC to update the models and adjust the threshold dynamically after each update. We suggest that the components of the CLeaR framework should be selected flexibly depending on the specific application scenario.
Models
An autoencoder is a neural network that usually consists of two symmetric parts with a bottleneck between them. In an undercomplete autoencoder, the bottleneck has a smaller dimension than the input layer, which helps to extract latent representations z from the input. An autoencoder can reconstruct the input at the output, rather than simply copy the input [25]. The encoder and the decoder can be formulated z_{n}=f_{Θ}(x_{n}) and \(\hat {\mathbf {x}}_{n}=g_{\Phi }\left (\mathbf {z}_{n}\right)\), where Θ and Φ are the parameter matrices. The optimization goal is to minimize the loss function,
by penalizing the reconstruction being different from the input. The change of distribution P(X), as shown in Fig. 1, can be detected by the reconstruction error of the autoencoder.
At the next step, the extracted representation z_{n} is fed into the predictor. As explained in Eqs. 4 and 5, optimizing the network in the general supervised setting is to minimize the MSE. Thus a true measurement y_{n} is required. The predictor, a fullyconnected neural network, can be replaced by other networks, e.g., LSTM. We can also drop the predictor in applications where only the reconstruction is needed, as we will introduce in experiment 1.
Buffers
Every neural network that needs to be updated during its application owns a limited novelty buffer and an unlimited familiarity buffer. When true target values are provided, the MSE can be used as a criterion to be compared to the preset threshold. The samples with a small MSE are stored in the familiarity buffer because the trained network has learned to cope with them before. The samples with a large MSE are stored in the novelty buffer. The network needs to be retrained based on these novelties to learn new knowledge. After updating, a validation error can be calculated using the familiarities to estimate whether the network can still retain the old knowledge acquired previously. How to deal with a poor update result remains an open question and needs to be further discussed. We empty both buffers after finishing an update.
Threshold
Each model that needs to be updated owns a threshold. As shown in Fig. 4, Threshold_a is for the autoencoder and Threshold_p is for the predictor. The value of the threshold determines how new samples are classified. The smaller the threshold, the more samples are likely to be labeled as novelties, where more welllearned knowledge will be relearned, thus leading to unnecessary updates. A larger threshold could cause inefficient updates because too many novelties are misclassified. We suggest that the threshold value should be adjusted dynamically depending on the training results.
Figure 5 illustrates a distribution of errors for all samples after learning. MSE_{min} refers to the minimum MSE obtained by minimizing a loss function, which can also be viewed as the mean of the distribution. The distribution has a lower mean and a lower variance indicating the model learns better on the given dataset. In this article, we adjust the threshold by
where α is a fixed threshold factor and the MSE_{min} is reestimated after each update.
Update
The updating method and the trigger condition play a crucial role in the CLeaR framework. As mentioned in the Section about related work, retraining on old tasks should be constrained due to reasons, such as privacy protection or data storage overhead. Therefore, we adopt OnlineEWC, which penalizes the loss when the overlapped significant weights are changed while learning new tasks. We adapt the notation given in [26] to explain EWC and OnlineEWC in the context of the CLeaR framework.
EWC
The goal of EWC is to approximate Bayesian posteriors over model parameters given tasks. In CLeaR, data is always split according to two kinds of tasks, the known task (T_{k}) and the unknown task (T_{u}). T_{k} refers to what the neural network has already learned. The corresponding dataset D_{k,T−1} is a combination of the datasets that were stored in the novelty and the familiarity buffers during the T−1th update. Note that novelties in D_{k,T−1} have already been learned before the Tth update. Therefore, it belongs to the known task. Comparably, T_{u} refers to what the network will learn at the Tth update. The corresponding D_{u,T} refers only to the dataset stored in the novelty buffer for triggering the Tth update. Chronologically, the T−1th update occurs before the Tth update. From Eq. 1, we can get the posterior after learning T_{u}, as:
By converting Eq. 13 into a logarithm, we will get
The term logP(D_{u,T}Θ) is generally tractable through minimizing an MSE loss with respect to Θ and the dataset D_{u,T}. In the case where the trained network can perform very well on the known task T_{k}, we have gotten the optimal parameters \(\Theta ^{\ast }_{k,T1}\), which makes the gradient of − logP(ΘD_{k,T−1}) with respect to Θ equal to 0. Therefore, the − logP(ΘD_{k,T−1}) can be estimated using 2nd order Taylor series around \(\Theta ^{\ast }_{k,T1}\):
where \(\Delta \left (\Theta \right)=\Theta  \Theta ^{\ast }_{k, T1}\). \(H(\Theta ^{\ast }_{k, T1})\) is the Hessian of − logP(ΘD_{k,T−1}) with respect to Θ, evaluated at the optimum \(\Theta ^{\ast }_{k,T1}\). Furthermore, we can approximate the Hessian as:
where N_{k} is the number of samples in D_{k,T−1}, and \(\mathbf {F}(\Theta ^{\ast }_{k,T1})\) is the empirical Fisher information matrix on the known dataset D_{k,T−1}, and \(H_{{prior}}(\Theta ^{\ast }_{k,T1})\) is the Hessian of the negative log prior with respect to Θ. EWC estimates the Fisher information matrix in a highdimensional space as a diagonal matrix, i.e., the nondiagonal elements are zero [4]. The diagonal Fisher information values are denoted by \(F_{k}^{i}\) concerning the ith parameter in the network. Thus the formula of EWC is
where the update has been done T−1 times already. D_{u,T} is the dataset stored in the novelty buffer for the unknown task at the Tth update. \(F_{k,t}^{i}\) is the diagonal Fisher information with respect to the parameter i, estimated by the dataset D_{k,T} stored in both buffers after the Tth update. \(\theta _{k, T1}^{\ast,i}\) is the optimal parameter i learned from the dataset D_{k,T−1} after the T−1th update. λ_{t} and λ_{prior} are hyperparameters for EWC. Note that all of the diagonal Fisher information F_{k,t} in Eq. 17, obtained after every update, must be stored.
OnlineEWC
The regularization term in EWC can be replaced by one Gaussian approximation to the whole posterior of all previous tasks, as proven in [26]. Based on this derivation, OnlineEWC is proposed in [27], as:
where \(\widetilde {F}_{k,T}^{i} = \gamma \widetilde {F}_{k,T1}^{i} + F_{k,T}^{i}\) and \(\widetilde {F}_{k,1}^{i}=F_{k,1}^{i}\). γ is a hyperparameter governing the contribution of previous tasks and not larger than 1. Therefore, \(\widetilde {F}_{k,T1}^{i}\) contains diagonal Fisher information of all updates which precede the Tth update.
By contrast with EWC, OnlineEWC requires less space to store diagonal Fisher information and optima of each previous task. Another similar algorithm is EWC++ introduced in [9], where the diagonal Fisher information is defined as: \(\widetilde {F}_{k,T1}^{i}=\alpha \widetilde {F}_{k, T1}^{i} + (1\alpha)F_{k,T}^{i}\) with 0<α<1. In our experiments, OnlineEWC is selected for updates within the CLeaR framework.
Trigger condition
The trigger condition for updating the model is another customizable parameter. In our realization of the framework, the update will be triggered when the finite novelty buffer is filled. The smaller the size of this novelty buffer, the more easily this buffer gets full. On the one hand, fewer data in a full buffer is available for updating, which might lead to a failed update. On the other hand, more updates will result in increasing computational overhead. Moreover, note that the hyperparameter γ of OnlineEWC is not larger than 1, which causes a gradual decay of the previous Fisher information when the number of updates increases. If the size of the buffer is too large, updating the network will be delayed until the buffer is full. The delay also results in the fact that the threshold can not be adjusted frequently.
Experiments
In this section, two sets of experiments will be performed and analyzed. An unlabeled artificial dataset is generated for experiment 1, where we analyze the effects of two hyperparameters of CLeaR, the novelty buffer size and the threshold factor. We implement experiment 2 on a labeled dataset to predict wind power generation measured in 10 European wind farms [28] using supervised learning. In this experiment, we assess the CLeaR framework’s performance in wind power prediction applications. Both experiments will adopt three evaluation metrics: fitting error, prediction accuracy, and forgetting ratio. The remainder of the section will introduce the datasets, the experimental setup, the evaluation criteria, and analyze the results.
Dataset
Artificial dataset is a sevendimensional periodic unsupervised dataset, i.e., \(D=\left [ \mathbf {x}_{1},\dots,\mathbf {x}_{7} \right ]\). The underlying generation model is shown as follows:
where x_{n}(t) is sampled from a Gaussian distribution with a timedependent mean and variance and denotes the tth data of the sequence x_{n}. The mean and variance are sampled from an absolute value of a sinusoidal function at the tth point in time with a period T and a phase p_{n}. The phase p_{n} is randomly generated with a standard normal distribution. Hyperparameters A_{m} and A_{v} refer to the amplitude of the sinusoidal function.
In order to generate a dataset containing daily and yearly periodical changes, we adopt two generation models, \(x_{n}^{d}(t)\) and \(x_{n}^{y}(t)\), with two periods T^{d} and T^{y} respectively. The period T^{d} is 24 to simulate the fluctuation within one day (24 hours), and the period T^{y} of \(x_{n}^{y}(t)\) is 8760 to mimic the periodic changes over a year (8760 hours). The resulting time series data is a sum of the two models, i.e., \(x_{n}(t) = x_{n}^{d}(t)+x_{n}^{y}(t)\).
Figure 6 displays the sequence x_{1} (marked in red) and the corresponding generating process (marked in black) in three different time slots. Note that the first 48 samples of Fig. 6(b) are displayed in Fig. 6(a). Similarly, the first 720 samples of Fig. 6(c) are also shown in Fig. 6(b). The periodical statistic fluctuation displayed in Fig. 6 can simulate features of nonstationary data, such as temperature, which usually peaks at noon and falls at night.
Wind power dataset [28] contains seven meteorological features and hourly averaged wind power generation data measured from European wind farms for two years in a row. The features are 24hourahead meteorological forecasts using the European Centre for MediumRange Weather Forecasts model [29], including (1) wind speed in 100m height, (2) wind speed in 10m height, (3) wind direction (zonal) in 100m height, (4) wind direction (meridional) in 100m height, (5) air pressure, (6) air temperature, and (7) humidity.
The power generation time series are normalized by the respective rated capacity of the wind farm for easy, scalefree comparison. All features are normalized to the range between 0 and 1. When no power has been generated longer than 24 hours, time points are removed in the preprocessing phase. The dropped time points can be viewed as exceptions that need to be processed manually in realworld applications.
Both datasets are available on our department website [28] or by contacting the corresponding author. Researchers can reimplement our experiments with the datasets.
Experimental setup
The artificial dataset and the wind power dataset are split into three subsets according to three phases: a warmup phase, an update phase, and an evaluation phase. In the warmup phase, the model is pretrained on the first 1000 samples, containing only partial information that can describe the current task but not future tasks. The model monitors a data stream with the following 10000 samples in the update phase, simulating a theoretically infinite data stream in a real application scenario. Here, we assume that 10000 samples can provide enough information to describe the real distribution of samples because 10000 samples span 10000 hours, covering a complete period. The model will be retrained based on the novelty buffer and validated on the familiarity buffer once updating is triggered. The updated model is finally evaluated in the evaluation phase with the following 1000 samples.
We implement the following three CLeaR instances and a baseline model to compare their performances in experiments:

Instance _{A}: Its models are trained on the warmup dataset in supervised learning mode and then directly evaluated in the evaluation phase without updating. The experimental results can reflect the disadvantages due to the lack of samples. Instance _{A} is viewed as a lower bound of models’ performance.

Instance _{B}: Its models are pretrained in the warmup phase, as Instance _{A}. Then the models will be updated in the update phase by using finetuning. Finetuning allows the pretrained models to learn a new dataset, which they were not originally trained on, by slightly adjusting all unfrozen parameters. The retraining process is monitored by early stopping with 30 epochs of patience to avoid overfitting. Early stopping will stop the retraining process if the loss is no longer decreasing in 30 epochs. The optimal models are the ones with the lowest loss before stopping. Identically, the updated Instance _{B} will be evaluated in the evaluation phase after the update phase.

Instance _{C}: It is similar to Instance _{B}, but uses OnlineEWC without being monitored by early stopping for updating. It will be evaluated in the evaluation phase as other models.

Baseline model: It is a common deep neural network model with the same structure as the neural network models in the above CLeaR instances. The baseline model is traditionally trained with 11000 samples (i.e., the samples of the warmup phase and the update phase). Dropout layers with a dropout rate of 0.2 are used during training to avoid overfitting. It will also be evaluated in the evaluation phase.
Two sets of experiments are conducted on the artificial dataset and the wind power dataset, respectively.

Experiment 1 is based on the artificial dataset to analyze the correlation between the CLeaR framework’s performance and two frameworkrelated hyperparameters, i.e., the novelty buffer size and the threshold factor. This unsupervised experiment aims to extract latent representations and reconstruct the inputs. The Instance _{C} (only an autoencoder involved) and the baseline model (an autoencoder) are implemented.
The grid search range of the framework’s hyperparameters is shown in Table 2, including 56 available pairs of parameter values. The experiment with each pair of parameters is repeated 20 times with different initialization of the autoencoder. Identically, the baseline model is also repeated 20 times. The architecture of the autoencoder and the training setting for experiment 1 are shown in Tables 3 and 4.

Experiment 2 aims to evaluate all three CLeaR instances and the baseline model in the realworld application of wind power generation forecast based on 10 European wind farms datasets. The adopted model consists of an autoencoder and a deep neural network, as proposed in Fig. 4. Its architecture parameters are empirically selected and kept identical for a fair comparison, referring to Table 5. The parameters for the framework and Online EWC algorithm are selected using grid search from the range in Table 6. The training setting for experiment 2 is shown in Table 7.
Metrics
Models are evaluated in terms of fitting error, prediction error, and forgetting ratio.
Fitting error indicates how well the instance fits all seen samples after the update phase. Updating a CLeaR instance by minibatch data might lead to a local minimum during the updating process. Eventually, the instance fits only a specific subset rather than all seen data. Such effects are measured by calculating the MSE on the 11000 samples, 1000 of which are from the warmup phase, and the rest 10000 are from the update phase. Therefore, a lower fitting error reflects that more knowledge is finally accumulated.
Prediction error reflects the ability of CLeaR instances to perform predictions on previously unseen data. It is calculated with 1000 samples in the evaluation phase. The overfitting problem often exists due to neural networks’ powerful learning ability, where a test error is much larger than a training error. Regularization techniques, such as dropout and early stopping, are used to avoid overfitting during training. The prediction error can tell us whether the instance has already fallen into a local optimum after several updates. Also, the comparison between Instance _{A} and Instance _{C} in experiment 2 shows how important accumulating data and updating the models are for improving prediction accuracy, especially under the condition of a limited pretraining dataset.
Forgetting ratio measures how much old knowledge a model forgets after learning new tasks. In [22], He et al. compared the forgetting ratio to average test error and demonstrated that the forgetting ratio could reflect the severity of the forgetting problem. The formula is
where \(L_{warm\_up}^{1}\) indicates the MSE on the warmup dataset at the end of the warmup phase and \(L_{warm\_up}^{2}\) indicates the error on the same dataset at the end of the update phase, and max(x_{1},x_{2}) returns the larger one of either x_{1} or x_{2}. The increment of the error for the same task describes the model’s forgetfulness after learning new tasks. The comparison is performed only between the Instance _{B} and the Instance _{C} in experiment 2 because the Instance _{A} and the baseline model are not updated.
Results of the experiment 1
Figure 7 illustrates the correlation between the frameworks’ hyperparameters (novelty buffer size and threshold factor) and the evaluation metrics (fitting error, prediction error, and forgetting ratio) based on 1120 results obtained in experiment 1. The lower the evaluation metric is, the better the model performs. Note that the binary logarithm of the fitting error and prediction error are plotted in Fig. 7. Moreover, the binary logarithm of the best fitting error and the best prediction error of the baseline model are 4.82 and 6.58, respectively.
Regarding the metric of fitting error, 1089 of the 1120 CLeaR models (97.2%) perform better than the best baseline model. This result indicates that the CLeaR framework can accumulate knowledge continually and effectively. Besides, it shows that the CLeaR framework’s continual fitting ability is relatively robust to both frameworkrelated hyperparameters.
Regarding the prediction error, 360 of the 1120 CLeaR models (32.1%) outperform the best baseline model. 253 of the 360 models have a threshold factor greater than or equal to 0.95. 159 of the 253 models have a novelty buffer size lower than or equal to 1000. Furthermore, 455 CLeaR models (40.8%) perform worse than the worst baseline model, whose prediction error is 5.84. 267 of the 455 models have a threshold factor lower than or equal to 0.95. 203 of the 267 models have a novelty buffer size greater than or equal to 1200. On the one hand, the experimental results are in accordance with our expectation, i.e., continual accumulation of meaningful knowledge can improve neural networks’ prediction abilities. To a certain extent, the CLeaR framework can even predict nonstationary data more accurately than a neural network trained with sufficient historical samples. On the other hand, our findings indicate that the CLeaR framework’s prediction ability is susceptible to its hyperparameter values. In the update phase, the average update frequency of the 360 models that outperform the best baseline model is 10.2 times, while the average update frequency of the 455 models that perform worse than the worst baseline model is 6.87 times. A smaller novelty buffer gets filled more easily so that updating is triggered more frequently. Besides, a higher threshold can result in that only highentropy data is stored in the novelty buffer. We conclude that timely learning of highentropy data can effectively improve the prediction accuracy of neural networks for nonstationary data.
We only analyze the forgetting ratio of these CLeaR models because the forgetting problem does not happen with the baseline model. 174 of the 1120 CLeaR models (15.5%) obtain a forgetting ratio greater than 0.1. 172 of the 174 models have a novelty buffer size lower than or equal to 1000. Combined with the conclusion concerning the prediction error, we find that a smaller novelty buffer can trigger updating more often, which can decrease the prediction error but also increase the forgetting ratio. The main reason can be that the hyperparameter γ of Online EWC is set to 0.9 (see Table 4), which leads to a decay of the Fisher information regarding the previous tasks after each update. Therefore, how to adjust the hyperparameters and how to supervise the updating process will be one of the key points in our further research. Otherwise, it will always be faced with a tradeoff between prediction error and forgetting ratio.
Figure 7 is plotted using HiPlot [30]. The raw data of the results is available by contacting the corresponding author.
Results of the experiment 2
First, we analyze the fitting error results of the three CLeaR instances and the baseline model for the 11000 samples in the first two phases, see Tables 8 and 9. Table 8 shows the fitting errors regarding the weathertoweather data, i.e., the outputs of the autoencoders, and Table 9 presents the corresponding results of the weathertopower data, i.e., the outputs of the predictors. WF is the abbreviation of the wind farm. The values on the last row of the tables are the average results of the 10 wind farms. The best results among the three CLeaR instances are marked in bold. Compared to the baseline model results, Instance _{C} can decrease the fitting error in the term of either input reconstruction or power prediction. It is similar to our observation in experiment 1. We can also observe that the average fitting error of the Instance _{B} is sightly lower than that of the Instance _{C} in Table 8, but rises obviously in Table 9. We infer that Catastrophic Forgetting happens on the Instance _{B} because it applies finetuning that can only adapt the model to the new situations. We can obtain the conclusion regarding the existence of the forgetting problem according to the high forgetting ratio in Table 11 as well.
Table 10 shows the results of power prediction, where the Instance _{C} clearly outperforms the other two instances. The baseline model results indicate that sufficient training data can enable neural network models to obtain as much meaningful knowledge as possible, which helps to improve the models’ predictive ability.
Figure 8 illustrates the errors of the three instances and the baseline model over the 12000 weathertopower samples of one wind farm. The errors are calculated after finishing the update phase. We split the samples into 12 sections, and each point in Fig. 8 refers to an MSE of 1000 samples. The average value of the points from 1000 to 11000 can be equal to the fitting error, and the value at a coordinate of 12000 is the prediction error. Compared to the other three lines, the red line is less volatile. Moreover, we can observe that the lines start to rise at a coordinate of 5000 and drop back to the original level at the end. On the one hand, this reflects that the periodic changes can lead to these fluctuated curve shapes and influence the mapping between weather and power generation. On the other hand, in Tables 9, 10, and Fig. 8, where the Instance _{C} outperforms the other instances, the results prove that updating is significant for predicting a nonstationary data stream, especially when the model is trained only on a limited dataset.
Table 11 presents the forgetting ratio values between Instance _{B} and Instance _{C} calculated after the update phase. According to the average results, we can conclude that Instance _{C} outperforms Instance _{B} here, for both the weathertoweather task and the weathertopower task. Moreover, as the conclusion of Table 9, the existence of the forgetting problem leads to the increment of the fitting error.
Conclusions
We believe that continual learning will be the key to the future machine’s intelligence. The nonstationary world requires future artificial intelligence to be updated smoothly by taking account into the different data distributions but still to retain previous useful knowledge. Therefore, in this article, the proposed CLeaR describes the prototype structure of the continual learning based framework. It can be applied to lots of realworld projects, such as power predictions for smart grids, where prediction models have to mimic humans’ ability to acquire and transfer knowledge incrementally from new data throughout their lifespan.
The framework still needs to be improved in our future research. For example, although the cleaned exception is not considered, it is necessary to define and process such exceptions in a real application. Besides, we only use MSE to calculate the difference between predictions and measurements. However, this method might not work when the values are unavailable in unsupervised settings. Therefore, we suggest estimating the uncertainty of new predictions to detect changes in distributions, for example, using Monte Carlo dropout [31]. If the uncertainty is over a threshold, the data can be labeled as novelties. Futhermore, hyperparameters are found by grid search in the context of the current design. It is worth researching whether hyperparameters can be found by transferring relevant knowledge from a similar task dataset. In addition, we suppose that OnlineEWC can be replaced by (or combined with) other CL algorithms to improve the framework. Novel analysis methods and evaluation metrics for the updated models will also be one of our main research focuses in future.
In a nutshell, we expect that the framework can be designed as a modular tool like LEGO toys. Each component of the framework is flexible and can be added, removed, replaced, or expanded. It should be possible for researchers and users to adapt the framework to a specific application scenario for achieving their own goals.
Availability of data and materials
The datasets generated and analyzed during the current study are available from https://www.unikassel.de/eecs/ies/downloads or the corresponding author on reasonable request.
Abbreviations
 CL:

Continual Learning
 MSE:

Mean Square Error
 EWC:

Elastic Weight Consolidation
 SI:

Synaptic Intelligence
 LWF:

Learning Without Forgetting
 OnlineEWC:

Online Elastic Weight Consolidation
 PCA:

Principal Component Analysis
 AE:

Autoencoder
 WF:

Wind farm
References
 1
McCloskey M, Cohen NJ. Catastrophic interference in connectionist networks: The sequential learning problem. Psychol Learn Motiv. 1989; 24:109–65.
 2
Ratcliff R. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychol Rev. 1990; 97(2):285.
 3
Mermillod M, Bugaiska A, Bonin P. The stabilityplasticity dilemma: Investigating the continuum from catastrophic forgetting to agelimited learning effects. Front Psychol. 2013; 4:504.
 4
Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, GrabskaBarwinska A, et al. Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci. 2017; 114(13):3521–6.
 5
Zenke F, Poole B, Ganguli S. Continual learning through synaptic intelligence. Proc Mach Learn Res. 2017; 70:3987.
 6
Maltoni D, Lomonaco V. Continuous learning in singleincrementaltask scenarios. Neural Netw. 2019; 116:56–73.
 7
Farquhar S, Gal Y. A unifying bayesian view of continual learning. arXiv eprints. 2019:arXiv–1902.
 8
Nguyen CV, Li Y, Bui TD, Turner RE. Variational continual learning. In: International Conference on Learning Representations.2018. https://openreview.net/forum?id=BkQqqOgRb.
 9
Chaudhry A, Dokania PK, Ajanthan T, Torr PH. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In: Proceedings of the European Conference on Computer Vision (ECCV): 2018. p. 532–47.
 10
van de Ven GM, Tolias AS. Generative replay with feedback connections as a general strategy for continual learning. arXiv eprints. 2018::arXiv–1809.
 11
Goodfellow IJ, Mirza M, Xiao D, Courville A, Bengio Y. An empirical investigation of catastrophic forgetting in gradientbased neural networks. arXiv eprints. 2013::arXiv–1312.
 12
Srivastava RK, Masci J, Kazerounian S, Gomez F, Schmidhuber J. Compete to compute. In: Advances in Neural Information Processing Systems: 2013. p. 2310–8.
 13
LeCun Y, Cortes C, Burges C. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist. 2010; 2.
 14
Lomonaco V, Maltoni D. Core50: a new dataset and benchmark for continuous object recognition. In: Conference on Robot Learning. PMLR: 2017. p. 17–26.
 15
Gensler A, Henze J, Sick B, Raabe N. Deep learning for solar power forecasting – an approach using autoencoder and lstm neural networks. In: 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC): 2016. p. 002858–65. https://doi.org/10.1109/SMC.2016.7844673.
 16
He Y, Henze J, Sick B. Forecasting power grid states for regional energy markets with deep neural networks. In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE: 2020. p. 1–8.
 17
Shin H, Lee JK, Kim J, Kim J. Continual learning with deep generative replay. In: Advances in Neural Information Processing Systems: 2017. p. 2990–9.
 18
Rusu AA, Rabinowitz NC, Desjardins G, Soyer H, Kirkpatrick J, Kavukcuoglu K, Pascanu R, Hadsell R. Progressive neural networks. arXiv eprints. 2016::arXiv–1606.
 19
Li Z, Hoiem D. Learning without forgetting. IEEE Trans Pattern Anal Mach Intell. 2017; 40(12):2935–47.
 20
Jung H, Ju J, Jung M, Kim J. Lessforgetting learning in deep neural networks. arXiv eprints. 2016::arXiv–1607.
 21
Pasquale G, Ciliberto C, Rosasco L, Natale L. Object identification from few examples by improving the invariance of a deep convolutional neural network. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE: 2016. p. 4904–11.
 22
He Y, Henze J, Sick B. Continuous learning of deep neural networks to improve forecasts for regional energy markets. IFACPapersOnLine. 2020; 53(2):12175–82.
 23
Farquhar S, Gal Y. Towards robust evaluations of continual learning. arXiv eprints. 2018::arXiv–1805.
 24
Deisenroth MP, Faisal AA, Ong CS. Mathematics for Machine Learning: Cambridge University Press. https://mmlbook.com.
 25
Goodfellow I, Bengio Y, Courville A. Deep Learning: MIT Press. http://www.deeplearningbook.org.
 26
Huszár F. On quadratic penalties in elastic weight consolidation. arXiv eprints. 2017::arXiv–1712.
 27
Schwarz J, Luketina J, Czarnecki WM, GrabskaBarwinska A, Teh YW, Pascanu R, Hadsell R. Progress & compress: A scalable framework for continual learning. In: International Conference on Machine Learning. PMLR: 2018. p. 4528–4537.
 28
Gensler A. EuropeWindFarm Data Set. https://www.unikassel.de/eecs/ies/downloads. Accessed 07 July 2021.
 29
ECMWF homepage. https://www.ecmwf.int/. Accessed 07 July 2021.
 30
Haziza D, Rapin J, Synnaeve G. Hiplot, interactive highdimensionality plots. GitHub. 2020.
 31
Gal Y, Ghahramani Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: International Conference on Machine Learning: 2016. p. 1050–9.
Acknowledgements
Thanks to our colleagues from the Intelligent Embedded Systems group, particularly Mohammad Wazed Ali, Florian Heidecker, and Chandana Priya Nivarthi, for their helpful comments and suggestions.
Funding
This work was supported within the C/sells RegioFlexMarkt Nordhessen (03SIN119) project and the DigitalTwinSolar (03EI6024E) project, funded by BMWi: Deutsches Bundesministerium für Wirtschaft und Energie/German Federal Ministry for Economic Affairs and Energy. Open Access funding enabled and organized by Projekt DEAL.
Author information
Affiliations
Contributions
Authors’ contributions
YH wrote the majority of the manuscript and was responsible for implementing the CLeaR and experiments. BS suggested the artificial data and the first experiment and was responsible for doublechecking the manuscript. All authors read and approved the final manuscript.
Authors’ information
Not applicable.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
He, Y., Sick, B. CLeaR: An adaptive continual learning framework for regression tasks. AI Perspect 3, 2 (2021). https://doi.org/10.1186/s42467021000098
Received:
Accepted:
Published:
Keywords
 Continual learning
 Renewable energy forecasts
 Regression tasks
 Deep neural networks