If you see this, something is wrong

Collapse and expand sections

To get acquainted with the document, the best thing to do is to select the "Collapse all sections" item from the "View" menu. This will leave visible only the titles of the top-level sections.

Clicking on a section title toggles the visibility of the section content. If you have collapsed all of the sections, this will let you discover the document progressively, from the top-level sections to the lower-level ones.

Cross-references and related material

Generally speaking, anything that is blue is clickable.

Clicking on a reference link (like an equation number, for instance) will display the reference as close as possible, without breaking the layout. Clicking on the displayed content or on the reference link hides the content. This is recursive: if the content includes a reference, clicking on it will have the same effect. These "links" are not necessarily numbers, as it is possible in LaTeX2Web to use full text for a reference.

Clicking on a bibliographical reference (i.e., a number within brackets) will display the reference.

Speech bubbles indicate a footnote. Click on the bubble to reveal the footnote (there is no page in a web document, so footnotes are placed inside the text flow). Acronyms work the same way as footnotes, except that you have the acronym instead of the speech bubble.

Discussions

By default, discussions are open in a document. Click on the discussion button below to reveal the discussion thread. However, you must be registered to participate in the discussion.

If a thread has been initialized, you can reply to it. Any modification to any comment, or a reply to it, in the discussion is signified by email to the owner of the document and to the author of the comment.

First published on Monday, Nov 25, 2024 and last modified on Thursday, Apr 10, 2025 by François Chaplais.

Adaptive Coded Federated Learning: Privacy Preservation and Straggler Mitigation

arXiv
Published version: 10.48550/arXiv.2403.14905

Chengxi Li Division of Information Science and Engineering, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, 10044 Stockholm, Sweden. Email

Ming Xiao Division of Information Science and Engineering, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, 10044 Stockholm, Sweden. Email

Mikael Skoglund Division of Information Science and Engineering, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, 10044 Stockholm, Sweden. Email

Keywords: Adaptive processing, coded computing, federated learning, mutual information differential privacy, stragglers.

Abstract

In this article, we address the problem of federated learning in the presence of stragglers. For this problem, a coded federated learning framework has been proposed, where the central server aggregates gradients received from the non-stragglers and gradient computed from a privacy-preservation global coded dataset to mitigate the negative impact of the stragglers. However, when aggregating these gradients, fixed weights are consistently applied across iterations, neglecting the generation process of the global coded dataset and the dynamic nature of the trained model over iterations. This oversight may result in diminished learning performance. To overcome this drawback, we propose a new method named adaptive coded federated learning (ACFL). In ACFL, before the training, each device uploads a coded local dataset with additive noise to the central server to generate a global coded dataset under privacy preservation requirements. During each iteration of the training, the central server aggregates the gradients received from the non-stragglers and the gradient computed from the global coded dataset, where an adaptive policy for varying the aggregation weights is designed. Under this policy, we optimize the performance in terms of privacy and learning, where the learning performance is analyzed through convergence analysis and the privacy performance is characterized via mutual information differential privacy. Finally, we perform simulations to demonstrate the superiority of ACFL compared with the non-adaptive methods.

1 Introduction

In intelligent systems, edge devices are generating an increasing amount of data, which can be utilized to train machine learning models for various tasks across a wide range of applications, such as smart cities [1], autonomous driving [2], and intelligent healthcare [3]. Traditionally, these data are uploaded to a central server for centralized machine learning. However, uploading datasets directly to the central server without privacy-preservation measures raises severe privacy concerns for the edge devices [4]. As an alternative to centralized machine learning, federated learning (FL) has emerged as a highly effective tool [5, 6, 7]. Under the FL framework, during each iteration of the training process, local devices compute gradients with their local datasets and send these gradients to a central server. The central server then aggregates the received gradients to update the global model [8]. In this manner, the edge devices only share locally computed gradients with the central server, significantly enhancing the privacy of raw datasets.

In FL, the training process may be hindered by devices known as stragglers, which are significantly slower than others or completely unresponsive due to various incidents [9, 10]. Additionally, in FL, the local datasets on different devices can be highly heterogeneous, meaning they are not independent and identically distributed (non-i.i.d.) across devices [11, 12]. In this case, the presence of stragglers can bias the trained global model due to the absence of the contributions of some devices in each iteration. In other words, stragglers can negatively impact the learning performance of FL, potentially leading to poor convergence. In traditional distributed learning scenarios, the challenges posed by stragglers can be addressed through well-known gradient coding (GC) strategies [13, 14, 15, 16, 17, 18, 19]. The core concept of GC is replicating the training data samples and distributing them redundantly across the devices prior to training. During each training iteration, non-straggler devices encode the locally computed gradients and transmit the encoded vectors to the central server. Upon receiving these vectors, the central server is able to reconstruct the global gradient, effectively mitigating the impact of stragglers. However, in FL scenarios, privacy concerns make it impractical to adopt GC techniques and to allocate training data to devices redundantly in a carefully planned manner, as is done in traditional distributed learning scenarios. This is because devices are reluctant to share their local datasets directly with others to create data redundancy considering that their local datasets may contain sensitive or private information [20].

For FL problems, to evade the negative impact of stragglers, a new paradigm named coded federated learning (CFL) has been investigated [21, 22, 23, 24]. In CFL, each device uploads a coded version of its dataset to the central server to form a global coded dataset before training begins. During the training process, the central server aggregates gradients received from the non-stragglers and gradient computed from the global coded dataset to mitigate the negative impact of the stragglers. Compared with sharing raw datasets as in GC, by carefully designing the way of encoding local datasets in CFL, privacy concerns can be largely reduced, and one can even achieve perfect privacy. However, it is intuitive that encoding the local datasets induces information loss and this may negatively impacts the learning performance. Based on that, the key issue in CFL is how to generate the global coded dataset before training, as well as how to aggregate the gradients at the central server during the training process, in order to achieve better performance in terms of learning and privacy.

There has been some recent progress under the framework of CFL. To be more specific, CFL was firstly investigated in [21] for linear regression problems under the FL setting, where local coded datasets are constructed at the devices by applying random linear projections. In [22], the CFL scheme was combined with distributed kernel embedding to make CFL applicable for nonlinear FL problems. Later, to further enhance the privacy performance of CFL, a differentially private CFL method was proposed by adding noise to the local coded datasets, where the differential privacy guarantee was proved [23]. Very recently, a stochastic coded federated learning (SCFL) method has been proposed [24]. In SCFL, before the training starts, the devices generate and upload privacy-preservation coded datasets to the central server by adding noise to the projected local datasets, which are used by the central server to generate the global coded dataset. In each iteration of the training process, the central server aggregates gradients received from the non-stragglers and gradient computed from the global coded dataset with fixed aggregation weights. Compared with previous methods, the advantages of SCFL lie in ensuring that the global gradient applied by the central server to update the global model is an unbiased estimation of the true gradient, and providing the trade off between the privacy performance and the learning performance analytically. However, in SCFL, when the central server aggregates the gradients, it consistently adopts pre-specified and fixed weights across iterations, which neglects the generation process of the global coded dataset and the dynamic nature of the trained model over iterations. This oversight results in an unnecessary performance loss, preventing SCFL from achieving the optimal performance in terms of privacy and learning.

For FL under requirements of privacy preservation and straggler mitigation, to overcome the drawbacks of existing CFL methods as mentioned above, we propose a new method named adaptive coded federated learning (ACFL). In ACFL, before training, each device uploads a local coded dataset, which is a transformed version of its local dataset with added random noise. Upon receiving local coded datasets from the devices, the central server generates the global coded dataset. Then, during each training iteration, each non-straggler device computes local gradient based on the current global model and its local dataset, subsequently transmitting this gradient to the central server. After receiving local gradients from the non-straggler devices, the central server estimates the global gradient by aggregating the received gradients and the gradient computed from the global coded dataset, applying aggregation weights determined by an adaptive policy. Under this policy, the performance in terms of privacy and learning is optimized, where the learning performance is described by convergence analysis and the privacy performance is measured by mutual information differential privacy (MI-DP). Finally, in order to verify the superiority of the proposed method, simulations results are provided, where we compare the performance of ACFL and the non-adaptive methods.

Our contributions are summarized as follows:

We propose a new ACFL method designed for FL problems under requirements of straggler mitigation and privacy preservation. This method develops an adaptive policy for adjusting the aggregation weights at the central server during gradient aggregation, aiming to achieve the optimal performance in terms of privacy and learning.
We analytically analyze the trade off between learning and privacy performance of the proposed method.
Through simulations, we demonstrate that the proposed method surpasses the non-adaptive methods, which attains better learning performance while maintaining the same level of privacy concerns.

The rest of this paper is structured as follows. In Section 2, we formulate the considered problem. In Section 3, we propose our ACFL method. In Section 4, we analyze the performance of ACFL theoretically and describes the adaptive policy adopted by the proposed method. Simulation results are provided in Section 5 to verify the superiority of the proposed method. Finally, we conclude this paper in Section 6.

2 Problem Model

In the FL system, there is a central server and \(N\) devices, where each device \(i\) owns a local dataset consisting of \({{\mathbf{X}}^{(i)}} \in \mathbb{R}^{m_i \times d}\) and \({{\mathbf{Y}}^{(i)}} \in \mathbb{R}^{m_i \times o}\), \(\forall i\). Here, \({{\mathbf{X}}^{(i)}}\) contains \(m_i\) data samples of dimension \(d\), and \({{\mathbf{Y}}^{(i)}}\) represents the corresponding labels of dimension \(o\). We also assume that, for each local dataset, the number of data samples \(m_i\) is greater than \(d\) and \({{\mathbf{X}}^{(i)}} \in \mathbb{R}^{m_i \times d}\) is full column rank [25]. Under the coordination of the central server, the aim is to minimize the total training loss \( f\left( \mathbf{W} \right)\) by finding the solution to the following linear regression problem [24]:

\[ \begin{align} {{\mathbf{W}}^*} = \arg \min_{\mathbf{W} \in \mathbb{R}^{d \times o}} f\left( \mathbf{W} \right) = \sum_{i = 1}^N \frac{1}{2}\left\| {{\mathbf{X}}^{(i)}\mathbf{W} - {{\mathbf{Y}}^{(i)}}} \right\|_F^2, \end{align} \]

(1)

where \(\mathbf{W} \in \mathbb{R}^{d \times o}\) is the model parameter matrix, and \(\left\| \cdot \right\|_F\) denotes the Frobenius norm. Here, we focus on linear regression problems under the FL setting for two reasons. First, linear regression is a very important machine learning model that has attracted significant attention recently due to its ability to make scientific and reliable predictions. As a statistical procedure established a long time ago, its properties are well understood, enabling the rapid training of linear regression models in different applications [26, 27, 28, 29]. Second, nonlinear problems can be transformed into linear regression problems through kernel embedding methods, demonstrating the broad applicability of linear regression models, as indicated by [30].

For the problem introduced above under the FL setting, the training process involves multiple iterations. In each iteration, each device computes local gradient based on its local dataset and transmits it to the central server. After aggregating the received gradients, the central server updates the global model and broadcasts the model parameters to all devices. Over the iterations, due to various incidents, some devices may become stragglers, which are temporarily unresponsive, fail to implement computations, and do not send any messages to the central server, unlike the normally functioning devices [13]. It is assumed that the probability of each device being a straggler is \( p\) in each iteration, and the straggler behavior of the devices is independent across iterations and among different devices [19]. It is obvious that the presence of stragglers may degrade the learning performance or even leads to very poor convergence.

To deal with the stragglers in FL, the CFL paradigm has been studied. In CFL, the local dataset \(\left( {\mathbf{X}^{(i)},\mathbf{Y}^{(i)}} \right)\) is encoded into \(\left( {\mathbf{H}_X^{(i)},\mathbf{H}_Y^{(i)}} \right)\) and transmitted to the central server by device \(i\) before training begins, \(\mathbf{H}_X^{(i)} \in \mathbb{R}^{n_X^{(i)} \times d}\), \(\mathbf{H}_Y^{(i)} \in \mathbb{R}^{n_Y^{(i)} \times o}\), \( \forall i\) . The central server utilizes the local coded datasets to generate a global coded dataset, denoted as \(\left( \mathbf{\tilde H}_X, \mathbf{\tilde H}_Y \right)\), instead of retaining all the received coded datasets to save its storage [24]. Subsequently, the training process begins, involving multiple iterations. In each iteration, each non-straggler device computes the gradient based on its local dataset and transmits it to the central server. Upon receiving these messages from the non-stragglers, the central server updates the current model parameter matrix using both the received gradients and the gradient computed from the global coded dataset [24]. At the end of each iteration, the central server broadcasts the updated model parameters to all devices. In CFL, the core issue is to design the method of generating global coded dataset before training and aggregating gradients at the central server during the training process, so that we can achieve a better performance in terms of privacy and learning. A recently proposed method within the CFL framework is known as SCFL [24], which generates local coded datasets by adding noise and random projections. The global dataset is then assembled as the sum of these local coded datasets. Throughout each training iteration, the central server employs a weighted sum with fixed and pre-specified weights across all iterations to aggregate the gradients. However, in SCFL, consistently using fixed and pre-specified weights for aggregating gradients without considering the generation process of the global coded dataset and the dynamic nature of the trained model over iterations results in an unnecessary performance loss.

To overcome the limitations of the existing work, our aim is to attain a better performance in terms of privacy and learning for FL problems under the requirements of privacy preservation and straggler mitigation.

3 The Proposed Method

In this section, we propose the ACFL method to address the problem outlined in Section 2. ACFL is implemented in two distinct stages. Initially, in the first stage prior to the commencement of training, each device transmits a coded dataset to the central server, which is derived by transforming its local dataset and adding noise. The central server then uses these local coded datasets to generate a global coded dataset. Subsequently, in the second stage, the training process involves multiple iterations. In each iteration, the central server aggregates gradient computed from the global coded dataset and the received gradients from non-straggler devices, where an adaptive policy to determine the aggregation weights is applied.

3.1 The first stage: before training

Before the training starts, each device uploads a coded dataset to the central server, which is a common practice under the framework of CFL. In the literature, methods commonly used to generate local coded dataset include random matrix projection and noise injection [21, 24, 25]. Random matrix projection reduces the volume of data transmission by lowering the dimension of the projected data compared to the raw dataset. Meanwhile, noise injection enables privacy enhancement at adjustable levels. Inspired by this, in ACFL, the devices encode the local datasets by adding noise to a transformed version of the local dataset. To be more specific, device \( i\) uploads coded dataset \(\left( {\mathbf{H}_X^{(i)},\mathbf{H}_Y^{(i)}} \right)\) given as

\[ \begin{align} \mathbf{H}_X^{(i)} = {{\mathbf{X}}^{\left( i \right)T}}{{\mathbf{X}}^{\left( i \right)}} + {\mathbf{N}}_1^{\left( i \right)}, \end{align} \]

(2)

and

\[ \begin{align} \mathbf{H}_Y^{(i)} = {{\mathbf{X}}^{\left( i \right)T}}{{\mathbf{Y}}^{\left( i \right)}} + {\mathbf{N}}_2^{\left( i \right)}, \end{align} \]

(3)

to the central server, where \( {\mathbf{N}}_1^{\left( i \right)} \in {\mathbb{R}^{d \times d}}\) and \( {\mathbf{N}}_2^{\left( i \right)} \in {\mathbb{R}^{d \times o}}\) denote the additive noise. In \( {\mathbf{N}}_1^{\left( i \right)}\) , all the elements are independent and identically distributed (i.i.d.) random variables drawn from Gaussian distribution \(\mathcal{N}\left( {0,\sigma _1^2} \right)\). Similarly, all the elements in \( {\mathbf{N}}_2^{\left( i \right)}\) are i.i.d. random variables following distribution \(\mathcal{N}\left( {0,\sigma _2^2} \right)\). Differing from existing methods that adopt random matrix projection [21, 24, 25], ACFL projects the local dataset \( \mathbf{X}^{(i)}\) using its own transpose. This approach still reduces the volume of transmitted data compared to sending the raw dataset, since \( d < m_i\) always holds. Furthermore, as will be detailed in Section 3.2, it can be noted that \( \mathbf{X}^{(i)T}\mathbf{X}^{(i)}\) and \( \mathbf{X}^{(i)T}\mathbf{Y}^{(i)}\) contain information equivalent to that in the raw datasets for computing gradients in linear regression problems. This implies the efficiency of projecting the local dataset as described in (2) and (3). Upon receiving local coded datasets, the central server generates a global coded dataset as

\[ \begin{align} {{{\mathbf{\tilde H}}}_X} = \sum\limits_{i = 1}^N {{\mathbf{H}}_X^{(i)}}, {{{\mathbf{\tilde H}}}_Y} = \sum\limits_{i = 1}^N {{\mathbf{H}}_Y^{(i)}}. \end{align} \]

(4)

In this way, the storage at the central server can be saved significantly compared with storing all received coded datasets directly.

3.2 The second stage: training iterations

In the \( t\) -th iteration, let us denote the current model parameter matrix as \( {{\mathbf{W}}_t}\) . The following random variables are adopted to represent the straggler behaviour of the devices:

\[ \begin{align} I_t^{\left( i \right)} = \left\{ \begin{array}{ll} 1, & \text{if device } i\text{ is not a straggler}, \\ 0, & \text{if device } i\text{ is a straggler}. \end{array}\right. \end{align} \]

(5)

According to (5), \(\left\{ {I_t^{\left( i \right)},\forall t,\forall i} \right\}\) are i.i.d. Bernoulli random variables, whose probability mass function can be expressed as

\[ \begin{align} \Pr \left( {I_t^{\left( i \right)} = 1} \right) = 1 - p, \Pr \left( {I_t^{\left( i \right)} = 0} \right) = p. \end{align} \]

(6)

If device \( i\) is not a straggler in the current iteration, it computes the local gradient as follows:

\[ \begin{align} {\mathbf{G}}_t^{\left( i \right)} = {{\mathbf{X}}^{\left( i \right)T}}\left[ {{{\mathbf{X}}^{\left( i \right)}}{{\mathbf{W}}_t} - {{\mathbf{Y}}^{\left( i \right)}}} \right], \end{align} \]

(7)

and then transmits \( {\mathbf{G}}_t^{\left( i \right)}\) to the central server. At the same time, the central server computes a gradient based on the global coded dataset to obtain:

\[ \begin{align} {\mathbf{G}}_t^S = {{{\mathbf{\tilde H}}}_X}{{\mathbf{W}}_t} - {{{\mathbf{\tilde H}}}_Y}. \end{align} \]

(8)

Subsequently, with the received gradients as shown in (7) and the gradient computed as (8), the server derives the global gradient as

\[ \begin{align} {\mathbf{G}}_t^{All} = {\alpha _t}{\mathbf{G}}_t^S + \frac{{1 - {\alpha _t}}}{{1 - p}}\sum\limits_{i = 1}^N {{\mathbf{G}}_t^{\left( i \right)}I_t^{\left( i \right)}}, \end{align} \]

(9)

where \( \alpha _t \in [0,1]\) represents the adaptive aggregation weights. Here, in the proposed method, we introduce an adaptive policy for adjusting the values of \( \alpha _t\) over iterations to attain the optimal performance in terms of privacy and learning, which will be specified after conducting performance analysis in the next section. With the global gradient provided in (9), the central server updates the model parameter matrix as

\[ \begin{align} {{\mathbf{W}}_{t + 1}} = {{\mathbf{W}}_t} - \eta_t {\mathbf{G}}_t^{All}, \end{align} \]

(10)

and broadcasts \( {{\mathbf{W}}_{t + 1}}\) to all devices, which ends the \( t\) -th iteration. In (10), \( \eta_t\) denotes the learning rate.

Remark 1

For the proposed method, the overall communication overhead during \( T\) iterations, measured by the number of uploaded bits from the devices to the central server, can be expressed as

\[ \begin{align} {\psi ^{(T)}} = \psi _1^{(T)} + \psi _2^{(T)}, \end{align} \]

(11)

where \( \psi _1^{(T)}\) is the communication overhead in the first stage induced by transmission of local coded datasets and \( \psi _2^{(T)}\) is the communication overhead in the second stage resulted from uploading gradients from the devices to the central server. Based on that, we can derive \( \psi _1^{(T)}\) and \( \psi _2^{(T)}\) as follows:

\[ \begin{align} \psi _1^{(T)} &= \phi\left( {{d^2} + od} \right)N, \end{align} \]

(12)

\[ \begin{align} \psi _2^{(T)} &= \phi(Tod)N, \\\end{align} \]

(13)

where \( \phi\) denotes the number of bits required for the transmission of a real number. In many practical applications, the FL process involves a very large number of iterations with \(T \gg d\), which indicates \( \psi _2^{(T)} \gg \psi _1^{(T)}\) . Hence, it holds that \( {\psi ^{(T)}} \approx \psi _2^{(T)}\) . In other words, compared with conventional FL only involving training iterations [8], the extra communication overhead induced by the first stage of the proposed method is negligible.

4 Performance Analysis and The Adaptive Policy

In this section, we first analyze the performance of the proposed method with arbitrary aggregation weights from two aspects: privacy performance and learning performance. Based on that, to attain the optimal performance in terms of privacy and learning, the adaptive policy of determining the value of \( \alpha_t\) would be specified accordingly. First, let us provide some useful assumptions, which have been widely and reasonably adopted in the literature.

Assumption 1

It is assumed that the absolute values of all elements in \({{\mathbf{X}}^{(i)}}\) and \({{\mathbf{Y}}^{(i)}}\) are no larger than 1, \(\forall i\) [25].

Assumption 2

The local gradients computed as (7) are all bounded as [24]

\[ \begin{align} \left\| {{\mathbf{G}}_t^{\left( i \right)}} \right\|_F^2 \leq {\beta ^2},\forall i. \end{align} \]

(14)

Assumption 3

The parameter matrix is bounded as follows [24]:

\[ \begin{align} \left\| {{{\mathbf{W}}_t}} \right\|_F^2 \leq {C^2}. \end{align} \]

(15)

4.1 Privacy performance of ACFL

To evaluate the privacy performance of the proposed method, the metric of MI-DP is adopted, which is stronger than the conventional differential privacy metric [31, 32]. Let us provide the definition of MI-DP as follows [25]:

Definition 1 (\( \epsilon\) -MI-DP)

The matrix \( q\left( {\mathbf{Z}} \right) = {\mathbf{\tilde Z}} \in {\mathbb{R}^{{a_2} \times {b_2}}}\) encoded from \( {\mathbf{Z}} \in {\mathbb{R}^{{a_1} \times {b_1}}}\) with randomized function \( q(\cdot)\) satisfies \( \epsilon\) -MI-DP if

\[ \begin{equation} \sup_{j,k,p({\mathbf{Z}})} I(Z_{j,k};q({\mathbf{Z}})\|\mathbf{Z}^{-j,k}) \leq \epsilon, \end{equation} \]

(16)

where \( Z_{j,k}\) denotes the \( (j,k)\) -th element in \( {\mathbf{Z}}\) , \( \mathbf{Z}^{-j,k}\) is the set of all elements in \( {\mathbf{Z}}\) excluding \( Z_{j,k}\) , and \( p({\mathbf{Z}})\) denotes the distribution of \( {\mathbf{Z}}\) . A decreasing value of \( \epsilon\) indicates enhanced privacy performance.

Based on Definition 1, the privacy performance of ACFL is characterized in the following Theorem.

Theorem 1 (Privacy performance of ACFL)

With respect to device \( i\) , \( \forall i\) , the local coded dataset satisfies \( \epsilon\) -MI-DP, where

\[ \begin{align} \epsilon=\left( {d - \frac{1}{2}} \right)\log \frac{{1 + \sigma _1^2}}{{\sigma _1^2}} + \frac{o}{2}\log \frac{{1 + \sigma _2^2}}{{\sigma _2^2}}. \end{align} \]

(17)

Proof

In the following proof, for the sake of brevity of denotation, we will omit the index \( (i)\) . Noting that

\[ \begin{align} & {\sup }\limits_{j,k,p\left( {{\mathbf{X}},{\mathbf{Y}}} \right)} I\left( {\left. {X_{j,k}^{};{\mathbf{H}}_X^{},{\mathbf{H}}_Y^{}} \right|{{\mathbf{X}}^{ - j,k}},{\mathbf{Y}}} \right)\nonumber \end{align} \]

(18)

\[ \begin{align} =& {\sup }\limits_{p\left( {{\mathbf{X}},{\mathbf{Y}}} \right)} I\left( {\left. {X_{1,1}^{};{\mathbf{H}}_X^{},{\mathbf{H}}_Y^{}} \right|{{\mathbf{X}}^{ - 1,1}},{\mathbf{Y}}} \right), \\\end{align} \]

(19)

it is sufficient to show

\[ \begin{align} {\sup }\limits_{p\left( {{\mathbf{X}},{\mathbf{Y}}} \right)} I\left( {\left. {X_{1,1}^{};{\mathbf{H}}_X^{},{\mathbf{H}}_Y^{}} \right|{{\mathbf{X}}^{ - 1,1}},{\mathbf{Y}}} \right) \leq \epsilon, \end{align} \]

(20)

in order to prove Theorem 1 without loss of generality. To this end, we can derive

\[ \begin{align} & {\sup }\limits_{p\left( {{\mathbf{X}},{\mathbf{Y}}} \right)} I\left( {\left. {X_{1,1}^{};{\mathbf{H}}_X^{},{\mathbf{H}}_Y^{}} \right|{{\mathbf{X}}^{ - 1,1}},{\mathbf{Y}}} \right) \nonumber \end{align} \]

(21)

\[ \begin{align} = & {\sup }\limits_{p\left( {{\mathbf{X}},{\mathbf{Y}}} \right)} I\left( {\left. {X_{1,1}^{};{{\mathbf{X}}^T}{\mathbf{X}} + {\mathbf{N}}_1^{},{{\mathbf{X}}^T}{\mathbf{Y}} + {\mathbf{N}}_2^{}} \right|{{\mathbf{X}}^{ - 1,1}},{\mathbf{Y}}} \right) \\ = & {\sup }\limits_{p\left( {{\mathbf{X}},{\mathbf{Y}}} \right)} h\left( {\left. {{{\mathbf{X}}^T}{\mathbf{X}} + {\mathbf{N}}_1^{},{{\mathbf{X}}^T}{\mathbf{Y}} + {\mathbf{N}}_2^{}} \right|{{\mathbf{X}}^{ - 1,1}},{\mathbf{Y}}} \right) \nonumber \\ &- h\left( {\left. {{{\mathbf{X}}^T}{\mathbf{X}} + {\mathbf{N}}_1^{},{{\mathbf{X}}^T}{\mathbf{Y}} + {\mathbf{N}}_2^{}} \right|{\mathbf{X}},{\mathbf{Y}}} \right) \nonumber \\ = & {\sup }\limits_{p\left( {{\mathbf{X}},{\mathbf{Y}}} \right)} h\left( {\left. {{{\mathbf{X}}^T}{\mathbf{X}} + {\mathbf{N}}_1^{},{{\mathbf{X}}^T}{\mathbf{Y}} + {\mathbf{N}}_2^{}} \right|{{\mathbf{X}}^{ - 1,1}},{\mathbf{Y}}} \right) \nonumber \\ & - h\left( {{\mathbf{N}}_1^{},{\mathbf{N}}_2^{}} \right), \\\end{align} \]

where \(h\left( {\cdot} \right)\) denotes the entropy, the first equality is derived by substituting (2) and (3) into (21) and the second equality holds according to the definition of mutual information. In (21), for the first term, we have (22) based on the independence between the additive noise and the raw data.

\[ \begin{align} & {\sup }\limits_{p\left( {{\mathbf{X}},{\mathbf{Y}}} \right)} h\left( {\left. {{{\mathbf{X}}^T}{\mathbf{X}} + {\mathbf{N}}_1^{},{{\mathbf{X}}^T}{\mathbf{Y}} + {\mathbf{N}}_2^{}} \right|{{\mathbf{X}}^{ - 1,1}},{\mathbf{Y}}} \right) \nonumber \end{align} \]

(22)

\[ \begin{align} =& {\sup }\limits_{p\left( {{\mathbf{X}},{\mathbf{Y}}} \right)} h\left( {\left. {\left\{ {{X_{1,j}}{X_{1,1}} + {{\left( {{N_1}} \right)}_{j,1}},j \ne 1} \right\},\left\{ {{X_{1,k}}{X_{1,1}} + {{\left( {{N_1}} \right)}_{1,k}},k \ne 1} \right\},X_{1,1}^2 + {{\left( {{N_1}} \right)}_{1,1}},\left\{ {{X_{1,1}}{Y_{1,k}} + {{\left( {{N_2}} \right)}_{1,k}},\forall k} \right\}} \right|{{\mathbf{X}}^{ - 1,1}},{\mathbf{Y}}} \right) \\ &+ h\left( {\left\{ {{{\left( {{{N}}_1^{}} \right)}_{j,k}},j \ne 1,k \ne 1} \right\}} \right) + h\left( {\left\{ {{{\left( {{{N}}_2^{}} \right)}_{j,k}},j \ne 1,\forall k} \right\}} \right) \nonumber \\ \leq& {\sup }\limits_{p\left( {\mathbf{X}} \right)} \sum\limits_{j \ne 1} {h\left( {\left. {{X_{1,j}}{X_{1,1}} + {{\left( {{N_1}} \right)}_{j,1}}} \right|{{\mathbf{X}}^{ - 1,1}}} \right)} + {\sup }\limits_{p\left( {\mathbf{X}} \right)} \sum\limits_{k \ne 1} {h\left( {\left. {{X_{1,k}}{X_{1,1}} + {{\left( {{N_1}} \right)}_{1,k}}} \right|{{\mathbf{X}}^{ - 1,1}}} \right)} + {\sup }\limits_{p\left( {{X_{1,1}}} \right)} h\left( {X_{1,1}^2 + {{\left( {{N_1}} \right)}_{1,1}}} \right) \nonumber \\ &+ {\sup }\limits_{p\left( {\mathbf{Y}} \right)} \sum\limits_k {h\left( {\left. {{X_{1,1}}{Y_{1,k}} + {{\left( {{N_2}} \right)}_{1,k}}} \right|{\mathbf{Y}}} \right)} + h\left( {\left\{ {{{\left( {{{N}}_1^{}} \right)}_{j,k}},j \ne 1,k \ne 1} \right\}} \right) + h\left( {\left\{ {{{\left( {{{N}}_2^{}} \right)}_{j,k}},j \ne 1,\forall k} \right\}} \right) \nonumber \\ \leq& {\sup }\limits_{p\left( {\mathbf{X}} \right)} \sum\limits_{j \ne 1} {h\left( {{X_{1,j}}{X_{1,1}} + {{\left( {{N_1}} \right)}_{j,1}}} \right)} + {\sup }\limits_{p\left( {\mathbf{X}} \right)} \sum\limits_{k \ne 1} {h\left( {{X_{1,k}}{X_{1,1}} + {{\left( {{N_1}} \right)}_{1,k}}} \right)}+ {\sup }\limits_{p\left( {{X_{1,1}}} \right)} h\left( {X_{1,1}^2 + {{\left( {{N_1}} \right)}_{1,1}}} \right) \nonumber \\ & + {\sup }\limits_{p\left( {\mathbf{Y}} \right)} \sum\limits_k {h\left( {{X_{1,1}}{Y_{1,k}} + {{\left( {{N_2}} \right)}_{1,k}}} \right)} +h\left( {\left\{ {{{\left( {{{N}}_1^{}} \right)}_{j,k}},j \ne 1,k \ne 1} \right\}} \right) + h\left( {\left\{ {{{\left( {{{N}}_2^{}} \right)}_{j,k}},j \ne 1,\forall k} \right\}} \right). \\\end{align} \]

Substituting (22) into (21) yields

(23)

\[ \begin{align} \leq& {\sup }\limits_{p\left( {\mathbf{X}} \right)} \sum\limits_{j \ne 1} {h\left( {{X_{1,j}}{X_{1,1}} + {{\left( {{N_1}} \right)}_{j,1}}} \right)} \\ & + {\sup }\limits_{p\left( {\mathbf{X}} \right)} \sum\limits_{k \ne 1} {h\left( {{X_{1,k}}{X_{1,1}} + {{\left( {{N_1}} \right)}_{1,k}}} \right)} \nonumber \\ & + {\sup }\limits_{p\left( {{X_{1,1}}} \right)} h\left( {X_{1,1}^2 + {{\left( {{N_1}} \right)}_{1,1}}} \right) \nonumber \\ & + {\sup }\limits_{p\left( {\mathbf{Y}} \right)} \sum\limits_k {h\left( {{X_{1,1}}{Y_{1,k}} + {{\left( {{N_2}} \right)}_{1,k}}} \right)} \nonumber \\ & - h\left( {\left\{ {{{\left( {{{N}}_1^{}} \right)}_{j,k}},j = 1,k \ne 1} \right\}} \right) \nonumber\\ &- h\left( {\left\{ {{{\left( {{{N}}_1^{}} \right)}_{j,k}},j \ne 1,k = 1} \right\}} \right) \nonumber \\ & - h\left[ {{{\left( {{{N}}_1^{}} \right)}_{1,1}}} \right] - h\left( {\left\{ {{{\left( {{{N}}_2^{}} \right)}_{j,k}},j = 1,\forall k} \right\}} \right) \nonumber \\ \leq& \left( {d - \frac{1}{2}} \right)\log \left[ {2\pi e\left( {1 + \sigma _1^2} \right)} \right] + \frac{o}{2}\log \left[ {2\pi e\left( {1 + \sigma _2^2} \right)} \right] \nonumber \\ & - \left( {d - \frac{1}{2}} \right)\log \left( {2\pi e\sigma _1^2} \right) - \frac{o}{2}\log \left( {2\pi e\sigma _2^2} \right) \nonumber \\ = &\left( {d - \frac{1}{2}} \right)\log \frac{{1 + \sigma _1^2}}{{\sigma _1^2}} + \frac{o}{2}\log \frac{{1 + \sigma _2^2}}{{\sigma _2^2}}, \\\end{align} \]

based on Assumption 1 and the maximum entropy bound [33]. From (23), we can easily derive Theorem 1.

Remark 2

From Theorem 1, we observe that \(\epsilon\) is a decreasing function with respect to \(\sigma_1^2\) and \(\sigma_2^2\). By Definition 1, the privacy performance of ACFL is enhanced by increasing the strength of the additive noise, that is, by increasing the values of \(\sigma_1^2\) and \(\sigma_2^2\), aligning with our intuition. Furthermore, it holds that

\[ \begin{align} \lim_{\sigma_1^2, \sigma_2^2 \to \infty} \epsilon = 0, \end{align} \]

(24)

which signifies that perfect privacy can be asymptotically achieved when the variances of the additive noise approach infinity.

4.2 Learning performance of ACFL

Let us first provide two lemmas, which facilitate the subsequent derivations of Theorem 2 describing the learning performance of ACFL under arbitrary values of aggregation weights.

Lemma 1

The global gradient applied by the central server denoted as (9) is an unbiased estimation of the true gradient given as

\[ \begin{align} {\mathbf{G}}_t^{True} = \sum\limits_{i = 1}^N {{\mathbf{G}}_t^{\left( i \right)}}, \end{align} \]

(25)

where \({{\mathbf{G}}_t^{\left( i \right)}}\) is provided as (7).

Proof

According to (2), (3), (4) and (8), we have

\[ \begin{align} \mathbb{E}\left( {\left. {{\mathbf{G}}_t^S} \right|{{\mathbf{W}}_t}} \right) = \mathbb{E}\left( {{{{\mathbf{\tilde H}}}_X}} \right){{\mathbf{W}}_t} - \mathbb{E}\left( {{{{\mathbf{\tilde H}}}_Y}} \right) = {\mathbf{G}}_t^{True}. \end{align} \]

(26)

Further, based on (6), (9) and (26), we have

\[ \begin{align} \mathbb{E}\left( {\left. {{\mathbf{G}}_t^{All}} \right|{{\mathbf{W}}_t}} \right) = \mathbb{E}\left[ {\left. {{\alpha _t}{\mathbf{G}}_t^S + \frac{{1 - {\alpha _t}}}{{1 - p}}\sum\limits_{i = 1}^N {{\mathbf{G}}_t^{\left( i \right)}I_t^{\left( i \right)}} } \right|{{\mathbf{W}}_t}} \right] \nonumber \end{align} \]

(27)

\[ \begin{align} = {\alpha _t}{\mathbf{G}}_t^{True} + \left( {1 - {\alpha _t}} \right)\sum\limits_{i = 1}^N {{\mathbf{G}}_t^{\left( i \right)}} = {\mathbf{G}}_t^{True}, \\\end{align} \]

(28)

which completes the proof.

Lemma 2

The global gradient in (9) is bounded as

\[ \begin{align} &\mathbb{E}\left[ {\left\| {{\mathbf{G}}_t^{All}} \right\|_F^2} \right] \nonumber \end{align} \]

(29)

\[ \begin{align} \leq& \left[ {\alpha _t^2p + \left( {1 - p} \right){{\left( {{\alpha _t} + \frac{{1 - {\alpha _t}}}{{1 - p}}} \right)}^2} + N - 1} \right]N{\beta ^2} \\ &+ \alpha _t^2Nd\sigma _1^2{C^2} + \alpha _t^2N\sigma _2^2od. \\\end{align} \]

Proof

We can derive that

\[ \begin{align} &\mathbb{E}\left[ {\left. {\left\| {{\mathbf{G}}_t^{All}} \right\|_F^2} \right|{{\mathbf{W}}_t}} \right]\nonumber \end{align} \]

(30)

\[ \begin{align} = &\mathbb{E}\left[ {\left. {\left\| {{\alpha _t}{\mathbf{G}}_t^S + \frac{{1 - {\alpha _t}}}{{1 - p}}\sum\limits_{i = 1}^N {{\mathbf{G}}_t^{\left( i \right)}I_t^{\left( i \right)}} } \right\|_F^2} \right|{{\mathbf{W}}_t}} \right] \\ =& \mathbb{E}\left[ {\left\| {\sum\limits_{i = 1}^N {{\mathbf{G}}_t^{\left( i \right)}\left[ {{\alpha _t} + \frac{{1 - {\alpha _t}}}{{1 - p}}I_t^{\left( i \right)}} \right]} } \right.} \right. \nonumber \\ &+ \left. {\left. {\left. {{\alpha _t}\sum\limits_{i = 1}^N {{\mathbf{N}}_1^{\left( i \right)}} {{\mathbf{W}}_t} - {\alpha _t}\sum\limits_{i = 1}^N {{\mathbf{N}}_2^{\left( i \right)}} } \right\|_F^2} \right|{{\mathbf{W}}_t}} \right] \nonumber \\ = &\mathbb{E}\left[ {\left. {\left\| {\sum\limits_{i = 1}^N {{\mathbf{G}}_t^{\left( i \right)}\left[ {{\alpha _t} + \frac{{1 - {\alpha _t}}}{{1 - p}}I_t^{\left( i \right)}} \right]} } \right\|_F^2} \right|{{\mathbf{W}}_t}} \right] \nonumber\\ &+ \mathbb{E}\left[ {\left. {\left\| {{\alpha _t}\sum\limits_{i = 1}^N {{\mathbf{N}}_1^{\left( i \right)}} {{\mathbf{W}}_t}} \right\|_F^2} \right|{{\mathbf{W}}_t}} \right]\nonumber\\ & + \mathbb{E}\left( {\left\| {{\alpha _t}\sum\limits_{i = 1}^N {{\mathbf{N}}_2^{\left( i \right)}} } \right\|_F^2} \right), \\\end{align} \]

based on (2), (3), (4), (8), (9) and the fact that the additive noise is independent from the straggler behaviour of the devices. For the first term in (30), we have

\[ \begin{align} &\mathbb{E}\left[ {\left. {\left\| {\sum\limits_{i = 1}^N {{\mathbf{G}}_t^{\left( i \right)}\left[ {{\alpha _t} + \frac{{1 - {\alpha _t}}}{{1 - p}}I_t^{\left( i \right)}} \right]} } \right\|_F^2} \right|{{\mathbf{W}}_t}} \right] \nonumber \end{align} \]

(31)

\[ \begin{align} =& \sum\limits_{{i_1} = 1}^N {\sum\limits_{{i_2} = 1}^N {\left\langle {{\mathbf{G}}_t^{\left( {{i_1}} \right)},{\mathbf{G}}_t^{\left( {{i_2}} \right)}} \right\rangle } } \\ & \times \mathbb{E}\left[ {\left( {{\alpha _t} + \frac{{1 - {\alpha _t}}}{{1 - p}}I_t^{\left( {{i_1}} \right)}} \right)\left( {{\alpha _t} + \frac{{1 - {\alpha _t}}}{{1 - p}}I_t^{\left( {{i_2}} \right)}} \right)} \right] \nonumber \\ = &\sum\limits_{{i_1} = 1}^N {\sum\limits_{{i_2} = 1}^N {\left\langle {{\mathbf{G}}_t^{\left( {{i_1}} \right)},{\mathbf{G}}_t^{\left( {{i_2}} \right)}} \right\rangle } } - \sum\limits_{i = 1}^N {\left\langle {{\mathbf{G}}_t^{\left( i \right)},{\mathbf{G}}_t^{\left( i \right)}} \right\rangle } \nonumber \\ &+ \sum\limits_{i = 1}^N {\left\langle {{\mathbf{G}}_t^{\left( i \right)},{\mathbf{G}}_t^{\left( i \right)}} \right\rangle \left[ {\alpha _t^2p + \left( {1 - p} \right){{\left( {{\alpha _t} + \frac{{1 - {\alpha _t}}}{{1 - p}}} \right)}^2}} \right]} \nonumber \\ \leq& \left[ {\alpha _t^2p + \left( {1 - p} \right){{\left( {{\alpha _t} + \frac{{1 - {\alpha _t}}}{{1 - p}}} \right)}^2} + N - 1} \right]\sum\limits_{i = 1}^N {\left\| {{\mathbf{G}}_t^{\left( i \right)}} \right\|_F^2} \nonumber \\ \leq& \left[ {\alpha _t^2p + \left( {1 - p} \right){{\left( {{\alpha _t} + \frac{{1 - {\alpha _t}}}{{1 - p}}} \right)}^2} + N - 1} \right]N{\beta ^2}, \\\end{align} \]

where \(\left\langle {{\mathbf{A}},{\mathbf{B}}} \right\rangle \) denotes the Hadamard Product of two matrices, the first inequality holds due to the following basic inequality:

\[ \begin{align} \left\| {\sum\limits_{i = 1}^n {{{\mathbf{A}}_i}} } \right\|_F^2 \leq n\sum\limits_{i = 1}^n {\left\| {{{\mathbf{A}}_i}} \right\|_F^2} ,{\text{ }}\forall {{\mathbf{A}}_i} \in {\mathbb{R}^{a \times b}}, \end{align} \]

(32)

and the second inequality holds due to Assumption 2. For the second term in (30), we have

\[ \begin{align} \mathbb{E}\left[ {\left. {\left\| {{\alpha _t}\sum\limits_{i = 1}^N {{\mathbf{N}}_1^{\left( i \right)}} {{\mathbf{W}}_t}} \right\|_F^2} \right|{{\mathbf{W}}_t}} \right] &= \alpha _t^2\sum\limits_{i = 1}^N {\mathbb{E}\left[ {\left. {\left\| {{\mathbf{N}}_1^{\left( i \right)}{{\mathbf{W}}_t}} \right\|_F^2} \right|{{\mathbf{W}}_t}} \right]} \nonumber \end{align} \]

(33)

\[ \begin{align} &= \alpha _t^2Nd\sigma _1^2\left\| {{{\mathbf{W}}_t}} \right\|_F^2 \\ &\leq \alpha _t^2Nd\sigma _1^2{C^2}, \\\end{align} \]

based on the independence of the additive noise on different local datasets and Assumption 3. For the third term in (30), we have

\[ \begin{align} \mathbb{E}\left( {\left\| {{\alpha _t}\sum\limits_{i = 1}^N {{\mathbf{N}}_2^{\left( i \right)}} } \right\|_F^2} \right) = \alpha _t^2N\sigma _2^2od. \end{align} \]

(34)

Substituting (31), (33) and (34) into (30) yields Lemma 2.

Theorem 2 (Convergence performance of ACFL under arbitrary values of aggregation weights)

Under Assumptions 1-3, if \( \sum\limits_{i = 1}^N {{{\mathbf{X}}^{\left( i \right)T}}{{\mathbf{X}}^{\left( i \right)}}} \geq \lambda {\mathbf{I}}\) with \( \lambda>0\) , by setting \( {\eta _t} = {1 \left/ {\vphantom {1 {\left( {\lambda t} \right)}}} \right. - {\left( {\lambda t} \right)}}\) , it holds that

\[ \begin{align} \mathbb{E}\left( {\left\| {{{\mathbf{W}}_T} - {{\mathbf{W}}^*}} \right\|_F^2} \right) \leq {\sup }\limits_{0 \leq t \leq T} 4\frac{u\left( {{\sigma _1^2,\sigma _2^2,{\alpha _t}}} \right)}{{{\lambda ^2}T}}, \end{align} \]

(35)

where

\[ \begin{align} & u\left( {{\sigma _1^2,\sigma _2^2,{\alpha _t}}} \right)\nonumber \end{align} \]

(36)

\[ \begin{align} \triangleq& \left[ {\alpha _t^2p + \left( {1 - p} \right){{\left( {{\alpha _t} + \frac{{1 - {\alpha _t}}}{{1 - p}}} \right)}^2} + N - 1} \right]N{\beta ^2}\\ &+ \alpha _t^2Nd\sigma _1^2{C^2} + \alpha _t^2N\sigma _2^2od. \\\end{align} \]

Proof

When \( \sum\limits_{i = 1}^N {{{\mathbf{X}}^{\left( i \right)T}}{{\mathbf{X}}^{\left( i \right)}}} \geq \lambda {\mathbf{I}}\) with \( \lambda>0\) , the linear regression problem is \( \lambda\) -strongly convex. Based on that, Theorem 2 can be easily proved by combining Lemma 1, Lemma 2, and Lemma 1 in [34].

Remark 3

Note that the condition \(\sum_{i = 1}^N \mathbf{X}^{(i)T}\mathbf{X}^{(i)} \geq \lambda \mathbf{I}\) with \(\lambda>0\) in Theorem 2 naturally holds for the considered problem formulated in Section 2. This is because \(\mathbf{X}^{(i)} \in \mathbb{R}^{m_i \times d}\) is full column rank, and \(\mathbf{X}^{(i)T}\mathbf{X}^{(i)}\) is positive definite. In this case, \(\lambda>0\) can be any value that satisfies \(0<\lambda<\sum_{i = 1}^N \text{eig}_{\min} \left[ \mathbf{X}^{(i)T}\mathbf{X}^{(i)} \right]\), where \(\text{eig}_{\min}(\cdot)\) denotes the smallest eigenvalue of a matrix.

4.3 The adaptive policy

Based on Section 4.1 and Section 4.2, let us now consider how to adaptively determine the value of \( \alpha_t\) for ACFL to attain a better performance in terms of privacy and learning. According to Theorem 1 and Theorem 2, the problem of determining \( \alpha _t\) in ACFL can be equivalently expressed as the following optimization problem:

\[ \begin{align} {\min }\limits_{{\alpha _t}} u\left( {\sigma _1^2,\sigma _2^2,{\alpha _t}} \right), s.t. 0\leq{\alpha _t}\leq1, \end{align} \]

(37)

which aims to expedite the convergence of the learning process under certain privacy level with fixed variances of additive noise. Based on the optimization problem defined in (37), we detail the adaptive policy for selecting the value of \(\alpha_t\) and analyze the trade off between privacy and learning performance under this policy in Theorem 3 as follows.

Theorem 3

For the proposed ACFL method, by setting

\[ \begin{align} \alpha_t= \frac{{pN{\beta ^2}\frac{1}{{1 - p}}}}{{pN{\beta ^2}\frac{1}{{1 - p}} + Nd\sigma _1^2{C^2} + N\sigma _2^2od}}, \end{align} \]

(38)

the optimal performance in terms of privacy and learning can be attained. In this case, under Assumptions 1-3, if \( \sum\limits_{i = 1}^N {{{\mathbf{X}}^{\left( i \right)}}^T{{\mathbf{X}}^{\left( i \right)}}} \geq \lambda {\mathbf{I}}\) with \( \lambda>0\) , by setting \( {\eta _t} = {1 \left/ {\vphantom {1 {\left( {\lambda t} \right)}}} \right. - {\left( {\lambda t} \right)}}\) , it holds that

\[ \begin{align} \mathbb{E}\left( {\left\| {{{\mathbf{W}}_T} - {{\mathbf{W}}^*}} \right\|_F^2} \right) \leqslant 4\frac{{\tilde u\left( {\sigma _1^2,\sigma _2^2} \right)}}{{{\lambda ^2}T}}, \end{align} \]

(39)

where

\[ \begin{align} & \tilde u\left( {\sigma _1^2,\sigma _2^2} \right) \nonumber \end{align} \]

(40)

\[ \begin{align} =& - \frac{{{{\left( {pN{\beta ^2}\frac{1}{{1 - p}}} \right)}^2}}}{{N{\beta ^2}p\frac{1}{{1 - p}} + Nd\sigma _1^2{C^2} + N\sigma _2^2od}} \\ &+ N{\beta ^2}\frac{1}{{1 - p}} + N{\beta ^2}\left( {N - 1} \right), \\\end{align} \]

With the learning performance provided as (39) and the privacy performance described in Theorem 1, the performance in terms of privacy and learning attained by the proposed method is characterized.

Proof

Problem (37) can be equivalently expressed as the following optimization problem:

\[ \begin{align} {\min }\limits_{{\alpha _t}} \hat u\left( {{\alpha _t}} \right),s.t.0 \leqslant {\alpha _t} \leqslant 1, \end{align} \]

(41)

where

\[ \begin{align} \hat u\left( {{\alpha _t}} \right) \triangleq& \alpha _t^2\left( {N{\beta ^2}p\frac{1}{{1 - p}} + Nd\sigma _1^2{C^2} + N\sigma _2^2od} \right) \nonumber \end{align} \]

(42)

\[ \begin{align} &- \left( {2pN{\beta ^2}\frac{1}{{1 - p}}} \right){\alpha _t}. \\\end{align} \]

(43)

Noting that \( \frac{{pN{\beta ^2}\frac{1}{{1 - p}}}}{{\left( {pN{\beta ^2}\frac{1}{{1 - p}} + Nd\sigma _1^2{C^2} + N\sigma _2^2od} \right)}} < 1\) always holds, which indicates that

\[ \begin{align} & {\arg \min }\lim_{\alpha _t} \hat u\left( {{\alpha _t}} \right) \nonumber \end{align} \]

(44)

\[ \begin{align} &= \frac{{pN{\beta ^2}\frac{1}{{1 - p}}}}{{pN{\beta ^2}\frac{1}{{1 - p}} + Nd\sigma _1^2{C^2} + N\sigma _2^2od}}. \\\end{align} \]

(45)

Substituting (44) into (36) yields (39), which completes the proof.

Remark 4

According to Theorem 3, the learning performance of ACFL improves as the variances of the additive noise decrease, whereas the privacy performance benefits from an increase in these variances. This highlights the intrinsic trade off between privacy and learning performance. Practically, the variances of the additive noise can be meticulously calibrated to achieve a desired trade off, aligning with specific practical requirements.

To verify that ACFL with the adaptive policy achieves a better performance in terms of privacy and learning, Fig. 1 illustrates the relationship between the upper bound derived in (35) and \(\epsilon\) specified in (17) with non-adaptive and fixed values of aggregation weights as well as adaptive aggregation weights determined by the designed adaptive policy, where we set \( d=100, \sigma_1^2=\sigma^2_2, o=10, p=0.1, N=5, \beta=10, c=1, T=1000\) and \( \lambda=1\) . Here, a lower value of the upper bound implies superior learning performance, whereas a lower \(\epsilon\) value signifies enhanced privacy performance. The results clearly show that under equivalent privacy levels, the upper bound in (35) reaches its minimum with the adaptive policy. This outcome indicates that ACFL, when implemented with the adaptive policy, attains the optimal performance in terms of privacy and learning.

In Theorem 3, we develop the adaptive policy for setting the aggregation weights in ACFL, which relies on the bounds specified in Assumptions 2 and 3. Notably, obtaining the values of \(C\) and \(\beta\) beforehand can be challenging in certain scenarios. In such cases, a practical implementation of ACFL involves estimating \(C\) and \(\beta\) over the iterations by the central server. Specifically, \(C\) and \(\beta\) are approximated by \(\hat{C}\) and \(\hat{\beta}\), respectively, with their expressions provided as follows:

\[ \begin{align} \hat{C}^2 = \left\| {{{\mathbf{W}}_t}} \right\|_F^2, \hat{\beta}^2 = \frac{{\sum\limits_{i = 1}^N {\left\| {{\mathbf{G}}_t^{\left( i \right)}} \right\|_F^2I_t^{\left( i \right)}} }}{{\sum\limits_{i = 1}^N {I_t^{\left( i \right)}} }}. \end{align} \]

(46)

Based on (46), the aggregation weights can be determined adaptively as

\[ \begin{align} {{\overset{\lower0.5em \( \smash{\scriptscriptstyle\smile}\) }{\alpha } }_t} = \frac{{p{{\hat \beta }^2}}}{{p{{\hat \beta }^2} + d\sigma _1^2{{\hat C}^2}\left( {1 - p} \right) + \sigma _2^2od\left( {1 - p} \right)}}. \end{align} \]

(47)

With (47), the implementation of ACFL is presented in Algorithm 1 and shown as Fig. 4.

Algorithm 1 ACFL

1.Input: Initial parameter matrix \({\mathbf{W}}_0\), learning rates \( \eta_t\)

2.The first stage: Device \( i\) uploads coded dataset \(\left( {\mathbf{H}_X^{(i)},\mathbf{H}_Y^{(i)}} \right)\) to the central server as (2) and (3), \( \forall i\) .

3.The central server generates a global coded dataset \(\left( {{{{\mathbf{\tilde H}}}_X},{{{\mathbf{\tilde H}}}_Y}} \right)\) as (4).

4.The second stage:

5.Initialize: \( t = 0\)

6.while \(t \leq T \) do

7.In parallel for all devices \(i \in \{1,2,...,N\}\):

8.if \(I_t^{\left( i \right)} = 1\) then

9.Compute \( {\mathbf{G}}_t^{\left( i \right)}\) as (7);

10.Transmits \( {\mathbf{G}}_t^{\left( i \right)}\) to the central server;

11.end if

12.The central server:

13.Computes the gradient based on the global coded dataset to obtain \( {\mathbf{G}}_t^S\) as (8);

14.Aggregates gradients to obtain \( {\mathbf{G}}_t^{All}\) as (9) with aggregation weights computed as (43);

15.Updates the model parameter matrix to obtain \({\mathbf{W}}_{t+1}\) as (10);

16.end while

5 Simulations

In this section, we evaluate the performance of ACFL through simulations on a linear regression problem, as detailed in Section 2. We consider an FL setting with \( N=100\) devices, each possessing a local dataset with \( m_i = 100\) samples for all \( i\) , with \( d=10\) and \( o=10\) . The elements of each local dataset \( {{\mathbf{X}}^{(i)}}\) are i.i.d., drawn from a uniform distribution over the interval \( [-1, 1]\) . A true parameter matrix \( {{\mathbf{W}}^{True}} \in {\mathbb{R}^{d \times o}}\) is generated with elements uniformly distributed within \( [0, 1/30]\) . The labels \( {{\mathbf{Y}}^{(i)}}\) for each device are subsequently generated according to \( {{\mathbf{Y}}^{(i)}} = {{\mathbf{X}}^{(i)}}{{\mathbf{W}}^{True}}\) , for all \( i\) . The initial parameter matrix \( {\mathbf{W}}_0\) is randomly generated with elements drawn from a uniform distribution in the interval \( [0, 1/30]\) , independent of the true parameter matrix generation.

To evaluate the effectiveness of the adaptive policy in ACFL, we conduct a comparative analysis between ACFL and its non-adaptive counterpart (denoted as NA), which employs a fixed aggregation weight \( \alpha_t = 0.5\) , as suggested by [24]. In Fig. 5, we depict the training loss as a function of number of iterations for both ACFL and NA under different variances of additive noise, where the learning rate is set as \( \eta_t=0.0001/t\) over the iterations and we consider two distinct scenarios characterized by varying straggler probabilities. As anticipated, an increase in noise variances, aimed at enhancing privacy, inversely affects the learning performance of both methods. This empirical observation aligns with our theoretical analysis, underscoring the intrinsic trade off between privacy preservation and learning performance. Notably, the performance deterioration in ACFL remains relatively modest in comparison to NA upon increasing the noise, while there is a pronounced decline in the learning performance of NA. This discrepancy underscores the superior capability of ACFL to leverage gradients from both the central server and devices more effectively with the adaptive policy and validates the advantage of employing the adaptive policy in ACFL for mitigating the adverse effects of privacy-enhancing measures on learning performance.

To further validate the efficacy of the proposed method, we compare its performance with the state-of-the-art method SCFL, which generates local coded datasets by adding noise and implementing random projections. Note that the difference between SCFL and NA lies in the way of encoding local datasets, and both of them use non-adaptive aggregation weights at the central server. In Fig. 8, we plot the training loss as a function of number of iterations for both methods under various privacy levels, i.e., under varying values of \( \epsilon\) using the metric of \( \epsilon\) -MI-DP. The variances of the noise in both methods are calculated according to Theorem 1 from this paper and Theorem 2 in [24], ensuring required value of \( \epsilon\) for both methods. Here, we fix \( p=0.2\) and set the learning rate as \( \eta_t=0.0001/t\) throughout the iterations. For an equitable comparison, we conduct full-batch gradients for both methods . Furthermore, the local coded datasets in both methods are of the same size. From Fig. 8, it is evident that ACFL achieves superior learning performance under equivalent privacy level. Moreover, the learning performance of ACFL exhibits greater resilience and suffers less from increasing the noise variances to heighten the privacy level. This advantage stems from the adoption of an adaptive policy in ACFL for determining optimal aggregation weights, thereby facilitating better performance of privacy and learning.

Figure 8. Training loss as a function of number of iterations of ACFL and SCFL under varying privacy variances.

6 Conclusions

In this article, we addressed the problem of FL in the presence of stragglers. For this issue, the CFL paradigm was explored in the literature. However, the aggregation of gradients at the central server in existing CFL methods, which employ fixed weights across iterations, leads to diminished learning performance. To overcome this limitation, we proposed a new method: ACFL. In ACFL, each device uploads a coded local dataset with additive noise to the central server before training, enabling the central server to generate a global coded dataset under privacy preservation requirements. During each iteration of the training, the central server aggregates the received gradients as well as the gradient computed from the global coded dataset by using an adaptive policy to determine the aggregation weights. This approach optimizes the performance in terms of privacy and learning through convergence analysis and MI-DP. Finally, we provided simulation results to demonstrate the superiority of ACFL compared to non-adaptive methods. It is worth noting that in this work, we have focused on the privacy preservation of raw datasets of edge devices. In the future, we plan to extend the proposed method to more complex scenarios, where private information could also be revealed through the analysis of gradients transmitted from the devices [35, 36, 37].

References

[1] Shahab S Band and Sina Ardabili and Mehdi Sookhak and Anthony Theodore Chronopoulos and Said Elnaffar and Massoud Moslehpour and Mako Csaba and Bernat Torok and Hao-Ting Pai and Amir Mosavi When smart cities get smarter via machine learning: An in-depth literature review IEEE Access 2022 10 60985–61015

[2] Salvador V Balkus and Honggang Wang and Brian D Cornet and Chinmay Mahabal and Hieu Ngo and Hua Fang A survey of collaborative machine learning using 5G vehicular communications IEEE Communications Surveys & Tutorials 2022 24 2 1280–1303

[3] Karrar Hameed Abdulkareem and Mazin Abed Mohammed and Ahmad Salim and Muhammad Arif and Oana Geman and Deepak Gupta and Ashish Khanna Realizing an effective COVID-19 diagnosis system based on machine learning and IOT in smart hospital environment IEEE Internet of things journal 2021 8 21 15919–15928

[4] Peiying Zhang and Ning Chen and Shibao Li and Kim-Kwang Raymond Choo and Chunxiao Jiang and Sheng Wu Multi-domain virtual network embedding algorithm based on horizontal federated learning IEEE Transactions on Information Forensics and Security 2023

[5] Latif U Khan and Walid Saad and Zhu Han and Ekram Hossain and Choong Seon Hong Federated learning for internet of things: Recent advances, taxonomy, and open challenges IEEE Communications Surveys & Tutorials 2021 23 3 1759–1799

[6] Mang Ye and Xiuwen Fang and Bo Du and Pong C Yuen and Dacheng Tao Heterogeneous federated learning: State-of-the-art and research challenges ACM Computing Surveys 2023 56 3 1–44

[7] Huikai Shao and Chengcheng Liu and Xiaojiang Li and Dexing Zhong Privacy Preserving Palmprint Recognition via Federated Metric Learning IEEE Transactions on Information Forensics and Security 2023

[8] Brendan McMahan and Eider Moore and Daniel Ramage and Seth Hampson and Blaise Aguera y Arcas Communication-efficient learning of deep networks from decentralized data Artificial intelligence and statistics 2017 1273–1282 PMLR

[9] Can Karakus and Yifan Sun and Suhas Diggavi and Wotao Yin Straggler mitigation in distributed optimization through data encoding Advances in Neural Information Processing Systems 2017 30

[10] Maximilian Egger and Marvin Xhemrishi and Antonia Wachter-Zeh and Rawad Bitar Sparse and private distributed matrix multiplication with straggler tolerance 2023 IEEE International Symposium on Information Theory (ISIT) 2023 2535–2540 IEEE

[11] Saeed Vahidian and Mahdi Morafah and Mubarak Shah and Bill Lin Rethinking data heterogeneity in federated learning: Introducing a new notion and standard benchmarks IEEE Transactions on Artificial Intelligence 2023

[12] Ahmed M Abdelmoniem and Chen-Yu Ho and Pantelis Papageorgiou and Marco Canini A Comprehensive Empirical Study of Heterogeneity in Federated Learning IEEE Internet of Things Journal 2023

[13] Rashish Tandon and Qi Lei and Alexandros G Dimakis and Nikos Karampatziakis Gradient coding: Avoiding stragglers in distributed learning International Conference on Machine Learning 2017 3368–3376 PMLR

[14] Chengxi Li and Mikael Skoglund Distributed Learning based on 1-Bit Gradient Coding in the Presence of Stragglers accepted by IEEE Transactions on Communications 2024

[15] Hongyi Wang and Zachary Charles and Dimitris Papailiopoulos Erasurehead: Distributed gradient descent without delays using approximate gradient coding arXiv preprint arXiv:1901.09671 2019

[16] Emre Ozfatura and Deniz Gündüz and Sennur Ulukus Gradient coding with clustering and multi-message communication 2019 IEEE Data Science Workshop (DSW) 2019 42–46 IEEE

[17] Baturalp Buyukates and Emre Ozfatura and Sennur Ulukus and Deniz Gündüz Gradient coding with dynamic clustering for straggler-tolerant distributed learning IEEE Transactions on Communications 2022

[18] Margalit Glasgow and Mary Wootters Approximate gradient coding with optimal decoding IEEE Journal on Selected Areas in Information Theory 2021 2 3 855–866

[19] Rawad Bitar and Mary Wootters and Salim El Rouayheb Stochastic gradient coding for straggler mitigation in distributed learning IEEE Journal on Selected Areas in Information Theory 2020 1 1 277–291

[20] Qinbin Li and Zeyi Wen and Zhaomin Wu and Sixu Hu and Naibo Wang and Yuan Li and Xu Liu and Bingsheng He A survey on federated learning systems: Vision, hype and reality for data privacy and protection IEEE Transactions on Knowledge and Data Engineering 2021

[21] Sagar Dhakal and Saurav Prakash and Yair Yona and Shilpa Talwar and Nageen Himayat Coded federated learning 2019 IEEE Globecom Workshops (GC Wkshps) 2019 1–6 IEEE

[22] Saurav Prakash and Sagar Dhakal and Mustafa Riza Akdeniz and Yair Yona and Shilpa Talwar and Salman Avestimehr and Nageen Himayat Coded computing for low-latency federated learning over wireless edge networks IEEE Journal on Selected Areas in Communications 2020 39 1 233–250

[23] Arjun Anand and Sagar Dhakal and Mustafa Akdeniz and Brandon Edwards and Nageen Himayat Differentially private coded federated linear regression 2021 IEEE Data Science and Learning Workshop (DSLW) 2021 1–6 IEEE

[24] Yuchang Sun and Jiawei Shao and Yuyi Mao and Songze Li and Jun Zhang Stochastic coded federated learning: Theoretical analysis and incentive mechanism design IEEE Transactions on Wireless Communications 2023

[25] Mehrdad Showkatbakhsh and Can Karakus and Suhas Diggavi Privacy-utility trade-off of linear regression under random projections and additive noise 2018 IEEE International Symposium on Information Theory (ISIT) 2018 186–190 IEEE

[26] Dastan Maulud and Adnan M Abdulazeez A review on linear regression comprehensive in machine learning Journal of Applied Science and Technology Trends 2020 1 4 140–147

[27] Douglas C Montgomery and Elizabeth A Peck and G Geoffrey Vining Introduction to linear regression analysis John Wiley & Sons 2021

[28] Jialin Wen and Benjamin Zi Hao Zhao and Minhui Xue and Alina Oprea and Haifeng Qian With great dispersion comes greater resilience: Efficient poisoning attacks and defenses for linear regression models IEEE Transactions on Information Forensics and Security 2021 16 3709–3723

[29] Weidong Liu and Xiaojun Mao and Xiaofei Zhang and Xin Zhang Efficient Sparse Least Absolute Deviation Regression with Differential Privacy IEEE Transactions on Information Forensics and Security 2024

[30] Ali Rahimi and Benjamin Recht Random features for large-scale kernel machines Advances in neural information processing systems 2007 20

[31] Paul Cuff and Lanqing Yu Differential privacy as a mutual information constraint Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security 2016 43–54

[32] Qiongxiu Li and Jaron Skovsted Gundersen and Richard Heusdens and Mads Græsbøll Christensen Privacy-preserving distributed processing: metrics, bounds and algorithms IEEE Transactions on Information Forensics and Security 2021 16 2090–2103

[33] Thomas M Cover Elements of information theory John Wiley & Sons 1999

[34] Alexander Rakhlin and Ohad Shamir and Karthik Sridharan Making gradient descent optimal for strongly convex stochastic optimization arXiv preprint arXiv:1109.5647 2011

[35] Kang Wei and Jun Li and Ming Ding and Chuan Ma and Howard H Yang and Farhad Farokhi and Shi Jin and Tony QS Quek and H Vincent Poor Federated learning with differential privacy: Algorithms and performance analysis IEEE Transactions on Information Forensics and Security 2020 15 3454–3469

[36] Kang Wei and Jun Li and Chuan Ma and Ming Ding and Wen Chen and Jun Wu and Meixia Tao and H Vincent Poor Personalized federated learning with differential privacy and convergence guarantee IEEE Transactions on Information Forensics and Security 2023

[37] Xinchen Lyu and Xinyun Hou and Chenshan Ren and Xin Ge and Penglin Yang and Qimei Cui and Xiaofeng Tao Secure and Efficient Federated Learning with Provable Performance Guarantees via Stochastic Quantization IEEE Transactions on Information Forensics and Security 2024

Collapse and expand sections

Cross-references and related material

Discussions

Table of contents

Abstract

1 Introduction

2 Problem Model

3 The Proposed Method

3.1 The first stage: before training

3.2 The second stage: training iterations

4 Performance Analysis and The Adaptive Policy

4.1 Privacy performance of ACFL

4.2 Learning performance of ACFL

4.3 The adaptive policy

5 Simulations

6 Conclusions

References

Discussion: create topic login to participate.

Dynamic display of documents.

Collapse and expand sections

Cross-references and related material

Discussions

Table of contents

Abstract

1 Introduction

2 Problem Model

3 The Proposed Method

3.1 The first stage: before training

3.2 The second stage: training iterations

4 Performance Analysis and The Adaptive Policy

4.1 Privacy performance of ACFL

4.2 Learning performance of ACFL

4.3 The adaptive policy

5 Simulations

6 Conclusions

References

Discussion: create topic login to participate.