:::important 🌏 Source Available at: https://arxiv.org/abs/2009.04177. Source code at: SuSir1996/MU-GAN. :::

Task: **facial attribute editing**, which has two main objectives:

- To translate an image from a source domain to a target one
- only change the facial regions related to a target attribute and preserving the attribute-excluding details

The **Multi-attention U-Net-based Generative Adversarial Network (MU-GAN)**:

- "Classic convolutional encoder-decoder" --> "Symmetric U-Net-like structure in a generator": Apply an additive attention mechanism to build attention-based U-Net connections for adaptively transferring encoder representations to complement a decoder with attribute-excluding detail and enhance attribute editing ability.
- A self-attention mechanism is incorporated into convolutional layers for modeling long-range and multi-level dependencies across image regions.

CNN-based GANs are good at editing local attributes and synthesizing images with few geometric constraints. However, they have difficulty in editing images with geometric or structural patterns (such as facial attribute editing). It is also known that there are complex coupling relationships among facial attributes (e.g.: gender and beard).

Thus, a desired model needs to have the ability to decouple attributes in order to meet the requirements of the target labels. To solve these problems, we construct a new generator with a novel encoder-decoder architecture and propose a **Multi-attention U-Net-based GAN (MU-GAN) model.**

- We construct a symmetric U-Net-like architecture generator based on additive attention mechanism, which effectively enhances our method's detail preservation and attribute manipulation abilities.
- We take a self-attention mechanism into the existing encoder-decoder architecture thus effectively enforcing geometric constraints on generated results.
- We introduces a multi-attention mechanism to help attribute decoupling, i.e., it only changes the attributes that need to be changed.

For detailed retention and blurry image problems, the paper replaces the original asymmetric CNN-based encoder-decoder with a symmetrical Attention U-Net architecture. Besides, instead of directly connecting an encoder to a decode via skip-connections, **the paper present AUCs to selectively transfer attribute-irrelevant representations from an encoder.** Then, AUCs concatenate encoder representation with decoder ones to improve image quality and detail preservation.

**With an attention mechanism**, AUCs are capable of **filtering out representations related to original attributes** while **preserving attribute-irrelevant details**. It can **promote the image fidelity** without **weakening attribute manipulation ability**.

Self-attention mechanism.

real / fake: two sub-networks.

In MU-GAN, the generator $G$ consists of two sub-networks: $G_{enc}$ + $G_{dec}$.

Loss is calculated as three parts:

- Adversarial loss
- Attribute classification
- Reconstruction loss

Overall objective:

Compare `MU-GAN`

with `AttGAN`

and `STGAN`

.

Multi-attention mechanism: AUCs and self-attention mechanisms → into a symmetrical U-Net-like architecture.

]]>Fourier transform is a commonly used techniques in many different fields. In mathematics, people use Fourier transform to solve differential equations, and in signal processing it is used to study the seasonality of time series. To prevent misuse, we had better understand theoretical background of Fourier transform before diving into applications. Prepare some papers and pencils, our adventure will begin from the heat equation.

In the field of **partial differential equations (PDEs)**, heat equation is defined as follows

$$ {\partial u\over\partial t}=k\nabla^2u $$

where $\nabla^2=\nabla\cdot\nabla$ is the Laplacian operator. To prevent too much digression, we only want to focus on its one-dimensional form:

$$ {\partial\over\partial t}u(x,t)=k{\partial^2\over\partial x^2}u(x,t)\tag1 $$

with the initial condition and the boundary condition being $u(x,0)=f(x)$ and $u(0,t)=u(L,t)=0$

We will not go into too much detail about this equation since this is not the topic of this blog.

Since the conditions and (1) are linear and homogeneous, we are allowed to perform separation of variables. That is, we set

$$ u(x,t)=X(x)T(t) $$

Plugging this new definition into (1), we get

$$ {T'\over T}(t)=k{X''\over X}(x) $$

By the properties of this equation, we observe that any partial derivatives taken on this equation will give zero, so both side must equal to a constant. In order for this constant to be *physically meaningful*, we set it negative:

$$ \frac1k{T'\over T}(t)={X''\over X}(x)=-\lambda^2 $$

which allows us to separate this PDE into two **ordinary differential equations (ODEs)**:

$$ \begin{cases} T'(t)&=-\lambda^2kT(t) \ X''(x)&=-\lambda^2X(x) \end{cases} $$

and using techniques learned from calculus, we obtain a special solution for $u(x,t)$:

$$ u(x,t)=Ae^{-\lambda^2kt}\sin\left(\lambda x+\varphi\right) $$

Now, if we were to plug in the boundary conditions of this problem, we get

$$ u_n(x,t)=Ae^{-\lambda_n^2kt}\sin(\lambda_nx) $$

with $\lambda_n$ being $n\pi/L$. Finally, by the linearity of (1), we get its general solution:

$$ u(x,t)=\sum_{n=1}^\infty u_n(x,t)=\sum_{n=1}^\infty A_ne^{-\lambda_n^2kt}\sin(\lambda_nx) $$

We are almost there. Plugging in the initial condition gives

$$ f(x)=\sum_{n=1}^\infty A_n\sin(\lambda_nx)\tag2 $$

All left is to determine the coefficients. To begin with, we consider the following integral

$$ I_{m,n}=\int_0^L\sin(\lambda_mx)\sin(\lambda_nx)\mathrm dx $$

For $m=n$, we have

$$ \begin{aligned} I_{m,m} &=\int_0^L\sin^2(\lambda_mx)\mathrm dx \ &=\int_0^L{1-\cos(2\lambda_mx)\over2}\mathrm dx \ &=\frac L2 \end{aligned} $$

but when $m\ne n$, we can use the fact that

$$ \sin\alpha\sin\beta={\cos(\alpha-\beta)-\cos(\alpha+\beta)\over2} $$

to get

$$ I_{m,n}=\frac12\int_0^L[\cos(\lambda_{m-n}x)-\cos(\lambda_{m+n}x)]\mathrm dx=0 $$

Combining these results, we get

$$ \int_0^L\sin(\lambda_mx)\sin(\lambda_nx)\mathrm dx=\frac L2\delta_{mn} $$

which, applied to (2), gives a formula to determine $A_n$:

$$ A_n=\frac2L\int_0^Lf(x)\sin(\lambda_n)\mathrm dx\tag3 $$

As shown in (2), the solution to the heat equation requires us to determine the coefficients of a trigonometric series, and this type of series is known as the **Fourier series** to acknowledge Joseph Fourier for his contribution in the study of heat equation. Particularly, Fourier hypothesizes all functions defined on an arbitrary $[T_0,T_0+T]$ can be represented as superposition of sinusoids:

$$ f(x)=A_0+\sum_{n=1}^\infty A_n\sin(\lambda_nx+\varphi_n)\tag4 $$

with $\lambda_n=2\pi n/T$. However, (4) is not a good version of Fourier series to work with, so we may consider applying angle-sum identity of sine function:

$$ \begin{aligned} f(x) &=A_0+\sum_{n=1}^\infty A_n[\sin\varphi_n\cos(\lambda_nx)+\cos\varphi_n\sin(\lambda_nx)] \ &=A_0+\sum_{n=1}^\infty[a_n\cos(\lambda_nx)+b_n\sin(\lambda_nx)] \end{aligned} $$

Hence, we are able to develop similar means in (3) to get

$$ \begin{aligned} A_0&=\frac1T\int_{T_0}^{T_0+T}f(x)\mathrm dx \ a_n&=\frac2T\int_{T_0}^{T_0+T}f(x)\cos(\lambda_nx)\mathrm dx \ b_n&=\frac2T\int_{T_0}^{T_0+T}f(x)\sin(\lambda_nx)\mathrm dx \end{aligned} $$

Because the formula for $A_0$ and $a_n$ are too similar, we can often see some texts used the following fashion to define Fourier series:

$$ f(x)={a_0\over2}+\sum_{n=1}^\infty\left[a_n\cos\left(2\pi nx\over T\right)+b_n\sin\left(2\pi nx\over T\right)\right]\tag5 $$

Oftentimes, Fourier expansion of certain functions help us evaluate values of certain series. For instance, let's set $T_0=0$, $T=2\pi$, and $f(x)=x^2$; we have

$$ {a_0\over2}={1\over2\pi}\int_0^{2\pi} x^2\mathrm dx={4\pi^2\over3} $$

and for $n\ge1$:

$$ \begin{aligned} a_n &=\frac1\pi\int_0^{2\pi} x^2\cos(nx)\mathrm dx \ &=\frac1\pi\left[\frac1nx^2\sin(nx)|_0^{2\pi}-\frac1n\int_0^{2\pi}2x\sin(nx)\mathrm dx\right] \ &=\frac1\pi\left[{1\over n^2}2x\cos(nx)|_0^{2\pi}-{2\over n^2}\int_0^{2\pi}\cos(nx)\mathrm dx\right] \ &={4\over n^2} \end{aligned} $$

which gives

$$ x^2={8\pi^3\over3}+\sum_{n=1}^\infty\left[{4\over n^2}\cos(nx)+b_n\sin(nx)\right] $$

set $x=\pi$, we have

$$ \pi^2={4\pi^2\over3}+4\sum_{n=1}^\infty{(-1)^n\over n^2} $$

resulting in

$$ \sum_{n=1}^\infty{(-1)^n\over n^2}=-{\pi^2\over12} $$

which is an equivalent form of the **Basel problem**.

Equation (5) still looks a bit complicated, but **Euler's formula** for trigonometric functions can help us simplify:

$$ \cos\theta={e^{i\theta}+e^{-i\theta}\over2} \ \sin\theta={e^{i\theta}-e^{-i\theta}\over2i} $$

As a result, we obtain the following series

$$ f(x)=\sum_{n\in\mathbb Z}c_ne^{2\pi inx/T}\tag6 $$

and by similar means in previous sections, we acquire the following formula for $c_n$:

$$ c_n=\frac1T\int_{T_0}^{T_0+T}f(x)e^{-2\pi inx/T}\mathrm dx $$

Since sinusoids are periodic, Fourier series virtually serve to produce a trigonometric representations of periodic functions. Nonetheless, in mathematics most functions we study are not periodic, indicating that a stronger tool is needed.

If a function is aperiodic, then why not consider its period is the entire real line $(-\infty,\infty)$? In our case, set $T_0=-T/2$, so that if we take the limit $T\to\infty$ the Fourier series can express $f(x)$ entirely.

In addition, define

$$ \hat f\left(\xi\right)=Tc_n=\int_{-T/2}^{T/2}f(x)e^{-2\pi ix\xi}\mathrm dx\tag7 $$

so that

$$ f(x)=\lim_{M\to\infty}\frac1T\sum_{-MT/2\le n\le MT/2}\hat f\left(\frac nT\right)e^{2\pi ix(n/T)} $$

Now, let's consider dividing the summation region into subintervals:

$$ -M\le\cdots<\xi_0<\xi_1<\cdots\le M $$

and require that the length of each subintervals to be $1/T$, then we have

$$ f(x)=\lim_{M\to\infty}\sum_n\hat f(\xi_n)e^{2\pi ix\xi_n}\Delta\xi_n\tag8 $$

and if we take $T\to\infty$ then (8) and (7) becomes

$$ f(x)=\int_{-\infty}^\infty\hat f(\xi)e^{2\pi ix\xi}\mathrm d\xi $$

$$ \hat f(\xi)=\int_{-\infty}^\infty f(x)e^{-2\pi ix\xi}\mathrm dx $$

wherein $\hat f(\xi)$ is the **Fourier transform** of $f(x)$ and $f(x)$ is called the inverse Fourier transform of $\hat f(\xi)$. Moreover, because Fourier transform is regarded as continuous analog of Fourier series, the variable $\xi$ is usually called the **frequency**, and $\hat f(\xi)$ gives the amplitude of the sinusoid with frequency $\xi$.

For these reasons, in the field of signal processing, $f(t)$ is often called the

time domain, and $\hat f(\xi)$ is often referred as thefrequency domainso that Fourier transform became a tool to connect them.

Combining the two identities above gives us the **Fourier inversion theorem**:

$$ f(t)=\int_{-\infty}^\infty e^{2\pi i\xi t}\int_{-\infty}^\infty f(x)e^{-2\pi i\xi x}\mathrm dx\mathrm d\xi $$

For simplicity in the exponential terms, we define the **angular frequency** $\omega=2\pi\xi$, resulting in an alternative version of Fourier inversion formula:

$$ f(t)={1\over2\pi}\int_{-\infty}^\infty e^{i\omega t}\int_{-\infty}^\infty e^{-i\omega x}f(x)\mathrm dx\mathrm d\omega $$

This implies a pair of Fourier transform equations using angular frequencies:

$$ F(\omega)=\int_{-\infty}^\infty e^{-i\omega t}f(t)\mathrm dt\tag9 $$

$$ f(t)={1\over2\pi}\int_{-\infty}^\infty e^{i\omega t}F(\omega)\mathrm d\omega\tag{10} $$

In order for $F(\omega)$ to exist, we ensure that $f(t)$ is square-integrable (i.e. this integral $\int|f|^2$ must converge).

Oftentimes in quantum mechanics, Schrodinger equations in particular problems were simplified into **Airy's equation**:

$$ y''-xy=0\tag{11} $$

If we differentiate (9) with respect to $\omega$, we get

$$ F'(\omega)=-i\int_{-\infty}^\infty e^{-i\omega t}tf(t)\mathrm dt $$

Moreover, if we differentiate (10) with respect to $t$ twice, we get

$$ f''(t)={1\over2\pi}\int_{-\infty}^\infty e^{i\omega t}[-\omega^2F(\omega)]\mathrm d\omega $$

As a result, if we were to denote $Y$ as the Fourier transform of $y$, then the Fourier transform on (11) results in

$$ -\omega^2Y-iY'=0 $$

Using elementary techniques, we obtain the solution as follows:

$$ Y(\omega)=Y_0e^{i\omega^3/3} $$

which gives us the spectrum of the square-integrable special solution to (11). Subsequently, inverse Fourier transform on both side gives

$$ y={Y_0\over2\pi}\int_{-\infty}^\infty e^{i\omega x}e^{\omega^3/3}\mathrm d\omega $$

Because the complex exponential makes the solution look too *intimidating*, we may consider splitting the interval of integration to get a more friendly version:

$$ \begin{aligned} y &={Y_0\over2\pi}\left[\underbrace{\int_{-\infty}^0 e^{i(\omega x+\omega^3/3)}\mathrm d\omega}_{\omega\mapsto-\omega}+\int_0^\infty e^{i(\omega x+\omega^3/3)}\mathrm d\omega\right] \ &={Y_0\over\pi}\int_0^\infty{e^{i(\omega x+\omega^3/3)}-e^{-i(\omega x+\omega^3/3)}\over2}\mathrm d\omega \ &={Y_0\over\pi}\int_0^\infty\cos\left(\omega x+{\omega^3\over3}\right)\mathrm d\omega \end{aligned} $$

and if we specify $Y_0=1$, we get the Airy's function of the first kind:

$$ \operatorname{Ai}(x)=\frac1\pi\int_0^\infty\cos\left(\omega x+{\omega^3\over3}\right)\mathrm d\omega $$

As (11) suggests, Airy's equation is a second order ODE, meaning there exists another branch of solutions that are linearly independent to $Y_0\operatorname{Ai}(x)$. Consequently, Fourier transform only gives the square-integrable branch of the general solution.

Although integrals look beautiful, it is not easy for computers to evaluate integrals, especially when it is integrating over the entire real line. As a result, we may consider discretizing the problem for computer use.

In reality, we do not capture signals as a continuous flow but instead a discrete sequence, so discretizing the time domain will help us apply Fourier transform to discrete signals. Particularly, if we were to discretize the time-domain with sampling period $1/T$, we get

$$ f_T(t)=f(t)\sum_{n=-\infty}^\infty\delta(t-nT) $$

The summation of delta functions are often called the

Dirac comb.

Consequently, its Fourier transform became

$$ \begin{aligned} F(\omega) &=\int_{-\infty}^\infty f_T(t)e^{-i\omega t}\mathrm dt\ &=\int_{-\infty}^\infty f(t)\sum_{n=-\infty}^\infty\delta(t-nT)e^{-i\omega t}\mathrm dt \ &=\sum_{n=-\infty}^\infty\int_{-\infty}^\infty\delta(t-nT)f(t)e^{-i\omega t}\mathrm dt \ \end{aligned} $$

Now, using the property of **Dirac delta function** that

$$ \int_{-\infty}^\infty\delta(t-a)g(t)\mathrm dt=g(a) $$

We deduce the formula for **discrete-time Fourier transform (DTFT)**.

$$ F(\omega)=\sum_{n=-\infty}^\infty f(nT)e^{-i\omega nT} $$

wherein the inversion formula is the same as (10).

Thanks to delta function, we turn integrals into summation. However, the inversion formula for DTFT is still in terms of integrals, so it is needed for us to manipulate this expression to feet real-life use. Because in reality we process finitely sized signal, the discrete signals are only finite sequences. As a result, let's consider DTFT on an discrete signal with length $N$:

$$ f(nT)= \begin{cases} x_n & 0\le n\le N-1 \ 0 & \text{otherwise} \end{cases} $$

Then using this definition, we obtain the DTFT for discrete and finite signals:

$$ F(\omega)=\sum_{n=0}^{N-1}x_ne^{-i\omega nT}\tag{12} $$

Since discrete signal only contains discrete periods, we only need to consider discrete frequencies. According to the definition of the sequence, its period is $NT$, as a result its discretized frequency can be written as

$$ \omega={2\pi k\over NT} $$

which gives us

$$ F\left(2\pi k\over NT\right)=\sum_{n=0}^{N-1}x_ne^{-2\pi ikn/N} $$

By the periodicity of complex exponential, we have that $k\in{0,1,\dots,N-1}$. To recover the original signal, we apply the (10) with the new discretized version of the frequency domain:

$$ \begin{aligned} x_n &=f(nT) \ &={1\over2\pi}\int_{-\infty}^\infty e^{i\omega nT}\sum_{k=0}^{N-1}F(\omega)\delta\left({N\omega\over2\pi}-{k\over T}\right)\mathrm d\omega \ &=\frac1N\sum_{k=0}^{N-1}e^{2\pi ikn/N}F\left(2\pi k\over NT\right) \end{aligned} $$

Now, if we were to define the **Fourier coefficients** $X_k$ as the amplitude of discrete frequencies $F(2\pi /NT)$, then we get the following equations for **discrete Fourier transform (DFT)** and its inversion:

$$ X_k=\sum_{n=0}^{N-1}x_ne^{-2\pi ink/N}\tag{13} $$

$$ x_n=\frac1N\sum_{k=0}^{N-1}X_ke^{2\pi ikn/N}\tag{14} $$

It can be verified that (13) and (14) conforms to the definition of DFT in numpy and scipy.

It is well-known that DFT is frequently used in signal processing, and now let's try analyzing the spectrum of the greatest common divisor function:

$$ \begin{aligned} X_k &=\sum_{m=0}^{n-1}\gcd(n,m)e^{-2\pi imk/N} \ &=\sum_{r=1}^n\gcd(n,r)e^{-2\pi irk/N} \end{aligned} $$

Since $\gcd(n,r)$ divides $n$, we can consider summing the summands using the value of $\gcd(n,r)$:

$$ \begin{aligned} X_k &=\sum_{d|n}d\sum_{r=jd\atop\gcd(j,n/d)=1}e^{-2\pi ijdk/N} \ &=\sum_{d|n}d\sum_{1\le j\le n/d\atop\gcd(j,n/d)=1}e^{-2\pi ijk/(n/d)} \ &\triangleq\sum_{d|n}d\cdot c_{n/d}(k) \end{aligned} $$

Now, let's turn our focus to $c_m(k)$:

$$ \begin{aligned} c_m(k) &=\sum_{1\le j\le m\atop\gcd(j,m)=1}e^{-2\pi ijk/m} \ &=\sum_{j=1}^me^{-2\pi ijk/m}\left\lfloor1\over\gcd(j,m)\right\rfloor \end{aligned} $$

Now, apply Mobius inversion formula to the floor function, so we get

$$ \begin{aligned} c_m(k) &=\sum_{j=1}^me^{-2\pi ijk/m}\sum_{d|j,d|m}\mu(d) \ &=\sum_{d|m}\mu(d)\sum_{j=qd}e^{-2\pi ijk/m} \ &=\sum_{d|m}\mu(d)\sum_{q=1}^{m/d}e^{-2\pi iqk/(m/d)} \ &=\sum_{d|m}\mu\left(\frac md\right)\sum_{q=1}^de^{-2\pi iqk/d} \end{aligned} $$

If $d|k$, the inner summation becomes $d$, but when $d\nmid k$ we have

$$ \sum_{q=1}^de^{-2\pi iqk/d}=e^{-2\pi iq/d}\cdot{e^{-2\pi iqk/d\cdot d}-1\over e^{-2\pi iq/d}-1}=0 $$

As a result, we obtain a nice expression for $c_m(k)$:

$$ c_m(k)=\sum_{d|m,d|k}\mu\left(\frac md\right)d $$

In particular, if $k=1$ this expression becomes the mobius function. Accordingly, plugging $k=1$ in to the $X_k$ expression gives

$$ X_1=\sum_{d|n}\mu\left(\frac nd\right)d=c_n(0)=\sum_{1\le j\le n\atop\gcd(j,n)=1}1=\varphi(n) $$

Eventually, we discover that the Fourier transform of the greatest common divisor function gives us the **Euler's totient function**. In addition, because $\varphi(n)$ is real, we can take the real component on the expression of $X_1$, yielding

$$ \varphi(n)=\sum_{r=1}^n\gcd(n,r)\cos\left(2\pi r\over n\right) $$

In this article, we began our exploration on a physics concepts. Then after we derived the continuous form of Fourier transform, we apply it to solving differential equation. In addition, we also discover application of discrete Fourier transform in number theory. Hence, we can see that Fourier transform has applications in different areas of mathematics because in the derivation we brought together different areas of mathematics: mathematical physics, mathematical analysis, and number theory!

If you have no prior knowledge in MVC and find it difficult to understand the maths behind DFT. You can always refer to Jason's implementation of DFT below.

```
void dft(double complex *X, const size_t N) {
double complex tmp[N];
for (size_t i = 0; i < N; ++i) {
tmp[i] = 0;
for (size_t j = 0; j < N; ++j) {
tmp[i] += X[j] * cexp(-2.0 * M_PI * I * j * i / N);
}
}
memcpy(X, tmp, N * sizeof(*X));
}
```

```
template <typename Iter>
void dft(Iter X, Iter last) {
const auto N = last - X;
std::vector<complex> tmp(N);
for (auto i = 0; i < N; ++i) {
for (auto j = 0; j < N; ++j) {
tmp[i] += X[j] * exp(complex(0, -2.0 * M_PI * i * j / N));
}
}
std::copy(std::begin(tmp), std::end(tmp), X);
}
```

```
def dft(X):
N = len(X)
temp = [0] * N
for i in range(N):
for k in range(N):
temp[i] += X[k] * exp(-2.0j * pi * i * k / N)
return temp
```

In this function, we define `n`

to be a set of integers from $0$ to $N−1$ and arrange them in a column. We then set `k`

to be the same thing, but in a row. This means that when we multiply them together, we get a matrix, but not just any matrix! This matrix is the heart to the transformation itself!

```
M = [1.0+0.0im 1.0+0.0im 1.0+0.0im 1.0+0.0im;
1.0+0.0im 6.12323e-17-1.0im -1.0-1.22465e-16im -1.83697e-16+1.0im;
1.0+0.0im -1.0-1.22465e-16im 1.0+2.44929e-16im -1.0-3.67394e-16im;
1.0+0.0im -1.83697e-16+1.0im -1.0-3.67394e-16im 5.51091e-16-1.0im]
```

It was amazing to us when we saw the transform for what it truly was: an actual transformation matrix! That said, the Discrete Fourier Transform is slow -- primarily because matrix multiplication is slow, and as mentioned before, slow code is not particularly useful.

So what was the trick that everyone used to go from a Discrete Fourier Transform to a ** Fast** Fourier Transform?

Recursion! More about that next week.

]]>$$ \zeta(s)=\sum_{n=1}^\infty{1\over n^s}\tag1 $$

To determine its convegence, let's consider Riemann-Stieltjes integration:

$$ \begin{aligned} \zeta_N(s) &=\sum_{n=1}^N{1\over n^s}=\int_{1-\varepsilon}^\infty{\mathrm d\lfloor t\rfloor\over t^s} \ &=\left.{\lfloor t\rfloor\over t^s}\right|_{1-\varepsilon}^N+s\int_1^N{\lfloor t\rfloor\over t^{s+1}}\mathrm dt \ &=N^{1-s}+s\int_1^N{\lfloor t\rfloor\over t^{+1}}\mathrm dt \end{aligned} $$

It can be easily verified that this expression converges absolutely and uniformly when $\Re(s)>1$, which allows us to make some manipulations with it. Let's have a look

Define

$$ \psi(x)=\sum_{n=1}^\infty e^{-n^2\pi x} $$

Then by **Poisson's summation formula** we have

$$ 2\psi(x)+1=\sum_{n\in\mathbb Z}e^{-n^2\pi x}={1\over\sqrt x}\sum_{n\in\mathbb Z}e^{-n^2\pi/x} $$

which leads to

$$ \psi(x)={1\over\sqrt x}\psi\left(\frac1x\right)+{1\over2\sqrt x}-\frac12\tag2 $$

Let's perform a Mellin transform on this function so that

$$ \begin{aligned} \int_0^\infty x^{s/2-1}\psi(x)\mathrm dx &=\int_0^\infty x^{s/2-1}\sum_{n=1}^\infty e^{-n^2\pi x}\mathrm dx \ &=\pi^{-s/2}\Gamma\left(\frac s2\right)\sum_{n=1}^\infty{1\over n^s} \end{aligned} $$

Now, by (1) we obtain this identity:

$$ \int_0^\infty x^{s/2-1}\psi(x)\mathrm dx=\pi^{-s/2}\Gamma\left(\frac s2\right)\zeta(s)\tag3 $$

As a result, we can study the properties of the Riemann zeta function by digging deeper into the integral on the left hand side.

First, let's split the integral into two parts

$$ \int_0^\infty x^{s/2-1}\psi(x)\mathrm dx=\int_0^1x^{s/2-1}\psi(x)\mathrm dx+\int_1^\infty x^{s/2-1}\psi(x)\mathrm dx $$

Then, applying (2) to $\int_0^1$ side gives

$$ \begin{aligned} \int_0^1x^{s/2-1}\psi(x)\mathrm dx &=\int_0^1x^{s/2-1}\left[{1\over\sqrt x}\psi\left(\frac1x\right)+{1\over2\sqrt x}-\frac12\right]\mathrm dx \ &=\underbrace{\int_0^1x^{(s-1)/2-1}\psi\left(\frac1x\right)\mathrm dx}_{t=x^{-1}} \ &+\frac12\int_0^1x^{(s-1)/2-1}\mathrm dx-\frac12\int_0^1x^{s/2-1}\mathrm dx \ &=\int_1^\infty t^{(1-s)/2-1}\psi(t)\mathrm dt+{1\over s(s-1)} \end{aligned} $$

Now, plugging this result to the original equation gives

$$ \int_0^\infty x^{s/2-1}\psi(x)\mathrm dx={1\over s(s-1)}+\int_1^\infty[x^{s/2}+x^{(1-s)/2}]\psi(x){\mathrm dx\over x} $$

As we can observe that the right hand side does not change when we replace $s$ with $1-s$. Hence, by (3) we have

$$ \pi^{-s/2}\Gamma\left(\frac s2\right)\zeta(s) =\pi^{(s-1)/2}\Gamma\left({1-s\over2}\right)\zeta(1-s) $$

Now, in order to achieve the optimal simplicity, we multiply both side with $\Gamma\left(1-\frac s2\right)$:

$$ \Gamma\left(\frac s2\right)\Gamma\left(1-\frac s2\right)\zeta(s)=\pi^{s-1/2}\Gamma\left({1-s\over2}\right)\Gamma\left({1-s\over2}+\frac12\right)\zeta(s) $$

By **Euler's reflection formula**, we have

$$ \Gamma\left(\frac s2\right)\Gamma\left(1-\frac s2\right)=\pi\csc\left(\pi s\over2\right) $$

and by **Legendre's Duplication formula**, we deduce

$$ \Gamma\left({1-s\over2}\right)\Gamma\left({1-s\over2}+\frac12\right)=2^s\pi^{1/2}\Gamma(1-s) $$

Plugging these results back gives us

$$ \pi\csc\left(\pi s\over2\right)\zeta(s)=2^s\pi^s\Gamma(1-s)\zeta(1-s) $$

Now, if we were to perform more manipulations, we get

$$ \zeta(s)=2^s\pi^{s-1}\sin\left(\pi s\over2\right)\Gamma(1-s)\zeta(1-s) $$

which is known as the **functional equation** for $\zeta(s)$.

In this blog post, we begin with the Dirichlet series definition of $\zeta(s)$, and then we try to connect zeta function with an integral representation. Subsequently, we use Poisson's summation formula to obtain its analytic continuation. However, this analytic continuation has other impacts. If we look back to the equation

$$ \zeta(s)=2^s\pi^{s-1}\sin\left(\pi s\over2\right)\Gamma(1-s)\zeta(1-s), $$

we can observe that for $\Re(s)<0$ the right hand side becomes zero whenever $s=-2k\ne0$. Hence, we call such $s$'s as the **trivial zeros** of $\zeta(s)$. However, there are also other occasions in which the right hand side is zero. Alternatively, we call that kind of zeros the **nontrivial zeros** of $\zeta(s)$.

On going through these definition, we can now have a good basic grasp of the Riemann hypothesis:

]]>

Riemann hypothesis:All nontrivial zeros of $\zeta(s)$ lie on the line $\Re(s)=\frac12$.

The Fast Gradient Sign Method, with perturbations limited by the $\ell_\infty$ or the $\ell_2$ norm.

- FGSM original definition
- FGSM as a maximum allowable attack problem
- FGSM with other norms
- References

The **Fast Gradient Sign Method (FGSM)** by Goodfellow et al. (NIPS 2014) is designed to attack deep neural networks. The idea is to maximize certain loss function $\mathcal{J}(x; w)$ subject to an upper bound on the perturbation, for instance: $|x-x_0|_\infty \leq \eta$.

Formally, we define FGSM as follows. Given a loss function $\mathcal{J}(x; w)$, the FGSM creates an attack $x$ with:

$$x=x_0+\eta\cdot \text{sign}(\nabla_x\mathcal{J}(x_0;w))$$

Whereas the gradient is taken with respect to the input $x$ not the parameter $w$. Therefore, $\nabla_x\mathcal{J}(x_0;w)$ should be interpreted as the gradient of $\mathcal{J}$ with respect to $x$ evaluated at $x_0$. It is the gradient of the loss function. And also, because of the $\ell_\infty$ bound on the perturbation magnitude, the perturbation direction is the sign of the gradient.

Given a loss function $\mathcal{J}(x; w)$, if there is an optimization that will result the FGSM attack, then we can generalize FGSM to a broader class of attacks. To this end, we notice that for general (possibly nonlinear) loss functions, FGSM can be interpreted by applying a first order of approximation:

$$\mathcal{J}(x;w)=\mathcal{J}(x_0+r;w)\approx\mathcal{J}(x_0;w)+\nabla_x\mathcal{J}(x_o;w)^Tr$$

where $x=x_0+r$. Therefore, finding $r$ to maximize $\mathcal{J}(x_0+r;w)$ is approximately equivalent to finding $r$ which maximizes $\mathcal{J}(x; w)=\mathcal{J}(x_0+r; w)^Tr$. Hence FGSM is the solution to:

$$\underset{r}{\textrm{maximize}}\quad\mathcal{J}(x_0;w)+\nabla_x\mathcal{J}(x_0;w)^Tr\quad\textrm{subject to}\quad|r|_\infty\leq\eta$$

which can be simplifed to:

$$\underset{r}{\textrm{minimize}}\quad-\nabla_x\mathcal{J}(x_0;w)^Tr-\mathcal{J}(x_0;w)\quad\textrm{subject to}\quad|r|_\infty\leq\eta$$

where we flipped the maximization to minimization by putting a negative sign to the gradient. Here, we simplify this optimization problem's definition to:

$$\underset{r}{\textrm{minimize}}\quad w^Tr+w_0\quad\textrm{subject to}\quad|r|_\infty\leq\eta$$

:::note Holder's Inequality Let $x \in \mathbb{R}^d$ and $y \in \mathbb{R}^d$, for any $p$ and $q$ that $1/p+1/q=1\ (p\in[1,\infty])$, we have the following Inequality: $-|x|_p|y|_q\leq|x^Ty|\leq|x|_p|y|_q$. :::

We consider the Holder's inequality (the negative side), and so, we can show that:

$$w^Tr\geq-|r|_\infty|w|_1\geq-\eta|w|_1,\ \textrm{where}\ p=1, q=\infty$$

and because we have:

$$-\eta|w|*1=-\eta\sum*{i=1}^d|w_i|=-\eta\sum_{i=1}^dw_i\cdot\textrm{sign}(w_i)=-\eta \cdot w^T\cdot\textrm{sign}(w)$$

Thus, the lower bound of $w^Tr$ is attained when $r=-\eta\cdot\textrm{sign}(w)$. Therefore, the solution to the original FGSM optimization problem is:

$$r=-\eta\cdot\textrm{sign}(\nabla_x\mathcal{J}(x_0;w))$$

And hence the perturbed data is $x=x_0+\eta\cdot\textrm{sign}(\nabla_x\mathcal{J}(x_0 ;w))$.

Considering FGSM as a maximum allowable attack problem, we can easily generalize the attack to other $\ell_p$ norms. Consider the $\ell_2$ norm. In this case, the Holder's inequality equation becomes:

$$w^Tr\geq-|r|_2|w|_2\geq-\eta|w|_2,\ \textrm{where}\ p=2, q=2$$

and thus, $w^Tr$ is minimized when $r=-\eta\cdot w / |w|_2$. As a result, the perturbed data becomes:

$$x=x_0+\eta\cdot\frac{\nabla_x\mathcal{J}(x_0 ;w)}{|\nabla_x\mathcal{J}(x_0 ;w)|_2}$$

It is well-known that factorial is defined by the following recursive relation, $$ n!=n(n-1)! $$

with $0!=1$, but however it is possible to generalize this operation to complex numbers. Let's begin our journey of generalizations!

It can be easily shown that the following integral satisfies

$$ \int_0^\infty e^{-\lambda t}\mathrm dt=\frac1\lambda $$

Differentiation with respect to $s$ on both side for $n-1$ times gives

$$ \int_0^\infty t^{n-1}e^{-\lambda t}\mathrm dt={(n-1)!\over\lambda^n} $$

Setting $\lambda=1$ gives us the first generalization of factorial: the Gamma function.

$$ (s-1)!=\Gamma(s)=\int_0^\infty t^{s-1}e^{-t}\mathrm dt $$

In fact, by integration by parts we can show that the Gamma function satisfies the recursive relationship:

$$ \Gamma(s+1)=s\Gamma(s)\tag1 $$

Furthermore, we have the relation

$$ \int_0^\infty t^{s-1}e^{-\lambda t}\mathrm dt={\Gamma(s)\over\lambda^s} $$

To convenience our derivation process, let's define the Euler integral of the first kind: the Beta function $B(x,y)$:

$$ B(x,y)=\int_0^1\tau^{x-1}(1-\tau)^{y-1}\mathrm d\tau\tag2 $$

which can be seen as a convolution between two power functions:

$$ B(x,y;t)=\int_0^t\tau^{x-1}(t-\tau)^{y-1}\mathrm d\tau $$

If we were to perform Laplace transform on both side, we get

$$ \mathscr L{B(x,y;t)}(\lambda)={\Gamma(x)\Gamma(y)\over\lambda^{x+y}} $$

If we juxtapose this fact with the relation

$$ \mathscr L{t^{x+y-1}}(\lambda)={\Gamma(x+y)\over\lambda^{x+y}} $$

then we see that

$$ B(x,y)=B(x,y;1)={\Gamma(x)\Gamma(y)\over\Gamma(x+y)} $$

which will be extremely useful for us to generalize factorial even further.

Via taking absolute variable, we know that the Gamma integral converges whenever $\Re(s)>0$ $$ |\Gamma(s)|\le\int_0^\infty t^{\Re(s)-1}e^{-t}\mathrm dt $$

which means that this improper integral is converges absolutely on the right half plane. Hence, a stronger definition is needed for us to expand it to the entire complex plane.

Let's consider a sequence of functions:

$$ f_n(t,s)= \begin{cases} t^{s-1}\left(1-\frac tn\right)^n & 0\le x\le n \ 0 & x>n \end{cases} $$

Then by the exponential limit we see that

$$ f_n(t,s)\to t^{s-1}\left(1-\frac tn\right)^n $$

in a **pointwise** sense. However, it is possible to show that this sequence converges **uniformly** for $t\in[0,+\infty)$. Let's set $T>0$ such that for all $t>T$ we have

$$ |t^{s-1}e^{-t}|<\varepsilon $$

Then, let's consider the interval $[0,T]$:

$$ \begin{aligned} |t^{s-1}e^{-t}-f_n(t,s)| &=t^{\Re(s)-1}\left|e^{-t}-\left(1-\frac tn\right)^n\right| \ &=t^{\Re(s)-1}\left|e^{-t}-e^{-b_n(t)}\right| \end{aligned} $$

where we define $b_n(t)$ as

$$ b_n(t)=-n\log\left(1-\frac tn\right) $$

By the **uniform continuity** of $e^{-t}$ on $[0,T]$, all we need is to prove that $b_n(t)$ converges uniformly to $t$. In fact, for all $n>t$ we can use the Taylor expansion of logarithm to obtain

$$ \begin{aligned} |b_n(t)-t| &=\left|-n\log\left(1-\frac tn\right)-t\right| \ &=\left|-n\left[\frac tn+\mathcal O\left(1\over n^2\right)\right]-t\right| \ &=\mathcal O\left(\frac1n\right) \end{aligned} $$

As a result, we conclude that $f_n(s,t)$ converges uniformly to $t^{s-1}e^{-t}$, which allows us to interchange the limit operation and integral to obtain

$$ \lim_{n\to\infty}\int_0^n t^{s-1}\left(1-\frac tn\right)^n\mathrm dt=\Gamma(s)\tag3 $$

In the following procedure, we are going to expand the left hand side limit in a subtle sense so that the right hand side can be analytically continued to the left half plane.

Performing a change of variables on (3) gives

$$ \Gamma_n(s)\triangleq\underbrace{\int_0^nt^{s-1}\left(1-\frac tn\right)^n\mathrm dt}_{t=nr}=n^s\int_0^1r^{s-1}(1-r)^n\mathrm dr $$

In fact, the right hand side can be expressed by Beta function as in (3), which yields

$$ \Gamma_n(s)=n^sB(s,n+1)=n^s{\Gamma(s)\Gamma(n+1)\over\Gamma(s+n+1)} $$

Continuous application of (1) gives us

$$ \begin{aligned} \Gamma_n(s) &={\Gamma(s)n^sn!\over\Gamma(s+n+1)} \ &={n^sn!\over s(s+1)(s+2)\cdots(s+n)} \ &={n^s\over s}\prod_{k=1}^n{k\over s+k} \ \end{aligned} $$

which transforms Gamma function into a product representation, however this expression still looks ugly, why not go deeper?

First, let's turn this equation up side down to obtain

$$ \begin{aligned} {1\over\Gamma_n(s)} &=sn^{-s}\prod_{k=1}^n\left(1+\frac sk\right) \ &=se^{-s\log n}\prod_{k=1}^n\left(1+\frac sk\right) \end{aligned} $$

Then, employing the fact that $H_n\triangleq\sum_{k=1}^n\frac1k=\log n+\gamma+\mathcal O(1/n)$ we can replace the logarithm with harmonic numbers:

$$ \begin{aligned} {1\over\Gamma_n(s)} &=se^{-s[H_n-\gamma+\mathcal O(1/n)]}\prod_{k=1}^n\left(1+\frac sk\right) \ &=se^{s\gamma+\mathcal O(1/n)}\prod_{k=1}^n\left(1+\frac sk\right)e^{-s/k} \end{aligned} $$

Now, if we take logarithm on the product, we have

$$ \begin{aligned} \log\left|\prod_{k=1}^n\left(1+\frac sk\right)e^{-s/k}\right| &\le|s^2|\sum_{k=1}^n{1\over k^2}<{|s|^2\pi^2\over6} \end{aligned} $$

As a result, the product converges absolutely for all $s\in\mathbb C$, giving us the Weierstrass product representation of Gamma function:

$$ {1\over\Gamma(s)}=se^{\gamma s}\prod_{k=1}^\infty\left(1+\frac sk\right)e^{-s/k} $$

which allows us to analytically continue $\Gamma(s)$ to the entire complex planes except for nonpositive integers:

$$ \Gamma(s)={e^{-\gamma s}\over s}\prod_{k=1}^\infty\left(1+\frac sk\right)^{-1}e^{s/k}\tag4 $$

Remark: (4) also reveals $\Gamma(s)$ is non-zero for all $s\in\mathbb C$

Now, we successfully expanded factorial to the entire complex plane except at negative integers, but Gamma function has some other brilliant properties. Let's have a look at some of them:

From the last article, we know that Euler-Mascheroni constant is defined by

$$ \gamma=\lim_{n\to\infty}(H_n-\log n)=1-\int_1^\infty{{x}\over x^2}\mathrm dx $$

However, it is possible for us to create a new definition for $\gamma$ by using $\Gamma(s)$.

Remark: I hypothesize this explains why they name the function $\Gamma$ and the constant $\gamma$ since they are highly correlated.

To begin with, we take logarithm on (4) to get

$$ \log\Gamma(s)=-\log s-\gamma s+\sum_{k=1}^\infty\left{\frac sk-\log\left(1+\frac sk\right)\right} $$

If we were to define **Digamma function** $\psi(s)$ as the logarithmic derivative of $\Gamma(s)$, then

$$ \psi(s)=-\gamma-\frac1s+\sum_{k=1}^\infty\left{\frac1k-{1\over s+k}\right} $$

If we were to move the $\frac1s$ term into the summation, we deduce the standard definition for Digamma function.

$$ \psi(s)=-\gamma+\sum_{m=0}^\infty\left{{1\over m+1}-{1\over m+s}\right}\tag5 $$

Plugging $s=1$ gives $\psi(1)=\Gamma'(1)/\Gamma(1)=-\gamma$. Because $\Gamma(1)=1$, we also know that $\Gamma'(1)=-\gamma$. Combining this with the original integral definition for $\Gamma(s)$ gives this elegant integral identity to represent Euler-Mascheroni constant:

$$ \int_0^\infty e^{-t}\log(t)\mathrm dt=-\gamma\tag6 $$

While deriving (5) in the previous section, we introduce the Digamma function, so why don't we do some calculus on that

$$ \begin{aligned} \psi(s) &=-\gamma+\sum_{m=0}^\infty\int_0^1(x^m-x^{m+s-1})\mathrm dt \ &=-\gamma+\int_0^1(1-x^{s-1})\sum_{m=0}^\infty x^m\mathrm dt \ &=-\gamma+\int_0^1{1-x^{s-1}\over1-x}\mathrm dt \end{aligned} $$

Well, the constant lying outside is not *beautiful*, so why not use (6):

$$ \begin{aligned} -\gamma &=\underbrace{\int_0^\infty e^{-t}\log(t)\mathrm dt}_{x=e^{-t}} \ &=-\int_1^0\log(-\log x)\mathrm dx \ &=\int_0^1\log\log\left(\frac1x\right)\mathrm dx \end{aligned} $$

Combining all these gives us the ultimate integral definition for $\psi(s)$:

$$ \psi(s)=\int_0^1\left{\log\log\left(\frac1x\right)+{x^{s-1}-1\over x-1}\right}\mathrm dx\tag7 $$

In this blog, we first use the technique of differentiation under integral to deduce an integral representation for factorial that introduces the concept of Gamma function $\Gamma(s)$. Then, by using an identity connecting $B(x,y)$ and $\Gamma(s)$, we obtain a product formula that turns $\Gamma(s)$ into a **meromorphic** function on $\mathbb C$. Using this newly obtained product formula, we are also able to discover some new identities. In fact, Gamma function is a function that often appears in the field of analytic number theory, and we will begin future investigation based on the following identity (which you could try prove it yourself):

$$ \sum_{n=1}^\infty{1\over n^s}={1\over\Gamma(s)}\int_0^\infty{x^{s-1}\over e^x-1}\mathrm dx $$

]]>:::important 🌏 Source 🔬 Downloadable at: https://arxiv.org/abs/2003.08757. CVPR 2020. :::

Adversarial Camouflage, AdvCam, transfers large adversarial perturbations into customized styles, which are then “hidden” on-target object or off-target background. Focuses on physical-world scenarios, are well camouflaged and highly stealthy, while remaining effective in fooling state-of-the-art DNN image classifiers.

- Propose a flexible adversarial camouflage approach,
**AdvCam**, to craft and camouflage adversarial examples. **AdvCam**allows the generation of**large perturbations, customizable attack regions and camouflage styles.**- It is very flexible and useful for vulnerability evaluation of DNNs against large perturbations for physical-world attacks.

- Digital settings.
- Physical-world settings.

- Adversarial strength: represents the ability to fool DNNs.
- Adversarial stealthiness: describes whether the adversarial perturbations can be detected by human observers.
- Camouflage flexibility: the degree to which the attacker can control the appearance of adversarial image.

Normal attacks try to find an adversarial example for a given image by solving the following optimization problem:

$$\textrm{minimize}\ \mathcal{D}(x,x')+\lambda\cdot\mathcal{L}_{adv}(x')\ \textrm{such that}\ x'\in[0,255]^m$$

Where the $\mathcal{D}(x,x')$ defines the "stealthiness" property of the adversarial example, and the $\mathcal{L}_{adv}$ is the adversarial loss. Which means that **there is a trade off between stealthiness and adversarial strength.**

We use style transfer techniques to achieve the goal of camouflage and adversarial attack techniques to achieve adversarial strength. In order to do so, the loss function, which we call the **adversarial camouflage loss**, becomes:

$$\mathcal{L}=(\mathcal{L}*{s}+\mathcal{L}*{c}+\mathcal{L}*{m})+\lambda\cdot\mathcal{L}*{adv}$$

The final overview of the AdvCam approach:

Camouflaged adversarial images crafted by AdvCam attack and their original versions.

]]>As the title suggests, today we are going to do some integral calculus. First, let's recall the definition of Riemann integral:

Traditionally, an integral of some function $f(x)$ over some interval $[a,b]$ is defined as the signed area of $f(x)$ over the curve:

$$ \int_a^bf(x)\mathrm dx\triangleq\text{Signed area under $f(x)$ where $x\in[a,b]$} $$

and Riemann integral is one way to define it rigorously. Particularly, we define it as the sum of the areas of tiny rectangles:

$$ \int_a^bf(x)\mathrm dx\approx\sum_{k=1}^Nf(\xi_k)(x_k-x_{k-1}) $$

where $\xi_k$ are sampled in $[x_{k-1},x_k]$ and the increasing sequence ${x_k}$ is called a partition of the interval $[a,b]$:

$$ a=x_0\le x_1\le x_2\le\cdots\le x_N=b $$

To formalize Riemann integral, let's define the $\operatorname{mesh}{x_n}$ as the length of the maximum of interval in a partition ${x_n}$:

$$ \operatorname{mesh}{x_n}\triangleq\max_{1\le n\le N}(x_n-x_{n-1}) $$

Then we say that some function $f(x)$ is **Riemann integrable** if for all $\varepsilon>0$, there exists $\delta>0$ such that when $\operatorname{mesh}{x_n}\le\delta$ we always have

$$ \left|\sum_{k=1}^Nf(\xi_k)(x_k-x_{k-1})-\int_a^bf(x)\mathrm dx\right|<\varepsilon $$

Alternatively, if some function $f(x)$ is Riemann-integrable on $[a,b]$, then the limit

$$ \lim_{\operatorname{mesh}{x_n}\to0}\sum_{k=1}^Nf(\xi_k)(x_k-x_{k-1}) $$

exists and converges to $\int_a^bf(x)\mathrm dx$.

Although Riemann integral appears to be sufficient to integrate functions, it is not friendly to integrate piecewise continuous functions. Let's first look at its definition:

$$ \int_a^bf(x)\mathrm dg(x) =\lim_{\operatorname{mesh}{x_n}\to0}I(x_n,\xi_n)\triangleq\lim_{\operatorname{mesh}{x_n}\to0}\sum_{k=1}^Nf(\xi_k)[g(x_k)-g(x_{k-1})] $$

In order for this it to exist, we need to set up constraints on $f(x)$ and $g(x)$:

**Theorem 1**: *The Riemann-Stieltjes integral $\int_a^bf\mathrm dg$ exists if $f$ is continuous on $[a,b]$ and $g$ is of bounded variation on $[a,b]$*

When we say $g$ is of bounded variation, we mean that the following quantity exists:

$\displaystyle{\operatorname{Var}

{[a,b]}}(g)\triangleq\sup\sum_k|g(x_k)-g(x{k-1})|$

*Proof.* Let's define ${y_n}={y_1,y_2,\dots,y_M}$ as another partition of $[a,b]$ such that ${x_n}$ is its subsequence and $\eta_k$ be the sampled abscissa in $[y_k,y_{k-1}]$, so if we designate $P_k$ to be the set of $m$ such that $y_m$'s are contained in interval $(x_{k-1},x_k]$:

$$ P_k={m|x_{k-1}<y_m\le x_k} $$

then we have

$$ \sum_{m\in P_k}[g(y_m)-g(y_{m-1})]=g(x_k)-g(x_{k-1}) $$

which implies

$$ I(x_n,\xi_n)=\sum_{k=1}^N\sum_{m\in P_k}f(\xi_k)[g(y_m)-g(y_{m-1})]\tag1 $$ $$ I(y_n,\eta_n)=\sum_{k=1}^N\sum_{m\in P_k}f(\eta_m)[g(y_m)-g(y_{m-1})]\tag2 $$

Because $f(x)$ is uniformly continuous within $[a,b]$, we know that for every $\varepsilon>0$ there exists $\delta>0$ such that when $s,t\in[a,b]$ satisfy $|s-t|\le\operatorname{mesh}{x_n}\le\delta$ then $|f(s)-f(t)|<\varepsilon$. Accordingly, if we were to take the absolute values of (1) and (2), then

$$ \begin{aligned} |I(x_n,\xi_n)-I(y_n,\eta_ n)| &=\sum_{n=1}^N\sum_{m\in P_k}|f(\eta_m)-f(\xi_k)||g(y_m)-g(y_{m-1})| \ &\le\varepsilon\sum_{m=1}^M|g(y_m)-g(y_{m-1})| \le\varepsilon\operatorname{Var}_{[a,b]}(g) \end{aligned} $$

Now, let ${z_n}$ be another partition of $[a,b]$, $\zeta_n$ be its correponding samples abscissa and ${y_n}$ be the union of both partitions, then

$$ \begin{aligned} |I(x_n,\xi_n)-I(z_n,\zeta_n)| &=|I(x_n,\xi_n)-I(y_n,\eta_n)-[I(z_n,\zeta_n)-I(y_n,\eta_n)]| \ &\le|I(x_n,\xi_n)-I(y_n,\eta_n)|+|I(z_n,\zeta_n)-I(y_n,\eta_n)| \ &\le2\varepsilon\operatorname{Var}_{[a,b]}(g) \end{aligned} $$

which indicates the validness of this theorem. $\square$

In addition to the constraint for the Riemann-Stieltjes integral to exist, we can also transform it into a Riemann integral at specific occasions:

**Theorem 2**: *If $g'(x)$ exists and is continuous on $[a,b]$ then*

$$ \int_a^bf(x)\mathrm dg(x)=\int_a^bf(x)g'(x)\mathrm dx\tag3 $$

*Proof.* Since $g(x)$ is differentiable, we can use mean value theorem to guarantee the existence of $\xi_k\in[x_{k-1},x_k]$ such that $g(x_k)-g(x_{k-1})=g'(\eta_k)(x_k-x_{k-1})$, so

$$ \begin{aligned} \sum_{k=1}^N|g(x_k)-g(x_{k-1})| &=\sum_{k=1}^N|g'(\eta_k)|(x_k-x_{k-1}) \ &\to\int_a^b|g'(x)|\mathrm dx\quad(\operatorname{mesh}{x_n}\to0) \end{aligned} $$

As a result, $g(x)$ is of bounded variation, implying the existence of the left hand side of (3).

$$
\begin{aligned}
S(x_n,\xi_n)
&=\sum_{k=1}^Nf(\xi_k)g'(\eta_k)(x_k-x_{k-1}) \
&=\underbrace{\sum_{k=1}^Nf(\xi_k)g'(\xi_k)(x_k-x_{k-1})}*{T_1}
+\underbrace{\sum*{k=1}^Nf(\xi_k)g'(\eta_k)-g'(\xi_k)}_{T_2} \
\end{aligned}
$$

By the uniform continuity of $g'(x)$, we know that for all $\varepsilon>0$ there exists $\delta>0$ such that $|g'(s)-g'(t)|<\varepsilon$ whenever $|s-t|<\delta$, indicating $T_1\to\int_a^bfg'$ and $T_2\to0$ as $\operatorname{mesh}{x_n}\to0$. Accordingly, we arrive at the conclusion that

$$ \int_a^bf(x)\mathrm dg(x)=\int_a^bf(x)g'(x)\mathrm dx $$ thus completing the proof. $\square$

In addition, we can also apply integration by parts on Riemann-Stieltjes integrals. Particularly, if we assume $f$ has a continuous derivative and $g$ is of bounded variation on $[a,b]$, then

$$ \int_a^bf(x)\mathrm dg(x)=f(x)g(x)|_q^b-\int_a^bg(x)f'(x)\mathrm dx $$

Let $h(n)$ be some arithmetic function and $H(x)$ be its summatory function

$$ R(x)=\sum_{n\le x}r(n)\tag4 $$

Let $f(t)$ have continuous derivative on $[0,\infty)$ and $b>a>0$ then we have

$$ \int_a^bf(x)\mathrm dR(x)=\sum_{k=1}^Nf(\xi_k)[R(x_k)-R(x_{k-1})] $$

where we require that $\mathrm{mesh}{x_n}<1$. Recall (4), we observe that $R(x)$ is a step function that only jumps at integer values, so we only need to sum over $k_n$'s such that $n\in(x_{k_n-1},x_{k_n}]$. Hence, this integral becomes a summation that sums over integers values in $(a,b]$:

$$ \begin{aligned} \int_a^bf(x)\mathrm dR(x) &=\sum_{a<n\le b}f(\xi_{k_n})r(n) \ &=\sum_{a<n\le b}f(n)r(n)+\sum_{a<n\le b}[f(\xi_{k_n})-f(n)]r(n) \ \end{aligned} $$

Since $f$ is uniformly continuous on $[a,b]$, we have that $|f(x)-f(y)|<\delta$ when ever $|x-y|<\varepsilon$, thus the second summation is of $\mathcal O(\delta)$, leaving us

$$ \int_a^bf(x)\mathrm dR(x)=\sum_{a<n\le b}f(n)r(n)\tag5 $$

Employing (5) in different situations can give us plentiful brilliant results. Let's have a look at some of them:

It was well-known that the harmonic series $1+\frac12+\frac13+\cdots$ diverges, and we can provide a formal proof using Riemann-Stieltjes integral:

$$
\begin{aligned}
\sum_{n\le N}\frac1n
&=\int_{1-\varepsilon}^N{\mathrm d\lfloor x\rfloor\over x} \
&=\left.\lfloor x\rfloor\over x\right|*{1-\varepsilon}^N+\int*{1-\varepsilon}^N{\lfloor x\rfloor\over x^2}\mathrm dx \
&=\int_{1-\varepsilon}^N{\mathrm dx\over x}+1-\int_{1-\varepsilon}^N{{x}\over x^2}\mathrm dx \
&=\int_1^N{\mathrm dx\over x}+1-\int_1^N{{x}\over x^2}\mathrm dx \
&=\log N+1-\int_1^\infty{{x}\over x^2}\mathrm dx+\int_N^\infty{{x}\over x^2}\mathrm dx \
&=\log N+\gamma+\mathcal O\left(\frac1N\right)
\end{aligned}
$$

Since $\log N\to\infty$ as $N\to\infty$, we conclude that the harmonic series diverges.

$$ \sum_{n=1}^\infty{(-1)^n\log n\over n} $$

Now, let's first consider the finite case:

$$ \begin{aligned} \sum_{n=1}^N{(-1)^n\log n\over n} &=2\sum_{n\le N/2}{\log(2n)\over2n}-\sum_{n\le N}{\log n\over n}+\mathcal O\left(\log N\over N\right) \ &=\sum_{n\le N/2}{\log2+\log n\over n}-\sum_{n\le N}{\log n\over n}+\mathcal O\left(\log N\over N\right) \ &=\log2\sum_{n\le N/2}\frac1n-\sum_{N/2<n\le N}{\log n\over n}+\mathcal O\left(\log N\over N\right) \ \end{aligned} $$

In fact, using Riemann-Stieltjes integral, we can show

$$ \begin{aligned} \sum_{N/2<n\le N}{\log n\over n} &=\int_{N/2}^N{\log x\over x}\mathrm d\lfloor x\rfloor \ &={N\log(N)-N\log(N/2)\over N}-\int_{N/2}^N[x-{x}]\mathrm d\left(\log x\over x\right)+\mathcal O\left(\frac1n\right) \ &=\log2-\int_{N/2}^N\left({1-\log x\over x}\right)\mathrm dx+\mathcal O\left(\log N\over N\right) \ &=\int_{N/2}^N{\log x\over x}\mathrm dx+\mathcal O\left(\log N\over N\right) \ &=\frac12[\log^2N-\log^2(N/2)]+\mathcal O\left(\log N\over N\right) \ &=\frac12[\log N+\log(N/2)][\log N-\log(N/2)]+\mathcal O\left(\log N\over N\right) \ &=\frac12\log2[2\log N-\log2]+\mathcal O\left(\log N\over N\right) \ &=\log2\log N-\frac12\log^22+\mathcal O\left(\log N\over N\right) \end{aligned} $$

Now, employing this obtained identity and the asymptotic formula for harmonic series yields:

$$ \begin{aligned} \sum_{n\le N}{(-1)^n\log n\over n} &=\log2(\log N+\gamma)-\log2\log N+\frac12\log^22+\mathcal O\left(\log N\over N\right) \ &=\gamma\log2+\frac12\log^22+\mathcal O\left(\log N\over N\right) \end{aligned} $$

Now, take the limit $n\to\infty$ on both side gives

$$ \sum_{n\ge1}{(-1)^n\log n\over n}=\gamma\log2+\frac12\log^22 $$

If we were to define the prime indicator function

$$ \mathbf1_p(n)= \begin{cases} 1 & \text{$n$ is a prime} \ 0 & \text{otherwise} \end{cases} $$

Then the prime counting function can be defined as

$$ \pi(x)=\sum_{p\le x}1=\sum_{n\le x}\mathbf1_p(n) $$

Now, let's also define Chebyshev's function $\vartheta(x)$:

$$ \vartheta(x)=\sum_{p\le x}\log p=\sum_{n\le x}\mathbf1_p(n)\log n $$

Hence, we have

$$ \begin{aligned} \pi(x) &=\sum_{n\le x}{1\over\log n}\cdot\mathbf1_p(n)\log n\ &=\int_{2-\varepsilon}^x{\mathrm d\vartheta(t)\over\log t} \ &={\vartheta(x)\over\log x}+\int_2^x{\vartheta(t)\over t\log^2t} \end{aligned} $$

It is known that $\vartheta(x)\sim x$, so plugging it into the above equation gives

$$ \pi(x)\sim{x\over\log x} $$

which is the prime number theorem.

To sum up, in this blog, we first define and explore the Riemann-Stieltjes integral, then use this integration technique to solve problems via asymptotic expansion. Lastly, we provide a proof for the prime number theorem.

]]>:::important 🌏 Source Downloadable at: Open Access - CVPR 2018. Source code is available at: GitHub - richzhang/PerceptualSimilarity. :::

The paper argues that widely used image quality metrics like SSIM and PSNR are *simple and shallow* functions that may fail to account for many nuances of human perception. The paper introduces a new dataset of human perceptual similarity judgments to systematically evaluate deep features across different architectures and tasks and compare them with classic metrics.

Findings of this paper suggests that *perceptual similarity is an emergent property shared across deep visual representations.*

In this paper, the author provides a hypothesis that perceptual similarity is not a special function all of its own, but rather a *consequence* of visual representations tuned to be predictive about important structure in the world.

- To testify this theory, the paper introduces a large scale, highly varied perceptual similarity dataset containing 484k human judgments.
- The paper shows that deep features trained on supervised, self-supervised, and unsupervised objectives alike, model low-level perceptual similarity surprisingly well, outperforming previous, widely-used metrics.
- The paper also demonstrates that network architecture alone doesn't account for the performance: untrained networks achieve much lower performance.

The paper suggests that with this data, we can improve performance by *calibrating* feature responses from a pre-trained network.

This content is less related to my interests. I'll cover them briefly.

- Traditional distortions: photometric distortions, random noise, blurring, spatial shifts, corruptions.
- CNN-based distortions: input corruptions (white noise, color removal, downsampling), generator networks, discriminators, loss/learning.
- Distorted image patches.
- Superresolution.
- Frame interpolation.
- Video deblurring.
- Colorization.

**2AFC similarity judgments.**Two-alternative forced choice (2AFC) is a method for measuring the subjective experience of a person or animal through their pattern of choices and response times. The subject is presented with two alternative options, only one of which contains the target stimulus, and is forced to choose which one was the correct option. Both options can be presented concurrently or sequentially in two intervals (also known as two-interval forced choice, 2IFC).**Just noticeable differences.**Just-noticeable difference or JND is the amount something must be changed in order for a difference to be noticeable, detectable at least half the time (absolute threshold). This limen is also known as the difference limen, difference threshold, or least perceptible difference.

The distance between reference and distorted patches $x$ and $x_0$ is calculated using this workflow and the equation below with a network $\mathcal{F}$. The paper extract feature stack from L layers and unit-normalize in the channel dimension. Then the paper scales the activations channel-wise and computes the $\ell_2$ distance.

$$d(x,x_0)=\sum_l\frac{1}{H_lW_l}\sum_{h,w}|w_l\odot(\hat{y}^l_{hw}-\hat{y}^l_{0hw})|_2^2$$

The paper considers the following variants:

: the paper keep pre-trained network weights fixed and learn linear weights $w$ on top.*lin*: the paper initializes from a pre-trained classification model and allow all the weights for network $\mathcal{F}$ to be fine-tuned.*tune*: the paper initializes the network from random Gaussian weights and train it entirely on the author's judgments.*scratch*

Finally, the paper refer to these as variants of the proposed **Learned Perceptual Image Patch Similarity (LPIPS)**.

Figure 4 shows the performance of various low-level metrics (in red), deep networks, and human ceiling (in black).

The 2AFC distortion preference test has high correlation to JND: $\rho = .928$ when averaging the results across distortion types. This indicates that 2AFC generalizes to another perceptual test and is giving us signal regarding human judgments.

Pairs which BiGAN perceives to be far but SSIM to be close generally contain some blur. BiGAN tends to perceive correlated noise patterns to be a smaller distortion than SSIM.

The stronger a feature set is at classification and detection, the stronger it is as a model of perceptual similarity judgments.

Features that are good at **semantic tasks**, are also good at **self-supervised and unsupervised tasks**, and also provide **good models of both human perceptual behavior and macaque neural activity.**

The CW attack algorithm is a very typical adversarial attack, which utilizes two separate losses:

- An adversarial loss to make the generated image actually adversarial, i.e., is capable of fooling image classifiers.
- An image distance loss to constraint the quality of the adversarial examples so as not to make the perturbation too obvious to the naked eye.

This paradigm makes CW attack and its variants capable of being integrated with many other image quality metrics like the PSNR or the SSIM.

When adversarial examples were first discovered in 2013, the optimization problem to craft adversarial examples was formulated as:

$$\begin{aligned}\text{minimize}:&\ \mathcal{D}(x,x+\delta)\ \text{such that}:&\ \mathcal{C}(x+\delta)=t&&\text{Constraint 1}\&\ x+\delta\in[0,1]^n&&\text{Constraint 2}\end{aligned}$$

Where:

- $x$ is the input image, $\delta$ is the perturbation, $n$ is the dimension of the image and $t$ is the target class.
- Function $\mathcal{D}$ serves as the distance metric between the adversarial and the real image, and function $\mathcal{C}$ is the classifier function.

Traditionally well known ways to solve this optimization problem is to define an objective function and to perform gradient descent on it, which will eventually guide us to an optimal point in the function. However, the formula above is difficult to solve because $\mathcal{C}(x+\delta)=t$ is highly non-linear (the classifier is not a straight forward linear function).

In CW, we express Constraint 1 in a different form as an objective function $f$ such that when $\mathcal{C}(x+\delta)=t$ is satisfied, $f(x+\delta) \leq t$ is also satisfied.

Conceptually, the objective function tells us how close we are getting to being classified as $t$. One simple but not a very good choice for function $f$ is:

$$f(x+\delta)=[1-\mathcal{C}[x+\delta]_T]$$

Where $C[x+\delta]_T$ is the probability of $x+\delta$ being classified as $t$. If the probability is low, then the value of $f$ is closer to 1 whereas when it is classified as $t$, $f$ is equal to 0. This is how the objective function works, but clearly we can't use this in real world implementations.

In the original paper, seven different objective functions are assessed, and the best among them is given by:

$$f(x')=\max(\max{Z(x')_i:i\neq t}-Z(x')_t, -k)$$

Where:

- $Z(x')$ is the logit (the unnormalized raw probability predictions of the model for each class / a vector of probabilities) when the input is an adversarial $x'$.
- $\max{Z(x')_i:i\neq t}$ is the probability of the target class (which represents how confident the model is on misclassifying the adversarial as the target).
- So, $\max{Z(x')_i:i\neq t}-Z(x')_t$ is the difference between what the model thinks the current image most probably is and what we want it to think.

The above term is essentially the difference of two probability values, so when we specify another term $-k$ and take a max, we are setting a lower limit on the value of loss. Hence, by controlling the parameter $-k$ we can specify how confident we want our adversarial to be classified as.

We then reformulates the original optimization problem by moving the difficult of the given constraints into the minimization function.

$$\begin{aligned}\text{minimize}:&\ \mathcal{D}(x,x+\delta)+c\cdot f(x+\delta)\ \text{such that}:&\ x+\delta\in [0,1]^n&&\text{Constraint 2}\end{aligned}$$

Here we introduce a constant $c$ to formulate our final loss function, and by doing so we are left with only one of the prior two constraints. Constant $c$ is best found by doing a binary search, where the most often lower bound is $1\times 10^{-4}$ and the upper bound is $+\infty$.

:::important The best constant I personally found that the best constant is often found lying between 1 or 2 through my personal experiments. :::

After formulating our final loss function, we are presented with this final constraint:

$$x+\delta\in[0,1]^n$$

This constraint is expressed in this particular form known as the "box constraint", which means that there is an upper bound and a lower bound set to this constraint. In order to solve this, we will need to apply a method called "change of variable", in which we optimize over $w$ instead of the original variable $\delta$, where $w$ is given by:

$$\begin{aligned}\delta &= \frac{1}{2}\cdot (\tanh(w)+1)-x\ \text{or,}\quad x+\delta&= \frac{1}{2}\cdot (\tanh(w)+1)&&\text{Constraint 2}\end{aligned}$$

Where $\tanh$ is the hyperbolic tangent function, so when $\tanh(W)$ varies from -1 to 1, $x+\delta$ varies from 0 to 1.

Therefore, our final optimization problem is:

$$\begin{aligned}\text{minimize}:&\ \mathcal{D}\left(\frac{1}{2}\cdot (\tanh(w)+1), x\right)+c\cdot f\left(\frac{1}{2}\cdot (\tanh(w)+1)\right)\ \text{such that}:&\ \tanh(w)\in [-1,1]\ \text{where}:&\ f(x')=\max(\max{Z(x')_i:i\neq t}-Z(x')_t, -k)\end{aligned}$$

The CW attack is the solution to the optimization problem (optimized over $w$) given above using Adam optimizer. To avoid gradient descent getting stuck, we use multiple starting point gradient descent in the solver.

In number theory, Möbius inversion is a common technique to study the properties of arithmetic functions (i.e. those that map $\mathbb N^+$ to $\mathbb C$), and all of these brilliant things are derived from the following formula:

$$ \sum_{d|n}\mu(d)=\left\lfloor\frac1n\right\rfloor= \begin{cases} 1 & n=1 \ 0 & \text{otherwise} \end{cases} \tag1 $$

To understand this formula, let's first understand what it says:

The first part of (1) is the summation symbol: $\sum_{d|n}$. Instead of summing over all positive integers within $n$, it is summing over the divisors of $n$. For instance, if $n=10$, then $\sum_{d|n}$ sums over $n=1,2,5,10$. Pedantically, we write $$ \sum_{d|10}f(d)=f(1)+f(2)+f(5)+f(10) $$

To enhance your understanding of this operator, try these exercises:

- Calculate $\sum_{d|15}d^2$
- Explain the meaning of $\sum_{d|n}1$

Usually, the Möbius function is defined as

$$ \mu(n)= \begin{cases} 1 & \text{$n$ has even distinct prime factors} \ -1 & \text{$n$ has odd distinct prime factors} \ 0 & \text{$n$ is not square-free} \end{cases} $$

This standard definition may appear to be strange: why would people care whether some number is square-free or not? To address this, let's turn to a totally different perspective: to expand the following product that runs over all prime numbers:

$$ F(s)=\prod_{p\text{ prime}}(1-p^{-s}) $$

By some combinatoric skills, we are able to expand it like this:

$$ \begin{aligned} F(s) &=(1-2^{-s})(1-3^{-1})\cdots \ &=1-{1\over2^s}-{1\over3^s}-{1\over5^s}+{1\over2^s\cdot3^s}-{1\over7^s} \ &+\cdots-{1\over3^s\cdot5^s\cdot7^s}+\cdots \ &\triangleq\sum_{n=1}^\infty{a(n)\over n^s} \end{aligned} $$

By observing the expansion, we discover that $a(n)$ must satisfy

- when the $n$ is composed of odd number of prime factors $a(n)=-1$
- when the $n$ consists of even number of prime factors $a(n)=1$.
- Because each prime only appears only once in the product, $a(n)=0$ for all $n$ that contains square factors

As a result, $a(n)$ is the Möbius function $\mu(n)$.

With our understanding of the symbols, we can now be capable of understanding the implication of (1). That is, (1) declares that the summation of the Möbius function over divisors of some certain number $n$ is one for $n=1$ and zero for all $n\ne1$. Obviously, when $n=1$, we have

$$ \sum_{d|1}\mu(d)=\mu(1)=1 $$

Did you understand the definition of $\mu(n)$? Try these problems!

- $\mu(20)$
- $\mu(5)\mu(2)$
- $\mu(10)$

Generally, we say some arithmetic function $f(n)$ to be **multiplicative** when for all coprime positive integers $a$ and $b$, $f(ab)=f(a)f(b)$. Now, let's show the following fact of $\mu(n)$:

**Theorem**: *$\mu(n)$ is multiplicative*

*Proof.* For coprime positive integers $a$ and $b$, we may divide this proof
into two situations:

- $a$ and/or $b$ contains square factors, their product $ab$ would also have square factors. As a result, $\mu(a)\mu(b)=\mu(ab)=0$
- For $a$ and $b$ being square free, let's denote $r_n$ be the number of prime factors in $n$, so we have

$$ \mu(a)\mu(b)=(-1)^{r_a+r_b}=\mu(ab) $$

which completes the proof. $\square$

With these tools being prepared, we can delve into proving (1).

Let $n=ab$ where $a$ and $b$ are coprime positive integers, then

$$ \sum_{d|n}\mu(d)=\sum_{d_1|a,d_2|b}\mu(d_1d_2) =\sum_{d_1|a}\mu(d_1)\sum_{d_2|b}\mu(d_2) $$

Now, let's plug prime powers $n=p^k$ into (1), so we have

$$ \sum_{d|p^k}\mu(d)=\sum_{r=0}^k\mu(p^r)=\mu(1)+\mu(p)=1-1=0 $$

Due to the fact that $\sum_{d|n}\mu(d)$ is multiplicative, we conclude (1) is true.

In fact, (1) can help us find a definition for Euler's totient function $\varphi(n)$, i.e. number of positive integers within $n$ that are coprime to $n$:

First, we write down $\varphi(n)$ in terms of summation:

$$ \varphi(n)=\sum_{k\le n\atop\gcd(k,n)=1}1 $$

Then, using the identity given by $(1)$, we have

$$ \begin{aligned} \varphi(n) &=\sum_{k\le n}\sum_{d|\gcd(k,n)}\mu(d)=\sum_{k\le n}\sum_{d|n}[d|k]\mu(d) \ &=\sum_{d|n}\mu(d)\sum_{k\le n,k=qd}=\sum_{d|n}\mu(d)\sum_{q\le n/d}1 \ \end{aligned} $$

Eventually, we obtain the formula for Euler's totient function:

$$ \varphi(n)=\sum_{d|n}\mu(d)\frac nd\tag2 $$

As we can see, (1) actually helps us give definition to other arithmetic functions. Now, it is your job to discover the properties of this function:

Show that $\varphi(n)$ is multiplicative and, in addition, $\varphi(n)$ can be expressed by

$$\displaystyle{\varphi(n)=n\prod_{p\text{ prime}\atop p|n}\left(1-\frac1p\right)}$$

If we were to sum $\varphi(n)$ over divisors of $n$, we could magically obtain $n$:

$$ \begin{aligned} \sum_{d|n}\varphi(d) &=\sum_{d|n}\varphi\left(\frac nd\right) =\sum_{d|n}\sum_{k|n/d}k\mu\left(n\over dk\right) \ &=\sum_{dk|n}k\mu\left(n\over dk\right) =\sum_{k|n}k\sum_{d|n/k}\mu(d) \ &=\sum_{k|n}k\left\lfloor\frac kn\right\rfloor=n \end{aligned} $$

This identity can also be seen by listing fractions. For instance let's consider the case for $n=20$

$$ {1\over20},{2\over20},{3\over20},{4\over20},\dots, {18\over20},{19\over20},{20\over20} $$

In total, there are $n$ fractions. If we were to simplify these fractions, we get

$$ {1\over20},{1\over10},{3\over20},{1\over5},\dots, {9\over10},{19\over20},\frac11 $$

Particularly, the denominators in these simplified fractions are always the divisor of $n$. Moreover, for each $d|n$ there are exactly $\varphi(d)$ simplified fractions with denominator $d$. As a result, we can also observe that $$ \sum_{d|n}\varphi(d)=n\tag3 $$

If we juxtapose (2) and (3), we can see that $\varphi(n)$ and $n$ are closely related to each other, particularly if we define (4) as **Dirichlet convolution**, then we can say that $\varphi(n)$ can be obtained by convolving Möbius function with $n$. Similarly, $n$ can be obtained by convolving $\varphi(n)$ with $1$.
$$
(f*g)(n)\triangleq\sum_{d|n}f(d)g\left(\frac nd\right)
=\sum_{d|n}g(d)f\left(\frac nd\right)\tag4
$$

Commutativity and associativity: Obviously this operation commutative. Moreover, it can be easily verified that Dirichlet convolution is also associative using similar techniques to prove (3).

Identity element: There is also an identity function for Dirichlet convolution. That is, Dirichlet convolution between any arithmetic function and $\lfloor1/n\rfloor$ gives the original function:

$$ \sum_{d|n}f(d)\left\lfloor\frac dn\right\rfloor=f(n) $$

- Dirichlet inverse: If $g(n)$ and $f(n)$ are Dirichlet inverse to each other, then

$$ \sum_{d|n}f(d)g\left(\frac nd\right)=\left\lfloor\frac1n\right\rfloor $$

Although $\mu(n)$ and $1$ are inverses to each other, not every arithmetic function has Dirichlet inverse. Hopefully, the following theorem helps us determine whether some arithmetic function is Dirichlet-invertible or not:

**Theorem**: *An arithmetic function $f(n)$ has Dirichlet inverse if and only if $f(1)\ne0$*

*Proof.* For convenience, suppose $f(n)$ has a Dirichlet inverse $g(n)$, so we need to ensure

$$ \sum_{d|n}f(d)g\left(\frac nd\right)=\left\lfloor\frac1n\right\rfloor $$

For $n=1$, we have $g(1)=1/f(1)$, so $f(1)$ must be non-zero in order for its Dirichlet inverse to exist. In addition, for $n>1$ we have

$$ \begin{aligned} 0&=\sum_{d|n}f(d)g\left(\frac nd\right) \ 0&=\sum_{d|n,d<n}f(d)g\left(\frac nd\right)+f(1)g(n) \ g(n)&=-{1\over f(1)}\sum_{d|n,d<n}f(d)g\left(\frac nd\right) \end{aligned} $$

which also implies the theorem. $\square$

Let $G$ be the set containing all multiplicative functions and $*$ be the Dirichlet convolution operator, then we can verify that

- For all $f,g\in G$, we have $f
*g=g*f\in G$ - For all $f,g,h\in G$, $(f
*g)*h=f*(g*h)$ - For all $f\in G$, $(f*\lfloor1/n\rfloor)(n)=f(n)$
- For all $f\in G$, there exists $g\in G$ such that $(f*g)(n)=\lfloor1/n\rfloor$

In order for the last condition to hold, readers could consider proving that every multiplicative function $f(n)$ satisfies $f(1)=1\ne0$.

As a result, we conclude that all multiplicative functions form an **abelian group** under Dirichlet convolution.

In a nutshell, we begin our discussion with the explanation and proof Möbius inversion formula as in (1), and then we present Dirichlet convolution, a generalization of the sum-of-divisor operation. At last, we discover an algebraic property in multiplicative functions: that is, all multiplicative functions form an abelian group under Dirichlet convolution.

]]>In mathematics, a norm is a function from a vector space over the real or complex numbers to the nonnegative real numbers, that satisfies certain properties pertaining to scalability and additivity and takes the value zero only if the input vector is zero. A pseudonorm or seminorm satisfies the same properties, except that it may have a zero value for some nonzero vectors. **[^1]

Recently, during our research on adversarial attacks, we needed to quantitatively measure the "perturbation size" between the adversarial and their corresponding benign images. In fact, in machine learning, whether "adversarial samples" or other images can essentially be represented as vectors, stored and computed as Numpy matrices. This article briefly describes some of the $\ell_p$ norm calculations and implementations.

:::note What is adversarial examples? Adversarial examples is a type of vulnerability in neural network models. For an image classification model, adversarial examples are produced by adding imperceptible perturbations onto the input images, in order that the model may misclassify the contents of the images. :::

The $\ell_p$ norm is actually a "set of norms" in a vector space [^2]. In my line of research, $\ell_p$ norm are also often used to measure the magnitude of the "perturbation" of an adversarial sample. We define all the $\ell_p$ norm as,

$$ \ell_p: L_p(\vec{x})=(\sum_{i=1}^n |x_i|^p)^{1/p}, $$

in which $p$ can be $1$, $2$, and $\infty$. They are naturally called $\ell_1$ norm、$\ell_2$ norm and $\ell_{\infty}$ norm.

Technically, $\ell_0$ norm is not norm (because, by definition, $p$ in $1/p$ cannot be 0). Still, this norm represents **the number of non-zero elements in the vector**. Then in the context of adversarial attack, it represents the number of non-zero elements in the "perturbation" vector.

The $\ell_1$ norm represents **the sum of the lengths of all the vectors in a vector space**. A better description would be that in a vector space, you need to walk from the start of one vector to the end of another, so the distance you travel (the total length through the vector) is the $\ell_1$ norm of the vector.

As shown above, $\ell_1$ can be calculated according to the following equation,

$$ \ell_1(\vec{v})=|\vec{v}|_1=|a|+|b| $$

much the way a New York taxicab travels along its route. Therefore, the $\ell_1$ norm is also known as the Taxicab norm or Manhattan norm. Generally, it is formulated as,
$$
\ell_1(\vec{x})=|\vec{x}|*1= \sum^n*{i=1} |x_i|.
$$

The $\ell_2$ norm is one of the more commonly used measures of vector size in the field of machine learning. $\ell_2$ norm, also known as the Euclidean norm, represents the shortest distance required to travel from one point to another.

In the example shown above, the $\ell_2$ paradigm is calculated according to the following equation,

$$ \ell_2(\vec{v})=|\vec{v}|_2=\sqrt{(|a|^2+|b|^2)} $$

More generally, it is formulated as, $$ \ell_2(\vec{x})=|\vec{x}|_2=\sqrt{(|x_1|^2+|x_2|^2+...+|x_n|^2)}. $$

$\ell_\infty$ norm is the easiest to understand, i.e. the length (size) of the element with the largest absolute value inside the vector element.

$$ \ell_\infty(\vec{v})=|\vec{v}|_\infty=\max(|a|,|b|) $$

For example, given a vector $\vec{v}=[-10,3,5]$, the $\ell_\infty$ norm of the vector is $10$.

In my research, I tend to use the $\ell_p$ norm to measure the perturbation size of adversarial examples. Unfortunately, instead of relying on any framework, we implement out attack from scratch, which means no automatically output direct distance value for all $\ell_p$ paradigm calculations, so I need to Numpy to calculate them instead.

For an image `img`

, and its adversary example `adv`

, we can easily compute the perturbation `perturb`

.

```
# perturb is a numpy array
perturb = adv - img
```

Then, we can use Numpy to compute the $\ell_p$ norm of the perturbation `perturb`

.

```
# import numpy and relevant libraries
import numpy as np
from numpy.linalg import norm
# L0
_l0 = norm(perturb, 0)
# L1
_l1 = norm(perturb, 1)
# L2
_l2 = norm(perturb)
# L∞
_linf = norm(perturb, np.inf)
```

In fact, the implementation of `numpy.linalg.norm`

is just a vector operation that uses the definition of $\ell_p$. For example, $\ell_1(x)$ is just,

`_l1_x = np.sum(np.abs(x))`

and $\ell_\infty(x)$ is just,

`_linf_x = np.max(np.abs(x))`

and etc.
Howver during production, without special cases, we should directly use `numpy.linalg.norm`

. On top of that, Numpy's official documentation also gives methods for calculating $\ell_p$ norms in different cases for matrices and vectors.

- Numpy Documentation-
`numpy.linalg.norm`

- Medium - L0 Norm, L1 Norm, L2 Norm & L-Infinity Norm
- Gentle Introduction to Vector Norms in Machine Learning

[^1]: Wikipedia - Norm [^2]: Wikipedia - $\ell_p$ Space

]]>This is the index of an Onedrive public folder, hosted on Cloudflare workers.

Live demo: 📚 Toby's OneDrive Index

Using Cloudflare servers as download proxies, in order to get higher speeds in mainland China.

To cache OneDrive files using Cloudflare CDN. There are two caching methods:

- Full: files will be firstly transfered to Cloudflare servers and then to clients but large files may encoutner timeout error due to Cloudflare Worker executing constraints.
- Chunk: streaming and caching but Content-Length can't be correctly returned

Enabling cache feature in configuration, caching method and the path for caching can be also configured.

The latest features are listed below. We'll try keep it up-to-date.

- File icon rendered according to file type. Emoji as folder icon when available (if the first character of the folder name is an emoji).
- Use Font Awesome icons instead of material design icons (For better design consistency).
**Add breadcrumbs for better directory navigation.**- Support file previewing:
- Images:
`.png`

,`.jpg`

,`.gif`

. - Plain text:
`.txt`

. - Markdown:
`.md`

,`.mdown`

,`.markdown`

. - Code:
`.js`

,`.py`

,`.c`

,`.json`

... . **PDF: Lazy loading, loading progress and built-in PDF viewer**.**Music / Audio:**`.mp3`

,`.aac`

,`.wav`

,`.oga`

.**Videos:**`.mp4`

,`.flv`

,`.webm`

,`.m3u8`

.- ...

- Images:
- Code syntax highlight in GitHub style. (With PrismJS.)
- Image preview supports Medium style zoom effect.
- Token cached and refreshed with Google Firebase Realtime Database. (
~~For those I can't afford Cloudflare Workers KV storage.~~😢) - Route lazy loading with the help of Turbolinks®. (Somewhat buggy when going from
`folder`

to`file preview`

, but not user-experience degrading.) - ...

- CSS animations all the way.
- Package source code with wrangler and webpack.
- Convert all CDN assets to load with jsDelivr.
- No external JS scripts,
**all scripts are loaded with webpack!**(Other than some libraries.) - ...

- Cloudflare Workers
- Google Firebase
- OneDrive developer platform
- Microsoft Azure App Service documentation
- turbolinks/turbolinks
- francoischalifour/medium-zoom
- PrismJS/prism
- sindresorhus/github-markdown-css

📁 LIB-HFI © Jason Liu. Released under the MIT License.

Authored and maintained by Jason Liu.

]]>在这一年间，仓库里的代码变成了应用，再变成了传说。

沉思和牺牲，甚至 eureka moment 时的开怀，都被时间的洪流冲淡。

不，这样才好，大都不必去记住。

因为那并不抽象，失去了精确性，在思维的深处成为了模棱两可的存在,

终有一日，现在的我们，也可以在未来成为某人手中的抽象。

]]>`apt`

命令和 `apt-get`

之间的区别。同时也会列出几个常用的 `apt`

命令及其被它们取代的相对应的 `apt-get`

命令。
:::
首先，回到2016年。那时，蔡英文就职成为了总统，朴槿惠被罢免了总统；金正恩还在放烟花，英国还是个欧洲国家；只需要花 6988 块的顶配 iPhone ，看完前2个圈就能关电视的一级方程式比赛；Frank Lee 还是个初中生，而 Donald Trump 还在卖房...... 不过最值得一提的，是 Ubuntu 16.04 的发行，其中的包管理器就从 `apt-get`

变成了 `apt`

。事实上，`apt`

早在 2014 就被推出了，但人们是自从 2016 年 Ubuntu 16.04 的发行后才开始注意到它。

现在，比起 `apt-get install <package>`

, 人们更常见到的开始变成`apt install <package>`

。于是，好多其他的 `Debian`

派生的 Linux 发行版开始跟随 Ubuntu 的步伐并鼓励用户使用 `apt`

而非 `apt-get`

。

此时此刻，有人也许会好奇：“`apt`

和`apt-get`

有啥不一样的？”、“既然已经有一个包管理器了，为啥要做个新的？”、“`apt`

在什么方面优于`apt-get`

？”、“实际用起来哪个好些？” 本文解决的，就是上述的几个问题。

`apt`

vs `apt-get`

:::warning 关于 Linux Mint 中的 `apt`

前几年，Linux Mint 做了一个叫 `apt`

的 python 模块。它实则用的是 `apt-get`

，但也增加了一些实用的选项。本文所讨论的`apt`

和 Linux Mint 里面的那个并不是同一个。
:::

在探究 `apt`

和 `apt-get`

的区别之前，我们先了解一下关于这些命令的背景知识。

`apt`

？Debian可以说是 Ubuntu Linux Mint、elementary、 OS等发行版的祖宗了。它有一个可靠的软件包系统，并且系统里每一个软件都是二进制或者原始码格式的软件包。为了管理这套系统，Debian 用了一套叫作 APT(Advanced Packaging Tools) 的工具。请不要把它和 `apt`

命令搞混淆，它俩是不一样的。

在 Debian 派生的 Linux 发行版里面，可以调用 APT 来安装、删除、和更新软件包的工具是多样的。比如之前提到的 `apt`

和`apt-get`

就是这样的 CLI 工具。除此以外，Aptitude 也是这样的一个工具，它同时提供了 CLI 和 GUI 界面。

如果上维基翻一番，你也许还会发现像 `apt-cache`

这样的命令。而问题就出在这了。

你看，这些命令过于底层，如果你是一个普普通通的 Linux 用户的话，这里面有一部分功能你可能从来都不会用到。另外，大部分的常用的包管理命令同时分散在了`apt-get`

和`apt-cache`

之间。

这个问题随着`apt`

的面世而得到了解决。`apt`

包含`apt-get`

和`apt-cache`

中的常用功能，同时撇开了它们中的一些模棱两可和不常用的功能。同时，`apt`

可以管理 `apt.conf`

文件。

有了`apt`

，用户便不用在`apt-get`

和`apt-cache`

之间折腾来管理软件包。`apt`

具有更加严谨的结构，需要用到的包管理命令也基本上是应有尽有。

*Bottom line:*`apt`

= `apt-get`

和 `apt-cache`

中最常用的选项。

`apt`

和 `apt-get`

之间有什么不同？有了`apt`

, 用户的手头上有了经统一整理的必要工具, 而免于在成堆的选项中迷失自我。`apt`

的目标也正是为最终用户提供一套愉快的包管理体验。

比如说，跟使用 `apt-get`

不同的是，在使用 `apt`

安装或卸载软件时，用户可以看到一个进度条。

再比如说，当用户更新 `apt`

的软件源仓库时，`apt`

同时可以提醒他们待更新的软件包的数目。

当然，加了参数项的`apt-get`

也能实现类似的功能。但在`apt`

里面，这些功能是默认开启的。

`apt`

和 `apt-get`

的命令之间有什么不同？虽说`apt`

和`apt-get`

的选项有很多相似的地方，但这不代表说`apt`

对于`apt-get`

是 backward compatible 的。顾名思义，如果单纯把一个`apt-get`

命令里面的 “apt-get"字样换成”apt“，运行不一定成功。

先来看看`apt`

的命令具体替换了哪些 常用的`apt-get`

和 `apt-cache`

的命令。

`apt` 命令 |
相应的被替换的`apt-get` 和`apt-cache` 命令 |
功能 |
---|---|---|

`apt install` |
`apt-get install` |
安装一个包 |

`apt remove` |
`apt-get remove` |
卸载一个包 |

`apt purge` |
`apt-get purge` |
卸载一个包及其配置文件 |

`apt update` |
`apt-get update` |
更新仓库 |

`apt upgrade` |
`apt-get upgrade` |
更新所有可以更新的包 |

`apt autoremove` |
`apt-get autoremove` |
删除无用的依赖 |

`apt full-upgrade` |
`apt-get dist-upgrade` |
更新所有可以更新的包并自动管理（删除）依赖 |

`apt search` |
`apt-cache search` |
搜索一个包 |

`apt show` |
`apt-cache show` |
显示一个包的详细信息 |

`apt`

也有几个独有的命令。

new apt command | function of the command |
---|---|

`apt list` |
列出已下载的包，并显示其状态(如 installed、 upgradable 等) |

`apt edit-sources` |
编辑下载源 |

值得指出的是，`apt`

是被长期维护的。所以在未来也会看到有新的选项加入。

笔者并没有找到任何关于 `apt-get`

会停止服务的消息。而且也不太可能会。毕竟，`apt-get`

还是提供了更多的功能的。

对于比较低端的操作，比如说脚本，`apt-get`

还是比较常用的。

读到这里，大家也许会想着该用 `apt`

还是 `apt-get`

。而作为一个普普通通的 Linux 用户，我会选择 `apt`

（~~pacman~~🤦）。

`apt`

是被大多数 Debian 派生的 Linux 发行版认可并推荐的命令，而且提供了我需要用来管理软件包的全部选项。最重要的是，它的选项更加简单

除非需要用到`apt-get`

里面的那些额外的功能，否则我也找不到继续使用它的理由了。

希望本文表达清楚了 `apt`

和 `apt-get`

之间的区别。如果要总结本次讨论的内容，那就是：

`apt`

命令是`apt-get`

和`apt-cache`

命令的并集的一个子集，其中包含了正常的包管理所必需的选项；- 虽然说
`apt-get`

没有deprecated，但是对于普通用户，本文推荐使用`apt`

而非`apt-get`

。

- 首先，操作系统分区被克隆。（通过 Copy-on-write 技术，从而不占用多余的空间。）
- 之后，我们会对克隆出来的那份操作系统副本实施更新。（这就是「准备更新」那个进度条读取的时候干的事情。）
- 接下来，系统会对这一分区进行一一认证：分区内部每个文件都会通过 MD5 检测，从而确保更新过程没有任何差错。（这也是为什么那个进度条读取时间如此长。）
- 然后手机会重启，从已经实施更新的那个操作系统分区启动。
- 只有手机系统启动成功之后，各种检测全部通过（第二个进度条），上一个老的操作系统分区才会被删掉

这一系列操作让 OTA 系统更新从理论上来说几乎无法让设备变砖。即使更新过程被打断，设备也可以从老的操作系统分区启动，然后我们再尝试更新就可以了。

]]>然而ruby毕竟是脚本语言，运行速度堪忧，于是法国开发者Pierre Peltier参考colorls,用更加底层的静态语言 Rust 编写了一个拥有 ls 的速度和 colorls 的漂亮颜色和图标的替代品 -- lsd

=> Installation

*P.S 可以通过在shell的配置文件内添加 alias ls='lsd'来用ls调出lsd。*