$$ \zeta(s)=\sum_{n=1}^\infty{1\over n^s}\tag1 $$

To determine its convegence, let's consider Riemann-Stieltjes integration:

$$ \begin{aligned} \zeta_N(s) &=\sum_{n=1}^N{1\over n^s}=\int_{1-\varepsilon}^\infty{\mathrm d\lfloor t\rfloor\over t^s} \ &=\left.{\lfloor t\rfloor\over t^s}\right|_{1-\varepsilon}^N+s\int_1^N{\lfloor t\rfloor\over t^{s+1}}\mathrm dt \ &=N^{1-s}+s\int_1^N{\lfloor t\rfloor\over t^{+1}}\mathrm dt \end{aligned} $$

It can be easily verified that this expression converges absolutely and uniformly when $\Re(s)>1$, which allows us to make some manipulations with it. Let's have a look

Define

$$ \psi(x)=\sum_{n=1}^\infty e^{-n^2\pi x} $$

Then by **Poisson's summation formula** we have

$$ 2\psi(x)+1=\sum_{n\in\mathbb Z}e^{-n^2\pi x}={1\over\sqrt x}\sum_{n\in\mathbb Z}e^{-n^2\pi/x} $$

which leads to

$$ \psi(x)={1\over\sqrt x}\psi\left(\frac1x\right)+{1\over2\sqrt x}-\frac12\tag2 $$

Let's perform a Mellin transform on this function so that

$$ \begin{aligned} \int_0^\infty x^{s/2-1}\psi(x)\mathrm dx &=\int_0^\infty x^{s/2-1}\sum_{n=1}^\infty e^{-n^2\pi x}\mathrm dx \ &=\pi^{-s/2}\Gamma\left(\frac s2\right)\sum_{n=1}^\infty{1\over n^s} \end{aligned} $$

Now, by (1) we obtain this identity:

$$ \int_0^\infty x^{s/2-1}\psi(x)\mathrm dx=\pi^{-s/2}\Gamma\left(\frac s2\right)\zeta(s)\tag3 $$

As a result, we can study the properties of the Riemann zeta function by digging deeper into the integral on the left hand side.

First, let's split the integral into two parts

$$ \int_0^\infty x^{s/2-1}\psi(x)\mathrm dx=\int_0^1x^{s/2-1}\psi(x)\mathrm dx+\int_1^\infty x^{s/2-1}\psi(x)\mathrm dx $$

Then, applying (2) to $\int_0^1$ side gives

$$ \begin{aligned} \int_0^1x^{s/2-1}\psi(x)\mathrm dx &=\int_0^1x^{s/2-1}\left[{1\over\sqrt x}\psi\left(\frac1x\right)+{1\over2\sqrt x}-\frac12\right]\mathrm dx \ &=\underbrace{\int_0^1x^{(s-1)/2-1}\psi\left(\frac1x\right)\mathrm dx}_{t=x^{-1}} \ &+\frac12\int_0^1x^{(s-1)/2-1}\mathrm dx-\frac12\int_0^1x^{s/2-1}\mathrm dx \ &=\int_1^\infty t^{(1-s)/2-1}\psi(t)\mathrm dt+{1\over s(s-1)} \end{aligned} $$

Now, plugging this result to the original equation gives

$$ \int_0^\infty x^{s/2-1}\psi(x)\mathrm dx={1\over s(s-1)}+\int_1^\infty[x^{s/2}+x^{(1-s)/2}]\psi(x){\mathrm dx\over x} $$

As we can observe that the right hand side does not change when we replace $s$ with $1-s$. Hence, by (3) we have

$$ \pi^{-s/2}\Gamma\left(\frac s2\right)\zeta(s) =\pi^{(s-1)/2}\Gamma\left({1-s\over2}\right)\zeta(1-s) $$

Now, in order to achieve the optimal simplicity, we multiply both side with $\Gamma\left(1-\frac s2\right)$:

$$ \Gamma\left(\frac s2\right)\Gamma\left(1-\frac s2\right)\zeta(s)=\pi^{s-1/2}\Gamma\left({1-s\over2}\right)\Gamma\left({1-s\over2}+\frac12\right)\zeta(s) $$

By **Euler's reflection formula**, we have

$$ \Gamma\left(\frac s2\right)\Gamma\left(1-\frac s2\right)=\pi\csc\left(\pi s\over2\right) $$

and by **Legendre's Duplication formula**, we deduce

$$ \Gamma\left({1-s\over2}\right)\Gamma\left({1-s\over2}+\frac12\right)=2^s\pi^{1/2}\Gamma(1-s) $$

Plugging these results back gives us

$$ \pi\csc\left(\pi s\over2\right)\zeta(s)=2^s\pi^s\Gamma(1-s)\zeta(1-s) $$

Now, if we were to perform more manipulations, we get

$$ \zeta(s)=2^s\pi^{s-1}\sin\left(\pi s\over2\right)\Gamma(1-s)\zeta(1-s) $$

which is known as the **functional equation** for $\zeta(s)$.

In this blog post, we begin with the Dirichlet series definition of $\zeta(s)$, and then we try to connect zeta function with an integral representation. Subsequently, we use Poisson's summation formula to obtain its analytic continuation. However, this analytic continuation has other impacts. If we look back to the equation

$$ \zeta(s)=2^s\pi^{s-1}\sin\left(\pi s\over2\right)\Gamma(1-s)\zeta(1-s), $$

we can observe that for $\Re(s)<0$ the right hand side becomes zero whenever $s=-2k\ne0$. Hence, we call such $s$'s as the **trivial zeros** of $\zeta(s)$. However, there are also other occasions in which the right hand side is zero. Alternatively, we call that kind of zeros the **nontrivial zeros** of $\zeta(s)$.

On going through these definition, we can now have a good basic grasp of the Riemann hypothesis:

]]>

Riemann hypothesis:All nontrivial zeros of $\zeta(s)$ lie on the line $\Re(s)=\frac12$.

The Fast Gradient Sign Method, with perturbations limited by the $\ell_\infty$ or the $\ell_2$ norm.

- FGSM original definition
- FGSM as a maximum allowable attack problem
- FGSM with other norms
- References

The **Fast Gradient Sign Method (FGSM)** by Goodfellow et al. (NIPS 2014) is designed to attack deep neural networks. The idea is to maximize certain loss function $\mathcal{J}(x; w)$ subject to an upper bound on the perturbation, for instance: $|x-x_0|_\infty \leq \eta$.

Formally, we define FGSM as follows. Given a loss function $\mathcal{J}(x; w)$, the FGSM creates an attack $x$ with:

$$x=x_0+\eta\cdot \text{sign}(\nabla_x\mathcal{J}(x_0;w))$$

Whereas the gradient is taken with respect to the input $x$ not the parameter $w$. Therefore, $\nabla_x\mathcal{J}(x_0;w)$ should be interpreted as the gradient of $\mathcal{J}$ with respect to $x$ evaluated at $x_0$. It is the gradient of the loss function. And also, because of the $\ell_\infty$ bound on the perturbation magnitude, the perturbation direction is the sign of the gradient.

Given a loss function $\mathcal{J}(x; w)$, if there is an optimization that will result the FGSM attack, then we can generalize FGSM to a broader class of attacks. To this end, we notice that for general (possibly nonlinear) loss functions, FGSM can be interpreted by applying a first order of approximation:

$$\mathcal{J}(x;w)=\mathcal{J}(x_0+r;w)\approx\mathcal{J}(x_0;w)+\nabla_x\mathcal{J}(x_o;w)^Tr$$

where $x=x_0+r$. Therefore, finding $r$ to maximize $\mathcal{J}(x_0+r;w)$ is approximately equivalent to finding $r$ which maximizes $\mathcal{J}(x; w)=\mathcal{J}(x_0+r; w)^Tr$. Hence FGSM is the solution to:

$$\underset{r}{\textrm{maximize}}\quad\mathcal{J}(x_0;w)+\nabla_x\mathcal{J}(x_0;w)^Tr\quad\textrm{subject to}\quad|r|_\infty\leq\eta$$

which can be simplifed to:

$$\underset{r}{\textrm{minimize}}\quad-\nabla_x\mathcal{J}(x_0;w)^Tr-\mathcal{J}(x_0;w)\quad\textrm{subject to}\quad|r|_\infty\leq\eta$$

where we flipped the maximization to minimization by putting a negative sign to the gradient. Here, we simplify this optimization problem's definition to:

$$\underset{r}{\textrm{minimize}}\quad w^Tr+w_0\quad\textrm{subject to}\quad|r|_\infty\leq\eta$$

:::note Holder's Inequality Let $x \in \mathbb{R}^d$ and $y \in \mathbb{R}^d$, for any $p$ and $q$ that $1/p+1/q=1\ (p\in[1,\infty])$, we have the following Inequality: $-|x|_p|y|_q\leq|x^Ty|\leq|x|_p|y|_q$. :::

We consider the Holder's inequality (the negative side), and so, we can show that:

$$w^Tr\geq-|r|_\infty|w|_1\geq-\eta|w|_1,\ \textrm{where}\ p=1, q=\infty$$

and because we have:

$$-\eta|w|*1=-\eta\sum*{i=1}^d|w_i|=-\eta\sum_{i=1}^dw_i\cdot\textrm{sign}(w_i)=-\eta \cdot w^T\cdot\textrm{sign}(w)$$

Thus, the lower bound of $w^Tr$ is attained when $r=-\eta\cdot\textrm{sign}(w)$. Therefore, the solution to the original FGSM optimization problem is:

$$r=-\eta\cdot\textrm{sign}(\nabla_x\mathcal{J}(x_0;w))$$

And hence the perturbed data is $x=x_0+\eta\cdot\textrm{sign}(\nabla_x\mathcal{J}(x_0 ;w))$.

Considering FGSM as a maximum allowable attack problem, we can easily generalize the attack to other $\ell_p$ norms. Consider the $\ell_2$ norm. In this case, the Holder's inequality equation becomes:

$$w^Tr\geq-|r|_2|w|_2\geq-\eta|w|_2,\ \textrm{where}\ p=2, q=2$$

and thus, $w^Tr$ is minimized when $r=-\eta\cdot w / |w|_2$. As a result, the perturbed data becomes:

$$x=x_0+\eta\cdot\frac{\nabla_x\mathcal{J}(x_0 ;w)}{|\nabla_x\mathcal{J}(x_0 ;w)|_2}$$

It is well-known that factorial is defined by the following recursive relation, $$ n!=n(n-1)! $$

with $0!=1$, but however it is possible to generalize this operation to complex numbers. Let's begin our journey of generalizations!

It can be easily shown that the following integral satisfies

$$ \int_0^\infty e^{-\lambda t}\mathrm dt=\frac1\lambda $$

Differentiation with respect to $s$ on both side for $n-1$ times gives

$$ \int_0^\infty t^{n-1}e^{-\lambda t}\mathrm dt={(n-1)!\over\lambda^n} $$

Setting $\lambda=1$ gives us the first generalization of factorial: the Gamma function.

$$ (s-1)!=\Gamma(s)=\int_0^\infty t^{s-1}e^{-t}\mathrm dt $$

In fact, by integration by parts we can show that the Gamma function satisfies the recursive relationship:

$$ \Gamma(s+1)=s\Gamma(s)\tag1 $$

Furthermore, we have the relation

$$ \int_0^\infty t^{s-1}e^{-\lambda t}\mathrm dt={\Gamma(s)\over\lambda^s} $$

To convenience our derivation process, let's define the Euler integral of the first kind: the Beta function $B(x,y)$:

$$ B(x,y)=\int_0^1\tau^{x-1}(1-\tau)^{y-1}\mathrm d\tau\tag2 $$

which can be seen as a convolution between two power functions:

$$ B(x,y;t)=\int_0^t\tau^{x-1}(t-\tau)^{y-1}\mathrm d\tau $$

If we were to perform Laplace transform on both side, we get

$$ \mathscr L{B(x,y;t)}(\lambda)={\Gamma(x)\Gamma(y)\over\lambda^{x+y}} $$

If we juxtapose this fact with the relation

$$ \mathscr L{t^{x+y-1}}(\lambda)={\Gamma(x+y)\over\lambda^{x+y}} $$

then we see that

$$ B(x,y)=B(x,y;1)={\Gamma(x)\Gamma(y)\over\Gamma(x+y)} $$

which will be extremely useful for us to generalize factorial even further.

Via taking absolute variable, we know that the Gamma integral converges whenever $\Re(s)>0$ $$ |\Gamma(s)|\le\int_0^\infty t^{\Re(s)-1}e^{-t}\mathrm dt $$

which means that this improper integral is converges absolutely on the right half plane. Hence, a stronger definition is needed for us to expand it to the entire complex plane.

Let's consider a sequence of functions:

$$ f_n(t,s)= \begin{cases} t^{s-1}\left(1-\frac tn\right)^n & 0\le x\le n \ 0 & x>n \end{cases} $$

Then by the exponential limit we see that

$$ f_n(t,s)\to t^{s-1}\left(1-\frac tn\right)^n $$

in a **pointwise** sense. However, it is possible to show that this sequence converges **uniformly** for $t\in[0,+\infty)$. Let's set $T>0$ such that for all $t>T$ we have

$$ |t^{s-1}e^{-t}|<\varepsilon $$

Then, let's consider the interval $[0,T]$:

$$ \begin{aligned} |t^{s-1}e^{-t}-f_n(t,s)| &=t^{\Re(s)-1}\left|e^{-t}-\left(1-\frac tn\right)^n\right| \ &=t^{\Re(s)-1}\left|e^{-t}-e^{-b_n(t)}\right| \end{aligned} $$

where we define $b_n(t)$ as

$$ b_n(t)=-n\log\left(1-\frac tn\right) $$

By the **uniform continuity** of $e^{-t}$ on $[0,T]$, all we need is to prove that $b_n(t)$ converges uniformly to $t$. In fact, for all $n>t$ we can use the Taylor expansion of logarithm to obtain

$$ \begin{aligned} |b_n(t)-t| &=\left|-n\log\left(1-\frac tn\right)-t\right| \ &=\left|-n\left[\frac tn+\mathcal O\left(1\over n^2\right)\right]-t\right| \ &=\mathcal O\left(\frac1n\right) \end{aligned} $$

As a result, we conclude that $f_n(s,t)$ converges uniformly to $t^{s-1}e^{-t}$, which allows us to interchange the limit operation and integral to obtain

$$ \lim_{n\to\infty}\int_0^n t^{s-1}\left(1-\frac tn\right)^n\mathrm dt=\Gamma(s)\tag3 $$

In the following procedure, we are going to expand the left hand side limit in a subtle sense so that the right hand side can be analytically continued to the left half plane.

Performing a change of variables on (3) gives

$$ \Gamma_n(s)\triangleq\underbrace{\int_0^nt^{s-1}\left(1-\frac tn\right)^n\mathrm dt}_{t=nr}=n^s\int_0^1r^{s-1}(1-r)^n\mathrm dr $$

In fact, the right hand side can be expressed by Beta function as in (3), which yields

$$ \Gamma_n(s)=n^sB(s,n+1)=n^s{\Gamma(s)\Gamma(n+1)\over\Gamma(s+n+1)} $$

Continuous application of (1) gives us

$$ \begin{aligned} \Gamma_n(s) &={\Gamma(s)n^sn!\over\Gamma(s+n+1)} \ &={n^sn!\over s(s+1)(s+2)\cdots(s+n)} \ &={n^s\over s}\prod_{k=1}^n{k\over s+k} \ \end{aligned} $$

which transforms Gamma function into a product representation, however this expression still looks ugly, why not go deeper?

First, let's turn this equation up side down to obtain

$$ \begin{aligned} {1\over\Gamma_n(s)} &=sn^{-s}\prod_{k=1}^n\left(1+\frac sk\right) \ &=se^{-s\log n}\prod_{k=1}^n\left(1+\frac sk\right) \end{aligned} $$

Then, employing the fact that $H_n\triangleq\sum_{k=1}^n\frac1k=\log n+\gamma+\mathcal O(1/n)$ we can replace the logarithm with harmonic numbers:

$$ \begin{aligned} {1\over\Gamma_n(s)} &=se^{-s[H_n-\gamma+\mathcal O(1/n)]}\prod_{k=1}^n\left(1+\frac sk\right) \ &=se^{s\gamma+\mathcal O(1/n)}\prod_{k=1}^n\left(1+\frac sk\right)e^{-s/k} \end{aligned} $$

Now, if we take logarithm on the product, we have

$$ \begin{aligned} \log\left|\prod_{k=1}^n\left(1+\frac sk\right)e^{-s/k}\right| &\le|s^2|\sum_{k=1}^n{1\over k^2}<{|s|^2\pi^2\over6} \end{aligned} $$

As a result, the product converges absolutely for all $s\in\mathbb C$, giving us the Weierstrass product representation of Gamma function:

$$ {1\over\Gamma(s)}=se^{\gamma s}\prod_{k=1}^\infty\left(1+\frac sk\right)e^{-s/k} $$

which allows us to analytically continue $\Gamma(s)$ to the entire complex planes except for nonpositive integers:

$$ \Gamma(s)={e^{-\gamma s}\over s}\prod_{k=1}^\infty\left(1+\frac sk\right)^{-1}e^{s/k}\tag4 $$

Remark: (4) also reveals $\Gamma(s)$ is non-zero for all $s\in\mathbb C$

Now, we successfully expanded factorial to the entire complex plane except at negative integers, but Gamma function has some other brilliant properties. Let's have a look at some of them:

From the last article, we know that Euler-Mascheroni constant is defined by

$$ \gamma=\lim_{n\to\infty}(H_n-\log n)=1-\int_1^\infty{{x}\over x^2}\mathrm dx $$

However, it is possible for us to create a new definition for $\gamma$ by using $\Gamma(s)$.

Remark: I hypothesize this explains why they name the function $\Gamma$ and the constant $\gamma$ since they are highly correlated.

To begin with, we take logarithm on (4) to get

$$ \log\Gamma(s)=-\log s-\gamma s+\sum_{k=1}^\infty\left{\frac sk-\log\left(1+\frac sk\right)\right} $$

If we were to define **Digamma function** $\psi(s)$ as the logarithmic derivative of $\Gamma(s)$, then

$$ \psi(s)=-\gamma-\frac1s+\sum_{k=1}^\infty\left{\frac1k-{1\over s+k}\right} $$

If we were to move the $\frac1s$ term into the summation, we deduce the standard definition for Digamma function.

$$ \psi(s)=-\gamma+\sum_{m=0}^\infty\left{{1\over m+1}-{1\over m+s}\right}\tag5 $$

Plugging $s=1$ gives $\psi(1)=\Gamma'(1)/\Gamma(1)=-\gamma$. Because $\Gamma(1)=1$, we also know that $\Gamma'(1)=-\gamma$. Combining this with the original integral definition for $\Gamma(s)$ gives this elegant integral identity to represent Euler-Mascheroni constant:

$$ \int_0^\infty e^{-t}\log(t)\mathrm dt=-\gamma\tag6 $$

While deriving (5) in the previous section, we introduce the Digamma function, so why don't we do some calculus on that

$$ \begin{aligned} \psi(s) &=-\gamma+\sum_{m=0}^\infty\int_0^1(x^m-x^{m+s-1})\mathrm dt \ &=-\gamma+\int_0^1(1-x^{s-1})\sum_{m=0}^\infty x^m\mathrm dt \ &=-\gamma+\int_0^1{1-x^{s-1}\over1-x}\mathrm dt \end{aligned} $$

Well, the constant lying outside is not *beautiful*, so why not use (6):

$$ \begin{aligned} -\gamma &=\underbrace{\int_0^\infty e^{-t}\log(t)\mathrm dt}_{x=e^{-t}} \ &=-\int_1^0\log(-\log x)\mathrm dx \ &=\int_0^1\log\log\left(\frac1x\right)\mathrm dx \end{aligned} $$

Combining all these gives us the ultimate integral definition for $\psi(s)$:

$$ \psi(s)=\int_0^1\left{\log\log\left(\frac1x\right)+{x^{s-1}-1\over x-1}\right}\mathrm dx\tag7 $$

In this blog, we first use the technique of differentiation under integral to deduce an integral representation for factorial that introduces the concept of Gamma function $\Gamma(s)$. Then, by using an identity connecting $B(x,y)$ and $\Gamma(s)$, we obtain a product formula that turns $\Gamma(s)$ into a **meromorphic** function on $\mathbb C$. Using this newly obtained product formula, we are also able to discover some new identities. In fact, Gamma function is a function that often appears in the field of analytic number theory, and we will begin future investigation based on the following identity (which you could try prove it yourself):

$$ \sum_{n=1}^\infty{1\over n^s}={1\over\Gamma(s)}\int_0^\infty{x^{s-1}\over e^x-1}\mathrm dx $$

]]>:::important 🌏 Source 🔬 Downloadable at: https://arxiv.org/abs/2003.08757. CVPR 2020. :::

Adversarial Camouflage, AdvCam, transfers large adversarial perturbations into customized styles, which are then “hidden” on-target object or off-target background. Focuses on physical-world scenarios, are well camouflaged and highly stealthy, while remaining effective in fooling state-of-the-art DNN image classifiers.

- Propose a flexible adversarial camouflage approach,
**AdvCam**, to craft and camouflage adversarial examples. **AdvCam**allows the generation of**large perturbations, customizable attack regions and camouflage styles.**- It is very flexible and useful for vulnerability evaluation of DNNs against large perturbations for physical-world attacks.

- Digital settings.
- Physical-world settings.

- Adversarial strength: represents the ability to fool DNNs.
- Adversarial stealthiness: describes whether the adversarial perturbations can be detected by human observers.
- Camouflage flexibility: the degree to which the attacker can control the appearance of adversarial image.

Normal attacks try to find an adversarial example for a given image by solving the following optimization problem:

$$\textrm{minimize}\ \mathcal{D}(x,x')+\lambda\cdot\mathcal{L}_{adv}(x')\ \textrm{such that}\ x'\in[0,255]^m$$

Where the $\mathcal{D}(x,x')$ defines the "stealthiness" property of the adversarial example, and the $\mathcal{L}_{adv}$ is the adversarial loss. Which means that **there is a trade off between stealthiness and adversarial strength.**

We use style transfer techniques to achieve the goal of camouflage and adversarial attack techniques to achieve adversarial strength. In order to do so, the loss function, which we call the **adversarial camouflage loss**, becomes:

$$\mathcal{L}=(\mathcal{L}*{s}+\mathcal{L}*{c}+\mathcal{L}*{m})+\lambda\cdot\mathcal{L}*{adv}$$

The final overview of the AdvCam approach:

Camouflaged adversarial images crafted by AdvCam attack and their original versions.

]]>As the title suggests, today we are going to do some integral calculus. First, let's recall the definition of Riemann integral:

Traditionally, an integral of some function $f(x)$ over some interval $[a,b]$ is defined as the signed area of $f(x)$ over the curve:

$$ \int_a^bf(x)\mathrm dx\triangleq\text{Signed area under $f(x)$ where $x\in[a,b]$} $$

and Riemann integral is one way to define it rigorously. Particularly, we define it as the sum of the areas of tiny rectangles:

$$ \int_a^bf(x)\mathrm dx\approx\sum_{k=1}^Nf(\xi_k)(x_k-x_{k-1}) $$

where $\xi_k$ are sampled in $[x_{k-1},x_k]$ and the increasing sequence ${x_k}$ is called a partition of the interval $[a,b]$:

$$ a=x_0\le x_1\le x_2\le\cdots\le x_N=b $$

To formalize Riemann integral, let's define the $\operatorname{mesh}{x_n}$ as the length of the maximum of interval in a partition ${x_n}$:

$$ \operatorname{mesh}{x_n}\triangleq\max_{1\le n\le N}(x_n-x_{n-1}) $$

Then we say that some function $f(x)$ is **Riemann integrable** if for all $\varepsilon>0$, there exists $\delta>0$ such that when $\operatorname{mesh}{x_n}\le\delta$ we always have

$$ \left|\sum_{k=1}^Nf(\xi_k)(x_k-x_{k-1})-\int_a^bf(x)\mathrm dx\right|<\varepsilon $$

Alternatively, if some function $f(x)$ is Riemann-integrable on $[a,b]$, then the limit

$$ \lim_{\operatorname{mesh}{x_n}\to0}\sum_{k=1}^Nf(\xi_k)(x_k-x_{k-1}) $$

exists and converges to $\int_a^bf(x)\mathrm dx$.

Although Riemann integral appears to be sufficient to integrate functions, it is not friendly to integrate piecewise continuous functions. Let's first look at its definition:

$$ \int_a^bf(x)\mathrm dg(x) =\lim_{\operatorname{mesh}{x_n}\to0}I(x_n,\xi_n)\triangleq\lim_{\operatorname{mesh}{x_n}\to0}\sum_{k=1}^Nf(\xi_k)[g(x_k)-g(x_{k-1})] $$

In order for this it to exist, we need to set up constraints on $f(x)$ and $g(x)$:

**Theorem 1**: *The Riemann-Stieltjes integral $\int_a^bf\mathrm dg$ exists if $f$ is continuous on $[a,b]$ and $g$ is of bounded variation on $[a,b]$*

When we say $g$ is of bounded variation, we mean that the following quantity exists:

$\displaystyle{\operatorname{Var}

{[a,b]}}(g)\triangleq\sup\sum_k|g(x_k)-g(x{k-1})|$

*Proof.* Let's define ${y_n}={y_1,y_2,\dots,y_M}$ as another partition of $[a,b]$ such that ${x_n}$ is its subsequence and $\eta_k$ be the sampled abscissa in $[y_k,y_{k-1}]$, so if we designate $P_k$ to be the set of $m$ such that $y_m$'s are contained in interval $(x_{k-1},x_k]$:

$$ P_k={m|x_{k-1}<y_m\le x_k} $$

then we have

$$ \sum_{m\in P_k}[g(y_m)-g(y_{m-1})]=g(x_k)-g(x_{k-1}) $$

which implies

$$ I(x_n,\xi_n)=\sum_{k=1}^N\sum_{m\in P_k}f(\xi_k)[g(y_m)-g(y_{m-1})]\tag1 $$ $$ I(y_n,\eta_n)=\sum_{k=1}^N\sum_{m\in P_k}f(\eta_m)[g(y_m)-g(y_{m-1})]\tag2 $$

Because $f(x)$ is uniformly continuous within $[a,b]$, we know that for every $\varepsilon>0$ there exists $\delta>0$ such that when $s,t\in[a,b]$ satisfy $|s-t|\le\operatorname{mesh}{x_n}\le\delta$ then $|f(s)-f(t)|<\varepsilon$. Accordingly, if we were to take the absolute values of (1) and (2), then

$$ \begin{aligned} |I(x_n,\xi_n)-I(y_n,\eta_ n)| &=\sum_{n=1}^N\sum_{m\in P_k}|f(\eta_m)-f(\xi_k)||g(y_m)-g(y_{m-1})| \ &\le\varepsilon\sum_{m=1}^M|g(y_m)-g(y_{m-1})| \le\varepsilon\operatorname{Var}_{[a,b]}(g) \end{aligned} $$

Now, let ${z_n}$ be another partition of $[a,b]$, $\zeta_n$ be its correponding samples abscissa and ${y_n}$ be the union of both partitions, then

$$ \begin{aligned} |I(x_n,\xi_n)-I(z_n,\zeta_n)| &=|I(x_n,\xi_n)-I(y_n,\eta_n)-[I(z_n,\zeta_n)-I(y_n,\eta_n)]| \ &\le|I(x_n,\xi_n)-I(y_n,\eta_n)|+|I(z_n,\zeta_n)-I(y_n,\eta_n)| \ &\le2\varepsilon\operatorname{Var}_{[a,b]}(g) \end{aligned} $$

which indicates the validness of this theorem. $\square$

In addition to the constraint for the Riemann-Stieltjes integral to exist, we can also transform it into a Riemann integral at specific occasions:

**Theorem 2**: *If $g'(x)$ exists and is continuous on $[a,b]$ then*

$$ \int_a^bf(x)\mathrm dg(x)=\int_a^bf(x)g'(x)\mathrm dx\tag3 $$

*Proof.* Since $g(x)$ is differentiable, we can use mean value theorem to guarantee the existence of $\xi_k\in[x_{k-1},x_k]$ such that $g(x_k)-g(x_{k-1})=g'(\eta_k)(x_k-x_{k-1})$, so

$$ \begin{aligned} \sum_{k=1}^N|g(x_k)-g(x_{k-1})| &=\sum_{k=1}^N|g'(\eta_k)|(x_k-x_{k-1}) \ &\to\int_a^b|g'(x)|\mathrm dx\quad(\operatorname{mesh}{x_n}\to0) \end{aligned} $$

As a result, $g(x)$ is of bounded variation, implying the existence of the left hand side of (3).

$$
\begin{aligned}
S(x_n,\xi_n)
&=\sum_{k=1}^Nf(\xi_k)g'(\eta_k)(x_k-x_{k-1}) \
&=\underbrace{\sum_{k=1}^Nf(\xi_k)g'(\xi_k)(x_k-x_{k-1})}*{T_1}
+\underbrace{\sum*{k=1}^Nf(\xi_k)g'(\eta_k)-g'(\xi_k)}_{T_2} \
\end{aligned}
$$

By the uniform continuity of $g'(x)$, we know that for all $\varepsilon>0$ there exists $\delta>0$ such that $|g'(s)-g'(t)|<\varepsilon$ whenever $|s-t|<\delta$, indicating $T_1\to\int_a^bfg'$ and $T_2\to0$ as $\operatorname{mesh}{x_n}\to0$. Accordingly, we arrive at the conclusion that

$$ \int_a^bf(x)\mathrm dg(x)=\int_a^bf(x)g'(x)\mathrm dx $$ thus completing the proof. $\square$

In addition, we can also apply integration by parts on Riemann-Stieltjes integrals. Particularly, if we assume $f$ has a continuous derivative and $g$ is of bounded variation on $[a,b]$, then

$$ \int_a^bf(x)\mathrm dg(x)=f(x)g(x)|_q^b-\int_a^bg(x)f'(x)\mathrm dx $$

Let $h(n)$ be some arithmetic function and $H(x)$ be its summatory function

$$ R(x)=\sum_{n\le x}r(n)\tag4 $$

Let $f(t)$ have continuous derivative on $[0,\infty)$ and $b>a>0$ then we have

$$ \int_a^bf(x)\mathrm dR(x)=\sum_{k=1}^Nf(\xi_k)[R(x_k)-R(x_{k-1})] $$

where we require that $\mathrm{mesh}{x_n}<1$. Recall (4), we observe that $R(x)$ is a step function that only jumps at integer values, so we only need to sum over $k_n$'s such that $n\in(x_{k_n-1},x_{k_n}]$. Hence, this integral becomes a summation that sums over integers values in $(a,b]$:

$$ \begin{aligned} \int_a^bf(x)\mathrm dR(x) &=\sum_{a<n\le b}f(\xi_{k_n})r(n) \ &=\sum_{a<n\le b}f(n)r(n)+\sum_{a<n\le b}[f(\xi_{k_n})-f(n)]r(n) \ \end{aligned} $$

Since $f$ is uniformly continuous on $[a,b]$, we have that $|f(x)-f(y)|<\delta$ when ever $|x-y|<\varepsilon$, thus the second summation is of $\mathcal O(\delta)$, leaving us

$$ \int_a^bf(x)\mathrm dR(x)=\sum_{a<n\le b}f(n)r(n)\tag5 $$

Employing (5) in different situations can give us plentiful brilliant results. Let's have a look at some of them:

It was well-known that the harmonic series $1+\frac12+\frac13+\cdots$ diverges, and we can provide a formal proof using Riemann-Stieltjes integral:

$$
\begin{aligned}
\sum_{n\le N}\frac1n
&=\int_{1-\varepsilon}^N{\mathrm d\lfloor x\rfloor\over x} \
&=\left.\lfloor x\rfloor\over x\right|*{1-\varepsilon}^N+\int*{1-\varepsilon}^N{\lfloor x\rfloor\over x^2}\mathrm dx \
&=\int_{1-\varepsilon}^N{\mathrm dx\over x}+1-\int_{1-\varepsilon}^N{{x}\over x^2}\mathrm dx \
&=\int_1^N{\mathrm dx\over x}+1-\int_1^N{{x}\over x^2}\mathrm dx \
&=\log N+1-\int_1^\infty{{x}\over x^2}\mathrm dx+\int_N^\infty{{x}\over x^2}\mathrm dx \
&=\log N+\gamma+\mathcal O\left(\frac1N\right)
\end{aligned}
$$

Since $\log N\to\infty$ as $N\to\infty$, we conclude that the harmonic series diverges.

$$ \sum_{n=1}^\infty{(-1)^n\log n\over n} $$

Now, let's first consider the finite case:

$$ \begin{aligned} \sum_{n=1}^N{(-1)^n\log n\over n} &=2\sum_{n\le N/2}{\log(2n)\over2n}-\sum_{n\le N}{\log n\over n}+\mathcal O\left(\log N\over N\right) \ &=\sum_{n\le N/2}{\log2+\log n\over n}-\sum_{n\le N}{\log n\over n}+\mathcal O\left(\log N\over N\right) \ &=\log2\sum_{n\le N/2}\frac1n-\sum_{N/2<n\le N}{\log n\over n}+\mathcal O\left(\log N\over N\right) \ \end{aligned} $$

In fact, using Riemann-Stieltjes integral, we can show

$$ \begin{aligned} \sum_{N/2<n\le N}{\log n\over n} &=\int_{N/2}^N{\log x\over x}\mathrm d\lfloor x\rfloor \ &={N\log(N)-N\log(N/2)\over N}-\int_{N/2}^N[x-{x}]\mathrm d\left(\log x\over x\right)+\mathcal O\left(\frac1n\right) \ &=\log2-\int_{N/2}^N\left({1-\log x\over x}\right)\mathrm dx+\mathcal O\left(\log N\over N\right) \ &=\int_{N/2}^N{\log x\over x}\mathrm dx+\mathcal O\left(\log N\over N\right) \ &=\frac12[\log^2N-\log^2(N/2)]+\mathcal O\left(\log N\over N\right) \ &=\frac12[\log N+\log(N/2)][\log N-\log(N/2)]+\mathcal O\left(\log N\over N\right) \ &=\frac12\log2[2\log N-\log2]+\mathcal O\left(\log N\over N\right) \ &=\log2\log N-\frac12\log^22+\mathcal O\left(\log N\over N\right) \end{aligned} $$

Now, employing this obtained identity and the asymptotic formula for harmonic series yields:

$$ \begin{aligned} \sum_{n\le N}{(-1)^n\log n\over n} &=\log2(\log N+\gamma)-\log2\log N+\frac12\log^22+\mathcal O\left(\log N\over N\right) \ &=\gamma\log2+\frac12\log^22+\mathcal O\left(\log N\over N\right) \end{aligned} $$

Now, take the limit $n\to\infty$ on both side gives

$$ \sum_{n\ge1}{(-1)^n\log n\over n}=\gamma\log2+\frac12\log^22 $$

If we were to define the prime indicator function

$$ \mathbf1_p(n)= \begin{cases} 1 & \text{$n$ is a prime} \ 0 & \text{otherwise} \end{cases} $$

Then the prime counting function can be defined as

$$ \pi(x)=\sum_{p\le x}1=\sum_{n\le x}\mathbf1_p(n) $$

Now, let's also define Chebyshev's function $\vartheta(x)$:

$$ \vartheta(x)=\sum_{p\le x}\log p=\sum_{n\le x}\mathbf1_p(n)\log n $$

Hence, we have

$$ \begin{aligned} \pi(x) &=\sum_{n\le x}{1\over\log n}\cdot\mathbf1_p(n)\log n\ &=\int_{2-\varepsilon}^x{\mathrm d\vartheta(t)\over\log t} \ &={\vartheta(x)\over\log x}+\int_2^x{\vartheta(t)\over t\log^2t} \end{aligned} $$

It is known that $\vartheta(x)\sim x$, so plugging it into the above equation gives

$$ \pi(x)\sim{x\over\log x} $$

which is the prime number theorem.

To sum up, in this blog, we first define and explore the Riemann-Stieltjes integral, then use this integration technique to solve problems via asymptotic expansion. Lastly, we provide a proof for the prime number theorem.

]]>:::important 🌏 Source Downloadable at: Open Access - CVPR 2018. Source code is available at: GitHub - richzhang/PerceptualSimilarity. :::

The paper argues that widely used image quality metrics like SSIM and PSNR are *simple and shallow* functions that may fail to account for many nuances of human perception. The paper introduces a new dataset of human perceptual similarity judgments to systematically evaluate deep features across different architectures and tasks and compare them with classic metrics.

Findings of this paper suggests that *perceptual similarity is an emergent property shared across deep visual representations.*

In this paper, the author provides a hypothesis that perceptual similarity is not a special function all of its own, but rather a *consequence* of visual representations tuned to be predictive about important structure in the world.

- To testify this theory, the paper introduces a large scale, highly varied perceptual similarity dataset containing 484k human judgments.
- The paper shows that deep features trained on supervised, self-supervised, and unsupervised objectives alike, model low-level perceptual similarity surprisingly well, outperforming previous, widely-used metrics.
- The paper also demonstrates that network architecture alone doesn't account for the performance: untrained networks achieve much lower performance.

The paper suggests that with this data, we can improve performance by *calibrating* feature responses from a pre-trained network.

This content is less related to my interests. I'll cover them briefly.

- Traditional distortions: photometric distortions, random noise, blurring, spatial shifts, corruptions.
- CNN-based distortions: input corruptions (white noise, color removal, downsampling), generator networks, discriminators, loss/learning.
- Distorted image patches.
- Superresolution.
- Frame interpolation.
- Video deblurring.
- Colorization.

**2AFC similarity judgments.**Two-alternative forced choice (2AFC) is a method for measuring the subjective experience of a person or animal through their pattern of choices and response times. The subject is presented with two alternative options, only one of which contains the target stimulus, and is forced to choose which one was the correct option. Both options can be presented concurrently or sequentially in two intervals (also known as two-interval forced choice, 2IFC).**Just noticeable differences.**Just-noticeable difference or JND is the amount something must be changed in order for a difference to be noticeable, detectable at least half the time (absolute threshold). This limen is also known as the difference limen, difference threshold, or least perceptible difference.

The distance between reference and distorted patches $x$ and $x_0$ is calculated using this workflow and the equation below with a network $\mathcal{F}$. The paper extract feature stack from L layers and unit-normalize in the channel dimension. Then the paper scales the activations channel-wise and computes the $\ell_2$ distance.

$$d(x,x_0)=\sum_l\frac{1}{H_lW_l}\sum_{h,w}|w_l\odot(\hat{y}^l_{hw}-\hat{y}^l_{0hw})|_2^2$$

The paper considers the following variants:

: the paper keep pre-trained network weights fixed and learn linear weights $w$ on top.*lin*: the paper initializes from a pre-trained classification model and allow all the weights for network $\mathcal{F}$ to be fine-tuned.*tune*: the paper initializes the network from random Gaussian weights and train it entirely on the author's judgments.*scratch*

Finally, the paper refer to these as variants of the proposed **Learned Perceptual Image Patch Similarity (LPIPS)**.

Figure 4 shows the performance of various low-level metrics (in red), deep networks, and human ceiling (in black).

The 2AFC distortion preference test has high correlation to JND: $\rho = .928$ when averaging the results across distortion types. This indicates that 2AFC generalizes to another perceptual test and is giving us signal regarding human judgments.

Pairs which BiGAN perceives to be far but SSIM to be close generally contain some blur. BiGAN tends to perceive correlated noise patterns to be a smaller distortion than SSIM.

The stronger a feature set is at classification and detection, the stronger it is as a model of perceptual similarity judgments.

Features that are good at **semantic tasks**, are also good at **self-supervised and unsupervised tasks**, and also provide **good models of both human perceptual behavior and macaque neural activity.**

The CW attack algorithm is a very typical adversarial attack, which utilizes two separate losses:

- An adversarial loss to make the generated image actually adversarial, i.e., is capable of fooling image classifiers.
- An image distance loss to constraint the quality of the adversarial examples so as not to make the perturbation too obvious to the naked eye.

This paradigm makes CW attack and its variants capable of being integrated with many other image quality metrics like the PSNR or the SSIM.

When adversarial examples were first discovered in 2013, the optimization problem to craft adversarial examples was formulated as:

$$\begin{aligned}\text{minimize}:&\ \mathcal{D}(x,x+\delta)\ \text{such that}:&\ \mathcal{C}(x+\delta)=t&&\text{Constraint 1}\&\ x+\delta\in[0,1]^n&&\text{Constraint 2}\end{aligned}$$

Where:

- $x$ is the input image, $\delta$ is the perturbation, $n$ is the dimension of the image and $t$ is the target class.
- Function $\mathcal{D}$ serves as the distance metric between the adversarial and the real image, and function $\mathcal{C}$ is the classifier function.

Traditionally well known ways to solve this optimization problem is to define an objective function and to perform gradient descent on it, which will eventually guide us to an optimal point in the function. However, the formula above is difficult to solve because $\mathcal{C}(x+\delta)=t$ is highly non-linear (the classifier is not a straight forward linear function).

In CW, we express Constraint 1 in a different form as an objective function $f$ such that when $\mathcal{C}(x+\delta)=t$ is satisfied, $f(x+\delta) \leq t$ is also satisfied.

Conceptually, the objective function tells us how close we are getting to being classified as $t$. One simple but not a very good choice for function $f$ is:

$$f(x+\delta)=[1-\mathcal{C}[x+\delta]_T]$$

Where $C[x+\delta]_T$ is the probability of $x+\delta$ being classified as $t$. If the probability is low, then the value of $f$ is closer to 1 whereas when it is classified as $t$, $f$ is equal to 0. This is how the objective function works, but clearly we can't use this in real world implementations.

In the original paper, seven different objective functions are assessed, and the best among them is given by:

$$f(x')=\max(\max{Z(x')_i:i\neq t}-Z(x')_t, -k)$$

Where:

- $Z(x')$ is the logit (the unnormalized raw probability predictions of the model for each class / a vector of probabilities) when the input is an adversarial $x'$.
- $\max{Z(x')_i:i\neq t}$ is the probability of the target class (which represents how confident the model is on misclassifying the adversarial as the target).
- So, $\max{Z(x')_i:i\neq t}-Z(x')_t$ is the difference between what the model thinks the current image most probably is and what we want it to think.

The above term is essentially the difference of two probability values, so when we specify another term $-k$ and take a max, we are setting a lower limit on the value of loss. Hence, by controlling the parameter $-k$ we can specify how confident we want our adversarial to be classified as.

We then reformulates the original optimization problem by moving the difficult of the given constraints into the minimization function.

$$\begin{aligned}\text{minimize}:&\ \mathcal{D}(x,x+\delta)+c\cdot f(x+\delta)\ \text{such that}:&\ x+\delta\in [0,1]^n&&\text{Constraint 2}\end{aligned}$$

Here we introduce a constant $c$ to formulate our final loss function, and by doing so we are left with only one of the prior two constraints. Constant $c$ is best found by doing a binary search, where the most often lower bound is $1\times 10^{-4}$ and the upper bound is $+\infty$.

:::important The best constant I personally found that the best constant is often found lying between 1 or 2 through my personal experiments. :::

After formulating our final loss function, we are presented with this final constraint:

$$x+\delta\in[0,1]^n$$

This constraint is expressed in this particular form known as the "box constraint", which means that there is an upper bound and a lower bound set to this constraint. In order to solve this, we will need to apply a method called "change of variable", in which we optimize over $w$ instead of the original variable $\delta$, where $w$ is given by:

$$\begin{aligned}\delta &= \frac{1}{2}\cdot (\tanh(w)+1)-x\ \text{or,}\quad x+\delta&= \frac{1}{2}\cdot (\tanh(w)+1)&&\text{Constraint 2}\end{aligned}$$

Where $\tanh$ is the hyperbolic tangent function, so when $\tanh(W)$ varies from -1 to 1, $x+\delta$ varies from 0 to 1.

Therefore, our final optimization problem is:

$$\begin{aligned}\text{minimize}:&\ \mathcal{D}\left(\frac{1}{2}\cdot (\tanh(w)+1), x\right)+c\cdot f\left(\frac{1}{2}\cdot (\tanh(w)+1)\right)\ \text{such that}:&\ \tanh(w)\in [-1,1]\ \text{where}:&\ f(x')=\max(\max{Z(x')_i:i\neq t}-Z(x')_t, -k)\end{aligned}$$

The CW attack is the solution to the optimization problem (optimized over $w$) given above using Adam optimizer. To avoid gradient descent getting stuck, we use multiple starting point gradient descent in the solver.

In number theory, Möbius inversion is a common technique to study the properties of arithmetic functions (i.e. those that map $\mathbb N^+$ to $\mathbb C$), and all of these brilliant things are derived from the following formula:

$$ \sum_{d|n}\mu(d)=\left\lfloor\frac1n\right\rfloor= \begin{cases} 1 & n=1 \ 0 & \text{otherwise} \end{cases} \tag1 $$

To understand this formula, let's first understand what it says:

The first part of (1) is the summation symbol: $\sum_{d|n}$. Instead of summing over all positive integers within $n$, it is summing over the divisors of $n$. For instance, if $n=10$, then $\sum_{d|n}$ sums over $n=1,2,5,10$. Pedantically, we write $$ \sum_{d|10}f(d)=f(1)+f(2)+f(5)+f(10) $$

To enhance your understanding of this operator, try these exercises:

- Calculate $\sum_{d|15}d^2$
- Explain the meaning of $\sum_{d|n}1$

Usually, the Möbius function is defined as

$$ \mu(n)= \begin{cases} 1 & \text{$n$ has even distinct prime factors} \ -1 & \text{$n$ has odd distinct prime factors} \ 0 & \text{$n$ is not square-free} \end{cases} $$

This standard definition may appear to be strange: why would people care whether some number is square-free or not? To address this, let's turn to a totally different perspective: to expand the following product that runs over all prime numbers:

$$ F(s)=\prod_{p\text{ prime}}(1-p^{-s}) $$

By some combinatoric skills, we are able to expand it like this:

$$ \begin{aligned} F(s) &=(1-2^{-s})(1-3^{-1})\cdots \ &=1-{1\over2^s}-{1\over3^s}-{1\over5^s}+{1\over2^s\cdot3^s}-{1\over7^s} \ &+\cdots-{1\over3^s\cdot5^s\cdot7^s}+\cdots \ &\triangleq\sum_{n=1}^\infty{a(n)\over n^s} \end{aligned} $$

By observing the expansion, we discover that $a(n)$ must satisfy

- when the $n$ is composed of odd number of prime factors $a(n)=-1$
- when the $n$ consists of even number of prime factors $a(n)=1$.
- Because each prime only appears only once in the product, $a(n)=0$ for all $n$ that contains square factors

As a result, $a(n)$ is the Möbius function $\mu(n)$.

With our understanding of the symbols, we can now be capable of understanding the implication of (1). That is, (1) declares that the summation of the Möbius function over divisors of some certain number $n$ is one for $n=1$ and zero for all $n\ne1$. Obviously, when $n=1$, we have

$$ \sum_{d|1}\mu(d)=\mu(1)=1 $$

Did you understand the definition of $\mu(n)$? Try these problems!

- $\mu(20)$
- $\mu(5)\mu(2)$
- $\mu(10)$

Generally, we say some arithmetic function $f(n)$ to be **multiplicative** when for all coprime positive integers $a$ and $b$, $f(ab)=f(a)f(b)$. Now, let's show the following fact of $\mu(n)$:

**Theorem**: *$\mu(n)$ is multiplicative*

*Proof.* For coprime positive integers $a$ and $b$, we may divide this proof
into two situations:

- $a$ and/or $b$ contains square factors, their product $ab$ would also have square factors. As a result, $\mu(a)\mu(b)=\mu(ab)=0$
- For $a$ and $b$ being square free, let's denote $r_n$ be the number of prime factors in $n$, so we have

$$ \mu(a)\mu(b)=(-1)^{r_a+r_b}=\mu(ab) $$

which completes the proof. $\square$

With these tools being prepared, we can delve into proving (1).

Let $n=ab$ where $a$ and $b$ are coprime positive integers, then

$$ \sum_{d|n}\mu(d)=\sum_{d_1|a,d_2|b}\mu(d_1d_2) =\sum_{d_1|a}\mu(d_1)\sum_{d_2|b}\mu(d_2) $$

Now, let's plug prime powers $n=p^k$ into (1), so we have

$$ \sum_{d|p^k}\mu(d)=\sum_{r=0}^k\mu(p^r)=\mu(1)+\mu(p)=1-1=0 $$

Due to the fact that $\sum_{d|n}\mu(d)$ is multiplicative, we conclude (1) is true.

In fact, (1) can help us find a definition for Euler's totient function $\varphi(n)$, i.e. number of positive integers within $n$ that are coprime to $n$:

First, we write down $\varphi(n)$ in terms of summation:

$$ \varphi(n)=\sum_{k\le n\atop\gcd(k,n)=1}1 $$

Then, using the identity given by $(1)$, we have

$$ \begin{aligned} \varphi(n) &=\sum_{k\le n}\sum_{d|\gcd(k,n)}\mu(d)=\sum_{k\le n}\sum_{d|n}[d|k]\mu(d) \ &=\sum_{d|n}\mu(d)\sum_{k\le n,k=qd}=\sum_{d|n}\mu(d)\sum_{q\le n/d}1 \ \end{aligned} $$

Eventually, we obtain the formula for Euler's totient function:

$$ \varphi(n)=\sum_{d|n}\mu(d)\frac nd\tag2 $$

As we can see, (1) actually helps us give definition to other arithmetic functions. Now, it is your job to discover the properties of this function:

Show that $\varphi(n)$ is multiplicative and, in addition, $\varphi(n)$ can be expressed by

$$\displaystyle{\varphi(n)=n\prod_{p\text{ prime}\atop p|n}\left(1-\frac1p\right)}$$

If we were to sum $\varphi(n)$ over divisors of $n$, we could magically obtain $n$:

$$ \begin{aligned} \sum_{d|n}\varphi(d) &=\sum_{d|n}\varphi\left(\frac nd\right) =\sum_{d|n}\sum_{k|n/d}k\mu\left(n\over dk\right) \ &=\sum_{dk|n}k\mu\left(n\over dk\right) =\sum_{k|n}k\sum_{d|n/k}\mu(d) \ &=\sum_{k|n}k\left\lfloor\frac kn\right\rfloor=n \end{aligned} $$

This identity can also be seen by listing fractions. For instance let's consider the case for $n=20$

$$ {1\over20},{2\over20},{3\over20},{4\over20},\dots, {18\over20},{19\over20},{20\over20} $$

In total, there are $n$ fractions. If we were to simplify these fractions, we get

$$ {1\over20},{1\over10},{3\over20},{1\over5},\dots, {9\over10},{19\over20},\frac11 $$

Particularly, the denominators in these simplified fractions are always the divisor of $n$. Moreover, for each $d|n$ there are exactly $\varphi(d)$ simplified fractions with denominator $d$. As a result, we can also observe that $$ \sum_{d|n}\varphi(d)=n\tag3 $$

If we juxtapose (2) and (3), we can see that $\varphi(n)$ and $n$ are closely related to each other, particularly if we define (4) as **Dirichlet convolution**, then we can say that $\varphi(n)$ can be obtained by convolving Möbius function with $n$. Similarly, $n$ can be obtained by convolving $\varphi(n)$ with $1$.
$$
(f*g)(n)\triangleq\sum_{d|n}f(d)g\left(\frac nd\right)
=\sum_{d|n}g(d)f\left(\frac nd\right)\tag4
$$

Commutativity and associativity: Obviously this operation commutative. Moreover, it can be easily verified that Dirichlet convolution is also associative using similar techniques to prove (3).

Identity element: There is also an identity function for Dirichlet convolution. That is, Dirichlet convolution between any arithmetic function and $\lfloor1/n\rfloor$ gives the original function:

$$ \sum_{d|n}f(d)\left\lfloor\frac dn\right\rfloor=f(n) $$

- Dirichlet inverse: If $g(n)$ and $f(n)$ are Dirichlet inverse to each other, then

$$ \sum_{d|n}f(d)g\left(\frac nd\right)=\left\lfloor\frac1n\right\rfloor $$

Although $\mu(n)$ and $1$ are inverses to each other, not every arithmetic function has Dirichlet inverse. Hopefully, the following theorem helps us determine whether some arithmetic function is Dirichlet-invertible or not:

**Theorem**: *An arithmetic function $f(n)$ has Dirichlet inverse if and only if $f(1)\ne0$*

*Proof.* For convenience, suppose $f(n)$ has a Dirichlet inverse $g(n)$, so we need to ensure

$$ \sum_{d|n}f(d)g\left(\frac nd\right)=\left\lfloor\frac1n\right\rfloor $$

For $n=1$, we have $g(1)=1/f(1)$, so $f(1)$ must be non-zero in order for its Dirichlet inverse to exist. In addition, for $n>1$ we have

$$ \begin{aligned} 0&=\sum_{d|n}f(d)g\left(\frac nd\right) \ 0&=\sum_{d|n,d<n}f(d)g\left(\frac nd\right)+f(1)g(n) \ g(n)&=-{1\over f(1)}\sum_{d|n,d<n}f(d)g\left(\frac nd\right) \end{aligned} $$

which also implies the theorem. $\square$

Let $G$ be the set containing all multiplicative functions and $*$ be the Dirichlet convolution operator, then we can verify that

- For all $f,g\in G$, we have $f
*g=g*f\in G$ - For all $f,g,h\in G$, $(f
*g)*h=f*(g*h)$ - For all $f\in G$, $(f*\lfloor1/n\rfloor)(n)=f(n)$
- For all $f\in G$, there exists $g\in G$ such that $(f*g)(n)=\lfloor1/n\rfloor$

In order for the last condition to hold, readers could consider proving that every multiplicative function $f(n)$ satisfies $f(1)=1\ne0$.

As a result, we conclude that all multiplicative functions form an **abelian group** under Dirichlet convolution.

In a nutshell, we begin our discussion with the explanation and proof Möbius inversion formula as in (1), and then we present Dirichlet convolution, a generalization of the sum-of-divisor operation. At last, we discover an algebraic property in multiplicative functions: that is, all multiplicative functions form an abelian group under Dirichlet convolution.

]]>In mathematics, a norm is a function from a vector space over the real or complex numbers to the nonnegative real numbers, that satisfies certain properties pertaining to scalability and additivity and takes the value zero only if the input vector is zero. A pseudonorm or seminorm satisfies the same properties, except that it may have a zero value for some nonzero vectors. **[^1]

Recently, during our research on adversarial attacks, we needed to quantitatively measure the "perturbation size" between the adversarial and their corresponding benign images. In fact, in machine learning, whether "adversarial samples" or other images can essentially be represented as vectors, stored and computed as Numpy matrices. This article briefly describes some of the $\ell_p$ norm calculations and implementations.

:::note What is adversarial examples? Adversarial examples is a type of vulnerability in neural network models. For an image classification model, adversarial examples are produced by adding imperceptible perturbations onto the input images, in order that the model may misclassify the contents of the images. :::

The $\ell_p$ norm is actually a "set of norms" in a vector space [^2]. In my line of research, $\ell_p$ norm are also often used to measure the magnitude of the "perturbation" of an adversarial sample. We define all the $\ell_p$ norm as,

$$ \ell_p: L_p(\vec{x})=(\sum_{i=1}^n |x_i|^p)^{1/p}, $$

in which $p$ can be $1$, $2$, and $\infty$. They are naturally called $\ell_1$ norm、$\ell_2$ norm and $\ell_{\infty}$ norm.

Technically, $\ell_0$ norm is not norm (because, by definition, $p$ in $1/p$ cannot be 0). Still, this norm represents **the number of non-zero elements in the vector**. Then in the context of adversarial attack, it represents the number of non-zero elements in the "perturbation" vector.

The $\ell_1$ norm represents **the sum of the lengths of all the vectors in a vector space**. A better description would be that in a vector space, you need to walk from the start of one vector to the end of another, so the distance you travel (the total length through the vector) is the $\ell_1$ norm of the vector.

As shown above, $\ell_1$ can be calculated according to the following equation,

$$ \ell_1(\vec{v})=|\vec{v}|_1=|a|+|b| $$

much the way a New York taxicab travels along its route. Therefore, the $\ell_1$ norm is also known as the Taxicab norm or Manhattan norm. Generally, it is formulated as,
$$
\ell_1(\vec{x})=|\vec{x}|*1= \sum^n*{i=1} |x_i|.
$$

The $\ell_2$ norm is one of the more commonly used measures of vector size in the field of machine learning. $\ell_2$ norm, also known as the Euclidean norm, represents the shortest distance required to travel from one point to another.

In the example shown above, the $\ell_2$ paradigm is calculated according to the following equation,

$$ \ell_2(\vec{v})=|\vec{v}|_2=\sqrt{(|a|^2+|b|^2)} $$

More generally, it is formulated as, $$ \ell_2(\vec{x})=|\vec{x}|_2=\sqrt{(|x_1|^2+|x_2|^2+...+|x_n|^2)}. $$

$\ell_\infty$ norm is the easiest to understand, i.e. the length (size) of the element with the largest absolute value inside the vector element.

$$ \ell_\infty(\vec{v})=|\vec{v}|_\infty=\max(|a|,|b|) $$

For example, given a vector $\vec{v}=[-10,3,5]$, the $\ell_\infty$ norm of the vector is $10$.

In my research, I tend to use the $\ell_p$ norm to measure the perturbation size of adversarial examples. Unfortunately, instead of relying on any framework, we implement out attack from scratch, which means no automatically output direct distance value for all $\ell_p$ paradigm calculations, so I need to Numpy to calculate them instead.

For an image `img`

, and its adversary example `adv`

, we can easily compute the perturbation `perturb`

.

```
# perturb is a numpy array
perturb = adv - img
```

Then, we can use Numpy to compute the $\ell_p$ norm of the perturbation `perturb`

.

```
# import numpy and relevant libraries
import numpy as np
from numpy.linalg import norm
# L0
_l0 = norm(perturb, 0)
# L1
_l1 = norm(perturb, 1)
# L2
_l2 = norm(perturb)
# L∞
_linf = norm(perturb, np.inf)
```

In fact, the implementation of `numpy.linalg.norm`

is just a vector operation that uses the definition of $\ell_p$. For example, $\ell_1(x)$ is just,

`_l1_x = np.sum(np.abs(x))`

and $\ell_\infty(x)$ is just,

`_linf_x = np.max(np.abs(x))`

and etc.
Howver during production, without special cases, we should directly use `numpy.linalg.norm`

. On top of that, Numpy's official documentation also gives methods for calculating $\ell_p$ norms in different cases for matrices and vectors.

- Numpy Documentation-
`numpy.linalg.norm`

- Medium - L0 Norm, L1 Norm, L2 Norm & L-Infinity Norm
- Gentle Introduction to Vector Norms in Machine Learning

[^1]: Wikipedia - Norm [^2]: Wikipedia - $\ell_p$ Space

]]>This is the index of an Onedrive public folder, hosted on Cloudflare workers.

Live demo: 📚 Toby's OneDrive Index

Using Cloudflare servers as download proxies, in order to get higher speeds in mainland China.

To cache OneDrive files using Cloudflare CDN. There are two caching methods:

- Full: files will be firstly transfered to Cloudflare servers and then to clients but large files may encoutner timeout error due to Cloudflare Worker executing constraints.
- Chunk: streaming and caching but Content-Length can't be correctly returned

Enabling cache feature in configuration, caching method and the path for caching can be also configured.

The latest features are listed below. We'll try keep it up-to-date.

- File icon rendered according to file type. Emoji as folder icon when available (if the first character of the folder name is an emoji).
- Use Font Awesome icons instead of material design icons (For better design consistency).
**Add breadcrumbs for better directory navigation.**- Support file previewing:
- Images:
`.png`

,`.jpg`

,`.gif`

. - Plain text:
`.txt`

. - Markdown:
`.md`

,`.mdown`

,`.markdown`

. - Code:
`.js`

,`.py`

,`.c`

,`.json`

... . **PDF: Lazy loading, loading progress and built-in PDF viewer**.**Music / Audio:**`.mp3`

,`.aac`

,`.wav`

,`.oga`

.**Videos:**`.mp4`

,`.flv`

,`.webm`

,`.m3u8`

.- ...

- Images:
- Code syntax highlight in GitHub style. (With PrismJS.)
- Image preview supports Medium style zoom effect.
- Token cached and refreshed with Google Firebase Realtime Database. (
~~For those I can't afford Cloudflare Workers KV storage.~~😢) - Route lazy loading with the help of Turbolinks®. (Somewhat buggy when going from
`folder`

to`file preview`

, but not user-experience degrading.) - ...

- CSS animations all the way.
- Package source code with wrangler and webpack.
- Convert all CDN assets to load with jsDelivr.
- No external JS scripts,
**all scripts are loaded with webpack!**(Other than some libraries.) - ...

- Cloudflare Workers
- Google Firebase
- OneDrive developer platform
- Microsoft Azure App Service documentation
- turbolinks/turbolinks
- francoischalifour/medium-zoom
- PrismJS/prism
- sindresorhus/github-markdown-css

📁 LIB-HFI © Jason Liu. Released under the MIT License.

Authored and maintained by Jason Liu.

]]>在这一年间，仓库里的代码变成了应用，再变成了传说。

沉思和牺牲，甚至 eureka moment 时的开怀，都被时间的洪流冲淡。

不，这样才好，大都不必去记住。

因为那并不抽象，失去了精确性，在思维的深处成为了模棱两可的存在,

终有一日，现在的我们，也可以在未来成为某人手中的抽象。

]]>`apt`

命令和 `apt-get`

之间的区别。同时也会列出几个常用的 `apt`

命令及其被它们取代的相对应的 `apt-get`

命令。
:::
首先，回到2016年。那时，蔡英文就职成为了总统，朴槿惠被罢免了总统；金正恩还在放烟花，英国还是个欧洲国家；只需要花 6988 块的顶配 iPhone ，看完前2个圈就能关电视的一级方程式比赛；Frank Lee 还是个初中生，而 Donald Trump 还在卖房...... 不过最值得一提的，是 Ubuntu 16.04 的发行，其中的包管理器就从 `apt-get`

变成了 `apt`

。事实上，`apt`

早在 2014 就被推出了，但人们是自从 2016 年 Ubuntu 16.04 的发行后才开始注意到它。

现在，比起 `apt-get install <package>`

, 人们更常见到的开始变成`apt install <package>`

。于是，好多其他的 `Debian`

派生的 Linux 发行版开始跟随 Ubuntu 的步伐并鼓励用户使用 `apt`

而非 `apt-get`

。

此时此刻，有人也许会好奇：“`apt`

和`apt-get`

有啥不一样的？”、“既然已经有一个包管理器了，为啥要做个新的？”、“`apt`

在什么方面优于`apt-get`

？”、“实际用起来哪个好些？” 本文解决的，就是上述的几个问题。

`apt`

vs `apt-get`

:::warning 关于 Linux Mint 中的 `apt`

前几年，Linux Mint 做了一个叫 `apt`

的 python 模块。它实则用的是 `apt-get`

，但也增加了一些实用的选项。本文所讨论的`apt`

和 Linux Mint 里面的那个并不是同一个。
:::

在探究 `apt`

和 `apt-get`

的区别之前，我们先了解一下关于这些命令的背景知识。

`apt`

？Debian可以说是 Ubuntu Linux Mint、elementary、 OS等发行版的祖宗了。它有一个可靠的软件包系统，并且系统里每一个软件都是二进制或者原始码格式的软件包。为了管理这套系统，Debian 用了一套叫作 APT(Advanced Packaging Tools) 的工具。请不要把它和 `apt`

命令搞混淆，它俩是不一样的。

在 Debian 派生的 Linux 发行版里面，可以调用 APT 来安装、删除、和更新软件包的工具是多样的。比如之前提到的 `apt`

和`apt-get`

就是这样的 CLI 工具。除此以外，Aptitude 也是这样的一个工具，它同时提供了 CLI 和 GUI 界面。

如果上维基翻一番，你也许还会发现像 `apt-cache`

这样的命令。而问题就出在这了。

你看，这些命令过于底层，如果你是一个普普通通的 Linux 用户的话，这里面有一部分功能你可能从来都不会用到。另外，大部分的常用的包管理命令同时分散在了`apt-get`

和`apt-cache`

之间。

这个问题随着`apt`

的面世而得到了解决。`apt`

包含`apt-get`

和`apt-cache`

中的常用功能，同时撇开了它们中的一些模棱两可和不常用的功能。同时，`apt`

可以管理 `apt.conf`

文件。

有了`apt`

，用户便不用在`apt-get`

和`apt-cache`

之间折腾来管理软件包。`apt`

具有更加严谨的结构，需要用到的包管理命令也基本上是应有尽有。

*Bottom line:*`apt`

= `apt-get`

和 `apt-cache`

中最常用的选项。

`apt`

和 `apt-get`

之间有什么不同？有了`apt`

, 用户的手头上有了经统一整理的必要工具, 而免于在成堆的选项中迷失自我。`apt`

的目标也正是为最终用户提供一套愉快的包管理体验。

比如说，跟使用 `apt-get`

不同的是，在使用 `apt`

安装或卸载软件时，用户可以看到一个进度条。

再比如说，当用户更新 `apt`

的软件源仓库时，`apt`

同时可以提醒他们待更新的软件包的数目。

当然，加了参数项的`apt-get`

也能实现类似的功能。但在`apt`

里面，这些功能是默认开启的。

`apt`

和 `apt-get`

的命令之间有什么不同？虽说`apt`

和`apt-get`

的选项有很多相似的地方，但这不代表说`apt`

对于`apt-get`

是 backward compatible 的。顾名思义，如果单纯把一个`apt-get`

命令里面的 “apt-get"字样换成”apt“，运行不一定成功。

先来看看`apt`

的命令具体替换了哪些 常用的`apt-get`

和 `apt-cache`

的命令。

`apt` 命令 |
相应的被替换的`apt-get` 和`apt-cache` 命令 |
功能 |
---|---|---|

`apt install` |
`apt-get install` |
安装一个包 |

`apt remove` |
`apt-get remove` |
卸载一个包 |

`apt purge` |
`apt-get purge` |
卸载一个包及其配置文件 |

`apt update` |
`apt-get update` |
更新仓库 |

`apt upgrade` |
`apt-get upgrade` |
更新所有可以更新的包 |

`apt autoremove` |
`apt-get autoremove` |
删除无用的依赖 |

`apt full-upgrade` |
`apt-get dist-upgrade` |
更新所有可以更新的包并自动管理（删除）依赖 |

`apt search` |
`apt-cache search` |
搜索一个包 |

`apt show` |
`apt-cache show` |
显示一个包的详细信息 |

`apt`

也有几个独有的命令。

new apt command | function of the command |
---|---|

`apt list` |
列出已下载的包，并显示其状态(如 installed、 upgradable 等) |

`apt edit-sources` |
编辑下载源 |

值得指出的是，`apt`

是被长期维护的。所以在未来也会看到有新的选项加入。

笔者并没有找到任何关于 `apt-get`

会停止服务的消息。而且也不太可能会。毕竟，`apt-get`

还是提供了更多的功能的。

对于比较低端的操作，比如说脚本，`apt-get`

还是比较常用的。

读到这里，大家也许会想着该用 `apt`

还是 `apt-get`

。而作为一个普普通通的 Linux 用户，我会选择 `apt`

（~~pacman~~🤦）。

`apt`

是被大多数 Debian 派生的 Linux 发行版认可并推荐的命令，而且提供了我需要用来管理软件包的全部选项。最重要的是，它的选项更加简单

除非需要用到`apt-get`

里面的那些额外的功能，否则我也找不到继续使用它的理由了。

希望本文表达清楚了 `apt`

和 `apt-get`

之间的区别。如果要总结本次讨论的内容，那就是：

`apt`

命令是`apt-get`

和`apt-cache`

命令的并集的一个子集，其中包含了正常的包管理所必需的选项；- 虽然说
`apt-get`

没有deprecated，但是对于普通用户，本文推荐使用`apt`

而非`apt-get`

。

- 首先，操作系统分区被克隆。（通过 Copy-on-write 技术，从而不占用多余的空间。）
- 之后，我们会对克隆出来的那份操作系统副本实施更新。（这就是「准备更新」那个进度条读取的时候干的事情。）
- 接下来，系统会对这一分区进行一一认证：分区内部每个文件都会通过 MD5 检测，从而确保更新过程没有任何差错。（这也是为什么那个进度条读取时间如此长。）
- 然后手机会重启，从已经实施更新的那个操作系统分区启动。
- 只有手机系统启动成功之后，各种检测全部通过（第二个进度条），上一个老的操作系统分区才会被删掉

这一系列操作让 OTA 系统更新从理论上来说几乎无法让设备变砖。即使更新过程被打断，设备也可以从老的操作系统分区启动，然后我们再尝试更新就可以了。

]]>然而ruby毕竟是脚本语言，运行速度堪忧，于是法国开发者Pierre Peltier参考colorls,用更加底层的静态语言 Rust 编写了一个拥有 ls 的速度和 colorls 的漂亮颜色和图标的替代品 -- lsd

=> Installation

*P.S 可以通过在shell的配置文件内添加 alias ls='lsd'来用ls调出lsd。*