SparSamp: 基于稀疏采样的可证明安全隐写 (USENIX Security 2025)

source. 6月 01, 2025

论文标题：SparSamp: Efficient Provably Secure Steganography Based on Sparse Sampling

SparSamp_Title

中文版

阅读本文前，建议先阅读 Discop 博文。可证明安全隐写的背景知识可参考 Discop 博文。

本文只关注隐写构造算法本身，不对嵌入时和提取时 token 序列不一致等工程问题进行讨论。

Discop base

Discop 开创了基于分布副本 (或者说基于循环移位) 的隐写构造。

让我们先回顾一下 Discop 的基础构造 (Discop base) 。

在每一时间步 $t$ ，采样需要消耗一个由 PRNG 产生的伪随机数 $r^{(t)}\in \left[0,1\right)$ 。

长度为 $l_m$ 的消息 $\mathbf{m}=\left\{m_0,\dots,m_{l_m-1}\right\}$ 对应的循环移位量 (rotation distance) 为

$d\left(\mathbf{m}\right)=\textsf{bin2num}\left(\mathbf{m}\right)=\frac{\sum_{i=0}^{l_m-1}2^i\times m_i}{2^{l_m}}\tag{1}$

在 $\left[0,1\right)$ 区间内将 $r^{(t)}$ 向右循环移位 $d$ 得到的值为

$r^{(t)}_{\mathbf{m}}=\textsf{rotate}\left(r^{(t)},d\left(\mathbf{m}\right) \right)=\left[r^{(t)}+d\left(\mathbf{m}\right)\right]\bmod 1\tag{2}$

若多个 $r^{(t)}_{\mathbf{m}}$ (下图中箭头) 落入相同 token 对应的区间，则解码可能出现歧义。

为了防止歧义，Discop base 引入争议区间 (disputed ranges) 的概念。如果 $r^{(t)}$ 正好落入争议区间 (下图中灰色遮罩部分区域) 中，那么只能回退到嵌入更少信息的情况。可以推导出 Discop base 的嵌入率仅约等于分布的最小熵 (具体分析推导过程可参见 Discop 博文中的 Q1. Detailed description of embedded rate calculation) ，与其理论极限 (分布的熵) 存在较大差距。

Discop_disputed_ranges

为了提升嵌入率 (或者说 Utilization) ，Discop 提出通过 Huffman 树将原先的一轮多元分布采样分解为多轮二元分布采样，从而将嵌入率提升至理论极限的约 92%~95% 。

SparSamp 则是给出另一种解决方案。在正式介绍 SparSamp 基本思想之前，这里先给出 SparSamp 论文中的一些定义。

SparSamp 中的一些定义

消息派生随机数 (message-derived random numbers, MRNs) ：伪随机数 $r^{(t)}$ 及其经过循环移位得到的数组成的集合。分成两种：

eMRN: 消息对应的 MRN
oMRN: 除 eMRN 以外的所有 MRNs

采样间隔 (sampling interval) ：相邻 MRNs 之间的 (循环移位) 距离。

解码歧义 (decoding ambiguity) ：一个 token 对应多个 MRNs 。

SparSamp 基本思想

SparSamp 同时借鉴了基于算术编码的隐写构造思路：在时间步 $t$ ，可能存在多个 MRNs 落入当前 token $w_t$ 对应的区间，接收方可以通过后续 token $w_{t+1},w_{t+2},\dots$ 逐渐缩小候选范围，从而最终确定与消息对应的那个 eMRN 。

现在，将公式 $(1)$ 改写为

$d\left(\mathbf{m}\right)=\textsf{bin2num}\left(\mathbf{m}\right)=\frac{k_{\mathbf{m}}}{N_m}\tag{3}$

其中，$N_m$ 是当前候选消息 (即可能的消息) 数量 (the number of candidate messages) ，$k_{\mathbf{m}}$ 是消息 $\textbf{m}$ 在当前候选消息中的序号 (或者说索引，从 0 开始) ，满足 $k_{\mathbf{m}}\in\left[0,N_m \right)$ 。

公式 $(1)$ 可视为 $l_m\in \mathbb{Z}^+$ 时公式 $(3)$ 的特例。

直接看 SparSamp 消息嵌入过程的一个例子：

SparSamp_example

初始上文为 $\mathcal{C}$ 。想要嵌入 3 比特消息 $\mathbf{m}=$ 111 ，其位于 {000, …, 111} 中第 7 个 (从 0 开始)，那么 $N_m=2^3=8$ ，$k_{\mathrm{m}}=7$ 。消耗一个由 PRNG 产生的伪随机数 (假设是 0.575) ，通过公式 $(2)$ 分别计算不同 $\mathbf{m}$ 对应的 MRNs 。
111 对应的 eMRN 落入 c 对应的区间，那么输出 c 作为下一个 token 。因为 110、000 对应的 oMRNs 也落入这个区间，所以需要继续。
当前上文为 $\mathcal{C}\|\text{c}$ 。当前共有 3 个 MRN ，$\mathbf{m}$ 位于上一时间步排序 {110, 111, 000} 中第 1 个 (从 0 开始) ，那么 $N_m=3$ ，$k_{\mathrm{m}}=1$ 。消耗一个由 PRNG 产生的伪随机数 (假设是 0.754) ，通过公式 $(2)$ 分别计算不同 $\mathbf{m}$ 对应的 MRNs 。
111 对应的 eMRN 落入 a 对应的区间，那么输出 a 作为下一个 token 。因为 eMRN 是唯一落入这个区间的 MRN ，则算法停止。 (当唯一能确定消息、解码没有歧义时，算法停止。)
最终内容为 $\mathcal{C}\|\text{a c}$ 。

消息提取过程是显然的：每读取一个 token ，就可以排除掉一些 MRNs ，最终只留下唯一的消息对应的那个 eMRN 。

SparSamp 实现 $O(1)$ 时间

在每一时间步 $t$ ，由于 MRNs 是均匀分布在 $\left[0,1\right)$ 上的，因此无需逐个判断每个 MRN 在哪个 token 对应的区间，只需确定 eMRN 落入哪个 token 对应的区间，以及它左右两侧各有多少个 oMRNs 落入同一区间即可。

换个视角，类比分组密码，将原始消息按长度 $l_m$ 进行分组，依次对每个分组进行处理。每个分组包含多个时间步，每一时间步需要更新公式 $(3)$ 中的 $k_{\mathbf{m}}$ 和 $N_m$ 。

论文中更新 $k_{\mathbf{m}}$ 和 $N_m$ 算法 (Algorithm 2) ：

例子代入：

i: k = 7, N = 8
r = 0.575
SE0 = 0.3, SE1 = 0.6

t0 = ceil( (0.3 - 0.575) * 8 ) = -2
t1 = ceil( (0.6 - 0.575) * 8 ) = 1

if: k + r * N = 7 + 0.575 * 8 = 11.6 >= N (8) ✅
then: k = k - N - t0 = 7 - 8 - (-2) = 1
N = t1 - t0 = 1 - (-2) = 3

实验与评估

GPT-2 Top-$p$, $p=1.00$ (不截断) 分组长度对嵌入率、嵌入和提取速度、提取准确率的探究：

可以看出：

SparSamp 的嵌入和提取速度 (单位均为 bits/s) 基本相当；
Utilization 随 $l_m$ 的增大而增大，但当 $l_m\ge 1024$ 时，无法正确提取消息，因此 $l_m$ 最多取 1023 。

GPT-2 Top-$p$, $p\in[0.80,1.00]$ (反映高熵场景) Utilization、嵌入和提取速度对比：

SparSamp_expr_utilization_speed_cmp

注：这里的 SpamSamp 的 $l_m$ 取 64 。要是 $l_m$ 取 1023 ，Utilization 能接近 99%~100% ，嵌入和提取速度会略低一点。

容量效率

GPT-2 Top-$k$, $k\in[0,100]$ (反映低熵场景) 不同隐写构造的 Utilization 对比：

可以看出，SparSamp 的 Utilization 和基于算术编码的隐写构造非常接近了 (接近 1) 。

时间效率

表中缩写解释：

ATST: 单个 token 平均采样时间 (Average Token Sampling Time) ，单位为 s/token
SITR: 采样时间和模型推理时间的比值 (Sampling-to-Inference Time Ratio)

可以看出：

与一般 (非隐写) 生成即随机采样 (random sampling) 相比，SparSamp 引入的额外时间极少 (毕竟理论上的 $O(1)$ 时间复杂度摆在那) ，即便是在 $p=1.00$ (即不截断) 时仍能保持高速；

安全性

通过平均 KLD 和最大 KLD 衡量，由于 SparSamp 和 Discop 一样不改变分布 (或许更严谨的说法是：不会以关于密钥长度的不可忽略的程度改变分布) ，所以没什么好说的。

总结

提出了基于稀疏采样 (Sparse Sampling) 的可证明安全隐写构造。优势：

高速：相比于一般 (非隐写) 生成，其在每一步采样引入的额外时间仅为 $O(1)$ ；
高利用率：嵌入率很接近理论界，即 Utilization 很接近 100% ；
可证明安全。

我个人认为这种隐写构造是 Discop base 和基于算术编码的隐写构造的结合，很巧妙。

论文作者：王垚飞，裴刚，陈可江，丁锦扬，潘超，庞威龙，胡东辉，张卫明

论文中我认为表述不当的地方

1 Introduction

While these methods successfully maintain the original distribution, their embedding capacity is significantly lower than AC-based steganography’s, particularly in low entropy situations.

有点不严谨。事实上，Discop 的 Utilization 大概是 92%~96% ，并不比 AC 的 Utilization (大概是 99%~100%) 低很多，没有到 significantly lower 的程度，这一点在 SparSamp 论文的 Table 4 中也有反映。

In addition, Discop [16], iMEC [14], and Meteor [28] introduce high computational complexity, with embedding time consumption that is even ten or hundreds of times longer than the inference time of the model, which is unacceptable.

Discop 论文中 Section V. C. (2) 测试了 Discop 和一般生成 (random sampling) 的时间对比，TABLE III 中结果表明，Discop 并不会引入那么大的额外时间。

English Version

⚠️ This English version was AI-translated from the Chinese original and is currently under review.

Before reading this blog post, it is recommended to read Discop’s blog post. For the background knowledge of provably secure steganography, you can refer to Discop’s blog post.

This blog post only focuses on the steganography algorithm itself and does not consider engineering issues such as inconsistent token sequences during embedding and extraction.

Discop base

Discop pioneered steganography algorithm based on “distribution copies” (or rather, based on rotation).

Let’s first review the basic construction of Discop (Discop base).

At each time step $t$, sampling consumes a pseudo-random number $r^{(t)}\in \left[0,1\right)$ generated by a PRNG.

The rotation distance corresponding to a message $\mathbf{m}=\left\{m_0,\dots,m_{l_m-1}\right\}$ of length $l_m$ is

$d\left(\mathbf{m}\right)=\textsf{bin2num}\left(\mathbf{m}\right)=\frac{\sum_{i=0}^{l_m-1}2^i\times m_i}{2^{l_m}}\tag{1}$

The value obtained by rotating $r^{(t)}$ to the right by $d$ in the interval $\left[0,1\right)$ is

$r^{(t)}_{\mathbf{m}}=\textsf{rotate}\left(r^{(t)},d\left(\mathbf{m}\right) \right)=\left[r^{(t)}+d\left(\mathbf{m}\right)\right]\bmod 1\tag{2}$

If multiple $r^{(t)}_{\mathbf{m}}$ (arrows in the figure below) fall into the interval corresponding to the same token, decoding may be ambiguous.

To prevent ambiguity, Discop base introduces the concept of disputed ranges. If $r^{(t)}$ falls into a disputed range (the gray-shaded area in the figure below), then one can only fall back to the situation where less information is embedded. It can be deduced that the embedding rate of Discop base is only approximately equal to the minimum entropy of the distribution (for the specific analysis and derivation process, see Q1. Detailed description of embedded rate calculation in the Discop’s blog post), which has a large gap with its theoretical limit (the entropy of the distribution).

Discop_disputed_ranges_en

To improve the embedding rate (or rather, Utilization), Discop proposes to decompose the original one-round of multi-variate distribution sampling into multiple rounds of binary distribution sampling through the Huffman tree, thereby increasing the embedding rate to approximately 92%-95% of the theoretical limit.

SparSamp presents another solution. Before introducing the basic idea of SparSamp, here are some definitions from the SparSamp paper.

Some definitions in SparSamp

Message-derived random numbers (MRNs): A set consisting of the pseudo-random number $r^{(t)}$ and the numbers obtained by rotating $r^{(t)}$, divided into two types:

eMRN: The MRN corresponding to the message
oMRNs: All MRNs other than eMRN

Sampling interval: The (cyclic shift) distance between adjacent MRNs.

Decoding ambiguity: One token corresponds to multiple MRNs.

Basic idea of SparSamp

SparSamp simultaneously draws on the steganography algorithm idea based on arithmetic coding: at time step $t$, there may be multiple MRNs falling into the interval corresponding to the current token $w_t$. The receiver can gradually narrow down the candidate range through subsequent tokens $w_{t + 1}, w_{t + 2}, \dots$, and finally determine the eMRN corresponding to the message.

Now, rewrite formula $(1)$ as

$d\left(\mathbf{m}\right)=\textsf{bin2num}\left(\mathbf{m}\right)=\frac{k_{\mathbf{m}}}{N_m}\tag{3}$

where $N_m$ is the number of current candidate messages (i.e., possible messages) (the number of candidate messages), and $k_{\mathbf{m}}$ is the serial number (or index, starting from 0) of the message $\textbf{m}$ among the current candidate messages, satisfying $k_{\mathbf{m}}\in\left[0,N_m \right)$.

Formula $(1)$ can be regarded as a special case of formula $(3)$ when $l_m\in \mathbb{Z}^+$.

Let’s directly look at an example of the SparSamp message embedding process:

SparSamp_example

The initial context is $\mathcal{C}$. Suppose we want to embed a 3-bit message $\mathbf{m} =$ 111, which is the 7th (starting from 0) among $\{000, \dots, 111\}$. Then $N_m = 2^3 = 8$ and $k_{\mathrm{m}} = 7$. Consume a pseudo-random number generated by a PRNG (assumed to be 0.575), and calculate the MRNs corresponding to different $\mathbf{m}$ through formula $(2)$.
The eMRN corresponding to 111 falls into the interval corresponding to c, so output c as the next token. Since the oMRNs corresponding to 110 and 000 also fall into this interval, we need to continue.
The current context is $\mathcal{C}\|\text{c}$. There are currently 3 MRNs, and $\mathbf{m}$ is the 1st (starting from 0) among the sorted $\{110, 111, 000\}$ in the previous time step. Then $N_m = 3$ and $k_{\mathrm{m}} = 1$. Consume a pseudo-random number generated by a PRNG (assumed to be 0.754), and calculate the MRNs corresponding to different $\mathbf{m}$ through formula $(2)$.
The eMRN corresponding to 111 falls into the interval corresponding to a, so output a as the next token. Since the eMRN is the only MRN falling into this interval, the algorithm stops. (The algorithm stops when the message can be uniquely determined and there is no decoding ambiguity.)
The final content is $\mathcal{C}\|\text{a c}$.

The message extraction process is obvious: each time a token is read, some MRNs can be excluded, and finally only the eMRN corresponding to the unique message remains.

SparSamp achieves $O(1)$ time

At each time step $t$, since the MRNs are uniformly distributed on $\left[0,1\right)$, there is no need to judge which token’s corresponding interval each MRN falls into one by one. We only need to determine which token’s corresponding interval the eMRN falls into, and how many oMRNs on its left and right sides fall into the same interval.

From another perspective, analogous to block ciphers, divide the original message into groups of length $l_m$ and process each group in turn. Each group contains multiple time steps, and at each time step, $k_{\mathbf{m}}$ and $N_m$ in formula $(3)$ need to be updated.

The algorithm for updating $k_{\mathbf{m}}$ and $N_m$ in the paper (Algorithm 2):

Substitute the example:

i: k = 7, N = 8
r = 0.575
SE0 = 0.3, SE1 = 0.6

t0 = ceil( (0.3 - 0.575) * 8 ) = -2
t1 = ceil( (0.6 - 0.575) * 8 ) = 1

if: k + r * N = 7 + 0.575 * 8 = 11.6 >= N (8) ✅
then: k = k - N - t0 = 7 - 8 - (-2) = 1
N = t1 - t0 = 1 - (-2) = 3

Experiments and evaluations

Investigation of the influence of the grouping length on the embedding rate, embedding and extraction speed, and extraction accuracy for GPT-2 Top-$p$, $p = 1.00$ (no truncation):

It can be seen that:

The embedding and extraction speeds of SparSamp (both in bits/s) are basically the same.
Utilization increases with the increase of $l_m$, but when $l_m\ge 1024$, the message cannot be correctly extracted, so the maximum value of $l_m$ is 1023.

Comparison of Utilization, embedding and extraction speeds for GPT-2 Top-$p$, $p\in[0.80,1.00]$ (reflecting high-entropy scenarios):

SparSamp_expr_utilization_speed_cmp

Note: Here, $l_m$ of SparSamp is set to 64. If $l_m$ is set to 1023, Utilization can approach 99%-100%, and the embedding and extraction speeds will be slightly lower.

Capacity efficiency

Comparison of Utilization of different steganography algorithms for GPT-2 Top-$k$, $k\in[0,100]$ (reflecting low-entropy scenarios):

It can be seen that the Utilization of SparSamp is very close to that of the steganography algorithm based on arithmetic coding (close to 1).

Time efficiency

Explanation of abbreviations in the table:

ATST: Average Token Sampling Time, with the unit of s/token
SITR: Sampling-to-Inference Time Ratio

It can be seen that:

Compared with general (non-steganographic) generation, i.e., random sampling, SparSamp introduces very little additional time (after all, it has a theoretical $O(1)$ time complexity), and can still maintain high speed even when $p = 1.00$ (i.e., no truncation).

Security

Measured by average KLD and maximum KLD, since SparSamp, like Discop, does not change the distribution (perhaps a more rigorous statement is: it does not change the distribution to a non-negligible extent with respect to the key length), there is not much to say.

Conclusion

A provably secure steganography algorithm based on sparse sampling (Sparse Sampling) is proposed. Advantages:

High speed: Compared with general (non-steganographic) generation, the additional time introduced in each step of sampling is only $O(1)$.
High utilization: The embedding rate is very close to the theoretical limit, i.e., Utilization is very close to 100%.
Provably secure.

Personally, I think this steganography algorithm is a combination of Discop base and the arithmetic coding-based steganography algorithm , which is very ingenious.

Authors: Yaofei Wang, Gang Pei, Kejiang Chen, Jinyang Ding, Chao Pan, Weilong Pang, Donghui Hu, and Weiming Zhang

Potentially Inappropriate Wording in the Paper

1 Introduction

While these methods successfully maintain the original distribution, their embedding capacity is significantly lower than AC-based steganography’s, particularly in low entropy situations.

This is a bit inaccurate. In fact, the Utilization of Discop is about 92%-96%, which is not much lower than that of AC (about 99%-100%). It is not to the extent of “significantly lower”, which is also reflected in Table 4 of SparSamp’s paper.

In addition, Discop [16], iMEC [14], and Meteor [28] introduce high computational complexity, with embedding time consumption that is even ten or hundreds of times longer than the inference time of the model, which is unacceptable.

Section V. C. (2) of the Discop paper tests the time comparison between Discop and normal generation (random sampling). The results in Table III show that Discop does not introduce such a large additional time.

本文采用 CC BY-NC-SA 4.0 许可协议，转载请注明来自 ComyDream 。
本文链接：http://comydream.github.io/2025/06/01/sparsamp-sec25/