Intro
天之道,损有余而补不足;人之道,损不足以奉有余。 —-老子《道德经》
大概意思是自然的规律是多的地方补不足的地方,比如水往低处流。而人道与此相反。
下面就来说说这个神奇的算法。
采样
假设给定一个盒子,盒子中放有各种颜色的若干个相同大小的小球,对小球进行随机采样,即每次从盒子中摸出一个,记录颜色;放回,然后继续。。。
那如何利用计算机模拟这一随机采样的过程呢?
抽象成一个如下问题,给定如下一个集合 [a:1,b:2,c:3,d:4],对该集合中的字母按其权重采样,一个直观的方法是对权重值加和,统计每个字母所占比例,计算对应的CDF(cumulative distribution function),如下表:
val | a:1 | b:2 | c:3 | d:4 |
prob | 0.1 | 0.2 | 0.3 | 0.4 |
cum | 0.1 | 0.3 | 0.6 | 1.0 |
令p=rand()产生一个0~1的随机值,然后看p落在CDF的哪个范围内,就取该值返回。如p=0.4,属于0.3~0.6,那么就取c作为采样值返回。通过二分查找,该取样的时间复杂度为O(logN)。
Alias
别名表的方法可以让采样在
采样值通过两次rand()来决定。
- 第一轮 生成一个随机数,确定一个采样值
- 第二轮 生成另一个随机数,根据alias表中的概率,判断选择该值还是其别名。
val | a:1 | b:2 | c:3 | d:4 |
alias | 3 | 4 | 4 | - |
prob | 0.4 | 0.8 | 0.6 | 1.0 |
图解:
代码实现
下面是借鉴的line中的部分代码:
double *prob;
long long *alias;
int num_edges;
void InitAliasTable()
{
prob = (double *)malloc(num_edges*sizeof(double));
alias = (long long *)malloc(num_edges*sizeof(long long));
if (alias == NULL || prob == NULL)
{
printf("Error: memory allocation failed!\n");
exit(1);
}
double *norm_prob = (double*)malloc(num_edges*sizeof(double));
long long *large_block = (long long*)malloc(num_edges*sizeof(long long));
long long *small_block = (long long*)malloc(num_edges*sizeof(long long));
if (norm_prob == NULL || large_block == NULL || small_block == NULL)
{
printf("Error: memory allocation failed!\n");
exit(1);
}
double sum = 0;
long long cur_small_block, cur_large_block;
long long num_small_block = 0, num_large_block = 0;
for (long long k = 0; k != num_edges; k++)
sum += edge_weight[k];
for (long long k = 0; k != num_edges; k++)
norm_prob[k] = edge_weight[k] * num_edges / sum;
for (long long k = num_edges - 1; k >= 0; k--)
{
if (norm_prob[k]<1)
small_block[num_small_block++] = k; //比随机采样要小的概率选中
else
large_block[num_large_block++] = k; //比随机采样要大的概率选中
}
while (num_small_block && num_large_block) //损有余而补不足
{
cur_small_block = small_block[--num_small_block]; //不足
cur_large_block = large_block[--num_large_block]; //有余
prob[cur_small_block] = norm_prob[cur_small_block];
alias[cur_small_block] = cur_large_block;
norm_prob[cur_large_block] = norm_prob[cur_large_block] + norm_prob[cur_small_block] - 1;
if (norm_prob[cur_large_block] < 1)
small_block[num_small_block++] = cur_large_block;
else
large_block[num_large_block++] = cur_large_block;
}
while (num_large_block) prob[large_block[--num_large_block]] = 1;
while (num_small_block) prob[small_block[--num_small_block]] = 1;
free(norm_prob);
free(small_block);
free(large_block);
}
long long SampleAnEdge(double rand_value1, double rand_value2)
{
long long k = (long long)num_edges * rand_value1;
return rand_value2 < prob[k] ? k : alias[k];
}