riscv-vector

Introduce riscv vector.

vector指令集想要解决的事情是:普通标量指令集每条指令只可操作一个目的寄存器,如果想要有多次类似的操作,需要用多条指令来完成,而使用vector可以达到,一条指令操作多个寄存器或内存地址。

Term

ELEN

The maximum size in bits of a vector element that any operation can produce or consume, ELEN ≥ 8, which must be a power of 2.

最大可操作element的大小。必须大于等于8。

VLEN

The number of bits in a vector register, VLEN>=ELEN, which must be a power of 2.

vector寄存器的bits,就是VLEN。要求 VLEN>= ELEN。

SEW

Selected element width.

当前选择的element宽度。它会把vector register分成多分。

LMUL

Vector register group multiplier.

把多个vector寄存器打包成一个group。取值是1/8,1/4,1/2,1,2,4,8。

VLMAX

The maximum number of elements that can be operated on with a single vector instruction given the current SEW and LMUL settings.

每条指令可操作的最大elements个数。

VLMAX = LMUL * VLEN / SEW

vl

vector length.

本次指令要操作的element个数。vl <= VLMAX

vstart

本次指令操作的起始element序号。

解释

VLEN是cpu design设计完就定死了的。

SEW和LMUL是软件在使用的时候用指令配置到vtype寄存器里的,是可以更改的。

一旦配置了SEW和LMUL,那VLMAX就确定了,后面的vector指令可操作的最大element个数就确定了。后面的vector指令操作的element个数可以小于VLMAX,它是通过vl寄存器来指定的。

一个指令操作的宽度:LMUL* VLEN。这个指令里面的操作元素的宽度是SEW。

例子

假设VLEN=128bit。

当VLEN=SLEN,LMUL=1的时候,我们来看看SEW的含义。

Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
SEW=8 bit F E D C B A 9 8 7 6 5 4 3 2 1 0
SEW=16 bit 7 6 5 4 3 2 1 0
SEW=32 bit 3 2 1 0
SEW=64 bit 1 0
SEW=128 bit 0

可以看出,当SEW为8bit的时候,一个vector register被分成了16份,每份8bit。当SEW为128bit时,一个vector register被分成了1份,每份128bit。

再来看看LMUL的含义。

当VLEN=SLEN,LMUL=1/4的时候。一个vector register只有1/4有效,所以最大SEW也就只能取到32bit了。这里可能会有个疑问,这样子看怎么表示中间的有效呢。这就要引出vstart和vl的概念了,vstart表示从哪个index开始,vl表示到哪个index结束。

Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
SEW=8 bit - - - - - - - - - - - - 3 2 1 0
SEW=16 bit - - - - - - 1 0
SEW=32 bit - - - 0

当VLEN=SLEN,SEW=32,LMUL=2的时候。每次操作2个vector register,相应的序号就如下所示。

Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v2*n 3 2 1 0
v2*n+1 7 6 5 4

Prestart, Active, Inactive, Body, and Tail

这几个概念是针对element来说的。

假设VLEN=32,LMUL=2,SEW=16,那么这条指令需要操作4个元素。如果vstart设置为1,vl设置为2,那这些概念对应的分别是如图所示。

1
2
3
4
5
6
7
for element index x
prestart(x) = (0 <= x < vstart)
body(x) = (vstart <= x < vl)
tail(x) = (vl <= x < max(VLMAX,VLEN/SEW))
mask(x) = unmasked || v0.mask[x] == 1
active(x) = body(x) && mask(x)
inactive(x) = body(x) && !mask(x)

Programmer’s model

为vector增加了32个vector register,并加了7个unprivileged CSR,以及在mstatus/vsstatus里面增加了相应的域。

可以看出,vector对于privilege的改动基本不大。大多数CSR都是为了不破坏32 bit指令编码而设置。

vector register

32个vector register,v0~v32。和scalar的类似,不过v0不是固定为0,而是会默认为mask。每个register是VLEN长度。

privilege

mstatus[10:9]中增加了VS域,它和FS域类似。

当mstatus.VS是OFF的时候,执行任何vector指令,或访问vector CSRs,会产生illegal instruction异常。

当mstatus.VS是initial或clean,执行任意会改变vector状态的指令(包括vector CSRs),会将mstatus.VS改为dirty。implementations可能会在任意时刻将mstatus.VS从initial或clean改为dirty,即使没有vector状态的改变。

如果mstatus.VS是dirty,mstatus.SD为1。

在misa中增加了V域。

同样的,如果hypervisor实现了,那再vsstatus里也增加VS域。

unprivileged

Address Privilege Name Description
0x008 URW vstart Vector start position
0x009 URW vxsat Fixed-Point Saturate Flag
0x00A URW vxrm Fixed-Point Rounding Mode
0x00F URW vcsr Vector control and status register
0xC20 URO vl Vector length
0xC21 URO vtype Vector data type register
0xC22 URO vlenb VLEN/8 (vector register length in bytes)

vstart,定义指令的起始元素位置

vxsat,定点饱和标志

vxrm,定点rounding mode

vcsr,里面就是vxsat和vxrm,为什么要再设置一个这个寄存器,不懂。

vl,本条指令需要操作的元素个数

vtype,设置的SEW/LMUL

vlenb,告诉软件该硬件的VLEN是多少。

vtype里面除了SEW/LMUL,还有两个vta(vector mask agnostic)和vma(vector tail agnostic)。

因为vector操作会有很多空洞,比如tail,比如inactive,那这些位置的值,是保留原值,还是填固定值,就涉及不同策略了。如果是需要保留原值,那硬件在处理时是需要更多消耗的。为了平衡软件硬件,才设置了不同的策略。

undisturbed:保留原值

agnostic:允许保留原值,也允许填1

vector instruction

共有如下几种指令类型:

  • configuration setting instructions
  • vector loads and stores instructions
  • vector integer arithmetic instructions
  • vector fixed-point arithmetic instructions
  • vector floating-point arithmetic instructions
  • vector reduction instructions
  • vector mask instructions
  • vector permutation instructions

configuration setting instructions

提供了三条配置指令。

1
2
3
vsetvli rd, rs1, vtypei # rd = new vl, rs1 = AVL, vtypei = new vtype setting
vsetivli rd, uimm, vtypei # rd = new vl, uimm = AVL, vtypei = new vtype setting
vsetvl rd, rs1, rs2 # rd = new vl, rs1 = AVL, rs2 = new vtype value

它们实现的动作是:

  • 将新的vtype配置(SEW/LMUL/vta/vma)写入vtype
  • 根据新的vtype,和AVL,计算出新的vl,并写入vl
  • 再将新的vl,写入rd
  • 如果配置不支持,就vtype里的vill置为一,vtype的其他域清零,vl也清零

这里比较特殊的是AVL(application vector length),它是期望的长度,是可以大于VLMAX的,所以计算出来的新的vl不一定会等于AVL。

举个例子说明。假设VLEN=64,SEW=8,LMUL=2,AVL=32。那么一次可以操作VLENLMUL/SEW = 64 2 / 8 = 16 个元素。那么vl会被写入16,该笔请求需要做32/16=2次才能做完。

对于vl的值的计算,有如下规则要遵守。

  • vl = AVL if AVL <= VLMAX. 这点比较好理解,如果期望操作的元素个数小于,最大可以操作的元素个数,那本次操作的就是AVL。
  • ceil( AVL / 2) <= vl <= VLMAX if AVL < (2VLMAX). 这点不好理解,我觉得是,如果AVL < 2 VLMAX,那就一定得分两次做了,那它允许implementation决定以怎样的方式分两次。比如,假设VLMAX=16,AVL=28,那允许16+12,也允许14+14,也允许15+13。
  • vl = VLMAX if AVL >= (2*VLMAX),如果大于2倍的VLMAX,那就必须按最多的来做了。

vector loads and stores instructions

vector loads stores指令分为以下几种

  • unit-stride
  • unit-stride, whole register
  • unit-stride, mask, EEW=8
  • unit-stride fault-only-first
  • strided
  • indexed-unordered
  • indexed-ordered
  • unit-stride segment
  • unit-stride fault-only-first segment
  • stride segment
  • indexed-unordered segment
  • indexed-ordered segment

在load/store指令中,还可以指定EEW,也就是本次需要读写的element宽度,通过公式可以得到EMUL。

1
EMUL == (EEW / SEW) * LMUL

也就是,根据原SEW/LMUL计算得到本次要操作的个数 vl = VLEN/SEW*LMUL,即使重新指令EEW,也就是重新指定了每个元素的宽度,但是本次要操作的个数vl是不变的。

unit-stride

单位步长,也就是连续的load/store,固定步长的。

1
2
3
4
5
6
7
8
9
10
11
# Vector unit-stride loads and stores
# vd destination, rs1 base address, vm is mask encoding (v0.t or <missing>)
vle8.v vd, (rs1), vm # 8-bit unit-stride load
vle16.v vd, (rs1), vm # 16-bit unit-stride load
vle32.v vd, (rs1), vm # 32-bit unit-stride load
vle64.v vd, (rs1), vm # 64-bit unit-stride load
# vs3 store data, rs1 base address, vm is mask encoding (v0.t or <missing>)
vse8.v vs3, (rs1), vm # 8-bit unit-stride store
vse16.v vs3, (rs1), vm # 16-bit unit-stride store
vse32.v vs3, (rs1), vm # 32-bit unit-stride store
vse64.v vs3, (rs1), vm # 64-bit unit-stride store

举例说明。假设VLEN=128。

场景:SEW=8,LMUL=1,EEW=SEW

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
li a0, 0x90000000
li a1, 512
vsetvli t0, a1, e8, m1, tu, mu // vl=VLEN/SEW*LMUL=16, t0=16
vle8.v v8, (a0) // EEW=8, so EMUL=LMUL=1
// 所以该指令的结果是读取 90000000 ~ 9000000f 地址的值,element宽度是8bit,个数是16个
// v8=0x0f0e0d0c0b0a09080706050403020100
// MR(1) 0x90000000
// MR(1) 0x90000001
// MR(1) 0x90000002
// MR(1) 0x90000003
// MR(1) 0x90000004
// MR(1) 0x90000005
// MR(1) 0x90000006
// MR(1) 0x90000007
// MR(1) 0x90000008
// MR(1) 0x90000009
// MR(1) 0x9000000a
// MR(1) 0x9000000b
// MR(1) 0x9000000c
// MR(1) 0x9000000d
// MR(1) 0x9000000e
// MR(1) 0x9000000f

场景:SEW=8,LMUL=1,EEW=2*SEW,说明EEW比SEW大的情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
li a0, 0x90000000
li a1, 512
vsetvli t0, a1, e8, m1, tu, mu // vl=VLEN/SEW*LMUL=16, t0=16
vle16.v v8, (a0) // EEW=16, so EMUL=(EEW/SEW)*LMUL=2
// 所以该指令的结果是读取 90000000 ~ 90000001f 地址的值,element宽度是16bit,个数是16个
// v9=0x1f1e1d1c1b1a19181716151413121110
// v8=0x0f0e0d0c0b0a09080706050403020100
// MR(2) 0x90000000
// MR(2) 0x90000002
// MR(2) 0x90000004
// MR(2) 0x90000006
// MR(2) 0x90000008
// MR(2) 0x9000000a
// MR(2) 0x9000000c
// MR(2) 0x9000000e
// MR(2) 0x90000010
// MR(2) 0x90000012
// MR(2) 0x90000014
// MR(2) 0x90000016
// MR(2) 0x90000018
// MR(2) 0x9000001a
// MR(2) 0x9000001c
// MR(2) 0x9000001e

场景:SEW=8,LMUL=1,EEW=(1/4)*SEW,说明EEW比SEW小的情况

1
2
3
4
5
6
7
8
li a0, 0x90000000
li a1, 512
vsetvli t0, a1, e64, m1, tu, mu // vl=VLEN/SEW*LMUL=2, t0=2
vle16.v v8, (a0) // EEW=16, so EMUL=(EEW/SEW)*LMUL=1/4
// 所以该指令的结果是读取 90000000 ~ 90000003 地址的值,element宽度是16bit,个数是2个
// v8 = 0x03020100
// MR(2) 0x90000000
// MR(2) 0x90000002

从上面的例子看出:

  • unit-stride的访问元素个数,是根据vsetvli指令计算出的vl
  • unit-stride的访问元素位宽,是指令里的EEW,它和SEW可以不同,以EEW为准
  • unit-stride的访问,小地址写到寄存器低位

unit-stride, whole register

读写一个完成的寄存器。指令码中的nf域指定了load/store多少个vector寄存器。目前NFIELDS支持1/2/4/8。

这种指令忽略vtype和vl寄存器。只与nf和EEW有关。evl = NFIELDS * VLEN / EEW。

1
2
3
# Format of whole register load and store instructions
vl<NFIELDS>re<EEW>.v vd, (rd)
vs<NFIELDS>r.v vd, (rd)

这里特殊的是,load可以指定EEW,但是store不能指定EEW,仅支持EEW=8。这里不理解为什么让load可以指定,而store不能指定?

举例说明。假设VLEN=128。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
vl1re64.v v8, (a0)
// v8=0x0f0e0d0c0b0a09080706050403020100
// MR(8) 0x90000000
// MR(8) 0x90000008
vs1r.v v8, (a0)
// MW(1) 0x90000000
// MW(1) 0x90000001
// MW(1) 0x90000002
// MW(1) 0x90000003
// MW(1) 0x90000004
// MW(1) 0x90000005
// MW(1) 0x90000006
// MW(1) 0x90000007
// MW(1) 0x90000008
// MW(1) 0x90000009
// MW(1) 0x9000000a
// MW(1) 0x9000000b
// MW(1) 0x9000000c
// MW(1) 0x9000000d
// MW(1) 0x9000000e
// MW(1) 0x9000000f

unit-stride, mask, EEW=8

该条指令和vle类似的,也是从memory中load/store值出来,EEW固定为8,不同的是 evl = ceil(vl/8),也就是vl/8向上取整。

1
2
3
# Vector unit-stride mask load/store
vlm.v vd, (rs1)
vsm.v vs3, (rs1)

举例说明。假设VLEN=128。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
li a1, 127
vsetvli t0, a1, e8, m8, tu, mu // t0 = 127
vlm.v v8, (a0)
// 因为vl=127,所以该指令要load的个数是 ceil(vl/8) = 16
// v8=0x0f0e0d0c0b0a09080706050403020100
// MR(1) 0x90000000
// MR(1) 0x90000001
// MR(1) 0x90000002
// MR(1) 0x90000003
// MR(1) 0x90000004
// MR(1) 0x90000005
// MR(1) 0x90000006
// MR(1) 0x90000007
// MR(1) 0x90000008
// MR(1) 0x90000009
// MR(1) 0x9000000a
// MR(1) 0x9000000b
// MR(1) 0x9000000c
// MR(1) 0x9000000d
// MR(1) 0x9000000e
// MR(1) 0x9000000f

unit-stride fault-only-first

只有在元素0发生异常时,才会发生trap;在其他元素发生异常时,不发生trap。

1
2
3
4
5
6
# Vector unit-stride fault-only-first loads
# vd destination, rs1 base address, vm is mask encoding (v0.t or <missing>)
vle8ff.v vd, (rs1), vm # 8-bit unit-stride fault-only-first load
vle16ff.v vd, (rs1), vm # 16-bit unit-stride fault-only-first load
vle32ff.v vd, (rs1), vm # 32-bit unit-stride fault-only-first load
vle64ff.v vd, (rs1), vm # 64-bit unit-stride fault-only-first load

strided

指定步长的。

1
2
3
4
5
6
7
8
9
10
11
# Vector strided loads and stores
# vd destination, rs1 base address, rs2 byte stride
vlse8.v vd, (rs1), rs2, vm # 8-bit strided load
vlse16.v vd, (rs1), rs2, vm # 16-bit strided load
vlse32.v vd, (rs1), rs2, vm # 32-bit strided load
vlse64.v vd, (rs1), rs2, vm # 64-bit strided load
# vs3 store data, rs1 base address, rs2 byte stride
vsse8.v vs3, (rs1), rs2, vm # 8-bit strided store
vsse16.v vs3, (rs1), rs2, vm # 16-bit strided store
vsse32.v vs3, (rs1), rs2, vm # 32-bit strided store
vsse64.v vs3, (rs1), rs2, vm # 64-bit strided store

该指令的几个特殊情况:

  • negative and zero strides 是支持的。负数和0作为步进都支持。
  • 每个元素是unordered的。
  • 当rs2 = x0,implementation可以支持,也可以不支持。
  • 当rs2 != x0,但是x[rs2] = 0,implementation 必须为每个元素执行一次memory访问,这些访问可能unordered。

举例说明。假设VLEN=128。

1
2
3
4
5
6
7
8
9
li a1, 512
vsetvli t0, a1, e32, m1, tu, mu // vl = 128/32 * 1 = 4
li a2, 2 // stride is 2
vlse8.v v8, (a0), a2 // EEW=8
// v8 = 0x06040200
// MR(1) 0x90000000
// MR(1) 0x90000002
// MR(1) 0x90000004
// MR(1) 0x90000006

indexed-unordered

可以指定index。

1
2
3
4
5
6
7
8
9
10
11
12
13
# Vector indexed loads and stores
# Vector indexed-unordered load instructions
# vd destination, rs1 base address, vs2 byte offsets
vluxei8.v vd, (rs1), vs2, vm # unordered 8-bit indexed load of SEW data
vluxei16.v vd, (rs1), vs2, vm # unordered 16-bit indexed load of SEW data
vluxei32.v vd, (rs1), vs2, vm # unordered 32-bit indexed load of SEW data
vluxei64.v vd, (rs1), vs2, vm # unordered 64-bit indexed load of SEW data
# Vector indexed-unordered store instructions
# vs3 store data, rs1 base address, vs2 byte offsets
vsuxei8.v vs3, (rs1), vs2, vm # unordered 8-bit indexed store of SEW data
vsuxei16.v vs3, (rs1), vs2, vm # unordered 16-bit indexed store of SEW data
vsuxei32.v vs3, (rs1), vs2, vm # unordered 32-bit indexed store of SEW data
vsuxei64.v vs3, (rs1), vs2, vm # unordered 64-bit indexed store of SEW data

举例说明。假设VLEN=128。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
li a1, 512
vsetvli t0, a1, e64, m1 // vl = 2
li a2, 0x0e0a0c0608040002ul
vmv.v.x v1, a2 // v1 = 0x0e0a0c06080400020e0a0c0608040002
vsetvli t0, a1, e16, m1 // vl = 8
vluxei8.v v8, (a0), v1
// v8 = 0x0f0e0b0a0d0c07060908050401000302
// MR(2) 0x90000002
// MR(2) 0x90000000
// MR(2) 0x90000004
// MR(2) 0x90000008
// MR(2) 0x90000006
// MR(2) 0x9000000c
// MR(2) 0x9000000a
// MR(2) 0x9000000e

有几个特点:

  • index指的是地址的偏移,而不是元素个数的偏移
  • vs2的每个元素的宽度是指令指定的。访问的宽度是SEW。

indexed-ordered

1
2
3
4
5
6
7
8
9
10
11
12
# Vector indexed-ordered load instructions
# vd destination, rs1 base address, vs2 byte offsets
vloxei8.v vd, (rs1), vs2, vm # ordered 8-bit indexed load of SEW data
vloxei16.v vd, (rs1), vs2, vm # ordered 16-bit indexed load of SEW data
vloxei32.v vd, (rs1), vs2, vm # ordered 32-bit indexed load of SEW data
vloxei64.v vd, (rs1), vs2, vm # ordered 64-bit indexed load of SEW data
# Vector indexed-ordered store instructions
# vs3 store data, rs1 base address, vs2 byte offsets
vsoxei8.v vs3, (rs1), vs2, vm # ordered 8-bit indexed store of SEW data
vsoxei16.v vs3, (rs1), vs2, vm # ordered 16-bit indexed store of SEW data
vsoxei32.v vs3, (rs1), vs2, vm # ordered 32-bit indexed store of SEW data
vsoxei64.v vs3, (rs1), vs2, vm # ordered 64-bit indexed store of SEW data

unit-stride segment

把一些segment放入连续的vector寄存器里,或从vector寄存器写入segment。

和前面的指令的不同在于,前面的指令是横向放的,segment是纵向放的。

1
2
3
# Format
vlseg<nf>e<eew>.v vd, (rs1), vm # Unit-stride segment load template
vsseg<nf>e<eew>.v vs3, (rs1), vm # Unit-stride segment store template

该指令的总元素个数 = vl * nf.

举例说明。假设VLEN=128。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
li a1, 512
vsetvli t0, a1, e32, m1, tu, mu // vl = 4
vlseg3e8.v v8, (a0) // 读取的宽度是EEW=8, 共读3个寄存器,纵向放
// v8 = 0x09060300
// v9 = 0x0a070401
// v10 = 0x0b080502
// MR(1) 0x90000000
// MR(1) 0x90000001
// MR(1) 0x90000002
// MR(1) 0x90000003
// MR(1) 0x90000004
// MR(1) 0x90000005
// MR(1) 0x90000006
// MR(1) 0x90000007
// MR(1) 0x90000008
// MR(1) 0x90000009
// MR(1) 0x9000000a
// MR(1) 0x9000000b

unit-stride fault-only-first segment

1
2
# Template for vector fault-only-first unit-stride segment loads.
vlseg<nf>e<eew>ff.v vd, (rs1), vm # Unit-stride fault-only-first segment loads

stride segment

1
2
3
# Format
vlsseg<nf>e<eew>.v vd, (rs1), rs2, vm # Strided segment loads
vssseg<nf>e<eew>.v vs3, (rs1), rs2, vm # Strided segment stores

indexed-unordered segment

1
2
3
# Format
vluxseg<nf>ei<eew>.v vd, (rs1), vs2, vm # Indexed-unordered segment loads
vsuxseg<nf>ei<eew>.v vs3, (rs1), vs2, vm # Indexed-unordered segment stores

indexed-ordered segment

1
2
3
# Format
vloxseg<nf>ei<eew>.v vd, (rs1), vs2, vm # Indexed-ordered segment loads
vsoxseg<nf>ei<eew>.v vs3, (rs1), vs2, vm # Indexed-ordered segment stores

vector integer arithmetic instructions

integer arithmetic指令有如下几类。

  • Vector Single-Width Integer Add and Subtract
  • Vector Widening Integer Add/Subtract
  • Vector Integer Extension
  • Vector Integer Add-with-Carry / Subtract-with-Borrow Instructions
  • Vector Bitwise Logical Instructions
  • Vector Single-Width Shift Instructions
  • Vector Narrowing Integer Right Shift Instructions
  • Vector Integer Compare Instructions
  • Vector Integer Min/Max Instructions
  • Vector Single-Width Integer Multiply Instructions
  • Vector Integer Divide Instructions
  • Vector Widening Integer Multiply Instructions
  • Vector Single-Width Integer Multiply-Add Instructions
  • Vector Widening Integer Multiply-Add Instructions
  • Vector Integer Merge Instructions
  • Vector Integer Move Instructions

Vector Single-Width Integer Add and Subtract

操作位宽是vtype中的SEW。SEW范围的溢出是被忽略的。

1
2
3
4
5
6
7
8
9
10
11
12
# Integer adds.
vadd.vv vd, vs2, vs1, vm # Vector-vector vd[i] = vs2[i] + vs1[i]
vadd.vx vd, vs2, rs1, vm # vector-scalar vd[i] = vs2[i] + rs1
vadd.vi vd, vs2, imm, vm # vector-immediate vd[i] = vs2[i] + imm

# Integer subtract
vsub.vv vd, vs2, vs1, vm # Vector-vector vd[i] = vs2[i] - vs1[i]
vsub.vx vd, vs2, rs1, vm # vector-scalar vd[i] = vs2[i] - rs1

# Integer reverse subtract
vrsub.vx vd, vs2, rs1, vm # vd[i] = x[rs1] - vs2[i]
vrsub.vi vd, vs2, imm, vm # vd[i] = imm - vs2[i]

举例说明。假设VLEN=128。

1
2
3
4
5
6
vsetvli t0, a1, e64, m1
li a2, 0xff88060504030280ul
vmv.v.x v1, a2 // v1 = 0xff88060504030280ff88060504030280
vmv.v.x v2, a2 // v2 = 0xff88060504030280ff88060504030280
vsetvli t0, a1, e8, m1 // SEW = 8, vl=16
vadd.vv v8, v1, v2 // v8 = 0xfe100c0a08060400fe100c0a08060400

Vector Widening Integer Add/Subtract

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Widening unsigned integer add/subtract, 2*SEW = SEW +/- SEW
vwaddu.vv vd, vs2, vs1, vm # vector-vector
vwaddu.vx vd, vs2, rs1, vm # vector-scalar
vwsubu.vv vd, vs2, vs1, vm # vector-vector
vwsubu.vx vd, vs2, rs1, vm # vector-scalar
# Widening signed integer add/subtract, 2*SEW = SEW +/- SEW
vwadd.vv vd, vs2, vs1, vm # vector-vector
vwadd.vx vd, vs2, rs1, vm # vector-scalar
vwsub.vv vd, vs2, vs1, vm # vector-vector
vwsub.vx vd, vs2, rs1, vm # vector-scalar
# Widening unsigned integer add/subtract, 2*SEW = 2*SEW +/- SEW
vwaddu.wv vd, vs2, vs1, vm # vector-vector
vwaddu.wx vd, vs2, rs1, vm # vector-scalar
vwsubu.wv vd, vs2, vs1, vm # vector-vector
vwsubu.wx vd, vs2, rs1, vm # vector-scalar
# Widening signed integer add/subtract, 2*SEW = 2*SEW +/- SEW
vwadd.wv vd, vs2, vs1, vm # vector-vector
vwadd.wx vd, vs2, rs1, vm # vector-scalar
vwsub.wv vd, vs2, vs1, vm # vector-vector
vwsub.wx vd, vs2, rs1, vm # vector-scalar

举例说明。假设VLEN=128。

1
2
3
4
5
6
7
vsetvli t0, a1, e64, m1
li a2, 0xff88060504030280ul
vmv.v.x v1, a2 // v1 = 0xff88060504030280ff88060504030280
vmv.v.x v2, a2 // v2 = 0xff88060504030280ff88060504030280
vsetvli t0, a1, e8, m1 // SEW = 8, vl=16
vwaddu.vv v8, v1, v2 // v8 = 0x01fe0110000c000a0008000600040100
// v8 = 0x01fe0110000c000a0008000600040100

Vector Integer Extension

该指令source的EEW取值是 1/2,1/4,1/8 的SEW。dest的EEW等于SEW。

1
2
3
4
5
6
vzext.vf2 vd, vs2, vm # Zero-extend SEW/2 source to SEW destination
vsext.vf2 vd, vs2, vm # Sign-extend SEW/2 source to SEW destination
vzext.vf4 vd, vs2, vm # Zero-extend SEW/4 source to SEW destination
vsext.vf4 vd, vs2, vm # Sign-extend SEW/4 source to SEW destination
vzext.vf8 vd, vs2, vm # Zero-extend SEW/8 source to SEW destination
vsext.vf8 vd, vs2, vm # Sign-extend SEW/8 source to SEW destination

举例说明。假设VLEN=128。

1
2
3
4
5
vsetvli t0, a1, e64, m1
li a2, 0xff88060504030280ul
vmv.v.x v1, a2 // v1 = 0xff88060504030280ff88060504030280
vsetvli t0, a1, e16, m1 // SEW = 16, vl=8
vsext.vf2 v8, v1 // v8 = 0xffffff8800060005000400030002ff80

Vector Integer Add-with-Carry / Subtract-with-Borrow Instructions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Produce sum with carry.
# vd[i] = vs2[i] + vs1[i] + v0.mask[i]
vadc.vvm vd, vs2, vs1, v0 # Vector-vector

# vd[i] = vs2[i] + x[rs1] + v0.mask[i]
vadc.vxm vd, vs2, rs1, v0 # Vector-scalar

# vd[i] = vs2[i] + imm + v0.mask[i]
vadc.vim vd, vs2, imm, v0 # Vector-immediate

# Produce carry out in mask register format
# vd.mask[i] = carry_out(vs2[i] + vs1[i] + v0.mask[i])
vmadc.vvm vd, vs2, vs1, v0 # Vector-vector

# vd.mask[i] = carry_out(vs2[i] + x[rs1] + v0.mask[i])
vmadc.vxm vd, vs2, rs1, v0 # Vector-scalar

# vd.mask[i] = carry_out(vs2[i] + imm + v0.mask[i])
vmadc.vim vd, vs2, imm, v0 # Vector-immediate

# vd.mask[i] = carry_out(vs2[i] + vs1[i])
vmadc.vv vd, vs2, vs1 # Vector-vector, no carry-in

# vd.mask[i] = carry_out(vs2[i] + x[rs1])
vmadc.vx vd, vs2, rs1 # Vector-scalar, no carry-in

# vd.mask[i] = carry_out(vs2[i] + imm)
vmadc.vi vd, vs2, imm # Vector-immediate, no carry-in
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Produce difference with borrow.
# vd[i] = vs2[i] - vs1[i] - v0.mask[i]
vsbc.vvm vd, vs2, vs1, v0 # Vector-vector

# vd[i] = vs2[i] - x[rs1] - v0.mask[i]
vsbc.vxm vd, vs2, rs1, v0 # Vector-scalar

# Produce borrow out in mask register format
# vd.mask[i] = borrow_out(vs2[i] - vs1[i] - v0.mask[i])
vmsbc.vvm vd, vs2, vs1, v0 # Vector-vector

# vd.mask[i] = borrow_out(vs2[i] - x[rs1] - v0.mask[i])
vmsbc.vxm vd, vs2, rs1, v0 # Vector-scalar

# vd.mask[i] = borrow_out(vs2[i] - vs1[i])
vmsbc.vv vd, vs2, vs1 # Vector-vector, no borrow-in

# vd.mask[i] = borrow_out(vs2[i] - x[rs1])
vmsbc.vx vd, vs2, rs1 # Vector-scalar, no borrow-in

Vector Bitwise Logical Instructions

1
2
3
4
5
6
7
8
9
10
# Bitwise logical operations.
vand.vv vd, vs2, vs1, vm # Vector-vector
vand.vx vd, vs2, rs1, vm # vector-scalar
vand.vi vd, vs2, imm, vm # vector-immediate
vor.vv vd, vs2, vs1, vm # Vector-vector
vor.vx vd, vs2, rs1, vm # vector-scalar
vor.vi vd, vs2, imm, vm # vector-immediate
vxor.vv vd, vs2, vs1, vm # Vector-vector
vxor.vx vd, vs2, rs1, vm # vector-scalar
vxor.vi vd, vs2, imm, vm # vector-immediate

Vector Single-Width Shift Instructions

1
2
3
4
5
6
7
8
9
10
# Bit shift operations
vsll.vv vd, vs2, vs1, vm # Vector-vector
vsll.vx vd, vs2, rs1, vm # vector-scalar
vsll.vi vd, vs2, uimm, vm # vector-immediate
vsrl.vv vd, vs2, vs1, vm # Vector-vector
vsrl.vx vd, vs2, rs1, vm # vector-scalar
vsrl.vi vd, vs2, uimm, vm # vector-immediate
vsra.vv vd, vs2, vs1, vm # Vector-vector
vsra.vx vd, vs2, rs1, vm # vector-scalar
vsra.vi vd, vs2, uimm, vm # vector-immediate

Vector Narrowing Integer Right Shift Instructions

1
2
3
4
5
6
7
8
# Narrowing shift right logical, SEW = (2*SEW) >> SEW
vnsrl.wv vd, vs2, vs1, vm # vector-vector
vnsrl.wx vd, vs2, rs1, vm # vector-scalar
vnsrl.wi vd, vs2, uimm, vm # vector-immediate
# Narrowing shift right arithmetic, SEW = (2*SEW) >> SEW
vnsra.wv vd, vs2, vs1, vm # vector-vector
vnsra.wx vd, vs2, rs1, vm # vector-scalar
vnsra.wi vd, vs2, uimm, vm # vector-immediate

Vector Integer Compare Instructions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Set if equal
vmseq.vv vd, vs2, vs1, vm # Vector-vector
vmseq.vx vd, vs2, rs1, vm # vector-scalar
vmseq.vi vd, vs2, imm, vm # vector-immediate

# Set if not equal
vmsne.vv vd, vs2, vs1, vm # Vector-vector
vmsne.vx vd, vs2, rs1, vm # vector-scalar
vmsne.vi vd, vs2, imm, vm # vector-immediate

# Set if less than, unsigned
vmsltu.vv vd, vs2, vs1, vm # Vector-vector
vmsltu.vx vd, vs2, rs1, vm # Vector-scalar

# Set if less than, signed
vmslt.vv vd, vs2, vs1, vm # Vector-vector
vmslt.vx vd, vs2, rs1, vm # vector-scalar

# Set if less than or equal, unsigned
vmsleu.vv vd, vs2, vs1, vm # Vector-vector
vmsleu.vx vd, vs2, rs1, vm # vector-scalar
vmsleu.vi vd, vs2, imm, vm # Vector-immediate

# Set if less than or equal, signed
vmsle.vv vd, vs2, vs1, vm # Vector-vector
vmsle.vx vd, vs2, rs1, vm # vector-scalar
vmsle.vi vd, vs2, imm, vm # vector-immediate

# Set if greater than, unsigned
vmsgtu.vx vd, vs2, rs1, vm # Vector-scalar
vmsgtu.vi vd, vs2, imm, vm # Vector-immediate

# Set if greater than, signed
vmsgt.vx vd, vs2, rs1, vm # Vector-scalar
vmsgt.vi vd, vs2, imm, vm # Vector-immediate

# Following two instructions are not provided directly
# Set if greater than or equal, unsigned
# vmsgeu.vx vd, vs2, rs1, vm # Vector-scalar
# Set if greater than or equal, signed
# vmsge.vx vd, vs2, rs1, vm # Vector-scalar

Vector Integer Min/Max Instructions

1
2
3
4
5
6
7
8
9
10
11
12
# Unsigned minimum
vminu.vv vd, vs2, vs1, vm # Vector-vector
vminu.vx vd, vs2, rs1, vm # vector-scalar
# Signed minimum
vmin.vv vd, vs2, vs1, vm # Vector-vector
vmin.vx vd, vs2, rs1, vm # vector-scalar
# Unsigned maximum
vmaxu.vv vd, vs2, vs1, vm # Vector-vector
vmaxu.vx vd, vs2, rs1, vm # vector-scalar
# Signed maximum
vmax.vv vd, vs2, vs1, vm # Vector-vector
vmax.vx vd, vs2, rs1, vm # vector-scalar

Vector Single-Width Integer Multiply Instructions

1
2
3
4
5
6
7
8
9
10
11
12
# Signed multiply, returning low bits of product
vmul.vv vd, vs2, vs1, vm # Vector-vector
vmul.vx vd, vs2, rs1, vm # vector-scalar
# Signed multiply, returning high bits of product
vmulh.vv vd, vs2, vs1, vm # Vector-vector
vmulh.vx vd, vs2, rs1, vm # vector-scalar
# Unsigned multiply, returning high bits of product
vmulhu.vv vd, vs2, vs1, vm # Vector-vector
vmulhu.vx vd, vs2, rs1, vm # vector-scalar
# Signed(vs2)-Unsigned multiply, returning high bits of product
vmulhsu.vv vd, vs2, vs1, vm # Vector-vector
vmulhsu.vx vd, vs2, rs1, vm # vector-scalar

Vector Integer Divide Instructions

1
2
3
4
5
6
7
8
9
10
11
12
# Unsigned divide.
vdivu.vv vd, vs2, vs1, vm # Vector-vector
vdivu.vx vd, vs2, rs1, vm # vector-scalar
# Signed divide
vdiv.vv vd, vs2, vs1, vm # Vector-vector
vdiv.vx vd, vs2, rs1, vm # vector-scalar
# Unsigned remainder
vremu.vv vd, vs2, vs1, vm # Vector-vector
vremu.vx vd, vs2, rs1, vm # vector-scalar
# Signed remainder
vrem.vv vd, vs2, vs1, vm # Vector-vector
vrem.vx vd, vs2, rs1, vm # vector-scalar

Vector Widening Integer Multiply Instructions

1
2
3
4
5
6
7
8
9
# Widening signed-integer multiply
vwmul.vv vd, vs2, vs1, vm # vector-vector
vwmul.vx vd, vs2, rs1, vm # vector-scalar
# Widening unsigned-integer multiply
vwmulu.vv vd, vs2, vs1, vm # vector-vector
vwmulu.vx vd, vs2, rs1, vm # vector-scalar
# Widening signed(vs2)-unsigned integer multiply
vwmulsu.vv vd, vs2, vs1, vm # vector-vector
vwmulsu.vx vd, vs2, rs1, vm # vector-scalar

Vector Single-Width Integer Multiply-Add Instructions

1
2
3
4
5
6
7
8
9
10
11
12
# Integer multiply-add, overwrite addend
vmacc.vv vd, vs1, vs2, vm # vd[i] = +(vs1[i] * vs2[i]) + vd[i]
vmacc.vx vd, rs1, vs2, vm # vd[i] = +(x[rs1] * vs2[i]) + vd[i]
# Integer multiply-sub, overwrite minuend
vnmsac.vv vd, vs1, vs2, vm # vd[i] = -(vs1[i] * vs2[i]) + vd[i]
vnmsac.vx vd, rs1, vs2, vm # vd[i] = -(x[rs1] * vs2[i]) + vd[i]
# Integer multiply-add, overwrite multiplicand
vmadd.vv vd, vs1, vs2, vm # vd[i] = (vs1[i] * vd[i]) + vs2[i]
vmadd.vx vd, rs1, vs2, vm # vd[i] = (x[rs1] * vd[i]) + vs2[i]
# Integer multiply-sub, overwrite multiplicand
vnmsub.vv vd, vs1, vs2, vm # vd[i] = -(vs1[i] * vd[i]) + vs2[i]
vnmsub.vx vd, rs1, vs2, vm # vd[i] = -(x[rs1] * vd[i]) + vs2[i]

Vector Widening Integer Multiply-Add Instructions

1
2
3
4
5
6
7
8
9
10
11
# Widening unsigned-integer multiply-add, overwrite addend
vwmaccu.vv vd, vs1, vs2, vm # vd[i] = +(vs1[i] * vs2[i]) + vd[i]
vwmaccu.vx vd, rs1, vs2, vm # vd[i] = +(x[rs1] * vs2[i]) + vd[i]
# Widening signed-integer multiply-add, overwrite addend
vwmacc.vv vd, vs1, vs2, vm # vd[i] = +(vs1[i] * vs2[i]) + vd[i]
vwmacc.vx vd, rs1, vs2, vm # vd[i] = +(x[rs1] * vs2[i]) + vd[i]
# Widening signed-unsigned-integer multiply-add, overwrite addend
vwmaccsu.vv vd, vs1, vs2, vm # vd[i] = +(signed(vs1[i]) * unsigned(vs2[i])) + vd[i]
vwmaccsu.vx vd, rs1, vs2, vm # vd[i] = +(signed(x[rs1]) * unsigned(vs2[i])) + vd[i]
# Widening unsigned-signed-integer multiply-add, overwrite addend
vwmaccus.vx vd, rs1, vs2, vm # vd[i] = +(unsigned(x[rs1]) * signed(vs2[i])) + vd[i]

Vector Integer Merge Instructions

1
2
3
vmerge.vvm vd, vs2, vs1, v0 # vd[i] = v0.mask[i] ? vs1[i] : vs2[i]
vmerge.vxm vd, vs2, rs1, v0 # vd[i] = v0.mask[i] ? x[rs1] : vs2[i]
vmerge.vim vd, vs2, imm, v0 # vd[i] = v0.mask[i] ? imm : vs2[i]

Vector Integer Move Instructions

1
2
3
vmv.v.v vd, vs1 # vd[i] = vs1[i]
vmv.v.x vd, rs1 # vd[i] = x[rs1]
vmv.v.i vd, imm # vd[i] = imm

vector fixed-point arithmetic instructions

fixed-point arithmetic指令有如下几类。

  • Vector Single-Width Saturating Add and Subtract
  • Vector Single-Width Averaging Add and Subtract
  • Vector Single-Width Fractional Multiply with Rounding and Saturation
  • Vector Single-Width Scaling Shift Instructions
  • Vector Narrowing Fixed-Point Clip Instructions

vector floating-point arithmetic instructions

floating-point arithmetic指令有如下几类。

  • Vector Floating-Point Exception Flags
  • Vector Single-Width Floating-Point Add/Subtract Instructions
  • Vector Widening Floating-Point Add/Subtract Instructions
  • Vector Single-Width Floating-Point Multiply/Divide Instructions
  • Vector Widening Floating-Point Multiply
  • Vector Single-Width Floating-Point Fused Multiply-Add Instructions
  • Vector Widening Floating-Point Fused Multiply-Add Instructions
  • Vector Floating-Point Square-Root Instruction
  • Vector Floating-Point Reciprocal Square-Root Estimate Instruction
  • Vector Floating-Point Reciprocal Estimate Instruction
  • Vector Floating-Point MIN/MAX Instructions
  • Vector Floating-Point Sign-Injection Instructions
  • Vector Floating-Point Compare Instructions
  • Vector Floating-Point Classify Instruction
  • Vector Floating-Point Merge Instruction
  • Vector Floating-Point Move Instruction
  • Single-Width Floating-Point/Integer Type-Convert Instructions
  • Widening Floating-Point/Integer Type-Convert Instructions
  • Narrowing Floating-Point/Integer Type-Convert Instructions

vector reduction instructions

reduction指令有如下几类。

  • Vector Single-Width Integer Reduction Instructions
  • Vector Widening Integer Reduction Instructions
  • Vector Single-Width Floating-Point Reduction Instructions
  • Vector Widening Floating-Point Reduction Instructions

这些指令有如下特点:

  • reduction操作如果vstart!=0,会发生异常

Vector Single-Width Integer Reduction Instructions

该指令的srouce/dest宽度都是SEW。

1
2
3
4
5
6
7
8
9
# Simple reductions, where [*] denotes all active elements:
vredsum.vs vd, vs2, vs1, vm # vd[0] = sum( vs1[0] , vs2[*] )
vredmaxu.vs vd, vs2, vs1, vm # vd[0] = maxu( vs1[0] , vs2[*] )
vredmax.vs vd, vs2, vs1, vm # vd[0] = max( vs1[0] , vs2[*] )
vredminu.vs vd, vs2, vs1, vm # vd[0] = minu( vs1[0] , vs2[*] )
vredmin.vs vd, vs2, vs1, vm # vd[0] = min( vs1[0] , vs2[*] )
vredand.vs vd, vs2, vs1, vm # vd[0] = and( vs1[0] , vs2[*] )
vredor.vs vd, vs2, vs1, vm # vd[0] = or( vs1[0] , vs2[*] )
vredxor.vs vd, vs2, vs1, vm # vd[0] = xor( vs1[0] , vs2[*] )

Vector Widening Integer Reduction Instructions

1
2
3
4
# Unsigned sum reduction into double-width accumulator
vwredsumu.vs vd, vs2, vs1, vm # 2*SEW = 2*SEW + sum(zero-extend(SEW))
# Signed sum reduction into double-width accumulator
vwredsum.vs vd, vs2, vs1, vm # 2*SEW = 2*SEW + sum(sign-extend(SEW))

Vector Single-Width Floating-Point Reduction Instructions

1
2
3
4
5
# Simple reductions.
vfredosum.vs vd, vs2, vs1, vm # Ordered sum
vfredusum.vs vd, vs2, vs1, vm # Unordered sum
vfredmax.vs vd, vs2, vs1, vm # Maximum value
vfredmin.vs vd, vs2, vs1, vm # Minimum value

Vector Widening Floating-Point Reduction Instructions

1
2
3
# Simple reductions.
vfwredosum.vs vd, vs2, vs1, vm # Ordered sum
vfwredusum.vs vd, vs2, vs1, vm # Unordered sum

vector mask instructions

mask指令有如下几类。

  • Vector Mask-Register Logical Instructions
  • Vector count population in mask vcpop.m
  • vfirst find-first-set mask bit
  • vmsbf.m set-before-first mask bit
  • vmsif.m set-including-first mask bit
  • vmsof.m set-only-first mask bit
  • Vector Iota Instruction
  • Vector Element Index Instruction

Vector Mask-Register Logical Instructions

这些指令有如下特点:

  • 所有mask register中每个元素都是1bit的
  • 这些指令所操作的都是单个vector寄存器,和vtype中vlmul无关,且不改变vlmul。
  • 这些指令都是unmasked的,所以没有inactive elements。
1
2
3
4
5
6
7
8
vmand.mm vd, vs2, vs1 # vd.mask[i] = vs2.mask[i] && vs1.mask[i]
vmnand.mm vd, vs2, vs1 # vd.mask[i] = !(vs2.mask[i] && vs1.mask[i])
vmandn.mm vd, vs2, vs1 # vd.mask[i] = vs2.mask[i] && !vs1.mask[i]
vmxor.mm vd, vs2, vs1 # vd.mask[i] = vs2.mask[i] ^^ vs1.mask[i]
vmor.mm vd, vs2, vs1 # vd.mask[i] = vs2.mask[i] || vs1.mask[i]
vmnor.mm vd, vs2, vs1 # vd.mask[i] = !(vs2.mask[i] || vs1.mask[i])
vmorn.mm vd, vs2, vs1 # vd.mask[i] = vs2.mask[i] || !vs1.mask[i]
vmxnor.mm vd, vs2, vs1 # vd.mask[i] = !(vs2.mask[i] ^^ vs1.mask[i])

Vector count population in mask vcpop.m

1
vcpop.m rd, vs2, v0.t # x[rd] = sum_i ( vs2.mask[i] && v0.mask[i] )

vfirst find-first-set mask bit

1
vfirst.m rd, vs2, vm

vmsbf.m set-before-first mask bit

1
vmsbf.m vd, vs2, vm

vmsif.m set-including-first mask bit

1
vmsif.m vd, vs2, vm

vmsof.m set-only-first mask bit

1
vmsof.m vd, vs2, vm

Vector Iota Instruction

1
viota.m vd, vs2, vm

Vector Element Index Instruction

1
vid.v vd, vm # Write element ID to destination.

vector permutation instructions

permutation指令有如下几类。

  • Integer Scalar Move Instructions
  • Floating-Point Scalar Move Instructions
  • Vector Slide Instructions
  • Vector Register Gather Instructions
  • Vector Compress Instruction
  • Whole Vector Register Move

Integer Scalar Move Instructions

该指令忽略LMUL,move单个SEW宽度的元素。

如果SEW > XLEN,只有低位XLEN的长度才被move。

如果SEW < XLEN,sign-extended to XLEN。

1
2
vmv.x.s rd, vs2 # x[rd] = vs2[0] (vs1=0)
vmv.s.x vd, rs1 # vd[0] = x[rs1] (vs2=0)

Floating-Point Scalar Move Instructions

1
2
vfmv.f.s rd, vs2 # f[rd] = vs2[0] (rs1=0)
vfmv.s.f vd, rs1 # vd[0] = f[rs1] (vs2=0)

Vector Slide Instructions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# slideup
vslideup.vx vd, vs2, rs1, vm # vd[i+rs1] = vs2[i]
vslideup.vi vd, vs2, uimm, vm # vd[i+uimm] = vs2[i]

# slidedown
vslidedown.vx vd, vs2, rs1, vm # vd[i] = vs2[i+rs1]
vslidedown.vi vd, vs2, uimm, vm # vd[i] = vs2[i+uimm]

# slide1up
vslide1up.vx vd, vs2, rs1, vm # vd[0]=x[rs1], vd[i+1] = vs2[i]
vfslide1up.vf vd, vs2, rs1, vm # vd[0]=f[rs1], vd[i+1] = vs2[i]

# slide1down
vslide1down.vx vd, vs2, rs1, vm # vd[i] = vs2[i+1], vd[vl-1]=x[rs1]
vfslide1down.vf vd, vs2, rs1, vm # vd[i] = vs2[i+1], vd[vl-1]=f[rs1]

Vector Register Gather Instructions

1
2
3
4
5
vrgather.vv vd, vs2, vs1, vm # vd[i] = (vs1[i] >= VLMAX) ? 0 : vs2[vs1[i]];
vrgatherei16.vv vd, vs2, vs1, vm # vd[i] = (vs1[i] >= VLMAX) ? 0 : vs2[vs1[i]];

vrgather.vx vd, vs2, rs1, vm # vd[i] = (x[rs1] >= VLMAX) ? 0 : vs2[x[rs1]]
vrgather.vi vd, vs2, uimm, vm # vd[i] = (uimm >= VLMAX) ? 0 : vs2[uimm

Vector Compress Instruction

1
vcompress.vm vd, vs2, vs1 # Compress into vd elements of vs2 where vs1 is enabled

Whole Vector Register Move

1
2
3
4
5
6
vmv<nr>r.v vd, vs2 # General form

vmv1r.v v1, v2 # Copy v1=v2
vmv2r.v v10, v12 # Copy v10=v12; v11=v13
vmv4r.v v4, v8 # Copy v4=v8; v5=v9; v6=v10; v7=v11
vmv8r.v v0, v8 # Copy v0=v8; v1=v9; ...; v7=v15