当前位置：首页 > news >正文

FasterNet中Pconv的实现、效果与作用分析

news 2025/7/11 7:44:09

发表时间：2023年3月7日
论文地址：https://arxiv.org/abs/2303.03667
项目地址：https://github.com/JierunChen/FasterNet
在这里插入图片描述

FasterNet-t0在GPU、CPU和ARM处理器上分别比MobileViT-XXS快2.8×、3.3×和2.4×，而准确率要高2.9%。我们的大型FasterNet-L实现了令人印象深刻的83.5%的前1精度，与新兴的Swin-B相当，同时在GPU上有36%的推理吞吐量，并在CPU上节省了37%的计算时间。FasterNet作者提到的其核心在于PConv模块，其不仅减少了FLOPs（降低了冗余计算，其与ghostnet一样，认为conv中存在冗余），同时降低了mac（大部分输入直达输入），故而在取得了高性能的延时能力，如在gpu上fps高，在cpu与arm设备上延时最低。为此对PConv的设计与实现进行深入分析。

1、论文信息

1.1 模块设计

Pconv与常规卷积、分组卷积相比，只对输入通道的少部分做密集卷积（常规卷积），剩余部分直通到输出。该操作大幅度降低了卷积的运算量（如将输入通道分成4份，只对其中一份进行卷积，剩余的3份直通到下一层），也降低了内存访问成本（如C_in为400，只对其四分之一进行卷积，内存访问则为100wh+100wh，内存访问成本为200wh，为原来的1/4）
在这里插入图片描述

Pconv对应实现代码如下所示，可以看到就是split=》conv=》cat操作

class Partial_conv3(nn.Module):def __init__(self, dim, n_div, forward):super().__init__()self.dim_conv3 = dim // n_divself.dim_untouched = dim - self.dim_conv3self.partial_conv3 = nn.Conv2d(self.dim_conv3, self.dim_conv3, 3, 1, 1, bias=False)if forward == 'slicing':self.forward = self.forward_slicingelif forward == 'split_cat':self.forward = self.forward_split_catelse:raise NotImplementedErrordef forward_slicing(self, x: Tensor) -> Tensor:# only for inferencex = x.clone()   # !!! Keep the original input intact for the residual connection laterx[:, :self.dim_conv3, :, :] = self.partial_conv3(x[:, :self.dim_conv3, :, :])return xdef forward_split_cat(self, x: Tensor) -> Tensor:# for training/inferencex1, x2 = torch.split(x, [self.dim_conv3, self.dim_untouched], dim=1)x1 = self.partial_conv3(x1)x = torch.cat((x1, x2), 1)return x

在论文中提到了与PWcov结合、或是T-shaped Conv，然而在代码层面实际上跟PConv没有任何关系。只是在FasterNet Block中与Conv1x1进行结合conv1x1实现通道间信息交互
在这里插入图片描述

1.2 模型结构

Faster的模型结构如下所示，可以看到Pconv只是其中的一小部分。作者将Pconv与conv1x1+BN+Relu+残差联合在一起形成FasterNet Block，FasterNet Block才是模型的主要成分。然后模型中参考了VIT模型设计中的很多设计（如PatchEmbed、mlp），只是没有Transformer模块。

PatchEmbed在模型输入层中可以看到，而mlp操作其实就是Pconv后面的Conv1x1+bn+relu+Conv1x1
在这里插入图片描述
具体模型结构如下所示（一共有t0、t1、t2、s、m、l等版本，可以看到数据在经过Embedding层后即完成了1/4下采样；后续的每一个Stage（即FasterNet Block）仅是实现特征提取；最后经过Merging层（即conv2+bn层）实现对数据的下采样
在这里插入图片描述

1.3 结构对比

模块性能对比 这里对比了conv、分组卷积、深度分离卷积、PConv。对应的feature map在像素点量上是逐步减半的（如：96x56x56的像素量是192x28x28的一半），可以发现只有DWConv的FLOPs是减半，其他方法是没有减少的。这里可以发现，DWConv是性价比最高的结构，PConv是第二的（观察fps与latency）。唯独在ARM (Cortex-A72,using a single thread)架构下，PConv比DWConv要强

注：1、PConv在r为1/4时，FLOPs与group为1/16的分组卷积是一样的，但内存访问量是不同的。
注：2、DWConv是全分组卷积（ksize为3，分组数为通道数，仅实现空间信息交互）+点卷积组成（ksize为1，实现通道信息交互）

作者通过对Conv进行拟合，发现PConv是loss最低的。这里是因为GConv与PConv都无法实现全局的通道信息交互，所以需要PWConv。然后为了同等对比，所以DWConv也被迫加上了一个PWConv，这些loss在值差异上只有0.001~0.002，实际上是没有区别的，具体参考ddb_conv、RepConv进行融合输出值差异

内存访问成本对比： 公式2是Pconv的，公式3是conv的，但c’是c的1/4，故而说Pconv的内存访问成本是conv的1/4 这里是假定了模型输入输出的通道数都为c，所以是2c，否则是(c_in+c_out)

1.3 模型效果

宏观对比如下，可以发现FasterNet在GPU上达到了最高的fps，在cpu与arm上达到了最低的延时。
在这里插入图片描述
以下图表表示了FasterNet在轻量级与重量级模型中都取得了最近性能。

2、代码实现与分析

2.1 Pconv代码

Pconv的实现代码经过简化后如下所示，可以发现就是简单的split+cat操作。23年博主也做过类似尝试（用pconv全量替换掉conv），并没有训练出好效果

class Partial_conv3(nn.Module):def __init__(self, dim, n_div, forward):super().__init__()self.dim_conv3 = dim // n_divself.dim_untouched = dim - self.dim_conv3self.partial_conv3 = nn.Conv2d(self.dim_conv3, self.dim_conv3, 3, 1, 1, bias=False)def forward(self, x: Tensor) -> Tensor:# only for inferencex = x.clone()   # !!! Keep the original input intact for the residual connection laterx[:, :self.dim_conv3, :, :] = self.partial_conv3(x[:, :self.dim_conv3, :, :])return x

2.2 Faster Block代码

spatial_mixing对象为pconv层
mlp对象为Faster Block模块中的非pconv层
forword代码如下：

    def forward(self, x: Tensor) -> Tensor:shortcut = xx = self.spatial_mixing(x)x = shortcut + self.drop_path(self.mlp(x))return x

完整实现代码如下

class MLPBlock(nn.Module):def __init__(self,dim,n_div,mlp_ratio,drop_path,layer_scale_init_value,act_layer,norm_layer,pconv_fw_type):super().__init__()self.dim = dimself.mlp_ratio = mlp_ratioself.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()self.n_div = n_divmlp_hidden_dim = int(dim * mlp_ratio)mlp_layer: List[nn.Module] = [nn.Conv2d(dim, mlp_hidden_dim, 1, bias=False),norm_layer(mlp_hidden_dim),act_layer(),nn.Conv2d(mlp_hidden_dim, dim, 1, bias=False)]self.mlp = nn.Sequential(*mlp_layer)self.spatial_mixing = Partial_conv3(dim,n_div,pconv_fw_type)if layer_scale_init_value > 0:self.layer_scale = nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_grad=True)self.forward = self.forward_layer_scaleelse:self.forward = self.forwarddef forward(self, x: Tensor) -> Tensor:shortcut = xx = self.spatial_mixing(x)x = shortcut + self.drop_path(self.mlp(x))return xdef forward_layer_scale(self, x: Tensor) -> Tensor:shortcut = xx = self.spatial_mixing(x)x = shortcut + self.drop_path(self.layer_scale.unsqueeze(-1).unsqueeze(-1) * self.mlp(x))return x

此外还有一个BasicStage类，其主要就是实现多层MLPBlock（即Faster Block）的堆叠

2.3 PatchEmbed与PatchMerging

PatchEmbed是类似于vit模型中的图像切patch，将空间信息转移到通道上。
PatchMerging是基于conv的stride实现特征图的分辨率降低，同时实现通道的增加。


class PatchEmbed(nn.Module):def __init__(self, patch_size, patch_stride, in_chans, embed_dim, norm_layer):super().__init__()self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_stride, bias=False)if norm_layer is not None:self.norm = norm_layer(embed_dim)else:self.norm = nn.Identity()def forward(self, x: Tensor) -> Tensor:x = self.norm(self.proj(x))return xclass PatchMerging(nn.Module):def __init__(self, patch_size2, patch_stride2, dim, norm_layer):super().__init__()self.reduction = nn.Conv2d(dim, 2 * dim, kernel_size=patch_size2, stride=patch_stride2, bias=False)if norm_layer is not None:self.norm = norm_layer(2 * dim)else:self.norm = nn.Identity()def forward(self, x: Tensor) -> Tensor:x = self.norm(self.reduction(x))return x

2.4 模型代码

class FasterNet(nn.Module):def __init__(self,in_chans=3,num_classes=1000,embed_dim=96,depths=(1, 2, 8, 2),mlp_ratio=2.,n_div=4,patch_size=4,patch_stride=4,patch_size2=2,  # for subsequent layerspatch_stride2=2,patch_norm=True,feature_dim=1280,drop_path_rate=0.1,layer_scale_init_value=0,norm_layer='BN',act_layer='RELU',fork_feat=False,init_cfg=None,pretrained=None,pconv_fw_type='split_cat',**kwargs):super().__init__()if norm_layer == 'BN':norm_layer = nn.BatchNorm2delse:raise NotImplementedErrorif act_layer == 'GELU':act_layer = nn.GELUelif act_layer == 'RELU':act_layer = partial(nn.ReLU, inplace=True)else:raise NotImplementedErrorif not fork_feat:self.num_classes = num_classesself.num_stages = len(depths)self.embed_dim = embed_dimself.patch_norm = patch_normself.num_features = int(embed_dim * 2 ** (self.num_stages - 1))self.mlp_ratio = mlp_ratioself.depths = depths# split image into non-overlapping patchesself.patch_embed = PatchEmbed(patch_size=patch_size,patch_stride=patch_stride,in_chans=in_chans,embed_dim=embed_dim,norm_layer=norm_layer if self.patch_norm else None)# stochastic depth decay ruledpr = [x.item()for x in torch.linspace(0, drop_path_rate, sum(depths))]# build layersstages_list = []for i_stage in range(self.num_stages):stage = BasicStage(dim=int(embed_dim * 2 ** i_stage),n_div=n_div,depth=depths[i_stage],mlp_ratio=self.mlp_ratio,drop_path=dpr[sum(depths[:i_stage]):sum(depths[:i_stage + 1])],layer_scale_init_value=layer_scale_init_value,norm_layer=norm_layer,act_layer=act_layer,pconv_fw_type=pconv_fw_type)stages_list.append(stage)# patch merging layerif i_stage < self.num_stages - 1:stages_list.append(PatchMerging(patch_size2=patch_size2,patch_stride2=patch_stride2,dim=int(embed_dim * 2 ** i_stage),norm_layer=norm_layer))self.stages = nn.Sequential(*stages_list)self.fork_feat = fork_featif self.fork_feat:self.forward = self.forward_det# add a norm layer for each outputself.out_indices = [0, 2, 4, 6]for i_emb, i_layer in enumerate(self.out_indices):if i_emb == 0 and os.environ.get('FORK_LAST3', None):raise NotImplementedErrorelse:layer = norm_layer(int(embed_dim * 2 ** i_emb))layer_name = f'norm{i_layer}'self.add_module(layer_name, layer)else:self.forward = self.forward_cls# Classifier headself.avgpool_pre_head = nn.Sequential(nn.AdaptiveAvgPool2d(1),nn.Conv2d(self.num_features, feature_dim, 1, bias=False),act_layer())self.head = nn.Linear(feature_dim, num_classes) \if num_classes > 0 else nn.Identity()self.apply(self.cls_init_weights)self.init_cfg = copy.deepcopy(init_cfg)if self.fork_feat and (self.init_cfg is not None or pretrained is not None):self.init_weights()def cls_init_weights(self, m):if isinstance(m, nn.Linear):trunc_normal_(m.weight, std=.02)if isinstance(m, nn.Linear) and m.bias is not None:nn.init.constant_(m.bias, 0)elif isinstance(m, (nn.Conv1d, nn.Conv2d)):trunc_normal_(m.weight, std=.02)if m.bias is not None:nn.init.constant_(m.bias, 0)elif isinstance(m, (nn.LayerNorm, nn.GroupNorm)):nn.init.constant_(m.bias, 0)nn.init.constant_(m.weight, 1.0)# init for mmdetection by loading imagenet pre-trained weightsdef init_weights(self, pretrained=None):logger = get_root_logger()if self.init_cfg is None and pretrained is None:logger.warn(f'No pre-trained weights for 'f'{self.__class__.__name__}, 'f'training start from scratch')passelse:assert 'checkpoint' in self.init_cfg, f'Only support ' \f'specify `Pretrained` in ' \f'`init_cfg` in ' \f'{self.__class__.__name__} 'if self.init_cfg is not None:ckpt_path = self.init_cfg['checkpoint']elif pretrained is not None:ckpt_path = pretrainedckpt = _load_checkpoint(ckpt_path, logger=logger, map_location='cpu')if 'state_dict' in ckpt:_state_dict = ckpt['state_dict']elif 'model' in ckpt:_state_dict = ckpt['model']else:_state_dict = ckptstate_dict = _state_dictmissing_keys, unexpected_keys = \self.load_state_dict(state_dict, False)# show for debugprint('missing_keys: ', missing_keys)print('unexpected_keys: ', unexpected_keys)def forward_cls(self, x):# output only the features of last layer for image classificationx = self.patch_embed(x)x = self.stages(x)x = self.avgpool_pre_head(x)  # B C 1 1x = torch.flatten(x, 1)x = self.head(x)return xdef forward_det(self, x: Tensor) -> Tensor:# output the features of four stages for dense predictionx = self.patch_embed(x)outs = []for idx, stage in enumerate(self.stages):x = stage(x)if self.fork_feat and idx in self.out_indices:norm_layer = getattr(self, f'norm{idx}')x_out = norm_layer(x)outs.append(x_out)return outs

2.5 完整模型代码

完整模型代码只是用于3.2中的FLOPs分析

# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import torch
import torch.nn as nn
from timm.models.layers import DropPath, to_2tuple, trunc_normal_
from functools import partial
from typing import List
from torch import Tensor
import copy
import ostry:from mmdet.models.builder import BACKBONES as det_BACKBONESfrom mmdet.utils import get_root_loggerfrom mmcv.runner import _load_checkpointhas_mmdet = True
except ImportError:print("If for detection, please install mmdetection first")has_mmdet = Falseclass Partial_conv3(nn.Module):def __init__(self, dim, n_div, forward):super().__init__()self.dim_conv3 = dim // n_divself.dim_untouched = dim - self.dim_conv3self.partial_conv3 = nn.Conv2d(self.dim_conv3, self.dim_conv3, 3, 1, 1, bias=False)if forward == 'slicing':self.forward = self.forward_slicingelif forward == 'split_cat':self.forward = self.forward_split_catelse:raise NotImplementedErrordef forward_slicing(self, x: Tensor) -> Tensor:# only for inferencex = x.clone()   # !!! Keep the original input intact for the residual connection laterx[:, :self.dim_conv3, :, :] = self.partial_conv3(x[:, :self.dim_conv3, :, :])return xdef forward_split_cat(self, x: Tensor) -> Tensor:# for training/inferencex1, x2 = torch.split(x, [self.dim_conv3, self.dim_untouched], dim=1)x1 = self.partial_conv3(x1)x = torch.cat((x1, x2), 1)return xclass MLPBlock(nn.Module):def __init__(self,dim,n_div,mlp_ratio,drop_path,layer_scale_init_value,act_layer,norm_layer,pconv_fw_type):super().__init__()self.dim = dimself.mlp_ratio = mlp_ratioself.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()self.n_div = n_divmlp_hidden_dim = int(dim * mlp_ratio)mlp_layer: List[nn.Module] = [nn.Conv2d(dim, mlp_hidden_dim, 1, bias=False),norm_layer(mlp_hidden_dim),act_layer(),nn.Conv2d(mlp_hidden_dim, dim, 1, bias=False)]self.mlp = nn.Sequential(*mlp_layer)self.spatial_mixing = Partial_conv3(dim,n_div,pconv_fw_type)if layer_scale_init_value > 0:self.layer_scale = nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_grad=True)self.forward = self.forward_layer_scaleelse:self.forward = self.forwarddef forward(self, x: Tensor) -> Tensor:shortcut = xx = self.spatial_mixing(x)x = shortcut + self.drop_path(self.mlp(x))return xdef forward_layer_scale(self, x: Tensor) -> Tensor:shortcut = xx = self.spatial_mixing(x)x = shortcut + self.drop_path(self.layer_scale.unsqueeze(-1).unsqueeze(-1) * self.mlp(x))return xclass BasicStage(nn.Module):def __init__(self,dim,depth,n_div,mlp_ratio,drop_path,layer_scale_init_value,norm_layer,act_layer,pconv_fw_type):super().__init__()blocks_list = [MLPBlock(dim=dim,n_div=n_div,mlp_ratio=mlp_ratio,drop_path=drop_path[i],layer_scale_init_value=layer_scale_init_value,norm_layer=norm_layer,act_layer=act_layer,pconv_fw_type=pconv_fw_type)for i in range(depth)]self.blocks = nn.Sequential(*blocks_list)def forward(self, x: Tensor) -> Tensor:x = self.blocks(x)return xclass PatchEmbed(nn.Module):def __init__(self, patch_size, patch_stride, in_chans, embed_dim, norm_layer):super().__init__()self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_stride, bias=False)if norm_layer is not None:self.norm = norm_layer(embed_dim)else:self.norm = nn.Identity()def forward(self, x: Tensor) -> Tensor:x = self.norm(self.proj(x))return xclass PatchMerging(nn.Module):def __init__(self, patch_size2, patch_stride2, dim, norm_layer):super().__init__()self.reduction = nn.Conv2d(dim, 2 * dim, kernel_size=patch_size2, stride=patch_stride2, bias=False)if norm_layer is not None:self.norm = norm_layer(2 * dim)else:self.norm = nn.Identity()def forward(self, x: Tensor) -> Tensor:x = self.norm(self.reduction(x))return xclass FasterNet(nn.Module):def __init__(self,in_chans=3,num_classes=1000,embed_dim=96,depths=(1, 2, 8, 2),mlp_ratio=2.,n_div=4,patch_size=4,patch_stride=4,patch_size2=2,  # for subsequent layerspatch_stride2=2,patch_norm=True,feature_dim=1280,drop_path_rate=0.1,layer_scale_init_value=0,norm_layer='BN',act_layer='RELU',fork_feat=False,init_cfg=None,pretrained=None,pconv_fw_type='split_cat',**kwargs):super().__init__()if norm_layer == 'BN':norm_layer = nn.BatchNorm2delse:raise NotImplementedErrorif act_layer == 'GELU':act_layer = nn.GELUelif act_layer == 'RELU':act_layer = partial(nn.ReLU, inplace=True)else:raise NotImplementedErrorif not fork_feat:self.num_classes = num_classesself.num_stages = len(depths)self.embed_dim = embed_dimself.patch_norm = patch_normself.num_features = int(embed_dim * 2 ** (self.num_stages - 1))self.mlp_ratio = mlp_ratioself.depths = depths# split image into non-overlapping patchesself.patch_embed = PatchEmbed(patch_size=patch_size,patch_stride=patch_stride,in_chans=in_chans,embed_dim=embed_dim,norm_layer=norm_layer if self.patch_norm else None)# stochastic depth decay ruledpr = [x.item()for x in torch.linspace(0, drop_path_rate, sum(depths))]# build layersstages_list = []for i_stage in range(self.num_stages):stage = BasicStage(dim=int(embed_dim * 2 ** i_stage),n_div=n_div,depth=depths[i_stage],mlp_ratio=self.mlp_ratio,drop_path=dpr[sum(depths[:i_stage]):sum(depths[:i_stage + 1])],layer_scale_init_value=layer_scale_init_value,norm_layer=norm_layer,act_layer=act_layer,pconv_fw_type=pconv_fw_type)stages_list.append(stage)# patch merging layerif i_stage < self.num_stages - 1:stages_list.append(PatchMerging(patch_size2=patch_size2,patch_stride2=patch_stride2,dim=int(embed_dim * 2 ** i_stage),norm_layer=norm_layer))self.stages = nn.Sequential(*stages_list)self.fork_feat = fork_featif self.fork_feat:self.forward = self.forward_det# add a norm layer for each outputself.out_indices = [0, 2, 4, 6]for i_emb, i_layer in enumerate(self.out_indices):if i_emb == 0 and os.environ.get('FORK_LAST3', None):raise NotImplementedErrorelse:layer = norm_layer(int(embed_dim * 2 ** i_emb))layer_name = f'norm{i_layer}'self.add_module(layer_name, layer)else:self.forward = self.forward_cls# Classifier headself.avgpool_pre_head = nn.Sequential(nn.AdaptiveAvgPool2d(1),nn.Conv2d(self.num_features, feature_dim, 1, bias=False),act_layer())self.head = nn.Linear(feature_dim, num_classes) \if num_classes > 0 else nn.Identity()self.apply(self.cls_init_weights)self.init_cfg = copy.deepcopy(init_cfg)if self.fork_feat and (self.init_cfg is not None or pretrained is not None):self.init_weights()def cls_init_weights(self, m):if isinstance(m, nn.Linear):trunc_normal_(m.weight, std=.02)if isinstance(m, nn.Linear) and m.bias is not None:nn.init.constant_(m.bias, 0)elif isinstance(m, (nn.Conv1d, nn.Conv2d)):trunc_normal_(m.weight, std=.02)if m.bias is not None:nn.init.constant_(m.bias, 0)elif isinstance(m, (nn.LayerNorm, nn.GroupNorm)):nn.init.constant_(m.bias, 0)nn.init.constant_(m.weight, 1.0)# init for mmdetection by loading imagenet pre-trained weightsdef init_weights(self, pretrained=None):logger = get_root_logger()if self.init_cfg is None and pretrained is None:logger.warn(f'No pre-trained weights for 'f'{self.__class__.__name__}, 'f'training start from scratch')passelse:assert 'checkpoint' in self.init_cfg, f'Only support ' \f'specify `Pretrained` in ' \f'`init_cfg` in ' \f'{self.__class__.__name__} 'if self.init_cfg is not None:ckpt_path = self.init_cfg['checkpoint']elif pretrained is not None:ckpt_path = pretrainedckpt = _load_checkpoint(ckpt_path, logger=logger, map_location='cpu')if 'state_dict' in ckpt:_state_dict = ckpt['state_dict']elif 'model' in ckpt:_state_dict = ckpt['model']else:_state_dict = ckptstate_dict = _state_dictmissing_keys, unexpected_keys = \self.load_state_dict(state_dict, False)# show for debugprint('missing_keys: ', missing_keys)print('unexpected_keys: ', unexpected_keys)def forward_cls(self, x):# output only the features of last layer for image classificationx = self.patch_embed(x)x = self.stages(x)x = self.avgpool_pre_head(x)  # B C 1 1x = torch.flatten(x, 1)x = self.head(x)return xdef forward_det(self, x: Tensor) -> Tensor:# output the features of four stages for dense predictionx = self.patch_embed(x)outs = []for idx, stage in enumerate(self.stages):x = stage(x)if self.fork_feat and idx in self.out_indices:norm_layer = getattr(self, f'norm{idx}')x_out = norm_layer(x)outs.append(x_out)return outs

3、相关分析

3.1 PConv可以取代Conv么？

不可以，其仅是实现了对于C_in与C_out相等时，conv的平替；同时，其只有局部空间信息的交互，大部分通道数据是直连输出，因此会是输入数据直传到网络深层。故而需要密集全连接的卷积层进行通道间信息交互。

在整个论文实验中，也没有将FasterNet中pconv替换为Conv的对比，pconv。或许FasterNet的优势仅是因为其结构设计（尤其是对输入进行PatchEmbed，将空间大小降低为原来的1/16），也就是是使用Conv替代pconv，在acc与延时上或许依旧占据优势。

同样，对于PWConv也没有等效对比，将FasterNet中pconv替换为PWConv或许还能再度迎来性能提升。毕竟在作者实验中，PWConv在gpu上推理速度比pconv更具优势，拟合能力与pconv不相上下。

3.2 FasterNet中的FLOPs分布

基于以下代码构建了一个简易的FasterNet模型，并输出了每一层的flops

if __name__=="__main__":model=FasterNet( depths=(1, 1, 1, 1),)from fvcore.nn import flop_count_table, FlopCountAnalysis, ActivationCountAnalysis    x = torch.randn(1, 3, 256, 256)# model = SAFMN(dim=36, n_blocks=12, ffn_scale=2.0, upscaling_factor=2)print(f'params: {sum(map(lambda x: x.numel(), model.parameters()))}')print(flop_count_table(FlopCountAnalysis(model, x), activations=ActivationCountAnalysis(model, x)))output = model(x)print(output.shape)

代码运行输出效果如下，可以发现模型关键模块FasterBlock中flops的大头在blocks.0.mlp上，spatial_mixing.partial_conv3（即pconv）只占据了模块10%的计算量为0.21m。

| module                                            | #parameters or shape   | #flops     | #activations   |
|:--------------------------------------------------|:-----------------------|:-----------|:---------------|
| model                                             | 7.4M                   | 0.948G     | 3.136M         |
|  patch_embed                                      |  4.8K                  |  20.84M    |  0.393M        |
|   patch_embed.proj                                |   4.608K               |   18.874M  |   0.393M       |
|    patch_embed.proj.weight                        |    (96, 3, 4, 4)       |            |                |
|   patch_embed.norm                                |   0.192K               |   1.966M   |   0            |
|    patch_embed.norm.weight                        |    (96,)               |            |                |
|    patch_embed.norm.bias                          |    (96,)               |            |                |
|  stages                                           |  5.131M                |  0.924G    |  2.74M         |
|   stages.0.blocks.0                               |   42.432K              |   0.176G   |   1.278M       |
|    stages.0.blocks.0.mlp                          |    37.248K             |    0.155G  |    1.18M       |
|    stages.0.blocks.0.spatial_mixing.partial_conv3 |    5.184K              |    21.234M |    98.304K     |
|   stages.1                                        |   74.112K              |   76.481M  |   0.197M       |
|    stages.1.reduction                             |    73.728K             |    75.497M |    0.197M      |
|    stages.1.norm                                  |    0.384K              |    0.983M  |    0           |
|   stages.2.blocks.0                               |   0.169M               |   0.174G   |   0.639M       |
|    stages.2.blocks.0.mlp                          |    0.148M              |    0.153G  |    0.59M       |
|    stages.2.blocks.0.spatial_mixing.partial_conv3 |    20.736K             |    21.234M |    49.152K     |
|   stages.3                                        |   0.296M               |   75.989M  |   98.304K      |
|    stages.3.reduction                             |    0.295M              |    75.497M |    98.304K     |
|    stages.3.norm                                  |    0.768K              |    0.492M  |    0           |
|   stages.4.blocks.0                               |   0.674M               |   0.173G   |   0.319M       |
|    stages.4.blocks.0.mlp                          |    0.591M              |    0.152G  |    0.295M      |
|    stages.4.blocks.0.spatial_mixing.partial_conv3 |    82.944K             |    21.234M |    24.576K     |
|   stages.5                                        |   1.181M               |   75.743M  |   49.152K      |
|    stages.5.reduction                             |    1.18M               |    75.497M |    49.152K     |
|    stages.5.norm                                  |    1.536K              |    0.246M  |    0           |
|   stages.6.blocks.0                               |   2.694M               |   0.173G   |   0.16M        |
|    stages.6.blocks.0.mlp                          |    2.362M              |    0.151G  |    0.147M      |
|    stages.6.blocks.0.spatial_mixing.partial_conv3 |    0.332M              |    21.234M |    12.288K     |
|  avgpool_pre_head                                 |  0.983M                |  1.032M    |  1.28K         |
|   avgpool_pre_head.1                              |   0.983M               |   0.983M   |   1.28K        |
|    avgpool_pre_head.1.weight                      |    (1280, 768, 1, 1)   |            |                |
|   avgpool_pre_head.0                              |                        |   49.152K  |   0            |
|  head                                             |  1.281M                |  1.28M     |  1K            |
|   head.weight                                     |   (1000, 1280)         |            |                |
|   head.bias                                       |   (1000,)              |            |                |

3.3 将PConv替换为Conv的FLops变化

将原来的Partial_conv3类代码替换为以下代码

class Partial_conv3(nn.Module):def __init__(self, dim, n_div, forward):super().__init__()self.conv = nn.Conv2d(dim, dim, 3, 1, 1, bias=False)def forward(self, x: Tensor) -> Tensor:# only for inferencex = x.clone()   # !!! Keep the original input intact for the residual connection laterx = self.conv(x)return x

再次运行以下代码后

if __name__=="__main__":model=FasterNet( depths=(1, 1, 1, 1),)from fvcore.nn import flop_count_table, FlopCountAnalysis, ActivationCountAnalysis    x = torch.randn(1, 3, 256, 256)# model = SAFMN(dim=36, n_blocks=12, ffn_scale=2.0, upscaling_factor=2)print(f'params: {sum(map(lambda x: x.numel(), model.parameters()))}')print(flop_count_table(FlopCountAnalysis(model, x), activations=ActivationCountAnalysis(model, x)))output = model(x)print(output.shape)

这里可以发现flops为2.22g，相比与原来的0.98g翻了一倍。在新的FasterBlock中，spatial_mixing.conv中flops的占比达到了70%，为0.34g，相比于原来的21m为16倍。

| module                                   | #parameters or shape   | #flops     | #activations   |
|:-----------------------------------------|:-----------------------|:-----------|:---------------|
| model                                    | 14.009M                | 2.222G     | 3.689M         |
|  patch_embed                             |  4.8K                  |  20.84M    |  0.393M        |
|   patch_embed.proj                       |   4.608K               |   18.874M  |   0.393M       |
|    patch_embed.proj.weight               |    (96, 3, 4, 4)       |            |                |
|   patch_embed.norm                       |   0.192K               |   1.966M   |   0            |
|    patch_embed.norm.weight               |    (96,)               |            |                |
|    patch_embed.norm.bias                 |    (96,)               |            |                |
|  stages                                  |  11.74M                |  2.199G    |  3.293M        |
|   stages.0.blocks.0                      |   0.12M                |   0.495G   |   1.573M       |
|    stages.0.blocks.0.mlp                 |    37.248K             |    0.155G  |    1.18M       |
|    stages.0.blocks.0.spatial_mixing.conv |    82.944K             |    0.34G   |    0.393M      |
|   stages.1                               |   74.112K              |   76.481M  |   0.197M       |
|    stages.1.reduction                    |    73.728K             |    75.497M |    0.197M      |
|    stages.1.norm                         |    0.384K              |    0.983M  |    0           |
|   stages.2.blocks.0                      |   0.48M                |   0.493G   |   0.786M       |
|    stages.2.blocks.0.mlp                 |    0.148M              |    0.153G  |    0.59M       |
|    stages.2.blocks.0.spatial_mixing.conv |    0.332M              |    0.34G   |    0.197M      |
|   stages.3                               |   0.296M               |   75.989M  |   98.304K      |
|    stages.3.reduction                    |    0.295M              |    75.497M |    98.304K     |
|    stages.3.norm                         |    0.768K              |    0.492M  |    0           |
|   stages.4.blocks.0                      |   1.918M               |   0.492G   |   0.393M       |
|    stages.4.blocks.0.mlp                 |    0.591M              |    0.152G  |    0.295M      |
|    stages.4.blocks.0.spatial_mixing.conv |    1.327M              |    0.34G   |    98.304K     |
|   stages.5                               |   1.181M               |   75.743M  |   49.152K      |
|    stages.5.reduction                    |    1.18M               |    75.497M |    49.152K     |
|    stages.5.norm                         |    1.536K              |    0.246M  |    0           |
|   stages.6.blocks.0                      |   7.671M               |   0.491G   |   0.197M       |
|    stages.6.blocks.0.mlp                 |    2.362M              |    0.151G  |    0.147M      |
|    stages.6.blocks.0.spatial_mixing.conv |    5.308M              |    0.34G   |    49.152K     |
|  avgpool_pre_head                        |  0.983M                |  1.032M    |  1.28K         |
|   avgpool_pre_head.1                     |   0.983M               |   0.983M   |   1.28K        |
|    avgpool_pre_head.1.weight             |    (1280, 768, 1, 1)   |            |                |
|   avgpool_pre_head.0                     |                        |   49.152K  |   0            |
|  head                                    |  1.281M                |  1.28M     |  1K            |
|   head.weight                            |   (1000, 1280)         |            |                |
|   head.bias                              |   (1000,)              |            |                |
torch.Size([1, 1000])

3.3 整体结论

基于3.1-3.3的分析，可以发现我们不能直接用pconv取代模型中所有的conv层，但可以在部分层中取代个别flops较大的conv中。pconv只是近似conv的一个选择，其仅是在FasterNet的架构设计下发挥作用，直接平替到其他模型中必然存在水土不服（需要额外的PWConv层实现信息交互）。

但是，FasterNet却为我们提供了一个强大的backbone，其在轻量级与重量级模型中均达到了最佳精度下的最快速度，可以用于图像分类、目标检测中。然后在我们的实验中，或许可以将FasterNet中的Pconv替换为DWConv，这样也许能再次提升backbone能力的提升。毕竟作者没有做这个对比，也说不定是发现Pconv不如DWConv后隐匿了这一部分实验数据