diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index b2acd0d395ca425f6c7130776db2cb619f303c12..f792258f07caf3f3b7cbccc13eb580904d722953 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -436,3 +436,131 @@ support enabled just fine as always. No difference can be noted in hugetlbfs other than there will be less overall fragmentation. All usual features belonging to hugetlbfs are preserved and unaffected. libhugetlbfs will also work fine as usual. + +THP zero subpages reclaim +========================= +THP may lead to memory bloat which may cause OOM. The reason is a huge +page may contain some zero subpages which users didn't really access them. +To avoid this, a mechanism to reclaim these zero subpages is introduced:: + + echo reclaim > /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim + echo swap > /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim + echo disable > /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim + +reclaim + means the huge page will be split and the zero subpages will be + reclaimed. + +swap + currently do nothing and just reserves for combining with swap, + for example, zswap and zram. + +disable + means do nothing. + +The default mode is inherited from its parent. The default mode of root +memcg is disable. + +We also add a global interface, if you don't want to configure it by configuring +every memory cgroup, you can use this one:: + + /sys/kernel/mm/transparent_hugepage/reclaim + +memcg + The default mode. It means every mem cgroup will use their own configure. + +reclaim + means every mem cgroup will use reclaim. + +swap + means every mem cgroup will use swap. + +disable + means every mem cgroup will use disable. + +Or you can configure it by boot parameter ``tr`, like ``tr=reclaim`` or +``tr=disable``. + +If the mode is reclaim or swap, the new huge page will be add to a reclaim +queue in mem_cgroup, and the queue would be scanned when memory reclaiming. +The queue stat can be checked like this:: + + cat /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim_stat + +queue_length + means the queue length of each node. + +split_hugepage + means the huge pages split by thp reclaim of each node. + +reclaim_subpage + means the zero subpages reclaimed by thp reclaim of each node. + +split_failed + means the thp split failed by thp reclaim of each node. + +total_ + hierarchical version of , which in addition to the memcg's + own value includes the sum of all hierarchical children's values of + . This adds the value of all the nodes together. These are + accumulated value except total_queue_length. total_queue_length is + the instantaneous value at the show time. + +We also add a controller interface to set configs for thp reclaim:: + + /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim_ctrl + +threshold + means the huge page which contains at least threshold zero pages would + be split (estimate it by checking some discrete unsigned long values). + The default value of threshold is 16, and will inherit from it's parent. + The range of this value is (0, HPAGE_PMD_NR], which means the value must + be less than or equal to HPAGE_PMD_NR (512 in x86), and be greater than 0. + We can set reclaim threshold to be 8 by this:: + + echo "threshold 8" > memory.thp_reclaim_ctrl + +reclaim + triggers action immediately for the huge pages in the reclaim queue. + The action deponds on the thp reclaim config (reclaim, swap or disable, + disable means just remove the huge page from the queue). + This contronller has two value, 1 and 2. 1 means just reclaim the current + memcg, and 2 means reclaim the current memcg and all the children memcgs. + Like this:: + + echo "reclaim 1" > memory.thp_reclaim_ctrl + echo "reclaim 2" > memory.thp_reclaim_ctrl + +proactive + Enable or disable the proactive reclaim of memory cgroup. See below. ++ + +Only one of the configs mentioned above can be set at a time. + +We also provide proactive reclaim. It can be triggered by:: + + /sys/kernel/mm/transparent_hugepage/reclaim_proactive + +Set 1 to enable, and 0 to disable. If it's enabled, the proactive reclaim +will run every 60000ms (by default). We can change the running frequency by:: + + /sys/kernel/mm/transparent_hugepage/reclaim_proactive_sleep_ms + +If memory cgroup does not want to enable proactive reclaim (The default is +disable), we can disable it by:: + + echo "proactive 0" > memory.thp_reclaim_ctrl + +And enable it by:: + + echo "proactive 1" > memory.thp_reclaim_ctrl + +We can limit the scan numbers of each memcg for proactive reclaim by:: + + /sys/kernel/mm/transparent_hugepage/reclaim_proactive_scan + +We also provide boot parameters ``tr.proactive``. ``tr.proactive=0`` +means disable proactive reclaim, ``tr.proactive=1`` means enable +proactive when system boot, and ``tr.proactive=2`` means the root +memory cgroup will enable proactive reclaim, and ``tr.proactive=3`` +means above both. diff --git a/Documentation/dev-tools/kfence.rst b/Documentation/dev-tools/kfence.rst index b4de8aa402618676ffdf74008c70d1f93857d1b5..63c3467e4358122cc3fb8de50637ddb782625561 100644 --- a/Documentation/dev-tools/kfence.rst +++ b/Documentation/dev-tools/kfence.rst @@ -85,7 +85,7 @@ Note: On architectures that support huge pages, KFENCE will ensure that the pool is using pages of size ``PAGE_SIZE``. This will result in additional page tables being allocated. -TLB revocer issue +TLB recover issue ~~~~~~~~~~~~~~~~~ For some arch like x86, kernel virtual address directly mapping to physical @@ -162,6 +162,13 @@ or writing to ``/sys/module/kfence/parameters/order0_page``. Error reports ~~~~~~~~~~~~~ +By default, KFENCE will only report faults in dmesg. If users want to panic +the kernel, set ``kfence.fault=panic`` in boot command line, or write "panic" +to ``/sys/module/kfence/parameters/fault``. + +The default value is "report". Users can switch between "report" and "panic" +at any time. + A typical out-of-bounds access looks like this:: ================================================================== diff --git a/Documentation/driver-api/auxiliary_bus.rst b/Documentation/driver-api/auxiliary_bus.rst index 5dd7804631ef760fa9ad71a1b6f15743114291e8..fff96c7ba7a858ca002587e422b19cfdd894dae2 100644 --- a/Documentation/driver-api/auxiliary_bus.rst +++ b/Documentation/driver-api/auxiliary_bus.rst @@ -1,5 +1,7 @@ .. SPDX-License-Identifier: GPL-2.0-only +.. _auxiliary_bus: + ============= Auxiliary Bus ============= @@ -150,7 +152,7 @@ and shutdown notifications using the standard conventions. struct auxiliary_driver { int (*probe)(struct auxiliary_device *, const struct auxiliary_device_id *id); - int (*remove)(struct auxiliary_device *); + void (*remove)(struct auxiliary_device *); void (*shutdown)(struct auxiliary_device *); int (*suspend)(struct auxiliary_device *, pm_message_t); int (*resume)(struct auxiliary_device *); diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst index e9b65035cd472fce140a33ff1e243acbafc0c536..a1b32fcd0d76fb8d9fa9ca6c34e848b722e30b74 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst @@ -12,6 +12,8 @@ Contents - `Enabling the driver and kconfig options`_ - `Devlink info`_ - `Devlink parameters`_ +- `mlx5 subfunction`_ +- `mlx5 port function`_ - `Devlink health reporters`_ - `mlx5 tracepoints`_ @@ -97,6 +99,11 @@ Enabling the driver and kconfig options | Provides low-level InfiniBand/RDMA and `RoCE `_ support. +**CONFIG_MLX5_SF=(y/n)** + +| Build support for subfunction. +| Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option +| will enable support for creating subfunction devices. **External options** ( Choose if the corresponding mlx5 feature is required ) @@ -176,6 +183,214 @@ User command examples: values: cmode driverinit value true +mlx5 subfunction +================ +mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst `) interface. + +A Subfunction has its own function capabilities and its own resources. This +means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These +queues are neither shared nor stolen from the parent PCI function. + +When a subfunction is RDMA capable, it has its own QP1, GID table and rdma +resources neither shared nor stolen from the parent PCI function. + +A subfunction has a dedicated window in PCI BAR space that is not shared +with ther other subfunctions or the parent PCI function. This ensures that all +devices (netdev, rdma, vdpa etc.) of the subfunction accesses only assigned +PCI BAR space. + +A Subfunction supports eswitch representation through which it supports tc +offloads. The user configures eswitch to send/receive packets from/to +the subfunction port. + +Subfunctions share PCI level resources such as PCI MSI-X IRQs with +other subfunctions and/or with its parent PCI function. + +Example mlx5 software, system and device view:: + + _______ + | admin | + | user |---------- + |_______| | + | | + ____|____ __|______ _________________ + | | | | | | + | devlink | | tc tool | | user | + | tool | |_________| | applications | + |_________| | |_________________| + | | | | + | | | | Userspace + +---------|-------------|-------------------|----------|--------------------+ + | | +----------+ +----------+ Kernel + | | | netdev | | rdma dev | + | | +----------+ +----------+ + (devlink port add/del | ^ ^ + port function set) | | | + | | +---------------| + _____|___ | | _______|_______ + | | | | | mlx5 class | + | devlink | +------------+ | | drivers | + | kernel | | rep netdev | | |(mlx5_core,ib) | + |_________| +------------+ | |_______________| + | | | ^ + (devlink ops) | | (probe/remove) + _________|________ | | ____|________ + | subfunction | | +---------------+ | subfunction | + | management driver|----- | subfunction |---| driver | + | (mlx5_core) | | auxiliary dev | | (mlx5_core) | + |__________________| +---------------+ |_____________| + | ^ + (sf add/del, vhca events) | + | (device add/del) + _____|____ ____|________ + | | | subfunction | + | PCI NIC |---- activate/deactive events---->| host driver | + |__________| | (mlx5_core) | + |_____________| + +Subfunction is created using devlink port interface. + +- Change device to switchdev mode:: + + $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev + +- Add a devlink port of subfunction flaovur:: + + $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 + pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false + function: + hw_addr 00:00:00:00:00:00 state inactive opstate detached + +- Show a devlink port of the subfunction:: + + $ devlink port show pci/0000:06:00.0/32768 + pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 + function: + hw_addr 00:00:00:00:00:00 state inactive opstate detached + +- Delete a devlink port of subfunction after use:: + + $ devlink port del pci/0000:06:00.0/32768 + +mlx5 function attributes +======================== +The mlx5 driver provides a mechanism to setup PCI VF/SF function attributes in +a unified way for SmartNIC and non-SmartNIC. + +This is supported only when the eswitch mode is set to switchdev. Port function +configuration of the PCI VF/SF is supported through devlink eswitch port. + +Port function attributes should be set before PCI VF/SF is enumerated by the +driver. + +MAC address setup +----------------- +mlx5 driver provides mechanism to setup the MAC address of the PCI VF/SF. + +The configured MAC address of the PCI VF/SF will be used by netdevice and rdma +device created for the PCI VF/SF. + +- Get the MAC address of the VF identified by its unique devlink port index:: + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 + +- Set the MAC address of the VF identified by its unique devlink port index:: + + $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55 + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:11:22:33:44:55 + +- Get the MAC address of the SF identified by its unique devlink port index:: + + $ devlink port show pci/0000:06:00.0/32768 + pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 + function: + hw_addr 00:00:00:00:00:00 + +- Set the MAC address of the VF identified by its unique devlink port index:: + + $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 + + $ devlink port show pci/0000:06:00.0/32768 + pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcivf pfnum 0 sfnum 88 + function: + hw_addr 00:00:00:00:88:88 + +SF state setup +-------------- +To use the SF, the user must active the SF using the SF function state +attribute. + +- Get the state of the SF identified by its unique devlink port index:: + + $ devlink port show ens2f0npf0sf88 + pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false + function: + hw_addr 00:00:00:00:88:88 state inactive opstate detached + +- Activate the function and verify its state is active:: + + $ devlink port function set ens2f0npf0sf88 state active + + $ devlink port show ens2f0npf0sf88 + pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false + function: + hw_addr 00:00:00:00:88:88 state active opstate detached + +Upon function activation, the PF driver instance gets the event from the device +that a particular SF was activated. It's the cue to put the device on bus, probe +it and instantiate the devlink instance and class specific auxiliary devices +for it. + +- Show the auxiliary device and port of the subfunction:: + + $ devlink dev show + devlink dev show auxiliary/mlx5_core.sf.4 + + $ devlink port show auxiliary/mlx5_core.sf.4/1 + auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false + + $ rdma link show mlx5_0/1 + link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88 + + $ rdma dev show + 8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112 + 13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112 + +- Subfunction auxiliary device and class device hierarchy:: + + mlx5_core.sf.4 + (subfunction auxiliary device) + /\ + / \ + / \ + / \ + / \ + mlx5_core.eth.4 mlx5_core.rdma.4 + (sf eth aux dev) (sf rdma aux dev) + | | + | | + p0sf88 mlx5_0 + (sf netdev) (sf rdma device) + +Additionally, the SF port also gets the event when the driver attaches to the +auxiliary device of the subfunction. This results in changing the operational +state of the function. This provides visiblity to the user to decide when is it +safe to delete the SF port for graceful termination of the subfunction. + +- Show the SF port operational state:: + + $ devlink port show ens2f0npf0sf88 + pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false + function: + hw_addr 00:00:00:00:88:88 state active opstate attached + Devlink health reporters ======================== diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst new file mode 100644 index 0000000000000000000000000000000000000000..e99b41599465110f09573e8693781808d236dba3 --- /dev/null +++ b/Documentation/networking/devlink/devlink-port.rst @@ -0,0 +1,199 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. _devlink_port: + +============ +Devlink Port +============ + +``devlink-port`` is a port that exists on the device. It has a logically +separate ingress/egress point of the device. A devlink port can be any one +of many flavours. A devlink port flavour along with port attributes +describe what a port represents. + +A device driver that intends to publish a devlink port sets the +devlink port attributes and registers the devlink port. + +Devlink port flavours are described below. + +.. list-table:: List of devlink port flavours + :widths: 33 90 + + * - Flavour + - Description + * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL`` + - Any kind of physical port. This can be an eswitch physical port or any + other physical port on the device. + * - ``DEVLINK_PORT_FLAVOUR_DSA`` + - This indicates a DSA interconnect port. + * - ``DEVLINK_PORT_FLAVOUR_CPU`` + - This indicates a CPU port applicable only to DSA. + * - ``DEVLINK_PORT_FLAVOUR_PCI_PF`` + - This indicates an eswitch port representing a port of PCI + physical function (PF). + * - ``DEVLINK_PORT_FLAVOUR_PCI_VF`` + - This indicates an eswitch port representing a port of PCI + virtual function (VF). + * - ``DEVLINK_PORT_FLAVOUR_PCI_SF`` + - This indicates an eswitch port representing a port of PCI + subfunction (SF). + * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL`` + - This indicates a virtual port for the PCI virtual function. + +Devlink port can have a different type based on the link layer described below. + +.. list-table:: List of devlink port types + :widths: 23 90 + + * - Type + - Description + * - ``DEVLINK_PORT_TYPE_ETH`` + - Driver should set this port type when a link layer of the port is + Ethernet. + * - ``DEVLINK_PORT_TYPE_IB`` + - Driver should set this port type when a link layer of the port is + InfiniBand. + * - ``DEVLINK_PORT_TYPE_AUTO`` + - This type is indicated by the user when driver should detect the port + type automatically. + +PCI controllers +--------------- +In most cases a PCI device has only one controller. A controller consists of +potentially multiple physical, virtual functions and subfunctions. A function +consists of one or more ports. This port is represented by the devlink eswitch +port. + +A PCI device connected to multiple CPUs or multiple PCI root complexes or a +SmartNIC, however, may have multiple controllers. For a device with multiple +controllers, each controller is distinguished by a unique controller number. +An eswitch is on the PCI device which supports ports of multiple controllers. + +An example view of a system with two controllers:: + + --------------------------------------------------------- + | | + | --------- --------- ------- ------- | + ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | + | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- | + | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ | + | connect | | ------- ------- | + ----------- | | controller_num=1 (no eswitch) | + ------|-------------------------------------------------- + (internal wire) + | + --------------------------------------------------------- + | devlink eswitch ports and reps | + | ----------------------------------------------------- | + | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | | + | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | + | ----------------------------------------------------- | + | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | | + | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | + | ----------------------------------------------------- | + | | + | | + ----------- | --------- --------- ------- ------- | + | smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | + | pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- | + | connect | | | pf0 |______/________/ | pf1 |___/_______/ | + ----------- | ------- ------- | + | | + | local controller_num=0 (eswitch) | + --------------------------------------------------------- + +In the above example, the external controller (identified by controller number = 1) +doesn't have the eswitch. Local controller (identified by controller number = 0) +has the eswitch. The Devlink instance on the local controller has eswitch +devlink ports for both the controllers. + +Function configuration +====================== + +A user can configure the function attribute before enumerating the PCI +function. Usually it means, user should configure function attribute +before a bus specific device for the function is created. However, when +SRIOV is enabled, virtual function devices are created on the PCI bus. +Hence, function attribute should be configured before binding virtual +function device to the driver. For subfunctions, this means user should +configure port function attribute before activating the port function. + +A user may set the hardware address of the function using +'devlink port function set hw_addr' command. For Ethernet port function +this means a MAC address. + +Subfunction +============ + +Subfunction is a lightweight function that has a parent PCI function on which +it is deployed. Subfunction is created and deployed in unit of 1. Unlike +SRIOV VFs, a subfunction doesn't require its own PCI virtual function. +A subfunction communicates with the hardware through the parent PCI function. + +To use a subfunction, 3 steps setup sequence is followed. +(1) create - create a subfunction; +(2) configure - configure subfunction attributes; +(3) deploy - deploy the subfunction; + +Subfunction management is done using devlink port user interface. +User performs setup on the subfunction management device. + +(1) Create +---------- +A subfunction is created using a devlink port interface. A user adds the +subfunction by adding a devlink port of subfunction flavour. The devlink +kernel code calls down to subfunction management driver (devlink ops) and asks +it to create a subfunction devlink port. Driver then instantiates the +subfunction port and any associated objects such as health reporters and +representor netdevice. + +(2) Configure +------------- +A subfunction devlink port is created but it is not active yet. That means the +entities are created on devlink side, the e-switch port representor is created, +but the subfunction device itself it not created. A user might use e-switch port +representor to do settings, putting it into bridge, adding TC rules, etc. A user +might as well configure the hardware address (such as MAC address) of the +subfunction while subfunction is inactive. + +(3) Deploy +---------- +Once a subfunction is configured, user must activate it to use it. Upon +activation, subfunction management driver asks the subfunction management +device to instantiate the subfunction device on particular PCI function. +A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst `. +At this point a matching subfunction driver binds to the subfunction's auxiliary device. + +Terms and Definitions +===================== + +.. list-table:: Terms and Definitions + :widths: 22 90 + + * - Term + - Definitions + * - ``PCI device`` + - A physical PCI device having one or more PCI bus consists of one or + more PCI controllers. + * - ``PCI controller`` + - A controller consists of potentially multiple physical functions, + virtual functions and subfunctions. + * - ``Port function`` + - An object to manage the function of a port. + * - ``Subfunction`` + - A lightweight function that has parent PCI function on which it is + deployed. + * - ``Subfunction device`` + - A bus device of the subfunction, usually on a auxiliary bus. + * - ``Subfunction driver`` + - A device driver for the subfunction auxiliary device. + * - ``Subfunction management device`` + - A PCI physical function that supports subfunction management. + * - ``Subfunction management driver`` + - A device driver for PCI physical function that supports + subfunction management using devlink port interface. + * - ``Subfunction host driver`` + - A device driver for PCI physical function that hosts subfunction + devices. In most cases it is same as subfunction management driver. When + subfunction is used on external controller, subfunction management and + host drivers are different. diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst index d82874760ae2627d0d7e9d0e3bcf429c9af789aa..aab79667f97b5e96b84dfcabf3ecbd22ac96aaeb 100644 --- a/Documentation/networking/devlink/index.rst +++ b/Documentation/networking/devlink/index.rst @@ -18,6 +18,7 @@ general. devlink-info devlink-flash devlink-params + devlink-port devlink-region devlink-resource devlink-reload diff --git a/Documentation/virt/coco/csv-guest.rst b/Documentation/virt/coco/csv-guest.rst new file mode 100644 index 0000000000000000000000000000000000000000..23cba2a5fd7c093bea25d6c1cc0f1ce41846d015 --- /dev/null +++ b/Documentation/virt/coco/csv-guest.rst @@ -0,0 +1,33 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================================================== +CSV Guest API Documentation +=================================================================== + +1. General description +====================== + +The CSV guest driver exposes IOCTL interfaces via the /dev/csv-guest misc +device to allow userspace to get certain CSV guest-specific details. + +2. API description +================== + +In this section, for each supported IOCTL, the following information is +provided along with a generic description. + +:Input parameters: Parameters passed to the IOCTL and related details. +:Output: Details about output data and return value (with details about + the non common error values). + +2.1 CSV_CMD_GET_REPORT +----------------------- + +:Input parameters: struct csv_report_req +:Output: Upon successful execution, CSV_REPORT data is copied to + csv_report_req.report_data and return 0. Return -EINVAL for invalid + operands, -EIO on VMMCALL failure or standard error number on other + common failures. + +The CSV_CMD_GET_REPORT IOCTL can be used by the attestation software to get +the CSV_REPORT from the CSV module using VMMCALL[KVM_HC_VM_ATTESTATION]. diff --git a/Documentation/x86/microcode.rst b/Documentation/x86/microcode.rst index a320d37982ed6dcfb772b78df934a490d2299f4d..1cc734412397175db742b3ca43abe21c7391b6e0 100644 --- a/Documentation/x86/microcode.rst +++ b/Documentation/x86/microcode.rst @@ -34,6 +34,8 @@ on Intel: kernel/x86/microcode/GenuineIntel.bin on AMD : kernel/x86/microcode/AuthenticAMD.bin +on Hygon: + kernel/x86/microcode/HygonGenuine.bin During BSP (BootStrapping Processor) boot (pre-SMP), the kernel scans the microcode file in the initrd. If microcode matching the @@ -68,6 +70,10 @@ here for future reference only). cd $TMPDIR mkdir -p $DSTDIR + if [ -d /lib/firmware/hygon-ucode ]; then + cat /lib/firmware/hygon-ucode/microcode_hygon*.bin > $DSTDIR/HygonGenuine.bin + fi + if [ -d /lib/firmware/amd-ucode ]; then cat /lib/firmware/amd-ucode/microcode_amd*.bin > $DSTDIR/AuthenticAMD.bin fi @@ -119,7 +125,8 @@ currently supported. Here's an example:: - CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09 amd-ucode/microcode_amd_fam15h.bin" + CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09 \ + amd-ucode/microcode_amd_fam15h.bin hygon-ucode/microcode_hygon_fam18h.bin" CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware" This basically means, you have the following tree structure locally:: @@ -129,6 +136,10 @@ This basically means, you have the following tree structure locally:: ... | |-- microcode_amd_fam15h.bin ... + |-- hygon-ucode + ... + | |-- microcode_hygon_fam18h.bin + ... |-- intel-ucode ... | |-- 06-3a-09 diff --git a/MAINTAINERS b/MAINTAINERS index c09c8a767ef0f557defd75df1b0c8aa382cf9540..c27561e3e2fddeb4cd2fc892afb6ea96947f9cbd 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -143,6 +143,12 @@ Maintainers List first. When adding to this list, please keep the entries in alphabetical order. +VTOA +M: Xuan Zhuo +M: slb_opensource@linux.alibaba.com +S: Maintained +F: net/vtoa/ + TCP RT M: Cambda Zhu M: Xuan Zhuo diff --git a/anolis/configs/custom-overrides/arm64/CONFIG_KVM b/anolis/configs/custom-overrides/arm64/CONFIG_KVM new file mode 100644 index 0000000000000000000000000000000000000000..c221222ab1c9c641e8484e218d0728e7cb5c80da --- /dev/null +++ b/anolis/configs/custom-overrides/arm64/CONFIG_KVM @@ -0,0 +1 @@ +CONFIG_KVM=m \ No newline at end of file diff --git a/arch/arm/include/asm/efi.h b/arch/arm/include/asm/efi.h index 3ee4f43819850b3ad1e262f3ef02f0b16a0226ba..e8444c60c86e1fd550c9e7f3dcfec108cf4c621b 100644 --- a/arch/arm/include/asm/efi.h +++ b/arch/arm/include/asm/efi.h @@ -24,13 +24,6 @@ int efi_set_mapping_permissions(struct mm_struct *mm, efi_memory_desc_t *md); #define arch_efi_call_virt_setup() efi_virtmap_load() #define arch_efi_call_virt_teardown() efi_virtmap_unload() -#define arch_efi_call_virt(p, f, args...) \ -({ \ - efi_##f##_t *__f; \ - __f = p->f; \ - __f(args); \ -}) - #define ARCH_EFI_IRQ_FLAGS_MASK \ (PSR_J_BIT | PSR_E_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT | \ PSR_T_BIT | MODE_MASK) diff --git a/arch/arm64/configs/anolis-debug_defconfig b/arch/arm64/configs/anolis-debug_defconfig index 669edebc47a64f2da181ae87da2ea914f8140c3a..3445acab3378db0df6a0d0c6a4f4165f929d9c97 100644 --- a/arch/arm64/configs/anolis-debug_defconfig +++ b/arch/arm64/configs/anolis-debug_defconfig @@ -2,7 +2,7 @@ # Automatically generated file; DO NOT EDIT. # Linux/arm64 5.10.134 Kernel Configuration # -CONFIG_CC_VERSION_TEXT="gcc (GCC) 8.5.0 20210514 (Anolis 8.5.0-10.0.1)" +CONFIG_CC_VERSION_TEXT="gcc (GCC) 8.5.0 20210514 (Anolis 8.5.0-10.0.3)" CONFIG_CC_IS_GCC=y CONFIG_GCC_VERSION=80500 CONFIG_LD_VERSION=230000000 @@ -122,7 +122,8 @@ CONFIG_RCU_NEED_SEGCBLIST=y CONFIG_RCU_NOCB_CPU=y # end of RCU Subsystem -# CONFIG_IKCONFIG is not set +CONFIG_IKCONFIG=y +CONFIG_IKCONFIG_PROC=y # CONFIG_IKHEADERS is not set CONFIG_LOG_BUF_SHIFT=20 CONFIG_LOG_CPU_MAX_BUF_SHIFT=12 @@ -161,6 +162,7 @@ CONFIG_CGROUP_DEVICE=y CONFIG_SCHED_SLI=y CONFIG_RICH_CONTAINER=y # CONFIG_RICH_CONTAINER_CG_SWITCH is not set +# CONFIG_MAX_PID_PER_NS is not set CONFIG_CGROUP_CPUACCT=y CONFIG_CGROUP_PERF=y CONFIG_CGROUP_BPF=y @@ -172,7 +174,6 @@ CONFIG_TIME_NS=y CONFIG_IPC_NS=y CONFIG_USER_NS=y CONFIG_PID_NS=y -# CONFIG_MAX_PID_PER_NS is not set CONFIG_NET_NS=y CONFIG_CHECKPOINT_RESTORE=y CONFIG_SCHED_AUTOGROUP=y @@ -288,6 +289,7 @@ CONFIG_FIX_EARLYCON_MEM=y CONFIG_PGTABLE_LEVELS=4 CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_ARCH_PROC_KCORE_TEXT=y +CONFIG_BROKEN_GAS_INST=y CONFIG_KASAN_SHADOW_OFFSET=0xdfffa00000000000 # @@ -440,8 +442,6 @@ CONFIG_KUSER_HELPERS=y # CONFIG_ARM64_HW_AFDBM=y CONFIG_ARM64_PAN=y -CONFIG_AS_HAS_LSE_ATOMICS=y -CONFIG_ARM64_LSE_ATOMICS=y CONFIG_ARM64_USE_LSE_ATOMICS=y CONFIG_ARM64_VHE=y # end of ARMv8.1 architectural features @@ -457,17 +457,12 @@ CONFIG_ARM64_CNP=y # # ARMv8.3 architectural features # -# CONFIG_ARM64_PTR_AUTH is not set -CONFIG_CC_HAS_SIGN_RETURN_ADDRESS=y -CONFIG_AS_HAS_PAC=y # end of ARMv8.3 architectural features # # ARMv8.4 architectural features # # CONFIG_ARM64_AMU_EXTN is not set -CONFIG_AS_HAS_ARMV8_4=y -CONFIG_ARM64_TLB_RANGE=y CONFIG_ARM64_MPAM=y # end of ARMv8.4 architectural features @@ -586,6 +581,7 @@ CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y CONFIG_ACPI_CPPC_CPUFREQ=y CONFIG_ARM_SCPI_CPUFREQ=m # CONFIG_ARM_QCOM_CPUFREQ_HW is not set +# CONFIG_RISV_THEAD_LIGHT_CPUFREQ is not set # end of CPU Frequency scaling # end of CPU Power Management @@ -627,6 +623,7 @@ CONFIG_UEFI_CPER_ARM=y CONFIG_EFI_EARLYCON=y CONFIG_EFI_CUSTOM_SSDT_OVERLAYS=y CONFIG_YITIAN_CPER_RAWDATA=y +# CONFIG_EFI_COCO_SECRET is not set CONFIG_ARM_PSCI_FW=y # CONFIG_ARM_PSCI_CHECKER is not set CONFIG_HAVE_ARM_SMCCC=y @@ -692,7 +689,7 @@ CONFIG_ACPI_PCC=y # CONFIG_PMIC_OPREGION is not set CONFIG_IRQ_BYPASS_MANAGER=y CONFIG_VIRTUALIZATION=y -CONFIG_KVM=m +CONFIG_KVM=y CONFIG_HAVE_KVM_IRQCHIP=y CONFIG_HAVE_KVM_IRQFD=y CONFIG_HAVE_KVM_IRQ_ROUTING=y @@ -705,6 +702,7 @@ CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL=y CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT=y CONFIG_HAVE_KVM_IRQ_BYPASS=y CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE=y +CONFIG_KVM_XFER_TO_GUEST_WORK=y CONFIG_KVM_ARM_PMU=y CONFIG_HAVE_LIVEPATCH=y CONFIG_LIVEPATCH=y @@ -860,7 +858,7 @@ CONFIG_BLK_DEV_BSG=y CONFIG_BLK_DEV_BSGLIB=y CONFIG_BLK_DEV_INTEGRITY=y CONFIG_BLK_DEV_INTEGRITY_T10=y -# CONFIG_BLK_DEV_ZONED is not set +CONFIG_BLK_DEV_ZONED=y CONFIG_BLK_DEV_THROTTLING=y # CONFIG_BLK_DEV_THROTTLING_LOW is not set # CONFIG_BLK_CMDLINE_PARSER is not set @@ -869,6 +867,7 @@ CONFIG_BLK_WBT=y CONFIG_BLK_CGROUP_IOCOST=y CONFIG_BLK_WBT_MQ=y CONFIG_BLK_DEBUG_FS=y +CONFIG_BLK_DEBUG_FS_ZONED=y # CONFIG_BLK_SED_OPAL is not set # CONFIG_BLK_INLINE_ENCRYPTION is not set @@ -953,6 +952,8 @@ CONFIG_ARCH_USE_QUEUED_RWLOCKS=y CONFIG_QUEUED_RWLOCKS=y CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y +CONFIG_CK_KABI_RESERVE=y +CONFIG_CK_KABI_SIZE_ALIGN_CHECKS=y CONFIG_FREEZER=y # @@ -1098,6 +1099,7 @@ CONFIG_SMC=m CONFIG_SMC_DIAG=m CONFIG_XDP_SOCKETS=y CONFIG_XDP_SOCKETS_DIAG=m +CONFIG_VTOA=m CONFIG_HOOKERS=m CONFIG_INET=y CONFIG_IP_MULTICAST=y @@ -1181,7 +1183,7 @@ CONFIG_IPV6_NDISC_NODETYPE=y CONFIG_IPV6_TUNNEL=m CONFIG_IPV6_GRE=m CONFIG_IPV6_MULTIPLE_TABLES=y -# CONFIG_IPV6_SUBTREES is not set +CONFIG_IPV6_SUBTREES=y CONFIG_IPV6_MROUTE=y CONFIG_IPV6_MROUTE_MULTIPLE_TABLES=y CONFIG_IPV6_PIMSM_V2=y @@ -1206,7 +1208,7 @@ CONFIG_NETFILTER_INGRESS=y CONFIG_NETFILTER_NETLINK=m CONFIG_NETFILTER_FAMILY_BRIDGE=y CONFIG_NETFILTER_FAMILY_ARP=y -# CONFIG_NETFILTER_NETLINK_ACCT is not set +CONFIG_NETFILTER_NETLINK_ACCT=m CONFIG_NETFILTER_NETLINK_QUEUE=m CONFIG_NETFILTER_NETLINK_LOG=m CONFIG_NETFILTER_NETLINK_OSF=m @@ -1255,6 +1257,7 @@ CONFIG_NF_TABLES_INET=y CONFIG_NF_TABLES_NETDEV=y CONFIG_NFT_NUMGEN=m CONFIG_NFT_CT=m +CONFIG_NFT_FLOW_OFFLOAD=m CONFIG_NFT_COUNTER=m CONFIG_NFT_CONNLIMIT=m CONFIG_NFT_LOG=m @@ -1262,7 +1265,7 @@ CONFIG_NFT_LIMIT=m CONFIG_NFT_MASQ=m CONFIG_NFT_REDIR=m CONFIG_NFT_NAT=m -# CONFIG_NFT_TUNNEL is not set +CONFIG_NFT_TUNNEL=m CONFIG_NFT_OBJREF=m CONFIG_NFT_QUEUE=m CONFIG_NFT_QUOTA=m @@ -1274,14 +1277,15 @@ CONFIG_NFT_FIB=m CONFIG_NFT_FIB_INET=m CONFIG_NFT_XFRM=m CONFIG_NFT_SOCKET=m -# CONFIG_NFT_OSF is not set +CONFIG_NFT_OSF=m CONFIG_NFT_TPROXY=m # CONFIG_NFT_SYNPROXY is not set CONFIG_NF_DUP_NETDEV=m CONFIG_NFT_DUP_NETDEV=m CONFIG_NFT_FWD_NETDEV=m CONFIG_NFT_FIB_NETDEV=m -# CONFIG_NF_FLOW_TABLE is not set +CONFIG_NF_FLOW_TABLE_INET=m +CONFIG_NF_FLOW_TABLE=m CONFIG_NETFILTER_XTABLES=y # @@ -1304,7 +1308,7 @@ CONFIG_NETFILTER_XT_TARGET_DSCP=m CONFIG_NETFILTER_XT_TARGET_HL=m CONFIG_NETFILTER_XT_TARGET_HMARK=m CONFIG_NETFILTER_XT_TARGET_IDLETIMER=m -# CONFIG_NETFILTER_XT_TARGET_LED is not set +CONFIG_NETFILTER_XT_TARGET_LED=m CONFIG_NETFILTER_XT_TARGET_LOG=m CONFIG_NETFILTER_XT_TARGET_MARK=m CONFIG_NETFILTER_XT_NAT=m @@ -1347,13 +1351,13 @@ CONFIG_NETFILTER_XT_MATCH_HL=m # CONFIG_NETFILTER_XT_MATCH_IPCOMP is not set CONFIG_NETFILTER_XT_MATCH_IPRANGE=m CONFIG_NETFILTER_XT_MATCH_IPVS=m -# CONFIG_NETFILTER_XT_MATCH_L2TP is not set +CONFIG_NETFILTER_XT_MATCH_L2TP=m CONFIG_NETFILTER_XT_MATCH_LENGTH=m CONFIG_NETFILTER_XT_MATCH_LIMIT=m CONFIG_NETFILTER_XT_MATCH_MAC=m CONFIG_NETFILTER_XT_MATCH_MARK=m CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m -# CONFIG_NETFILTER_XT_MATCH_NFACCT is not set +CONFIG_NETFILTER_XT_MATCH_NFACCT=m CONFIG_NETFILTER_XT_MATCH_OSF=m CONFIG_NETFILTER_XT_MATCH_OWNER=m CONFIG_NETFILTER_XT_MATCH_POLICY=m @@ -1369,7 +1373,7 @@ CONFIG_NETFILTER_XT_MATCH_STATE=m CONFIG_NETFILTER_XT_MATCH_STATISTIC=m CONFIG_NETFILTER_XT_MATCH_STRING=m CONFIG_NETFILTER_XT_MATCH_TCPMSS=m -# CONFIG_NETFILTER_XT_MATCH_TIME is not set +CONFIG_NETFILTER_XT_MATCH_TIME=m CONFIG_NETFILTER_XT_MATCH_U32=m # end of Core Netfilter Configuration @@ -1419,7 +1423,7 @@ CONFIG_IP_VS_LBLC=m CONFIG_IP_VS_LBLCR=m CONFIG_IP_VS_DH=m CONFIG_IP_VS_SH=m -# CONFIG_IP_VS_MH is not set +CONFIG_IP_VS_MH=m CONFIG_IP_VS_SED=m CONFIG_IP_VS_NQ=m @@ -1451,6 +1455,7 @@ CONFIG_NFT_REJECT_IPV4=m CONFIG_NFT_DUP_IPV4=m CONFIG_NFT_FIB_IPV4=m CONFIG_NF_TABLES_ARP=y +CONFIG_NF_FLOW_TABLE_IPV4=m CONFIG_NF_DUP_IPV4=m CONFIG_NF_LOG_ARP=m CONFIG_NF_LOG_IPV4=m @@ -1490,6 +1495,7 @@ CONFIG_NF_TABLES_IPV6=y CONFIG_NFT_REJECT_IPV6=m CONFIG_NFT_DUP_IPV6=m CONFIG_NFT_FIB_IPV6=m +CONFIG_NF_FLOW_TABLE_IPV6=m CONFIG_NF_DUP_IPV6=m CONFIG_NF_REJECT_IPV6=m CONFIG_NF_LOG_IPV6=m @@ -1642,7 +1648,6 @@ CONFIG_DEFAULT_NET_SCH="fq_codel" # CONFIG_NET_CLS=y CONFIG_NET_CLS_BASIC=m -CONFIG_NET_CLS_TCINDEX=m CONFIG_NET_CLS_ROUTE4=m CONFIG_NET_CLS_FW=m CONFIG_NET_CLS_U32=m @@ -1671,7 +1676,7 @@ CONFIG_NET_ACT_GACT=m CONFIG_GACT_PROB=y CONFIG_NET_ACT_MIRRED=m CONFIG_NET_ACT_SAMPLE=m -# CONFIG_NET_ACT_IPT is not set +CONFIG_NET_ACT_IPT=m CONFIG_NET_ACT_NAT=m CONFIG_NET_ACT_PEDIT=m CONFIG_NET_ACT_SIMP=m @@ -1685,8 +1690,9 @@ CONFIG_NET_ACT_BPF=m CONFIG_NET_ACT_SKBMOD=m # CONFIG_NET_ACT_IFE is not set CONFIG_NET_ACT_TUNNEL_KEY=m +CONFIG_NET_ACT_CT=m # CONFIG_NET_ACT_GATE is not set -# CONFIG_NET_TC_SKB_EXT is not set +CONFIG_NET_TC_SKB_EXT=y CONFIG_NET_SCH_FIFO=y CONFIG_DCB=y CONFIG_DNS_RESOLVER=m @@ -1945,6 +1951,7 @@ CONFIG_YENTA_TOSHIBA=y # # Generic Driver Options # +CONFIG_AUXILIARY_BUS=y # CONFIG_UEVENT_HELPER is not set CONFIG_DEVTMPFS=y CONFIG_DEVTMPFS_MOUNT=y @@ -2147,8 +2154,8 @@ CONFIG_CDROM_PKTCDVD_BUFFERS=8 # CONFIG_ATA_OVER_ETH is not set CONFIG_VIRTIO_BLK=m CONFIG_BLK_DEV_RBD=m -CONFIG_BLK_DEV_UBLK=m # CONFIG_BLK_DEV_RSXX is not set +CONFIG_BLK_DEV_UBLK=m # # NVME Support @@ -2494,6 +2501,7 @@ CONFIG_DM_VERITY=m CONFIG_DM_SWITCH=m CONFIG_DM_LOG_WRITES=m CONFIG_DM_INTEGRITY=m +# CONFIG_DM_ZONED is not set CONFIG_TARGET_CORE=m CONFIG_TCM_IBLOCK=m CONFIG_TCM_FILEIO=m @@ -2692,6 +2700,7 @@ CONFIG_MLX5_EN_RXNFC=y CONFIG_MLX5_MPFS=y CONFIG_MLX5_ESWITCH=y CONFIG_MLX5_CLS_ACT=y +CONFIG_MLX5_TC_CT=y CONFIG_MLX5_CORE_EN_DCB=y CONFIG_MLX5_CORE_IPOIB=y # CONFIG_MLX5_FPGA_IPSEC is not set @@ -4115,6 +4124,7 @@ CONFIG_DRM_CIRRUS_QEMU=m # CONFIG_DRM_LIMA is not set # CONFIG_DRM_PANFROST is not set # CONFIG_DRM_TIDSS is not set +# CONFIG_DRM_VERISILICON is not set # CONFIG_DRM_LEGACY is not set CONFIG_DRM_PANEL_ORIENTATION_QUIRKS=y @@ -4768,6 +4778,7 @@ CONFIG_INFINIBAND_HNS=m # CONFIG_INFINIBAND_HNS_HIP08 is not set CONFIG_INFINIBAND_BNXT_RE=m CONFIG_INFINIBAND_QEDR=m +CONFIG_INFINIBAND_ERDMA=m CONFIG_RDMA_RXE=m CONFIG_RDMA_SIW=m CONFIG_INFINIBAND_IPOIB=m @@ -5213,6 +5224,12 @@ CONFIG_QCOM_KRYO_L2_ACCESSORS=y # # CONFIG_XILINX_VCU is not set # end of Xilinx SoC drivers + +# +# prefetch tuning drivers +# +CONFIG_ARM64_PREFETCH_TUNING=m +# end of prefetch tuning drivers # end of SOC (System On Chip) specific Drivers # CONFIG_PM_DEVFREQ is not set @@ -5239,6 +5256,7 @@ CONFIG_PWM_SYSFS=y # CONFIG_PWM_FSL_FTM is not set # CONFIG_PWM_HIBVT is not set # CONFIG_PWM_PCA9685 is not set +# CONFIG_PWM_LIGHT is not set # # IRQ chip support @@ -5299,6 +5317,7 @@ CONFIG_PHY_HI6220_USB=m # CONFIG_PHY_QCOM_USB_HS_28NM is not set # CONFIG_PHY_QCOM_USB_SS is not set # CONFIG_PHY_QCOM_IPQ806X_USB is not set +# CONFIG_PHY_DW_DPHY is not set # CONFIG_PHY_TUSB1210 is not set # end of PHY Subsystem @@ -5320,9 +5339,9 @@ CONFIG_QCOM_L3_PMU=y CONFIG_THUNDERX2_PMU=m CONFIG_XGENE_PMU=y CONFIG_ARM_SPE_PMU=m +CONFIG_ALIBABA_UNCORE_DRW_PMU=m CONFIG_HISI_PMU=y CONFIG_DWC_UNCORE_PCIE_PMU=y -CONFIG_ALIBABA_UNCORE_DRW_PMU=m # end of Performance monitor support CONFIG_RAS=y @@ -5413,6 +5432,7 @@ CONFIG_XFS_WARN=y # CONFIG_BTRFS_FS is not set # CONFIG_NILFS2_FS is not set # CONFIG_F2FS_FS is not set +# CONFIG_ZONEFS_FS is not set CONFIG_FS_DAX=y CONFIG_FS_POSIX_ACL=y CONFIG_EXPORTFS=y @@ -5710,7 +5730,7 @@ CONFIG_SECURITYFS=y CONFIG_SECURITY_NETWORK=y CONFIG_SECURITY_INFINIBAND=y CONFIG_SECURITY_NETWORK_XFRM=y -# CONFIG_SECURITY_PATH is not set +CONFIG_SECURITY_PATH=y CONFIG_LSM_MMAP_MIN_ADDR=65535 CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR=y CONFIG_HARDENED_USERCOPY=y @@ -5736,7 +5756,8 @@ CONFIG_INTEGRITY=y CONFIG_INTEGRITY_SIGNATURE=y CONFIG_INTEGRITY_ASYMMETRIC_KEYS=y CONFIG_INTEGRITY_TRUSTED_KEYRING=y -# CONFIG_INTEGRITY_PLATFORM_KEYRING is not set +CONFIG_INTEGRITY_PLATFORM_KEYRING=y +CONFIG_LOAD_UEFI_KEYS=y CONFIG_INTEGRITY_AUDIT=y CONFIG_IMA=y CONFIG_IMA_MEASURE_PCR_IDX=10 @@ -5745,9 +5766,9 @@ CONFIG_IMA_LSM_RULES=y CONFIG_IMA_SIG_TEMPLATE=y CONFIG_IMA_DEFAULT_TEMPLATE="ima-sig" # CONFIG_IMA_DEFAULT_HASH_SHA1 is not set -# CONFIG_IMA_DEFAULT_HASH_SHA256 is not set -CONFIG_IMA_DEFAULT_HASH_SM3=y -CONFIG_IMA_DEFAULT_HASH="sm3" +CONFIG_IMA_DEFAULT_HASH_SHA256=y +# CONFIG_IMA_DEFAULT_HASH_SM3 is not set +CONFIG_IMA_DEFAULT_HASH="sha256" CONFIG_IMA_WRITE_POLICY=y CONFIG_IMA_READ_POLICY=y CONFIG_IMA_APPRAISE=y @@ -5816,6 +5837,7 @@ CONFIG_CRYPTO_RNG_DEFAULT=m CONFIG_CRYPTO_AKCIPHER2=y CONFIG_CRYPTO_AKCIPHER=y CONFIG_CRYPTO_KPP2=y +CONFIG_CRYPTO_KPP=m CONFIG_CRYPTO_ACOMP2=y CONFIG_CRYPTO_MANAGER=y CONFIG_CRYPTO_MANAGER2=y @@ -5835,7 +5857,7 @@ CONFIG_CRYPTO_SIMD=y # Public-key cryptography # CONFIG_CRYPTO_RSA=y -# CONFIG_CRYPTO_DH is not set +CONFIG_CRYPTO_DH=m # CONFIG_CRYPTO_ECDH is not set # CONFIG_CRYPTO_ECRDSA is not set CONFIG_CRYPTO_SM2=y @@ -5880,8 +5902,8 @@ CONFIG_CRYPTO_VMAC=m # CONFIG_CRYPTO_CRC32C=y CONFIG_CRYPTO_CRC32=m -# CONFIG_CRYPTO_XXHASH is not set -# CONFIG_CRYPTO_BLAKE2B is not set +CONFIG_CRYPTO_XXHASH=m +CONFIG_CRYPTO_BLAKE2B=m # CONFIG_CRYPTO_BLAKE2S is not set CONFIG_CRYPTO_CRCT10DIF=y CONFIG_CRYPTO_GHASH=y @@ -5937,7 +5959,7 @@ CONFIG_CRYPTO_LZO=y # CONFIG_CRYPTO_842 is not set # CONFIG_CRYPTO_LZ4 is not set # CONFIG_CRYPTO_LZ4HC is not set -# CONFIG_CRYPTO_ZSTD is not set +CONFIG_CRYPTO_ZSTD=m # # Random Number Generation @@ -6047,7 +6069,6 @@ CONFIG_CRYPTO_LIB_POLY1305_GENERIC=m CONFIG_CRYPTO_LIB_POLY1305=m CONFIG_CRYPTO_LIB_CHACHA20POLY1305=m CONFIG_CRYPTO_LIB_SHA256=y -CONFIG_CRYPTO_LIB_SM4=y # end of Crypto library routines CONFIG_LIB_MEMNEQ=y @@ -6061,7 +6082,7 @@ CONFIG_CRC32_SLICEBY8=y # CONFIG_CRC32_SLICEBY4 is not set # CONFIG_CRC32_SARWATE is not set # CONFIG_CRC32_BIT is not set -# CONFIG_CRC64 is not set +CONFIG_CRC64=m # CONFIG_CRC4 is not set CONFIG_CRC7=m CONFIG_LIBCRC32C=m @@ -6076,6 +6097,7 @@ CONFIG_ZLIB_DEFLATE=y CONFIG_LZO_COMPRESS=y CONFIG_LZO_DECOMPRESS=y CONFIG_LZ4_DECOMPRESS=y +CONFIG_ZSTD_COMPRESS=m CONFIG_ZSTD_DECOMPRESS=y CONFIG_XZ_DEC=y CONFIG_XZ_DEC_X86=y @@ -6094,7 +6116,7 @@ CONFIG_DECOMPRESS_LZO=y CONFIG_DECOMPRESS_LZ4=y CONFIG_DECOMPRESS_ZSTD=y CONFIG_GENERIC_ALLOCATOR=y -CONFIG_REED_SOLOMON=m +CONFIG_REED_SOLOMON=y CONFIG_REED_SOLOMON_ENC8=y CONFIG_REED_SOLOMON_DEC8=y CONFIG_TEXTSEARCH=y @@ -6380,7 +6402,7 @@ CONFIG_DEBUG_LIST=y # CONFIG_DEBUG_PLIST is not set CONFIG_DEBUG_SG=y CONFIG_DEBUG_NOTIFIERS=y -# CONFIG_BUG_ON_DATA_CORRUPTION is not set +CONFIG_BUG_ON_DATA_CORRUPTION=y # end of Debug kernel data structures CONFIG_DEBUG_CREDENTIALS=y @@ -6480,8 +6502,10 @@ CONFIG_CORESIGHT_SOURCE_ETM4X=m CONFIG_ETM4X_IMPDEF_FEATURE=y CONFIG_CORESIGHT_STM=m CONFIG_CORESIGHT_CPU_DEBUG=m +# CONFIG_CORESIGHT_CPU_DEBUG_DEFAULT_ON is not set CONFIG_CORESIGHT_CTI=m CONFIG_CORESIGHT_CTI_INTEGRATION_REGS=y +# CONFIG_CORESIGHT_TRBE is not set # end of arm64 Debugging # @@ -6551,6 +6575,3 @@ CONFIG_TEST_BPF=m # CONFIG_MEMTEST is not set # end of Kernel Testing and Coverage # end of Kernel hacking -CONFIG_CK_KABI_SIZE_ALIGN_CHECKS=y -CONFIG_CK_KABI_RESERVE=y -CONFIG_AUXILIARY_BUS=y diff --git a/arch/arm64/configs/anolis_defconfig b/arch/arm64/configs/anolis_defconfig index 84a762785ce221480737317ef143e2e8abd7b85b..0e49d4927660dc503f07f1c83d9e3bf390cb27fa 100644 --- a/arch/arm64/configs/anolis_defconfig +++ b/arch/arm64/configs/anolis_defconfig @@ -2,7 +2,7 @@ # Automatically generated file; DO NOT EDIT. # Linux/arm64 5.10.134 Kernel Configuration # -CONFIG_CC_VERSION_TEXT="gcc (GCC) 8.5.0 20210514 (Anolis 8.5.0-10.0.1)" +CONFIG_CC_VERSION_TEXT="gcc (GCC) 8.5.0 20210514 (Anolis 8.5.0-10.0.3)" CONFIG_CC_IS_GCC=y CONFIG_GCC_VERSION=80500 CONFIG_LD_VERSION=230000000 @@ -119,7 +119,8 @@ CONFIG_RCU_NEED_SEGCBLIST=y CONFIG_RCU_NOCB_CPU=y # end of RCU Subsystem -# CONFIG_IKCONFIG is not set +CONFIG_IKCONFIG=y +CONFIG_IKCONFIG_PROC=y # CONFIG_IKHEADERS is not set CONFIG_LOG_BUF_SHIFT=20 CONFIG_LOG_CPU_MAX_BUF_SHIFT=12 @@ -158,6 +159,7 @@ CONFIG_CGROUP_DEVICE=y CONFIG_SCHED_SLI=y CONFIG_RICH_CONTAINER=y # CONFIG_RICH_CONTAINER_CG_SWITCH is not set +# CONFIG_MAX_PID_PER_NS is not set CONFIG_CGROUP_CPUACCT=y CONFIG_CGROUP_PERF=y CONFIG_CGROUP_BPF=y @@ -169,7 +171,6 @@ CONFIG_TIME_NS=y CONFIG_IPC_NS=y CONFIG_USER_NS=y CONFIG_PID_NS=y -# CONFIG_MAX_PID_PER_NS is not set CONFIG_NET_NS=y CONFIG_CHECKPOINT_RESTORE=y CONFIG_SCHED_AUTOGROUP=y @@ -285,6 +286,7 @@ CONFIG_FIX_EARLYCON_MEM=y CONFIG_PGTABLE_LEVELS=4 CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_ARCH_PROC_KCORE_TEXT=y +CONFIG_BROKEN_GAS_INST=y # # Platform selection @@ -436,8 +438,6 @@ CONFIG_KUSER_HELPERS=y # CONFIG_ARM64_HW_AFDBM=y CONFIG_ARM64_PAN=y -CONFIG_AS_HAS_LSE_ATOMICS=y -CONFIG_ARM64_LSE_ATOMICS=y CONFIG_ARM64_USE_LSE_ATOMICS=y CONFIG_ARM64_VHE=y # end of ARMv8.1 architectural features @@ -453,17 +453,12 @@ CONFIG_ARM64_CNP=y # # ARMv8.3 architectural features # -# CONFIG_ARM64_PTR_AUTH is not set -CONFIG_CC_HAS_SIGN_RETURN_ADDRESS=y -CONFIG_AS_HAS_PAC=y # end of ARMv8.3 architectural features # # ARMv8.4 architectural features # # CONFIG_ARM64_AMU_EXTN is not set -CONFIG_AS_HAS_ARMV8_4=y -CONFIG_ARM64_TLB_RANGE=y CONFIG_ARM64_MPAM=y # end of ARMv8.4 architectural features @@ -582,6 +577,7 @@ CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y CONFIG_ACPI_CPPC_CPUFREQ=y CONFIG_ARM_SCPI_CPUFREQ=m # CONFIG_ARM_QCOM_CPUFREQ_HW is not set +# CONFIG_RISV_THEAD_LIGHT_CPUFREQ is not set # end of CPU Frequency scaling # end of CPU Power Management @@ -623,6 +619,7 @@ CONFIG_UEFI_CPER_ARM=y CONFIG_EFI_EARLYCON=y CONFIG_EFI_CUSTOM_SSDT_OVERLAYS=y CONFIG_YITIAN_CPER_RAWDATA=y +# CONFIG_EFI_COCO_SECRET is not set CONFIG_ARM_PSCI_FW=y # CONFIG_ARM_PSCI_CHECKER is not set CONFIG_HAVE_ARM_SMCCC=y @@ -687,7 +684,7 @@ CONFIG_ACPI_PCC=y # CONFIG_PMIC_OPREGION is not set CONFIG_IRQ_BYPASS_MANAGER=y CONFIG_VIRTUALIZATION=y -CONFIG_KVM=m +CONFIG_KVM=y CONFIG_HAVE_KVM_IRQCHIP=y CONFIG_HAVE_KVM_IRQFD=y CONFIG_HAVE_KVM_IRQ_ROUTING=y @@ -700,6 +697,7 @@ CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL=y CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT=y CONFIG_HAVE_KVM_IRQ_BYPASS=y CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE=y +CONFIG_KVM_XFER_TO_GUEST_WORK=y CONFIG_KVM_ARM_PMU=y CONFIG_HAVE_LIVEPATCH=y CONFIG_LIVEPATCH=y @@ -856,7 +854,7 @@ CONFIG_BLK_DEV_BSG=y CONFIG_BLK_DEV_BSGLIB=y CONFIG_BLK_DEV_INTEGRITY=y CONFIG_BLK_DEV_INTEGRITY_T10=y -# CONFIG_BLK_DEV_ZONED is not set +CONFIG_BLK_DEV_ZONED=y CONFIG_BLK_DEV_THROTTLING=y # CONFIG_BLK_DEV_THROTTLING_LOW is not set # CONFIG_BLK_CMDLINE_PARSER is not set @@ -865,6 +863,7 @@ CONFIG_BLK_WBT=y CONFIG_BLK_CGROUP_IOCOST=y CONFIG_BLK_WBT_MQ=y CONFIG_BLK_DEBUG_FS=y +CONFIG_BLK_DEBUG_FS_ZONED=y # CONFIG_BLK_SED_OPAL is not set # CONFIG_BLK_INLINE_ENCRYPTION is not set @@ -973,6 +972,8 @@ CONFIG_ARCH_USE_QUEUED_RWLOCKS=y CONFIG_QUEUED_RWLOCKS=y CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y +CONFIG_CK_KABI_RESERVE=y +CONFIG_CK_KABI_SIZE_ALIGN_CHECKS=y CONFIG_FREEZER=y # @@ -1118,6 +1119,7 @@ CONFIG_SMC=m CONFIG_SMC_DIAG=m CONFIG_XDP_SOCKETS=y CONFIG_XDP_SOCKETS_DIAG=m +CONFIG_VTOA=m CONFIG_HOOKERS=m CONFIG_INET=y CONFIG_IP_MULTICAST=y @@ -1201,7 +1203,7 @@ CONFIG_IPV6_NDISC_NODETYPE=y CONFIG_IPV6_TUNNEL=m CONFIG_IPV6_GRE=m CONFIG_IPV6_MULTIPLE_TABLES=y -# CONFIG_IPV6_SUBTREES is not set +CONFIG_IPV6_SUBTREES=y CONFIG_IPV6_MROUTE=y CONFIG_IPV6_MROUTE_MULTIPLE_TABLES=y CONFIG_IPV6_PIMSM_V2=y @@ -1226,7 +1228,7 @@ CONFIG_NETFILTER_INGRESS=y CONFIG_NETFILTER_NETLINK=m CONFIG_NETFILTER_FAMILY_BRIDGE=y CONFIG_NETFILTER_FAMILY_ARP=y -# CONFIG_NETFILTER_NETLINK_ACCT is not set +CONFIG_NETFILTER_NETLINK_ACCT=m CONFIG_NETFILTER_NETLINK_QUEUE=m CONFIG_NETFILTER_NETLINK_LOG=m CONFIG_NETFILTER_NETLINK_OSF=m @@ -1275,6 +1277,7 @@ CONFIG_NF_TABLES_INET=y CONFIG_NF_TABLES_NETDEV=y CONFIG_NFT_NUMGEN=m CONFIG_NFT_CT=m +CONFIG_NFT_FLOW_OFFLOAD=m CONFIG_NFT_COUNTER=m CONFIG_NFT_CONNLIMIT=m CONFIG_NFT_LOG=m @@ -1282,7 +1285,7 @@ CONFIG_NFT_LIMIT=m CONFIG_NFT_MASQ=m CONFIG_NFT_REDIR=m CONFIG_NFT_NAT=m -# CONFIG_NFT_TUNNEL is not set +CONFIG_NFT_TUNNEL=m CONFIG_NFT_OBJREF=m CONFIG_NFT_QUEUE=m CONFIG_NFT_QUOTA=m @@ -1294,14 +1297,15 @@ CONFIG_NFT_FIB=m CONFIG_NFT_FIB_INET=m CONFIG_NFT_XFRM=m CONFIG_NFT_SOCKET=m -# CONFIG_NFT_OSF is not set +CONFIG_NFT_OSF=m CONFIG_NFT_TPROXY=m # CONFIG_NFT_SYNPROXY is not set CONFIG_NF_DUP_NETDEV=m CONFIG_NFT_DUP_NETDEV=m CONFIG_NFT_FWD_NETDEV=m CONFIG_NFT_FIB_NETDEV=m -# CONFIG_NF_FLOW_TABLE is not set +CONFIG_NF_FLOW_TABLE_INET=m +CONFIG_NF_FLOW_TABLE=m CONFIG_NETFILTER_XTABLES=y # @@ -1324,7 +1328,7 @@ CONFIG_NETFILTER_XT_TARGET_DSCP=m CONFIG_NETFILTER_XT_TARGET_HL=m CONFIG_NETFILTER_XT_TARGET_HMARK=m CONFIG_NETFILTER_XT_TARGET_IDLETIMER=m -# CONFIG_NETFILTER_XT_TARGET_LED is not set +CONFIG_NETFILTER_XT_TARGET_LED=m CONFIG_NETFILTER_XT_TARGET_LOG=m CONFIG_NETFILTER_XT_TARGET_MARK=m CONFIG_NETFILTER_XT_NAT=m @@ -1367,13 +1371,13 @@ CONFIG_NETFILTER_XT_MATCH_HL=m # CONFIG_NETFILTER_XT_MATCH_IPCOMP is not set CONFIG_NETFILTER_XT_MATCH_IPRANGE=m CONFIG_NETFILTER_XT_MATCH_IPVS=m -# CONFIG_NETFILTER_XT_MATCH_L2TP is not set +CONFIG_NETFILTER_XT_MATCH_L2TP=m CONFIG_NETFILTER_XT_MATCH_LENGTH=m CONFIG_NETFILTER_XT_MATCH_LIMIT=m CONFIG_NETFILTER_XT_MATCH_MAC=m CONFIG_NETFILTER_XT_MATCH_MARK=m CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m -# CONFIG_NETFILTER_XT_MATCH_NFACCT is not set +CONFIG_NETFILTER_XT_MATCH_NFACCT=m CONFIG_NETFILTER_XT_MATCH_OSF=m CONFIG_NETFILTER_XT_MATCH_OWNER=m CONFIG_NETFILTER_XT_MATCH_POLICY=m @@ -1389,7 +1393,7 @@ CONFIG_NETFILTER_XT_MATCH_STATE=m CONFIG_NETFILTER_XT_MATCH_STATISTIC=m CONFIG_NETFILTER_XT_MATCH_STRING=m CONFIG_NETFILTER_XT_MATCH_TCPMSS=m -# CONFIG_NETFILTER_XT_MATCH_TIME is not set +CONFIG_NETFILTER_XT_MATCH_TIME=m CONFIG_NETFILTER_XT_MATCH_U32=m # end of Core Netfilter Configuration @@ -1439,7 +1443,7 @@ CONFIG_IP_VS_LBLC=m CONFIG_IP_VS_LBLCR=m CONFIG_IP_VS_DH=m CONFIG_IP_VS_SH=m -# CONFIG_IP_VS_MH is not set +CONFIG_IP_VS_MH=m CONFIG_IP_VS_SED=m CONFIG_IP_VS_NQ=m @@ -1471,6 +1475,7 @@ CONFIG_NFT_REJECT_IPV4=m CONFIG_NFT_DUP_IPV4=m CONFIG_NFT_FIB_IPV4=m CONFIG_NF_TABLES_ARP=y +CONFIG_NF_FLOW_TABLE_IPV4=m CONFIG_NF_DUP_IPV4=m CONFIG_NF_LOG_ARP=m CONFIG_NF_LOG_IPV4=m @@ -1510,6 +1515,7 @@ CONFIG_NF_TABLES_IPV6=y CONFIG_NFT_REJECT_IPV6=m CONFIG_NFT_DUP_IPV6=m CONFIG_NFT_FIB_IPV6=m +CONFIG_NF_FLOW_TABLE_IPV6=m CONFIG_NF_DUP_IPV6=m CONFIG_NF_REJECT_IPV6=m CONFIG_NF_LOG_IPV6=m @@ -1662,7 +1668,6 @@ CONFIG_DEFAULT_NET_SCH="fq_codel" # CONFIG_NET_CLS=y CONFIG_NET_CLS_BASIC=m -CONFIG_NET_CLS_TCINDEX=m CONFIG_NET_CLS_ROUTE4=m CONFIG_NET_CLS_FW=m CONFIG_NET_CLS_U32=m @@ -1691,7 +1696,7 @@ CONFIG_NET_ACT_GACT=m CONFIG_GACT_PROB=y CONFIG_NET_ACT_MIRRED=m CONFIG_NET_ACT_SAMPLE=m -# CONFIG_NET_ACT_IPT is not set +CONFIG_NET_ACT_IPT=m CONFIG_NET_ACT_NAT=m CONFIG_NET_ACT_PEDIT=m CONFIG_NET_ACT_SIMP=m @@ -1705,8 +1710,9 @@ CONFIG_NET_ACT_BPF=m CONFIG_NET_ACT_SKBMOD=m # CONFIG_NET_ACT_IFE is not set CONFIG_NET_ACT_TUNNEL_KEY=m +CONFIG_NET_ACT_CT=m # CONFIG_NET_ACT_GATE is not set -# CONFIG_NET_TC_SKB_EXT is not set +CONFIG_NET_TC_SKB_EXT=y CONFIG_NET_SCH_FIFO=y CONFIG_DCB=y CONFIG_DNS_RESOLVER=m @@ -1965,6 +1971,7 @@ CONFIG_YENTA_TOSHIBA=y # # Generic Driver Options # +CONFIG_AUXILIARY_BUS=y # CONFIG_UEVENT_HELPER is not set CONFIG_DEVTMPFS=y CONFIG_DEVTMPFS_MOUNT=y @@ -2166,8 +2173,8 @@ CONFIG_CDROM_PKTCDVD_BUFFERS=8 # CONFIG_ATA_OVER_ETH is not set CONFIG_VIRTIO_BLK=m CONFIG_BLK_DEV_RBD=m -CONFIG_BLK_DEV_UBLK=m # CONFIG_BLK_DEV_RSXX is not set +CONFIG_BLK_DEV_UBLK=m # # NVME Support @@ -2513,6 +2520,7 @@ CONFIG_DM_VERITY=m CONFIG_DM_SWITCH=m CONFIG_DM_LOG_WRITES=m CONFIG_DM_INTEGRITY=m +# CONFIG_DM_ZONED is not set CONFIG_TARGET_CORE=m CONFIG_TCM_IBLOCK=m CONFIG_TCM_FILEIO=m @@ -2711,6 +2719,7 @@ CONFIG_MLX5_EN_RXNFC=y CONFIG_MLX5_MPFS=y CONFIG_MLX5_ESWITCH=y CONFIG_MLX5_CLS_ACT=y +CONFIG_MLX5_TC_CT=y CONFIG_MLX5_CORE_EN_DCB=y CONFIG_MLX5_CORE_IPOIB=y # CONFIG_MLX5_FPGA_IPSEC is not set @@ -4134,6 +4143,7 @@ CONFIG_DRM_CIRRUS_QEMU=m # CONFIG_DRM_LIMA is not set # CONFIG_DRM_PANFROST is not set # CONFIG_DRM_TIDSS is not set +# CONFIG_DRM_VERISILICON is not set # CONFIG_DRM_LEGACY is not set CONFIG_DRM_PANEL_ORIENTATION_QUIRKS=y @@ -4787,6 +4797,7 @@ CONFIG_INFINIBAND_HNS=m # CONFIG_INFINIBAND_HNS_HIP08 is not set CONFIG_INFINIBAND_BNXT_RE=m CONFIG_INFINIBAND_QEDR=m +CONFIG_INFINIBAND_ERDMA=m CONFIG_RDMA_RXE=m CONFIG_RDMA_SIW=m CONFIG_INFINIBAND_IPOIB=m @@ -5231,6 +5242,12 @@ CONFIG_QCOM_KRYO_L2_ACCESSORS=y # # CONFIG_XILINX_VCU is not set # end of Xilinx SoC drivers + +# +# prefetch tuning drivers +# +CONFIG_ARM64_PREFETCH_TUNING=m +# end of prefetch tuning drivers # end of SOC (System On Chip) specific Drivers # CONFIG_PM_DEVFREQ is not set @@ -5257,6 +5274,7 @@ CONFIG_PWM_SYSFS=y # CONFIG_PWM_FSL_FTM is not set # CONFIG_PWM_HIBVT is not set # CONFIG_PWM_PCA9685 is not set +# CONFIG_PWM_LIGHT is not set # # IRQ chip support @@ -5317,6 +5335,7 @@ CONFIG_PHY_HI6220_USB=m # CONFIG_PHY_QCOM_USB_HS_28NM is not set # CONFIG_PHY_QCOM_USB_SS is not set # CONFIG_PHY_QCOM_IPQ806X_USB is not set +# CONFIG_PHY_DW_DPHY is not set # CONFIG_PHY_TUSB1210 is not set # end of PHY Subsystem @@ -5338,9 +5357,9 @@ CONFIG_QCOM_L3_PMU=y CONFIG_THUNDERX2_PMU=m CONFIG_XGENE_PMU=y CONFIG_ARM_SPE_PMU=m +CONFIG_ALIBABA_UNCORE_DRW_PMU=m CONFIG_HISI_PMU=y CONFIG_DWC_UNCORE_PCIE_PMU=y -CONFIG_ALIBABA_UNCORE_DRW_PMU=m # end of Performance monitor support CONFIG_RAS=y @@ -5431,6 +5450,7 @@ CONFIG_XFS_POSIX_ACL=y # CONFIG_BTRFS_FS is not set # CONFIG_NILFS2_FS is not set # CONFIG_F2FS_FS is not set +# CONFIG_ZONEFS_FS is not set CONFIG_FS_DAX=y CONFIG_FS_POSIX_ACL=y CONFIG_EXPORTFS=y @@ -5728,7 +5748,7 @@ CONFIG_SECURITYFS=y CONFIG_SECURITY_NETWORK=y CONFIG_SECURITY_INFINIBAND=y CONFIG_SECURITY_NETWORK_XFRM=y -# CONFIG_SECURITY_PATH is not set +CONFIG_SECURITY_PATH=y CONFIG_LSM_MMAP_MIN_ADDR=65535 CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR=y CONFIG_HARDENED_USERCOPY=y @@ -5754,7 +5774,8 @@ CONFIG_INTEGRITY=y CONFIG_INTEGRITY_SIGNATURE=y CONFIG_INTEGRITY_ASYMMETRIC_KEYS=y CONFIG_INTEGRITY_TRUSTED_KEYRING=y -# CONFIG_INTEGRITY_PLATFORM_KEYRING is not set +CONFIG_INTEGRITY_PLATFORM_KEYRING=y +CONFIG_LOAD_UEFI_KEYS=y CONFIG_INTEGRITY_AUDIT=y CONFIG_IMA=y CONFIG_IMA_MEASURE_PCR_IDX=10 @@ -5763,9 +5784,9 @@ CONFIG_IMA_LSM_RULES=y CONFIG_IMA_SIG_TEMPLATE=y CONFIG_IMA_DEFAULT_TEMPLATE="ima-sig" # CONFIG_IMA_DEFAULT_HASH_SHA1 is not set -# CONFIG_IMA_DEFAULT_HASH_SHA256 is not set -CONFIG_IMA_DEFAULT_HASH_SM3=y -CONFIG_IMA_DEFAULT_HASH="sm3" +CONFIG_IMA_DEFAULT_HASH_SHA256=y +# CONFIG_IMA_DEFAULT_HASH_SM3 is not set +CONFIG_IMA_DEFAULT_HASH="sha256" CONFIG_IMA_WRITE_POLICY=y CONFIG_IMA_READ_POLICY=y CONFIG_IMA_APPRAISE=y @@ -5834,6 +5855,7 @@ CONFIG_CRYPTO_RNG_DEFAULT=m CONFIG_CRYPTO_AKCIPHER2=y CONFIG_CRYPTO_AKCIPHER=y CONFIG_CRYPTO_KPP2=y +CONFIG_CRYPTO_KPP=m CONFIG_CRYPTO_ACOMP2=y CONFIG_CRYPTO_MANAGER=y CONFIG_CRYPTO_MANAGER2=y @@ -5853,7 +5875,7 @@ CONFIG_CRYPTO_SIMD=y # Public-key cryptography # CONFIG_CRYPTO_RSA=y -# CONFIG_CRYPTO_DH is not set +CONFIG_CRYPTO_DH=m # CONFIG_CRYPTO_ECDH is not set # CONFIG_CRYPTO_ECRDSA is not set CONFIG_CRYPTO_SM2=y @@ -5898,8 +5920,8 @@ CONFIG_CRYPTO_VMAC=m # CONFIG_CRYPTO_CRC32C=y CONFIG_CRYPTO_CRC32=m -# CONFIG_CRYPTO_XXHASH is not set -# CONFIG_CRYPTO_BLAKE2B is not set +CONFIG_CRYPTO_XXHASH=m +CONFIG_CRYPTO_BLAKE2B=m # CONFIG_CRYPTO_BLAKE2S is not set CONFIG_CRYPTO_CRCT10DIF=y CONFIG_CRYPTO_GHASH=y @@ -5955,7 +5977,7 @@ CONFIG_CRYPTO_LZO=y # CONFIG_CRYPTO_842 is not set # CONFIG_CRYPTO_LZ4 is not set # CONFIG_CRYPTO_LZ4HC is not set -# CONFIG_CRYPTO_ZSTD is not set +CONFIG_CRYPTO_ZSTD=m # # Random Number Generation @@ -6065,7 +6087,6 @@ CONFIG_CRYPTO_LIB_POLY1305_GENERIC=m CONFIG_CRYPTO_LIB_POLY1305=m CONFIG_CRYPTO_LIB_CHACHA20POLY1305=m CONFIG_CRYPTO_LIB_SHA256=y -CONFIG_CRYPTO_LIB_SM4=y # end of Crypto library routines CONFIG_LIB_MEMNEQ=y @@ -6079,7 +6100,7 @@ CONFIG_CRC32_SLICEBY8=y # CONFIG_CRC32_SLICEBY4 is not set # CONFIG_CRC32_SARWATE is not set # CONFIG_CRC32_BIT is not set -# CONFIG_CRC64 is not set +CONFIG_CRC64=m # CONFIG_CRC4 is not set CONFIG_CRC7=m CONFIG_LIBCRC32C=m @@ -6094,6 +6115,7 @@ CONFIG_ZLIB_DEFLATE=y CONFIG_LZO_COMPRESS=y CONFIG_LZO_DECOMPRESS=y CONFIG_LZ4_DECOMPRESS=y +CONFIG_ZSTD_COMPRESS=m CONFIG_ZSTD_DECOMPRESS=y CONFIG_XZ_DEC=y CONFIG_XZ_DEC_X86=y @@ -6112,7 +6134,7 @@ CONFIG_DECOMPRESS_LZO=y CONFIG_DECOMPRESS_LZ4=y CONFIG_DECOMPRESS_ZSTD=y CONFIG_GENERIC_ALLOCATOR=y -CONFIG_REED_SOLOMON=m +CONFIG_REED_SOLOMON=y CONFIG_REED_SOLOMON_ENC8=y CONFIG_REED_SOLOMON_DEC8=y CONFIG_TEXTSEARCH=y @@ -6361,7 +6383,7 @@ CONFIG_DEBUG_LIST=y # CONFIG_DEBUG_PLIST is not set # CONFIG_DEBUG_SG is not set # CONFIG_DEBUG_NOTIFIERS is not set -# CONFIG_BUG_ON_DATA_CORRUPTION is not set +CONFIG_BUG_ON_DATA_CORRUPTION=y # end of Debug kernel data structures # CONFIG_DEBUG_CREDENTIALS is not set @@ -6458,8 +6480,10 @@ CONFIG_CORESIGHT_SOURCE_ETM4X=m CONFIG_ETM4X_IMPDEF_FEATURE=y CONFIG_CORESIGHT_STM=m CONFIG_CORESIGHT_CPU_DEBUG=m +# CONFIG_CORESIGHT_CPU_DEBUG_DEFAULT_ON is not set CONFIG_CORESIGHT_CTI=m CONFIG_CORESIGHT_CTI_INTEGRATION_REGS=y +# CONFIG_CORESIGHT_TRBE is not set # end of arm64 Debugging # @@ -6519,6 +6543,3 @@ CONFIG_TEST_BPF=m # CONFIG_MEMTEST is not set # end of Kernel Testing and Coverage # end of Kernel hacking -CONFIG_CK_KABI_SIZE_ALIGN_CHECKS=y -CONFIG_CK_KABI_RESERVE=y -CONFIG_AUXILIARY_BUS=y diff --git a/arch/arm64/include/asm/efi.h b/arch/arm64/include/asm/efi.h index 973b144152711ace4bc6ee85c44926cebd253274..2672580b8c82cb4a326dba00a801e1aeefdeb459 100644 --- a/arch/arm64/include/asm/efi.h +++ b/arch/arm64/include/asm/efi.h @@ -27,12 +27,9 @@ int efi_set_mapping_permissions(struct mm_struct *mm, efi_memory_desc_t *md); __efi_fpsimd_begin(); \ }) +#undef arch_efi_call_virt #define arch_efi_call_virt(p, f, args...) \ -({ \ - efi_##f##_t *__f; \ - __f = p->f; \ - __efi_rt_asm_wrapper(__f, #f, args); \ -}) + __efi_rt_asm_wrapper((p)->f, #f, args) #define arch_efi_call_virt_teardown() \ ({ \ diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h index edbfac9ba9a059e87478d99db1f02679f3c6d06d..708a353b63a31b6f65b2676d2f3a47b3b30177f1 100644 --- a/arch/arm64/kernel/image-vars.h +++ b/arch/arm64/kernel/image-vars.h @@ -27,9 +27,6 @@ __efistub_primary_entry_offset = primary_entry - _text; */ __efistub_memcmp = __pi_memcmp; __efistub_memchr = __pi_memchr; -__efistub_memcpy = __pi_memcpy; -__efistub_memmove = __pi_memmove; -__efistub_memset = __pi_memset; __efistub_strlen = __pi_strlen; __efistub_strnlen = __pi_strnlen; __efistub_strcmp = __pi_strcmp; @@ -37,12 +34,6 @@ __efistub_strncmp = __pi_strncmp; __efistub_strrchr = __pi_strrchr; __efistub___clean_dcache_area_poc = __pi___clean_dcache_area_poc; -#ifdef CONFIG_KASAN -__efistub___memcpy = __pi_memcpy; -__efistub___memmove = __pi_memmove; -__efistub___memset = __pi_memset; -#endif - __efistub__text = _text; __efistub__end = _end; __efistub__edata = _edata; diff --git a/arch/riscv/include/asm/efi.h b/arch/riscv/include/asm/efi.h index 7542282f1141c200f56d70b3ae01dfa49eb7b155..3a6bb2268f1118fbf920d9d842f9c6f7e8cbae51 100644 --- a/arch/riscv/include/asm/efi.h +++ b/arch/riscv/include/asm/efi.h @@ -23,8 +23,6 @@ int efi_set_mapping_permissions(struct mm_struct *mm, efi_memory_desc_t *md); #define arch_efi_call_virt_setup() efi_virtmap_load() #define arch_efi_call_virt_teardown() efi_virtmap_unload() -#define arch_efi_call_virt(p, f, args...) p->f(args) - #define ARCH_EFI_IRQ_FLAGS_MASK (SR_IE | SR_SPIE) /* on RISC-V, the FDT may be located anywhere in system RAM */ diff --git a/arch/riscv/kernel/image-vars.h b/arch/riscv/kernel/image-vars.h index 8c212efb37a64cf0ff33a4fa98a2ac74f625174a..bf1c73b5a5b929d1c7068ea915622cc1b3cdcff4 100644 --- a/arch/riscv/kernel/image-vars.h +++ b/arch/riscv/kernel/image-vars.h @@ -25,21 +25,12 @@ */ __efistub_memcmp = memcmp; __efistub_memchr = memchr; -__efistub_memcpy = memcpy; -__efistub_memmove = memmove; -__efistub_memset = memset; __efistub_strlen = strlen; __efistub_strnlen = strnlen; __efistub_strcmp = strcmp; __efistub_strncmp = strncmp; __efistub_strrchr = strrchr; -#ifdef CONFIG_KASAN -__efistub___memcpy = memcpy; -__efistub___memmove = memmove; -__efistub___memset = memset; -#endif - __efistub__start = _start; __efistub__start_kernel = _start_kernel; __efistub__end = _end; diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 5674792726cd9107f1f0cccace2cd4eada98db3b..a32b8334ff10ab132398e2b03154e0f671b68bfe 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -329,6 +329,7 @@ static void __no_sanitize_address pcpu_delegate(struct pcpu *pcpu, CALL_ON_STACK(__pcpu_delegate, stack, 2, func, data); /* Stop target cpu (if func returns this stops the current cpu). */ pcpu_sigp_retry(pcpu, SIGP_STOP, 0); + pcpu_sigp_retry(pcpu, SIGP_CPU_RESET, 0); /* Restart func on the target cpu and stop the current cpu. */ mem_assign_absolute(lc->restart_stack, stack); mem_assign_absolute(lc->restart_fn, (unsigned long) func); diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 872557aff7351869cf22b13e96bdbc56b4591225..71485e36ba9cc5da753948bbda364392c7578b5e 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1328,14 +1328,14 @@ config X86_REBOOTFIXUPS config MICROCODE bool "CPU microcode loading support" default y - depends on CPU_SUP_AMD || CPU_SUP_INTEL + depends on CPU_SUP_AMD || CPU_SUP_INTEL || CPU_SUP_HYGON help - If you say Y here, you will be able to update the microcode on - Intel and AMD processors. The Intel support is for the IA32 family, + If you say Y here, you will be able to update the microcode on Intel, + AMD and Hygon processors. The Intel support is for the IA32 family, e.g. Pentium Pro, Pentium II, Pentium III, Pentium 4, Xeon etc. The - AMD support is for families 0x10 and later. You will obviously need - the actual microcode binary data itself which is not shipped with - the Linux kernel. + AMD support is for families 0x10 and later. The Hygon support is for + families 0x18 and later. You will obviously need the actual microcode + binary data itself which is not shipped with the Linux kernel. The preferred method to load microcode from a detached initrd is described in Documentation/x86/microcode.rst. For that you need to enable @@ -1365,6 +1365,14 @@ config MICROCODE_AMD If you select this option, microcode patch loading support for AMD processors will be enabled. +config MICROCODE_HYGON + bool "Hygon microcode loading support" + depends on CPU_SUP_HYGON && MICROCODE + select MICROCODE_AMD + help + If you select this option, microcode patch loading support for Hygon + processors will be enabled. + config MICROCODE_OLD_INTERFACE bool "Ancient loading interface (DEPRECATED)" default n diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c index 918a7606f53c0308621ef0b50670222dde3cf79c..2d81d3cc72a1fa1e5633fd121f40e9d45536c90b 100644 --- a/arch/x86/boot/compressed/tdx.c +++ b/arch/x86/boot/compressed/tdx.c @@ -26,7 +26,7 @@ static inline unsigned int tdx_io_in(int size, u16 port) .r14 = port, }; - if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) + if (__tdx_hypercall_ret(&args)) return UINT_MAX; return args.r11; @@ -43,7 +43,7 @@ static inline void tdx_io_out(int size, u16 port, u32 value) .r15 = value, }; - __tdx_hypercall(&args, 0); + __tdx_hypercall(&args); } static inline u8 tdx_inb(u16 port) diff --git a/arch/x86/boot/compressed/vmlinux.lds.S b/arch/x86/boot/compressed/vmlinux.lds.S index 112b2375d021bb190fb6491bf05a41f52c2ea4fe..b22f34b8684a725b92c2b044f1a4efbbff959773 100644 --- a/arch/x86/boot/compressed/vmlinux.lds.S +++ b/arch/x86/boot/compressed/vmlinux.lds.S @@ -34,6 +34,7 @@ SECTIONS _text = .; /* Text */ *(.text) *(.text.*) + *(.noinstr.text) _etext = . ; } .rodata : { diff --git a/arch/x86/coco/tdx/tdcall.S b/arch/x86/coco/tdx/tdcall.S index f5a44e648c91fb6580db5fa01ae92c9aac7d2b2c..99bf315697c9efb62b5ac8f50cb606699fab1d73 100644 --- a/arch/x86/coco/tdx/tdcall.S +++ b/arch/x86/coco/tdx/tdcall.S @@ -13,6 +13,12 @@ /* * Bitmasks of exposed registers (with VMM). */ +#define TDX_RDX BIT(2) +#define TDX_RBX BIT(3) +#define TDX_RSI BIT(6) +#define TDX_RDI BIT(7) +#define TDX_R8 BIT(8) +#define TDX_R9 BIT(9) #define TDX_R10 BIT(10) #define TDX_R11 BIT(11) #define TDX_R12 BIT(12) @@ -27,9 +33,11 @@ * details can be found in TDX GHCI specification, section * titled "TDCALL [TDG.VP.VMCALL] leaf". */ -#define TDVMCALL_EXPOSE_REGS_MASK ( TDX_R10 | TDX_R11 | \ - TDX_R12 | TDX_R13 | \ - TDX_R14 | TDX_R15 ) +#define TDVMCALL_EXPOSE_REGS_MASK \ + ( TDX_RDX | TDX_RBX | TDX_RSI | TDX_RDI | TDX_R8 | TDX_R9 | \ + TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13 | TDX_R14 | TDX_R15 ) + +.section .noinstr.text, "ax" /* * __tdx_module_call() - Used by TDX guests to request services from @@ -77,12 +85,12 @@ SYM_FUNC_START(__tdx_module_call) SYM_FUNC_END(__tdx_module_call) /* - * __tdx_hypercall() - Make hypercalls to a TDX VMM using TDVMCALL leaf - * of TDCALL instruction + * TDX_HYPERCALL - Make hypercalls to a TDX VMM using TDVMCALL leaf of TDCALL + * instruction * * Transforms values in function call argument struct tdx_hypercall_args @args * into the TDCALL register ABI. After TDCALL operation, VMM output is saved - * back in @args. + * back in @args, if \ret is 1. * *------------------------------------------------------------------------- * TD VMCALL ABI: @@ -97,26 +105,18 @@ SYM_FUNC_END(__tdx_module_call) * specification. Non zero value indicates vendor * specific ABI. * R11 - VMCALL sub function number - * RBX, RBP, RDI, RSI - Used to pass VMCALL sub function specific arguments. + * RBX, RDX, RDI, RSI - Used to pass VMCALL sub function specific arguments. * R8-R9, R12-R15 - Same as above. * * Output Registers: * * RAX - TDCALL instruction status (Not related to hypercall * output). - * R10 - Hypercall output error code. - * R11-R15 - Hypercall sub function specific output values. - * - *------------------------------------------------------------------------- - * - * __tdx_hypercall() function ABI: + * RBX, RDX, RDI, RSI - Hypercall sub function specific output values. + * R8-R15 - Same as above. * - * @args (RDI) - struct tdx_hypercall_args for input and output - * @flags (RSI) - TDX_HCALL_* flags - * - * On successful completion, return the hypercall error code. */ -SYM_FUNC_START(__tdx_hypercall) +.macro TDX_HYPERCALL ret:req FRAME_BEGIN /* Save callee-saved GPRs as mandated by the x86_64 ABI */ @@ -124,38 +124,37 @@ SYM_FUNC_START(__tdx_hypercall) push %r14 push %r13 push %r12 + push %rbx + + /* Free RDI to be used as TDVMCALL arguments */ + movq %rdi, %rax + + /* Copy hypercall registers from arg struct: */ + movq TDX_HYPERCALL_r8(%rax), %r8 + movq TDX_HYPERCALL_r9(%rax), %r9 + movq TDX_HYPERCALL_r10(%rax), %r10 + movq TDX_HYPERCALL_r11(%rax), %r11 + movq TDX_HYPERCALL_r12(%rax), %r12 + movq TDX_HYPERCALL_r13(%rax), %r13 + movq TDX_HYPERCALL_r14(%rax), %r14 + movq TDX_HYPERCALL_r15(%rax), %r15 + movq TDX_HYPERCALL_rdi(%rax), %rdi + movq TDX_HYPERCALL_rsi(%rax), %rsi + movq TDX_HYPERCALL_rbx(%rax), %rbx + movq TDX_HYPERCALL_rdx(%rax), %rdx + + push %rax /* Mangle function call ABI into TDCALL ABI: */ /* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */ xor %eax, %eax - /* Copy hypercall registers from arg struct: */ - movq TDX_HYPERCALL_r10(%rdi), %r10 - movq TDX_HYPERCALL_r11(%rdi), %r11 - movq TDX_HYPERCALL_r12(%rdi), %r12 - movq TDX_HYPERCALL_r13(%rdi), %r13 - movq TDX_HYPERCALL_r14(%rdi), %r14 - movq TDX_HYPERCALL_r15(%rdi), %r15 - movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx - /* - * For the idle loop STI needs to be called directly before the TDCALL - * that enters idle (EXIT_REASON_HLT case). STI instruction enables - * interrupts only one instruction later. If there is a window between - * STI and the instruction that emulates the HALT state, there is a - * chance for interrupts to happen in this window, which can delay the - * HLT operation indefinitely. Since this is the not the desired - * result, conditionally call STI before TDCALL. - */ - testq $TDX_HCALL_ISSUE_STI, %rsi - jz .Lskip_sti - sti -.Lskip_sti: tdcall /* - * RAX==0 indicates a failure of the TDVMCALL mechanism itself and that + * RAX!=0 indicates a failure of the TDVMCALL mechanism itself and that * something has gone horribly wrong with the TDX module. * * The return status of the hypercall operation is in a separate @@ -163,32 +162,43 @@ SYM_FUNC_START(__tdx_hypercall) * and are handled by callers. */ testq %rax, %rax - jne .Lpanic + jne .Lpanic\@ + + pop %rax + + .if \ret + movq %r8, TDX_HYPERCALL_r8(%rax) + movq %r9, TDX_HYPERCALL_r9(%rax) + movq %r10, TDX_HYPERCALL_r10(%rax) + movq %r11, TDX_HYPERCALL_r11(%rax) + movq %r12, TDX_HYPERCALL_r12(%rax) + movq %r13, TDX_HYPERCALL_r13(%rax) + movq %r14, TDX_HYPERCALL_r14(%rax) + movq %r15, TDX_HYPERCALL_r15(%rax) + movq %rdi, TDX_HYPERCALL_rdi(%rax) + movq %rsi, TDX_HYPERCALL_rsi(%rax) + movq %rbx, TDX_HYPERCALL_rbx(%rax) + movq %rdx, TDX_HYPERCALL_rdx(%rax) + .endif /* TDVMCALL leaf return code is in R10 */ movq %r10, %rax - /* Copy hypercall result registers to arg struct if needed */ - testq $TDX_HCALL_HAS_OUTPUT, %rsi - jz .Lout - - movq %r10, TDX_HYPERCALL_r10(%rdi) - movq %r11, TDX_HYPERCALL_r11(%rdi) - movq %r12, TDX_HYPERCALL_r12(%rdi) - movq %r13, TDX_HYPERCALL_r13(%rdi) - movq %r14, TDX_HYPERCALL_r14(%rdi) - movq %r15, TDX_HYPERCALL_r15(%rdi) -.Lout: /* * Zero out registers exposed to the VMM to avoid speculative execution * with VMM-controlled values. This needs to include all registers - * present in TDVMCALL_EXPOSE_REGS_MASK (except R12-R15). R12-R15 - * context will be restored. + * present in TDVMCALL_EXPOSE_REGS_MASK, except RBX, and R12-R15 which + * will be restored. */ + xor %r8d, %r8d + xor %r9d, %r9d xor %r10d, %r10d xor %r11d, %r11d + xor %rdi, %rdi + xor %rdx, %rdx /* Restore callee-saved GPRs as mandated by the x86_64 ABI */ + pop %rbx pop %r12 pop %r13 pop %r14 @@ -197,8 +207,32 @@ SYM_FUNC_START(__tdx_hypercall) FRAME_END RET -.Lpanic: +.Lpanic\@: call __tdx_hypercall_failed /* __tdx_hypercall_failed never returns */ - jmp .Lpanic + jmp .Lpanic\@ +.endm + +/* + * + * __tdx_hypercall() function ABI: + * + * @args (RDI) - struct tdx_hypercall_args for input + * + * On successful completion, return the hypercall error code. + */ +SYM_FUNC_START(__tdx_hypercall) + TDX_HYPERCALL ret=0 SYM_FUNC_END(__tdx_hypercall) + +/* + * + * __tdx_hypercall_ret() function ABI: + * + * @args (RDI) - struct tdx_hypercall_args for input and output + * + * On successful completion, return the hypercall error code. + */ +SYM_FUNC_START(__tdx_hypercall_ret) + TDX_HYPERCALL ret=1 +SYM_FUNC_END(__tdx_hypercall_ret) diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index bc6825c6e7bc53ab85cc89e87f1377ecac2e7a73..c9717d96921ab6de692bd4f63777b235ede2d14f 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -20,9 +20,14 @@ #define TDX_GET_VEINFO 3 #define TDX_GET_REPORT 4 #define TDX_ACCEPT_PAGE 6 +#define TDX_WR 8 + +/* TDCS fields. To be used by TDG.VM.WR and TDG.VM.RD module calls */ +#define TDCS_NOTIFY_ENABLES 0x9100000000000010 /* TDX hypercall Leaf IDs */ #define TDVMCALL_MAP_GPA 0x10001 +#define TDVMCALL_REPORT_FATAL_ERROR 0x10003 /* MMIO direction */ #define EPT_READ 0 @@ -38,6 +43,7 @@ #define VE_GET_PORT_NUM(e) ((e) >> 16) #define VE_IS_IO_STRING(e) ((e) & BIT(4)) +#define ATTR_DEBUG BIT(0) #define ATTR_SEPT_VE_DISABLE BIT(28) /* TDX Module call error codes */ @@ -61,12 +67,13 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15) .r15 = r15, }; - return __tdx_hypercall(&args, 0); + return __tdx_hypercall(&args); } /* Called from __tdx_hypercall() for unrecoverable failure */ -void __tdx_hypercall_failed(void) +noinstr void __tdx_hypercall_failed(void) { + instrumentation_begin(); panic("TDVMCALL failed. TDX module bug?"); } @@ -76,7 +83,7 @@ void __tdx_hypercall_failed(void) * Reusing the KVM EXIT_REASON macros makes it easier to connect the host and * guest sides of these calls. */ -static u64 hcall_func(u64 exit_reason) +static __always_inline u64 hcall_func(u64 exit_reason) { return exit_reason; } @@ -93,7 +100,7 @@ long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2, .r14 = p4, }; - return __tdx_hypercall(&args, 0); + return __tdx_hypercall(&args); } EXPORT_SYMBOL_GPL(tdx_kvm_hypercall); #endif @@ -141,6 +148,41 @@ int tdx_mcall_get_report0(u8 *reportdata, u8 *tdreport) } EXPORT_SYMBOL_GPL(tdx_mcall_get_report0); +static void __noreturn tdx_panic(const char *msg) +{ + struct tdx_hypercall_args args = { + .r10 = TDX_HYPERCALL_STANDARD, + .r11 = TDVMCALL_REPORT_FATAL_ERROR, + .r12 = 0, /* Error code: 0 is Panic */ + }; + union { + /* Define register order according to the GHCI */ + struct { u64 r14, r15, rbx, rdi, rsi, r8, r9, rdx; }; + + char str[64]; + } message; + + /* VMM assumes '\0' in byte 65, if the message took all 64 bytes */ + strncpy(message.str, msg, 64); + + args.r8 = message.r8; + args.r9 = message.r9; + args.r14 = message.r14; + args.r15 = message.r15; + args.rdi = message.rdi; + args.rsi = message.rsi; + args.rbx = message.rbx; + args.rdx = message.rdx; + + /* + * This hypercall should never return and it is not safe + * to keep the guest running. Call it forever if it + * happens to return. + */ + while (1) + __tdx_hypercall(&args); +} + static void tdx_parse_tdinfo(u64 *cc_mask) { struct tdx_module_output out; @@ -172,8 +214,15 @@ static void tdx_parse_tdinfo(u64 *cc_mask) * TD-private memory. Only VMM-shared memory (MMIO) will #VE. */ td_attr = out.rdx; - if (!(td_attr & ATTR_SEPT_VE_DISABLE)) - panic("TD misconfiguration: SEPT_VE_DISABLE attibute must be set.\n"); + if (!(td_attr & ATTR_SEPT_VE_DISABLE)) { + const char *msg = "TD misconfiguration: SEPT_VE_DISABLE attribute must be set."; + + /* Relax SEPT_VE_DISABLE check for debug TD. */ + if (td_attr & ATTR_DEBUG) + pr_warn("%s\n", msg); + else + tdx_panic(msg); + } } /* @@ -221,7 +270,7 @@ static int ve_instr_len(struct ve_info *ve) } } -static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti) +static u64 __cpuidle __halt(const bool irq_disabled) { struct tdx_hypercall_args args = { .r10 = TDX_HYPERCALL_STANDARD, @@ -241,20 +290,14 @@ static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti) * can keep the vCPU in virtual HLT, even if an IRQ is * pending, without hanging/breaking the guest. */ - return __tdx_hypercall(&args, do_sti ? TDX_HCALL_ISSUE_STI : 0); + return __tdx_hypercall(&args); } static int handle_halt(struct ve_info *ve) { - /* - * Since non safe halt is mainly used in CPU offlining - * and the guest will always stay in the halt state, don't - * call the STI instruction (set do_sti as false). - */ const bool irq_disabled = irqs_disabled(); - const bool do_sti = false; - if (__halt(irq_disabled, do_sti)) + if (__halt(irq_disabled)) return -EIO; return ve_instr_len(ve); @@ -262,18 +305,12 @@ static int handle_halt(struct ve_info *ve) void __cpuidle tdx_safe_halt(void) { - /* - * For do_sti=true case, __tdx_hypercall() function enables - * interrupts using the STI instruction before the TDCALL. So - * set irq_disabled as false. - */ const bool irq_disabled = false; - const bool do_sti = true; /* * Use WARN_ONCE() to report the failure. */ - if (__halt(irq_disabled, do_sti)) + if (__halt(irq_disabled)) WARN_ONCE(1, "HLT instruction emulation failed\n"); } @@ -290,7 +327,7 @@ static int read_msr(struct pt_regs *regs, struct ve_info *ve) * can be found in TDX Guest-Host-Communication Interface * (GHCI), section titled "TDG.VP.VMCALL". */ - if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) + if (__tdx_hypercall_ret(&args)) return -EIO; regs->ax = lower_32_bits(args.r11); @@ -312,7 +349,7 @@ static int write_msr(struct pt_regs *regs, struct ve_info *ve) * can be found in TDX Guest-Host-Communication Interface * (GHCI) section titled "TDG.VP.VMCALL". */ - if (__tdx_hypercall(&args, 0)) + if (__tdx_hypercall(&args)) return -EIO; return ve_instr_len(ve); @@ -344,7 +381,7 @@ static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve) * ABI can be found in TDX Guest-Host-Communication Interface * (GHCI), section titled "VP.VMCALL". */ - if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) + if (__tdx_hypercall_ret(&args)) return -EIO; /* @@ -371,7 +408,7 @@ static bool mmio_read(int size, unsigned long addr, unsigned long *val) .r15 = *val, }; - if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT)) + if (__tdx_hypercall_ret(&args)) return false; *val = args.r11; return true; @@ -505,7 +542,7 @@ static bool handle_in(struct pt_regs *regs, int size, int port) * in TDX Guest-Host-Communication Interface (GHCI) section titled * "TDG.VP.VMCALL". */ - success = !__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT); + success = !__tdx_hypercall_ret(&args); /* Update part of the register affected by the emulated instruction */ regs->ax &= ~mask; @@ -629,6 +666,11 @@ static int virt_exception_user(struct pt_regs *regs, struct ve_info *ve) } } +static inline bool is_private_gpa(u64 gpa) +{ + return gpa == cc_mkenc(gpa); +} + /* * Handle the kernel #VE. * @@ -647,6 +689,8 @@ static int virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve) case EXIT_REASON_CPUID: return handle_cpuid(regs, ve); case EXIT_REASON_EPT_VIOLATION: + if (is_private_gpa(ve->gpa)) + panic("Unexpected EPT-violation on private memory."); return handle_mmio(regs, ve); case EXIT_REASON_IO_INSTRUCTION: return handle_io(regs, ve); @@ -813,6 +857,9 @@ void __init tdx_early_init(void) tdx_parse_tdinfo(&cc_mask); cc_set_mask(cc_mask); + /* Kernel does not use NOTIFY_ENABLES and does not need random #VEs */ + tdx_module_call(TDX_WR, 0, TDCS_NOTIFY_ENABLES, 0, -1ULL, NULL); + if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) swiotlb_force = SWIOTLB_FORCE; diff --git a/arch/x86/configs/anolis-debug_defconfig b/arch/x86/configs/anolis-debug_defconfig index 0db7a0dc9bef874f96f6b35343c63593aedc397b..d60b8ae51b7dfca3c48cdc3b2e1837eb41907a45 100644 --- a/arch/x86/configs/anolis-debug_defconfig +++ b/arch/x86/configs/anolis-debug_defconfig @@ -2,9 +2,9 @@ # Automatically generated file; DO NOT EDIT. # Linux/x86 5.10.134 Kernel Configuration # -CONFIG_CC_VERSION_TEXT="gcc (GCC) 8.4.1 20200928 (Anolis 8.4.1-1.0.1)" +CONFIG_CC_VERSION_TEXT="gcc (GCC) 8.5.0 20210514 (Anolis 8.5.0-10.0.3)" CONFIG_CC_IS_GCC=y -CONFIG_GCC_VERSION=80401 +CONFIG_GCC_VERSION=80500 CONFIG_LD_VERSION=230000000 CONFIG_CLANG_VERSION=0 CONFIG_LLD_VERSION=0 @@ -31,10 +31,10 @@ CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_HAVE_KERNEL_ZSTD=y -# CONFIG_KERNEL_GZIP is not set +CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set -CONFIG_KERNEL_XZ=y +# CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set # CONFIG_KERNEL_ZSTD is not set @@ -139,7 +139,8 @@ CONFIG_RCU_NOCB_CPU=y # end of RCU Subsystem CONFIG_BUILD_BIN2C=y -# CONFIG_IKCONFIG is not set +CONFIG_IKCONFIG=y +CONFIG_IKCONFIG_PROC=y # CONFIG_IKHEADERS is not set CONFIG_LOG_BUF_SHIFT=21 CONFIG_LOG_CPU_MAX_BUF_SHIFT=12 @@ -171,7 +172,7 @@ CONFIG_GROUP_IDENTITY=y CONFIG_CFS_BANDWIDTH=y CONFIG_RT_GROUP_SCHED=y CONFIG_CGROUP_PIDS=y -# CONFIG_CGROUP_IOASIDS is not set +CONFIG_CGROUP_IOASIDS=y CONFIG_CGROUP_RDMA=y CONFIG_CGROUP_FREEZER=y CONFIG_CGROUP_HUGETLB=y @@ -181,6 +182,7 @@ CONFIG_CGROUP_DEVICE=y CONFIG_SCHED_SLI=y CONFIG_RICH_CONTAINER=y # CONFIG_RICH_CONTAINER_CG_SWITCH is not set +# CONFIG_MAX_PID_PER_NS is not set CONFIG_CGROUP_CPUACCT=y CONFIG_CGROUP_PERF=y CONFIG_CGROUP_BPF=y @@ -192,7 +194,6 @@ CONFIG_TIME_NS=y CONFIG_IPC_NS=y CONFIG_USER_NS=y CONFIG_PID_NS=y -# CONFIG_MAX_PID_PER_NS is not set CONFIG_NET_NS=y CONFIG_CHECKPOINT_RESTORE=y CONFIG_SCHED_AUTOGROUP=y @@ -360,6 +361,7 @@ CONFIG_PARAVIRT_TIME_ACCOUNTING=y CONFIG_PARAVIRT_CLOCK=y # CONFIG_JAILHOUSE_GUEST is not set # CONFIG_ACRN_GUEST is not set +CONFIG_INTEL_TDX_GUEST=y # CONFIG_MK8 is not set # CONFIG_MPSC is not set # CONFIG_MCORE2 is not set @@ -410,7 +412,7 @@ CONFIG_PERF_EVENTS_INTEL_RAPL=m CONFIG_PERF_EVENTS_INTEL_CSTATE=m CONFIG_PERF_EVENTS_AMD_POWER=m CONFIG_PERF_EVENTS_AMD_UNCORE=y -# CONFIG_PERF_EVENTS_AMD_BRS is not set +CONFIG_PERF_EVENTS_AMD_BRS=y # end of Performance monitoring CONFIG_X86_16BIT=y @@ -421,16 +423,17 @@ CONFIG_I8K=m CONFIG_MICROCODE=y CONFIG_MICROCODE_INTEL=y CONFIG_MICROCODE_AMD=y +# CONFIG_MICROCODE_HYGON is not set # CONFIG_MICROCODE_OLD_INTERFACE is not set CONFIG_X86_MSR=y CONFIG_X86_CPUID=y # CONFIG_X86_5LEVEL is not set CONFIG_X86_DIRECT_GBPAGES=y CONFIG_X86_CPA_STATISTICS=y +CONFIG_X86_MEM_ENCRYPT=y CONFIG_AMD_MEM_ENCRYPT=y # CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is not set CONFIG_NUMA=y -CONFIG_NUMA_AWARE_SPINLOCKS=y CONFIG_AMD_NUMA=y CONFIG_X86_64_ACPI_NUMA=y CONFIG_NUMA_EMU=y @@ -602,6 +605,7 @@ CONFIG_ACPI_CONFIGFS=m CONFIG_ACPI_PCC=y CONFIG_PMIC_OPREGION=y CONFIG_X86_PM_TIMER=y +CONFIG_ACPI_PRMT=y CONFIG_SFI=y # @@ -627,7 +631,7 @@ CONFIG_CPU_FREQ_GOV_SCHEDUTIL=y # CONFIG_X86_INTEL_PSTATE=y # CONFIG_X86_PCC_CPUFREQ is not set -# CONFIG_X86_AMD_PSTATE is not set +CONFIG_X86_AMD_PSTATE=y CONFIG_X86_ACPI_CPUFREQ=m CONFIG_X86_ACPI_CPUFREQ_CPB=y CONFIG_X86_POWERNOW_K8=m @@ -745,7 +749,7 @@ CONFIG_KVM_XFER_TO_GUEST_WORK=y CONFIG_VIRTUALIZATION=y CONFIG_KVM=m CONFIG_KVM_INTEL=m -# CONFIG_KVM_INTEL_TDX is not set +CONFIG_KVM_INTEL_TDX=y CONFIG_X86_SGX_KVM=y CONFIG_KVM_AMD=m CONFIG_KVM_AMD_SEV=y @@ -846,6 +850,7 @@ CONFIG_OLD_SIGSUSPEND3=y CONFIG_COMPAT_OLD_SIGACTION=y CONFIG_COMPAT_32BIT_TIME=y CONFIG_HAVE_ARCH_VMAP_STACK=y +CONFIG_VMAP_STACK=y CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y CONFIG_STRICT_KERNEL_RWX=y CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y @@ -860,6 +865,7 @@ CONFIG_HAVE_STATIC_CALL=y CONFIG_HAVE_STATIC_CALL_INLINE=y CONFIG_ARCH_WANT_LD_ORPHAN_WARN=y CONFIG_DYNAMIC_SIGFRAME=y +CONFIG_HAVE_ARCH_NODE_DEV_GROUP=y # # GCOV-based kernel profiling @@ -972,6 +978,8 @@ CONFIG_QUEUED_RWLOCKS=y CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y +CONFIG_CK_KABI_RESERVE=y +CONFIG_CK_KABI_SIZE_ALIGN_CHECKS=y CONFIG_FREEZER=y # @@ -1122,6 +1130,7 @@ CONFIG_SMC=m CONFIG_SMC_DIAG=m CONFIG_XDP_SOCKETS=y CONFIG_XDP_SOCKETS_DIAG=m +CONFIG_VTOA=m CONFIG_HOOKERS=m CONFIG_INET=y CONFIG_IP_MULTICAST=y @@ -1174,7 +1183,7 @@ CONFIG_TCP_CONG_VENO=m CONFIG_TCP_CONG_YEAH=m CONFIG_TCP_CONG_ILLINOIS=m CONFIG_TCP_CONG_DCTCP=m -# CONFIG_TCP_CONG_CDG is not set +CONFIG_TCP_CONG_CDG=m CONFIG_TCP_CONG_BBR=m CONFIG_DEFAULT_CUBIC=y # CONFIG_DEFAULT_RENO is not set @@ -1230,7 +1239,7 @@ CONFIG_NETFILTER_INGRESS=y CONFIG_NETFILTER_NETLINK=m CONFIG_NETFILTER_FAMILY_BRIDGE=y CONFIG_NETFILTER_FAMILY_ARP=y -# CONFIG_NETFILTER_NETLINK_ACCT is not set +CONFIG_NETFILTER_NETLINK_ACCT=m CONFIG_NETFILTER_NETLINK_QUEUE=m CONFIG_NETFILTER_NETLINK_LOG=m CONFIG_NETFILTER_NETLINK_OSF=m @@ -1279,6 +1288,7 @@ CONFIG_NF_TABLES_INET=y CONFIG_NF_TABLES_NETDEV=y CONFIG_NFT_NUMGEN=m CONFIG_NFT_CT=m +CONFIG_NFT_FLOW_OFFLOAD=m CONFIG_NFT_COUNTER=m CONFIG_NFT_CONNLIMIT=m CONFIG_NFT_LOG=m @@ -1286,7 +1296,7 @@ CONFIG_NFT_LIMIT=m CONFIG_NFT_MASQ=m CONFIG_NFT_REDIR=m CONFIG_NFT_NAT=m -# CONFIG_NFT_TUNNEL is not set +CONFIG_NFT_TUNNEL=m CONFIG_NFT_OBJREF=m CONFIG_NFT_QUEUE=m CONFIG_NFT_QUOTA=m @@ -1298,14 +1308,15 @@ CONFIG_NFT_FIB=m CONFIG_NFT_FIB_INET=m CONFIG_NFT_XFRM=m CONFIG_NFT_SOCKET=m -# CONFIG_NFT_OSF is not set +CONFIG_NFT_OSF=m CONFIG_NFT_TPROXY=m # CONFIG_NFT_SYNPROXY is not set CONFIG_NF_DUP_NETDEV=m CONFIG_NFT_DUP_NETDEV=m CONFIG_NFT_FWD_NETDEV=m CONFIG_NFT_FIB_NETDEV=m -# CONFIG_NF_FLOW_TABLE is not set +CONFIG_NF_FLOW_TABLE_INET=m +CONFIG_NF_FLOW_TABLE=m CONFIG_NETFILTER_XTABLES=y # @@ -1328,7 +1339,7 @@ CONFIG_NETFILTER_XT_TARGET_DSCP=m CONFIG_NETFILTER_XT_TARGET_HL=m CONFIG_NETFILTER_XT_TARGET_HMARK=m CONFIG_NETFILTER_XT_TARGET_IDLETIMER=m -# CONFIG_NETFILTER_XT_TARGET_LED is not set +CONFIG_NETFILTER_XT_TARGET_LED=m CONFIG_NETFILTER_XT_TARGET_LOG=m CONFIG_NETFILTER_XT_TARGET_MARK=m CONFIG_NETFILTER_XT_NAT=m @@ -1371,13 +1382,13 @@ CONFIG_NETFILTER_XT_MATCH_HL=m # CONFIG_NETFILTER_XT_MATCH_IPCOMP is not set CONFIG_NETFILTER_XT_MATCH_IPRANGE=m CONFIG_NETFILTER_XT_MATCH_IPVS=m -# CONFIG_NETFILTER_XT_MATCH_L2TP is not set +CONFIG_NETFILTER_XT_MATCH_L2TP=m CONFIG_NETFILTER_XT_MATCH_LENGTH=m CONFIG_NETFILTER_XT_MATCH_LIMIT=m CONFIG_NETFILTER_XT_MATCH_MAC=m CONFIG_NETFILTER_XT_MATCH_MARK=m CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m -# CONFIG_NETFILTER_XT_MATCH_NFACCT is not set +CONFIG_NETFILTER_XT_MATCH_NFACCT=m CONFIG_NETFILTER_XT_MATCH_OSF=m CONFIG_NETFILTER_XT_MATCH_OWNER=m CONFIG_NETFILTER_XT_MATCH_POLICY=m @@ -1393,7 +1404,7 @@ CONFIG_NETFILTER_XT_MATCH_STATE=m CONFIG_NETFILTER_XT_MATCH_STATISTIC=m CONFIG_NETFILTER_XT_MATCH_STRING=m CONFIG_NETFILTER_XT_MATCH_TCPMSS=m -# CONFIG_NETFILTER_XT_MATCH_TIME is not set +CONFIG_NETFILTER_XT_MATCH_TIME=m CONFIG_NETFILTER_XT_MATCH_U32=m # end of Core Netfilter Configuration @@ -1443,7 +1454,7 @@ CONFIG_IP_VS_LBLC=m CONFIG_IP_VS_LBLCR=m CONFIG_IP_VS_DH=m CONFIG_IP_VS_SH=m -# CONFIG_IP_VS_MH is not set +CONFIG_IP_VS_MH=m CONFIG_IP_VS_SED=m CONFIG_IP_VS_NQ=m @@ -1475,6 +1486,7 @@ CONFIG_NFT_REJECT_IPV4=m CONFIG_NFT_DUP_IPV4=m CONFIG_NFT_FIB_IPV4=m CONFIG_NF_TABLES_ARP=y +CONFIG_NF_FLOW_TABLE_IPV4=m CONFIG_NF_DUP_IPV4=m CONFIG_NF_LOG_ARP=m CONFIG_NF_LOG_IPV4=m @@ -1495,7 +1507,7 @@ CONFIG_IP_NF_TARGET_MASQUERADE=m CONFIG_IP_NF_TARGET_NETMAP=m CONFIG_IP_NF_TARGET_REDIRECT=m CONFIG_IP_NF_MANGLE=m -# CONFIG_IP_NF_TARGET_CLUSTERIP is not set +CONFIG_IP_NF_TARGET_CLUSTERIP=m CONFIG_IP_NF_TARGET_ECN=m CONFIG_IP_NF_TARGET_TTL=m CONFIG_IP_NF_RAW=m @@ -1514,6 +1526,7 @@ CONFIG_NF_TABLES_IPV6=y CONFIG_NFT_REJECT_IPV6=m CONFIG_NFT_DUP_IPV6=m CONFIG_NFT_FIB_IPV6=m +CONFIG_NF_FLOW_TABLE_IPV6=m CONFIG_NF_DUP_IPV6=m CONFIG_NF_REJECT_IPV6=m CONFIG_NF_LOG_IPV6=m @@ -1528,7 +1541,7 @@ CONFIG_IP6_NF_MATCH_MH=m CONFIG_IP6_NF_MATCH_RPFILTER=m CONFIG_IP6_NF_MATCH_RT=m # CONFIG_IP6_NF_MATCH_SRH is not set -# CONFIG_IP6_NF_TARGET_HL is not set +CONFIG_IP6_NF_TARGET_HL=m CONFIG_IP6_NF_FILTER=m CONFIG_IP6_NF_TARGET_REJECT=m CONFIG_IP6_NF_TARGET_SYNPROXY=m @@ -1568,7 +1581,24 @@ CONFIG_BRIDGE_EBT_SNAT=m CONFIG_BRIDGE_EBT_LOG=m CONFIG_BRIDGE_EBT_NFLOG=m # CONFIG_BPFILTER is not set -# CONFIG_IP_DCCP is not set +CONFIG_IP_DCCP=m +CONFIG_INET_DCCP_DIAG=m + +# +# DCCP CCIDs Configuration +# +# CONFIG_IP_DCCP_CCID2_DEBUG is not set +CONFIG_IP_DCCP_CCID3=y +# CONFIG_IP_DCCP_CCID3_DEBUG is not set +CONFIG_IP_DCCP_TFRC_LIB=y +# end of DCCP CCIDs Configuration + +# +# DCCP Kernel Hacking +# +# CONFIG_IP_DCCP_DEBUG is not set +# end of DCCP Kernel Hacking + CONFIG_IP_SCTP=m # CONFIG_SCTP_DBG_OBJCNT is not set # CONFIG_SCTP_DEFAULT_COOKIE_HMAC_MD5 is not set @@ -1662,17 +1692,16 @@ CONFIG_NET_SCH_PLUG=m CONFIG_NET_SCH_DEFAULT=y # CONFIG_DEFAULT_FQ is not set # CONFIG_DEFAULT_CODEL is not set -# CONFIG_DEFAULT_FQ_CODEL is not set +CONFIG_DEFAULT_FQ_CODEL=y # CONFIG_DEFAULT_SFQ is not set -CONFIG_DEFAULT_PFIFO_FAST=y -CONFIG_DEFAULT_NET_SCH="pfifo_fast" +# CONFIG_DEFAULT_PFIFO_FAST is not set +CONFIG_DEFAULT_NET_SCH="fq_codel" # # Classification # CONFIG_NET_CLS=y CONFIG_NET_CLS_BASIC=m -CONFIG_NET_CLS_TCINDEX=m CONFIG_NET_CLS_ROUTE4=m CONFIG_NET_CLS_FW=m CONFIG_NET_CLS_U32=m @@ -1701,7 +1730,7 @@ CONFIG_NET_ACT_GACT=m CONFIG_GACT_PROB=y CONFIG_NET_ACT_MIRRED=m CONFIG_NET_ACT_SAMPLE=m -# CONFIG_NET_ACT_IPT is not set +CONFIG_NET_ACT_IPT=m CONFIG_NET_ACT_NAT=m CONFIG_NET_ACT_PEDIT=m CONFIG_NET_ACT_SIMP=m @@ -1715,11 +1744,12 @@ CONFIG_NET_ACT_BPF=m CONFIG_NET_ACT_SKBMOD=m # CONFIG_NET_ACT_IFE is not set CONFIG_NET_ACT_TUNNEL_KEY=m +CONFIG_NET_ACT_CT=m # CONFIG_NET_ACT_GATE is not set -# CONFIG_NET_TC_SKB_EXT is not set +CONFIG_NET_TC_SKB_EXT=y CONFIG_NET_SCH_FIFO=y CONFIG_DCB=y -CONFIG_DNS_RESOLVER=m +CONFIG_DNS_RESOLVER=y # CONFIG_BATMAN_ADV is not set CONFIG_OPENVSWITCH=m CONFIG_OPENVSWITCH_GRE=m @@ -2014,6 +2044,7 @@ CONFIG_YENTA_TOSHIBA=y # # Generic Driver Options # +CONFIG_AUXILIARY_BUS=y # CONFIG_UEVENT_HELPER is not set CONFIG_DEVTMPFS=y CONFIG_DEVTMPFS_MOUNT=y @@ -2193,8 +2224,8 @@ CONFIG_CDROM_PKTCDVD_BUFFERS=8 CONFIG_XEN_BLKDEV_FRONTEND=m CONFIG_VIRTIO_BLK=y CONFIG_BLK_DEV_RBD=m -CONFIG_BLK_DEV_UBLK=m # CONFIG_BLK_DEV_RSXX is not set +CONFIG_BLK_DEV_UBLK=m # # NVME Support @@ -2749,6 +2780,7 @@ CONFIG_MLX5_EN_RXNFC=y CONFIG_MLX5_MPFS=y CONFIG_MLX5_ESWITCH=y CONFIG_MLX5_CLS_ACT=y +CONFIG_MLX5_TC_CT=y CONFIG_MLX5_CORE_EN_DCB=y # CONFIG_MLX5_CORE_IPOIB is not set # CONFIG_MLX5_FPGA_IPSEC is not set @@ -3547,6 +3579,7 @@ CONFIG_TCG_INFINEON=m # CONFIG_TCG_XEN is not set CONFIG_TCG_CRB=y # CONFIG_TCG_VTPM_PROXY is not set +CONFIG_TCG_HYGON=m CONFIG_TCG_TIS_ST33ZP24=m CONFIG_TCG_TIS_ST33ZP24_I2C=m CONFIG_TELCLOCK=m @@ -4153,9 +4186,6 @@ CONFIG_MFD_INTEL_LPSS_ACPI=m CONFIG_MFD_INTEL_LPSS_PCI=m # CONFIG_MFD_INTEL_PMC_BXT is not set CONFIG_MFD_INTEL_PMT=m -CONFIG_INTEL_PMT_CLASS=m -CONFIG_INTEL_PMT_CRASHLOG=m -CONFIG_INTEL_PMT_TELEMETRY=m # CONFIG_MFD_IQS62X is not set # CONFIG_MFD_JANZ_CMODIO is not set # CONFIG_MFD_KEMPLD is not set @@ -4393,6 +4423,7 @@ CONFIG_DRM_CIRRUS_QEMU=m # CONFIG_DRM_GM12U320 is not set # CONFIG_DRM_XEN is not set # CONFIG_DRM_VBOXVIDEO is not set +# CONFIG_DRM_VERISILICON is not set # CONFIG_DRM_LEGACY is not set CONFIG_DRM_PANEL_ORIENTATION_QUIRKS=y @@ -5177,7 +5208,7 @@ CONFIG_INTEL_IDXD_SVM=y # CONFIG_INTEL_IDXD_PERFMON is not set CONFIG_INTEL_IOATDMA=m # CONFIG_PLX_DMA is not set -# CONFIG_AMD_PTDMA is not set +CONFIG_AMD_PTDMA=y # CONFIG_QCOM_HIDMA_MGMT is not set # CONFIG_QCOM_HIDMA is not set CONFIG_DW_DMAC_CORE=y @@ -5233,7 +5264,10 @@ CONFIG_VFIO_MDEV=m # CONFIG_VFIO_MDEV_IDXD is not set CONFIG_IRQ_BYPASS_MANAGER=m CONFIG_VIRT_DRIVERS=y +# CONFIG_VBOXGUEST is not set +# CONFIG_NITRO_ENCLAVES is not set CONFIG_EFI_SECRET=m +CONFIG_TDX_GUEST_DRIVER=m CONFIG_VIRTIO=y CONFIG_VIRTIO_MENU=y CONFIG_VIRTIO_PCI=y @@ -5376,6 +5410,9 @@ CONFIG_INTEL_SPEED_SELECT_INTERFACE=m CONFIG_INTEL_TURBO_MAX_3=y # CONFIG_INTEL_UNCORE_FREQ_CONTROL is not set CONFIG_INTEL_PMC_CORE=m +CONFIG_INTEL_PMT_CLASS=m +CONFIG_INTEL_PMT_TELEMETRY=m +CONFIG_INTEL_PMT_CRASHLOG=m # CONFIG_INTEL_PUNIT_IPC is not set # CONFIG_INTEL_SCU_PCI is not set # CONFIG_INTEL_SCU_PLATFORM is not set @@ -5493,6 +5530,11 @@ CONFIG_HYPERV_IOMMU=y # # CONFIG_XILINX_VCU is not set # end of Xilinx SoC drivers + +# +# prefetch tuning drivers +# +# end of prefetch tuning drivers # end of SOC (System On Chip) specific Drivers # CONFIG_PM_DEVFREQ is not set @@ -5867,6 +5909,7 @@ CONFIG_PWM_LPSS=m CONFIG_PWM_LPSS_PCI=m CONFIG_PWM_LPSS_PLATFORM=m # CONFIG_PWM_PCA9685 is not set +# CONFIG_PWM_LIGHT is not set # # IRQ chip support @@ -6214,7 +6257,7 @@ CONFIG_CEPH_FS=m # CONFIG_CEPH_FSCACHE is not set CONFIG_CEPH_FS_POSIX_ACL=y # CONFIG_CEPH_FS_SECURITY_LABEL is not set -CONFIG_CIFS=m +CONFIG_CIFS=y # CONFIG_CIFS_STATS2 is not set CONFIG_CIFS_ALLOW_INSECURE_LEGACY=y CONFIG_CIFS_WEAK_PW_HASH=y @@ -6225,8 +6268,6 @@ CONFIG_CIFS_DEBUG=y # CONFIG_CIFS_DEBUG2 is not set # CONFIG_CIFS_DEBUG_DUMP_KEYS is not set CONFIG_CIFS_DFS_UPCALL=y -# CONFIG_CIFS_SMB_DIRECT is not set -# CONFIG_CIFS_FSCACHE is not set # CONFIG_CODA_FS is not set # CONFIG_AFS_FS is not set CONFIG_NLS=y @@ -6319,9 +6360,9 @@ CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1 CONFIG_SECURITY_SELINUX_SIDTAB_HASH_BITS=9 CONFIG_SECURITY_SELINUX_SID2STR_CACHE_SIZE=256 CONFIG_SECURITY_SMACK=y -# CONFIG_SECURITY_SMACK_BRINGUP is not set -# CONFIG_SECURITY_SMACK_NETFILTER is not set -# CONFIG_SECURITY_SMACK_APPEND_SIGNALS is not set +CONFIG_SECURITY_SMACK_BRINGUP=y +CONFIG_SECURITY_SMACK_NETFILTER=y +CONFIG_SECURITY_SMACK_APPEND_SIGNALS=y # CONFIG_SECURITY_TOMOYO is not set # CONFIG_SECURITY_APPARMOR is not set # CONFIG_SECURITY_LOADPIN is not set @@ -6342,12 +6383,12 @@ CONFIG_IMA_LSM_RULES=y CONFIG_IMA_SIG_TEMPLATE=y CONFIG_IMA_DEFAULT_TEMPLATE="ima-sig" # CONFIG_IMA_DEFAULT_HASH_SHA1 is not set -# CONFIG_IMA_DEFAULT_HASH_SHA256 is not set +CONFIG_IMA_DEFAULT_HASH_SHA256=y # CONFIG_IMA_DEFAULT_HASH_SHA512 is not set -CONFIG_IMA_DEFAULT_HASH_SM3=y -CONFIG_IMA_DEFAULT_HASH="sm3" -# CONFIG_IMA_WRITE_POLICY is not set -# CONFIG_IMA_READ_POLICY is not set +# CONFIG_IMA_DEFAULT_HASH_SM3 is not set +CONFIG_IMA_DEFAULT_HASH="sha256" +CONFIG_IMA_WRITE_POLICY=y +CONFIG_IMA_READ_POLICY=y CONFIG_IMA_APPRAISE=y # CONFIG_IMA_ARCH_POLICY is not set CONFIG_IMA_APPRAISE_BUILD_POLICY=y @@ -6358,7 +6399,7 @@ CONFIG_IMA_APPRAISE_BUILD_POLICY=y CONFIG_IMA_APPRAISE_BOOTPARAM=y # CONFIG_IMA_APPRAISE_MODSIG is not set CONFIG_IMA_TRUSTED_KEYRING=y -# CONFIG_IMA_KEYRINGS_PERMIT_SIGNED_BY_BUILTIN_OR_SECONDARY is not set +CONFIG_IMA_KEYRINGS_PERMIT_SIGNED_BY_BUILTIN_OR_SECONDARY=y CONFIG_IMA_BLACKLIST_KEYRING=y CONFIG_IMA_LOAD_X509=y CONFIG_IMA_X509_PATH="/etc/keys/x509_ima.der" @@ -6368,7 +6409,7 @@ CONFIG_IMA_QUEUE_EARLY_BOOT_KEYS=y # CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT is not set CONFIG_EVM=y CONFIG_EVM_ATTR_FSUUID=y -# CONFIG_EVM_EXTRA_SMACK_XATTRS is not set +CONFIG_EVM_EXTRA_SMACK_XATTRS=y # CONFIG_EVM_ADD_XATTRS is not set CONFIG_EVM_LOAD_X509=y CONFIG_EVM_X509_PATH="/etc/keys/x509_evm.der" @@ -6449,7 +6490,7 @@ CONFIG_CRYPTO_CURVE25519_X86=m # # Authenticated Encryption with Associated Data # -CONFIG_CRYPTO_CCM=m +CONFIG_CRYPTO_CCM=y CONFIG_CRYPTO_GCM=y CONFIG_CRYPTO_CHACHA20POLY1305=m # CONFIG_CRYPTO_AEGIS128 is not set @@ -6478,7 +6519,7 @@ CONFIG_CRYPTO_ESSIV=m # # Hash modes # -CONFIG_CRYPTO_CMAC=m +CONFIG_CRYPTO_CMAC=y CONFIG_CRYPTO_HMAC=y CONFIG_CRYPTO_XCBC=m CONFIG_CRYPTO_VMAC=m @@ -6490,8 +6531,8 @@ CONFIG_CRYPTO_CRC32C=y CONFIG_CRYPTO_CRC32C_INTEL=m CONFIG_CRYPTO_CRC32=m CONFIG_CRYPTO_CRC32_PCLMUL=m -# CONFIG_CRYPTO_XXHASH is not set -# CONFIG_CRYPTO_BLAKE2B is not set +CONFIG_CRYPTO_XXHASH=m +CONFIG_CRYPTO_BLAKE2B=m # CONFIG_CRYPTO_BLAKE2S is not set CONFIG_CRYPTO_BLAKE2S_X86=m CONFIG_CRYPTO_CRCT10DIF=y @@ -6499,7 +6540,7 @@ CONFIG_CRYPTO_CRCT10DIF_PCLMUL=m CONFIG_CRYPTO_GHASH=y CONFIG_CRYPTO_POLY1305=m CONFIG_CRYPTO_POLY1305_X86_64=m -CONFIG_CRYPTO_MD4=m +CONFIG_CRYPTO_MD4=y CONFIG_CRYPTO_MD5=y CONFIG_CRYPTO_MICHAEL_MIC=m CONFIG_CRYPTO_RMD128=m @@ -6572,7 +6613,7 @@ CONFIG_CRYPTO_LZO=y # CONFIG_CRYPTO_842 is not set # CONFIG_CRYPTO_LZ4 is not set # CONFIG_CRYPTO_LZ4HC is not set -# CONFIG_CRYPTO_ZSTD is not set +CONFIG_CRYPTO_ZSTD=m # # Random Number Generation @@ -6604,6 +6645,9 @@ CONFIG_CRYPTO_DEV_CCP_DD=m CONFIG_CRYPTO_DEV_SP_CCP=y CONFIG_CRYPTO_DEV_CCP_CRYPTO=m CONFIG_CRYPTO_DEV_SP_PSP=y +CONFIG_HYGON_PSP2CPU_CMD=y +CONFIG_TDM_DEV_HYGON=y +CONFIG_TDM_KERNEL_GUARD=m # CONFIG_CRYPTO_DEV_CCP_DEBUGFS is not set CONFIG_CRYPTO_DEV_QAT=m CONFIG_CRYPTO_DEV_QAT_DH895xCC=m @@ -6668,7 +6712,7 @@ CONFIG_ARCH_USE_SYM_ANNOTATIONS=y # Crypto library routines # CONFIG_CRYPTO_LIB_AES=y -CONFIG_CRYPTO_LIB_ARC4=m +CONFIG_CRYPTO_LIB_ARC4=y CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S=y CONFIG_CRYPTO_LIB_BLAKE2S_GENERIC=y CONFIG_CRYPTO_ARCH_HAVE_LIB_CHACHA=m @@ -6677,14 +6721,13 @@ CONFIG_CRYPTO_LIB_CHACHA=m CONFIG_CRYPTO_ARCH_HAVE_LIB_CURVE25519=m CONFIG_CRYPTO_LIB_CURVE25519_GENERIC=m CONFIG_CRYPTO_LIB_CURVE25519=m -CONFIG_CRYPTO_LIB_DES=m +CONFIG_CRYPTO_LIB_DES=y CONFIG_CRYPTO_LIB_POLY1305_RSIZE=11 CONFIG_CRYPTO_ARCH_HAVE_LIB_POLY1305=m CONFIG_CRYPTO_LIB_POLY1305_GENERIC=m CONFIG_CRYPTO_LIB_POLY1305=m CONFIG_CRYPTO_LIB_CHACHA20POLY1305=m CONFIG_CRYPTO_LIB_SHA256=y -CONFIG_CRYPTO_LIB_SM4=y # end of Crypto library routines CONFIG_LIB_MEMNEQ=y @@ -6698,8 +6741,8 @@ CONFIG_CRC32_SLICEBY8=y # CONFIG_CRC32_SLICEBY4 is not set # CONFIG_CRC32_SARWATE is not set # CONFIG_CRC32_BIT is not set -# CONFIG_CRC64 is not set -# CONFIG_CRC4 is not set +CONFIG_CRC64=m +CONFIG_CRC4=m CONFIG_CRC7=m CONFIG_LIBCRC32C=m CONFIG_CRC8=m @@ -6710,6 +6753,7 @@ CONFIG_ZLIB_DEFLATE=y CONFIG_LZO_COMPRESS=y CONFIG_LZO_DECOMPRESS=y CONFIG_LZ4_DECOMPRESS=y +CONFIG_ZSTD_COMPRESS=m CONFIG_ZSTD_DECOMPRESS=y CONFIG_XZ_DEC=y CONFIG_XZ_DEC_X86=y @@ -6728,7 +6772,7 @@ CONFIG_DECOMPRESS_LZO=y CONFIG_DECOMPRESS_LZ4=y CONFIG_DECOMPRESS_ZSTD=y CONFIG_GENERIC_ALLOCATOR=y -CONFIG_REED_SOLOMON=m +CONFIG_REED_SOLOMON=y CONFIG_REED_SOLOMON_ENC8=y CONFIG_REED_SOLOMON_DEC8=y CONFIG_TEXTSEARCH=y @@ -6921,7 +6965,7 @@ CONFIG_KASAN_GENERIC=y # CONFIG_KASAN_OUTLINE is not set CONFIG_KASAN_INLINE=y CONFIG_KASAN_STACK=1 -# CONFIG_KASAN_VMALLOC is not set +CONFIG_KASAN_VMALLOC=y # CONFIG_TEST_KASAN_MODULE is not set CONFIG_HAVE_ARCH_KFENCE=y CONFIG_KFENCE=y @@ -6936,8 +6980,8 @@ CONFIG_DEBUG_SHIRQ=y # # Debug Oops, Lockups and Hangs # -# CONFIG_PANIC_ON_OOPS is not set -CONFIG_PANIC_ON_OOPS_VALUE=0 +CONFIG_PANIC_ON_OOPS=y +CONFIG_PANIC_ON_OOPS_VALUE=1 CONFIG_PANIC_TIMEOUT=1 CONFIG_LOCKUP_DETECTOR=y CONFIG_SOFTLOCKUP_DETECTOR=y @@ -7195,6 +7239,3 @@ CONFIG_TEST_LIVEPATCH=m # CONFIG_HYPERV_TESTING is not set # end of Kernel Testing and Coverage # end of Kernel hacking -CONFIG_CK_KABI_SIZE_ALIGN_CHECKS=y -CONFIG_CK_KABI_RESERVE=y -CONFIG_AUXILIARY_BUS=y diff --git a/arch/x86/configs/anolis_defconfig b/arch/x86/configs/anolis_defconfig index 3efd11fd435f4c642a519fb34b71d5211f01f455..e22cfef53f610af772f6519b80e1a0521c7f99a5 100644 --- a/arch/x86/configs/anolis_defconfig +++ b/arch/x86/configs/anolis_defconfig @@ -2,9 +2,9 @@ # Automatically generated file; DO NOT EDIT. # Linux/x86 5.10.134 Kernel Configuration # -CONFIG_CC_VERSION_TEXT="gcc (GCC) 8.4.1 20200928 (Anolis 8.4.1-1.0.1)" +CONFIG_CC_VERSION_TEXT="gcc (GCC) 8.5.0 20210514 (Anolis 8.5.0-10.0.3)" CONFIG_CC_IS_GCC=y -CONFIG_GCC_VERSION=80401 +CONFIG_GCC_VERSION=80500 CONFIG_LD_VERSION=230000000 CONFIG_CLANG_VERSION=0 CONFIG_LLD_VERSION=0 @@ -70,7 +70,6 @@ CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y CONFIG_GENERIC_IRQ_RESERVATION_MODE=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y -CONFIG_DEVICE_MSI=y # CONFIG_GENERIC_IRQ_DEBUGFS is not set # end of IRQ subsystem @@ -138,6 +137,7 @@ CONFIG_RCU_NOCB_CPU=y CONFIG_BUILD_BIN2C=y CONFIG_IKCONFIG=y +CONFIG_IKCONFIG_PROC=y # CONFIG_IKHEADERS is not set CONFIG_LOG_BUF_SHIFT=21 CONFIG_LOG_CPU_MAX_BUF_SHIFT=12 @@ -179,6 +179,7 @@ CONFIG_CGROUP_DEVICE=y CONFIG_SCHED_SLI=y CONFIG_RICH_CONTAINER=y # CONFIG_RICH_CONTAINER_CG_SWITCH is not set +# CONFIG_MAX_PID_PER_NS is not set CONFIG_CGROUP_CPUACCT=y CONFIG_CGROUP_PERF=y CONFIG_CGROUP_BPF=y @@ -190,7 +191,6 @@ CONFIG_TIME_NS=y CONFIG_IPC_NS=y CONFIG_USER_NS=y CONFIG_PID_NS=y -# CONFIG_MAX_PID_PER_NS is not set CONFIG_NET_NS=y CONFIG_CHECKPOINT_RESTORE=y CONFIG_SCHED_AUTOGROUP=y @@ -357,6 +357,7 @@ CONFIG_PARAVIRT_TIME_ACCOUNTING=y CONFIG_PARAVIRT_CLOCK=y # CONFIG_JAILHOUSE_GUEST is not set # CONFIG_ACRN_GUEST is not set +CONFIG_INTEL_TDX_GUEST=y # CONFIG_MK8 is not set # CONFIG_MPSC is not set # CONFIG_MCORE2 is not set @@ -418,16 +419,17 @@ CONFIG_I8K=m CONFIG_MICROCODE=y CONFIG_MICROCODE_INTEL=y CONFIG_MICROCODE_AMD=y +# CONFIG_MICROCODE_HYGON is not set # CONFIG_MICROCODE_OLD_INTERFACE is not set CONFIG_X86_MSR=y CONFIG_X86_CPUID=y # CONFIG_X86_5LEVEL is not set CONFIG_X86_DIRECT_GBPAGES=y -# CONFIG_X86_CPA_STATISTICS is not set +CONFIG_X86_CPA_STATISTICS=y +CONFIG_X86_MEM_ENCRYPT=y CONFIG_AMD_MEM_ENCRYPT=y # CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is not set CONFIG_NUMA=y -CONFIG_NUMA_AWARE_SPINLOCKS=y CONFIG_AMD_NUMA=y CONFIG_X86_64_ACPI_NUMA=y CONFIG_NUMA_EMU=y @@ -597,6 +599,7 @@ CONFIG_ACPI_ADXL=y CONFIG_ACPI_PCC=y CONFIG_PMIC_OPREGION=y CONFIG_X86_PM_TIMER=y +CONFIG_ACPI_PRMT=y CONFIG_SFI=y # @@ -856,6 +859,7 @@ CONFIG_HAVE_STATIC_CALL=y CONFIG_HAVE_STATIC_CALL_INLINE=y CONFIG_ARCH_WANT_LD_ORPHAN_WARN=y CONFIG_DYNAMIC_SIGFRAME=y +CONFIG_HAVE_ARCH_NODE_DEV_GROUP=y # # GCOV-based kernel profiling @@ -972,6 +976,8 @@ CONFIG_QUEUED_RWLOCKS=y CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y +CONFIG_CK_KABI_RESERVE=y +CONFIG_CK_KABI_SIZE_ALIGN_CHECKS=y CONFIG_FREEZER=y # @@ -1122,6 +1128,7 @@ CONFIG_SMC=m CONFIG_SMC_DIAG=m CONFIG_XDP_SOCKETS=y CONFIG_XDP_SOCKETS_DIAG=m +CONFIG_VTOA=m CONFIG_HOOKERS=m CONFIG_INET=y CONFIG_IP_MULTICAST=y @@ -1158,7 +1165,6 @@ CONFIG_INET_DIAG=m CONFIG_INET_TCP_DIAG=m CONFIG_INET_UDP_DIAG=m CONFIG_INET_RAW_DIAG=m -CONFIG_INET_DCCP_DIAG=m # CONFIG_INET_DIAG_DESTROY is not set CONFIG_TCP_CONG_ADVANCED=y CONFIG_TCP_CONG_BIC=m @@ -1280,6 +1286,7 @@ CONFIG_NF_TABLES_INET=y CONFIG_NF_TABLES_NETDEV=y CONFIG_NFT_NUMGEN=m CONFIG_NFT_CT=m +CONFIG_NFT_FLOW_OFFLOAD=m CONFIG_NFT_COUNTER=m CONFIG_NFT_CONNLIMIT=m CONFIG_NFT_LOG=m @@ -1306,11 +1313,8 @@ CONFIG_NF_DUP_NETDEV=m CONFIG_NFT_DUP_NETDEV=m CONFIG_NFT_FWD_NETDEV=m CONFIG_NFT_FIB_NETDEV=m -CONFIG_NF_FLOW_TABLE=m CONFIG_NF_FLOW_TABLE_INET=m -CONFIG_NF_FLOW_TABLE_IPV4=m -CONFIG_NF_FLOW_TABLE_IPV6=m -CONFIG_NFT_FLOW_OFFLOAD=m +CONFIG_NF_FLOW_TABLE=m CONFIG_NETFILTER_XTABLES=y # @@ -1480,6 +1484,7 @@ CONFIG_NFT_REJECT_IPV4=m CONFIG_NFT_DUP_IPV4=m CONFIG_NFT_FIB_IPV4=m CONFIG_NF_TABLES_ARP=y +CONFIG_NF_FLOW_TABLE_IPV4=m CONFIG_NF_DUP_IPV4=m CONFIG_NF_LOG_ARP=m CONFIG_NF_LOG_IPV4=m @@ -1519,6 +1524,7 @@ CONFIG_NF_TABLES_IPV6=y CONFIG_NFT_REJECT_IPV6=m CONFIG_NFT_DUP_IPV6=m CONFIG_NFT_FIB_IPV6=m +CONFIG_NF_FLOW_TABLE_IPV6=m CONFIG_NF_DUP_IPV6=m CONFIG_NF_REJECT_IPV6=m CONFIG_NF_LOG_IPV6=m @@ -1574,8 +1580,23 @@ CONFIG_BRIDGE_EBT_LOG=m CONFIG_BRIDGE_EBT_NFLOG=m # CONFIG_BPFILTER is not set CONFIG_IP_DCCP=m +CONFIG_INET_DCCP_DIAG=m + +# +# DCCP CCIDs Configuration +# +# CONFIG_IP_DCCP_CCID2_DEBUG is not set CONFIG_IP_DCCP_CCID3=y +# CONFIG_IP_DCCP_CCID3_DEBUG is not set CONFIG_IP_DCCP_TFRC_LIB=y +# end of DCCP CCIDs Configuration + +# +# DCCP Kernel Hacking +# +# CONFIG_IP_DCCP_DEBUG is not set +# end of DCCP Kernel Hacking + CONFIG_IP_SCTP=m # CONFIG_SCTP_DBG_OBJCNT is not set # CONFIG_SCTP_DEFAULT_COOKIE_HMAC_MD5 is not set @@ -1669,17 +1690,16 @@ CONFIG_NET_SCH_PLUG=m CONFIG_NET_SCH_DEFAULT=y # CONFIG_DEFAULT_FQ is not set # CONFIG_DEFAULT_CODEL is not set -# CONFIG_DEFAULT_FQ_CODEL is not set +CONFIG_DEFAULT_FQ_CODEL=y # CONFIG_DEFAULT_SFQ is not set -CONFIG_DEFAULT_PFIFO_FAST=y -CONFIG_DEFAULT_NET_SCH="pfifo_fast" +# CONFIG_DEFAULT_PFIFO_FAST is not set +CONFIG_DEFAULT_NET_SCH="fq_codel" # # Classification # CONFIG_NET_CLS=y CONFIG_NET_CLS_BASIC=m -CONFIG_NET_CLS_TCINDEX=m CONFIG_NET_CLS_ROUTE4=m CONFIG_NET_CLS_FW=m CONFIG_NET_CLS_U32=m @@ -1718,11 +1738,11 @@ CONFIG_NET_ACT_CSUM=m CONFIG_NET_ACT_VLAN=m CONFIG_NET_ACT_BPF=m # CONFIG_NET_ACT_CONNMARK is not set -CONFIG_NET_ACT_CT=m # CONFIG_NET_ACT_CTINFO is not set CONFIG_NET_ACT_SKBMOD=m # CONFIG_NET_ACT_IFE is not set CONFIG_NET_ACT_TUNNEL_KEY=m +CONFIG_NET_ACT_CT=m # CONFIG_NET_ACT_GATE is not set CONFIG_NET_TC_SKB_EXT=y CONFIG_NET_SCH_FIFO=y @@ -2022,6 +2042,7 @@ CONFIG_YENTA_TOSHIBA=y # # Generic Driver Options # +CONFIG_AUXILIARY_BUS=y # CONFIG_UEVENT_HELPER is not set CONFIG_DEVTMPFS=y CONFIG_DEVTMPFS_MOUNT=y @@ -2200,8 +2221,8 @@ CONFIG_CDROM_PKTCDVD_BUFFERS=8 CONFIG_XEN_BLKDEV_FRONTEND=m CONFIG_VIRTIO_BLK=y CONFIG_BLK_DEV_RBD=m -CONFIG_BLK_DEV_UBLK=m # CONFIG_BLK_DEV_RSXX is not set +CONFIG_BLK_DEV_UBLK=m # # NVME Support @@ -2756,6 +2777,7 @@ CONFIG_MLX5_EN_RXNFC=y CONFIG_MLX5_MPFS=y CONFIG_MLX5_ESWITCH=y CONFIG_MLX5_CLS_ACT=y +CONFIG_MLX5_TC_CT=y CONFIG_MLX5_CORE_EN_DCB=y # CONFIG_MLX5_CORE_IPOIB is not set # CONFIG_MLX5_FPGA_IPSEC is not set @@ -3553,6 +3575,7 @@ CONFIG_TCG_INFINEON=m # CONFIG_TCG_XEN is not set CONFIG_TCG_CRB=y # CONFIG_TCG_VTPM_PROXY is not set +CONFIG_TCG_HYGON=m CONFIG_TCG_TIS_ST33ZP24=m CONFIG_TCG_TIS_ST33ZP24_I2C=m CONFIG_TELCLOCK=m @@ -4159,9 +4182,6 @@ CONFIG_MFD_INTEL_LPSS_ACPI=m CONFIG_MFD_INTEL_LPSS_PCI=m # CONFIG_MFD_INTEL_PMC_BXT is not set CONFIG_MFD_INTEL_PMT=m -CONFIG_INTEL_PMT_CLASS=m -CONFIG_INTEL_PMT_CRASHLOG=m -CONFIG_INTEL_PMT_TELEMETRY=m # CONFIG_MFD_IQS62X is not set # CONFIG_MFD_JANZ_CMODIO is not set # CONFIG_MFD_KEMPLD is not set @@ -4399,6 +4419,7 @@ CONFIG_DRM_CIRRUS_QEMU=m # CONFIG_DRM_GM12U320 is not set # CONFIG_DRM_XEN is not set # CONFIG_DRM_VBOXVIDEO is not set +# CONFIG_DRM_VERISILICON is not set # CONFIG_DRM_LEGACY is not set CONFIG_DRM_PANEL_ORIENTATION_QUIRKS=y @@ -5238,7 +5259,10 @@ CONFIG_VFIO_MDEV=m # CONFIG_VFIO_MDEV_IDXD is not set CONFIG_IRQ_BYPASS_MANAGER=m CONFIG_VIRT_DRIVERS=y +# CONFIG_VBOXGUEST is not set +# CONFIG_NITRO_ENCLAVES is not set CONFIG_EFI_SECRET=m +CONFIG_TDX_GUEST_DRIVER=m CONFIG_VIRTIO=y CONFIG_VIRTIO_MENU=y CONFIG_VIRTIO_PCI=y @@ -5381,6 +5405,9 @@ CONFIG_INTEL_SPEED_SELECT_INTERFACE=m CONFIG_INTEL_TURBO_MAX_3=y # CONFIG_INTEL_UNCORE_FREQ_CONTROL is not set CONFIG_INTEL_PMC_CORE=m +CONFIG_INTEL_PMT_CLASS=m +CONFIG_INTEL_PMT_TELEMETRY=m +CONFIG_INTEL_PMT_CRASHLOG=m # CONFIG_INTEL_PUNIT_IPC is not set # CONFIG_INTEL_SCU_PCI is not set # CONFIG_INTEL_SCU_PLATFORM is not set @@ -5495,6 +5522,11 @@ CONFIG_HYPERV_IOMMU=y # # CONFIG_XILINX_VCU is not set # end of Xilinx SoC drivers + +# +# prefetch tuning drivers +# +# end of prefetch tuning drivers # end of SOC (System On Chip) specific Drivers # CONFIG_PM_DEVFREQ is not set @@ -5869,6 +5901,7 @@ CONFIG_PWM_LPSS=m CONFIG_PWM_LPSS_PCI=m CONFIG_PWM_LPSS_PLATFORM=m # CONFIG_PWM_PCA9685 is not set +# CONFIG_PWM_LIGHT is not set # # IRQ chip support @@ -6444,7 +6477,6 @@ CONFIG_CRYPTO_DH=m CONFIG_CRYPTO_ECC=m CONFIG_CRYPTO_ECDH=m # CONFIG_CRYPTO_ECRDSA is not set -CONFIG_CRYPTO_ENGINE=m CONFIG_CRYPTO_SM2=y # CONFIG_CRYPTO_CURVE25519 is not set CONFIG_CRYPTO_CURVE25519_X86=m @@ -6607,6 +6639,9 @@ CONFIG_CRYPTO_DEV_CCP_DD=m CONFIG_CRYPTO_DEV_SP_CCP=y CONFIG_CRYPTO_DEV_CCP_CRYPTO=m CONFIG_CRYPTO_DEV_SP_PSP=y +CONFIG_HYGON_PSP2CPU_CMD=y +CONFIG_TDM_DEV_HYGON=y +CONFIG_TDM_KERNEL_GUARD=m # CONFIG_CRYPTO_DEV_CCP_DEBUGFS is not set CONFIG_CRYPTO_DEV_QAT=m CONFIG_CRYPTO_DEV_QAT_DH895xCC=m @@ -6687,7 +6722,6 @@ CONFIG_CRYPTO_LIB_POLY1305_GENERIC=m CONFIG_CRYPTO_LIB_POLY1305=m CONFIG_CRYPTO_LIB_CHACHA20POLY1305=m CONFIG_CRYPTO_LIB_SHA256=y -CONFIG_CRYPTO_LIB_SM4=y # end of Crypto library routines CONFIG_LIB_MEMNEQ=y @@ -6713,6 +6747,7 @@ CONFIG_ZLIB_DEFLATE=y CONFIG_LZO_COMPRESS=y CONFIG_LZO_DECOMPRESS=y CONFIG_LZ4_DECOMPRESS=y +CONFIG_ZSTD_COMPRESS=m CONFIG_ZSTD_DECOMPRESS=y CONFIG_XZ_DEC=y CONFIG_XZ_DEC_X86=y @@ -6723,7 +6758,6 @@ CONFIG_XZ_DEC_ARMTHUMB=y CONFIG_XZ_DEC_SPARC=y CONFIG_XZ_DEC_BCJ=y # CONFIG_XZ_DEC_TEST is not set -CONFIG_ZSTD_COMPRESS=m CONFIG_DECOMPRESS_GZIP=y CONFIG_DECOMPRESS_BZIP2=y CONFIG_DECOMPRESS_LZMA=y @@ -6732,7 +6766,7 @@ CONFIG_DECOMPRESS_LZO=y CONFIG_DECOMPRESS_LZ4=y CONFIG_DECOMPRESS_ZSTD=y CONFIG_GENERIC_ALLOCATOR=y -CONFIG_REED_SOLOMON=m +CONFIG_REED_SOLOMON=y CONFIG_REED_SOLOMON_ENC8=y CONFIG_REED_SOLOMON_DEC8=y CONFIG_TEXTSEARCH=y @@ -6821,7 +6855,6 @@ CONFIG_DEBUG_INFO_BTF=y # CONFIG_GDB_SCRIPTS is not set CONFIG_ENABLE_MUST_CHECK=y CONFIG_FRAME_WARN=2048 -CONFIG_FRAME_VECTOR=y CONFIG_STRIP_ASM_SYMS=y # CONFIG_READABLE_ASM is not set # CONFIG_HEADERS_INSTALL is not set @@ -7147,6 +7180,3 @@ CONFIG_TEST_LIVEPATCH=m # CONFIG_HYPERV_TESTING is not set # end of Kernel Testing and Coverage # end of Kernel hacking -CONFIG_CK_KABI_SIZE_ALIGN_CHECKS=y -CONFIG_CK_KABI_RESERVE=y -CONFIG_AUXILIARY_BUS=y diff --git a/arch/x86/include/asm/amd_nb.h b/arch/x86/include/asm/amd_nb.h index 455066a06f607eb5ba54f5fb68097f0ec289e7d5..f53447965d39e3de273d4da43c1e52e06dc2f710 100644 --- a/arch/x86/include/asm/amd_nb.h +++ b/arch/x86/include/asm/amd_nb.h @@ -84,6 +84,10 @@ u16 amd_nb_num(void); bool amd_nb_has_feature(unsigned int feature); struct amd_northbridge *node_to_amd_nb(int node); +bool hygon_f18h_m4h(void); +u16 hygon_nb_num(void); +int get_df_id(struct pci_dev *misc, u8 *id); + static inline u16 amd_pci_dev_to_node_id(struct pci_dev *pdev) { struct pci_dev *misc; @@ -121,6 +125,10 @@ static inline bool amd_gart_present(void) #define node_to_amd_nb(x) NULL #define amd_gart_present(x) false +#define hygon_f18h_m4h false +#define hygon_nb_num(x) 0 +#define get_df_id(x, y) NULL + #endif diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index ddc166dd5b2ea86f078f135e2ec813527742f64f..ae4b792cc6ab735202273839ca0de364266c3dc9 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -328,6 +328,9 @@ /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */ #define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */ #define X86_FEATURE_AVX512_BF16 (12*32+ 5) /* AVX512 BFLOAT16 instructions */ +#define X86_FEATURE_FZRM (12*32+10) /* "" Fast zero-length REP MOVSB */ +#define X86_FEATURE_FSRS (12*32+11) /* "" Fast short REP STOSB */ +#define X86_FEATURE_FSRC (12*32+12) /* "" Fast short REP {CMPSB,SCASB} */ /* AMD-defined CPU features, CPUID level 0x80000008 (EBX), word 13 */ #define X86_FEATURE_CLZERO (13*32+ 0) /* CLZERO instruction */ diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index bc9758ef292ef1a10fc6542f391bbdb9b82db058..ea690f793e6d1f6b3ccbfb032cdcf091ed03f846 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -81,8 +81,6 @@ extern unsigned long efi_fw_vendor, efi_config_table; kernel_fpu_end(); \ }) -#define arch_efi_call_virt(p, f, args...) p->f(args) - #else /* !CONFIG_X86_32 */ #define EFI_LOADER_SIGNATURE "EL64" @@ -112,6 +110,7 @@ struct efi_scratch { efi_switch_mm(&efi_mm); \ }) +#undef arch_efi_call_virt #define arch_efi_call_virt(p, f, args...) \ efi_call((void *)p->f, args) \ diff --git a/arch/x86/include/asm/microcode.h b/arch/x86/include/asm/microcode.h index d54bce0bb78c0127436b0624d4da34ac3dc1cbc7..a6b598045634cfec18f3e2435b30c618afe08a3c 100644 --- a/arch/x86/include/asm/microcode.h +++ b/arch/x86/include/asm/microcode.h @@ -78,6 +78,12 @@ static inline struct microcode_ops * __init init_amd_microcode(void) static inline void __exit exit_amd_microcode(void) {} #endif +#ifdef CONFIG_MICROCODE_HYGON +extern const struct microcode_ops * __init init_hygon_microcode(void); +#else +#define init_hygon_microcode() NULL +#endif + #define MAX_UCODE_COUNT 128 #define QCHAR(a, b, c, d) ((a) + ((b) << 8) + ((c) << 16) + ((d) << 24)) @@ -87,6 +93,9 @@ static inline void __exit exit_amd_microcode(void) {} #define CPUID_AMD1 QCHAR('A', 'u', 't', 'h') #define CPUID_AMD2 QCHAR('e', 'n', 't', 'i') #define CPUID_AMD3 QCHAR('c', 'A', 'M', 'D') +#define CPUID_HYGON1 QCHAR('H', 'y', 'g', 'o') +#define CPUID_HYGON2 QCHAR('n', 'G', 'e', 'n') +#define CPUID_HYGON3 QCHAR('u', 'i', 'n', 'e') #define CPUID_IS(a, b, c, ebx, ecx, edx) \ (!((ebx ^ (a))|(edx ^ (b))|(ecx ^ (c)))) @@ -113,6 +122,9 @@ static inline int x86_cpuid_vendor(void) if (CPUID_IS(CPUID_AMD1, CPUID_AMD2, CPUID_AMD3, ebx, ecx, edx)) return X86_VENDOR_AMD; + if (CPUID_IS(CPUID_HYGON1, CPUID_HYGON2, CPUID_HYGON3, ebx, ecx, edx)) + return X86_VENDOR_HYGON; + return X86_VENDOR_UNKNOWN; } diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h index e53f26228fbbd91e0511fc34add1e6290ea4b84c..2631e01f6e0fb274f78a260a51cec358118c43d6 100644 --- a/arch/x86/include/asm/shared/tdx.h +++ b/arch/x86/include/asm/shared/tdx.h @@ -7,9 +7,6 @@ #define TDX_HYPERCALL_STANDARD 0 -#define TDX_HCALL_HAS_OUTPUT BIT(0) -#define TDX_HCALL_ISSUE_STI BIT(1) - #define TDX_CPUID_LEAF_ID 0x21 #define TDX_IDENT "IntelTDX " @@ -22,16 +19,23 @@ * This is a software only structure and not part of the TDX module/VMM ABI. */ struct tdx_hypercall_args { + u64 r8; + u64 r9; u64 r10; u64 r11; u64 r12; u64 r13; u64 r14; u64 r15; + u64 rdi; + u64 rsi; + u64 rbx; + u64 rdx; }; /* Used to request services from the VMM */ -u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags); +u64 __tdx_hypercall(struct tdx_hypercall_args *args); +u64 __tdx_hypercall_ret(struct tdx_hypercall_args *args); /* Called from __tdx_hypercall() for unrecoverable failure */ void __tdx_hypercall_failed(void); diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c index 117932965ca4e2a9ee2991c618e516a4eeba9e5a..28912610c4872e0d433ad7d557de3f439d7eac95 100644 --- a/arch/x86/kernel/acpi/cstate.c +++ b/arch/x86/kernel/acpi/cstate.c @@ -206,6 +206,7 @@ static int __init ffh_cstate_init(void) if (c->x86_vendor != X86_VENDOR_INTEL && c->x86_vendor != X86_VENDOR_AMD && + c->x86_vendor != X86_VENDOR_HYGON && c->x86_vendor != X86_VENDOR_CENTAUR && c->x86_vendor != X86_VENDOR_ZHAOXIN) return -1; diff --git a/arch/x86/kernel/amd_nb.c b/arch/x86/kernel/amd_nb.c index 1b576d0e3e1fe5ff2b2b1010699ce3a83182e0d6..0f68ce410eccc8bb109141974d7b5a96aee3d289 100644 --- a/arch/x86/kernel/amd_nb.c +++ b/arch/x86/kernel/amd_nb.c @@ -31,10 +31,16 @@ #define PCI_DEVICE_ID_AMD_19H_M40H_ROOT 0x14b5 #define PCI_DEVICE_ID_AMD_19H_M40H_DF_F4 0x167d +#define PCI_DEVICE_ID_HYGON_18H_M05H_ROOT 0x14a0 +#define PCI_DEVICE_ID_HYGON_18H_M04H_DF_F1 0x1491 +#define PCI_DEVICE_ID_HYGON_18H_M05H_DF_F1 0x14b1 +#define PCI_DEVICE_ID_HYGON_18H_M05H_DF_F4 0x14b4 + /* Protect the PCI config register pairs used for SMN and DF indirect access. */ static DEFINE_MUTEX(smn_mutex); static u32 *flush_words; +static u16 nb_num; static const struct pci_device_id amd_root_ids[] = { { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_17H_ROOT) }, @@ -91,16 +97,22 @@ static const struct pci_device_id amd_nb_link_ids[] = { static const struct pci_device_id hygon_root_ids[] = { { PCI_DEVICE(PCI_VENDOR_ID_HYGON, PCI_DEVICE_ID_AMD_17H_ROOT) }, + { PCI_DEVICE(PCI_VENDOR_ID_HYGON, PCI_DEVICE_ID_AMD_17H_M30H_ROOT) }, + { PCI_DEVICE(PCI_VENDOR_ID_HYGON, PCI_DEVICE_ID_HYGON_18H_M05H_ROOT) }, {} }; static const struct pci_device_id hygon_nb_misc_ids[] = { { PCI_DEVICE(PCI_VENDOR_ID_HYGON, PCI_DEVICE_ID_AMD_17H_DF_F3) }, + { PCI_DEVICE(PCI_VENDOR_ID_HYGON, PCI_DEVICE_ID_AMD_17H_M30H_DF_F3) }, + { PCI_DEVICE(PCI_VENDOR_ID_HYGON, PCI_DEVICE_ID_HYGON_18H_M05H_DF_F3) }, {} }; static const struct pci_device_id hygon_nb_link_ids[] = { { PCI_DEVICE(PCI_VENDOR_ID_HYGON, PCI_DEVICE_ID_AMD_17H_DF_F4) }, + { PCI_DEVICE(PCI_VENDOR_ID_HYGON, PCI_DEVICE_ID_AMD_17H_M30H_DF_F4) }, + { PCI_DEVICE(PCI_VENDOR_ID_HYGON, PCI_DEVICE_ID_HYGON_18H_M05H_DF_F4) }, {} }; @@ -235,6 +247,209 @@ int amd_df_indirect_read(u16 node, u8 func, u16 reg, u8 instance_id, u32 *lo) } EXPORT_SYMBOL_GPL(amd_df_indirect_read); +bool hygon_f18h_m4h(void) +{ + if (boot_cpu_data.x86_vendor != X86_VENDOR_HYGON) + return false; + + if (boot_cpu_data.x86 == 0x18 && + boot_cpu_data.x86_model >= 0x4 && + boot_cpu_data.x86_model <= 0xf) + return true; + + return false; +} +EXPORT_SYMBOL_GPL(hygon_f18h_m4h); + +u16 hygon_nb_num(void) +{ + return nb_num; +} +EXPORT_SYMBOL_GPL(hygon_nb_num); + +static int get_df1_register(struct pci_dev *misc, int offset, u32 *value) +{ + struct pci_dev *df_f1 = NULL; + u32 device; + int err; + + switch (boot_cpu_data.x86_model) { + case 0x4: + device = PCI_DEVICE_ID_HYGON_18H_M04H_DF_F1; + break; + case 0x5: + if (misc->device == PCI_DEVICE_ID_HYGON_18H_M05H_DF_F3) + device = PCI_DEVICE_ID_HYGON_18H_M05H_DF_F1; + else + device = PCI_DEVICE_ID_HYGON_18H_M04H_DF_F1; + break; + case 0x6: + device = PCI_DEVICE_ID_HYGON_18H_M05H_DF_F1; + break; + default: + return -ENODEV; + } + + while ((df_f1 = pci_get_device(misc->vendor, device, df_f1))) + if (pci_domain_nr(df_f1->bus) == pci_domain_nr(misc->bus) && + df_f1->bus->number == misc->bus->number && + PCI_SLOT(df_f1->devfn) == PCI_SLOT(misc->devfn)) + break; + + if (!df_f1) { + pr_warn("Error getting DF F1 device.\n"); + return -ENODEV; + } + + err = pci_read_config_dword(df_f1, offset, value); + if (err) + pr_warn("Error reading DF F1 register.\n"); + + return err; +} + +int get_df_id(struct pci_dev *misc, u8 *id) +{ + u32 value; + int ret; + + /* F1x200[23:20]: DF ID */ + ret = get_df1_register(misc, 0x200, &value); + *id = (value >> 20) & 0xf; + + return ret; +} +EXPORT_SYMBOL_GPL(get_df_id); + +static u8 get_socket_num(struct pci_dev *misc) +{ + u32 value; + int ret; + + /* F1x200[7:0]: Which socket is present. */ + ret = get_df1_register(misc, 0x200, &value); + + return ret ? 0 : hweight8(value & 0xff); +} + +static int northbridge_init_f18h_m4h(const struct pci_device_id *root_ids, + const struct pci_device_id *misc_ids, + const struct pci_device_id *link_ids) +{ + struct pci_dev *root, *misc, *link; + struct pci_dev *root_first = NULL; + struct amd_northbridge *nb; + u16 roots_per_socket = 0; + u16 miscs_per_socket = 0; + u16 socket_num = 0; + u16 root_count = 0; + u16 misc_count = 0; + int err = -ENODEV; + u8 i, j, m, n; + u8 id; + + pr_info("Hygon Fam%xh Model%xh NB driver.\n", + boot_cpu_data.x86, boot_cpu_data.x86_model); + + misc = next_northbridge(NULL, misc_ids); + if (misc != NULL) { + socket_num = get_socket_num(misc); + pr_info("Socket number: %d\n", socket_num); + if (!socket_num) { + err = -ENODEV; + goto ret; + } + } else { + err = -ENODEV; + goto ret; + } + + misc = NULL; + while ((misc = next_northbridge(misc, misc_ids)) != NULL) + misc_count++; + + root = NULL; + while ((root = next_northbridge(root, root_ids)) != NULL) + root_count++; + + if (!root_count || !misc_count) { + err = -ENODEV; + goto ret; + } + + /* + * There should be _exactly_ N roots for each DF/SMN + * interface, and M DF/SMN interfaces in one socket. + */ + roots_per_socket = root_count / socket_num; + miscs_per_socket = misc_count / socket_num; + + if (!roots_per_socket || !miscs_per_socket) { + err = -ENODEV; + goto ret; + } + + nb = kcalloc(misc_count, sizeof(struct amd_northbridge), GFP_KERNEL); + if (!nb) { + err = -ENOMEM; + goto ret; + } + + amd_northbridges.nb = nb; + amd_northbridges.num = misc_count; + + link = misc = root = NULL; + j = m = n = 0; + for (i = 0; i < amd_northbridges.num; i++) { + misc = next_northbridge(misc, misc_ids); + link = next_northbridge(link, link_ids); + + /* Only save the first PCI root device for each socket. */ + if (!(i % miscs_per_socket)) { + root_first = next_northbridge(root, root_ids); + root = root_first; + j = 1; + } + + if (get_df_id(misc, &id)) { + err = -ENODEV; + goto err; + } + pr_info("DF ID: %d\n", id); + + if (id < 4) { + /* Add the devices with id<4 from the tail. */ + node_to_amd_nb(misc_count - m - 1)->misc = misc; + node_to_amd_nb(misc_count - m - 1)->link = link; + node_to_amd_nb(misc_count - m - 1)->root = root_first; + m++; + } else { + node_to_amd_nb(n)->misc = misc; + node_to_amd_nb(n)->link = link; + node_to_amd_nb(n)->root = root_first; + n++; + } + + /* Skip the redundant PCI root devices per socket. */ + while (j < roots_per_socket) { + root = next_northbridge(root, root_ids); + j++; + } + } + nb_num = n; + + return 0; + +err: + kfree(nb); + amd_northbridges.nb = NULL; + +ret: + pr_err("Hygon Fam%xh Model%xh northbridge init failed(%d)!\n", + boot_cpu_data.x86, boot_cpu_data.x86_model, err); + return err; +} + int amd_cache_northbridges(void) { const struct pci_device_id *misc_ids = amd_nb_misc_ids; @@ -254,6 +469,11 @@ int amd_cache_northbridges(void) root_ids = hygon_root_ids; misc_ids = hygon_nb_misc_ids; link_ids = hygon_nb_link_ids; + + if (boot_cpu_data.x86_model >= 0x4 && + boot_cpu_data.x86_model <= 0xf) + return northbridge_init_f18h_m4h(root_ids, + misc_ids, link_ids); } misc = NULL; diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c index 4e3cf2f856f59bc7ce3de275b9bde0609c443494..6cb20150956975666d4494a27e34e7fd47181b4e 100644 --- a/arch/x86/kernel/asm-offsets.c +++ b/arch/x86/kernel/asm-offsets.c @@ -86,12 +86,18 @@ static void __used common(void) OFFSET(TDX_MODULE_r11, tdx_module_output, r11); BLANK(); + OFFSET(TDX_HYPERCALL_r8, tdx_hypercall_args, r8); + OFFSET(TDX_HYPERCALL_r9, tdx_hypercall_args, r9); OFFSET(TDX_HYPERCALL_r10, tdx_hypercall_args, r10); OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_args, r11); OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_args, r12); OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_args, r13); OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_args, r14); OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_args, r15); + OFFSET(TDX_HYPERCALL_rdi, tdx_hypercall_args, rdi); + OFFSET(TDX_HYPERCALL_rsi, tdx_hypercall_args, rsi); + OFFSET(TDX_HYPERCALL_rbx, tdx_hypercall_args, rbx); + OFFSET(TDX_HYPERCALL_rdx, tdx_hypercall_args, rdx); BLANK(); OFFSET(BP_scratch, boot_params, scratch); diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 19cd0f6e89b2f31d2753fa6c7c723c82f8639222..2e624e4a10350e32474478b62f5484e76329654f 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -1885,6 +1885,8 @@ static int ib_prctl_set(struct task_struct *task, unsigned long ctrl) if (ctrl == PR_SPEC_FORCE_DISABLE) task_set_spec_ib_force_disable(task); task_update_spec_tif(task); + if (task == current) + indirect_branch_prediction_barrier(); break; default: return -ERANGE; diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c index b458b0fd98bf6cb967a87bcf5e656415de3e2893..f33d55338da48efb35bff42d90e5f70603827d44 100644 --- a/arch/x86/kernel/cpu/cacheinfo.c +++ b/arch/x86/kernel/cpu/cacheinfo.c @@ -693,11 +693,30 @@ void cacheinfo_hygon_init_llc_id(struct cpuinfo_x86 *c, int cpu) if (!cpuid_edx(0x80000006)) return; - /* - * LLC is at the core complex level. - * Core complex ID is ApicId[3] for these processors. - */ - per_cpu(cpu_llc_id, cpu) = c->apicid >> 3; + if (c->x86_model < 0x5) { + /* + * LLC is at the core complex level. + * Core complex ID is ApicId[3] for these processors. + */ + per_cpu(cpu_llc_id, cpu) = c->apicid >> 3; + } else { + /* + * LLC ID is calculated from the number of threads + * sharing the cache. + */ + u32 eax, ebx, ecx, edx, num_sharing_cache = 0; + u32 llc_index = find_num_cache_leaves(c) - 1; + + cpuid_count(0x8000001d, llc_index, &eax, &ebx, &ecx, &edx); + if (eax) + num_sharing_cache = ((eax >> 14) & 0xfff) + 1; + + if (num_sharing_cache) { + int bits = get_count_order(num_sharing_cache); + + per_cpu(cpu_llc_id, cpu) = c->apicid >> bits; + } + } } void init_amd_cacheinfo(struct cpuinfo_x86 *c) diff --git a/arch/x86/kernel/cpu/hygon.c b/arch/x86/kernel/cpu/hygon.c index 774ca6bfda9f4c463d4368f0ba8063cab05c4193..f752adaf80546557230fa041667ec59294d26403 100644 --- a/arch/x86/kernel/cpu/hygon.c +++ b/arch/x86/kernel/cpu/hygon.c @@ -81,16 +81,29 @@ static void hygon_get_topology(struct cpuinfo_x86 *c) if (smp_num_siblings > 1) c->x86_max_cores /= smp_num_siblings; - /* - * In case leaf B is available, use it to derive - * topology information. - */ - err = detect_extended_topology(c); - if (!err) - c->x86_coreid_bits = get_count_order(c->x86_max_cores); - - /* Socket ID is ApicId[6] for these processors. */ - c->phys_proc_id = c->apicid >> APICID_SOCKET_ID_BIT; + switch (c->x86_model) { + case 0x0 ... 0x3: + if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) + break; + /* Socket ID is ApicId[6] for these processors. */ + c->phys_proc_id = c->apicid >> APICID_SOCKET_ID_BIT; + break; + case 0x4: + case 0x5: + case 0x6: + /* + * In case leaf 0xB is available, use it to derive + * topology information. + */ + err = detect_extended_topology(c); + if (!err) + c->x86_coreid_bits = + get_count_order(c->x86_max_cores); + __max_die_per_package = nodes_per_socket; + break; + default: + break; + } cacheinfo_hygon_init_llc_id(c, cpu); } else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) { diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c index 09f7c652346a9847fd0095dfb0ff7f7e3e3307d3..673b4204a7a01c58fc62816c44189f1b4412f4ad 100644 --- a/arch/x86/kernel/cpu/mce/amd.c +++ b/arch/x86/kernel/cpu/mce/amd.c @@ -197,10 +197,10 @@ static DEFINE_PER_CPU(struct threshold_bank **, threshold_banks); * A list of the banks enabled on each logical CPU. Controls which respective * descriptors to initialize later in mce_threshold_create_device(). */ -static DEFINE_PER_CPU(unsigned int, bank_map); +static DEFINE_PER_CPU(u64, bank_map); /* Map of banks that have more than MCA_MISC0 available. */ -static DEFINE_PER_CPU(u32, smca_misc_banks_map); +static DEFINE_PER_CPU(u64, smca_misc_banks_map); static void amd_threshold_interrupt(void); static void amd_deferred_error_interrupt(void); @@ -229,7 +229,7 @@ static void smca_set_misc_banks_map(unsigned int bank, unsigned int cpu) return; if (low & MASK_BLKPTR_LO) - per_cpu(smca_misc_banks_map, cpu) |= BIT(bank); + per_cpu(smca_misc_banks_map, cpu) |= BIT_ULL(bank); } @@ -492,7 +492,7 @@ static u32 smca_get_block_address(unsigned int bank, unsigned int block, if (!block) return MSR_AMD64_SMCA_MCx_MISC(bank); - if (!(per_cpu(smca_misc_banks_map, cpu) & BIT(bank))) + if (!(per_cpu(smca_misc_banks_map, cpu) & BIT_ULL(bank))) return 0; return MSR_AMD64_SMCA_MCx_MISCy(bank, block - 1); @@ -536,7 +536,7 @@ prepare_threshold_block(unsigned int bank, unsigned int block, u32 addr, int new; if (!block) - per_cpu(bank_map, cpu) |= (1 << bank); + per_cpu(bank_map, cpu) |= BIT_ULL(bank); memset(&b, 0, sizeof(b)); b.cpu = cpu; @@ -691,13 +691,21 @@ int umc_normaddr_to_sysaddr(u64 norm_addr, u16 nid, u8 umc, u64 *sys_addr) u8 cs_mask, cs_id = 0; bool hash_enabled = false; - /* Read D18F0x1B4 (DramOffset), check if base 1 is used. */ - if (amd_df_indirect_read(nid, 0, 0x1B4, umc, &tmp)) + /* Read DramOffset, check if base 1 is used. */ + if (hygon_f18h_m4h() && + amd_df_indirect_read(nid, 0, 0x214, umc, &tmp)) + goto out_err; + else if (amd_df_indirect_read(nid, 0, 0x1B4, umc, &tmp)) goto out_err; /* Remove HiAddrOffset from normalized address, if enabled: */ if (tmp & BIT(0)) { - u64 hi_addr_offset = (tmp & GENMASK_ULL(31, 20)) << 8; + u64 hi_addr_offset; + + if (hygon_f18h_m4h()) + hi_addr_offset = (tmp & GENMASK_ULL(31, 18)) << 8; + else + hi_addr_offset = (tmp & GENMASK_ULL(31, 20)) << 8; if (norm_addr >= hi_addr_offset) { ret_addr -= hi_addr_offset; @@ -716,6 +724,9 @@ int umc_normaddr_to_sysaddr(u64 norm_addr, u16 nid, u8 umc, u64 *sys_addr) goto out_err; } + intlv_num_sockets = 0; + if (hygon_f18h_m4h()) + intlv_num_sockets = (tmp >> 2) & 0x3; lgcy_mmio_hole_en = tmp & BIT(1); intlv_num_chan = (tmp >> 4) & 0xF; intlv_addr_sel = (tmp >> 8) & 0x7; @@ -732,7 +743,8 @@ int umc_normaddr_to_sysaddr(u64 norm_addr, u16 nid, u8 umc, u64 *sys_addr) if (amd_df_indirect_read(nid, 0, 0x114 + (8 * base), umc, &tmp)) goto out_err; - intlv_num_sockets = (tmp >> 8) & 0x1; + if (!hygon_f18h_m4h()) + intlv_num_sockets = (tmp >> 8) & 0x1; intlv_num_dies = (tmp >> 10) & 0x3; dram_limit_addr = ((tmp & GENMASK_ULL(31, 12)) << 16) | GENMASK_ULL(27, 0); @@ -742,6 +754,10 @@ int umc_normaddr_to_sysaddr(u64 norm_addr, u16 nid, u8 umc, u64 *sys_addr) switch (intlv_num_chan) { case 0: intlv_num_chan = 0; break; case 1: intlv_num_chan = 1; break; + case 2: + if (hygon_f18h_m4h()) + intlv_num_chan = 2; + break; case 3: intlv_num_chan = 2; break; case 5: intlv_num_chan = 3; break; case 7: intlv_num_chan = 4; break; @@ -768,8 +784,9 @@ int umc_normaddr_to_sysaddr(u64 norm_addr, u16 nid, u8 umc, u64 *sys_addr) /* Add a bit if sockets are interleaved. */ num_intlv_bits += intlv_num_sockets; - /* Assert num_intlv_bits <= 4 */ - if (num_intlv_bits > 4) { + /* Assert num_intlv_bits in the correct range. */ + if ((hygon_f18h_m4h() && num_intlv_bits > 7) || + (!hygon_f18h_m4h() && num_intlv_bits > 4)) { pr_err("%s: Invalid interleave bits %d.\n", __func__, num_intlv_bits); goto out_err; @@ -788,7 +805,10 @@ int umc_normaddr_to_sysaddr(u64 norm_addr, u16 nid, u8 umc, u64 *sys_addr) if (amd_df_indirect_read(nid, 0, 0x50, umc, &tmp)) goto out_err; - cs_fabric_id = (tmp >> 8) & 0xFF; + if (hygon_f18h_m4h()) + cs_fabric_id = (tmp >> 8) & 0x7FF; + else + cs_fabric_id = (tmp >> 8) & 0xFF; die_id_bit = 0; /* If interleaved over more than 1 channel: */ @@ -808,8 +828,13 @@ int umc_normaddr_to_sysaddr(u64 norm_addr, u16 nid, u8 umc, u64 *sys_addr) /* If interleaved over more than 1 die. */ if (intlv_num_dies) { sock_id_bit = die_id_bit + intlv_num_dies; - die_id_shift = (tmp >> 24) & 0xF; - die_id_mask = (tmp >> 8) & 0xFF; + if (hygon_f18h_m4h()) { + die_id_shift = (tmp >> 12) & 0xF; + die_id_mask = tmp & 0x7FF; + } else { + die_id_shift = (tmp >> 24) & 0xF; + die_id_mask = (tmp >> 8) & 0xFF; + } cs_id |= ((cs_fabric_id & die_id_mask) >> die_id_shift) << die_id_bit; } @@ -817,7 +842,10 @@ int umc_normaddr_to_sysaddr(u64 norm_addr, u16 nid, u8 umc, u64 *sys_addr) /* If interleaved over more than 1 socket. */ if (intlv_num_sockets) { socket_id_shift = (tmp >> 28) & 0xF; - socket_id_mask = (tmp >> 16) & 0xFF; + if (hygon_f18h_m4h()) + socket_id_mask = (tmp >> 16) & 0x7FF; + else + socket_id_mask = (tmp >> 16) & 0xFF; cs_id |= ((cs_fabric_id & socket_id_mask) >> socket_id_shift) << sock_id_bit; } @@ -1041,7 +1069,7 @@ static void amd_threshold_interrupt(void) return; for (bank = 0; bank < this_cpu_read(mce_num_banks); ++bank) { - if (!(per_cpu(bank_map, cpu) & (1 << bank))) + if (!(per_cpu(bank_map, cpu) & BIT_ULL(bank))) continue; first_block = bp[bank]->blocks; @@ -1518,7 +1546,7 @@ int mce_threshold_create_device(unsigned int cpu) return -ENOMEM; for (bank = 0; bank < numbanks; ++bank) { - if (!(this_cpu_read(bank_map) & (1 << bank))) + if (!(this_cpu_read(bank_map) & BIT_ULL(bank))) continue; err = threshold_create_bank(bp, cpu, bank); if (err) { diff --git a/arch/x86/kernel/cpu/microcode/amd.c b/arch/x86/kernel/cpu/microcode/amd.c index 3f6b137ef4e6e564bff9fe4e8fd5ad2913d5de2a..7983c21bdd0e94a5051abbed1b87b5b531322966 100644 --- a/arch/x86/kernel/cpu/microcode/amd.c +++ b/arch/x86/kernel/cpu/microcode/amd.c @@ -458,11 +458,14 @@ apply_microcode_early_amd(u32 cpuid_1_eax, void *ucode, size_t size, bool save_p static bool get_builtin_microcode(struct cpio_data *cp, unsigned int family) { #ifdef CONFIG_X86_64 - char fw_name[36] = "amd-ucode/microcode_amd.bin"; + char fw_name[40] = "amd-ucode/microcode_amd.bin"; - if (family >= 0x15) + if (x86_cpuid_vendor() == X86_VENDOR_AMD && family >= 0x15) snprintf(fw_name, sizeof(fw_name), "amd-ucode/microcode_amd_fam%.2xh.bin", family); + else if (x86_cpuid_vendor() == X86_VENDOR_HYGON) + snprintf(fw_name, sizeof(fw_name), + "hygon-ucode/microcode_hygon_fam%.2xh.bin", family); return get_builtin_firmware(cp, fw_name); #else @@ -479,11 +482,18 @@ static void __load_ucode_amd(unsigned int cpuid_1_eax, struct cpio_data *ret) if (IS_ENABLED(CONFIG_X86_32)) { uci = (struct ucode_cpu_info *)__pa_nodebug(ucode_cpu_info); - path = (const char *)__pa_nodebug(ucode_path); + if (x86_cpuid_vendor() == X86_VENDOR_HYGON) + path = (const char *)__pa_nodebug( + "kernel/x86/microcode/HygonGenuine.bin"); + else + path = (const char *)__pa_nodebug(ucode_path); use_pa = true; } else { uci = ucode_cpu_info; - path = ucode_path; + if (x86_cpuid_vendor() == X86_VENDOR_HYGON) + path = "kernel/x86/microcode/HygonGenuine.bin"; + else + path = ucode_path; use_pa = false; } @@ -546,8 +556,14 @@ int __init save_microcode_in_initrd_amd(unsigned int cpuid_1_eax) struct cont_desc desc = { 0 }; enum ucode_state ret; struct cpio_data cp; + const char *path; - cp = find_microcode_in_initrd(ucode_path, false); + if (x86_cpuid_vendor() == X86_VENDOR_HYGON) + path = "kernel/x86/microcode/HygonGenuine.bin"; + else + path = ucode_path; + + cp = find_microcode_in_initrd(path, false); if (!(cp.data && cp.size)) return -EINVAL; @@ -888,7 +904,7 @@ load_microcode_amd(bool save, u8 family, const u8 *data, size_t size) static enum ucode_state request_microcode_amd(int cpu, struct device *device, bool refresh_fw) { - char fw_name[36] = "amd-ucode/microcode_amd.bin"; + char fw_name[40] = "amd-ucode/microcode_amd.bin"; struct cpuinfo_x86 *c = &cpu_data(cpu); bool bsp = c->cpu_index == boot_cpu_data.cpu_index; enum ucode_state ret = UCODE_NFOUND; @@ -898,8 +914,12 @@ static enum ucode_state request_microcode_amd(int cpu, struct device *device, if (!refresh_fw || !bsp) return UCODE_OK; - if (c->x86 >= 0x15) - snprintf(fw_name, sizeof(fw_name), "amd-ucode/microcode_amd_fam%.2xh.bin", c->x86); + if (x86_cpuid_vendor() == X86_VENDOR_AMD && c->x86 >= 0x15) + snprintf(fw_name, sizeof(fw_name), + "amd-ucode/microcode_amd_fam%.2xh.bin", c->x86); + else if (x86_cpuid_vendor() == X86_VENDOR_HYGON) + snprintf(fw_name, sizeof(fw_name), + "hygon-ucode/microcode_hygon_fam%.2xh.bin", c->x86); if (request_firmware_direct(&fw, (const char *)fw_name, device)) { pr_debug("failed to load file %s\n", fw_name); @@ -956,6 +976,22 @@ struct microcode_ops * __init init_amd_microcode(void) return µcode_amd_ops; } +#ifdef CONFIG_MICROCODE_HYGON +const struct microcode_ops * __init init_hygon_microcode(void) +{ + struct cpuinfo_x86 *c = &boot_cpu_data; + + if (c->x86_vendor != X86_VENDOR_HYGON) + return NULL; + + if (ucode_new_rev) + pr_info_once("microcode updated early to new patch_level=0x%08x\n", + ucode_new_rev); + + return µcode_amd_ops; +} +#endif + void __exit exit_amd_microcode(void) { cleanup(); diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index cba0fce815b998abf4325866a4d44445a3f343a6..ac996f8d54874a611121c0d710052606549947ff 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -41,7 +41,11 @@ #define DRIVER_VERSION "2.2" +#ifdef CONFIG_MICROCODE_HYGON +static const struct microcode_ops *microcode_ops; +#else static struct microcode_ops *microcode_ops; +#endif static bool dis_ucode_ldr = true; bool ucode_rollback = false; int enable_rollback = 0; @@ -150,7 +154,8 @@ static bool __init check_loader_disabled_bsp(void) if (native_cpuid_ecx(1) & BIT(31)) return *res; - if (x86_cpuid_vendor() == X86_VENDOR_AMD) { + if (x86_cpuid_vendor() == X86_VENDOR_AMD || + x86_cpuid_vendor() == X86_VENDOR_HYGON) { if (amd_check_current_patch_level()) return *res; } @@ -183,6 +188,10 @@ void __init load_ucode_bsp(void) intel = false; break; + case X86_VENDOR_HYGON: + intel = false; + break; + default: return; } @@ -223,6 +232,9 @@ void load_ucode_ap(void) if (x86_family(cpuid_1_eax) >= 0x10) load_ucode_amd_ap(cpuid_1_eax); break; + case X86_VENDOR_HYGON: + load_ucode_amd_ap(cpuid_1_eax); + break; default: break; } @@ -242,6 +254,9 @@ static int __init save_microcode_in_initrd(void) if (c->x86 >= 0x10) ret = save_microcode_in_initrd_amd(cpuid_eax(1)); break; + case X86_VENDOR_HYGON: + ret = save_microcode_in_initrd_amd(cpuid_eax(1)); + break; default: break; } @@ -335,6 +350,9 @@ void reload_early_microcode(void) if (family >= 0x10) reload_ucode_amd(); break; + case X86_VENDOR_HYGON: + reload_ucode_amd(); + break; default: break; } @@ -862,6 +880,8 @@ int __init microcode_init(void) microcode_ops = init_intel_microcode(); else if (c->x86_vendor == X86_VENDOR_AMD) microcode_ops = init_amd_microcode(); + else if (c->x86_vendor == X86_VENDOR_HYGON) + microcode_ops = init_hygon_microcode(); else pr_err("no support for this CPU vendor\n"); diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c index 91288da29599516f1f5c068748afc7836ab2354e..8678864ce7123c35cbb5d0ab09e9ffcd91095e16 100644 --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -96,6 +96,7 @@ int detect_extended_topology(struct cpuinfo_x86 *c) unsigned int ht_mask_width, core_plus_mask_width, die_plus_mask_width; unsigned int core_select_mask, core_level_siblings; unsigned int die_select_mask, die_level_siblings; + bool die_level_present = false; int leaf; leaf = detect_extended_topology_leaf(c); @@ -126,6 +127,7 @@ int detect_extended_topology(struct cpuinfo_x86 *c) die_plus_mask_width = BITS_SHIFT_NEXT_LEVEL(eax); } if (LEAFB_SUBTYPE(ecx) == DIE_TYPE) { + die_level_present = true; die_level_siblings = LEVEL_MAX_SIBLINGS(ebx); die_plus_mask_width = BITS_SHIFT_NEXT_LEVEL(eax); } @@ -139,8 +141,12 @@ int detect_extended_topology(struct cpuinfo_x86 *c) c->cpu_core_id = apic->phys_pkg_id(c->initial_apicid, ht_mask_width) & core_select_mask; - c->cpu_die_id = apic->phys_pkg_id(c->initial_apicid, - core_plus_mask_width) & die_select_mask; + + if (die_level_present) { + c->cpu_die_id = apic->phys_pkg_id(c->initial_apicid, + core_plus_mask_width) & die_select_mask; + } + c->phys_proc_id = apic->phys_pkg_id(c->initial_apicid, die_plus_mask_width); /* diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 53c2476c665751d9f58b9bf49f23e39b7f986a75..55ddf22ccd0affa407234253751401104238246d 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -511,7 +511,8 @@ void kvm_set_cpu_caps(void) kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD); kvm_cpu_cap_mask(CPUID_7_1_EAX, - F(AVX512_BF16) | F(AVX_VNNI) + F(AVX512_BF16) | F(AVX_VNNI) | + F(FZRM) | F(FSRS) | F(FSRC) ); kvm_cpu_cap_mask(CPUID_D_1_EAX, diff --git a/block/blk-exec.c b/block/blk-exec.c index 85324d53d072f0809c567f98214e1bc552c145c4..3fc1f23355a98fa88fb77ae5dfe0f9df3031892d 100644 --- a/block/blk-exec.c +++ b/block/blk-exec.c @@ -30,6 +30,16 @@ static void blk_end_sync_rq(struct request *rq, blk_status_t error) complete(waiting); } +bool blk_rq_is_poll(struct request *rq) +{ + if (!rq->mq_hctx) + return false; + if (rq->mq_hctx->type != HCTX_TYPE_POLL) + return false; + return true; +} +EXPORT_SYMBOL_GPL(blk_rq_is_poll); + /** * blk_execute_rq_nowait - insert a request into queue for execution * @q: queue to insert the request in diff --git a/block/ioctl.c b/block/ioctl.c index e7eed7dadb5cf5867b21efc59a125dc9dc43ca87..53bc1f5397122028614e27e24ee5329d9b1e56b8 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -265,13 +265,24 @@ int blkdev_compat_ptr_ioctl(struct block_device *bdev, fmode_t mode, EXPORT_SYMBOL(blkdev_compat_ptr_ioctl); #endif -static int blkdev_pr_register(struct block_device *bdev, +static bool blkdev_pr_allowed(struct block_device *bdev, fmode_t mode) +{ + if (capable(CAP_SYS_ADMIN)) + return true; + /* + * Only allow unprivileged reservations if the file descriptor is open + * for writing. + */ + return mode & FMODE_WRITE; +} + +static int blkdev_pr_register(struct block_device *bdev, fmode_t mode, struct pr_registration __user *arg) { const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops; struct pr_registration reg; - if (!capable(CAP_SYS_ADMIN)) + if (!blkdev_pr_allowed(bdev, mode)) return -EPERM; if (!ops || !ops->pr_register) return -EOPNOTSUPP; @@ -283,13 +294,13 @@ static int blkdev_pr_register(struct block_device *bdev, return ops->pr_register(bdev, reg.old_key, reg.new_key, reg.flags); } -static int blkdev_pr_reserve(struct block_device *bdev, +static int blkdev_pr_reserve(struct block_device *bdev, fmode_t mode, struct pr_reservation __user *arg) { const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops; struct pr_reservation rsv; - if (!capable(CAP_SYS_ADMIN)) + if (!blkdev_pr_allowed(bdev, mode)) return -EPERM; if (!ops || !ops->pr_reserve) return -EOPNOTSUPP; @@ -301,13 +312,13 @@ static int blkdev_pr_reserve(struct block_device *bdev, return ops->pr_reserve(bdev, rsv.key, rsv.type, rsv.flags); } -static int blkdev_pr_release(struct block_device *bdev, +static int blkdev_pr_release(struct block_device *bdev, fmode_t mode, struct pr_reservation __user *arg) { const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops; struct pr_reservation rsv; - if (!capable(CAP_SYS_ADMIN)) + if (!blkdev_pr_allowed(bdev, mode)) return -EPERM; if (!ops || !ops->pr_release) return -EOPNOTSUPP; @@ -319,13 +330,13 @@ static int blkdev_pr_release(struct block_device *bdev, return ops->pr_release(bdev, rsv.key, rsv.type); } -static int blkdev_pr_preempt(struct block_device *bdev, +static int blkdev_pr_preempt(struct block_device *bdev, fmode_t mode, struct pr_preempt __user *arg, bool abort) { const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops; struct pr_preempt p; - if (!capable(CAP_SYS_ADMIN)) + if (!blkdev_pr_allowed(bdev, mode)) return -EPERM; if (!ops || !ops->pr_preempt) return -EOPNOTSUPP; @@ -337,13 +348,13 @@ static int blkdev_pr_preempt(struct block_device *bdev, return ops->pr_preempt(bdev, p.old_key, p.new_key, p.type, abort); } -static int blkdev_pr_clear(struct block_device *bdev, +static int blkdev_pr_clear(struct block_device *bdev, fmode_t mode, struct pr_clear __user *arg) { const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops; struct pr_clear c; - if (!capable(CAP_SYS_ADMIN)) + if (!blkdev_pr_allowed(bdev, mode)) return -EPERM; if (!ops || !ops->pr_clear) return -EOPNOTSUPP; @@ -564,17 +575,17 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode, case BLKTRACETEARDOWN: return blk_trace_ioctl(bdev, cmd, argp); case IOC_PR_REGISTER: - return blkdev_pr_register(bdev, argp); + return blkdev_pr_register(bdev, mode, argp); case IOC_PR_RESERVE: - return blkdev_pr_reserve(bdev, argp); + return blkdev_pr_reserve(bdev, mode, argp); case IOC_PR_RELEASE: - return blkdev_pr_release(bdev, argp); + return blkdev_pr_release(bdev, mode, argp); case IOC_PR_PREEMPT: - return blkdev_pr_preempt(bdev, argp, false); + return blkdev_pr_preempt(bdev, mode, argp, false); case IOC_PR_PREEMPT_ABORT: - return blkdev_pr_preempt(bdev, argp, true); + return blkdev_pr_preempt(bdev, mode, argp, true); case IOC_PR_CLEAR: - return blkdev_pr_clear(bdev, argp); + return blkdev_pr_clear(bdev, mode, argp); default: return -ENOIOCTLCMD; } diff --git a/drivers/acpi/acpica/evevent.c b/drivers/acpi/acpica/evevent.c index 9efca54c51ac50c6638f0f19c41a8f3efb8ef696..10cd64e7a1eb473407a3f7e6d108da469dcc5cfa 100644 --- a/drivers/acpi/acpica/evevent.c +++ b/drivers/acpi/acpica/evevent.c @@ -140,9 +140,9 @@ static acpi_status acpi_ev_fixed_event_initialize(void) if (acpi_gbl_fixed_event_info[i].enable_register_id != 0xFF) { status = - acpi_write_bit_register(acpi_gbl_fixed_event_info - [i].enable_register_id, - ACPI_DISABLE_EVENT); + acpi_write_bit_register(acpi_gbl_fixed_event_info[i].enable_register_id, + (i == ACPI_EVENT_PCIE_WAKE) ? + ACPI_ENABLE_EVENT : ACPI_DISABLE_EVENT); if (ACPI_FAILURE(status)) { return (status); } @@ -185,6 +185,11 @@ u32 acpi_ev_fixed_event_detect(void) return (int_status); } + if (fixed_enable & ACPI_BITMASK_PCIEXP_WAKE_DISABLE) + fixed_enable &= ~ACPI_BITMASK_PCIEXP_WAKE_DISABLE; + else + fixed_enable |= ACPI_BITMASK_PCIEXP_WAKE_DISABLE; + ACPI_DEBUG_PRINT((ACPI_DB_INTERRUPTS, "Fixed Event Block: Enable %08X Status %08X\n", fixed_enable, fixed_status)); @@ -248,9 +253,9 @@ static u32 acpi_ev_fixed_event_dispatch(u32 event) * and disable the event to prevent further interrupts. */ if (!acpi_gbl_fixed_event_handlers[event].handler) { - (void)acpi_write_bit_register(acpi_gbl_fixed_event_info[event]. - enable_register_id, - ACPI_DISABLE_EVENT); + (void)acpi_write_bit_register(acpi_gbl_fixed_event_info[event].enable_register_id, + event == ACPI_EVENT_PCIE_WAKE ? + ACPI_ENABLE_EVENT : ACPI_DISABLE_EVENT); ACPI_ERROR((AE_INFO, "No installed handler for fixed event - %s (%u), disabling", diff --git a/drivers/acpi/acpica/hwsleep.c b/drivers/acpi/acpica/hwsleep.c index 6a20bb5059c1d815a33ab83149dca348065e5a44..e89895622a72fc9251a90937adfdcbbe5d82f8d8 100644 --- a/drivers/acpi/acpica/hwsleep.c +++ b/drivers/acpi/acpica/hwsleep.c @@ -311,6 +311,18 @@ acpi_status acpi_hw_legacy_wake(u8 sleep_state) [ACPI_EVENT_SLEEP_BUTTON]. status_register_id, ACPI_CLEAR_STATUS); + /* Enable pcie wake event if support */ + if ((acpi_gbl_FADT.flags & ACPI_FADT_PCI_EXPRESS_WAKE)) { + (void) + acpi_write_bit_register(acpi_gbl_fixed_event_info + [ACPI_EVENT_PCIE_WAKE]. + enable_register_id, ACPI_DISABLE_EVENT); + (void) + acpi_write_bit_register(acpi_gbl_fixed_event_info + [ACPI_EVENT_PCIE_WAKE]. + status_register_id, ACPI_CLEAR_STATUS); + } + acpi_hw_execute_sleep_method(METHOD_PATHNAME__SST, ACPI_SST_WORKING); return_ACPI_STATUS(status); } diff --git a/drivers/acpi/acpica/utglobal.c b/drivers/acpi/acpica/utglobal.c index e6dcbdc3fc6ec8e3d58f19ba892ae8ab5a709545..0dc81b85112c2f21b36dd0a96a95075d8fed0883 100644 --- a/drivers/acpi/acpica/utglobal.c +++ b/drivers/acpi/acpica/utglobal.c @@ -186,6 +186,10 @@ struct acpi_fixed_event_info acpi_gbl_fixed_event_info[ACPI_NUM_FIXED_EVENTS] = ACPI_BITREG_RT_CLOCK_ENABLE, ACPI_BITMASK_RT_CLOCK_STATUS, ACPI_BITMASK_RT_CLOCK_ENABLE}, + /* ACPI_EVENT_PCIE_WAKE */ {ACPI_BITREG_PCIEXP_WAKE_STATUS, + ACPI_BITREG_PCIEXP_WAKE_DISABLE, + ACPI_BITMASK_PCIEXP_WAKE_STATUS, + ACPI_BITMASK_PCIEXP_WAKE_DISABLE}, }; #endif /* !ACPI_REDUCED_HARDWARE */ diff --git a/drivers/acpi/irq.c b/drivers/acpi/irq.c index 935a3277f49614dc670e2b24fdc7e2227d3e3b18..9b31bdbc16e440d278b983c69cb2c31db287191e 100644 --- a/drivers/acpi/irq.c +++ b/drivers/acpi/irq.c @@ -16,7 +16,8 @@ enum acpi_irq_model_id acpi_irq_model; -static struct fwnode_handle *acpi_gsi_domain_id; +static struct fwnode_handle *(*acpi_get_gsi_domain_id)(u32 gsi); +static u32 (*acpi_gsi_to_irq_fallback)(u32 gsi); /** * acpi_gsi_to_irq() - Retrieve the linux irq number for a given GSI @@ -30,14 +31,18 @@ static struct fwnode_handle *acpi_gsi_domain_id; */ int acpi_gsi_to_irq(u32 gsi, unsigned int *irq) { - struct irq_domain *d = irq_find_matching_fwnode(acpi_gsi_domain_id, - DOMAIN_BUS_ANY); + struct irq_domain *d; + d = irq_find_matching_fwnode(acpi_get_gsi_domain_id(gsi), + DOMAIN_BUS_ANY); *irq = irq_find_mapping(d, gsi); /* - * *irq == 0 means no mapping, that should - * be reported as a failure + * *irq == 0 means no mapping, that should be reported as a + * failure, unless there is an arch-specific fallback handler. */ + if (!*irq && acpi_gsi_to_irq_fallback) + *irq = acpi_gsi_to_irq_fallback(gsi); + return (*irq > 0) ? 0 : -EINVAL; } EXPORT_SYMBOL_GPL(acpi_gsi_to_irq); @@ -77,7 +82,8 @@ int acpi_register_partitioned_percpu_gsi(struct device *dev, u32 gsi, { struct irq_fwspec fwspec; - if (WARN_ON(!acpi_gsi_domain_id)) { + fwspec.fwnode = acpi_get_gsi_domain_id(gsi); + if (WARN_ON(!fwspec.fwnode)) { pr_warn("GSI: No registered irqchip, giving up\n"); return -EINVAL; } @@ -85,7 +91,6 @@ int acpi_register_partitioned_percpu_gsi(struct device *dev, u32 gsi, pack_fwspec(&fwspec, gsi, trigger, polarity); fwspec.param[fwspec.param_count] = processor_container_uid; fwspec.param_count++; - fwspec.fwnode = acpi_gsi_domain_id; return irq_create_fwspec_mapping(&fwspec); } @@ -105,13 +110,13 @@ int acpi_register_gsi(struct device *dev, u32 gsi, int trigger, { struct irq_fwspec fwspec; - if (WARN_ON(!acpi_gsi_domain_id)) { + fwspec.fwnode = acpi_get_gsi_domain_id(gsi); + if (WARN_ON(!fwspec.fwnode)) { pr_warn("GSI: No registered irqchip, giving up\n"); return -EINVAL; } pack_fwspec(&fwspec, gsi, trigger, polarity); - fwspec.fwnode = acpi_gsi_domain_id; return irq_create_fwspec_mapping(&fwspec); } @@ -123,8 +128,8 @@ EXPORT_SYMBOL_GPL(acpi_register_gsi); */ void acpi_unregister_gsi(u32 gsi) { - struct irq_domain *d = irq_find_matching_fwnode(acpi_gsi_domain_id, - DOMAIN_BUS_ANY); + struct irq_domain *d = irq_find_matching_fwnode(acpi_get_gsi_domain_id(gsi), + DOMAIN_BUS_ANY); int irq = irq_find_mapping(d, gsi); irq_dispose_mapping(irq); @@ -143,7 +148,8 @@ EXPORT_SYMBOL_GPL(acpi_unregister_gsi); * The referenced device fwhandle or NULL on failure */ static struct fwnode_handle * -acpi_get_irq_source_fwhandle(const struct acpi_resource_source *source) +acpi_get_irq_source_fwhandle(const struct acpi_resource_source *source, + u32 gsi) { struct fwnode_handle *result; struct acpi_device *device; @@ -151,7 +157,7 @@ acpi_get_irq_source_fwhandle(const struct acpi_resource_source *source) acpi_status status; if (!source->string_length) - return acpi_gsi_domain_id; + return acpi_get_gsi_domain_id(gsi); status = acpi_get_handle(NULL, source->string_ptr, &handle); if (WARN_ON(ACPI_FAILURE(status))) @@ -239,7 +245,7 @@ static acpi_status acpi_irq_parse_one_cb(struct acpi_resource *ares, ctx->index -= irq->interrupt_count; return AE_OK; } - fwnode = acpi_gsi_domain_id; + fwnode = acpi_get_gsi_domain_id(irq->interrupts[ctx->index]); acpi_irq_parse_one_match(fwnode, irq->interrupts[ctx->index], irq->triggering, irq->polarity, irq->shareable, ctx); @@ -252,7 +258,8 @@ static acpi_status acpi_irq_parse_one_cb(struct acpi_resource *ares, ctx->index -= eirq->interrupt_count; return AE_OK; } - fwnode = acpi_get_irq_source_fwhandle(&eirq->resource_source); + fwnode = acpi_get_irq_source_fwhandle(&eirq->resource_source, + eirq->interrupts[ctx->index]); acpi_irq_parse_one_match(fwnode, eirq->interrupts[ctx->index], eirq->triggering, eirq->polarity, eirq->shareable, ctx); @@ -336,10 +343,20 @@ EXPORT_SYMBOL_GPL(acpi_irq_get); * GSI interrupts */ void __init acpi_set_irq_model(enum acpi_irq_model_id model, - struct fwnode_handle *fwnode) + struct fwnode_handle *(*fn)(u32)) { acpi_irq_model = model; - acpi_gsi_domain_id = fwnode; + acpi_get_gsi_domain_id = fn; +} + +/** + * acpi_set_gsi_to_irq_fallback - Register a GSI transfer + * callback to fallback to arch specified implementation. + * @fn: arch-specific fallback handler + */ +void __init acpi_set_gsi_to_irq_fallback(u32 (*fn)(u32)) +{ + acpi_gsi_to_irq_fallback = fn; } /** @@ -357,8 +374,14 @@ struct irq_domain *acpi_irq_create_hierarchy(unsigned int flags, const struct irq_domain_ops *ops, void *host_data) { - struct irq_domain *d = irq_find_matching_fwnode(acpi_gsi_domain_id, - DOMAIN_BUS_ANY); + struct irq_domain *d; + + /* This only works for the GIC model... */ + if (acpi_irq_model != ACPI_IRQ_MODEL_GIC) + return NULL; + + d = irq_find_matching_fwnode(acpi_get_gsi_domain_id(0), + DOMAIN_BUS_ANY); if (!d) return NULL; diff --git a/drivers/android/binder.c b/drivers/android/binder.c index a5d5247c4f3e8d9a8ce55bbd991d144e8bfc6a3b..c90865a1396f62a06c2f513de8d6fbf37823ab4f 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -2619,6 +2619,9 @@ static int binder_translate_fd_array(struct binder_fd_array_object *fda, struct binder_proc *proc = thread->proc; struct binder_proc *target_proc = t->to_proc; + if (fda->num_fds == 0) + return 0; + fd_buf_size = sizeof(u32) * fda->num_fds; if (fda->num_fds >= SIZE_MAX / sizeof(u32)) { binder_user_error("%d:%d got transaction with invalid number of fds (%lld)\n", diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c index 95ca4f934d283db8489befafc90da6b762634e90..536d5d7745d24b3b75ce84c051902fad47b627f5 100644 --- a/drivers/android/binder_alloc.c +++ b/drivers/android/binder_alloc.c @@ -213,7 +213,7 @@ static int binder_update_page_range(struct binder_alloc *alloc, int allocate, if (mm) { mmap_read_lock(mm); - vma = alloc->vma; + vma = vma_lookup(mm, alloc->vma_addr); } if (!vma && need_mm) { @@ -313,16 +313,15 @@ static int binder_update_page_range(struct binder_alloc *alloc, int allocate, static inline void binder_alloc_set_vma(struct binder_alloc *alloc, struct vm_area_struct *vma) { - if (vma) + unsigned long vm_start = 0; + + if (vma) { + vm_start = vma->vm_start; alloc->vma_vm_mm = vma->vm_mm; - /* - * If we see alloc->vma is not NULL, buffer data structures set up - * completely. Look at smp_rmb side binder_alloc_get_vma. - * We also want to guarantee new alloc->vma_vm_mm is always visible - * if alloc->vma is set. - */ - smp_wmb(); - alloc->vma = vma; + } + + mmap_assert_write_locked(alloc->vma_vm_mm); + alloc->vma_addr = vm_start; } static inline struct vm_area_struct *binder_alloc_get_vma( @@ -330,11 +329,9 @@ static inline struct vm_area_struct *binder_alloc_get_vma( { struct vm_area_struct *vma = NULL; - if (alloc->vma) { - /* Look at description in binder_alloc_set_vma */ - smp_rmb(); - vma = alloc->vma; - } + if (alloc->vma_addr) + vma = vma_lookup(alloc->vma_vm_mm, alloc->vma_addr); + return vma; } @@ -808,7 +805,8 @@ void binder_alloc_deferred_release(struct binder_alloc *alloc) buffers = 0; mutex_lock(&alloc->mutex); - BUG_ON(alloc->vma); + BUG_ON(alloc->vma_addr && + vma_lookup(alloc->vma_vm_mm, alloc->vma_addr)); while ((n = rb_first(&alloc->allocated_buffers))) { buffer = rb_entry(n, struct binder_buffer, rb_node); diff --git a/drivers/android/binder_alloc.h b/drivers/android/binder_alloc.h index 6e8e001381af4ba2c2f73a2ab7bbca1feff5397f..31e41c77755141a99b26668a4a8037c7e796a438 100644 --- a/drivers/android/binder_alloc.h +++ b/drivers/android/binder_alloc.h @@ -95,7 +95,7 @@ struct binder_lru_page { */ struct binder_alloc { struct mutex mutex; - struct vm_area_struct *vma; + unsigned long vma_addr; struct mm_struct *vma_vm_mm; void __user *buffer; struct list_head buffers; diff --git a/drivers/android/binder_alloc_selftest.c b/drivers/android/binder_alloc_selftest.c index c2b323bc3b3a53043fa5783fb4f809de93b24644..43a881073a42832feb96c906c53e0ad34d2b2970 100644 --- a/drivers/android/binder_alloc_selftest.c +++ b/drivers/android/binder_alloc_selftest.c @@ -287,7 +287,7 @@ void binder_selftest_alloc(struct binder_alloc *alloc) if (!binder_selftest_run) return; mutex_lock(&binder_selftest_lock); - if (!binder_selftest_run || !alloc->vma) + if (!binder_selftest_run || !alloc->vma_addr) goto done; pr_info("STARTED\n"); binder_selftest_alloc_offset(alloc, end_offset, 0); diff --git a/drivers/ata/Kconfig b/drivers/ata/Kconfig index 030cb32da980fc4797a91469630b8e6a19edea8c..5f62dcdc546b4163b03f30d32e0562ef4f423915 100644 --- a/drivers/ata/Kconfig +++ b/drivers/ata/Kconfig @@ -552,6 +552,14 @@ config SATA_VITESSE If unsure, say N. +config SATA_ZHAOXIN + tristate "ZhaoXin SATA support" + depends on PCI + help + This option enables support for ZhaoXin Serial ATA. + + If unsure, say N. + comment "PATA SFF controllers with BMDMA" config PATA_ALI diff --git a/drivers/ata/Makefile b/drivers/ata/Makefile index b8aebfb14e825040af4a228b475e0be3c8b028ac..f0f3c7145e2782333562fbfa085d28f0c21917d6 100644 --- a/drivers/ata/Makefile +++ b/drivers/ata/Makefile @@ -44,6 +44,7 @@ obj-$(CONFIG_SATA_SIL) += sata_sil.o obj-$(CONFIG_SATA_SIS) += sata_sis.o obj-$(CONFIG_SATA_SVW) += sata_svw.o obj-$(CONFIG_SATA_ULI) += sata_uli.o +obj-$(CONFIG_SATA_ZHAOXIN) += sata_zhaoxin.o obj-$(CONFIG_SATA_VIA) += sata_via.o obj-$(CONFIG_SATA_VITESSE) += sata_vsc.o diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c index ff2add0101fe56e5d02149f51088ac29f906d316..652f654cf6352314c0c6146ff9eed1dffe33e596 100644 --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -316,7 +316,6 @@ static const struct pci_device_id ahci_pci_tbl[] = { { PCI_VDEVICE(INTEL, 0x1d02), board_ahci }, /* PBG AHCI */ { PCI_VDEVICE(INTEL, 0x1d04), board_ahci }, /* PBG RAID */ { PCI_VDEVICE(INTEL, 0x1d06), board_ahci }, /* PBG RAID */ - { PCI_VDEVICE(INTEL, 0x2826), board_ahci }, /* PBG RAID */ { PCI_VDEVICE(INTEL, 0x2323), board_ahci }, /* DH89xxCC AHCI */ { PCI_VDEVICE(INTEL, 0x1e02), board_ahci }, /* Panther Point AHCI */ { PCI_VDEVICE(INTEL, 0x1e03), board_ahci_mobile }, /* Panther M AHCI */ @@ -396,8 +395,9 @@ static const struct pci_device_id ahci_pci_tbl[] = { { PCI_VDEVICE(INTEL, 0xa10f), board_ahci }, /* Sunrise Point-H RAID */ { PCI_VDEVICE(INTEL, 0x2822), board_ahci }, /* Lewisburg RAID*/ { PCI_VDEVICE(INTEL, 0x2823), board_ahci }, /* Lewisburg AHCI*/ - { PCI_VDEVICE(INTEL, 0x2826), board_ahci }, /* Lewisburg RAID*/ - { PCI_VDEVICE(INTEL, 0x2827), board_ahci }, /* Lewisburg RAID*/ + { PCI_VDEVICE(INTEL, 0x2826), board_ahci }, /* *burg SATA0 'RAID' */ + { PCI_VDEVICE(INTEL, 0x2827), board_ahci }, /* *burg SATA1 'RAID' */ + { PCI_VDEVICE(INTEL, 0x282f), board_ahci }, /* *burg SATA2 'RAID' */ { PCI_VDEVICE(INTEL, 0xa182), board_ahci }, /* Lewisburg AHCI*/ { PCI_VDEVICE(INTEL, 0xa186), board_ahci }, /* Lewisburg RAID*/ { PCI_VDEVICE(INTEL, 0xa1d2), board_ahci }, /* Lewisburg RAID*/ diff --git a/drivers/ata/sata_zhaoxin.c b/drivers/ata/sata_zhaoxin.c new file mode 100644 index 0000000000000000000000000000000000000000..ef8c73a37667e99a23128b4206dd9d558b64f669 --- /dev/null +++ b/drivers/ata/sata_zhaoxin.c @@ -0,0 +1,384 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * sata_zhaoxin.c - ZhaoXin Serial ATA controllers + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define DRV_NAME "sata_zx" +#define DRV_VERSION "2.6.1" + +enum board_ids_enum { + zx100s, +}; + +enum { + SATA_CHAN_ENAB = 0x40, /* SATA channel enable */ + SATA_INT_GATE = 0x41, /* SATA interrupt gating */ + SATA_NATIVE_MODE = 0x42, /* Native mode enable */ + PATA_UDMA_TIMING = 0xB3, /* PATA timing for DMA/ cable detect */ + PATA_PIO_TIMING = 0xAB, /* PATA timing register */ + + PORT0 = (1 << 1), + PORT1 = (1 << 0), + ALL_PORTS = PORT0 | PORT1, + + NATIVE_MODE_ALL = (1 << 7) | (1 << 6) | (1 << 5) | (1 << 4), + + SATA_EXT_PHY = (1 << 6), /* 0==use PATA, 1==ext phy */ +}; + +static int zx_init_one(struct pci_dev *pdev, const struct pci_device_id *ent); +static int zx_scr_read(struct ata_link *link, unsigned int scr, u32 *val); +static int zx_scr_write(struct ata_link *link, unsigned int scr, u32 val); +static int zx_hardreset(struct ata_link *link, unsigned int *class, + unsigned long deadline); + +static void zx_tf_load(struct ata_port *ap, const struct ata_taskfile *tf); + +static const struct pci_device_id zx_pci_tbl[] = { + { PCI_VDEVICE(ZHAOXIN, 0x9002), zx100s }, + { PCI_VDEVICE(ZHAOXIN, 0x9003), zx100s }, + + { } /* terminate list */ +}; + +static struct pci_driver zx_pci_driver = { + .name = DRV_NAME, + .id_table = zx_pci_tbl, + .probe = zx_init_one, +#ifdef CONFIG_PM_SLEEP + .suspend = ata_pci_device_suspend, + .resume = ata_pci_device_resume, +#endif + .remove = ata_pci_remove_one, +}; + +static struct scsi_host_template zx_sht = { + ATA_BMDMA_SHT(DRV_NAME), +}; + +static struct ata_port_operations zx_base_ops = { + .inherits = &ata_bmdma_port_ops, + .sff_tf_load = zx_tf_load, +}; + +static struct ata_port_operations zx_ops = { + .inherits = &zx_base_ops, + .hardreset = zx_hardreset, + .scr_read = zx_scr_read, + .scr_write = zx_scr_write, +}; + +static struct ata_port_info zx100s_port_info = { + .flags = ATA_FLAG_SATA | ATA_FLAG_SLAVE_POSS, + .pio_mask = ATA_PIO4, + .mwdma_mask = ATA_MWDMA2, + .udma_mask = ATA_UDMA6, + .port_ops = &zx_ops, +}; + + +static int zx_hardreset(struct ata_link *link, unsigned int *class, + unsigned long deadline) +{ + int rc; + + rc = sata_std_hardreset(link, class, deadline); + if (!rc || rc == -EAGAIN) { + struct ata_port *ap = link->ap; + int pmp = link->pmp; + int tmprc; + + if (pmp) { + ap->ops->sff_dev_select(ap, pmp); + tmprc = ata_sff_wait_ready(&ap->link, deadline); + } else { + tmprc = ata_sff_wait_ready(link, deadline); + } + if (tmprc) + ata_link_err(link, "COMRESET failed for wait (errno=%d)\n", + rc); + else + ata_link_err(link, "wait for bsy success\n"); + + ata_link_err(link, "COMRESET success (errno=%d) ap=%d link %d\n", + rc, link->ap->port_no, link->pmp); + } else { + ata_link_err(link, "COMRESET failed (errno=%d) ap=%d link %d\n", + rc, link->ap->port_no, link->pmp); + } + return rc; +} + +static int zx_scr_read(struct ata_link *link, unsigned int scr, u32 *val) +{ + static const u8 ipm_tbl[] = { 1, 2, 6, 0 }; + struct pci_dev *pdev = to_pci_dev(link->ap->host->dev); + int slot = 2 * link->ap->port_no + link->pmp; + u32 v = 0; + u8 raw; + + switch (scr) { + case SCR_STATUS: + pci_read_config_byte(pdev, 0xA0 + slot, &raw); + + /* read the DET field, bit0 and 1 of the config byte */ + v |= raw & 0x03; + + /* read the SPD field, bit4 of the configure byte */ + v |= raw & 0x30; + + /* read the IPM field, bit2 and 3 of the config byte */ + v |= ((ipm_tbl[(raw >> 2) & 0x3])<<8); + break; + + case SCR_ERROR: + /* devices other than 5287 uses 0xA8 as base */ + WARN_ON(pdev->device != 0x9002 && pdev->device != 0x9003); + pci_write_config_byte(pdev, 0x42, slot); + pci_read_config_dword(pdev, 0xA8, &v); + break; + + case SCR_CONTROL: + pci_read_config_byte(pdev, 0xA4 + slot, &raw); + + /* read the DET field, bit0 and bit1 */ + v |= ((raw & 0x02) << 1) | (raw & 0x01); + + /* read the IPM field, bit2 and bit3 */ + v |= ((raw >> 2) & 0x03) << 8; + + break; + + default: + return -EINVAL; + } + + *val = v; + return 0; +} + +static int zx_scr_write(struct ata_link *link, unsigned int scr, u32 val) +{ + struct pci_dev *pdev = to_pci_dev(link->ap->host->dev); + int slot = 2 * link->ap->port_no + link->pmp; + u32 v = 0; + + WARN_ON(pdev == NULL); + + switch (scr) { + case SCR_ERROR: + /* devices 0x9002 uses 0xA8 as base */ + WARN_ON(pdev->device != 0x9002 && pdev->device != 0x9003); + pci_write_config_byte(pdev, 0x42, slot); + pci_write_config_dword(pdev, 0xA8, val); + return 0; + + case SCR_CONTROL: + /* set the DET field */ + v |= ((val & 0x4) >> 1) | (val & 0x1); + + /* set the IPM field */ + v |= ((val >> 8) & 0x3) << 2; + + + pci_write_config_byte(pdev, 0xA4 + slot, v); + + + return 0; + + default: + return -EINVAL; + } +} + + +/** + * zx_tf_load - send taskfile registers to host controller + * @ap: Port to which output is sent + * @tf: ATA taskfile register set + * + * Outputs ATA taskfile to standard ATA host controller. + * + * This is to fix the internal bug of zx chipsets, which will + * reset the device register after changing the IEN bit on ctl + * register. + */ +static void zx_tf_load(struct ata_port *ap, const struct ata_taskfile *tf) +{ + struct ata_taskfile ttf; + + if (tf->ctl != ap->last_ctl) { + ttf = *tf; + ttf.flags |= ATA_TFLAG_DEVICE; + tf = &ttf; + } + ata_sff_tf_load(ap, tf); +} + +static const unsigned int zx_bar_sizes[] = { + 8, 4, 8, 4, 16, 256 +}; + +static const unsigned int zx100s_bar_sizes0[] = { + 8, 4, 8, 4, 16, 0 +}; + +static const unsigned int zx100s_bar_sizes1[] = { + 8, 4, 0, 0, 16, 0 +}; + +static int zx_prepare_host(struct pci_dev *pdev, struct ata_host **r_host) +{ + const struct ata_port_info *ppi0[] = { + &zx100s_port_info, NULL + }; + const struct ata_port_info *ppi1[] = { + &zx100s_port_info, &ata_dummy_port_info + }; + struct ata_host *host; + int i, rc; + + if (pdev->device == 0x9002) + rc = ata_pci_bmdma_prepare_host(pdev, ppi0, &host); + else if (pdev->device == 0x9003) + rc = ata_pci_bmdma_prepare_host(pdev, ppi1, &host); + else + rc = -EINVAL; + + if (rc) + return rc; + + *r_host = host; + + /* 9002 hosts four sata ports as M/S of the two channels */ + /* 9003 hosts two sata ports as M/S of the one channel */ + for (i = 0; i < host->n_ports; i++) + ata_slave_link_init(host->ports[i]); + + return 0; +} + +static void zx_configure(struct pci_dev *pdev, int board_id) +{ + u8 tmp8; + + pci_read_config_byte(pdev, PCI_INTERRUPT_LINE, &tmp8); + dev_info(&pdev->dev, "routed to hard irq line %d\n", + (int) (tmp8 & 0xf0) == 0xf0 ? 0 : tmp8 & 0x0f); + + /* make sure SATA channels are enabled */ + pci_read_config_byte(pdev, SATA_CHAN_ENAB, &tmp8); + if ((tmp8 & ALL_PORTS) != ALL_PORTS) { + dev_dbg(&pdev->dev, "enabling SATA channels (0x%x)\n", + (int)tmp8); + tmp8 |= ALL_PORTS; + pci_write_config_byte(pdev, SATA_CHAN_ENAB, tmp8); + } + + /* make sure interrupts for each channel sent to us */ + pci_read_config_byte(pdev, SATA_INT_GATE, &tmp8); + if ((tmp8 & ALL_PORTS) != ALL_PORTS) { + dev_dbg(&pdev->dev, "enabling SATA channel interrupts (0x%x)\n", + (int) tmp8); + tmp8 |= ALL_PORTS; + pci_write_config_byte(pdev, SATA_INT_GATE, tmp8); + } + + /* make sure native mode is enabled */ + pci_read_config_byte(pdev, SATA_NATIVE_MODE, &tmp8); + if ((tmp8 & NATIVE_MODE_ALL) != NATIVE_MODE_ALL) { + dev_dbg(&pdev->dev, + "enabling SATA channel native mode (0x%x)\n", + (int) tmp8); + tmp8 |= NATIVE_MODE_ALL; + pci_write_config_byte(pdev, SATA_NATIVE_MODE, tmp8); + } +} + +static int zx_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) +{ + unsigned int i; + int rc; + struct ata_host *host = NULL; + int board_id = (int) ent->driver_data; + const unsigned int *bar_sizes; + int legacy_mode = 0; + + ata_print_version_once(&pdev->dev, DRV_VERSION); + + if (pdev->device == 0x9002 || pdev->device == 0x9003) { + if ((pdev->class >> 8) == PCI_CLASS_STORAGE_IDE) { + u8 tmp8, mask; + + /* TODO: What if one channel is in native mode ... */ + pci_read_config_byte(pdev, PCI_CLASS_PROG, &tmp8); + mask = (1 << 2) | (1 << 0); + if ((tmp8 & mask) != mask) + legacy_mode = 1; + } + if (legacy_mode) + return -EINVAL; + } + + rc = pcim_enable_device(pdev); + if (rc) + return rc; + + if (board_id == zx100s && pdev->device == 0x9002) + bar_sizes = &zx100s_bar_sizes0[0]; + else if (board_id == zx100s && pdev->device == 0x9003) + bar_sizes = &zx100s_bar_sizes1[0]; + else + bar_sizes = &zx_bar_sizes[0]; + + for (i = 0; i < ARRAY_SIZE(zx_bar_sizes); i++) { + if ((pci_resource_start(pdev, i) == 0) || + (pci_resource_len(pdev, i) < bar_sizes[i])) { + if (bar_sizes[i] == 0) + continue; + + dev_err(&pdev->dev, + "invalid PCI BAR %u (sz 0x%llx, val 0x%llx)\n", + i, + (unsigned long long)pci_resource_start(pdev, i), + (unsigned long long)pci_resource_len(pdev, i)); + + return -ENODEV; + } + } + + switch (board_id) { + case zx100s: + rc = zx_prepare_host(pdev, &host); + break; + default: + rc = -EINVAL; + } + if (rc) + return rc; + + zx_configure(pdev, board_id); + + pci_set_master(pdev); + return ata_host_activate(host, pdev->irq, ata_bmdma_interrupt, + IRQF_SHARED, &zx_sht); +} + +module_pci_driver(zx_pci_driver); + +MODULE_AUTHOR("Yanchen:YanchenSun@zhaoxin.com"); +MODULE_DESCRIPTION("SCSI low-level driver for ZX SATA controllers"); +MODULE_LICENSE("GPL"); +MODULE_DEVICE_TABLE(pci, zx_pci_tbl); +MODULE_VERSION(DRV_VERSION); diff --git a/drivers/base/auxiliary.c b/drivers/base/auxiliary.c index 53fac7686d9c11e51e6f34bc0c1685977d3f0621..f3610501da20881a8e3b0f19d17bf2380bf5a041 100644 --- a/drivers/base/auxiliary.c +++ b/drivers/base/auxiliary.c @@ -82,13 +82,12 @@ static int auxiliary_bus_remove(struct device *dev) { struct auxiliary_driver *auxdrv = to_auxiliary_drv(dev->driver); struct auxiliary_device *auxdev = to_auxiliary_dev(dev); - int ret = 0; if (auxdrv->remove) - ret = auxdrv->remove(auxdev); + auxdrv->remove(auxdev); dev_pm_domain_detach(dev, true); - return ret; + return 0; } static void auxiliary_bus_shutdown(struct device *dev) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 9b54eec9b17eb077dec68777e78c1226ccaef38a..c520016dd4cff141afe43f05e42ae11dd738e4d8 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -228,7 +228,8 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx, bool unmap = false; u32 type; - BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems); + if (req_op(req) != REQ_OP_DISCARD) + BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems); switch (req_op(req)) { case REQ_OP_READ: diff --git a/drivers/bluetooth/btsdio.c b/drivers/bluetooth/btsdio.c index 199e8f7d426d96ae524be2842a8f08b258cd73ff..7050a16e7efebd2119d4b7c7aace75da1cdf3d98 100644 --- a/drivers/bluetooth/btsdio.c +++ b/drivers/bluetooth/btsdio.c @@ -352,6 +352,7 @@ static void btsdio_remove(struct sdio_func *func) BT_DBG("func %p", func); + cancel_work_sync(&data->work); if (!data) return; diff --git a/drivers/clk/st/clkgen-fsyn.c b/drivers/clk/st/clkgen-fsyn.c index f1adc858b5907d4ee480be6b73d9e0d51a10f1e0..0e58a7cda427fb0bdd1cd4408eb28ab62bbefc5d 100644 --- a/drivers/clk/st/clkgen-fsyn.c +++ b/drivers/clk/st/clkgen-fsyn.c @@ -942,9 +942,10 @@ static void __init st_of_quadfs_setup(struct device_node *np, clk = st_clk_register_quadfs_pll(pll_name, clk_parent_name, data, reg, lock); - if (IS_ERR(clk)) + if (IS_ERR(clk)) { + kfree(lock); goto err_exit; - else + } else pr_debug("%s: parent %s rate %u\n", __clk_get_name(clk), __clk_get_name(clk_get_parent(clk)), diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c index 96388ae74a7e3ea85d5da7bb5e22d8c0198f9652..e1b1d7cf755f8b422c8462049839fb6649e3f0bd 100644 --- a/drivers/edac/amd64_edac.c +++ b/drivers/edac/amd64_edac.c @@ -98,6 +98,17 @@ int __amd64_write_pci_cfg_dword(struct pci_dev *pdev, int offset, return err; } +static u32 get_umc_base_f18h_m4h(u16 node, u8 channel) +{ + struct pci_dev *f3 = node_to_amd_nb(node)->misc; + u8 df_id; + + get_df_id(f3, &df_id); + df_id -= 4; + + return get_umc_base(channel) + (0x80000000 + (0x10000000 * df_id)); +} + /* * Select DCT to which PCI cfg accesses are routed */ @@ -865,7 +876,10 @@ static void __dump_misc_regs_df(struct amd64_pvt *pvt) u32 i, tmp, umc_base; for_each_umc(i) { - umc_base = get_umc_base(i); + if (hygon_f18h_m4h()) + umc_base = get_umc_base_f18h_m4h(pvt->mc_node_id, i); + else + umc_base = get_umc_base(i); umc = &pvt->umc[i]; edac_dbg(1, "UMC%d DIMM cfg: 0x%x\n", i, umc->dimm_cfg); @@ -985,11 +999,17 @@ static void read_umc_base_mask(struct amd64_pvt *pvt) u32 mask_reg, mask_reg_sec; u32 *base, *base_sec; u32 *mask, *mask_sec; + u32 umc_base; int cs, umc; for_each_umc(umc) { - umc_base_reg = get_umc_base(umc) + UMCCH_BASE_ADDR; - umc_base_reg_sec = get_umc_base(umc) + UMCCH_BASE_ADDR_SEC; + if (hygon_f18h_m4h()) + umc_base = get_umc_base_f18h_m4h(pvt->mc_node_id, umc); + else + umc_base = get_umc_base(umc); + + umc_base_reg = umc_base + UMCCH_BASE_ADDR; + umc_base_reg_sec = umc_base + UMCCH_BASE_ADDR_SEC; for_each_chip_select(cs, umc, pvt) { base = &pvt->csels[umc].csbases[cs]; @@ -1007,8 +1027,8 @@ static void read_umc_base_mask(struct amd64_pvt *pvt) umc, cs, *base_sec, base_reg_sec); } - umc_mask_reg = get_umc_base(umc) + UMCCH_ADDR_MASK; - umc_mask_reg_sec = get_umc_base(umc) + get_umc_reg(UMCCH_ADDR_MASK_SEC); + umc_mask_reg = umc_base + UMCCH_ADDR_MASK; + umc_mask_reg_sec = umc_base + get_umc_reg(UMCCH_ADDR_MASK_SEC); for_each_chip_select_mask(cs, umc, pvt) { mask = &pvt->csels[umc].csmasks[cs]; @@ -1096,7 +1116,8 @@ static void determine_memory_type_df(struct amd64_pvt *pvt) * Check if the system supports the "DDR Type" field in UMC Config * and has DDR5 DIMMs in use. */ - if (fam_type->flags.zn_regs_v2 && ((umc->umc_cfg & GENMASK(2, 0)) == 0x1)) { + if ((fam_type->flags.zn_regs_v2 || hygon_f18h_m4h()) && + ((umc->umc_cfg & GENMASK(2, 0)) == 0x1)) { if (umc->dimm_cfg & BIT(5)) umc->dram_type = MEM_LRDDR5; else if (umc->dimm_cfg & BIT(4)) @@ -2425,6 +2446,16 @@ static struct amd64_family_type family_types[] = { .dbam_to_cs = f17_addr_mask_to_cs_size, } }, + [F18_M06H_CPUS] = { + .ctl_name = "F18h_M06h", + .f0_id = PCI_DEVICE_ID_HYGON_18H_M06H_DF_F0, + .f6_id = PCI_DEVICE_ID_HYGON_18H_M06H_DF_F6, + .max_mcs = 2, + .ops = { + .early_channel_count = f17_early_channel_count, + .dbam_to_cs = f17_addr_mask_to_cs_size, + } + }, [F19_CPUS] = { .ctl_name = "F19h", .f0_id = PCI_DEVICE_ID_AMD_19H_DF_F0, @@ -2706,6 +2737,9 @@ static inline void decode_bus_error(int node_id, struct mce *m) */ static int find_umc_channel(struct mce *m) { + if (boot_cpu_data.x86_vendor == X86_VENDOR_HYGON && + boot_cpu_data.x86 == 0x18) + return (m->ipid & GENMASK(23, 0)) >> 20; return (m->ipid & GENMASK(31, 0)) >> 20; } @@ -2829,6 +2863,14 @@ static void free_mc_sibling_devs(struct amd64_pvt *pvt) } } +static void determine_ecc_sym_sz_f18h_m4h(struct amd64_pvt *pvt, int channel) +{ + if (pvt->umc[channel].ecc_ctrl & BIT(8)) + pvt->ecc_sym_sz = 16; + else if (pvt->umc[channel].ecc_ctrl & BIT(7)) + pvt->ecc_sym_sz = 8; +} + static void determine_ecc_sym_sz(struct amd64_pvt *pvt) { pvt->ecc_sym_sz = 4; @@ -2839,6 +2881,15 @@ static void determine_ecc_sym_sz(struct amd64_pvt *pvt) for_each_umc(i) { /* Check enabled channels only: */ if (pvt->umc[i].sdp_ctrl & UMC_SDP_INIT) { + if (boot_cpu_data.x86_vendor == X86_VENDOR_HYGON && + boot_cpu_data.x86 == 0x18 && + (boot_cpu_data.x86_model == 0x4 || + boot_cpu_data.x86_model == 0x5) && + (pvt->umc[i].umc_cfg & GENMASK(2, 0)) == 0x1) { + determine_ecc_sym_sz_f18h_m4h(pvt, i); + return; + } + if (pvt->umc[i].ecc_ctrl & BIT(9)) { pvt->ecc_sym_sz = 16; return; @@ -2873,8 +2924,11 @@ static void __read_mc_regs_df(struct amd64_pvt *pvt) /* Read registers from each UMC */ for_each_umc(i) { + if (hygon_f18h_m4h()) + umc_base = get_umc_base_f18h_m4h(pvt->mc_node_id, i); + else + umc_base = get_umc_base(i); - umc_base = get_umc_base(i); umc = &pvt->umc[i]; amd_smn_read(nid, umc_base + get_umc_reg(UMCCH_DIMM_CFG), &umc->dimm_cfg); @@ -3484,13 +3538,31 @@ static struct amd64_family_type *per_family_init(struct amd64_pvt *pvt) pvt->ops = &family_types[F17_M70H_CPUS].ops; break; } - fallthrough; - case 0x18: fam_type = &family_types[F17_CPUS]; pvt->ops = &family_types[F17_CPUS].ops; + break; - if (pvt->fam == 0x18) - family_types[F17_CPUS].ctl_name = "F18h"; + case 0x18: + if (pvt->model == 0x4) { + fam_type = &family_types[F17_M30H_CPUS]; + pvt->ops = &family_types[F17_M30H_CPUS].ops; + family_types[F17_M30H_CPUS].max_mcs = 3; + family_types[F17_M30H_CPUS].ctl_name = "F18h_M04h"; + break; + } else if (pvt->model == 0x5) { + fam_type = &family_types[F17_M30H_CPUS]; + pvt->ops = &family_types[F17_M30H_CPUS].ops; + family_types[F17_M30H_CPUS].max_mcs = 1; + family_types[F17_M30H_CPUS].ctl_name = "F18h_M05h"; + break; + } else if (pvt->model == 0x6) { + fam_type = &family_types[F18_M06H_CPUS]; + pvt->ops = &family_types[F18_M06H_CPUS].ops; + break; + } + fam_type = &family_types[F17_CPUS]; + pvt->ops = &family_types[F17_CPUS].ops; + family_types[F17_CPUS].ctl_name = "F18h"; break; case 0x19: @@ -3768,6 +3840,7 @@ static int __init amd64_edac_init(void) { const char *owner; int err = -ENODEV; + u16 instance_num; int i; owner = edac_get_owner(); @@ -3782,8 +3855,13 @@ static int __init amd64_edac_init(void) opstate_init(); + if (hygon_f18h_m4h()) + instance_num = hygon_nb_num(); + else + instance_num = amd_nb_num(); + err = -ENOMEM; - ecc_stngs = kcalloc(amd_nb_num(), sizeof(ecc_stngs[0]), GFP_KERNEL); + ecc_stngs = kcalloc(instance_num, sizeof(ecc_stngs[0]), GFP_KERNEL); if (!ecc_stngs) goto err_free; @@ -3791,7 +3869,7 @@ static int __init amd64_edac_init(void) if (!msrs) goto err_free; - for (i = 0; i < amd_nb_num(); i++) { + for (i = 0; i < instance_num; i++) { err = probe_one_instance(i); if (err) { /* unwind properly */ @@ -3838,6 +3916,7 @@ static int __init amd64_edac_init(void) static void __exit amd64_edac_exit(void) { + u16 instance_num; int i; if (pci_ctl) @@ -3849,7 +3928,12 @@ static void __exit amd64_edac_exit(void) else amd_unregister_ecc_decoder(decode_bus_error); - for (i = 0; i < amd_nb_num(); i++) + if (hygon_f18h_m4h()) + instance_num = hygon_nb_num(); + else + instance_num = amd_nb_num(); + + for (i = 0; i < instance_num; i++) remove_one_instance(i); kfree(ecc_stngs); diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h index 5a273d589e3042806f26d522ceaead52ef74785b..3c82cedc4eec9cf0aef9ae99b218869b90edf332 100644 --- a/drivers/edac/amd64_edac.h +++ b/drivers/edac/amd64_edac.h @@ -129,6 +129,9 @@ #define PCI_DEVICE_ID_AMD_19H_M10H_DF_F0 0x14ad #define PCI_DEVICE_ID_AMD_19H_M10H_DF_F6 0x14b3 +#define PCI_DEVICE_ID_HYGON_18H_M06H_DF_F0 0x14b0 +#define PCI_DEVICE_ID_HYGON_18H_M06H_DF_F6 0x14b6 + /* * Function 1 - Address Map */ @@ -302,6 +305,7 @@ enum amd_families { F17_M30H_CPUS, F17_M60H_CPUS, F17_M70H_CPUS, + F18_M06H_CPUS, F19_CPUS, F19_M10H_CPUS, NUM_FAMILIES, diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c index 4682a492b9875a323fb7e47a94d5d72b29d4f66e..ba9da97f16f70cb8ed028757c334e5991e189672 100644 --- a/drivers/edac/mce_amd.c +++ b/drivers/edac/mce_amd.c @@ -1002,8 +1002,13 @@ static void decode_smca_error(struct mce *m) if (xec < smca_mce_descs[bank_type].num_descs) pr_cont(", %s.\n", smca_mce_descs[bank_type].descs[xec]); - if (bank_type == SMCA_UMC && xec == 0 && decode_dram_ecc) - decode_dram_ecc(topology_die_id(m->extcpu), m); + if (bank_type == SMCA_UMC && xec == 0 && decode_dram_ecc) { + if (boot_cpu_data.x86_vendor == X86_VENDOR_HYGON && + boot_cpu_data.x86 == 0x18) + decode_dram_ecc(topology_logical_die_id(m->extcpu), m); + else + decode_dram_ecc(topology_die_id(m->extcpu), m); + } } static inline void amd_decode_err_code(u16 ec) diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig index 62ee2d87ccf41d64f2211719b3132db5d741ad55..7192bcde0038b92122392686a2167d495b4cbbac 100644 --- a/drivers/firmware/efi/Kconfig +++ b/drivers/firmware/efi/Kconfig @@ -104,6 +104,35 @@ config EFI_RUNTIME_WRAPPERS config EFI_GENERIC_STUB bool +config EFI_ZBOOT + def_bool y + depends on EFI_GENERIC_STUB && !ARM + select HAVE_KERNEL_GZIP + select HAVE_KERNEL_LZ4 + select HAVE_KERNEL_LZMA + select HAVE_KERNEL_LZO + select HAVE_KERNEL_XZ + select HAVE_KERNEL_ZSTD + +config EFI_ZBOOT_SIGNED + bool "Sign the EFI decompressor for UEFI secure boot" + depends on EFI_ZBOOT + help + Use the 'sbsign' command line tool (which must exist on the host + path) to sign both the EFI decompressor PE/COFF image, as well as the + encapsulated PE/COFF image, which is subsequently compressed and + wrapped by the former image. + +config EFI_ZBOOT_SIGNING_CERT + string "Certificate to use for signing the compressed EFI boot image" + depends on EFI_ZBOOT_SIGNED + default "" + +config EFI_ZBOOT_SIGNING_KEY + string "Private key to use for signing the compressed EFI boot image" + depends on EFI_ZBOOT_SIGNED + default "" + config EFI_ARMSTUB_DTB_LOADER bool "Enable the DTB loader" depends on EFI_GENERIC_STUB && !RISCV diff --git a/drivers/firmware/efi/libstub/Makefile b/drivers/firmware/efi/libstub/Makefile index a2ae9c3b95793821c305c4c3e8d125cbce783bd7..cc910712f1057273ac5133bfda184f1b459b0880 100644 --- a/drivers/firmware/efi/libstub/Makefile +++ b/drivers/firmware/efi/libstub/Makefile @@ -61,7 +61,7 @@ efi-deps-y := fdt_rw.c fdt_ro.c fdt_wip.c fdt.c fdt_empty_tree.c fdt_sw.c $(obj)/lib-%.o: $(srctree)/lib/%.c FORCE $(call if_changed_rule,cc_o_c) -lib-$(CONFIG_EFI_GENERIC_STUB) += efi-stub.o fdt.o string.o \ +lib-$(CONFIG_EFI_GENERIC_STUB) += efi-stub.o fdt.o string.o intrinsics.o \ $(patsubst %.c,lib-%.o,$(efi-deps-y)) lib-$(CONFIG_ARM) += arm32-stub.o @@ -70,6 +70,12 @@ lib-$(CONFIG_X86) += x86-stub.o lib-$(CONFIG_RISCV) += riscv-stub.o CFLAGS_arm32-stub.o := -DTEXT_OFFSET=$(TEXT_OFFSET) +zboot-obj-$(CONFIG_RISCV) := lib-clz_ctz.o lib-ashldi3.o +lib-$(CONFIG_EFI_ZBOOT) += zboot.o $(zboot-obj-y) + +extra-y := $(lib-y) +lib-y := $(patsubst %.o,%.stub.o,$(lib-y)) + # Even when -mbranch-protection=none is set, Clang will generate a # .note.gnu.property for code-less object files (like lib/ctype.c), # so work around this by explicitly removing the unwanted section. @@ -109,9 +115,6 @@ STUBCOPY_RELOC-$(CONFIG_ARM) := R_ARM_ABS # a verification pass to see if any absolute relocations exist in any of the # object files. # -extra-y := $(lib-y) -lib-y := $(patsubst %.o,%.stub.o,$(lib-y)) - STUBCOPY_FLAGS-$(CONFIG_ARM64) += --prefix-alloc-sections=.init \ --prefix-symbols=__efistub_ STUBCOPY_RELOC-$(CONFIG_ARM64) := R_AARCH64_ABS diff --git a/drivers/firmware/efi/libstub/Makefile.zboot b/drivers/firmware/efi/libstub/Makefile.zboot new file mode 100644 index 0000000000000000000000000000000000000000..d0be2de2c8cf7dff07caa3dc8423b94cd0808001 --- /dev/null +++ b/drivers/firmware/efi/libstub/Makefile.zboot @@ -0,0 +1,70 @@ +# SPDX-License-Identifier: GPL-2.0 + +# to be include'd by arch/$(ARCH)/boot/Makefile after setting +# EFI_ZBOOT_PAYLOAD, EFI_ZBOOT_BFD_TARGET and EFI_ZBOOT_MACH_TYPE + +comp-type-$(CONFIG_KERNEL_GZIP) := gzip +comp-type-$(CONFIG_KERNEL_LZ4) := lz4 +comp-type-$(CONFIG_KERNEL_LZMA) := lzma +comp-type-$(CONFIG_KERNEL_LZO) := lzo +comp-type-$(CONFIG_KERNEL_XZ) := xzkern +comp-type-$(CONFIG_KERNEL_ZSTD) := zstd22 + +# in GZIP, the appended le32 carrying the uncompressed size is part of the +# format, but in other cases, we just append it at the end for convenience, +# causing the original tools to complain when checking image integrity. +# So disregard it when calculating the payload size in the zimage header. +zboot-method-y := $(comp-type-y) +zboot-size-len-y := 4 + +zboot-method-$(CONFIG_KERNEL_GZIP) := gzip +zboot-size-len-$(CONFIG_KERNEL_GZIP) := 0 + +quiet_cmd_sbsign = SBSIGN $@ + cmd_sbsign = sbsign --out $@ $< \ + --key $(CONFIG_EFI_ZBOOT_SIGNING_KEY) \ + --cert $(CONFIG_EFI_ZBOOT_SIGNING_CERT) + +$(obj)/$(EFI_ZBOOT_PAYLOAD).signed: $(obj)/$(EFI_ZBOOT_PAYLOAD) FORCE + $(call if_changed,sbsign) + +ZBOOT_PAYLOAD-y := $(EFI_ZBOOT_PAYLOAD) +ZBOOT_PAYLOAD-$(CONFIG_EFI_ZBOOT_SIGNED) := $(EFI_ZBOOT_PAYLOAD).signed + +$(obj)/vmlinuz: $(obj)/$(ZBOOT_PAYLOAD-y) FORCE + $(call if_changed,$(zboot-method-y)) + +OBJCOPYFLAGS_vmlinuz.o := -I binary -O $(EFI_ZBOOT_BFD_TARGET) \ + --rename-section .data=.gzdata,load,alloc,readonly,contents +$(obj)/vmlinuz.o: $(obj)/vmlinuz FORCE + $(call if_changed,objcopy) + +AFLAGS_zboot-header.o += -DMACHINE_TYPE=IMAGE_FILE_MACHINE_$(EFI_ZBOOT_MACH_TYPE) \ + -DZBOOT_EFI_PATH="\"$(realpath $(obj)/vmlinuz.efi.elf)\"" \ + -DZBOOT_SIZE_LEN=$(zboot-size-len-y) \ + -DCOMP_TYPE="\"$(comp-type-y)\"" + +$(obj)/zboot-header.o: $(srctree)/drivers/firmware/efi/libstub/zboot-header.S FORCE + $(call if_changed_rule,as_o_S) + +ZBOOT_DEPS := $(obj)/zboot-header.o $(objtree)/drivers/firmware/efi/libstub/lib.a + +LDFLAGS_vmlinuz.efi.elf := -T $(srctree)/drivers/firmware/efi/libstub/zboot.lds +$(obj)/vmlinuz.efi.elf: $(obj)/vmlinuz.o $(ZBOOT_DEPS) FORCE + $(call if_changed,ld) + +ZBOOT_EFI-y := vmlinuz.efi +ZBOOT_EFI-$(CONFIG_EFI_ZBOOT_SIGNED) := vmlinuz.efi.unsigned + +OBJCOPYFLAGS_$(ZBOOT_EFI-y) := -O binary +$(obj)/$(ZBOOT_EFI-y): $(obj)/vmlinuz.efi.elf FORCE + $(call if_changed,objcopy) + +targets += zboot-header.o vmlinuz vmlinuz.o vmlinuz.efi.elf vmlinuz.efi + +ifneq ($(CONFIG_EFI_ZBOOT_SIGNED),) +$(obj)/vmlinuz.efi: $(obj)/vmlinuz.efi.unsigned FORCE + $(call if_changed,sbsign) + +targets += $(EFI_ZBOOT_PAYLOAD).signed vmlinuz.efi.unsigned +endif diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h index 2d7abcd99de9b9d30cd4f23cb51d29ef6ac153b5..fc48e8e4c4906ee91cece96d1a913b90f049bd47 100644 --- a/drivers/firmware/efi/libstub/efistub.h +++ b/drivers/firmware/efi/libstub/efistub.h @@ -168,6 +168,23 @@ struct efi_boot_memmap { typedef struct efi_generic_dev_path efi_device_path_protocol_t; +union efi_device_path_to_text_protocol { + struct { + efi_char16_t *(__efiapi *convert_device_node_to_text)( + const efi_device_path_protocol_t *, + bool, bool); + efi_char16_t *(__efiapi *convert_device_path_to_text)( + const efi_device_path_protocol_t *, + bool, bool); + }; + struct { + u32 convert_device_node_to_text; + u32 convert_device_path_to_text; + } mixed_mode; +}; + +typedef union efi_device_path_to_text_protocol efi_device_path_to_text_protocol_t; + typedef void *efi_event_t; /* Note that notifications won't work in mixed mode */ typedef void (__efiapi *efi_event_notify_t)(efi_event_t, void *); @@ -251,13 +268,17 @@ union efi_boot_services { efi_handle_t *); efi_status_t (__efiapi *install_configuration_table)(efi_guid_t *, void *); - void *load_image; - void *start_image; + efi_status_t (__efiapi *load_image)(bool, efi_handle_t, + efi_device_path_protocol_t *, + void *, unsigned long, + efi_handle_t *); + efi_status_t (__efiapi *start_image)(efi_handle_t, unsigned long *, + efi_char16_t **); efi_status_t __noreturn (__efiapi *exit)(efi_handle_t, efi_status_t, unsigned long, efi_char16_t *); - void *unload_image; + efi_status_t (__efiapi *unload_image)(efi_handle_t); efi_status_t (__efiapi *exit_boot_services)(efi_handle_t, unsigned long); void *get_next_monotonic_count; @@ -274,11 +295,11 @@ union efi_boot_services { void *locate_handle_buffer; efi_status_t (__efiapi *locate_protocol)(efi_guid_t *, void *, void **); - void *install_multiple_protocol_interfaces; - void *uninstall_multiple_protocol_interfaces; + efi_status_t (__efiapi *install_multiple_protocol_interfaces)(efi_handle_t *, ...); + efi_status_t (__efiapi *uninstall_multiple_protocol_interfaces)(efi_handle_t, ...); void *calculate_crc32; - void *copy_mem; - void *set_mem; + void (__efiapi *copy_mem)(void *, const void *, unsigned long); + void (__efiapi *set_mem)(void *, unsigned long, unsigned char); void *create_event_ex; }; struct { diff --git a/drivers/firmware/efi/libstub/file.c b/drivers/firmware/efi/libstub/file.c index dd95f330fe6e173ef8bed69f2b59b7e2cbaddb57..42b3338273aaa0780a13e3c96fc9ce4793d2ef4c 100644 --- a/drivers/firmware/efi/libstub/file.c +++ b/drivers/firmware/efi/libstub/file.c @@ -66,10 +66,27 @@ static efi_status_t efi_open_file(efi_file_protocol_t *volume, static efi_status_t efi_open_volume(efi_loaded_image_t *image, efi_file_protocol_t **fh) { + struct efi_vendor_dev_path *dp = image->file_path; + efi_guid_t li_proto = LOADED_IMAGE_PROTOCOL_GUID; efi_guid_t fs_proto = EFI_FILE_SYSTEM_GUID; efi_simple_file_system_protocol_t *io; efi_status_t status; + // If we are using EFI zboot, we should look for the file system + // protocol on the parent image's handle instead + if (IS_ENABLED(CONFIG_EFI_ZBOOT) && + image->parent_handle != NULL && + dp->header.type == EFI_DEV_MEDIA && + dp->header.sub_type == EFI_DEV_MEDIA_VENDOR && + !efi_guidcmp(dp->vendorguid, LINUX_EFI_ZBOOT_MEDIA_GUID)) { + status = efi_bs_call(handle_protocol, image->parent_handle, + &li_proto, (void *)&image); + if (status != EFI_SUCCESS) { + efi_err("Failed to locate parent image handle\n"); + return status; + } + } + status = efi_bs_call(handle_protocol, image->device_handle, &fs_proto, (void **)&io); if (status != EFI_SUCCESS) { diff --git a/drivers/firmware/efi/libstub/intrinsics.c b/drivers/firmware/efi/libstub/intrinsics.c new file mode 100644 index 0000000000000000000000000000000000000000..a04ab39292b62d2bf53d69e461967befc4dd1c0f --- /dev/null +++ b/drivers/firmware/efi/libstub/intrinsics.c @@ -0,0 +1,30 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include + +#include "efistub.h" + +#ifdef CONFIG_KASAN +#undef memcpy +#undef memmove +#undef memset +void *__memcpy(void *__dest, const void *__src, size_t __n) __alias(memcpy); +void *__memmove(void *__dest, const void *__src, size_t count) __alias(memmove); +void *__memset(void *s, int c, size_t count) __alias(memset); +#endif + +void *memcpy(void *dst, const void *src, size_t len) +{ + efi_bs_call(copy_mem, dst, src, len); + return dst; +} + +extern void *memmove(void *dst, const void *src, size_t len) __alias(memcpy); + +void *memset(void *dst, int c, size_t len) +{ + efi_bs_call(set_mem, dst, len, c & U8_MAX); + return dst; +} diff --git a/drivers/firmware/efi/libstub/zboot-header.S b/drivers/firmware/efi/libstub/zboot-header.S new file mode 100644 index 0000000000000000000000000000000000000000..b69c609ba8fd236a7ba35e0f430fa74f2373abad --- /dev/null +++ b/drivers/firmware/efi/libstub/zboot-header.S @@ -0,0 +1,138 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#include + +#ifdef CONFIG_64BIT + .set .Lextra_characteristics, 0x0 + .set .Lpe_opt_magic, PE_OPT_MAGIC_PE32PLUS +#else + .set .Lextra_characteristics, IMAGE_FILE_32BIT_MACHINE + .set .Lpe_opt_magic, PE_OPT_MAGIC_PE32 +#endif + + .section ".head", "a" + .globl __efistub_efi_zboot_header +__efistub_efi_zboot_header: +.Ldoshdr: + .long MZ_MAGIC + .ascii "zimg" // image type + .long __efistub__gzdata_start - .Ldoshdr // payload offset + .long __efistub__gzdata_size - ZBOOT_SIZE_LEN // payload size + .long 0, 0 // reserved + .asciz COMP_TYPE // compression type + .org .Ldoshdr + 0x3c + .long .Lpehdr - .Ldoshdr // PE header offset + +.Lpehdr: + .long PE_MAGIC + .short MACHINE_TYPE + .short .Lsection_count + .long 0 + .long 0 + .long 0 + .short .Lsection_table - .Loptional_header + .short IMAGE_FILE_DEBUG_STRIPPED | \ + IMAGE_FILE_EXECUTABLE_IMAGE | \ + IMAGE_FILE_LINE_NUMS_STRIPPED |\ + .Lextra_characteristics + +.Loptional_header: + .short .Lpe_opt_magic + .byte 0, 0 + .long _etext - .Lefi_header_end + .long __data_size + .long 0 + .long __efistub_efi_zboot_entry - .Ldoshdr + .long .Lefi_header_end - .Ldoshdr + +#ifdef CONFIG_64BIT + .quad 0 +#else + .long _etext - .Ldoshdr, 0x0 +#endif + .long 4096 + .long 512 + .short 0, 0 + .short LINUX_EFISTUB_MAJOR_VERSION // MajorImageVersion + .short LINUX_EFISTUB_MINOR_VERSION // MinorImageVersion + .short 0, 0 + .long 0 + .long _end - .Ldoshdr + + .long .Lefi_header_end - .Ldoshdr + .long 0 + .short IMAGE_SUBSYSTEM_EFI_APPLICATION + .short 0 + .quad 0, 0, 0, 0 + .long 0 + .long (.Lsection_table - .) / 8 + + .quad 0 // ExportTable + .quad 0 // ImportTable + .quad 0 // ResourceTable + .quad 0 // ExceptionTable + .quad 0 // CertificationTable + .quad 0 // BaseRelocationTable +#ifdef CONFIG_DEBUG_EFI + .long .Lefi_debug_table - .Ldoshdr // DebugTable + .long .Lefi_debug_table_size +#endif + +.Lsection_table: + .ascii ".text\0\0\0" + .long _etext - .Lefi_header_end + .long .Lefi_header_end - .Ldoshdr + .long _etext - .Lefi_header_end + .long .Lefi_header_end - .Ldoshdr + + .long 0, 0 + .short 0, 0 + .long IMAGE_SCN_CNT_CODE | \ + IMAGE_SCN_MEM_READ | \ + IMAGE_SCN_MEM_EXECUTE + + .ascii ".data\0\0\0" + .long __data_size + .long _etext - .Ldoshdr + .long __data_rawsize + .long _etext - .Ldoshdr + + .long 0, 0 + .short 0, 0 + .long IMAGE_SCN_CNT_INITIALIZED_DATA | \ + IMAGE_SCN_MEM_READ | \ + IMAGE_SCN_MEM_WRITE + + .set .Lsection_count, (. - .Lsection_table) / 40 + +#ifdef CONFIG_DEBUG_EFI + .section ".rodata", "a" + .align 2 +.Lefi_debug_table: + // EFI_IMAGE_DEBUG_DIRECTORY_ENTRY + .long 0 // Characteristics + .long 0 // TimeDateStamp + .short 0 // MajorVersion + .short 0 // MinorVersion + .long IMAGE_DEBUG_TYPE_CODEVIEW // Type + .long .Lefi_debug_entry_size // SizeOfData + .long 0 // RVA + .long .Lefi_debug_entry - .Ldoshdr // FileOffset + + .set .Lefi_debug_table_size, . - .Lefi_debug_table + .previous + +.Lefi_debug_entry: + // EFI_IMAGE_DEBUG_CODEVIEW_NB10_ENTRY + .ascii "NB10" // Signature + .long 0 // Unknown + .long 0 // Unknown2 + .long 0 // Unknown3 + + .asciz ZBOOT_EFI_PATH + + .set .Lefi_debug_entry_size, . - .Lefi_debug_entry +#endif + + .p2align 12 +.Lefi_header_end: diff --git a/drivers/firmware/efi/libstub/zboot.c b/drivers/firmware/efi/libstub/zboot.c new file mode 100644 index 0000000000000000000000000000000000000000..8c23617ac4870bf73fe9704abff2c827f781c61b --- /dev/null +++ b/drivers/firmware/efi/libstub/zboot.c @@ -0,0 +1,289 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include + +#include "efistub.h" + +static unsigned char zboot_heap[SZ_256K] __aligned(64); +static unsigned long free_mem_ptr, free_mem_end_ptr; + +#define STATIC static +#if defined(CONFIG_KERNEL_GZIP) +#include "../../../../lib/decompress_inflate.c" +#elif defined(CONFIG_KERNEL_LZ4) +#include "../../../../lib/decompress_unlz4.c" +#elif defined(CONFIG_KERNEL_LZMA) +#include "../../../../lib/decompress_unlzma.c" +#elif defined(CONFIG_KERNEL_LZO) +#include "../../../../lib/decompress_unlzo.c" +#elif defined(CONFIG_KERNEL_XZ) +#undef memcpy +#define memcpy memcpy +#undef memmove +#define memmove memmove +#include "../../../../lib/decompress_unxz.c" +#elif defined(CONFIG_KERNEL_ZSTD) +#include "../../../../lib/decompress_unzstd.c" +#endif + +extern char efi_zboot_header[]; +extern char _gzdata_start[], _gzdata_end[]; + +static void log(efi_char16_t str[]) +{ + efi_call_proto(efi_table_attr(efi_system_table, con_out), + output_string, L"EFI decompressor: "); + efi_call_proto(efi_table_attr(efi_system_table, con_out), + output_string, str); + efi_call_proto(efi_table_attr(efi_system_table, con_out), + output_string, L"\n"); +} + +static void error(char *x) +{ + log(L"error() called from decompressor library\n"); +} + +// Local version to avoid pulling in memcmp() +static bool guids_eq(const efi_guid_t *a, const efi_guid_t *b) +{ + const u32 *l = (u32 *)a; + const u32 *r = (u32 *)b; + + return l[0] == r[0] && l[1] == r[1] && l[2] == r[2] && l[3] == r[3]; +} + +static efi_status_t __efiapi +load_file(efi_load_file_protocol_t *this, efi_device_path_protocol_t *rem, + bool boot_policy, unsigned long *bufsize, void *buffer) +{ + struct efi_vendor_dev_path *vendor_dp; + bool decompress = false; + unsigned long size; + int ret; + + if (rem == NULL || bufsize == NULL) + return EFI_INVALID_PARAMETER; + + if (boot_policy) + return EFI_UNSUPPORTED; + + // Look for our vendor media device node in the remaining file path + if (rem->type == EFI_DEV_MEDIA && + rem->sub_type == EFI_DEV_MEDIA_VENDOR) { + vendor_dp = container_of(rem, struct efi_vendor_dev_path, header); + if (!guids_eq(&vendor_dp->vendorguid, &LINUX_EFI_ZBOOT_MEDIA_GUID)) + return EFI_NOT_FOUND; + + decompress = true; + rem = (void *)(vendor_dp + 1); + } + + if (rem->type != EFI_DEV_END_PATH || + rem->sub_type != EFI_DEV_END_ENTIRE) + return EFI_NOT_FOUND; + + // The uncompressed size of the payload is appended to the raw bit + // stream, and may therefore appear misaligned in memory + size = decompress ? get_unaligned_le32(_gzdata_end - 4) + : (_gzdata_end - _gzdata_start); + if (buffer == NULL || *bufsize < size) { + *bufsize = size; + return EFI_BUFFER_TOO_SMALL; + } + + if (decompress) { + ret = __decompress(_gzdata_start, _gzdata_end - _gzdata_start, + NULL, NULL, buffer, 0, NULL, error); + if (ret < 0) { + log(L"Decompression failed"); + return EFI_DEVICE_ERROR; + } + } else { + memcpy(buffer, _gzdata_start, size); + } + + return EFI_SUCCESS; +} + +// Return the length in bytes of the device path up to the first end node. +static int device_path_length(const efi_device_path_protocol_t *dp) +{ + int len = 0; + + while (dp->type != EFI_DEV_END_PATH) { + len += dp->length; + dp = (void *)((u8 *)dp + dp->length); + } + return len; +} + +static void append_rel_offset_node(efi_device_path_protocol_t **dp, + unsigned long start, unsigned long end) +{ + struct efi_rel_offset_dev_path *rodp = (void *)*dp; + + rodp->header.type = EFI_DEV_MEDIA; + rodp->header.sub_type = EFI_DEV_MEDIA_REL_OFFSET; + rodp->header.length = sizeof(struct efi_rel_offset_dev_path); + rodp->reserved = 0; + rodp->starting_offset = start; + rodp->ending_offset = end; + + *dp = (void *)(rodp + 1); +} + +static void append_ven_media_node(efi_device_path_protocol_t **dp, + efi_guid_t *guid) +{ + struct efi_vendor_dev_path *vmdp = (void *)*dp; + + vmdp->header.type = EFI_DEV_MEDIA; + vmdp->header.sub_type = EFI_DEV_MEDIA_VENDOR; + vmdp->header.length = sizeof(struct efi_vendor_dev_path); + vmdp->vendorguid = *guid; + + *dp = (void *)(vmdp + 1); +} + +static void append_end_node(efi_device_path_protocol_t **dp) +{ + (*dp)->type = EFI_DEV_END_PATH; + (*dp)->sub_type = EFI_DEV_END_ENTIRE; + (*dp)->length = sizeof(struct efi_generic_dev_path); + + ++*dp; +} + +asmlinkage efi_status_t __efiapi +efi_zboot_entry(efi_handle_t handle, efi_system_table_t *systab) +{ + efi_device_path_protocol_t *parent_dp, *dpp, *lf2_dp, *li_dp; + efi_load_file2_protocol_t zboot_load_file2; + efi_loaded_image_t *parent, *child; + unsigned long exit_data_size; + efi_handle_t child_handle; + efi_handle_t zboot_handle; + efi_char16_t *exit_data; + efi_status_t status; + void *dp_alloc; + int dp_len; + + WRITE_ONCE(efi_system_table, systab); + + free_mem_ptr = (unsigned long)&zboot_heap; + free_mem_end_ptr = free_mem_ptr + sizeof(zboot_heap); + + exit_data = NULL; + exit_data_size = 0; + + status = efi_bs_call(handle_protocol, handle, + &LOADED_IMAGE_PROTOCOL_GUID, (void **)&parent); + if (status != EFI_SUCCESS) { + log(L"Failed to locate parent's loaded image protocol"); + return status; + } + + status = efi_bs_call(handle_protocol, handle, + &LOADED_IMAGE_DEVICE_PATH_PROTOCOL_GUID, + (void **)&parent_dp); + if (status != EFI_SUCCESS) { + log(L"Failed to locate parent's loaded image device path protocol"); + return status; + } + + // Allocate some pool memory for device path protocol data + dp_len = parent_dp ? device_path_length(parent_dp) : 0; + status = efi_bs_call(allocate_pool, EFI_LOADER_DATA, + 2 * (dp_len + sizeof(struct efi_rel_offset_dev_path) + + sizeof(struct efi_generic_dev_path)) + + sizeof(struct efi_vendor_dev_path), + (void **)&dp_alloc); + if (status != EFI_SUCCESS) { + log(L"Failed to allocate device path pool memory"); + return status; + } + + // Create a device path describing the compressed payload in this image + // <...parent_dp...>/Offset(, ) + lf2_dp = memcpy(dp_alloc, parent_dp, dp_len); + dpp = (void *)((u8 *)lf2_dp + dp_len); + append_rel_offset_node(&dpp, + (unsigned long)(_gzdata_start - efi_zboot_header), + (unsigned long)(_gzdata_end - efi_zboot_header - 1)); + append_end_node(&dpp); + + // Create a device path describing the decompressed payload in this image + // <...parent_dp...>/Offset(, )/VenMedia(ZBOOT_MEDIA_GUID) + dp_len += sizeof(struct efi_rel_offset_dev_path); + li_dp = memcpy(dpp, lf2_dp, dp_len); + dpp = (void *)((u8 *)li_dp + dp_len); + append_ven_media_node(&dpp, &LINUX_EFI_ZBOOT_MEDIA_GUID); + append_end_node(&dpp); + + zboot_handle = NULL; + zboot_load_file2.load_file = load_file; + status = efi_bs_call(install_multiple_protocol_interfaces, + &zboot_handle, + &EFI_DEVICE_PATH_PROTOCOL_GUID, lf2_dp, + &EFI_LOAD_FILE2_PROTOCOL_GUID, &zboot_load_file2, + NULL); + if (status != EFI_SUCCESS) { + log(L"Failed to install LoadFile2 protocol and device path"); + goto free_dpalloc; + } + + status = efi_bs_call(load_image, false, handle, li_dp, NULL, 0, + &child_handle); + if (status != EFI_SUCCESS) { + log(L"Failed to load image"); + goto uninstall_lf2; + } + + status = efi_bs_call(handle_protocol, child_handle, + &LOADED_IMAGE_PROTOCOL_GUID, (void **)&child); + if (status != EFI_SUCCESS) { + log(L"Failed to locate child's loaded image protocol"); + goto unload_image; + } + + // Copy the kernel command line + child->load_options = parent->load_options; + child->load_options_size = parent->load_options_size; + + status = efi_bs_call(start_image, child_handle, &exit_data_size, + &exit_data); + if (status != EFI_SUCCESS) { + log(L"StartImage() returned with error"); + if (exit_data_size > 0) + log(exit_data); + + // If StartImage() returns EFI_SECURITY_VIOLATION, the image is + // not unloaded so we need to do it by hand. + if (status == EFI_SECURITY_VIOLATION) +unload_image: + efi_bs_call(unload_image, child_handle); + } + +uninstall_lf2: + efi_bs_call(uninstall_multiple_protocol_interfaces, + zboot_handle, + &EFI_DEVICE_PATH_PROTOCOL_GUID, lf2_dp, + &EFI_LOAD_FILE2_PROTOCOL_GUID, &zboot_load_file2, + NULL); + +free_dpalloc: + efi_bs_call(free_pool, dp_alloc); + + efi_bs_call(exit, handle, status, exit_data_size, exit_data); + + // Free ExitData in case Exit() returned with a failure code, + // but return the original status code. + log(L"Exit() returned with failure code"); + if (exit_data != NULL) + efi_bs_call(free_pool, exit_data); + return status; +} diff --git a/drivers/firmware/efi/libstub/zboot.lds b/drivers/firmware/efi/libstub/zboot.lds new file mode 100644 index 0000000000000000000000000000000000000000..509996988a9b680c86f2dedbe076268f36afb9c2 --- /dev/null +++ b/drivers/firmware/efi/libstub/zboot.lds @@ -0,0 +1,41 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +ENTRY(__efistub_efi_zboot_header); + +SECTIONS +{ + .text : ALIGN(4096) { + *(.head) + *(.text* .init.text*) + } + + .rodata : ALIGN(8) { + __efistub__gzdata_start = .; + *(.gzdata) + __efistub__gzdata_end = .; + *(.rodata* .init.rodata* .srodata*) + _etext = ALIGN(4096); + . = _etext; + } + + .data : ALIGN(4096) { + *(.data* .init.data*) + _edata = ALIGN(512); + . = _edata; + } + + .bss : { + *(.bss* .init.bss*) + _end = ALIGN(512); + . = _end; + } + + /DISCARD/ : { + *(.modinfo .init.modinfo) + } +} + +PROVIDE(__efistub__gzdata_size = ABSOLUTE(. - __efistub__gzdata_start)); + +PROVIDE(__data_rawsize = ABSOLUTE(_edata - _etext)); +PROVIDE(__data_size = ABSOLUTE(_end - _etext)); diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c index f56414a06ec416c74cdcfccfa34e817bf1def64f..9bc4a1cd9ac65eea8f41563ba76abf9d54963b36 100644 --- a/drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c +++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c @@ -831,6 +831,8 @@ static int dpu_crtc_atomic_check(struct drm_crtc *crtc, struct drm_rect crtc_rect = { 0 }; pstates = kzalloc(sizeof(*pstates) * DPU_STAGE_MAX * 4, GFP_KERNEL); + if (!pstates) + return -ENOMEM; if (!state->enable || !state->active) { DPU_DEBUG("crtc%d -> enable %d, active %d, skip atomic_check\n", diff --git a/drivers/gpu/drm/radeon/cik.c b/drivers/gpu/drm/radeon/cik.c index 5c42877fd6fbf6fc00f5439ab869e64114c60822..de402657091e6a6e6ffd21a068a430dc06e581a9 100644 --- a/drivers/gpu/drm/radeon/cik.c +++ b/drivers/gpu/drm/radeon/cik.c @@ -8108,6 +8108,7 @@ int cik_irq_process(struct radeon_device *rdev) if (queue_thermal) schedule_work(&rdev->pm.dpm.thermal.work); rdev->ih.rptr = rptr; + WREG32(IH_RB_RPTR, rptr); atomic_set(&rdev->ih.lock, 0); /* make sure wptr hasn't changed while processing */ diff --git a/drivers/gpu/drm/radeon/evergreen.c b/drivers/gpu/drm/radeon/evergreen.c index 14d90dc376e7165a613cf1ed9e9a7862e9893577..11e3e99a9f0194f57dda68f50e6e5439894f0387 100644 --- a/drivers/gpu/drm/radeon/evergreen.c +++ b/drivers/gpu/drm/radeon/evergreen.c @@ -4919,6 +4919,7 @@ int evergreen_irq_process(struct radeon_device *rdev) if (queue_thermal && rdev->pm.dpm_enabled) schedule_work(&rdev->pm.dpm.thermal.work); rdev->ih.rptr = rptr; + WREG32(IH_RB_RPTR, rptr); atomic_set(&rdev->ih.lock, 0); /* make sure wptr hasn't changed while processing */ diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c index d9a33ca768f345c2dcb34aa0bec01e705945d92a..cd5418ccf20ee2a00076eb760280e0ee7a3799b7 100644 --- a/drivers/gpu/drm/radeon/r600.c +++ b/drivers/gpu/drm/radeon/r600.c @@ -4331,6 +4331,7 @@ int r600_irq_process(struct radeon_device *rdev) if (queue_thermal && rdev->pm.dpm_enabled) schedule_work(&rdev->pm.dpm.thermal.work); rdev->ih.rptr = rptr; + WREG32(IH_RB_RPTR, rptr); atomic_set(&rdev->ih.lock, 0); /* make sure wptr hasn't changed while processing */ diff --git a/drivers/gpu/drm/radeon/si.c b/drivers/gpu/drm/radeon/si.c index 93dcab548a835abb604c5ba0068ea3c4c7e02d75..914b861df92cc6f85c424092a174cf31a86fc549 100644 --- a/drivers/gpu/drm/radeon/si.c +++ b/drivers/gpu/drm/radeon/si.c @@ -6443,6 +6443,7 @@ int si_irq_process(struct radeon_device *rdev) if (queue_thermal && rdev->pm.dpm_enabled) schedule_work(&rdev->pm.dpm.thermal.work); rdev->ih.rptr = rptr; + WREG32(IH_RB_RPTR, rptr); atomic_set(&rdev->ih.lock, 0); /* make sure wptr hasn't changed while processing */ diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_kms.c b/drivers/gpu/drm/vmwgfx/vmwgfx_kms.c index e58112997c88136e4ecce5c238e99d366f2f5b6c..0e963fd7db17e73ffb1bfe7deac81843bd08b908 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_kms.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_kms.c @@ -182,7 +182,8 @@ void vmw_kms_cursor_snoop(struct vmw_surface *srf, if (cmd->dma.guest.ptr.offset % PAGE_SIZE || box->x != 0 || box->y != 0 || box->z != 0 || box->srcx != 0 || box->srcy != 0 || box->srcz != 0 || - box->d != 1 || box_count != 1) { + box->d != 1 || box_count != 1 || + box->w > 64 || box->h > 64) { /* TODO handle none page aligned offsets */ /* TODO handle more dst & src != 0 */ /* TODO handle more then one copy */ diff --git a/drivers/hid/hid-roccat.c b/drivers/hid/hid-roccat.c index 26373b82fe812510a6755a15ef96755e3396231e..6da80e442fdd1065ed5b5e8b871e402f74b099cf 100644 --- a/drivers/hid/hid-roccat.c +++ b/drivers/hid/hid-roccat.c @@ -257,6 +257,8 @@ int roccat_report_event(int minor, u8 const *data) if (!new_value) return -ENOMEM; + mutex_lock(&device->cbuf_lock); + report = &device->cbuf[device->cbuf_end]; /* passing NULL is safe */ @@ -276,6 +278,8 @@ int roccat_report_event(int minor, u8 const *data) reader->cbuf_start = (reader->cbuf_start + 1) % ROCCAT_CBUF_SIZE; } + mutex_unlock(&device->cbuf_lock); + wake_up_interruptible(&device->wait); return 0; } diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c index 4e239bd75b1dae2cf8f3108b0cdefbba78d4afad..7c4d870e9b2028b2fca0799f57968d1273b25d6c 100644 --- a/drivers/hwmon/k10temp.c +++ b/drivers/hwmon/k10temp.c @@ -76,6 +76,11 @@ static DEFINE_MUTEX(nb_smu_ind_mutex); #define ZEN_CUR_TEMP_SHIFT 21 #define ZEN_CUR_TEMP_RANGE_SEL_MASK BIT(19) +struct hygon_private { + u32 index_2nd; + u32 offset_2nd; +}; + struct k10temp_data { struct pci_dev *pdev; void (*read_htcreg)(struct pci_dev *pdev, u32 *regval); @@ -85,6 +90,7 @@ struct k10temp_data { u32 show_temp; bool is_zen; u32 ccd_offset; + void *priv; }; #define TCTL_BIT 0 @@ -191,6 +197,23 @@ static int k10temp_read_labels(struct device *dev, return 0; } +static void hygon_read_temp(struct k10temp_data *data, int channel, + u32 *regval) +{ + struct hygon_private *h_priv; + + h_priv = (struct hygon_private *)data->priv; + if ((channel - 2) < h_priv->index_2nd) + amd_smn_read(amd_pci_dev_to_node_id(data->pdev), + ZEN_CCD_TEMP(data->ccd_offset, channel - 2), + regval); + else + amd_smn_read(amd_pci_dev_to_node_id(data->pdev), + ZEN_CCD_TEMP(h_priv->offset_2nd, + channel - 2 - h_priv->index_2nd), + regval); +} + static int k10temp_read_temp(struct device *dev, u32 attr, int channel, long *val) { @@ -211,7 +234,10 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel, *val = 0; break; case 2 ... 13: /* Tccd{1-12} */ - amd_smn_read(amd_pci_dev_to_node_id(data->pdev), + if (hygon_f18h_m4h()) + hygon_read_temp(data, channel, ®val); + else + amd_smn_read(amd_pci_dev_to_node_id(data->pdev), ZEN_CCD_TEMP(data->ccd_offset, channel - 2), ®val); *val = (regval & ZEN_CCD_TEMP_MASK) * 125 - 49000; @@ -378,14 +404,48 @@ static void k10temp_get_ccd_support(struct pci_dev *pdev, } } +static void k10temp_get_ccd_support_2nd(struct pci_dev *pdev, + struct k10temp_data *data, int limit) +{ + struct hygon_private *h_priv; + u32 regval; + int i; + + h_priv = (struct hygon_private *)data->priv; + for (i = h_priv->index_2nd; i < limit; i++) { + amd_smn_read(amd_pci_dev_to_node_id(pdev), + ZEN_CCD_TEMP(h_priv->offset_2nd, + i - h_priv->index_2nd), + ®val); + if (regval & ZEN_CCD_TEMP_VALID) + data->show_temp |= BIT(TCCD_BIT(i)); + } +} + static int k10temp_probe(struct pci_dev *pdev, const struct pci_device_id *id) { int unreliable = has_erratum_319(pdev); struct device *dev = &pdev->dev; + struct hygon_private *h_priv; struct k10temp_data *data; struct device *hwmon_dev; + u8 df_id; int i; + if (hygon_f18h_m4h()) { + if (get_df_id(pdev, &df_id)) { + pr_err("Get DF ID failed.\n"); + return -ENODEV; + } + + /* + * The temperature should be get from the devices + * with id < 4. + */ + if (df_id >= 4) + return 0; + } + if (unreliable) { if (!force) { dev_err(dev, @@ -408,7 +468,7 @@ static int k10temp_probe(struct pci_dev *pdev, const struct pci_device_id *id) (boot_cpu_data.x86_model & 0xf0) == 0x70)) { data->read_htcreg = read_htcreg_nb_f15; data->read_tempreg = read_tempreg_nb_f15; - } else if (boot_cpu_data.x86 == 0x17 || boot_cpu_data.x86 == 0x18) { + } else if (boot_cpu_data.x86 == 0x17) { data->temp_adjust_mask = ZEN_CUR_TEMP_RANGE_SEL_MASK; data->read_tempreg = read_tempreg_nb_zen; data->is_zen = true; @@ -429,6 +489,27 @@ static int k10temp_probe(struct pci_dev *pdev, const struct pci_device_id *id) k10temp_get_ccd_support(pdev, data, 8); break; } + } else if (boot_cpu_data.x86_vendor == X86_VENDOR_HYGON && + boot_cpu_data.x86 == 0x18) { + data->temp_adjust_mask = ZEN_CUR_TEMP_RANGE_SEL_MASK; + data->read_tempreg = read_tempreg_nb_zen; + data->is_zen = true; + + switch (boot_cpu_data.x86_model) { + case 0x4: + case 0x5: + data->ccd_offset = 0x154; + data->priv = devm_kzalloc(dev, sizeof(*h_priv), + GFP_KERNEL); + if (!data->priv) + return -ENOMEM; + h_priv = (struct hygon_private *)data->priv; + h_priv->offset_2nd = 0x2f8; + h_priv->index_2nd = 3; + k10temp_get_ccd_support(pdev, data, h_priv->index_2nd); + k10temp_get_ccd_support_2nd(pdev, data, 8); + break; + } } else if (boot_cpu_data.x86 == 0x19) { data->temp_adjust_mask = ZEN_CUR_TEMP_RANGE_SEL_MASK; data->read_tempreg = read_tempreg_nb_zen; @@ -494,6 +575,8 @@ static const struct pci_device_id k10temp_id_table[] = { { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_19H_M40H_DF_F3) }, { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_19H_M50H_DF_F3) }, { PCI_VDEVICE(HYGON, PCI_DEVICE_ID_AMD_17H_DF_F3) }, + { PCI_VDEVICE(HYGON, PCI_DEVICE_ID_AMD_17H_M30H_DF_F3) }, + { PCI_VDEVICE(HYGON, PCI_DEVICE_ID_HYGON_18H_M05H_DF_F3) }, {} }; MODULE_DEVICE_TABLE(pci, k10temp_id_table); diff --git a/drivers/hwmon/xgene-hwmon.c b/drivers/hwmon/xgene-hwmon.c index 37a946647a69f3b15c5a2d278f40621e089d5f95..9c37e2afc5575ab010c6e33d031d0d3ed75189a5 100644 --- a/drivers/hwmon/xgene-hwmon.c +++ b/drivers/hwmon/xgene-hwmon.c @@ -759,6 +759,7 @@ static int xgene_hwmon_remove(struct platform_device *pdev) { struct xgene_hwmon_dev *ctx = platform_get_drvdata(pdev); + cancel_work_sync(&ctx->workq); hwmon_device_unregister(ctx->hwmon_dev); kfifo_free(&ctx->async_msg_fifo); if (acpi_disabled) diff --git a/drivers/i2c/busses/i2c-piix4.c b/drivers/i2c/busses/i2c-piix4.c index 8c1b31ed0c429a73240e4f5509dcf18d26efc467..defce8c11a9b9180efd8528b4e57441ba71a8849 100644 --- a/drivers/i2c/busses/i2c-piix4.c +++ b/drivers/i2c/busses/i2c-piix4.c @@ -924,8 +924,7 @@ static int piix4_probe(struct pci_dev *dev, const struct pci_device_id *id) bool notify_imc = false; is_sb800 = true; - if ((dev->vendor == PCI_VENDOR_ID_AMD || - dev->vendor == PCI_VENDOR_ID_HYGON) && + if (dev->vendor == PCI_VENDOR_ID_AMD && dev->device == PCI_DEVICE_ID_AMD_KERNCZ_SMBUS) { u8 imc; diff --git a/drivers/i2c/busses/i2c-xgene-slimpro.c b/drivers/i2c/busses/i2c-xgene-slimpro.c index a7c24396f7643beca9f6dafcaa33be8ada06b7f8..145aaa1b1bb6fc1f5457c1cb2715637b02d05c5b 100644 --- a/drivers/i2c/busses/i2c-xgene-slimpro.c +++ b/drivers/i2c/busses/i2c-xgene-slimpro.c @@ -309,6 +309,9 @@ static int slimpro_i2c_blkwr(struct slimpro_i2c_dev *ctx, u32 chip, u32 msg[3]; int rc; + if (writelen > I2C_SMBUS_BLOCK_MAX) + return -EINVAL; + memcpy(ctx->dma_buffer, data, writelen); paddr = dma_map_single(ctx->dev, ctx->dma_buffer, writelen, DMA_TO_DEVICE); diff --git a/drivers/infiniband/hw/erdma/Makefile b/drivers/infiniband/hw/erdma/Makefile index b272645f13217214295bf3191ce03db998b65ece..37759e9cdbc6194b84a4785e5142b8cc4048c33f 100644 --- a/drivers/infiniband/hw/erdma/Makefile +++ b/drivers/infiniband/hw/erdma/Makefile @@ -3,4 +3,5 @@ obj-$(CONFIG_INFINIBAND_ERDMA) += erdma.o erdma-y :=\ erdma_cm.o erdma_cq.o erdma_main.o erdma_qp.o erdma_verbs.o erdma_compat.o\ - erdma_cmdq.o erdma_eq.o erdma_ioctl.o erdma_stats.o + erdma_cmdq.o erdma_eq.o erdma_ioctl.o erdma_stats.o erdma_cmd.o\ + erdma_debugfs.o diff --git a/drivers/infiniband/hw/erdma/erdma.h b/drivers/infiniband/hw/erdma/erdma.h index a96a6bd5a2f674f37b3b987531a9296ac87ff418..7367443fa1af8fc0f6b1f09be20139cbbee1aafd 100644 --- a/drivers/infiniband/hw/erdma/erdma.h +++ b/drivers/infiniband/hw/erdma/erdma.h @@ -246,6 +246,8 @@ struct erdma_dev { struct dma_pool *db_pool; struct dma_pool *resp_pool; + + struct dentry *dbg_root; }; static inline void *get_queue_entry(void *qbuf, u32 idx, u32 depth, u32 shift) @@ -311,4 +313,16 @@ void erdma_ceq_completion_handler(struct erdma_eq_cb *ceq_cb); void erdma_chrdev_destroy(void); int erdma_chrdev_init(void); +int erdma_query_resource(struct erdma_dev *dev, u32 mod, u32 op, u32 index, + void *out, u32 len); +int erdma_query_ext_attr(struct erdma_dev *dev, void *out); +int erdma_set_dack_count(struct erdma_dev *dev, u32 value); + +void erdma_debugfs_register(void); +void erdma_debugfs_unregister(void); + +int erdma_debugfs_files_create(struct erdma_dev *dev); +void erdma_debugfs_files_destroy(struct erdma_dev *dev); +extern struct dentry *erdma_debugfs_root; + #endif diff --git a/drivers/infiniband/hw/erdma/erdma_cmd.c b/drivers/infiniband/hw/erdma/erdma_cmd.c new file mode 100644 index 0000000000000000000000000000000000000000..c9c36fe49675feb438eac03578a534e51a4fe12c --- /dev/null +++ b/drivers/infiniband/hw/erdma/erdma_cmd.c @@ -0,0 +1,81 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause + +/* Authors: Cheng Xu */ +/* Kai Shen */ +/* Copyright (c) 2020-2022, Alibaba Group. */ + +#include + +#include "erdma.h" + +int erdma_query_resource(struct erdma_dev *dev, u32 mod, u32 op, u32 index, + void *out, u32 len) +{ + struct erdma_cmdq_query_req req; + dma_addr_t dma_addr; + void *resp; + int err; + + erdma_cmdq_build_reqhdr(&req.hdr, mod, op); + + resp = dma_pool_alloc(dev->resp_pool, GFP_KERNEL, &dma_addr); + if (!resp) + return -ENOMEM; + + req.index = index; + req.target_addr = dma_addr; + req.target_length = ERDMA_HW_RESP_SIZE; + + err = erdma_post_cmd_wait(&dev->cmdq, &req, sizeof(req), NULL, NULL); + if (err) + goto out; + + if (out) + memcpy(out, resp, len); + +out: + dma_pool_free(dev->resp_pool, resp, dma_addr); + + return err; +} + +int erdma_query_ext_attr(struct erdma_dev *dev, void *out) +{ + BUILD_BUG_ON(sizeof(struct erdma_cmdq_query_ext_attr_resp) > + ERDMA_HW_RESP_SIZE); + + return erdma_query_resource( + dev, CMDQ_SUBMOD_COMMON, CMDQ_OPCODE_GET_EXT_ATTR, 0, out, + sizeof(struct erdma_cmdq_query_ext_attr_resp)); +} + +static int erdma_set_ext_attr(struct erdma_dev *dev, struct erdma_ext_attr *attr) +{ + struct erdma_cmdq_set_ext_attr_req req; + int ret; + + erdma_cmdq_build_reqhdr(&req.hdr, CMDQ_SUBMOD_COMMON, + CMDQ_OPCODE_SET_EXT_ATTR); + + if (attr->attr_mask & ERDMA_EXT_ATTR_DACK_COUNT_MASK) + req.dack_count = attr->dack_count; + + req.attr_mask = attr->attr_mask; + + ret = erdma_post_cmd_wait(&dev->cmdq, &req, sizeof(req), NULL, NULL); + + return ret; +} + +int erdma_set_dack_count(struct erdma_dev *dev, u32 value) +{ + struct erdma_ext_attr attr; + + if (value > 0xff) + return -EINVAL; + + attr.attr_mask = ERDMA_EXT_ATTR_DACK_COUNT_MASK; + attr.dack_count = (u8)value; + + return erdma_set_ext_attr(dev, &attr); +} diff --git a/drivers/infiniband/hw/erdma/erdma_debugfs.c b/drivers/infiniband/hw/erdma/erdma_debugfs.c new file mode 100644 index 0000000000000000000000000000000000000000..24ea4033a5cb77a9512a3b0c59a156090c6d88f9 --- /dev/null +++ b/drivers/infiniband/hw/erdma/erdma_debugfs.c @@ -0,0 +1,133 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause + +/* Authors: Cheng Xu */ +/* Kai Shen */ +/* Copyright (c) 2020-2022, Alibaba Group. */ + +#include +#include +#include + +#include "erdma.h" + +struct dentry *erdma_debugfs_root; +EXPORT_SYMBOL(erdma_debugfs_root); + +static ssize_t dack_read(struct file *filp, char __user *buf, size_t count, + loff_t *pos) +{ + struct erdma_cmdq_query_ext_attr_resp resp; + struct erdma_dev *dev; + char cbuf[20]; + int ret; + + dev = filp->private_data; + ret = erdma_query_ext_attr(dev, &resp); + if (ret) + return ret; + + ret = snprintf(cbuf, sizeof(cbuf), "0x%x\n", resp.dack_count); + + return simple_read_from_buffer(buf, count, pos, cbuf, ret); +} + +static ssize_t dack_write(struct file *filp, const char __user *buf, + size_t count, loff_t *pos) +{ + struct erdma_dev *dev; + u32 var; + int ret; + + dev = filp->private_data; + + if (kstrtouint_from_user(buf, count, 0, &var)) + return -EFAULT; + + ret = erdma_set_dack_count(dev, var); + if (ret) + return ret; + + return count; +} + +static const struct file_operations dack_fops = { + .owner = THIS_MODULE, + .open = simple_open, + .read = dack_read, + .write = dack_write, +}; + +static ssize_t cap_read(struct file *filp, char __user *buf, size_t count, + loff_t *pos) +{ + struct erdma_cmdq_query_ext_attr_resp resp; + struct erdma_dev *dev; + char cbuf[40]; + int ret; + + dev = filp->private_data; + ret = erdma_query_ext_attr(dev, &resp); + if (ret) + return ret; + + ret = snprintf(cbuf, sizeof(cbuf), "cap 0x%lx\next_cap 0x%x\n", + dev->attrs.cap_flags, resp.cap_mask); + + return simple_read_from_buffer(buf, count, pos, cbuf, ret); +} + +static const struct file_operations cap_fops = { + .owner = THIS_MODULE, + .open = simple_open, + .read = cap_read, +}; + +int erdma_debugfs_files_create(struct erdma_dev *dev) +{ + struct dentry *ent; + + if (!erdma_debugfs_root) + return 0; + + dev->dbg_root = debugfs_create_dir(dev_name(&dev->pdev->dev), erdma_debugfs_root); + if (!dev->dbg_root) { + dev_err(&dev->pdev->dev, "erdma: Cannot create debugfs dir, aborting\n"); + return -ENOMEM; + } + + ent = debugfs_create_file("delay_ack", 0600, dev->dbg_root, dev, + &dack_fops); + if (!ent) + goto err_out; + + ent = debugfs_create_file("cap", 0400, dev->dbg_root, dev, + &cap_fops); + if (!ent) + goto err_out; + + return 0; + +err_out: + debugfs_remove_recursive(dev->dbg_root); + + return -ENOMEM; +} + +void erdma_debugfs_files_destroy(struct erdma_dev *dev) +{ + if (erdma_debugfs_root) + debugfs_remove_recursive(dev->dbg_root); +} + +void erdma_debugfs_register(void) +{ + erdma_debugfs_root = debugfs_create_dir("erdma", NULL); + + if (IS_ERR_OR_NULL(erdma_debugfs_root)) + erdma_debugfs_root = NULL; +} + +void erdma_debugfs_unregister(void) +{ + debugfs_remove(erdma_debugfs_root); +} diff --git a/drivers/infiniband/hw/erdma/erdma_ioctl.c b/drivers/infiniband/hw/erdma/erdma_ioctl.c index 4491fcc659c2815b63a8e24f4df2f253d428fd28..87dd59f6001b73ddbec3e170e228377e7f1d4b1d 100644 --- a/drivers/infiniband/hw/erdma/erdma_ioctl.c +++ b/drivers/infiniband/hw/erdma/erdma_ioctl.c @@ -21,68 +21,6 @@ static dev_t erdma_char_dev; #define ERDMA_CHRDEV_NAME "erdma" -int erdma_set_ext_attr(struct erdma_dev *dev, struct erdma_ext_attr *attr) -{ - struct erdma_cmdq_set_ext_attr_req req; - int ret; - - erdma_cmdq_build_reqhdr(&req.hdr, CMDQ_SUBMOD_COMMON, - CMDQ_OPCODE_SET_EXT_ATTR); - - if (attr->attr_mask & ERDMA_EXT_ATTR_DACK_COUNT_MASK) - req.dack_count = attr->dack_count; - - req.attr_mask = attr->attr_mask; - - ret = erdma_post_cmd_wait(&dev->cmdq, &req, sizeof(req), NULL, NULL); - - return ret; -} - -int erdma_set_dack_count(struct erdma_dev *dev, u32 value) -{ - struct erdma_ext_attr attr; - - if (value > 0xff) - return -EINVAL; - - attr.attr_mask = ERDMA_EXT_ATTR_DACK_COUNT_MASK; - attr.dack_count = (u8)value; - - return erdma_set_ext_attr(dev, &attr); -} - -static int erdma_query_resource(struct erdma_dev *dev, u32 mod, u32 op, - u32 index, void *out, u32 len) -{ - struct erdma_cmdq_query_req req; - dma_addr_t dma_addr; - void *resp; - int err; - - erdma_cmdq_build_reqhdr(&req.hdr, mod, op); - - resp = dma_pool_alloc(dev->resp_pool, GFP_KERNEL, &dma_addr); - if (!resp) - return -ENOMEM; - - req.index = index; - req.target_addr = dma_addr; - req.target_length = ERDMA_HW_RESP_SIZE; - - err = erdma_post_cmd_wait(&dev->cmdq, &req, sizeof(req), NULL, NULL); - if (err) - goto out; - - if (out) - memcpy(out, resp, len); - -out: - dma_pool_free(dev->resp_pool, resp, dma_addr); - - return err; -} - static int erdma_query_qpc(struct erdma_dev *dev, u32 qpn, void *out) { BUILD_BUG_ON(sizeof(struct erdma_cmdq_query_qpc_resp) > @@ -113,16 +51,6 @@ static int erdma_query_eqc(struct erdma_dev *dev, u32 eqn, void *out) sizeof(struct erdma_cmdq_query_eqc_resp)); } -static int erdma_query_ext_attr(struct erdma_dev *dev, void *out) -{ - BUILD_BUG_ON(sizeof(struct erdma_cmdq_query_ext_attr_resp) > - ERDMA_HW_RESP_SIZE); - - return erdma_query_resource(dev, CMDQ_SUBMOD_COMMON, - CMDQ_OPCODE_GET_EXT_ATTR, 0, out, - sizeof(struct erdma_cmdq_query_ext_attr_resp)); -} - static int erdma_ioctl_conf_cmd(struct erdma_dev *edev, struct erdma_ioctl_msg *msg) { diff --git a/drivers/infiniband/hw/erdma/erdma_main.c b/drivers/infiniband/hw/erdma/erdma_main.c index edd04a22280be206c8bd8185b69d5249901e0732..ab47b7b422da12f241a1bf12d5bea5f1571f79ce 100644 --- a/drivers/infiniband/hw/erdma/erdma_main.c +++ b/drivers/infiniband/hw/erdma/erdma_main.c @@ -101,6 +101,13 @@ static int erdma_enum_and_get_netdev(struct erdma_dev *dev) return ret; } +static void erdma_device_unregister(struct erdma_dev *dev) +{ + unregister_netdevice_notifier(&dev->netdev_nb); + + ib_unregister_device(&dev->ibdev); +} + static int erdma_device_register(struct erdma_dev *dev) { struct ib_device *ibdev = &dev->ibdev; @@ -277,20 +284,24 @@ static int erdma_device_init(struct erdma_dev *dev, struct pci_dev *pdev) int ret; erdma_dwqe_resource_init(dev); + ret = erdma_hw_resp_pool_init(dev); if (ret) return ret; ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(ERDMA_PCI_WIDTH)); - if (ret) { - erdma_hw_resp_pool_destroy(dev); - return ret; - } + if (ret) + goto destroy_pool; dma_set_max_seg_size(&pdev->dev, UINT_MAX); return 0; + +destroy_pool: + erdma_hw_resp_pool_destroy(dev); + + return ret; } static void erdma_device_uninit(struct erdma_dev *dev) @@ -752,10 +763,16 @@ static int erdma_ib_device_add(struct pci_dev *pdev) if (ret) goto free_wq; + ret = erdma_debugfs_files_create(dev); + if (ret) + goto device_unregister; + dev->ibdev.use_cq_dim = true; return 0; +device_unregister: + erdma_device_unregister(dev); free_wq: destroy_workqueue(dev->reflush_wq); free_pool: @@ -777,7 +794,8 @@ static void erdma_ib_device_remove(struct pci_dev *pdev) unregister_netdevice_notifier(&dev->netdev_nb); - ib_unregister_device(&dev->ibdev); + erdma_debugfs_files_destroy(dev); + erdma_device_unregister(dev); WARN_ON(atomic_read(&dev->num_ctx)); WARN_ON(atomic_read(&dev->num_cep)); @@ -827,6 +845,8 @@ static __init int erdma_init_module(void) { int ret; + erdma_debugfs_register(); + ret = erdma_compat_init(); if (ret) return ret; @@ -863,6 +883,7 @@ static void __exit erdma_exit_module(void) erdma_chrdev_destroy(); erdma_cm_exit(); erdma_compat_exit(); + erdma_debugfs_unregister(); } module_init(erdma_init_module); diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c index 90384f6b83231e41bcabc9ab0bde00fe777387be..60f7eecb7c6f6ae508ff889520c4a1e45ad60d64 100644 --- a/drivers/iommu/amd/init.c +++ b/drivers/iommu/amd/init.c @@ -2591,6 +2591,9 @@ static void __init free_iommu_resources(void) /* SB IOAPIC is always on this device in AMD systems */ #define IOAPIC_SB_DEVID ((0x00 << 8) | PCI_DEVFN(0x14, 0)) +/* SB IOAPIC for Hygon family 18h model 4h is on the device 0xb */ +#define IOAPIC_SB_DEVID_FAM18H_M4H ((0x00 << 8) | PCI_DEVFN(0xb, 0)) + static bool __init check_ioapic_information(void) { const char *fw_bug = FW_BUG; @@ -2616,7 +2619,12 @@ static bool __init check_ioapic_information(void) pr_err("%s: IOAPIC[%d] not in IVRS table\n", fw_bug, id); ret = false; - } else if (devid == IOAPIC_SB_DEVID) { + } else if (devid == IOAPIC_SB_DEVID || + (boot_cpu_data.x86_vendor == X86_VENDOR_HYGON && + boot_cpu_data.x86 == 0x18 && + boot_cpu_data.x86_model >= 0x4 && + boot_cpu_data.x86_model <= 0xf && + devid == IOAPIC_SB_DEVID_FAM18H_M4H)) { has_sb_ioapic = true; ret = true; } diff --git a/drivers/irqchip/irq-gic-phytium-2500.c b/drivers/irqchip/irq-gic-phytium-2500.c index ac08283929454ceaf039e89ff7464c3bff12bad2..d357f35f7c1a063f0ca36014f728637d8445dd0b 100644 --- a/drivers/irqchip/irq-gic-phytium-2500.c +++ b/drivers/irqchip/irq-gic-phytium-2500.c @@ -2421,11 +2421,17 @@ static void __init gic_acpi_setup_kvm_info(void) gic_set_kvm_info(&gic_v3_kvm_info); } +static struct fwnode_handle *gsi_domain_handle; + +static struct fwnode_handle *gic_v3_get_gsi_domain_id(u32 gsi) +{ + return gsi_domain_handle; +} + static int __init gic_acpi_init(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_distributor *dist; - struct fwnode_handle *domain_handle; size_t size; int i, err; int skt; @@ -2486,18 +2492,18 @@ gic_acpi_init(union acpi_subtable_headers *header, const unsigned long end) if (err) goto out_redist_unmap; - domain_handle = irq_domain_alloc_fwnode(&dist->base_address); - if (!domain_handle) { + gsi_domain_handle = irq_domain_alloc_fwnode(&dist->base_address); + if (!gsi_domain_handle) { err = -ENOMEM; goto out_redist_unmap; } err = gic_init_bases(acpi_data.dist_base, acpi_data.redist_regs, - acpi_data.nr_redist_regions, 0, domain_handle); + acpi_data.nr_redist_regions, 0, gsi_domain_handle); if (err) goto out_fwhandle_free; - acpi_set_irq_model(ACPI_IRQ_MODEL_GIC, domain_handle); + acpi_set_irq_model(ACPI_IRQ_MODEL_GIC, gic_v3_get_gsi_domain_id); if (static_branch_likely(&supports_deactivate_key)) gic_acpi_setup_kvm_info(); @@ -2505,7 +2511,7 @@ gic_acpi_init(union acpi_subtable_headers *header, const unsigned long end) return 0; out_fwhandle_free: - irq_domain_free_fwnode(domain_handle); + irq_domain_free_fwnode(gsi_domain_handle); out_redist_unmap: for (i = 0; i < acpi_data.nr_redist_regions; i++) if (acpi_data.redist_regs[i].redist_base) diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index ff0f8d561b71e76979b5e9b815edf0d3b6c0cbd7..98ff90528da6300703f9d5b88583bf57ca0be8db 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -2270,11 +2270,17 @@ static void __init gic_acpi_setup_kvm_info(void) gic_set_kvm_info(&gic_v3_kvm_info); } +static struct fwnode_handle *gsi_domain_handle; + +static struct fwnode_handle *gic_v3_get_gsi_domain_id(u32 gsi) +{ + return gsi_domain_handle; +} + static int __init gic_acpi_init(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_distributor *dist; - struct fwnode_handle *domain_handle; size_t size; int i, err; @@ -2305,18 +2311,18 @@ gic_acpi_init(union acpi_subtable_headers *header, const unsigned long end) if (err) goto out_redist_unmap; - domain_handle = irq_domain_alloc_fwnode(&dist->base_address); - if (!domain_handle) { + gsi_domain_handle = irq_domain_alloc_fwnode(&dist->base_address); + if (!gsi_domain_handle) { err = -ENOMEM; goto out_redist_unmap; } err = gic_init_bases(acpi_data.dist_base, acpi_data.redist_regs, - acpi_data.nr_redist_regions, 0, domain_handle); + acpi_data.nr_redist_regions, 0, gsi_domain_handle); if (err) goto out_fwhandle_free; - acpi_set_irq_model(ACPI_IRQ_MODEL_GIC, domain_handle); + acpi_set_irq_model(ACPI_IRQ_MODEL_GIC, gic_v3_get_gsi_domain_id); if (static_branch_likely(&supports_deactivate_key)) gic_acpi_setup_kvm_info(); @@ -2324,7 +2330,7 @@ gic_acpi_init(union acpi_subtable_headers *header, const unsigned long end) return 0; out_fwhandle_free: - irq_domain_free_fwnode(domain_handle); + irq_domain_free_fwnode(gsi_domain_handle); out_redist_unmap: for (i = 0; i < acpi_data.nr_redist_regions; i++) if (acpi_data.redist_regs[i].redist_base) diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c index 205cbd24ff20916028a09328d3db21514a7a88c5..ba5b15dc081615193231cdf2a66f5fa5f1fe4bcb 100644 --- a/drivers/irqchip/irq-gic.c +++ b/drivers/irqchip/irq-gic.c @@ -1683,11 +1683,17 @@ static void __init gic_acpi_setup_kvm_info(void) gic_set_kvm_info(&gic_v2_kvm_info); } +static struct fwnode_handle *gsi_domain_handle; + +static struct fwnode_handle *gic_v2_get_gsi_domain_id(u32 gsi) +{ + return gsi_domain_handle; +} + static int __init gic_v2_acpi_init(union acpi_subtable_headers *header, const unsigned long end) { struct acpi_madt_generic_distributor *dist; - struct fwnode_handle *domain_handle; struct gic_chip_data *gic = &gic_data[0]; int count, ret; @@ -1725,22 +1731,22 @@ static int __init gic_v2_acpi_init(union acpi_subtable_headers *header, /* * Initialize GIC instance zero (no multi-GIC support). */ - domain_handle = irq_domain_alloc_fwnode(&dist->base_address); - if (!domain_handle) { + gsi_domain_handle = irq_domain_alloc_fwnode(&dist->base_address); + if (!gsi_domain_handle) { pr_err("Unable to allocate domain handle\n"); gic_teardown(gic); return -ENOMEM; } - ret = __gic_init_bases(gic, domain_handle); + ret = __gic_init_bases(gic, gsi_domain_handle); if (ret) { pr_err("Failed to initialise GIC\n"); - irq_domain_free_fwnode(domain_handle); + irq_domain_free_fwnode(gsi_domain_handle); gic_teardown(gic); return ret; } - acpi_set_irq_model(ACPI_IRQ_MODEL_GIC, domain_handle); + acpi_set_irq_model(ACPI_IRQ_MODEL_GIC, gic_v2_get_gsi_domain_id); if (IS_ENABLED(CONFIG_ARM_GIC_V2M)) gicv2m_init(NULL, gic_data[0].domain); diff --git a/drivers/memstick/host/r592.c b/drivers/memstick/host/r592.c index eaa2a94d18be4e46b187de7b5d5f9ad92245caa3..dd06c18495eb6f96c52fd33c4a9995a6e7af9c3e 100644 --- a/drivers/memstick/host/r592.c +++ b/drivers/memstick/host/r592.c @@ -828,7 +828,7 @@ static void r592_remove(struct pci_dev *pdev) /* Stop the processing thread. That ensures that we won't take any more requests */ kthread_stop(dev->io_thread); - + del_timer_sync(&dev->detect_timer); r592_enable_device(dev, false); while (!error && dev->req) { diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig index 3e371d24c4628143f335b14cff1ef6f99954c030..ad45d20f9d44d31221829493dd2633a1d46797d7 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig @@ -6,6 +6,7 @@ config MLX5_CORE tristate "Mellanox 5th generation network adapters (ConnectX series) core driver" depends on PCI + select AUXILIARY_BUS select NET_DEVLINK depends on VXLAN || !VXLAN depends on MLXFW || !MLXFW @@ -202,3 +203,22 @@ config MLX5_SW_STEERING default y help Build support for software-managed steering in the NIC. + +config MLX5_SF + bool "Mellanox Technologies subfunction device support using auxiliary device" + depends on MLX5_CORE && MLX5_CORE_EN + default n + help + Build support for subfuction device in the NIC. A Mellanox subfunction + device can support RDMA, netdevice and vdpa device. + It is similar to a SRIOV VF but it doesn't require SRIOV support. + +config MLX5_SF_MANAGER + bool + depends on MLX5_SF && MLX5_ESWITCH + default y + help + Build support for subfuction port in the NIC. A Mellanox subfunction + port is managed through devlink. A subfunction supports RDMA, netdevice + and vdpa device. It is similar to a SRIOV VF but it doesn't require + SRIOV support. diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile index 2d477f9a8cb7df040be03d04dd9491995d40ca8e..767af7486c45353673eb8a5e1ef3bcec5c4b109e 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile @@ -85,3 +85,12 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o steering/dr_ste.o steering/dr_send.o \ steering/dr_cmd.o steering/dr_fw.o \ steering/dr_action.o steering/fs_dr.o +# +# SF device +# +mlx5_core-$(CONFIG_MLX5_SF) += sf/vhca_event.o sf/dev/dev.o sf/dev/driver.o + +# +# SF manager +# +mlx5_core-$(CONFIG_MLX5_SF_MANAGER) += sf/cmd.o sf/hw_table.o sf/devlink.o diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c index 39c17e9039157659cc9e4c1f85f07bc3508ade0f..b00a1c8b67cef5f92314a9209982a1891043d6cd 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c @@ -338,6 +338,7 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev *dev, u16 op, case MLX5_CMD_OP_DEALLOC_MEMIC: case MLX5_CMD_OP_PAGE_FAULT_RESUME: case MLX5_CMD_OP_QUERY_ESW_FUNCTIONS: + case MLX5_CMD_OP_DEALLOC_SF: return MLX5_CMD_STAT_OK; case MLX5_CMD_OP_QUERY_HCA_CAP: @@ -469,6 +470,9 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev *dev, u16 op, case MLX5_CMD_OP_ALLOC_MEMIC: case MLX5_CMD_OP_MODIFY_XRQ: case MLX5_CMD_OP_RELEASE_XRQ_ERROR: + case MLX5_CMD_OP_QUERY_VHCA_STATE: + case MLX5_CMD_OP_MODIFY_VHCA_STATE: + case MLX5_CMD_OP_ALLOC_SF: *status = MLX5_DRIVER_STATUS_ABORTED; *synd = MLX5_DRIVER_SYND; return -EIO; @@ -662,6 +666,10 @@ const char *mlx5_command_str(int command) MLX5_COMMAND_STR_CASE(DESTROY_UMEM); MLX5_COMMAND_STR_CASE(RELEASE_XRQ_ERROR); MLX5_COMMAND_STR_CASE(MODIFY_XRQ); + MLX5_COMMAND_STR_CASE(QUERY_VHCA_STATE); + MLX5_COMMAND_STR_CASE(MODIFY_VHCA_STATE); + MLX5_COMMAND_STR_CASE(ALLOC_SF); + MLX5_COMMAND_STR_CASE(DEALLOC_SF); default: return "unknown command opcode"; } } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/dev.c b/drivers/net/ethernet/mellanox/mlx5/core/dev.c index 1972ddd127044132fd725bb60e2811f698466a62..8ddf469b2d05a5cbd48836a9e92dad9c0808a1b8 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/dev.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/dev.c @@ -37,6 +37,7 @@ static LIST_HEAD(intf_list); static LIST_HEAD(mlx5_dev_list); /* intf dev list mutex */ static DEFINE_MUTEX(mlx5_intf_mutex); +static DEFINE_IDA(mlx5_adev_ida); struct mlx5_device_context { struct list_head list; @@ -50,6 +51,39 @@ enum { MLX5_INTERFACE_ATTACHED, }; +static const struct mlx5_adev_device { + const char *suffix; + bool (*is_supported)(struct mlx5_core_dev *dev); +} mlx5_adev_devices[1] = {}; + +int mlx5_adev_idx_alloc(void) +{ + return ida_alloc(&mlx5_adev_ida, GFP_KERNEL); +} + +void mlx5_adev_idx_free(int idx) +{ + ida_free(&mlx5_adev_ida, idx); +} + +int mlx5_adev_init(struct mlx5_core_dev *dev) +{ + struct mlx5_priv *priv = &dev->priv; + + priv->adev = kcalloc(ARRAY_SIZE(mlx5_adev_devices), + sizeof(struct mlx5_adev *), GFP_KERNEL); + if (!priv->adev) + return -ENOMEM; + + return 0; +} + +void mlx5_adev_cleanup(struct mlx5_core_dev *dev) +{ + struct mlx5_priv *priv = &dev->priv; + + kfree(priv->adev); +} void mlx5_add_device(struct mlx5_interface *intf, struct mlx5_priv *priv) { @@ -135,15 +169,99 @@ static void mlx5_attach_interface(struct mlx5_interface *intf, struct mlx5_priv } } -void mlx5_attach_device(struct mlx5_core_dev *dev) +static void adev_release(struct device *dev) +{ + struct mlx5_adev *mlx5_adev = + container_of(dev, struct mlx5_adev, adev.dev); + struct mlx5_priv *priv = &mlx5_adev->mdev->priv; + int idx = mlx5_adev->idx; + + kfree(mlx5_adev); + priv->adev[idx] = NULL; +} + +static struct mlx5_adev *add_adev(struct mlx5_core_dev *dev, int idx) +{ + const char *suffix = mlx5_adev_devices[idx].suffix; + struct auxiliary_device *adev; + struct mlx5_adev *madev; + int ret; + + madev = kzalloc(sizeof(*madev), GFP_KERNEL); + if (!madev) + return ERR_PTR(-ENOMEM); + + adev = &madev->adev; + adev->id = dev->priv.adev_idx; + adev->name = suffix; + adev->dev.parent = dev->device; + adev->dev.release = adev_release; + madev->mdev = dev; + madev->idx = idx; + + ret = auxiliary_device_init(adev); + if (ret) { + kfree(madev); + return ERR_PTR(ret); + } + + ret = auxiliary_device_add(adev); + if (ret) { + auxiliary_device_uninit(adev); + return ERR_PTR(ret); + } + return madev; +} + +static void del_adev(struct auxiliary_device *adev) +{ + auxiliary_device_delete(adev); + auxiliary_device_uninit(adev); +} + +int mlx5_attach_device(struct mlx5_core_dev *dev) { struct mlx5_priv *priv = &dev->priv; + struct auxiliary_device *adev; + struct auxiliary_driver *adrv; struct mlx5_interface *intf; + int ret = 0, i; mutex_lock(&mlx5_intf_mutex); + for (i = 0; i < ARRAY_SIZE(mlx5_adev_devices); i++) { + if (!priv->adev[i]) { + bool is_supported = false; + + if (mlx5_adev_devices[i].is_supported) + is_supported = mlx5_adev_devices[i].is_supported(dev); + + if (!is_supported) + continue; + + priv->adev[i] = add_adev(dev, i); + if (IS_ERR(priv->adev[i])) { + ret = PTR_ERR(priv->adev[i]); + priv->adev[i] = NULL; + } + } else { + adev = &priv->adev[i]->adev; + adrv = to_auxiliary_drv(adev->dev.driver); + + if (adrv->resume) + ret = adrv->resume(adev); + } + if (ret) { + mlx5_core_warn(dev, "Device[%d] (%s) failed to load\n", + i, mlx5_adev_devices[i].suffix); + + break; + } + } + list_for_each_entry(intf, &intf_list, list) mlx5_attach_interface(intf, priv); mutex_unlock(&mlx5_intf_mutex); + return ret; } static void mlx5_detach_interface(struct mlx5_interface *intf, struct mlx5_priv *priv) @@ -171,9 +289,29 @@ static void mlx5_detach_interface(struct mlx5_interface *intf, struct mlx5_priv void mlx5_detach_device(struct mlx5_core_dev *dev) { struct mlx5_priv *priv = &dev->priv; + struct auxiliary_device *adev; + struct auxiliary_driver *adrv; struct mlx5_interface *intf; + pm_message_t pm = {}; + int i; mutex_lock(&mlx5_intf_mutex); + for (i = ARRAY_SIZE(mlx5_adev_devices) - 1; i >= 0; i--) { + if (!priv->adev[i]) + continue; + + adev = &priv->adev[i]->adev; + adrv = to_auxiliary_drv(adev->dev.driver); + + if (adrv->suspend) { + adrv->suspend(adev, pm); + continue; + } + + del_adev(&priv->adev[i]->adev); + priv->adev[i] = NULL; + } + list_for_each_entry(intf, &intf_list, list) mlx5_detach_interface(intf, priv); mutex_unlock(&mlx5_intf_mutex); @@ -193,16 +331,30 @@ bool mlx5_device_registered(struct mlx5_core_dev *dev) return found; } -void mlx5_register_device(struct mlx5_core_dev *dev) +int mlx5_register_device(struct mlx5_core_dev *dev) { struct mlx5_priv *priv = &dev->priv; struct mlx5_interface *intf; + int ret; + + mutex_lock(&mlx5_intf_mutex); + dev->priv.flags &= ~MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV; + ret = mlx5_rescan_drivers_locked(dev); + mutex_unlock(&mlx5_intf_mutex); + if (ret) + goto add_err; mutex_lock(&mlx5_intf_mutex); list_add_tail(&priv->dev_list, &mlx5_dev_list); list_for_each_entry(intf, &intf_list, list) mlx5_add_device(intf, priv); mutex_unlock(&mlx5_intf_mutex); + + return 0; + +add_err: + mlx5_unregister_device(dev); + return ret; } void mlx5_unregister_device(struct mlx5_core_dev *dev) @@ -214,6 +366,9 @@ void mlx5_unregister_device(struct mlx5_core_dev *dev) list_for_each_entry_reverse(intf, &intf_list, list) mlx5_remove_device(intf, priv); list_del(&priv->dev_list); + + dev->priv.flags |= MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV; + mlx5_rescan_drivers_locked(dev); mutex_unlock(&mlx5_intf_mutex); } @@ -246,6 +401,77 @@ void mlx5_unregister_interface(struct mlx5_interface *intf) } EXPORT_SYMBOL(mlx5_unregister_interface); +static int add_drivers(struct mlx5_core_dev *dev) +{ + struct mlx5_priv *priv = &dev->priv; + int i, ret = 0; + + for (i = 0; i < ARRAY_SIZE(mlx5_adev_devices); i++) { + bool is_supported = false; + + if (priv->adev[i]) + continue; + + if (mlx5_adev_devices[i].is_supported) + is_supported = mlx5_adev_devices[i].is_supported(dev); + + if (!is_supported) + continue; + + priv->adev[i] = add_adev(dev, i); + if (IS_ERR(priv->adev[i])) { + mlx5_core_warn(dev, "Device[%d] (%s) failed to load\n", + i, mlx5_adev_devices[i].suffix); + /* We continue to rescan drivers and leave to the caller + * to make decision if to release everything or continue. + */ + ret = PTR_ERR(priv->adev[i]); + priv->adev[i] = NULL; + } + } + return ret; +} + +static void delete_drivers(struct mlx5_core_dev *dev) +{ + struct mlx5_priv *priv = &dev->priv; + bool delete_all; + int i; + + delete_all = priv->flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV; + + for (i = ARRAY_SIZE(mlx5_adev_devices) - 1; i >= 0; i--) { + bool is_supported = false; + + if (!priv->adev[i]) + continue; + + if (mlx5_adev_devices[i].is_supported && !delete_all) + is_supported = mlx5_adev_devices[i].is_supported(dev); + + if (is_supported) + continue; + + del_adev(&priv->adev[i]->adev); + priv->adev[i] = NULL; + } +} + +/* This function is used after mlx5_core_dev is reconfigured. + */ +int mlx5_rescan_drivers_locked(struct mlx5_core_dev *dev) +{ + struct mlx5_priv *priv = &dev->priv; + + lockdep_assert_held(&mlx5_intf_mutex); + + delete_drivers(dev); + if (priv->flags & MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV) + return 0; + + return add_drivers(dev); +} + /* Must be called with intf_mutex held */ static bool mlx5_has_added_dev_by_protocol(struct mlx5_core_dev *mdev, int protocol) { @@ -299,24 +525,55 @@ void mlx5_remove_dev_by_protocol(struct mlx5_core_dev *dev, int protocol) } } -static u32 mlx5_gen_pci_id(struct mlx5_core_dev *dev) +static u32 mlx5_gen_pci_id(const struct mlx5_core_dev *dev) { return (u32)((pci_domain_nr(dev->pdev->bus) << 16) | (dev->pdev->bus->number << 8) | PCI_SLOT(dev->pdev->devfn)); } -/* Must be called with intf_mutex held */ +static int next_phys_dev(struct device *dev, const void *data) +{ + struct mlx5_adev *madev = container_of(dev, struct mlx5_adev, adev.dev); + struct mlx5_core_dev *mdev = madev->mdev; + const struct mlx5_core_dev *curr = data; + + if (!mlx5_core_is_pf(mdev)) + return 0; + + if (mdev == curr) + return 0; + + if (mlx5_gen_pci_id(mdev) != mlx5_gen_pci_id(curr)) + return 0; + + return 1; +} + +/* This function is called with two flows: + * 1. During initialization of mlx5_core_dev and we don't need to lock it. + * 2. During LAG configure stage and caller holds &mlx5_intf_mutex. + */ struct mlx5_core_dev *mlx5_get_next_phys_dev(struct mlx5_core_dev *dev) { struct mlx5_core_dev *res = NULL; struct mlx5_core_dev *tmp_dev; + struct auxiliary_device *adev; + struct mlx5_adev *madev; struct mlx5_priv *priv; u32 pci_id; if (!mlx5_core_is_pf(dev)) return NULL; + adev = auxiliary_find_device(NULL, dev, &next_phys_dev); + if (adev) { + madev = container_of(adev, struct mlx5_adev, adev); + + put_device(&adev->dev); + return madev->mdev; + } + pci_id = mlx5_gen_pci_id(dev); list_for_each_entry(priv, &mlx5_dev_list, dev_list) { tmp_dev = container_of(priv, struct mlx5_core_dev, priv); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c index 0e699330ae77c042329173fddf59e202a01aef04..d72707cd6cb9dc938172f6427baab36bddb4b249 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c @@ -7,6 +7,8 @@ #include "fw_reset.h" #include "fs_core.h" #include "eswitch.h" +#include "sf/dev/dev.h" +#include "sf/sf.h" static int mlx5_devlink_flash_update(struct devlink *devlink, struct devlink_flash_update_params *params, @@ -52,7 +54,7 @@ mlx5_devlink_info_get(struct devlink *devlink, struct devlink_info_req *req, u32 running_fw, stored_fw; int err; - err = devlink_info_driver_name_put(req, DRIVER_NAME); + err = devlink_info_driver_name_put(req, KBUILD_MODNAME); if (err) return err; @@ -136,6 +138,17 @@ static int mlx5_devlink_reload_down(struct devlink *devlink, bool netns_change, struct netlink_ext_ack *extack) { struct mlx5_core_dev *dev = devlink_priv(devlink); + bool sf_dev_allocated; + + sf_dev_allocated = mlx5_sf_dev_allocated(dev); + if (sf_dev_allocated) { + /* Reload results in deleting SF device which further results in + * unregistering devlink instance while holding devlink_mutext. + * Hence, do not support reload. + */ + NL_SET_ERR_MSG_MOD(extack, "reload is unsupported when SFs are allocated\n"); + return -EOPNOTSUPP; + } if (mlx5_lag_is_active(dev)) { NL_SET_ERR_MSG_MOD(extack, "reload is unsupported in Lag mode\n"); @@ -192,6 +205,12 @@ static const struct devlink_ops mlx5_devlink_ops = { .eswitch_encap_mode_get = mlx5_devlink_eswitch_encap_mode_get, .port_function_hw_addr_get = mlx5_devlink_port_function_hw_addr_get, .port_function_hw_addr_set = mlx5_devlink_port_function_hw_addr_set, +#endif +#ifdef CONFIG_MLX5_SF_MANAGER + .port_new = mlx5_devlink_sf_port_new, + .port_del = mlx5_devlink_sf_port_del, + .port_fn_state_get = mlx5_devlink_sf_port_fn_state_get, + .port_fn_state_set = mlx5_devlink_sf_port_fn_state_set, #endif .flash_update = mlx5_devlink_flash_update, .info_get = mlx5_devlink_info_get, diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c index 6a1b1363ac16a8d0049ade5da6ad5ccf8a955221..d3817dd07e3dc368a20e93485ef30dfd37dbd9f7 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c @@ -40,9 +40,7 @@ void mlx5e_ethtool_get_drvinfo(struct mlx5e_priv *priv, { struct mlx5_core_dev *mdev = priv->mdev; - strlcpy(drvinfo->driver, DRIVER_NAME, sizeof(drvinfo->driver)); - strlcpy(drvinfo->version, DRIVER_VERSION, - sizeof(drvinfo->version)); + strlcpy(drvinfo->driver, KBUILD_MODNAME, sizeof(drvinfo->driver)); snprintf(drvinfo->fw_version, sizeof(drvinfo->fw_version), "%d.%d.%04d (%.16s)", fw_rev_maj(mdev), fw_rev_min(mdev), fw_rev_sub(mdev), diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index cfc3bfcb04a2f684a25567f5c6cb6f583be0f903..cd6ce1fe2092067f274ad56a79bd3ea1f1e8a6a5 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -5673,11 +5673,15 @@ static struct mlx5_interface mlx5e_interface = { .protocol = MLX5_INTERFACE_PROTOCOL_ETH, }; -void mlx5e_init(void) +int mlx5e_init(void) { + int err; + mlx5e_ipsec_build_inverse_table(); mlx5e_build_ptys2ethtool_map(); - mlx5_register_interface(&mlx5e_interface); + err = mlx5_register_interface(&mlx5e_interface); + + return err; } void mlx5e_cleanup(void) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c index b991f03c7e9917c6e69810628f8a6ae938e34785..5a13d47d2c09dd6bb58e51b1d3e96c8a781e046f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c @@ -64,7 +64,6 @@ static void mlx5e_rep_get_drvinfo(struct net_device *dev, strlcpy(drvinfo->driver, mlx5e_rep_driver_name, sizeof(drvinfo->driver)); - strlcpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version)); snprintf(drvinfo->fw_version, sizeof(drvinfo->fw_version), "%d.%d.%04d (%.16s)", fw_rev_maj(mdev), fw_rev_min(mdev), diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c index 4f4f79ca37a81e11308c42c0bc7e46ab5333d049..01f5debebf2ed2ecc20d8c41f8d1d806417fb6c2 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c @@ -465,7 +465,7 @@ int mlx5_eq_table_init(struct mlx5_core_dev *dev) for (i = 0; i < MLX5_EVENT_TYPE_MAX; i++) ATOMIC_INIT_NOTIFIER_HEAD(&eq_table->nh[i]); - eq_table->irq_table = dev->priv.irq_table; + eq_table->irq_table = mlx5_irq_table_get(dev); return 0; } @@ -593,6 +593,9 @@ static void gather_async_events_mask(struct mlx5_core_dev *dev, u64 mask[4]) async_event_mask |= (1ull << MLX5_EVENT_TYPE_ESW_FUNCTIONS_CHANGED); + if (MLX5_CAP_GEN_MAX(dev, vhca_state)) + async_event_mask |= (1ull << MLX5_EVENT_TYPE_VHCA_STATE_CHANGE); + mask[0] = async_event_mask; if (MLX5_CAP_GEN(dev, event_cap)) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c index 4c74e2690d57bc85bd26f33ae6aa001c432de109..26b37a0f87629cac18d8adc613e92235f01d4a43 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c @@ -150,7 +150,7 @@ static void esw_acl_egress_ofld_groups_destroy(struct mlx5_vport *vport) static bool esw_acl_egress_needed(const struct mlx5_eswitch *esw, u16 vport_num) { - return mlx5_eswitch_is_vf_vport(esw, vport_num); + return mlx5_eswitch_is_vf_vport(esw, vport_num) || mlx5_esw_is_sf_vport(esw, vport_num); } int esw_acl_egress_ofld_setup(struct mlx5_eswitch *esw, struct mlx5_vport *vport) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c index ffff11baa3d04245a4564398ba890f04d6705840..cb1e181f4c6abd7710adb7de60dc1ab1ac1db544 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c @@ -122,3 +122,44 @@ struct devlink_port *mlx5_esw_offloads_devlink_port(struct mlx5_eswitch *esw, u1 vport = mlx5_eswitch_get_vport(esw, vport_num); return vport->dl_port; } + +int mlx5_esw_devlink_sf_port_register(struct mlx5_eswitch *esw, struct devlink_port *dl_port, + u16 vport_num, u32 sfnum) +{ + struct mlx5_core_dev *dev = esw->dev; + struct netdev_phys_item_id ppid = {}; + unsigned int dl_port_index; + struct mlx5_vport *vport; + struct devlink *devlink; + u16 pfnum; + int err; + + vport = mlx5_eswitch_get_vport(esw, vport_num); + if (IS_ERR(vport)) + return PTR_ERR(vport); + + pfnum = PCI_FUNC(dev->pdev->devfn); + mlx5_esw_get_port_parent_id(dev, &ppid); + memcpy(dl_port->attrs.switch_id.id, &ppid.id[0], ppid.id_len); + dl_port->attrs.switch_id.id_len = ppid.id_len; + devlink_port_attrs_pci_sf_set(dl_port, 0, pfnum, sfnum); + devlink = priv_to_devlink(dev); + dl_port_index = mlx5_esw_vport_to_devlink_port_index(dev, vport_num); + err = devlink_port_register(devlink, dl_port, dl_port_index); + if (err) + return err; + + vport->dl_port = dl_port; + return 0; +} + +void mlx5_esw_devlink_sf_port_unregister(struct mlx5_eswitch *esw, u16 vport_num) +{ + struct mlx5_vport *vport; + + vport = mlx5_eswitch_get_vport(esw, vport_num); + if (IS_ERR(vport)) + return; + devlink_port_unregister(vport->dl_port); + vport->dl_port = NULL; +} diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c index 78cc6f0bbc72b173bd80db84c5e356784f350b3b..ea1b645113399f105200f3f36e9673deb0479b59 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c @@ -1274,8 +1274,8 @@ static void esw_vport_cleanup(struct mlx5_eswitch *esw, struct mlx5_vport *vport esw_vport_cleanup_acl(esw, vport); } -static int esw_enable_vport(struct mlx5_eswitch *esw, u16 vport_num, - enum mlx5_eswitch_vport_event enabled_events) +int mlx5_esw_vport_enable(struct mlx5_eswitch *esw, u16 vport_num, + enum mlx5_eswitch_vport_event enabled_events) { struct mlx5_vport *vport; int ret; @@ -1317,7 +1317,7 @@ static int esw_enable_vport(struct mlx5_eswitch *esw, u16 vport_num, return ret; } -static void esw_disable_vport(struct mlx5_eswitch *esw, u16 vport_num) +void mlx5_esw_vport_disable(struct mlx5_eswitch *esw, u16 vport_num) { struct mlx5_vport *vport; @@ -1373,9 +1373,15 @@ const u32 *mlx5_esw_query_functions(struct mlx5_core_dev *dev) { int outlen = MLX5_ST_SZ_BYTES(query_esw_functions_out); u32 in[MLX5_ST_SZ_DW(query_esw_functions_in)] = {}; + u16 max_sf_vports; u32 *out; int err; + max_sf_vports = mlx5_sf_max_functions(dev); + /* Device interface is array of 64-bits */ + if (max_sf_vports) + outlen += DIV_ROUND_UP(max_sf_vports, BITS_PER_TYPE(__be64)) * sizeof(__be64); + out = kvzalloc(outlen, GFP_KERNEL); if (!out) return ERR_PTR(-ENOMEM); @@ -1383,7 +1389,7 @@ const u32 *mlx5_esw_query_functions(struct mlx5_core_dev *dev) MLX5_SET(query_esw_functions_in, in, opcode, MLX5_CMD_OP_QUERY_ESW_FUNCTIONS); - err = mlx5_cmd_exec_inout(dev, query_esw_functions, in, out); + err = mlx5_cmd_exec(dev, in, sizeof(in), out, outlen); if (!err) return out; @@ -1433,7 +1439,7 @@ int mlx5_eswitch_load_vport(struct mlx5_eswitch *esw, u16 vport_num, { int err; - err = esw_enable_vport(esw, vport_num, enabled_events); + err = mlx5_esw_vport_enable(esw, vport_num, enabled_events); if (err) return err; @@ -1444,14 +1450,14 @@ int mlx5_eswitch_load_vport(struct mlx5_eswitch *esw, u16 vport_num, return err; err_rep: - esw_disable_vport(esw, vport_num); + mlx5_esw_vport_disable(esw, vport_num); return err; } void mlx5_eswitch_unload_vport(struct mlx5_eswitch *esw, u16 vport_num) { esw_offloads_unload_rep(esw, vport_num); - esw_disable_vport(esw, vport_num); + mlx5_esw_vport_disable(esw, vport_num); } void mlx5_eswitch_unload_vf_vports(struct mlx5_eswitch *esw, u16 num_vfs) @@ -1574,6 +1580,15 @@ mlx5_eswitch_update_num_of_vfs(struct mlx5_eswitch *esw, int num_vfs) kvfree(out); } +static void mlx5_esw_mode_change_notify(struct mlx5_eswitch *esw, u16 mode) +{ + struct mlx5_esw_event_info info = {}; + + info.new_mode = mode; + + blocking_notifier_call_chain(&esw->n_head, 0, &info); +} + /** * mlx5_eswitch_enable_locked - Enable eswitch * @esw: Pointer to eswitch @@ -1635,6 +1650,8 @@ int mlx5_eswitch_enable_locked(struct mlx5_eswitch *esw, int mode, int num_vfs) mode == MLX5_ESWITCH_LEGACY ? "LEGACY" : "OFFLOADS", esw->esw_funcs.num_vfs, esw->enabled_vports); + mlx5_esw_mode_change_notify(esw, mode); + return 0; abort: @@ -1692,6 +1709,11 @@ void mlx5_eswitch_disable_locked(struct mlx5_eswitch *esw, bool clear_vf) esw->mode == MLX5_ESWITCH_LEGACY ? "LEGACY" : "OFFLOADS", esw->esw_funcs.num_vfs, esw->enabled_vports); + /* Notify eswitch users that it is exiting from current mode. + * So that it can do necessary cleanup before the eswitch is disabled. + */ + mlx5_esw_mode_change_notify(esw, MLX5_ESWITCH_NONE); + mlx5_eswitch_event_handlers_unregister(esw); if (esw->mode == MLX5_ESWITCH_LEGACY) @@ -1793,6 +1815,7 @@ int mlx5_eswitch_init(struct mlx5_core_dev *dev) esw->offloads.inline_mode = MLX5_INLINE_MODE_NONE; dev->priv.eswitch = esw; + BLOCKING_INIT_NOTIFIER_HEAD(&esw->n_head); return 0; abort: if (esw->work_queue) @@ -1881,7 +1904,8 @@ static bool is_port_function_supported(const struct mlx5_eswitch *esw, u16 vport_num) { return vport_num == MLX5_VPORT_PF || - mlx5_eswitch_is_vf_vport(esw, vport_num); + mlx5_eswitch_is_vf_vport(esw, vport_num) || + mlx5_esw_is_sf_vport(esw, vport_num); } int mlx5_devlink_port_function_hw_addr_get(struct devlink *devlink, @@ -2480,4 +2504,12 @@ bool mlx5_esw_multipath_prereq(struct mlx5_core_dev *dev0, dev1->priv.eswitch->mode == MLX5_ESWITCH_OFFLOADS); } +int mlx5_esw_event_notifier_register(struct mlx5_eswitch *esw, struct notifier_block *nb) +{ + return blocking_notifier_chain_register(&esw->n_head, nb); +} +void mlx5_esw_event_notifier_unregister(struct mlx5_eswitch *esw, struct notifier_block *nb) +{ + blocking_notifier_chain_unregister(&esw->n_head, nb); +} diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h index 59c674f157a8c10e706615840a8d77eb83b5a66e..3058393761ef0d25060e8c353fa911d2ee03f750 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h @@ -43,6 +43,7 @@ #include #include "lib/mpfs.h" #include "lib/fs_chains.h" +#include "sf/sf.h" #include "en/tc_ct.h" #ifdef CONFIG_MLX5_ESWITCH @@ -277,6 +278,7 @@ struct mlx5_eswitch { struct { u32 large_group_num; } params; + struct blocking_notifier_head n_head; }; void esw_offloads_disable(struct mlx5_eswitch *esw); @@ -499,6 +501,40 @@ static inline u16 mlx5_eswitch_first_host_vport_num(struct mlx5_core_dev *dev) MLX5_VPORT_PF : MLX5_VPORT_FIRST_VF; } +static inline int mlx5_esw_sf_start_idx(const struct mlx5_eswitch *esw) +{ + /* PF and VF vports indices start from 0 to max_vfs */ + return MLX5_VPORT_PF_PLACEHOLDER + mlx5_core_max_vfs(esw->dev); +} + +static inline int mlx5_esw_sf_end_idx(const struct mlx5_eswitch *esw) +{ + return mlx5_esw_sf_start_idx(esw) + mlx5_sf_max_functions(esw->dev); +} + +static inline int +mlx5_esw_sf_vport_num_to_index(const struct mlx5_eswitch *esw, u16 vport_num) +{ + return vport_num - mlx5_sf_start_function_id(esw->dev) + + MLX5_VPORT_PF_PLACEHOLDER + mlx5_core_max_vfs(esw->dev); +} + +static inline u16 +mlx5_esw_sf_vport_index_to_num(const struct mlx5_eswitch *esw, int idx) +{ + return mlx5_sf_start_function_id(esw->dev) + idx - + (MLX5_VPORT_PF_PLACEHOLDER + mlx5_core_max_vfs(esw->dev)); +} + +static inline bool +mlx5_esw_is_sf_vport(const struct mlx5_eswitch *esw, u16 vport_num) +{ + return mlx5_sf_supported(esw->dev) && + vport_num >= mlx5_sf_start_function_id(esw->dev) && + (vport_num < (mlx5_sf_start_function_id(esw->dev) + + mlx5_sf_max_functions(esw->dev))); +} + static inline bool mlx5_eswitch_is_funcs_handler(const struct mlx5_core_dev *dev) { return mlx5_core_is_ecpf_esw_manager(dev); @@ -527,6 +563,10 @@ static inline int mlx5_eswitch_vport_num_to_index(struct mlx5_eswitch *esw, if (vport_num == MLX5_VPORT_UPLINK) return mlx5_eswitch_uplink_idx(esw); + if (mlx5_esw_is_sf_vport(esw, vport_num)) + return mlx5_esw_sf_vport_num_to_index(esw, vport_num); + + /* PF and VF vports start from 0 to max_vfs */ return vport_num; } @@ -540,6 +580,12 @@ static inline u16 mlx5_eswitch_index_to_vport_num(struct mlx5_eswitch *esw, if (index == mlx5_eswitch_uplink_idx(esw)) return MLX5_VPORT_UPLINK; + /* SF vports indices are after VFs and before ECPF */ + if (mlx5_sf_supported(esw->dev) && + index > mlx5_core_max_vfs(esw->dev)) + return mlx5_esw_sf_vport_index_to_num(esw, index); + + /* PF and VF vports start from 0 to max_vfs */ return index; } @@ -625,6 +671,11 @@ void mlx5e_tc_clean_fdb_peer_flows(struct mlx5_eswitch *esw); for ((vport) = (nvfs); \ (vport) >= (esw)->first_host_vport; (vport)--) +#define mlx5_esw_for_each_sf_rep(esw, i, rep) \ + for ((i) = mlx5_esw_sf_start_idx(esw); \ + (rep) = &(esw)->offloads.vport_reps[(i)], \ + (i) < mlx5_esw_sf_end_idx(esw); (i++)) + struct mlx5_eswitch *mlx5_devlink_eswitch_get(struct devlink *devlink); struct mlx5_vport *__must_check mlx5_eswitch_get_vport(struct mlx5_eswitch *esw, u16 vport_num); @@ -638,6 +689,10 @@ mlx5_eswitch_enable_pf_vf_vports(struct mlx5_eswitch *esw, enum mlx5_eswitch_vport_event enabled_events); void mlx5_eswitch_disable_pf_vf_vports(struct mlx5_eswitch *esw); +int mlx5_esw_vport_enable(struct mlx5_eswitch *esw, u16 vport_num, + enum mlx5_eswitch_vport_event enabled_events); +void mlx5_esw_vport_disable(struct mlx5_eswitch *esw, u16 vport_num); + int esw_vport_create_offloads_acl_tables(struct mlx5_eswitch *esw, struct mlx5_vport *vport); @@ -656,6 +711,9 @@ esw_get_max_restore_tag(struct mlx5_eswitch *esw); int esw_offloads_load_rep(struct mlx5_eswitch *esw, u16 vport_num); void esw_offloads_unload_rep(struct mlx5_eswitch *esw, u16 vport_num); +int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num); +void mlx5_esw_offloads_rep_unload(struct mlx5_eswitch *esw, u16 vport_num); + int mlx5_eswitch_load_vport(struct mlx5_eswitch *esw, u16 vport_num, enum mlx5_eswitch_vport_event enabled_events); void mlx5_eswitch_unload_vport(struct mlx5_eswitch *esw, u16 vport_num); @@ -667,6 +725,26 @@ void mlx5_eswitch_unload_vf_vports(struct mlx5_eswitch *esw, u16 num_vfs); int mlx5_esw_offloads_devlink_port_register(struct mlx5_eswitch *esw, u16 vport_num); void mlx5_esw_offloads_devlink_port_unregister(struct mlx5_eswitch *esw, u16 vport_num); struct devlink_port *mlx5_esw_offloads_devlink_port(struct mlx5_eswitch *esw, u16 vport_num); + +int mlx5_esw_devlink_sf_port_register(struct mlx5_eswitch *esw, struct devlink_port *dl_port, + u16 vport_num, u32 sfnum); +void mlx5_esw_devlink_sf_port_unregister(struct mlx5_eswitch *esw, u16 vport_num); + +int mlx5_esw_offloads_sf_vport_enable(struct mlx5_eswitch *esw, struct devlink_port *dl_port, + u16 vport_num, u32 sfnum); +void mlx5_esw_offloads_sf_vport_disable(struct mlx5_eswitch *esw, u16 vport_num); + +/** + * mlx5_esw_event_info - Indicates eswitch mode changed/changing. + * + * @new_mode: New mode of eswitch. + */ +struct mlx5_esw_event_info { + u16 new_mode; +}; + +int mlx5_esw_event_notifier_register(struct mlx5_eswitch *esw, struct notifier_block *n); +void mlx5_esw_event_notifier_unregister(struct mlx5_eswitch *esw, struct notifier_block *n); #else /* CONFIG_MLX5_ESWITCH */ /* eswitch API stubs */ static inline int mlx5_eswitch_init(struct mlx5_core_dev *dev) { return 0; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c index ccc7dd3e738a48ede918113f769a0bed7b84300b..447788234a670c2d7d83a768f2d5f01a6d739663 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c @@ -1801,11 +1801,22 @@ static void __esw_offloads_unload_rep(struct mlx5_eswitch *esw, esw->offloads.rep_ops[rep_type]->unload(rep); } +static void __unload_reps_sf_vport(struct mlx5_eswitch *esw, u8 rep_type) +{ + struct mlx5_eswitch_rep *rep; + int i; + + mlx5_esw_for_each_sf_rep(esw, i, rep) + __esw_offloads_unload_rep(esw, rep, rep_type); +} + static void __unload_reps_all_vport(struct mlx5_eswitch *esw, u8 rep_type) { struct mlx5_eswitch_rep *rep; int i; + __unload_reps_sf_vport(esw, rep_type); + mlx5_esw_for_each_vf_rep_reverse(esw, i, rep, esw->esw_funcs.num_vfs) __esw_offloads_unload_rep(esw, rep, rep_type); @@ -1823,7 +1834,7 @@ static void __unload_reps_all_vport(struct mlx5_eswitch *esw, u8 rep_type) __esw_offloads_unload_rep(esw, rep, rep_type); } -static int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num) +int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num) { struct mlx5_eswitch_rep *rep; int rep_type; @@ -1847,7 +1858,7 @@ static int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num) return err; } -static void mlx5_esw_offloads_rep_unload(struct mlx5_eswitch *esw, u16 vport_num) +void mlx5_esw_offloads_rep_unload(struct mlx5_eswitch *esw, u16 vport_num) { struct mlx5_eswitch_rep *rep; int rep_type; @@ -2824,3 +2835,35 @@ u32 mlx5_eswitch_get_vport_metadata_for_match(struct mlx5_eswitch *esw, return vport->metadata << (32 - ESW_SOURCE_PORT_METADATA_BITS); } EXPORT_SYMBOL(mlx5_eswitch_get_vport_metadata_for_match); + +int mlx5_esw_offloads_sf_vport_enable(struct mlx5_eswitch *esw, struct devlink_port *dl_port, + u16 vport_num, u32 sfnum) +{ + int err; + + err = mlx5_esw_vport_enable(esw, vport_num, MLX5_VPORT_UC_ADDR_CHANGE); + if (err) + return err; + + err = mlx5_esw_devlink_sf_port_register(esw, dl_port, vport_num, sfnum); + if (err) + goto devlink_err; + + err = mlx5_esw_offloads_rep_load(esw, vport_num); + if (err) + goto rep_err; + return 0; + +rep_err: + mlx5_esw_devlink_sf_port_unregister(esw, vport_num); +devlink_err: + mlx5_esw_vport_disable(esw, vport_num); + return err; +} + +void mlx5_esw_offloads_sf_vport_disable(struct mlx5_eswitch *esw, u16 vport_num) +{ + mlx5_esw_offloads_rep_unload(esw, vport_num); + mlx5_esw_devlink_sf_port_unregister(esw, vport_num); + mlx5_esw_vport_disable(esw, vport_num); +} diff --git a/drivers/net/ethernet/mellanox/mlx5/core/events.c b/drivers/net/ethernet/mellanox/mlx5/core/events.c index 3ce17c3d7a0014082b74a5ac00428935a76f48b9..5523d218e5fb75c6739456bd6743a1c4ba729c0e 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/events.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/events.c @@ -110,6 +110,8 @@ static const char *eqe_type_str(u8 type) return "MLX5_EVENT_TYPE_CMD"; case MLX5_EVENT_TYPE_ESW_FUNCTIONS_CHANGED: return "MLX5_EVENT_TYPE_ESW_FUNCTIONS_CHANGED"; + case MLX5_EVENT_TYPE_VHCA_STATE_CHANGE: + return "MLX5_EVENT_TYPE_VHCA_STATE_CHANGE"; case MLX5_EVENT_TYPE_PAGE_REQUEST: return "MLX5_EVENT_TYPE_PAGE_REQUEST"; case MLX5_EVENT_TYPE_PAGE_FAULT: @@ -403,3 +405,8 @@ int mlx5_notifier_call_chain(struct mlx5_events *events, unsigned int event, voi { return atomic_notifier_call_chain(&events->nh, event, data); } + +void mlx5_events_work_enqueue(struct mlx5_core_dev *dev, struct work_struct *work) +{ + queue_work(dev->priv.events->wq, work); +} diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c index cac8f085b16d74fabff10d162ea1625cfb9b9c0a..97d96fc38a655f0e9527f16fc1747a1eb11f9ef8 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c @@ -39,7 +39,7 @@ static void mlx5i_get_drvinfo(struct net_device *dev, struct mlx5e_priv *priv = mlx5i_epriv(dev); mlx5e_ethtool_get_drvinfo(priv, drvinfo); - strlcpy(drvinfo->driver, DRIVER_NAME "[ib_ipoib]", + strlcpy(drvinfo->driver, KBUILD_MODNAME "[ib_ipoib]", sizeof(drvinfo->driver)); } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c index 8246b6285d5a44fc0d4a002ea43dbdb36a1aab0d..76cc201c9f7c4c2e2a9087aae7d882f613f3956d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c @@ -73,17 +73,18 @@ #include "ecpf.h" #include "lib/hv_vhca.h" #include "diag/rsc_dump.h" +#include "sf/vhca_event.h" +#include "sf/dev/dev.h" +#include "sf/sf.h" MODULE_AUTHOR("Eli Cohen "); MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) core driver"); MODULE_LICENSE("Dual BSD/GPL"); -MODULE_VERSION(DRIVER_VERSION); unsigned int mlx5_core_debug_mask; module_param_named(debug_mask, mlx5_core_debug_mask, uint, 0644); MODULE_PARM_DESC(debug_mask, "debug mask: 1 = dump cmd data, 2 = dump cmd exec time, 3 = both. Default=0"); -#define MLX5_DEFAULT_PROF 2 static unsigned int prof_sel = MLX5_DEFAULT_PROF; module_param_named(prof_sel, prof_sel, uint, 0444); MODULE_PARM_DESC(prof_sel, "profile selector. Valid range 0 - 2"); @@ -228,7 +229,7 @@ static void mlx5_set_driver_version(struct mlx5_core_dev *dev) strncat(string, ",", remaining_size); remaining_size = max_t(int, 0, driver_ver_sz - strlen(string)); - strncat(string, DRIVER_NAME, remaining_size); + strncat(string, KBUILD_MODNAME, remaining_size); remaining_size = max_t(int, 0, driver_ver_sz - strlen(string)); strncat(string, ",", remaining_size); @@ -313,7 +314,7 @@ static int request_bar(struct pci_dev *pdev) return -ENODEV; } - err = pci_request_regions(pdev, DRIVER_NAME); + err = pci_request_regions(pdev, KBUILD_MODNAME); if (err) dev_err(&pdev->dev, "Couldn't get PCI resources, aborting\n"); @@ -568,6 +569,8 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx) if (MLX5_CAP_GEN_MAX(dev, mkey_by_name)) MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1); + mlx5_vhca_state_cap_handle(dev, set_hca_cap); + return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE); } @@ -885,6 +888,24 @@ static int mlx5_init_once(struct mlx5_core_dev *dev) goto err_eswitch_cleanup; } + err = mlx5_vhca_event_init(dev); + if (err) { + mlx5_core_err(dev, "Failed to init vhca event notifier %d\n", err); + goto err_fpga_cleanup; + } + + err = mlx5_sf_hw_table_init(dev); + if (err) { + mlx5_core_err(dev, "Failed to init SF HW table %d\n", err); + goto err_sf_hw_table_cleanup; + } + + err = mlx5_sf_table_init(dev); + if (err) { + mlx5_core_err(dev, "Failed to init SF table %d\n", err); + goto err_sf_table_cleanup; + } + dev->dm = mlx5_dm_create(dev); if (IS_ERR(dev->dm)) mlx5_core_warn(dev, "Failed to init device memory%d\n", err); @@ -895,6 +916,12 @@ static int mlx5_init_once(struct mlx5_core_dev *dev) return 0; +err_sf_table_cleanup: + mlx5_sf_hw_table_cleanup(dev); +err_sf_hw_table_cleanup: + mlx5_vhca_event_cleanup(dev); +err_fpga_cleanup: + mlx5_fpga_cleanup(dev); err_eswitch_cleanup: mlx5_eswitch_cleanup(dev->priv.eswitch); err_sriov_cleanup: @@ -926,6 +953,9 @@ static void mlx5_cleanup_once(struct mlx5_core_dev *dev) mlx5_hv_vhca_destroy(dev->hv_vhca); mlx5_fw_tracer_destroy(dev->tracer); mlx5_dm_cleanup(dev); + mlx5_sf_table_cleanup(dev); + mlx5_sf_hw_table_cleanup(dev); + mlx5_vhca_event_cleanup(dev); mlx5_fpga_cleanup(dev); mlx5_eswitch_cleanup(dev->priv.eswitch); mlx5_sriov_cleanup(dev); @@ -1136,15 +1166,28 @@ static int mlx5_load(struct mlx5_core_dev *dev) goto err_sriov; } + mlx5_vhca_event_start(dev); + + err = mlx5_sf_hw_table_create(dev); + if (err) { + mlx5_core_err(dev, "sf table create failed %d\n", err); + goto err_vhca; + } + err = mlx5_ec_init(dev); if (err) { mlx5_core_err(dev, "Failed to init embedded CPU\n"); goto err_ec; } + mlx5_sf_dev_table_create(dev); + return 0; err_ec: + mlx5_sf_hw_table_destroy(dev); +err_vhca: + mlx5_vhca_event_stop(dev); mlx5_sriov_detach(dev); err_sriov: mlx5_cleanup_fs(dev); @@ -1172,8 +1215,11 @@ static int mlx5_load(struct mlx5_core_dev *dev) static void mlx5_unload(struct mlx5_core_dev *dev) { + mlx5_sf_dev_table_destroy(dev); mlx5_ec_cleanup(dev); mlx5_sriov_detach(dev); + mlx5_sf_hw_table_destroy(dev); + mlx5_vhca_event_stop(dev); mlx5_cleanup_fs(dev); mlx5_accel_ipsec_cleanup(dev); mlx5_accel_tls_cleanup(dev); @@ -1223,14 +1269,21 @@ int mlx5_load_one(struct mlx5_core_dev *dev, bool boot) err = mlx5_devlink_register(priv_to_devlink(dev), dev->device); if (err) goto err_devlink_reg; - mlx5_register_device(dev); + + err = mlx5_register_device(dev); } else { - mlx5_attach_device(dev); + err = mlx5_attach_device(dev); } + if (err) + goto err_register; + mutex_unlock(&dev->intf_state_mutex); return 0; +err_register: + if (boot) + mlx5_devlink_unregister(priv_to_devlink(dev)); err_devlink_reg: clear_bit(MLX5_INTERFACE_STATE_UP, &dev->intf_state); mlx5_unload(dev); @@ -1277,7 +1330,7 @@ void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup) mutex_unlock(&dev->intf_state_mutex); } -static int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx) +int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx) { struct mlx5_priv *priv = &dev->priv; int err; @@ -1286,7 +1339,9 @@ static int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx) INIT_LIST_HEAD(&priv->ctx_list); spin_lock_init(&priv->ctx_lock); + lockdep_register_key(&dev->lock_key); mutex_init(&dev->intf_state_mutex); + lockdep_set_class(&dev->intf_state_mutex, &dev->lock_key); mutex_init(&priv->bfregs.reg_head.lock); mutex_init(&priv->bfregs.wc_head.lock); @@ -1307,8 +1362,14 @@ static int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx) if (err) goto err_pagealloc_init; + err = mlx5_adev_init(dev); + if (err) + goto err_adev_init; + return 0; +err_adev_init: + mlx5_pagealloc_cleanup(dev); err_pagealloc_init: mlx5_health_cleanup(dev); err_health_init: @@ -1318,13 +1379,15 @@ static int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx) mutex_destroy(&priv->bfregs.wc_head.lock); mutex_destroy(&priv->bfregs.reg_head.lock); mutex_destroy(&dev->intf_state_mutex); + lockdep_unregister_key(&dev->lock_key); return err; } -static void mlx5_mdev_uninit(struct mlx5_core_dev *dev) +void mlx5_mdev_uninit(struct mlx5_core_dev *dev) { struct mlx5_priv *priv = &dev->priv; + mlx5_adev_cleanup(dev); mlx5_pagealloc_cleanup(dev); mlx5_health_cleanup(dev); debugfs_remove_recursive(dev->priv.dbg_root); @@ -1333,6 +1396,7 @@ static void mlx5_mdev_uninit(struct mlx5_core_dev *dev) mutex_destroy(&priv->bfregs.wc_head.lock); mutex_destroy(&priv->bfregs.reg_head.lock); mutex_destroy(&dev->intf_state_mutex); + lockdep_unregister_key(&dev->lock_key); } #define MLX5_IB_MOD "mlx5_ib" @@ -1355,6 +1419,10 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *id) dev->coredev_type = id->driver_data & MLX5_PCI_DEV_IS_VF ? MLX5_COREDEV_VF : MLX5_COREDEV_PF; + dev->priv.adev_idx = mlx5_adev_idx_alloc(); + if (dev->priv.adev_idx < 0) + return dev->priv.adev_idx; + err = mlx5_mdev_init(dev, prof_sel); if (err) goto mdev_init_err; @@ -1389,6 +1457,7 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *id) pci_init_err: mlx5_mdev_uninit(dev); mdev_init_err: + mlx5_adev_idx_free(dev->priv.adev_idx); mlx5_devlink_free(devlink); return err; @@ -1405,6 +1474,7 @@ static void remove_one(struct pci_dev *pdev) mlx5_unload_one(dev, true); mlx5_pci_close(dev); mlx5_mdev_uninit(dev); + mlx5_adev_idx_free(dev->priv.adev_idx); mlx5_devlink_free(devlink); } @@ -1612,13 +1682,17 @@ void mlx5_disable_device(struct mlx5_core_dev *dev) void mlx5_recover_device(struct mlx5_core_dev *dev) { - mlx5_pci_disable_device(dev); - if (mlx5_pci_slot_reset(dev->pdev) == PCI_ERS_RESULT_RECOVERED) - mlx5_pci_resume(dev->pdev); + if (!mlx5_core_is_sf(dev)) { + mlx5_pci_disable_device(dev); + if (mlx5_pci_slot_reset(dev->pdev) != PCI_ERS_RESULT_RECOVERED) + return; + } + + mlx5_pci_resume(dev->pdev); } static struct pci_driver mlx5_core_driver = { - .name = DRIVER_NAME, + .name = KBUILD_MODNAME, .id_table = mlx5_core_pci_table, .probe = init_one, .remove = remove_one, @@ -1644,6 +1718,9 @@ static int __init init(void) { int err; + WARN_ONCE(strcmp(MLX5_ADEV_NAME, KBUILD_MODNAME), + "mlx5_core name not in sync with kernel module name"); + get_random_bytes(&sw_owner_id, sizeof(sw_owner_id)); mlx5_core_verify_params(); @@ -1654,12 +1731,20 @@ static int __init init(void) if (err) goto err_debug; -#ifdef CONFIG_MLX5_CORE_EN - mlx5e_init(); -#endif + err = mlx5_sf_driver_register(); + if (err) + goto err_sf; + + err = mlx5e_init(); + if (err) + goto err_en; return 0; +err_en: + mlx5_sf_driver_unregister(); +err_sf: + pci_unregister_driver(&mlx5_core_driver); err_debug: mlx5_unregister_debugfs(); return err; @@ -1667,9 +1752,8 @@ static int __init init(void) static void __exit cleanup(void) { -#ifdef CONFIG_MLX5_CORE_EN mlx5e_cleanup(); -#endif + mlx5_sf_driver_unregister(); pci_unregister_driver(&mlx5_core_driver); mlx5_unregister_debugfs(); } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h index 8cec85ab419d0ece156c5242485287ed37381b37..a5ffc68f62ddc38f706590f513e2b0dbefc8c866 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h @@ -42,9 +42,6 @@ #include #include -#define DRIVER_NAME "mlx5_core" -#define DRIVER_VERSION "5.0-0" - extern uint mlx5_core_debug_mask; #define mlx5_core_dbg(__dev, format, ...) \ @@ -120,6 +117,8 @@ enum mlx5_semaphore_space_address { MLX5_SEMAPHORE_SW_RESET = 0x20, }; +#define MLX5_DEFAULT_PROF 2 + int mlx5_query_hca_caps(struct mlx5_core_dev *dev); int mlx5_query_board_id(struct mlx5_core_dev *dev); int mlx5_cmd_init_hca(struct mlx5_core_dev *dev, uint32_t *sw_owner_id); @@ -175,18 +174,24 @@ struct cpumask * mlx5_irq_get_affinity_mask(struct mlx5_irq_table *irq_table, int vecidx); struct cpu_rmap *mlx5_irq_get_rmap(struct mlx5_irq_table *table); int mlx5_irq_get_num_comp(struct mlx5_irq_table *table); +struct mlx5_irq_table *mlx5_irq_table_get(struct mlx5_core_dev *dev); int mlx5_events_init(struct mlx5_core_dev *dev); void mlx5_events_cleanup(struct mlx5_core_dev *dev); void mlx5_events_start(struct mlx5_core_dev *dev); void mlx5_events_stop(struct mlx5_core_dev *dev); +int mlx5_adev_idx_alloc(void); +void mlx5_adev_idx_free(int idx); +void mlx5_adev_cleanup(struct mlx5_core_dev *dev); +int mlx5_adev_init(struct mlx5_core_dev *dev); + void mlx5_add_device(struct mlx5_interface *intf, struct mlx5_priv *priv); void mlx5_remove_device(struct mlx5_interface *intf, struct mlx5_priv *priv); -void mlx5_attach_device(struct mlx5_core_dev *dev); +int mlx5_attach_device(struct mlx5_core_dev *dev); void mlx5_detach_device(struct mlx5_core_dev *dev); bool mlx5_device_registered(struct mlx5_core_dev *dev); -void mlx5_register_device(struct mlx5_core_dev *dev); +int mlx5_register_device(struct mlx5_core_dev *dev); void mlx5_unregister_device(struct mlx5_core_dev *dev); void mlx5_add_dev_by_protocol(struct mlx5_core_dev *dev, int protocol); void mlx5_remove_dev_by_protocol(struct mlx5_core_dev *dev, int protocol); @@ -215,8 +220,13 @@ int mlx5_firmware_flash(struct mlx5_core_dev *dev, const struct firmware *fw, int mlx5_fw_version_query(struct mlx5_core_dev *dev, u32 *running_ver, u32 *stored_ver); -void mlx5e_init(void); +#ifdef CONFIG_MLX5_CORE_EN +int mlx5e_init(void); void mlx5e_cleanup(void); +#else +static inline int mlx5e_init(void){ return 0; } +static inline void mlx5e_cleanup(void){} +#endif static inline bool mlx5_sriov_is_enabled(struct mlx5_core_dev *dev) { @@ -235,6 +245,17 @@ static inline int mlx5_lag_is_lacp_owner(struct mlx5_core_dev *dev) MLX5_CAP_GEN(dev, lag_master); } +int mlx5_rescan_drivers_locked(struct mlx5_core_dev *dev); +static inline int mlx5_rescan_drivers(struct mlx5_core_dev *dev) +{ + int ret; + + mlx5_dev_list_lock(); + ret = mlx5_rescan_drivers_locked(dev); + mlx5_dev_list_unlock(); + return ret; +} + void mlx5_reload_interface(struct mlx5_core_dev *mdev, int protocol); void mlx5_lag_update(struct mlx5_core_dev *dev); @@ -248,6 +269,15 @@ enum { u8 mlx5_get_nic_state(struct mlx5_core_dev *dev); void mlx5_set_nic_state(struct mlx5_core_dev *dev, u8 state); +static inline bool mlx5_core_is_sf(const struct mlx5_core_dev *dev) +{ + return dev->coredev_type == MLX5_COREDEV_SF; +} + +int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx); +void mlx5_mdev_uninit(struct mlx5_core_dev *dev); void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup); int mlx5_load_one(struct mlx5_core_dev *dev, bool boot); + +void mlx5_events_work_enqueue(struct mlx5_core_dev *dev, struct work_struct *work); #endif /* __MLX5_CORE_H__ */ diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c index 6fd9749203944c47b19c27584beb2fcb5103af90..a61e09aff1523c9d8916f6eb407bf8623a0f9d08 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c @@ -30,6 +30,9 @@ int mlx5_irq_table_init(struct mlx5_core_dev *dev) { struct mlx5_irq_table *irq_table; + if (mlx5_core_is_sf(dev)) + return 0; + irq_table = kvzalloc(sizeof(*irq_table), GFP_KERNEL); if (!irq_table) return -ENOMEM; @@ -40,6 +43,9 @@ int mlx5_irq_table_init(struct mlx5_core_dev *dev) void mlx5_irq_table_cleanup(struct mlx5_core_dev *dev) { + if (mlx5_core_is_sf(dev)) + return; + kvfree(dev->priv.irq_table); } @@ -268,6 +274,9 @@ int mlx5_irq_table_create(struct mlx5_core_dev *dev) int nvec; int err; + if (mlx5_core_is_sf(dev)) + return 0; + nvec = MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() + MLX5_IRQ_VEC_COMP_BASE; nvec = min_t(int, nvec, num_eqs); @@ -319,6 +328,9 @@ void mlx5_irq_table_destroy(struct mlx5_core_dev *dev) struct mlx5_irq_table *table = dev->priv.irq_table; int i; + if (mlx5_core_is_sf(dev)) + return; + /* free_irq requires that affinity and rmap will be cleared * before calling it. This is why there is asymmetry with set_rmap * which should be called after alloc_irq but before request_irq. @@ -332,3 +344,11 @@ void mlx5_irq_table_destroy(struct mlx5_core_dev *dev) kfree(table->irq); } +struct mlx5_irq_table *mlx5_irq_table_get(struct mlx5_core_dev *dev) +{ +#ifdef CONFIG_MLX5_SF + if (mlx5_core_is_sf(dev)) + return dev->priv.parent_mdev->priv.irq_table; +#endif + return dev->priv.irq_table; +} diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c new file mode 100644 index 0000000000000000000000000000000000000000..a8d75c2f0275402539739c8e33063ac851f711ac --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c @@ -0,0 +1,49 @@ +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +/* Copyright (c) 2020 Mellanox Technologies Ltd */ + +#include +#include "priv.h" + +int mlx5_cmd_alloc_sf(struct mlx5_core_dev *dev, u16 function_id) +{ + u32 out[MLX5_ST_SZ_DW(alloc_sf_out)] = {}; + u32 in[MLX5_ST_SZ_DW(alloc_sf_in)] = {}; + + MLX5_SET(alloc_sf_in, in, opcode, MLX5_CMD_OP_ALLOC_SF); + MLX5_SET(alloc_sf_in, in, function_id, function_id); + + return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out)); +} + +int mlx5_cmd_dealloc_sf(struct mlx5_core_dev *dev, u16 function_id) +{ + u32 out[MLX5_ST_SZ_DW(dealloc_sf_out)] = {}; + u32 in[MLX5_ST_SZ_DW(dealloc_sf_in)] = {}; + + MLX5_SET(dealloc_sf_in, in, opcode, MLX5_CMD_OP_DEALLOC_SF); + MLX5_SET(dealloc_sf_in, in, function_id, function_id); + + return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out)); +} + +int mlx5_cmd_sf_enable_hca(struct mlx5_core_dev *dev, u16 func_id) +{ + u32 out[MLX5_ST_SZ_DW(enable_hca_out)] = {}; + u32 in[MLX5_ST_SZ_DW(enable_hca_in)] = {}; + + MLX5_SET(enable_hca_in, in, opcode, MLX5_CMD_OP_ENABLE_HCA); + MLX5_SET(enable_hca_in, in, function_id, func_id); + MLX5_SET(enable_hca_in, in, embedded_cpu_function, 0); + return mlx5_cmd_exec(dev, &in, sizeof(in), &out, sizeof(out)); +} + +int mlx5_cmd_sf_disable_hca(struct mlx5_core_dev *dev, u16 func_id) +{ + u32 out[MLX5_ST_SZ_DW(disable_hca_out)] = {}; + u32 in[MLX5_ST_SZ_DW(disable_hca_in)] = {}; + + MLX5_SET(disable_hca_in, in, opcode, MLX5_CMD_OP_DISABLE_HCA); + MLX5_SET(disable_hca_in, in, function_id, func_id); + MLX5_SET(enable_hca_in, in, embedded_cpu_function, 0); + return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out)); +} diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c new file mode 100644 index 0000000000000000000000000000000000000000..4a2e46309728b6083b9e5ba379a223312daed93e --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c @@ -0,0 +1,271 @@ +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +/* Copyright (c) 2020 Mellanox Technologies Ltd */ + +#include +#include +#include "mlx5_core.h" +#include "dev.h" +#include "sf/vhca_event.h" +#include "sf/sf.h" +#include "sf/mlx5_ifc_vhca_event.h" +#include "ecpf.h" + +struct mlx5_sf_dev_table { + struct xarray devices; + unsigned int max_sfs; + phys_addr_t base_address; + u64 sf_bar_length; + struct notifier_block nb; + struct mlx5_core_dev *dev; +}; + +static bool mlx5_sf_dev_supported(const struct mlx5_core_dev *dev) +{ + return MLX5_CAP_GEN(dev, sf) && mlx5_vhca_event_supported(dev); +} + +bool mlx5_sf_dev_allocated(const struct mlx5_core_dev *dev) +{ + struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table; + + return table && !xa_empty(&table->devices); +} + +static ssize_t sfnum_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + struct auxiliary_device *adev = container_of(dev, struct auxiliary_device, dev); + struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev); + + return scnprintf(buf, PAGE_SIZE, "%u\n", sf_dev->sfnum); +} +static DEVICE_ATTR_RO(sfnum); + +static struct attribute *sf_device_attrs[] = { + &dev_attr_sfnum.attr, + NULL, +}; + +static const struct attribute_group sf_attr_group = { + .attrs = sf_device_attrs, +}; + +static const struct attribute_group *sf_attr_groups[2] = { + &sf_attr_group, + NULL +}; + +static void mlx5_sf_dev_release(struct device *device) +{ + struct auxiliary_device *adev = container_of(device, struct auxiliary_device, dev); + struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev); + + mlx5_adev_idx_free(adev->id); + kfree(sf_dev); +} + +static void mlx5_sf_dev_remove(struct mlx5_sf_dev *sf_dev) +{ + auxiliary_device_delete(&sf_dev->adev); + auxiliary_device_uninit(&sf_dev->adev); +} + +static void mlx5_sf_dev_add(struct mlx5_core_dev *dev, u16 sf_index, u32 sfnum) +{ + struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table; + struct mlx5_sf_dev *sf_dev; + struct pci_dev *pdev; + int err; + int id; + + id = mlx5_adev_idx_alloc(); + if (id < 0) { + err = id; + goto add_err; + } + + sf_dev = kzalloc(sizeof(*sf_dev), GFP_KERNEL); + if (!sf_dev) { + mlx5_adev_idx_free(id); + err = -ENOMEM; + goto add_err; + } + pdev = dev->pdev; + sf_dev->adev.id = id; + sf_dev->adev.name = MLX5_SF_DEV_ID_NAME; + sf_dev->adev.dev.release = mlx5_sf_dev_release; + sf_dev->adev.dev.parent = &pdev->dev; + sf_dev->adev.dev.groups = sf_attr_groups; + sf_dev->sfnum = sfnum; + sf_dev->parent_mdev = dev; + + if (!table->max_sfs) { + mlx5_adev_idx_free(id); + kfree(sf_dev); + err = -EOPNOTSUPP; + goto add_err; + } + sf_dev->bar_base_addr = table->base_address + (sf_index * table->sf_bar_length); + + err = auxiliary_device_init(&sf_dev->adev); + if (err) { + mlx5_adev_idx_free(id); + kfree(sf_dev); + goto add_err; + } + + err = auxiliary_device_add(&sf_dev->adev); + if (err) { + put_device(&sf_dev->adev.dev); + goto add_err; + } + + err = xa_insert(&table->devices, sf_index, sf_dev, GFP_KERNEL); + if (err) + goto xa_err; + return; + +xa_err: + mlx5_sf_dev_remove(sf_dev); +add_err: + mlx5_core_err(dev, "SF DEV: fail device add for index=%d sfnum=%d err=%d\n", + sf_index, sfnum, err); +} + +static void mlx5_sf_dev_del(struct mlx5_core_dev *dev, struct mlx5_sf_dev *sf_dev, u16 sf_index) +{ + struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table; + + xa_erase(&table->devices, sf_index); + mlx5_sf_dev_remove(sf_dev); +} + +static int +mlx5_sf_dev_state_change_handler(struct notifier_block *nb, unsigned long event_code, void *data) +{ + struct mlx5_sf_dev_table *table = container_of(nb, struct mlx5_sf_dev_table, nb); + const struct mlx5_vhca_state_event *event = data; + struct mlx5_sf_dev *sf_dev; + u16 sf_index; + + sf_index = event->function_id - MLX5_CAP_GEN(table->dev, sf_base_id); + sf_dev = xa_load(&table->devices, sf_index); + switch (event->new_vhca_state) { + case MLX5_VHCA_STATE_INVALID: + case MLX5_VHCA_STATE_ALLOCATED: + if (sf_dev) + mlx5_sf_dev_del(table->dev, sf_dev, sf_index); + break; + case MLX5_VHCA_STATE_TEARDOWN_REQUEST: + if (sf_dev) + mlx5_sf_dev_del(table->dev, sf_dev, sf_index); + else + mlx5_core_err(table->dev, + "SF DEV: teardown state for invalid dev index=%d fn_id=0x%x\n", + sf_index, event->sw_function_id); + break; + case MLX5_VHCA_STATE_ACTIVE: + if (!sf_dev) + mlx5_sf_dev_add(table->dev, sf_index, event->sw_function_id); + break; + default: + break; + } + return 0; +} + +static int mlx5_sf_dev_vhca_arm_all(struct mlx5_sf_dev_table *table) +{ + struct mlx5_core_dev *dev = table->dev; + u16 max_functions; + u16 function_id; + int err = 0; + int i; + + max_functions = mlx5_sf_max_functions(dev); + function_id = MLX5_CAP_GEN(dev, sf_base_id); + /* Arm the vhca context as the vhca event notifier */ + for (i = 0; i < max_functions; i++) { + err = mlx5_vhca_event_arm(dev, function_id); + if (err) + return err; + + function_id++; + } + return 0; +} + +void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev) +{ + struct mlx5_sf_dev_table *table; + unsigned int max_sfs; + int err; + + if (!mlx5_sf_dev_supported(dev) || !mlx5_vhca_event_supported(dev)) + return; + + table = kzalloc(sizeof(*table), GFP_KERNEL); + if (!table) { + err = -ENOMEM; + goto table_err; + } + + table->nb.notifier_call = mlx5_sf_dev_state_change_handler; + table->dev = dev; + if (MLX5_CAP_GEN(dev, max_num_sf)) + max_sfs = MLX5_CAP_GEN(dev, max_num_sf); + else + max_sfs = 1 << MLX5_CAP_GEN(dev, log_max_sf); + table->sf_bar_length = 1 << (MLX5_CAP_GEN(dev, log_min_sf_size) + 12); + table->base_address = pci_resource_start(dev->pdev, 2); + table->max_sfs = max_sfs; + xa_init(&table->devices); + dev->priv.sf_dev_table = table; + + err = mlx5_vhca_event_notifier_register(dev, &table->nb); + if (err) + goto vhca_err; + err = mlx5_sf_dev_vhca_arm_all(table); + if (err) + goto arm_err; + mlx5_core_dbg(dev, "SF DEV: max sf devices=%d\n", max_sfs); + return; + +arm_err: + mlx5_vhca_event_notifier_unregister(dev, &table->nb); +vhca_err: + table->max_sfs = 0; + kfree(table); + dev->priv.sf_dev_table = NULL; +table_err: + mlx5_core_err(dev, "SF DEV table create err = %d\n", err); +} + +static void mlx5_sf_dev_destroy_all(struct mlx5_sf_dev_table *table) +{ + struct mlx5_sf_dev *sf_dev; + unsigned long index; + + xa_for_each(&table->devices, index, sf_dev) { + xa_erase(&table->devices, index); + mlx5_sf_dev_remove(sf_dev); + } +} + +void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev) +{ + struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table; + + if (!table) + return; + + mlx5_vhca_event_notifier_unregister(dev, &table->nb); + + /* Now that event handler is not running, it is safe to destroy + * the sf device without race. + */ + mlx5_sf_dev_destroy_all(table); + + WARN_ON(!xa_empty(&table->devices)); + kfree(table); + dev->priv.sf_dev_table = NULL; +} diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h new file mode 100644 index 0000000000000000000000000000000000000000..4de02902aef11874c15bf9b2fb1517791c984866 --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h @@ -0,0 +1,55 @@ +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ +/* Copyright (c) 2020 Mellanox Technologies Ltd */ + +#ifndef __MLX5_SF_DEV_H__ +#define __MLX5_SF_DEV_H__ + +#ifdef CONFIG_MLX5_SF + +#include + +#define MLX5_SF_DEV_ID_NAME "sf" + +struct mlx5_sf_dev { + struct auxiliary_device adev; + struct mlx5_core_dev *parent_mdev; + struct mlx5_core_dev *mdev; + phys_addr_t bar_base_addr; + u32 sfnum; +}; + +void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev); +void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev); + +int mlx5_sf_driver_register(void); +void mlx5_sf_driver_unregister(void); + +bool mlx5_sf_dev_allocated(const struct mlx5_core_dev *dev); + +#else + +static inline void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev) +{ +} + +static inline void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev) +{ +} + +static inline int mlx5_sf_driver_register(void) +{ + return 0; +} + +static inline void mlx5_sf_driver_unregister(void) +{ +} + +static inline bool mlx5_sf_dev_allocated(const struct mlx5_core_dev *dev) +{ + return 0; +} + +#endif + +#endif diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c new file mode 100644 index 0000000000000000000000000000000000000000..5138b182307bf6e9f535c72bc1e672a6d3adad7c --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c @@ -0,0 +1,103 @@ +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +/* Copyright (c) 2020 Mellanox Technologies Ltd */ + +#include +#include +#include "mlx5_core.h" +#include "dev.h" +#include "devlink.h" + +static int mlx5_sf_dev_probe(struct auxiliary_device *adev, const struct auxiliary_device_id *id) +{ + struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev); + struct mlx5_core_dev *mdev; + struct devlink *devlink; + int err; + + devlink = mlx5_devlink_alloc(); + if (!devlink) + return -ENOMEM; + + mdev = devlink_priv(devlink); + mdev->device = &adev->dev; + mdev->pdev = sf_dev->parent_mdev->pdev; + mdev->bar_addr = sf_dev->bar_base_addr; + mdev->iseg_base = sf_dev->bar_base_addr; + mdev->coredev_type = MLX5_COREDEV_SF; + mdev->priv.parent_mdev = sf_dev->parent_mdev; + mdev->priv.adev_idx = adev->id; + sf_dev->mdev = mdev; + + err = mlx5_mdev_init(mdev, MLX5_DEFAULT_PROF); + if (err) { + mlx5_core_warn(mdev, "mlx5_mdev_init on err=%d\n", err); + goto mdev_err; + } + + mdev->iseg = ioremap(mdev->iseg_base, sizeof(*mdev->iseg)); + if (!mdev->iseg) { + mlx5_core_warn(mdev, "remap error\n"); + err = -ENOMEM; + goto remap_err; + } + + err = mlx5_load_one(mdev, true); + if (err) { + mlx5_core_warn(mdev, "mlx5_load_one err=%d\n", err); + goto load_one_err; + } + return 0; + +load_one_err: + iounmap(mdev->iseg); +remap_err: + mlx5_mdev_uninit(mdev); +mdev_err: + mlx5_devlink_free(devlink); + return err; +} + +static void mlx5_sf_dev_remove(struct auxiliary_device *adev) +{ + struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev); + struct devlink *devlink; + + devlink = priv_to_devlink(sf_dev->mdev); + mlx5_drain_health_wq(sf_dev->mdev); + mlx5_unload_one(sf_dev->mdev, true); + iounmap(sf_dev->mdev->iseg); + mlx5_mdev_uninit(sf_dev->mdev); + mlx5_devlink_free(devlink); +} + +static void mlx5_sf_dev_shutdown(struct auxiliary_device *adev) +{ + struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev); + + mlx5_unload_one(sf_dev->mdev, false); +} + +static const struct auxiliary_device_id mlx5_sf_dev_id_table[] = { + { .name = MLX5_ADEV_NAME "." MLX5_SF_DEV_ID_NAME, }, + { }, +}; + +MODULE_DEVICE_TABLE(auxiliary, mlx5_sf_dev_id_table); + +static struct auxiliary_driver mlx5_sf_driver = { + .name = MLX5_SF_DEV_ID_NAME, + .probe = mlx5_sf_dev_probe, + .remove = mlx5_sf_dev_remove, + .shutdown = mlx5_sf_dev_shutdown, + .id_table = mlx5_sf_dev_id_table, +}; + +int mlx5_sf_driver_register(void) +{ + return auxiliary_driver_register(&mlx5_sf_driver); +} + +void mlx5_sf_driver_unregister(void) +{ + auxiliary_driver_unregister(&mlx5_sf_driver); +} diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c new file mode 100644 index 0000000000000000000000000000000000000000..96c4509e58381c1f63a807d8c76b1ea4432e5c29 --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c @@ -0,0 +1,560 @@ +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +/* Copyright (c) 2020 Mellanox Technologies Ltd */ + +#include +#include "eswitch.h" +#include "priv.h" +#include "sf/dev/dev.h" +#include "mlx5_ifc_vhca_event.h" +#include "vhca_event.h" +#include "ecpf.h" + +struct mlx5_sf { + struct devlink_port dl_port; + unsigned int port_index; + u16 id; + u16 hw_fn_id; + u16 hw_state; +}; + +struct mlx5_sf_table { + struct mlx5_core_dev *dev; /* To refer from notifier context. */ + struct xarray port_indices; /* port index based lookup. */ + refcount_t refcount; + struct completion disable_complete; + struct mutex sf_state_lock; /* Serializes sf state among user cmds & vhca event handler. */ + struct notifier_block esw_nb; + struct notifier_block vhca_nb; + u8 ecpu: 1; +}; + +static struct mlx5_sf * +mlx5_sf_lookup_by_index(struct mlx5_sf_table *table, unsigned int port_index) +{ + return xa_load(&table->port_indices, port_index); +} + +static struct mlx5_sf * +mlx5_sf_lookup_by_function_id(struct mlx5_sf_table *table, unsigned int fn_id) +{ + unsigned long index; + struct mlx5_sf *sf; + + xa_for_each(&table->port_indices, index, sf) { + if (sf->hw_fn_id == fn_id) + return sf; + } + return NULL; +} + +static int mlx5_sf_id_insert(struct mlx5_sf_table *table, struct mlx5_sf *sf) +{ + return xa_insert(&table->port_indices, sf->port_index, sf, GFP_KERNEL); +} + +static void mlx5_sf_id_erase(struct mlx5_sf_table *table, struct mlx5_sf *sf) +{ + xa_erase(&table->port_indices, sf->port_index); +} + +static struct mlx5_sf * +mlx5_sf_alloc(struct mlx5_sf_table *table, u32 sfnum, struct netlink_ext_ack *extack) +{ + unsigned int dl_port_index; + struct mlx5_sf *sf; + u16 hw_fn_id; + int id_err; + int err; + + id_err = mlx5_sf_hw_table_sf_alloc(table->dev, sfnum); + if (id_err < 0) { + err = id_err; + goto id_err; + } + + sf = kzalloc(sizeof(*sf), GFP_KERNEL); + if (!sf) { + err = -ENOMEM; + goto alloc_err; + } + sf->id = id_err; + hw_fn_id = mlx5_sf_sw_to_hw_id(table->dev, sf->id); + dl_port_index = mlx5_esw_vport_to_devlink_port_index(table->dev, hw_fn_id); + sf->port_index = dl_port_index; + sf->hw_fn_id = hw_fn_id; + sf->hw_state = MLX5_VHCA_STATE_ALLOCATED; + + err = mlx5_sf_id_insert(table, sf); + if (err) + goto insert_err; + + return sf; + +insert_err: + kfree(sf); +alloc_err: + mlx5_sf_hw_table_sf_free(table->dev, id_err); +id_err: + if (err == -EEXIST) + NL_SET_ERR_MSG_MOD(extack, "SF already exist. Choose different sfnum"); + return ERR_PTR(err); +} + +static void mlx5_sf_free(struct mlx5_sf_table *table, struct mlx5_sf *sf) +{ + mlx5_sf_id_erase(table, sf); + mlx5_sf_hw_table_sf_free(table->dev, sf->id); + kfree(sf); +} + +static struct mlx5_sf_table *mlx5_sf_table_try_get(struct mlx5_core_dev *dev) +{ + struct mlx5_sf_table *table = dev->priv.sf_table; + + if (!table) + return NULL; + + return refcount_inc_not_zero(&table->refcount) ? table : NULL; +} + +static void mlx5_sf_table_put(struct mlx5_sf_table *table) +{ + if (refcount_dec_and_test(&table->refcount)) + complete(&table->disable_complete); +} + +static enum devlink_port_fn_state mlx5_sf_to_devlink_state(u8 hw_state) +{ + switch (hw_state) { + case MLX5_VHCA_STATE_ACTIVE: + case MLX5_VHCA_STATE_IN_USE: + return DEVLINK_PORT_FN_STATE_ACTIVE; + case MLX5_VHCA_STATE_INVALID: + case MLX5_VHCA_STATE_ALLOCATED: + case MLX5_VHCA_STATE_TEARDOWN_REQUEST: + default: + return DEVLINK_PORT_FN_STATE_INACTIVE; + } +} + +static enum devlink_port_fn_opstate mlx5_sf_to_devlink_opstate(u8 hw_state) +{ + switch (hw_state) { + case MLX5_VHCA_STATE_IN_USE: + case MLX5_VHCA_STATE_TEARDOWN_REQUEST: + return DEVLINK_PORT_FN_OPSTATE_ATTACHED; + case MLX5_VHCA_STATE_INVALID: + case MLX5_VHCA_STATE_ALLOCATED: + case MLX5_VHCA_STATE_ACTIVE: + default: + return DEVLINK_PORT_FN_OPSTATE_DETACHED; + } +} + +static bool mlx5_sf_is_active(const struct mlx5_sf *sf) +{ + return sf->hw_state == MLX5_VHCA_STATE_ACTIVE || sf->hw_state == MLX5_VHCA_STATE_IN_USE; +} + +int mlx5_devlink_sf_port_fn_state_get(struct devlink *devlink, struct devlink_port *dl_port, + enum devlink_port_fn_state *state, + enum devlink_port_fn_opstate *opstate, + struct netlink_ext_ack *extack) +{ + struct mlx5_core_dev *dev = devlink_priv(devlink); + struct mlx5_sf_table *table; + struct mlx5_sf *sf; + int err = 0; + + table = mlx5_sf_table_try_get(dev); + if (!table) + return -EOPNOTSUPP; + + sf = mlx5_sf_lookup_by_index(table, dl_port->index); + if (!sf) { + err = -EOPNOTSUPP; + goto sf_err; + } + mutex_lock(&table->sf_state_lock); + *state = mlx5_sf_to_devlink_state(sf->hw_state); + *opstate = mlx5_sf_to_devlink_opstate(sf->hw_state); + mutex_unlock(&table->sf_state_lock); +sf_err: + mlx5_sf_table_put(table); + return err; +} + +static int mlx5_sf_activate(struct mlx5_core_dev *dev, struct mlx5_sf *sf, + struct netlink_ext_ack *extack) +{ + int err; + + if (mlx5_sf_is_active(sf)) + return 0; + if (sf->hw_state != MLX5_VHCA_STATE_ALLOCATED) { + NL_SET_ERR_MSG_MOD(extack, "SF is inactivated but it is still attached"); + return -EBUSY; + } + + err = mlx5_cmd_sf_enable_hca(dev, sf->hw_fn_id); + if (err) + return err; + + sf->hw_state = MLX5_VHCA_STATE_ACTIVE; + return 0; +} + +static int mlx5_sf_deactivate(struct mlx5_core_dev *dev, struct mlx5_sf *sf) +{ + int err; + + if (!mlx5_sf_is_active(sf)) + return 0; + + err = mlx5_cmd_sf_disable_hca(dev, sf->hw_fn_id); + if (err) + return err; + + sf->hw_state = MLX5_VHCA_STATE_TEARDOWN_REQUEST; + return 0; +} + +static int mlx5_sf_state_set(struct mlx5_core_dev *dev, struct mlx5_sf_table *table, + struct mlx5_sf *sf, + enum devlink_port_fn_state state, + struct netlink_ext_ack *extack) +{ + int err = 0; + + mutex_lock(&table->sf_state_lock); + if (state == mlx5_sf_to_devlink_state(sf->hw_state)) + goto out; + if (state == DEVLINK_PORT_FN_STATE_ACTIVE) + err = mlx5_sf_activate(dev, sf, extack); + else if (state == DEVLINK_PORT_FN_STATE_INACTIVE) + err = mlx5_sf_deactivate(dev, sf); + else + err = -EINVAL; +out: + mutex_unlock(&table->sf_state_lock); + return err; +} + +int mlx5_devlink_sf_port_fn_state_set(struct devlink *devlink, struct devlink_port *dl_port, + enum devlink_port_fn_state state, + struct netlink_ext_ack *extack) +{ + struct mlx5_core_dev *dev = devlink_priv(devlink); + struct mlx5_sf_table *table; + struct mlx5_sf *sf; + int err; + + table = mlx5_sf_table_try_get(dev); + if (!table) { + NL_SET_ERR_MSG_MOD(extack, + "Port state set is only supported in eswitch switchdev mode or SF ports are disabled."); + return -EOPNOTSUPP; + } + sf = mlx5_sf_lookup_by_index(table, dl_port->index); + if (!sf) { + err = -ENODEV; + goto out; + } + + err = mlx5_sf_state_set(dev, table, sf, state, extack); +out: + mlx5_sf_table_put(table); + return err; +} + +static int mlx5_sf_add(struct mlx5_core_dev *dev, struct mlx5_sf_table *table, + const struct devlink_port_new_attrs *new_attr, + struct netlink_ext_ack *extack, + unsigned int *new_port_index) +{ + struct mlx5_eswitch *esw = dev->priv.eswitch; + struct mlx5_sf *sf; + u16 hw_fn_id; + int err; + + sf = mlx5_sf_alloc(table, new_attr->sfnum, extack); + if (IS_ERR(sf)) + return PTR_ERR(sf); + + hw_fn_id = mlx5_sf_sw_to_hw_id(dev, sf->id); + err = mlx5_esw_offloads_sf_vport_enable(esw, &sf->dl_port, hw_fn_id, new_attr->sfnum); + if (err) + goto esw_err; + *new_port_index = sf->port_index; + return 0; + +esw_err: + mlx5_sf_free(table, sf); + return err; +} + +static int +mlx5_sf_new_check_attr(struct mlx5_core_dev *dev, const struct devlink_port_new_attrs *new_attr, + struct netlink_ext_ack *extack) +{ + if (new_attr->flavour != DEVLINK_PORT_FLAVOUR_PCI_SF) { + NL_SET_ERR_MSG_MOD(extack, "Driver supports only SF port addition"); + return -EOPNOTSUPP; + } + if (new_attr->port_index_valid) { + NL_SET_ERR_MSG_MOD(extack, + "Driver does not support user defined port index assignment"); + return -EOPNOTSUPP; + } + if (!new_attr->sfnum_valid) { + NL_SET_ERR_MSG_MOD(extack, + "User must provide unique sfnum. Driver does not support auto assignment"); + return -EOPNOTSUPP; + } + if (new_attr->controller_valid && new_attr->controller) { + NL_SET_ERR_MSG_MOD(extack, "External controller is unsupported"); + return -EOPNOTSUPP; + } + if (new_attr->pfnum != PCI_FUNC(dev->pdev->devfn)) { + NL_SET_ERR_MSG_MOD(extack, "Invalid pfnum supplied"); + return -EOPNOTSUPP; + } + return 0; +} + +int mlx5_devlink_sf_port_new(struct devlink *devlink, + const struct devlink_port_new_attrs *new_attr, + struct netlink_ext_ack *extack, + unsigned int *new_port_index) +{ + struct mlx5_core_dev *dev = devlink_priv(devlink); + struct mlx5_sf_table *table; + int err; + + err = mlx5_sf_new_check_attr(dev, new_attr, extack); + if (err) + return err; + + table = mlx5_sf_table_try_get(dev); + if (!table) { + NL_SET_ERR_MSG_MOD(extack, + "Port add is only supported in eswitch switchdev mode or SF ports are disabled."); + return -EOPNOTSUPP; + } + err = mlx5_sf_add(dev, table, new_attr, extack, new_port_index); + mlx5_sf_table_put(table); + return err; +} + +static void mlx5_sf_dealloc(struct mlx5_sf_table *table, struct mlx5_sf *sf) +{ + if (sf->hw_state == MLX5_VHCA_STATE_ALLOCATED) { + mlx5_sf_free(table, sf); + } else if (mlx5_sf_is_active(sf)) { + /* Even if its active, it is treated as in_use because by the time, + * it is disabled here, it may getting used. So it is safe to + * always look for the event to ensure that it is recycled only after + * firmware gives confirmation that it is detached by the driver. + */ + mlx5_cmd_sf_disable_hca(table->dev, sf->hw_fn_id); + mlx5_sf_hw_table_sf_deferred_free(table->dev, sf->id); + kfree(sf); + } else { + mlx5_sf_hw_table_sf_deferred_free(table->dev, sf->id); + kfree(sf); + } +} + +int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index, + struct netlink_ext_ack *extack) +{ + struct mlx5_core_dev *dev = devlink_priv(devlink); + struct mlx5_eswitch *esw = dev->priv.eswitch; + struct mlx5_sf_table *table; + struct mlx5_sf *sf; + int err = 0; + + table = mlx5_sf_table_try_get(dev); + if (!table) { + NL_SET_ERR_MSG_MOD(extack, + "Port del is only supported in eswitch switchdev mode or SF ports are disabled."); + return -EOPNOTSUPP; + } + sf = mlx5_sf_lookup_by_index(table, port_index); + if (!sf) { + err = -ENODEV; + goto sf_err; + } + + mlx5_esw_offloads_sf_vport_disable(esw, sf->hw_fn_id); + mlx5_sf_id_erase(table, sf); + + mutex_lock(&table->sf_state_lock); + mlx5_sf_dealloc(table, sf); + mutex_unlock(&table->sf_state_lock); +sf_err: + mlx5_sf_table_put(table); + return err; +} + +static bool mlx5_sf_state_update_check(const struct mlx5_sf *sf, u8 new_state) +{ + if (sf->hw_state == MLX5_VHCA_STATE_ACTIVE && new_state == MLX5_VHCA_STATE_IN_USE) + return true; + + if (sf->hw_state == MLX5_VHCA_STATE_IN_USE && new_state == MLX5_VHCA_STATE_ACTIVE) + return true; + + if (sf->hw_state == MLX5_VHCA_STATE_TEARDOWN_REQUEST && + new_state == MLX5_VHCA_STATE_ALLOCATED) + return true; + + return false; +} + +static int mlx5_sf_vhca_event(struct notifier_block *nb, unsigned long opcode, void *data) +{ + struct mlx5_sf_table *table = container_of(nb, struct mlx5_sf_table, vhca_nb); + const struct mlx5_vhca_state_event *event = data; + bool update = false; + struct mlx5_sf *sf; + + table = mlx5_sf_table_try_get(table->dev); + if (!table) + return 0; + + mutex_lock(&table->sf_state_lock); + sf = mlx5_sf_lookup_by_function_id(table, event->function_id); + if (!sf) + goto sf_err; + + /* When driver is attached or detached to a function, an event + * notifies such state change. + */ + update = mlx5_sf_state_update_check(sf, event->new_vhca_state); + if (update) + sf->hw_state = event->new_vhca_state; +sf_err: + mutex_unlock(&table->sf_state_lock); + mlx5_sf_table_put(table); + return 0; +} + +static void mlx5_sf_table_enable(struct mlx5_sf_table *table) +{ + if (!mlx5_sf_max_functions(table->dev)) + return; + + init_completion(&table->disable_complete); + refcount_set(&table->refcount, 1); +} + +static void mlx5_sf_deactivate_all(struct mlx5_sf_table *table) +{ + struct mlx5_eswitch *esw = table->dev->priv.eswitch; + unsigned long index; + struct mlx5_sf *sf; + + /* At this point, no new user commands can start and no vhca event can + * arrive. It is safe to destroy all user created SFs. + */ + xa_for_each(&table->port_indices, index, sf) { + mlx5_esw_offloads_sf_vport_disable(esw, sf->hw_fn_id); + mlx5_sf_id_erase(table, sf); + mlx5_sf_dealloc(table, sf); + } +} + +static void mlx5_sf_table_disable(struct mlx5_sf_table *table) +{ + if (!mlx5_sf_max_functions(table->dev)) + return; + + if (!refcount_read(&table->refcount)) + return; + + /* Balances with refcount_set; drop the reference so that new user cmd cannot start + * and new vhca event handler cannnot run. + */ + mlx5_sf_table_put(table); + wait_for_completion(&table->disable_complete); + + mlx5_sf_deactivate_all(table); +} + +static int mlx5_sf_esw_event(struct notifier_block *nb, unsigned long event, void *data) +{ + struct mlx5_sf_table *table = container_of(nb, struct mlx5_sf_table, esw_nb); + const struct mlx5_esw_event_info *mode = data; + + switch (mode->new_mode) { + case MLX5_ESWITCH_OFFLOADS: + mlx5_sf_table_enable(table); + break; + case MLX5_ESWITCH_NONE: + mlx5_sf_table_disable(table); + break; + default: + break; + }; + + return 0; +} + +static bool mlx5_sf_table_supported(const struct mlx5_core_dev *dev) +{ + return dev->priv.eswitch && MLX5_ESWITCH_MANAGER(dev) && mlx5_sf_supported(dev); +} + +int mlx5_sf_table_init(struct mlx5_core_dev *dev) +{ + struct mlx5_sf_table *table; + int err; + + if (!mlx5_sf_table_supported(dev) || !mlx5_vhca_event_supported(dev)) + return 0; + + table = kzalloc(sizeof(*table), GFP_KERNEL); + if (!table) + return -ENOMEM; + + mutex_init(&table->sf_state_lock); + table->dev = dev; + xa_init(&table->port_indices); + dev->priv.sf_table = table; + refcount_set(&table->refcount, 0); + table->esw_nb.notifier_call = mlx5_sf_esw_event; + err = mlx5_esw_event_notifier_register(dev->priv.eswitch, &table->esw_nb); + if (err) + goto reg_err; + + table->vhca_nb.notifier_call = mlx5_sf_vhca_event; + err = mlx5_vhca_event_notifier_register(table->dev, &table->vhca_nb); + if (err) + goto vhca_err; + + return 0; + +vhca_err: + mlx5_esw_event_notifier_unregister(dev->priv.eswitch, &table->esw_nb); +reg_err: + mutex_destroy(&table->sf_state_lock); + kfree(table); + dev->priv.sf_table = NULL; + return err; +} + +void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev) +{ + struct mlx5_sf_table *table = dev->priv.sf_table; + + if (!table) + return; + + mlx5_vhca_event_notifier_unregister(table->dev, &table->vhca_nb); + mlx5_esw_event_notifier_unregister(dev->priv.eswitch, &table->esw_nb); + WARN_ON(refcount_read(&table->refcount)); + mutex_destroy(&table->sf_state_lock); + WARN_ON(!xa_empty(&table->port_indices)); + kfree(table); +} diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c new file mode 100644 index 0000000000000000000000000000000000000000..a5a0f60bef66b94f1171dae22b9e7ecd1a174048 --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c @@ -0,0 +1,231 @@ +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +/* Copyright (c) 2020 Mellanox Technologies Ltd */ +#include +#include "vhca_event.h" +#include "priv.h" +#include "sf.h" +#include "mlx5_ifc_vhca_event.h" +#include "vhca_event.h" +#include "mlx5_core.h" + +struct mlx5_sf_hw { + u32 usr_sfnum; + u8 allocated: 1; + u8 pending_delete: 1; +}; + +struct mlx5_sf_hw_table { + struct mlx5_core_dev *dev; + struct mlx5_sf_hw *sfs; + int max_local_functions; + struct mutex table_lock; /* Serializes sf deletion and vhca state change handler. */ + struct notifier_block vhca_nb; +}; + +u16 mlx5_sf_sw_to_hw_id(const struct mlx5_core_dev *dev, u16 sw_id) +{ + return sw_id + mlx5_sf_start_function_id(dev); +} + +static u16 mlx5_sf_hw_to_sw_id(const struct mlx5_core_dev *dev, u16 hw_id) +{ + return hw_id - mlx5_sf_start_function_id(dev); +} + +int mlx5_sf_hw_table_sf_alloc(struct mlx5_core_dev *dev, u32 usr_sfnum) +{ + struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table; + int sw_id = -ENOSPC; + u16 hw_fn_id; + int err; + int i; + + if (!table->max_local_functions) + return -EOPNOTSUPP; + + mutex_lock(&table->table_lock); + /* Check if sf with same sfnum already exists or not. */ + for (i = 0; i < table->max_local_functions; i++) { + if (table->sfs[i].allocated && table->sfs[i].usr_sfnum == usr_sfnum) { + err = -EEXIST; + goto exist_err; + } + } + + /* Find the free entry and allocate the entry from the array */ + for (i = 0; i < table->max_local_functions; i++) { + if (!table->sfs[i].allocated) { + table->sfs[i].usr_sfnum = usr_sfnum; + table->sfs[i].allocated = true; + sw_id = i; + break; + } + } + if (sw_id == -ENOSPC) { + err = -ENOSPC; + goto exist_err; + } + + hw_fn_id = mlx5_sf_sw_to_hw_id(table->dev, sw_id); + err = mlx5_cmd_alloc_sf(table->dev, hw_fn_id); + if (err) + goto err; + + err = mlx5_modify_vhca_sw_id(dev, hw_fn_id, usr_sfnum); + if (err) + goto vhca_err; + + mutex_unlock(&table->table_lock); + return sw_id; + +vhca_err: + mlx5_cmd_dealloc_sf(table->dev, hw_fn_id); +err: + table->sfs[i].allocated = false; +exist_err: + mutex_unlock(&table->table_lock); + return err; +} + +static void _mlx5_sf_hw_id_free(struct mlx5_core_dev *dev, u16 id) +{ + struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table; + u16 hw_fn_id; + + hw_fn_id = mlx5_sf_sw_to_hw_id(table->dev, id); + mlx5_cmd_dealloc_sf(table->dev, hw_fn_id); + table->sfs[id].allocated = false; + table->sfs[id].pending_delete = false; +} + +void mlx5_sf_hw_table_sf_free(struct mlx5_core_dev *dev, u16 id) +{ + struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table; + + mutex_lock(&table->table_lock); + _mlx5_sf_hw_id_free(dev, id); + mutex_unlock(&table->table_lock); +} + +void mlx5_sf_hw_table_sf_deferred_free(struct mlx5_core_dev *dev, u16 id) +{ + struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table; + u32 out[MLX5_ST_SZ_DW(query_vhca_state_out)] = {}; + u16 hw_fn_id; + u8 state; + int err; + + hw_fn_id = mlx5_sf_sw_to_hw_id(dev, id); + mutex_lock(&table->table_lock); + err = mlx5_cmd_query_vhca_state(dev, hw_fn_id, out, sizeof(out)); + if (err) + goto err; + state = MLX5_GET(query_vhca_state_out, out, vhca_state_context.vhca_state); + if (state == MLX5_VHCA_STATE_ALLOCATED) { + mlx5_cmd_dealloc_sf(table->dev, hw_fn_id); + table->sfs[id].allocated = false; + } else { + table->sfs[id].pending_delete = true; + } +err: + mutex_unlock(&table->table_lock); +} + +static void mlx5_sf_hw_dealloc_all(struct mlx5_sf_hw_table *table) +{ + int i; + + for (i = 0; i < table->max_local_functions; i++) { + if (table->sfs[i].allocated) + _mlx5_sf_hw_id_free(table->dev, i); + } +} + +int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev) +{ + struct mlx5_sf_hw_table *table; + struct mlx5_sf_hw *sfs; + int max_functions; + + if (!mlx5_sf_supported(dev) || !mlx5_vhca_event_supported(dev)) + return 0; + + max_functions = mlx5_sf_max_functions(dev); + table = kzalloc(sizeof(*table), GFP_KERNEL); + if (!table) + return -ENOMEM; + + sfs = kcalloc(max_functions, sizeof(*sfs), GFP_KERNEL); + if (!sfs) + goto table_err; + + mutex_init(&table->table_lock); + table->dev = dev; + table->sfs = sfs; + table->max_local_functions = max_functions; + dev->priv.sf_hw_table = table; + mlx5_core_dbg(dev, "SF HW table: max sfs = %d\n", max_functions); + return 0; + +table_err: + kfree(table); + return -ENOMEM; +} + +void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev) +{ + struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table; + + if (!table) + return; + + mutex_destroy(&table->table_lock); + kfree(table->sfs); + kfree(table); +} + +static int mlx5_sf_hw_vhca_event(struct notifier_block *nb, unsigned long opcode, void *data) +{ + struct mlx5_sf_hw_table *table = container_of(nb, struct mlx5_sf_hw_table, vhca_nb); + const struct mlx5_vhca_state_event *event = data; + struct mlx5_sf_hw *sf_hw; + u16 sw_id; + + if (event->new_vhca_state != MLX5_VHCA_STATE_ALLOCATED) + return 0; + + sw_id = mlx5_sf_hw_to_sw_id(table->dev, event->function_id); + sf_hw = &table->sfs[sw_id]; + + mutex_lock(&table->table_lock); + /* SF driver notified through firmware that SF is finally detached. + * Hence recycle the sf hardware id for reuse. + */ + if (sf_hw->allocated && sf_hw->pending_delete) + _mlx5_sf_hw_id_free(table->dev, sw_id); + mutex_unlock(&table->table_lock); + return 0; +} + +int mlx5_sf_hw_table_create(struct mlx5_core_dev *dev) +{ + struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table; + + if (!table) + return 0; + + table->vhca_nb.notifier_call = mlx5_sf_hw_vhca_event; + return mlx5_vhca_event_notifier_register(table->dev, &table->vhca_nb); +} + +void mlx5_sf_hw_table_destroy(struct mlx5_core_dev *dev) +{ + struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table; + + if (!table) + return; + + mlx5_vhca_event_notifier_unregister(table->dev, &table->vhca_nb); + /* Dealloc SFs whose firmware event has been missed. */ + mlx5_sf_hw_dealloc_all(table); +} diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h new file mode 100644 index 0000000000000000000000000000000000000000..4fc870140d710115d7a39588852bc2094e958340 --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h @@ -0,0 +1,82 @@ +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ +/* Copyright (c) 2020 Mellanox Technologies Ltd */ + +#ifndef __MLX5_IFC_VHCA_EVENT_H__ +#define __MLX5_IFC_VHCA_EVENT_H__ + +enum mlx5_ifc_vhca_state { + MLX5_VHCA_STATE_INVALID = 0x0, + MLX5_VHCA_STATE_ALLOCATED = 0x1, + MLX5_VHCA_STATE_ACTIVE = 0x2, + MLX5_VHCA_STATE_IN_USE = 0x3, + MLX5_VHCA_STATE_TEARDOWN_REQUEST = 0x4, +}; + +struct mlx5_ifc_vhca_state_context_bits { + u8 arm_change_event[0x1]; + u8 reserved_at_1[0xb]; + u8 vhca_state[0x4]; + u8 reserved_at_10[0x10]; + + u8 sw_function_id[0x20]; + + u8 reserved_at_40[0x40]; +}; + +struct mlx5_ifc_query_vhca_state_out_bits { + u8 status[0x8]; + u8 reserved_at_8[0x18]; + + u8 syndrome[0x20]; + + u8 reserved_at_40[0x40]; + + struct mlx5_ifc_vhca_state_context_bits vhca_state_context; +}; + +struct mlx5_ifc_query_vhca_state_in_bits { + u8 opcode[0x10]; + u8 uid[0x10]; + + u8 reserved_at_20[0x10]; + u8 op_mod[0x10]; + + u8 embedded_cpu_function[0x1]; + u8 reserved_at_41[0xf]; + u8 function_id[0x10]; + + u8 reserved_at_60[0x20]; +}; + +struct mlx5_ifc_vhca_state_field_select_bits { + u8 reserved_at_0[0x1e]; + u8 sw_function_id[0x1]; + u8 arm_change_event[0x1]; +}; + +struct mlx5_ifc_modify_vhca_state_out_bits { + u8 status[0x8]; + u8 reserved_at_8[0x18]; + + u8 syndrome[0x20]; + + u8 reserved_at_40[0x40]; +}; + +struct mlx5_ifc_modify_vhca_state_in_bits { + u8 opcode[0x10]; + u8 uid[0x10]; + + u8 reserved_at_20[0x10]; + u8 op_mod[0x10]; + + u8 embedded_cpu_function[0x1]; + u8 reserved_at_41[0xf]; + u8 function_id[0x10]; + + struct mlx5_ifc_vhca_state_field_select_bits vhca_state_field_select; + + struct mlx5_ifc_vhca_state_context_bits vhca_state_context; +}; + +#endif diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h new file mode 100644 index 0000000000000000000000000000000000000000..cb02a51d09861b92ab61c9aaf43c7962fd65cf82 --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ +/* Copyright (c) 2020 Mellanox Technologies Ltd */ + +#ifndef __MLX5_SF_PRIV_H__ +#define __MLX5_SF_PRIV_H__ + +#include + +int mlx5_cmd_alloc_sf(struct mlx5_core_dev *dev, u16 function_id); +int mlx5_cmd_dealloc_sf(struct mlx5_core_dev *dev, u16 function_id); + +int mlx5_cmd_sf_enable_hca(struct mlx5_core_dev *dev, u16 func_id); +int mlx5_cmd_sf_disable_hca(struct mlx5_core_dev *dev, u16 func_id); + +u16 mlx5_sf_sw_to_hw_id(const struct mlx5_core_dev *dev, u16 sw_id); + +int mlx5_sf_hw_table_sf_alloc(struct mlx5_core_dev *dev, u32 usr_sfnum); +void mlx5_sf_hw_table_sf_free(struct mlx5_core_dev *dev, u16 id); +void mlx5_sf_hw_table_sf_deferred_free(struct mlx5_core_dev *dev, u16 id); + +#endif diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h new file mode 100644 index 0000000000000000000000000000000000000000..0b6aea1e6a947940c1e5f19d3b904efa7a422df2 --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h @@ -0,0 +1,100 @@ +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ +/* Copyright (c) 2020 Mellanox Technologies Ltd */ + +#ifndef __MLX5_SF_H__ +#define __MLX5_SF_H__ + +#include + +static inline u16 mlx5_sf_start_function_id(const struct mlx5_core_dev *dev) +{ + return MLX5_CAP_GEN(dev, sf_base_id); +} + +#ifdef CONFIG_MLX5_SF + +static inline bool mlx5_sf_supported(const struct mlx5_core_dev *dev) +{ + return MLX5_CAP_GEN(dev, sf); +} + +static inline u16 mlx5_sf_max_functions(const struct mlx5_core_dev *dev) +{ + if (!mlx5_sf_supported(dev)) + return 0; + if (MLX5_CAP_GEN(dev, max_num_sf)) + return MLX5_CAP_GEN(dev, max_num_sf); + else + return 1 << MLX5_CAP_GEN(dev, log_max_sf); +} + +#else + +static inline bool mlx5_sf_supported(const struct mlx5_core_dev *dev) +{ + return false; +} + +static inline u16 mlx5_sf_max_functions(const struct mlx5_core_dev *dev) +{ + return 0; +} + +#endif + +#ifdef CONFIG_MLX5_SF_MANAGER + +int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev); +void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev); + +int mlx5_sf_hw_table_create(struct mlx5_core_dev *dev); +void mlx5_sf_hw_table_destroy(struct mlx5_core_dev *dev); + +int mlx5_sf_table_init(struct mlx5_core_dev *dev); +void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev); + +int mlx5_devlink_sf_port_new(struct devlink *devlink, + const struct devlink_port_new_attrs *add_attr, + struct netlink_ext_ack *extack, + unsigned int *new_port_index); +int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index, + struct netlink_ext_ack *extack); +int mlx5_devlink_sf_port_fn_state_get(struct devlink *devlink, struct devlink_port *dl_port, + enum devlink_port_fn_state *state, + enum devlink_port_fn_opstate *opstate, + struct netlink_ext_ack *extack); +int mlx5_devlink_sf_port_fn_state_set(struct devlink *devlink, struct devlink_port *dl_port, + enum devlink_port_fn_state state, + struct netlink_ext_ack *extack); +#else + +static inline int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev) +{ + return 0; +} + +static inline void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev) +{ +} + +static inline int mlx5_sf_hw_table_create(struct mlx5_core_dev *dev) +{ + return 0; +} + +static inline void mlx5_sf_hw_table_destroy(struct mlx5_core_dev *dev) +{ +} + +static inline int mlx5_sf_table_init(struct mlx5_core_dev *dev) +{ + return 0; +} + +static inline void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev) +{ +} + +#endif + +#endif diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c new file mode 100644 index 0000000000000000000000000000000000000000..28b14b05086f636a1b2923252e3f0b74742bf0ec --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c @@ -0,0 +1,188 @@ +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +/* Copyright (c) 2020 Mellanox Technologies Ltd */ + +#include +#include "mlx5_ifc_vhca_event.h" +#include "mlx5_core.h" +#include "vhca_event.h" +#include "ecpf.h" + +struct mlx5_vhca_state_notifier { + struct mlx5_core_dev *dev; + struct mlx5_nb nb; + struct blocking_notifier_head n_head; +}; + +struct mlx5_vhca_event_work { + struct work_struct work; + struct mlx5_vhca_state_notifier *notifier; + struct mlx5_vhca_state_event event; +}; + +int mlx5_cmd_query_vhca_state(struct mlx5_core_dev *dev, u16 function_id, u32 *out, u32 outlen) +{ + u32 in[MLX5_ST_SZ_DW(query_vhca_state_in)] = {}; + + MLX5_SET(query_vhca_state_in, in, opcode, MLX5_CMD_OP_QUERY_VHCA_STATE); + MLX5_SET(query_vhca_state_in, in, function_id, function_id); + MLX5_SET(query_vhca_state_in, in, embedded_cpu_function, 0); + + return mlx5_cmd_exec(dev, in, sizeof(in), out, outlen); +} + +static int mlx5_cmd_modify_vhca_state(struct mlx5_core_dev *dev, u16 function_id, + u32 *in, u32 inlen) +{ + u32 out[MLX5_ST_SZ_DW(modify_vhca_state_out)] = {}; + + MLX5_SET(modify_vhca_state_in, in, opcode, MLX5_CMD_OP_MODIFY_VHCA_STATE); + MLX5_SET(modify_vhca_state_in, in, function_id, function_id); + MLX5_SET(modify_vhca_state_in, in, embedded_cpu_function, 0); + + return mlx5_cmd_exec(dev, in, inlen, out, sizeof(out)); +} + +int mlx5_modify_vhca_sw_id(struct mlx5_core_dev *dev, u16 function_id, u32 sw_fn_id) +{ + u32 out[MLX5_ST_SZ_DW(modify_vhca_state_out)] = {}; + u32 in[MLX5_ST_SZ_DW(modify_vhca_state_in)] = {}; + + MLX5_SET(modify_vhca_state_in, in, opcode, MLX5_CMD_OP_MODIFY_VHCA_STATE); + MLX5_SET(modify_vhca_state_in, in, function_id, function_id); + MLX5_SET(modify_vhca_state_in, in, embedded_cpu_function, 0); + MLX5_SET(modify_vhca_state_in, in, vhca_state_field_select.sw_function_id, 1); + MLX5_SET(modify_vhca_state_in, in, vhca_state_context.sw_function_id, sw_fn_id); + + return mlx5_cmd_exec_inout(dev, modify_vhca_state, in, out); +} + +int mlx5_vhca_event_arm(struct mlx5_core_dev *dev, u16 function_id) +{ + u32 in[MLX5_ST_SZ_DW(modify_vhca_state_in)] = {}; + + MLX5_SET(modify_vhca_state_in, in, vhca_state_context.arm_change_event, 1); + MLX5_SET(modify_vhca_state_in, in, vhca_state_field_select.arm_change_event, 1); + + return mlx5_cmd_modify_vhca_state(dev, function_id, in, sizeof(in)); +} + +static void +mlx5_vhca_event_notify(struct mlx5_core_dev *dev, struct mlx5_vhca_state_event *event) +{ + u32 out[MLX5_ST_SZ_DW(query_vhca_state_out)] = {}; + int err; + + err = mlx5_cmd_query_vhca_state(dev, event->function_id, out, sizeof(out)); + if (err) + return; + + event->sw_function_id = MLX5_GET(query_vhca_state_out, out, + vhca_state_context.sw_function_id); + event->new_vhca_state = MLX5_GET(query_vhca_state_out, out, + vhca_state_context.vhca_state); + + mlx5_vhca_event_arm(dev, event->function_id); + + blocking_notifier_call_chain(&dev->priv.vhca_state_notifier->n_head, 0, event); +} + +static void mlx5_vhca_state_work_handler(struct work_struct *_work) +{ + struct mlx5_vhca_event_work *work = container_of(_work, struct mlx5_vhca_event_work, work); + struct mlx5_vhca_state_notifier *notifier = work->notifier; + struct mlx5_core_dev *dev = notifier->dev; + + mlx5_vhca_event_notify(dev, &work->event); + kfree(work); +} + +static int +mlx5_vhca_state_change_notifier(struct notifier_block *nb, unsigned long type, void *data) +{ + struct mlx5_vhca_state_notifier *notifier = + mlx5_nb_cof(nb, struct mlx5_vhca_state_notifier, nb); + struct mlx5_vhca_event_work *work; + struct mlx5_eqe *eqe = data; + + work = kzalloc(sizeof(*work), GFP_ATOMIC); + if (!work) + return NOTIFY_DONE; + INIT_WORK(&work->work, &mlx5_vhca_state_work_handler); + work->notifier = notifier; + work->event.function_id = be16_to_cpu(eqe->data.vhca_state.function_id); + mlx5_events_work_enqueue(notifier->dev, &work->work); + return NOTIFY_OK; +} + +void mlx5_vhca_state_cap_handle(struct mlx5_core_dev *dev, void *set_hca_cap) +{ + if (!mlx5_vhca_event_supported(dev)) + return; + + MLX5_SET(cmd_hca_cap, set_hca_cap, vhca_state, 1); + MLX5_SET(cmd_hca_cap, set_hca_cap, event_on_vhca_state_allocated, 1); + MLX5_SET(cmd_hca_cap, set_hca_cap, event_on_vhca_state_active, 1); + MLX5_SET(cmd_hca_cap, set_hca_cap, event_on_vhca_state_in_use, 1); + MLX5_SET(cmd_hca_cap, set_hca_cap, event_on_vhca_state_teardown_request, 1); +} + +int mlx5_vhca_event_init(struct mlx5_core_dev *dev) +{ + struct mlx5_vhca_state_notifier *notifier; + + if (!mlx5_vhca_event_supported(dev)) + return 0; + + notifier = kzalloc(sizeof(*notifier), GFP_KERNEL); + if (!notifier) + return -ENOMEM; + + dev->priv.vhca_state_notifier = notifier; + notifier->dev = dev; + BLOCKING_INIT_NOTIFIER_HEAD(¬ifier->n_head); + MLX5_NB_INIT(¬ifier->nb, mlx5_vhca_state_change_notifier, VHCA_STATE_CHANGE); + return 0; +} + +void mlx5_vhca_event_cleanup(struct mlx5_core_dev *dev) +{ + if (!mlx5_vhca_event_supported(dev)) + return; + + kfree(dev->priv.vhca_state_notifier); + dev->priv.vhca_state_notifier = NULL; +} + +void mlx5_vhca_event_start(struct mlx5_core_dev *dev) +{ + struct mlx5_vhca_state_notifier *notifier; + + if (!dev->priv.vhca_state_notifier) + return; + + notifier = dev->priv.vhca_state_notifier; + mlx5_eq_notifier_register(dev, ¬ifier->nb); +} + +void mlx5_vhca_event_stop(struct mlx5_core_dev *dev) +{ + struct mlx5_vhca_state_notifier *notifier; + + if (!dev->priv.vhca_state_notifier) + return; + + notifier = dev->priv.vhca_state_notifier; + mlx5_eq_notifier_unregister(dev, ¬ifier->nb); +} + +int mlx5_vhca_event_notifier_register(struct mlx5_core_dev *dev, struct notifier_block *nb) +{ + if (!dev->priv.vhca_state_notifier) + return -EOPNOTSUPP; + return blocking_notifier_chain_register(&dev->priv.vhca_state_notifier->n_head, nb); +} + +void mlx5_vhca_event_notifier_unregister(struct mlx5_core_dev *dev, struct notifier_block *nb) +{ + blocking_notifier_chain_unregister(&dev->priv.vhca_state_notifier->n_head, nb); +} diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h new file mode 100644 index 0000000000000000000000000000000000000000..013cdfe90616fe3b76e06b3ea0d72fa14b76320d --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h @@ -0,0 +1,56 @@ +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ +/* Copyright (c) 2020 Mellanox Technologies Ltd */ + +#ifndef __MLX5_VHCA_EVENT_H__ +#define __MLX5_VHCA_EVENT_H__ + +#ifdef CONFIG_MLX5_SF + +struct mlx5_vhca_state_event { + u16 function_id; + u16 sw_function_id; + u8 new_vhca_state; +}; + +static inline bool mlx5_vhca_event_supported(const struct mlx5_core_dev *dev) +{ + return MLX5_CAP_GEN_MAX(dev, vhca_state); +} + +void mlx5_vhca_state_cap_handle(struct mlx5_core_dev *dev, void *set_hca_cap); +int mlx5_vhca_event_init(struct mlx5_core_dev *dev); +void mlx5_vhca_event_cleanup(struct mlx5_core_dev *dev); +void mlx5_vhca_event_start(struct mlx5_core_dev *dev); +void mlx5_vhca_event_stop(struct mlx5_core_dev *dev); +int mlx5_vhca_event_notifier_register(struct mlx5_core_dev *dev, struct notifier_block *nb); +void mlx5_vhca_event_notifier_unregister(struct mlx5_core_dev *dev, struct notifier_block *nb); +int mlx5_modify_vhca_sw_id(struct mlx5_core_dev *dev, u16 function_id, u32 sw_fn_id); +int mlx5_vhca_event_arm(struct mlx5_core_dev *dev, u16 function_id); +int mlx5_cmd_query_vhca_state(struct mlx5_core_dev *dev, u16 function_id, + u32 *out, u32 outlen); +#else + +static inline void mlx5_vhca_state_cap_handle(struct mlx5_core_dev *dev, void *set_hca_cap) +{ +} + +static inline int mlx5_vhca_event_init(struct mlx5_core_dev *dev) +{ + return 0; +} + +static inline void mlx5_vhca_event_cleanup(struct mlx5_core_dev *dev) +{ +} + +static inline void mlx5_vhca_event_start(struct mlx5_core_dev *dev) +{ +} + +static inline void mlx5_vhca_event_stop(struct mlx5_core_dev *dev) +{ +} + +#endif + +#endif diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c b/drivers/net/ethernet/mellanox/mlx5/core/vport.c index fc91bbf7d0c37abd8808bb5bb7b72280d67717fc..46ecfb023d8799c0815e6ef4f7b7b0e4bebaa1e4 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c @@ -36,6 +36,7 @@ #include #include #include "mlx5_core.h" +#include "sf/sf.h" /* Mutex to hold while enabling or disabling RoCE */ static DEFINE_MUTEX(mlx5_roce_en_lock); @@ -1158,6 +1159,6 @@ EXPORT_SYMBOL_GPL(mlx5_query_nic_system_image_guid); */ u16 mlx5_eswitch_get_total_vports(const struct mlx5_core_dev *dev) { - return MLX5_SPECIAL_VPORTS(dev) + mlx5_core_max_vfs(dev); + return MLX5_SPECIAL_VPORTS(dev) + mlx5_core_max_vfs(dev) + mlx5_sf_max_functions(dev); } EXPORT_SYMBOL_GPL(mlx5_eswitch_get_total_vports); diff --git a/drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2_tlv_multi.c b/drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2_tlv_multi.c index 017d68f1e1232c394aa4b2ac209300256fbbab51..972c571b41587a14518b71d8bacce2172f82a081 100644 --- a/drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2_tlv_multi.c +++ b/drivers/net/ethernet/mellanox/mlxfw/mlxfw_mfa2_tlv_multi.c @@ -31,6 +31,8 @@ mlxfw_mfa2_tlv_next(const struct mlxfw_mfa2_file *mfa2_file, if (tlv->type == MLXFW_MFA2_TLV_MULTI_PART) { multi = mlxfw_mfa2_tlv_multi_get(mfa2_file, tlv); + if (!multi) + return NULL; tlv_len = NLA_ALIGN(tlv_len + be16_to_cpu(multi->total_len)); } diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c b/drivers/net/ethernet/qualcomm/emac/emac.c index ad655f0a4965ce87c60e9784262d88a774420879..e1aa56be9cc0b1cf19392922dec26d67701caa74 100644 --- a/drivers/net/ethernet/qualcomm/emac/emac.c +++ b/drivers/net/ethernet/qualcomm/emac/emac.c @@ -728,9 +728,15 @@ static int emac_remove(struct platform_device *pdev) struct net_device *netdev = dev_get_drvdata(&pdev->dev); struct emac_adapter *adpt = netdev_priv(netdev); + netif_carrier_off(netdev); + netif_tx_disable(netdev); + unregister_netdev(netdev); netif_napi_del(&adpt->rx_q.napi); + free_irq(adpt->irq.irq, &adpt->irq); + cancel_work_sync(&adpt->work_thread); + emac_clks_teardown(adpt); put_device(&adpt->phydev->mdio.dev); diff --git a/drivers/net/ethernet/xircom/xirc2ps_cs.c b/drivers/net/ethernet/xircom/xirc2ps_cs.c index 3e337142b5161c1c097788b236fb05336ac55672..56cef59c1c872c96c4e3a89a9d1d2482a3e36026 100644 --- a/drivers/net/ethernet/xircom/xirc2ps_cs.c +++ b/drivers/net/ethernet/xircom/xirc2ps_cs.c @@ -503,6 +503,11 @@ static void xirc2ps_detach(struct pcmcia_device *link) { struct net_device *dev = link->priv; + struct local_info *local = netdev_priv(dev); + + netif_carrier_off(dev); + netif_tx_disable(dev); + cancel_work_sync(&local->tx_timeout_task); dev_dbg(&link->dev, "detach\n"); diff --git a/drivers/nfc/st-nci/ndlc.c b/drivers/nfc/st-nci/ndlc.c index 5d74c674368a5481701b0dec38b1af1114f4de28..8ccf5a86ad1bb3b8a6d5324495db1c1bc2b2514d 100644 --- a/drivers/nfc/st-nci/ndlc.c +++ b/drivers/nfc/st-nci/ndlc.c @@ -286,13 +286,15 @@ EXPORT_SYMBOL(ndlc_probe); void ndlc_remove(struct llt_ndlc *ndlc) { - st_nci_remove(ndlc->ndev); - /* cancel timers */ del_timer_sync(&ndlc->t1_timer); del_timer_sync(&ndlc->t2_timer); ndlc->t2_active = false; ndlc->t1_active = false; + /* cancel work */ + cancel_work_sync(&ndlc->sm_work); + + st_nci_remove(ndlc->ndev); skb_queue_purge(&ndlc->rcv_q); skb_queue_purge(&ndlc->send_q); diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile index d7f6a87687b8d9448970119eadaa3ce5747e8252..cbc509784b2e731a4bba7cec3868a22c95730121 100644 --- a/drivers/nvme/host/Makefile +++ b/drivers/nvme/host/Makefile @@ -9,7 +9,7 @@ obj-$(CONFIG_NVME_RDMA) += nvme-rdma.o obj-$(CONFIG_NVME_FC) += nvme-fc.o obj-$(CONFIG_NVME_TCP) += nvme-tcp.o -nvme-core-y := core.o +nvme-core-y := core.o ioctl.o nvme-core-$(CONFIG_TRACING) += trace.o nvme-core-$(CONFIG_NVME_MULTIPATH) += multipath.o nvme-core-$(CONFIG_NVM) += lightnvm.o diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index a1f977d9c609f85181c3d753f370a50401438591..7e9a594d86dcc5475e24703f74399dd9287950bd 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -84,10 +84,14 @@ static LIST_HEAD(nvme_subsystems); static DEFINE_MUTEX(nvme_subsystems_lock); static DEFINE_IDA(nvme_instance_ida); -static dev_t nvme_chr_devt; +static dev_t nvme_ctrl_base_chr_devt; static struct class *nvme_class; static struct class *nvme_subsys_class; +static DEFINE_IDA(nvme_ns_chr_minor_ida); +static dev_t nvme_ns_chr_devt; +static struct class *nvme_ns_chr_class; + static void nvme_put_subsystem(struct nvme_subsystem *subsys); static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl, unsigned nsid); @@ -122,7 +126,7 @@ static void nvme_set_queue_dying(struct nvme_ns *ns) nvme_update_bdev_size(ns->disk); } -static void nvme_queue_scan(struct nvme_ctrl *ctrl) +void nvme_queue_scan(struct nvme_ctrl *ctrl) { /* * Only new queue scan work when admin and IO queues are both alive @@ -561,7 +565,12 @@ static void nvme_free_ns_head(struct kref *ref) kfree(head); } -static void nvme_put_ns_head(struct nvme_ns_head *head) +bool nvme_tryget_ns_head(struct nvme_ns_head *head) +{ + return kref_get_unless_zero(&head->ref); +} + +void nvme_put_ns_head(struct nvme_ns_head *head) { kref_put(&head->ref, nvme_free_ns_head); } @@ -592,13 +601,8 @@ static inline void nvme_clear_nvme_request(struct request *req) req->rq_flags |= RQF_DONTPREP; } -static inline unsigned int nvme_req_op(struct nvme_command *cmd) -{ - return nvme_is_write(cmd) ? REQ_OP_DRV_OUT : REQ_OP_DRV_IN; -} - -static inline void nvme_init_request(struct request *req, - struct nvme_command *cmd) +/* initialize a passthrough request */ +void nvme_init_request(struct request *req, struct nvme_command *cmd) { if (req->q->queuedata) req->timeout = NVME_IO_TIMEOUT; @@ -609,18 +613,7 @@ static inline void nvme_init_request(struct request *req, nvme_clear_nvme_request(req); nvme_req(req)->cmd = cmd; } - -struct request *nvme_alloc_request(struct request_queue *q, - struct nvme_command *cmd, blk_mq_req_flags_t flags) -{ - struct request *req; - - req = blk_mq_alloc_request(q, nvme_req_op(cmd), flags); - if (!IS_ERR(req)) - nvme_init_request(req, cmd); - return req; -} -EXPORT_SYMBOL_GPL(nvme_alloc_request); +EXPORT_SYMBOL_GPL(nvme_init_request); /* * For something we're not in a state to send to the device the default action @@ -683,19 +676,6 @@ bool __nvme_check_ready(struct nvme_ctrl *ctrl, struct request *rq, } EXPORT_SYMBOL_GPL(__nvme_check_ready); -struct request *nvme_alloc_request_qid(struct request_queue *q, - struct nvme_command *cmd, blk_mq_req_flags_t flags, int qid) -{ - struct request *req; - - req = blk_mq_alloc_request_hctx(q, nvme_req_op(cmd), flags, - qid ? qid - 1 : 0); - if (!IS_ERR(req)) - nvme_init_request(req, cmd); - return req; -} -EXPORT_SYMBOL_GPL(nvme_alloc_request_qid); - static int nvme_toggle_streams(struct nvme_ctrl *ctrl, bool enable) { struct nvme_command c; @@ -1069,11 +1049,14 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd, int ret; if (qid == NVME_QID_ANY) - req = nvme_alloc_request(q, cmd, flags); + req = blk_mq_alloc_request(q, nvme_req_op(cmd), flags); else - req = nvme_alloc_request_qid(q, cmd, flags, qid); + req = blk_mq_alloc_request_hctx(q, nvme_req_op(cmd), flags, + qid ? qid - 1 : 0); + if (IS_ERR(req)) return PTR_ERR(req); + nvme_init_request(req, cmd); if (timeout) req->timeout = timeout; @@ -1108,40 +1091,6 @@ int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd, } EXPORT_SYMBOL_GPL(nvme_submit_sync_cmd); -static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf, - unsigned len, u32 seed, bool write) -{ - struct bio_integrity_payload *bip; - int ret = -ENOMEM; - void *buf; - - buf = kmalloc(len, GFP_KERNEL); - if (!buf) - goto out; - - ret = -EFAULT; - if (write && copy_from_user(buf, ubuf, len)) - goto out_free_meta; - - bip = bio_integrity_alloc(bio, GFP_KERNEL, 1); - if (IS_ERR(bip)) { - ret = PTR_ERR(bip); - goto out_free_meta; - } - - bip->bip_iter.bi_size = len; - bip->bip_iter.bi_sector = seed; - ret = bio_integrity_add_page(bio, virt_to_page(buf), len, - offset_in_page(buf)); - if (ret == len) - return buf; - ret = -ENOMEM; -out_free_meta: - kfree(buf); -out: - return ERR_PTR(ret); -} - static u32 nvme_known_admin_effects(u8 opcode) { switch (opcode) { @@ -1228,72 +1177,6 @@ void nvme_execute_passthru_rq(struct request *rq, u32 *effects) } EXPORT_SYMBOL_NS_GPL(nvme_execute_passthru_rq, NVME_TARGET_PASSTHRU); -static int nvme_submit_user_cmd(struct request_queue *q, - struct nvme_command *cmd, void __user *ubuffer, - unsigned bufflen, void __user *meta_buffer, unsigned meta_len, - u32 meta_seed, u64 *result, unsigned timeout) -{ - struct nvme_ctrl *ctrl; - bool write = nvme_is_write(cmd); - struct nvme_ns *ns = q->queuedata; - struct gendisk *disk = ns ? ns->disk : NULL; - struct request *req; - struct bio *bio = NULL; - void *meta = NULL; - u32 effects; - int ret; - - req = nvme_alloc_request(q, cmd, 0); - if (IS_ERR(req)) - return PTR_ERR(req); - - if (timeout) - req->timeout = timeout; - nvme_req(req)->flags |= NVME_REQ_USERCMD; - - if (ubuffer && bufflen) { - ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen, - GFP_KERNEL); - if (ret) - goto out; - bio = req->bio; - bio->bi_disk = disk; - if (disk && meta_buffer && meta_len) { - meta = nvme_add_user_metadata(bio, meta_buffer, meta_len, - meta_seed, write); - if (IS_ERR(meta)) { - ret = PTR_ERR(meta); - goto out_unmap; - } - req->cmd_flags |= REQ_INTEGRITY; - } - } - - ctrl = nvme_req(req)->ctrl; - nvme_execute_passthru_rq(req, &effects); - if (nvme_req(req)->flags & NVME_REQ_CANCELLED) - ret = -EINTR; - else - ret = nvme_req(req)->status; - if (result) - *result = le64_to_cpu(nvme_req(req)->result.u64); - if (meta && !ret && !write) { - if (copy_to_user(meta_buffer, meta, meta_len)) - ret = -EFAULT; - } - kfree(meta); - out_unmap: - if (bio) - blk_rq_unmap_user(bio); - out: - blk_mq_free_request(req); - - if (effects) - nvme_passthru_end(ctrl, effects); - - return ret; -} - static void nvme_keep_alive_end_io(struct request *rq, blk_status_t status) { struct nvme_ctrl *ctrl = rq->end_io_data; @@ -1323,10 +1206,11 @@ static int nvme_keep_alive(struct nvme_ctrl *ctrl) { struct request *rq; - rq = nvme_alloc_request(ctrl->admin_q, &ctrl->ka_cmd, + rq = blk_mq_alloc_request(ctrl->admin_q, nvme_req_op(&ctrl->ka_cmd), BLK_MQ_REQ_RESERVED); if (IS_ERR(rq)) return PTR_ERR(rq); + nvme_init_request(rq, &ctrl->ka_cmd); rq->timeout = ctrl->kato * HZ; rq->end_io_data = ctrl; @@ -1649,170 +1533,6 @@ static void nvme_enable_aen(struct nvme_ctrl *ctrl) queue_work(nvme_wq, &ctrl->async_event_work); } -/* - * Convert integer values from ioctl structures to user pointers, silently - * ignoring the upper bits in the compat case to match behaviour of 32-bit - * kernels. - */ -static void __user *nvme_to_user_ptr(uintptr_t ptrval) -{ - if (in_compat_syscall()) - ptrval = (compat_uptr_t)ptrval; - return (void __user *)ptrval; -} - -static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) -{ - struct nvme_user_io io; - struct nvme_command c; - unsigned length, meta_len; - void __user *metadata; - - if (copy_from_user(&io, uio, sizeof(io))) - return -EFAULT; - if (io.flags) - return -EINVAL; - - switch (io.opcode) { - case nvme_cmd_write: - case nvme_cmd_read: - case nvme_cmd_compare: - break; - default: - return -EINVAL; - } - - length = (io.nblocks + 1) << ns->lba_shift; - - if ((io.control & NVME_RW_PRINFO_PRACT) && - ns->ms == sizeof(struct t10_pi_tuple)) { - /* - * Protection information is stripped/inserted by the - * controller. - */ - if (nvme_to_user_ptr(io.metadata)) - return -EINVAL; - meta_len = 0; - metadata = NULL; - } else { - meta_len = (io.nblocks + 1) * ns->ms; - metadata = nvme_to_user_ptr(io.metadata); - } - - if (ns->features & NVME_NS_EXT_LBAS) { - length += meta_len; - meta_len = 0; - } else if (meta_len) { - if ((io.metadata & 3) || !io.metadata) - return -EINVAL; - } - - memset(&c, 0, sizeof(c)); - c.rw.opcode = io.opcode; - c.rw.flags = io.flags; - c.rw.nsid = cpu_to_le32(ns->head->ns_id); - c.rw.slba = cpu_to_le64(io.slba); - c.rw.length = cpu_to_le16(io.nblocks); - c.rw.control = cpu_to_le16(io.control); - c.rw.dsmgmt = cpu_to_le32(io.dsmgmt); - c.rw.reftag = cpu_to_le32(io.reftag); - c.rw.apptag = cpu_to_le16(io.apptag); - c.rw.appmask = cpu_to_le16(io.appmask); - - return nvme_submit_user_cmd(ns->queue, &c, - nvme_to_user_ptr(io.addr), length, - metadata, meta_len, lower_32_bits(io.slba), NULL, 0); -} - -static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, - struct nvme_passthru_cmd __user *ucmd) -{ - struct nvme_passthru_cmd cmd; - struct nvme_command c; - unsigned timeout = 0; - u64 result; - int status; - - if (!capable(CAP_SYS_ADMIN)) - return -EACCES; - if (copy_from_user(&cmd, ucmd, sizeof(cmd))) - return -EFAULT; - if (cmd.flags) - return -EINVAL; - - memset(&c, 0, sizeof(c)); - c.common.opcode = cmd.opcode; - c.common.flags = cmd.flags; - c.common.nsid = cpu_to_le32(cmd.nsid); - c.common.cdw2[0] = cpu_to_le32(cmd.cdw2); - c.common.cdw2[1] = cpu_to_le32(cmd.cdw3); - c.common.cdw10 = cpu_to_le32(cmd.cdw10); - c.common.cdw11 = cpu_to_le32(cmd.cdw11); - c.common.cdw12 = cpu_to_le32(cmd.cdw12); - c.common.cdw13 = cpu_to_le32(cmd.cdw13); - c.common.cdw14 = cpu_to_le32(cmd.cdw14); - c.common.cdw15 = cpu_to_le32(cmd.cdw15); - - if (cmd.timeout_ms) - timeout = msecs_to_jiffies(cmd.timeout_ms); - - status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c, - nvme_to_user_ptr(cmd.addr), cmd.data_len, - nvme_to_user_ptr(cmd.metadata), cmd.metadata_len, - 0, &result, timeout); - - if (status >= 0) { - if (put_user(result, &ucmd->result)) - return -EFAULT; - } - - return status; -} - -static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns, - struct nvme_passthru_cmd64 __user *ucmd) -{ - struct nvme_passthru_cmd64 cmd; - struct nvme_command c; - unsigned timeout = 0; - int status; - - if (!capable(CAP_SYS_ADMIN)) - return -EACCES; - if (copy_from_user(&cmd, ucmd, sizeof(cmd))) - return -EFAULT; - if (cmd.flags) - return -EINVAL; - - memset(&c, 0, sizeof(c)); - c.common.opcode = cmd.opcode; - c.common.flags = cmd.flags; - c.common.nsid = cpu_to_le32(cmd.nsid); - c.common.cdw2[0] = cpu_to_le32(cmd.cdw2); - c.common.cdw2[1] = cpu_to_le32(cmd.cdw3); - c.common.cdw10 = cpu_to_le32(cmd.cdw10); - c.common.cdw11 = cpu_to_le32(cmd.cdw11); - c.common.cdw12 = cpu_to_le32(cmd.cdw12); - c.common.cdw13 = cpu_to_le32(cmd.cdw13); - c.common.cdw14 = cpu_to_le32(cmd.cdw14); - c.common.cdw15 = cpu_to_le32(cmd.cdw15); - - if (cmd.timeout_ms) - timeout = msecs_to_jiffies(cmd.timeout_ms); - - status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c, - nvme_to_user_ptr(cmd.addr), cmd.data_len, - nvme_to_user_ptr(cmd.metadata), cmd.metadata_len, - 0, &cmd.result, timeout); - - if (status >= 0) { - if (put_user(cmd.result, &ucmd->result)) - return -EFAULT; - } - - return status; -} - /* * Issue ioctl requests on the first available path. Note that unlike normal * block layer requests we will not retry failed request on another controller. @@ -1843,136 +1563,12 @@ void nvme_put_ns_from_disk(struct nvme_ns_head *head, int idx) srcu_read_unlock(&head->srcu, idx); } -static bool is_ctrl_ioctl(unsigned int cmd) +static int nvme_ns_open(struct nvme_ns *ns) { - if (cmd == NVME_IOCTL_ADMIN_CMD || cmd == NVME_IOCTL_ADMIN64_CMD) - return true; - if (is_sed_ioctl(cmd)) - return true; - return false; -} - -static int nvme_handle_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd, - void __user *argp, - struct nvme_ns_head *head, - int srcu_idx) -{ - struct nvme_ctrl *ctrl = ns->ctrl; - int ret; - - nvme_get_ctrl(ns->ctrl); - nvme_put_ns_from_disk(head, srcu_idx); - switch (cmd) { - case NVME_IOCTL_ADMIN_CMD: - ret = nvme_user_cmd(ctrl, NULL, argp); - break; - case NVME_IOCTL_ADMIN64_CMD: - ret = nvme_user_cmd64(ctrl, NULL, argp); - break; - default: - ret = sed_ioctl(ctrl->opal_dev, cmd, argp); - break; - } - nvme_put_ctrl(ctrl); - return ret; -} - -static int nvme_ioctl(struct block_device *bdev, fmode_t mode, - unsigned int cmd, unsigned long arg) -{ - struct nvme_ns_head *head = NULL; - void __user *argp = (void __user *)arg; - struct nvme_ns *ns; - int srcu_idx, ret; - - ns = nvme_get_ns_from_disk(bdev->bd_disk, &head, &srcu_idx); - if (unlikely(!ns)) - return -EWOULDBLOCK; - - /* - * Handle ioctls that apply to the controller instead of the namespace - * seperately and drop the ns SRCU reference early. This avoids a - * deadlock when deleting namespaces using the passthrough interface. - */ - if (is_ctrl_ioctl(cmd)) - return nvme_handle_ctrl_ioctl(ns, cmd, argp, head, srcu_idx); - - switch (cmd) { - case NVME_IOCTL_ID: - force_successful_syscall_return(); - ret = ns->head->ns_id; - break; - case NVME_IOCTL_IO_CMD: - ret = nvme_user_cmd(ns->ctrl, ns, argp); - break; - case NVME_IOCTL_SUBMIT_IO: - ret = nvme_submit_io(ns, argp); - break; - case NVME_IOCTL_IO64_CMD: - ret = nvme_user_cmd64(ns->ctrl, ns, argp); - break; - default: - if (ns->ndev) - ret = nvme_nvm_ioctl(ns, cmd, arg); - else - ret = -ENOTTY; - } - - nvme_put_ns_from_disk(head, srcu_idx); - return ret; -} - -#ifdef CONFIG_COMPAT -struct nvme_user_io32 { - __u8 opcode; - __u8 flags; - __u16 control; - __u16 nblocks; - __u16 rsvd; - __u64 metadata; - __u64 addr; - __u64 slba; - __u32 dsmgmt; - __u32 reftag; - __u16 apptag; - __u16 appmask; -} __attribute__((__packed__)); - -#define NVME_IOCTL_SUBMIT_IO32 _IOW('N', 0x42, struct nvme_user_io32) - -static int nvme_compat_ioctl(struct block_device *bdev, fmode_t mode, - unsigned int cmd, unsigned long arg) -{ - /* - * Corresponds to the difference of NVME_IOCTL_SUBMIT_IO - * between 32 bit programs and 64 bit kernel. - * The cause is that the results of sizeof(struct nvme_user_io), - * which is used to define NVME_IOCTL_SUBMIT_IO, - * are not same between 32 bit compiler and 64 bit compiler. - * NVME_IOCTL_SUBMIT_IO32 is for 64 bit kernel handling - * NVME_IOCTL_SUBMIT_IO issued from 32 bit programs. - * Other IOCTL numbers are same between 32 bit and 64 bit. - * So there is nothing to do regarding to other IOCTL numbers. - */ - if (cmd == NVME_IOCTL_SUBMIT_IO32) - return nvme_ioctl(bdev, mode, NVME_IOCTL_SUBMIT_IO, arg); - - return nvme_ioctl(bdev, mode, cmd, arg); -} -#else -#define nvme_compat_ioctl NULL -#endif /* CONFIG_COMPAT */ - -static int nvme_open(struct block_device *bdev, fmode_t mode) -{ - struct nvme_ns *ns = bdev->bd_disk->private_data; - -#ifdef CONFIG_NVME_MULTIPATH /* should never be called due to GENHD_FL_HIDDEN */ - if (WARN_ON_ONCE(ns->head->disk)) + if (WARN_ON_ONCE(nvme_ns_head_multipath(ns->head))) goto fail; -#endif if (!kref_get_unless_zero(&ns->kref)) goto fail; if (!try_module_get(ns->ctrl->ops->module)) @@ -1986,15 +1582,24 @@ static int nvme_open(struct block_device *bdev, fmode_t mode) return -ENXIO; } -static void nvme_release(struct gendisk *disk, fmode_t mode) +static void nvme_ns_release(struct nvme_ns *ns) { - struct nvme_ns *ns = disk->private_data; module_put(ns->ctrl->ops->module); nvme_put_ns(ns); } -static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo) +static int nvme_open(struct block_device *bdev, fmode_t mode) +{ + return nvme_ns_open(bdev->bd_disk->private_data); +} + +static void nvme_release(struct gendisk *disk, fmode_t mode) +{ + nvme_ns_release(disk->private_data); +} + +int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo) { /* some standard values */ geo->heads = 1 << 6; @@ -2319,11 +1924,10 @@ static int nvme_update_ns_info(struct nvme_ns *ns, struct nvme_id_ns *id) if (blk_queue_is_zoned(ns->queue)) { ret = nvme_revalidate_zones(ns); if (ret && !nvme_first_scan(ns->disk)) - return ret; + goto out; } -#ifdef CONFIG_NVME_MULTIPATH - if (ns->head->disk) { + if (nvme_ns_head_multipath(ns->head)) { blk_mq_freeze_queue(ns->head->disk->queue); nvme_update_disk_info(ns->head->disk, ns, id); blk_stack_limits(&ns->head->disk->queue->limits, @@ -2332,11 +1936,19 @@ static int nvme_update_ns_info(struct nvme_ns *ns, struct nvme_id_ns *id) nvme_update_bdev_size(ns->head->disk); blk_mq_unfreeze_queue(ns->head->disk->queue); } -#endif return 0; out_unfreeze: blk_mq_unfreeze_queue(ns->disk->queue); +out: + /* + * If probing fails due an unsupported feature, hide the block device, + * but still allow other access. + */ + if (ret == -ENODEV) { + ns->disk->flags |= GENHD_FL_HIDDEN; + ret = 0; + } return ret; } @@ -2432,7 +2044,7 @@ static int nvme_pr_release(struct block_device *bdev, u64 key, enum pr_type type return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_release); } -static const struct pr_ops nvme_pr_ops = { +const struct pr_ops nvme_pr_ops = { .pr_register = nvme_pr_register, .pr_reserve = nvme_pr_reserve, .pr_release = nvme_pr_release, @@ -2462,10 +2074,9 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len, EXPORT_SYMBOL_GPL(nvme_sec_submit); #endif /* CONFIG_BLK_SED_OPAL */ -static const struct block_device_operations nvme_fops = { +static const struct block_device_operations nvme_bdev_ops = { .owner = THIS_MODULE, .ioctl = nvme_ioctl, - .compat_ioctl = nvme_compat_ioctl, .open = nvme_open, .release = nvme_release, .getgeo = nvme_getgeo, @@ -2473,34 +2084,6 @@ static const struct block_device_operations nvme_fops = { .pr_ops = &nvme_pr_ops, }; -#ifdef CONFIG_NVME_MULTIPATH -static int nvme_ns_head_open(struct block_device *bdev, fmode_t mode) -{ - struct nvme_ns_head *head = bdev->bd_disk->private_data; - - if (!kref_get_unless_zero(&head->ref)) - return -ENXIO; - return 0; -} - -static void nvme_ns_head_release(struct gendisk *disk, fmode_t mode) -{ - nvme_put_ns_head(disk->private_data); -} - -const struct block_device_operations nvme_ns_head_ops = { - .owner = THIS_MODULE, - .submit_bio = nvme_ns_head_submit_bio, - .open = nvme_ns_head_open, - .release = nvme_ns_head_release, - .ioctl = nvme_ioctl, - .compat_ioctl = nvme_compat_ioctl, - .getgeo = nvme_getgeo, - .report_zones = nvme_report_zones, - .pr_ops = &nvme_pr_ops, -}; -#endif /* CONFIG_NVME_MULTIPATH */ - static int nvme_wait_ready(struct nvme_ctrl *ctrl, u64 cap, bool enabled) { unsigned long timeout = @@ -3407,77 +2990,13 @@ static int nvme_dev_release(struct inode *inode, struct file *file) return 0; } -static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp) -{ - struct nvme_ns *ns; - int ret; - - down_read(&ctrl->namespaces_rwsem); - if (list_empty(&ctrl->namespaces)) { - ret = -ENOTTY; - goto out_unlock; - } - - ns = list_first_entry(&ctrl->namespaces, struct nvme_ns, list); - if (ns != list_last_entry(&ctrl->namespaces, struct nvme_ns, list)) { - dev_warn(ctrl->device, - "NVME_IOCTL_IO_CMD not supported when multiple namespaces present!\n"); - ret = -EINVAL; - goto out_unlock; - } - - dev_warn(ctrl->device, - "using deprecated NVME_IOCTL_IO_CMD ioctl on the char device!\n"); - kref_get(&ns->kref); - up_read(&ctrl->namespaces_rwsem); - - ret = nvme_user_cmd(ctrl, ns, argp); - nvme_put_ns(ns); - return ret; - -out_unlock: - up_read(&ctrl->namespaces_rwsem); - return ret; -} - -static long nvme_dev_ioctl(struct file *file, unsigned int cmd, - unsigned long arg) -{ - struct nvme_ctrl *ctrl = file->private_data; - void __user *argp = (void __user *)arg; - - switch (cmd) { - case NVME_IOCTL_ADMIN_CMD: - return nvme_user_cmd(ctrl, NULL, argp); - case NVME_IOCTL_ADMIN64_CMD: - return nvme_user_cmd64(ctrl, NULL, argp); - case NVME_IOCTL_IO_CMD: - return nvme_dev_user_cmd(ctrl, argp); - case NVME_IOCTL_RESET: - if (!capable(CAP_SYS_ADMIN)) - return -EACCES; - dev_warn(ctrl->device, "resetting controller\n"); - return nvme_reset_ctrl_sync(ctrl); - case NVME_IOCTL_SUBSYS_RESET: - if (!capable(CAP_SYS_ADMIN)) - return -EACCES; - return nvme_reset_subsystem(ctrl); - case NVME_IOCTL_RESCAN: - if (!capable(CAP_SYS_ADMIN)) - return -EACCES; - nvme_queue_scan(ctrl); - return 0; - default: - return -ENOTTY; - } -} - static const struct file_operations nvme_dev_fops = { .owner = THIS_MODULE, .open = nvme_dev_open, .release = nvme_dev_release, .unlocked_ioctl = nvme_dev_ioctl, .compat_ioctl = compat_ptr_ioctl, + .uring_cmd = nvme_dev_uring_cmd, }; static ssize_t nvme_sysfs_reset(struct device *dev, @@ -3509,7 +3028,7 @@ static inline struct nvme_ns_head *dev_to_ns_head(struct device *dev) { struct gendisk *disk = dev_to_disk(dev); - if (disk->fops == &nvme_fops) + if (disk->fops == &nvme_bdev_ops) return nvme_get_ns_from_dev(dev)->head; else return disk->private_data; @@ -3618,7 +3137,7 @@ static umode_t nvme_ns_id_attrs_are_visible(struct kobject *kobj, } #ifdef CONFIG_NVME_MULTIPATH if (a == &dev_attr_ana_grpid.attr || a == &dev_attr_ana_state.attr) { - if (dev_to_disk(dev)->fops != &nvme_fops) /* per-path attr */ + if (dev_to_disk(dev)->fops != &nvme_bdev_ops) /* per-path attr */ return 0; if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl)) return 0; @@ -3876,7 +3395,7 @@ static struct nvme_ns_head *nvme_find_ns_head(struct nvme_subsystem *subsys, lockdep_assert_held(&subsys->lock); list_for_each_entry(h, &subsys->nsheads, entry) { - if (h->ns_id == nsid && kref_get_unless_zero(&h->ref)) + if (h->ns_id == nsid && nvme_tryget_ns_head(h)) return h; } @@ -3898,6 +3417,73 @@ static int nvme_subsys_check_duplicate_ids(struct nvme_subsystem *subsys, return 0; } +static void nvme_cdev_rel(struct device *dev) +{ + ida_simple_remove(&nvme_ns_chr_minor_ida, MINOR(dev->devt)); +} + +void nvme_cdev_del(struct cdev *cdev, struct device *cdev_device) +{ + cdev_device_del(cdev, cdev_device); + put_device(cdev_device); +} + +int nvme_cdev_add(struct cdev *cdev, struct device *cdev_device, + const struct file_operations *fops, struct module *owner) +{ + int minor, ret; + + minor = ida_simple_get(&nvme_ns_chr_minor_ida, 0, 0, GFP_KERNEL); + if (minor < 0) + return minor; + cdev_device->devt = MKDEV(MAJOR(nvme_ns_chr_devt), minor); + cdev_device->class = nvme_ns_chr_class; + cdev_device->release = nvme_cdev_rel; + device_initialize(cdev_device); + cdev_init(cdev, fops); + cdev->owner = owner; + ret = cdev_device_add(cdev, cdev_device); + if (ret) + put_device(cdev_device); + + return ret; +} + +static int nvme_ns_chr_open(struct inode *inode, struct file *file) +{ + return nvme_ns_open(container_of(inode->i_cdev, struct nvme_ns, cdev)); +} + +static int nvme_ns_chr_release(struct inode *inode, struct file *file) +{ + nvme_ns_release(container_of(inode->i_cdev, struct nvme_ns, cdev)); + return 0; +} + +static const struct file_operations nvme_ns_chr_fops = { + .owner = THIS_MODULE, + .open = nvme_ns_chr_open, + .release = nvme_ns_chr_release, + .unlocked_ioctl = nvme_ns_chr_ioctl, + .compat_ioctl = compat_ptr_ioctl, + .uring_cmd = nvme_ns_chr_uring_cmd, + .uring_cmd_iopoll = nvme_ns_chr_uring_cmd_iopoll, +}; + +static int nvme_add_ns_cdev(struct nvme_ns *ns) +{ + int ret; + + ns->cdev_device.parent = ns->ctrl->device; + ret = dev_set_name(&ns->cdev_device, "ng%dn%d", + ns->ctrl->instance, ns->head->instance); + if (ret) + return ret; + + return nvme_cdev_add(&ns->cdev, &ns->cdev_device, &nvme_ns_chr_fops, + ns->ctrl->ops->module); +} + static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl, unsigned nsid, struct nvme_ns_ids *ids) { @@ -4045,8 +3631,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid, struct nvme_ns *ns; struct gendisk *disk; struct nvme_id_ns *id; - char disk_name[DISK_NAME_LEN]; - int node = ctrl->numa_node, flags = GENHD_FL_EXT_DEVT, ret; + int node = ctrl->numa_node; if (nvme_identify_ns(ctrl, nsid, ids, &id)) return; @@ -4070,28 +3655,32 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid, ns->ctrl = ctrl; kref_init(&ns->kref); - ret = nvme_init_ns_head(ns, nsid, ids, id->nmic & NVME_NS_NMIC_SHARED); - if (ret) + if (nvme_init_ns_head(ns, nsid, ids, id->nmic & NVME_NS_NMIC_SHARED)) goto out_free_queue; - nvme_set_disk_name(disk_name, ns, ctrl, &flags); disk = alloc_disk_node(0, node); if (!disk) goto out_unlink_ns; - disk->fops = &nvme_fops; + disk->fops = &nvme_bdev_ops; disk->private_data = ns; disk->queue = ns->queue; - disk->flags = flags; - memcpy(disk->disk_name, disk_name, DISK_NAME_LEN); + disk->flags = GENHD_FL_EXT_DEVT; + /* + * Without the multipath code enabled, multiple controller per + * subsystems are visible as devices and thus we cannot use the + * subsystem instance. + */ + if (!nvme_mpath_set_disk_name(ns, disk->disk_name, &disk->flags)) + sprintf(disk->disk_name, "nvme%dn%d", ctrl->instance, + ns->head->instance); ns->disk = disk; if (nvme_update_ns_info(ns, id)) goto out_put_disk; if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) { - ret = nvme_nvm_register(ns, disk_name, node); - if (ret) { + if (nvme_nvm_register(ns, disk->disk_name, node)) { dev_warn(ctrl->device, "LightNVM init failure\n"); goto out_put_disk; } @@ -4103,6 +3692,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid, nvme_get_ctrl(ctrl); device_add_disk(ctrl->device, ns->disk, nvme_ns_id_attr_groups); + if (!nvme_ns_head_multipath(ns->head)) + nvme_add_ns_cdev(ns); nvme_mpath_add_disk(ns, id); nvme_fault_inject_init(&ns->fault_inject, ns->disk->disk_name); @@ -4147,6 +3738,8 @@ static void nvme_ns_remove(struct nvme_ns *ns) synchronize_srcu(&ns->head->srcu); /* wait for concurrent submissions */ if (ns->disk->flags & GENHD_FL_UP) { + if (!nvme_ns_head_multipath(ns->head)) + nvme_cdev_del(&ns->cdev, &ns->cdev_device); del_gendisk(ns->disk); blk_cleanup_queue(ns->queue); if (blk_get_integrity(ns->disk)) @@ -4713,7 +4306,8 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev, device_initialize(&ctrl->ctrl_device); ctrl->device = &ctrl->ctrl_device; - ctrl->device->devt = MKDEV(MAJOR(nvme_chr_devt), ctrl->instance); + ctrl->device->devt = MKDEV(MAJOR(nvme_ctrl_base_chr_devt), + ctrl->instance); ctrl->device->class = nvme_class; ctrl->device->parent = ctrl->dev; ctrl->device->groups = nvme_dev_attr_groups; @@ -4923,7 +4517,8 @@ static int __init nvme_core_init(void) if (!nvme_delete_wq) goto destroy_reset_wq; - result = alloc_chrdev_region(&nvme_chr_devt, 0, NVME_MINORS, "nvme"); + result = alloc_chrdev_region(&nvme_ctrl_base_chr_devt, 0, + NVME_MINORS, "nvme"); if (result < 0) goto destroy_delete_wq; @@ -4939,12 +4534,28 @@ static int __init nvme_core_init(void) result = PTR_ERR(nvme_subsys_class); goto destroy_class; } + + result = alloc_chrdev_region(&nvme_ns_chr_devt, 0, NVME_MINORS, + "nvme-generic"); + if (result < 0) + goto destroy_subsys_class; + + nvme_ns_chr_class = class_create(THIS_MODULE, "nvme-generic"); + if (IS_ERR(nvme_ns_chr_class)) { + result = PTR_ERR(nvme_ns_chr_class); + goto unregister_generic_ns; + } + return 0; +unregister_generic_ns: + unregister_chrdev_region(nvme_ns_chr_devt, NVME_MINORS); +destroy_subsys_class: + class_destroy(nvme_subsys_class); destroy_class: class_destroy(nvme_class); unregister_chrdev: - unregister_chrdev_region(nvme_chr_devt, NVME_MINORS); + unregister_chrdev_region(nvme_ctrl_base_chr_devt, NVME_MINORS); destroy_delete_wq: destroy_workqueue(nvme_delete_wq); destroy_reset_wq: @@ -4957,12 +4568,15 @@ static int __init nvme_core_init(void) static void __exit nvme_core_exit(void) { + class_destroy(nvme_ns_chr_class); class_destroy(nvme_subsys_class); class_destroy(nvme_class); - unregister_chrdev_region(nvme_chr_devt, NVME_MINORS); + unregister_chrdev_region(nvme_ns_chr_devt, NVME_MINORS); + unregister_chrdev_region(nvme_ctrl_base_chr_devt, NVME_MINORS); destroy_workqueue(nvme_delete_wq); destroy_workqueue(nvme_reset_wq); destroy_workqueue(nvme_wq); + ida_destroy(&nvme_ns_chr_minor_ida); ida_destroy(&nvme_instance_ida); } diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c new file mode 100644 index 0000000000000000000000000000000000000000..ab348dee23cf5863ec6b50608f755764e18c23e5 --- /dev/null +++ b/drivers/nvme/host/ioctl.c @@ -0,0 +1,845 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2011-2014, Intel Corporation. + * Copyright (c) 2017-2021 Christoph Hellwig. + */ +#include /* for force_successful_syscall_return */ +#include +#include +#include "nvme.h" + +/* + * Convert integer values from ioctl structures to user pointers, silently + * ignoring the upper bits in the compat case to match behaviour of 32-bit + * kernels. + */ +static void __user *nvme_to_user_ptr(uintptr_t ptrval) +{ + if (in_compat_syscall()) + ptrval = (compat_uptr_t)ptrval; + return (void __user *)ptrval; +} + +static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf, + unsigned len, u32 seed, bool write) +{ + struct bio_integrity_payload *bip; + int ret = -ENOMEM; + void *buf; + + buf = kmalloc(len, GFP_KERNEL); + if (!buf) + goto out; + + ret = -EFAULT; + if (write && copy_from_user(buf, ubuf, len)) + goto out_free_meta; + + bip = bio_integrity_alloc(bio, GFP_KERNEL, 1); + if (IS_ERR(bip)) { + ret = PTR_ERR(bip); + goto out_free_meta; + } + + bip->bip_iter.bi_size = len; + bip->bip_iter.bi_sector = seed; + ret = bio_integrity_add_page(bio, virt_to_page(buf), len, + offset_in_page(buf)); + if (ret == len) + return buf; + ret = -ENOMEM; +out_free_meta: + kfree(buf); +out: + return ERR_PTR(ret); +} + +static int nvme_finish_user_metadata(struct request *req, void __user *ubuf, + void *meta, unsigned len, int ret) +{ + if (!ret && req_op(req) == REQ_OP_DRV_IN && + copy_to_user(ubuf, meta, len)) + ret = -EFAULT; + kfree(meta); + return ret; +} + +static struct request *nvme_alloc_user_request(struct request_queue *q, + struct nvme_command *cmd, void __user *ubuffer, + unsigned bufflen, void __user *meta_buffer, unsigned meta_len, + u32 meta_seed, void **metap, unsigned timeout, bool vec, + unsigned int rq_flags, blk_mq_req_flags_t blk_flags) +{ + bool write = nvme_is_write(cmd); + struct nvme_ns *ns = q->queuedata; + struct gendisk *disk = ns ? ns->disk : NULL; + struct request *req; + struct bio *bio = NULL; + void *meta = NULL; + int ret; + + req = blk_mq_alloc_request(q, nvme_req_op(cmd) | rq_flags, blk_flags); + if (IS_ERR(req)) + return req; + nvme_init_request(req, cmd); + + if (timeout) + req->timeout = timeout; + nvme_req(req)->flags |= NVME_REQ_USERCMD; + + if (ubuffer && bufflen) { + if (!vec) + ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen, + GFP_KERNEL); + else { + struct iovec fast_iov[UIO_FASTIOV]; + struct iovec *iov = fast_iov; + struct iov_iter iter; + + ret = import_iovec(rq_data_dir(req), ubuffer, bufflen, + UIO_FASTIOV, &iov, &iter); + if (ret < 0) + goto out; + ret = blk_rq_map_user_iov(q, req, NULL, &iter, + GFP_KERNEL); + kfree(iov); + } + if (ret) + goto out; + bio = req->bio; + bio->bi_disk = disk; + if (disk && meta_buffer && meta_len) { + meta = nvme_add_user_metadata(bio, meta_buffer, meta_len, + meta_seed, write); + if (IS_ERR(meta)) { + ret = PTR_ERR(meta); + goto out_unmap; + } + req->cmd_flags |= REQ_INTEGRITY; + *metap = meta; + } + } + + return req; + +out_unmap: + if (bio) + blk_rq_unmap_user(bio); +out: + blk_mq_free_request(req); + return ERR_PTR(ret); +} + +static int nvme_submit_user_cmd(struct request_queue *q, + struct nvme_command *cmd, void __user *ubuffer, + unsigned bufflen, void __user *meta_buffer, unsigned meta_len, + u32 meta_seed, u64 *result, unsigned timeout, bool vec) +{ + struct nvme_ctrl *ctrl; + struct request *req; + void *meta = NULL; + struct bio *bio; + u32 effects; + int ret; + + req = nvme_alloc_user_request(q, cmd, ubuffer, bufflen, meta_buffer, + meta_len, meta_seed, &meta, timeout, vec, 0, 0); + if (IS_ERR(req)) + return PTR_ERR(req); + + bio = req->bio; + ctrl = nvme_req(req)->ctrl; + + nvme_execute_passthru_rq(req, &effects); + if (nvme_req(req)->flags & NVME_REQ_CANCELLED) + ret = -EINTR; + else + ret = nvme_req(req)->status; + if (result) + *result = le64_to_cpu(nvme_req(req)->result.u64); + if (meta) + ret = nvme_finish_user_metadata(req, meta_buffer, meta, + meta_len, ret); + if (bio) + blk_rq_unmap_user(bio); + blk_mq_free_request(req); + + if (effects) + nvme_passthru_end(ctrl, effects); + + return ret; +} + +static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) +{ + struct nvme_user_io io; + struct nvme_command c; + unsigned length, meta_len; + void __user *metadata; + + if (copy_from_user(&io, uio, sizeof(io))) + return -EFAULT; + if (io.flags) + return -EINVAL; + + switch (io.opcode) { + case nvme_cmd_write: + case nvme_cmd_read: + case nvme_cmd_compare: + break; + default: + return -EINVAL; + } + + length = (io.nblocks + 1) << ns->lba_shift; + + if ((io.control & NVME_RW_PRINFO_PRACT) && + ns->ms == sizeof(struct t10_pi_tuple)) { + /* + * Protection information is stripped/inserted by the + * controller. + */ + if (nvme_to_user_ptr(io.metadata)) + return -EINVAL; + meta_len = 0; + metadata = NULL; + } else { + meta_len = (io.nblocks + 1) * ns->ms; + metadata = nvme_to_user_ptr(io.metadata); + } + + if (ns->features & NVME_NS_EXT_LBAS) { + length += meta_len; + meta_len = 0; + } else if (meta_len) { + if ((io.metadata & 3) || !io.metadata) + return -EINVAL; + } + + memset(&c, 0, sizeof(c)); + c.rw.opcode = io.opcode; + c.rw.flags = io.flags; + c.rw.nsid = cpu_to_le32(ns->head->ns_id); + c.rw.slba = cpu_to_le64(io.slba); + c.rw.length = cpu_to_le16(io.nblocks); + c.rw.control = cpu_to_le16(io.control); + c.rw.dsmgmt = cpu_to_le32(io.dsmgmt); + c.rw.reftag = cpu_to_le32(io.reftag); + c.rw.apptag = cpu_to_le16(io.apptag); + c.rw.appmask = cpu_to_le16(io.appmask); + + return nvme_submit_user_cmd(ns->queue, &c, + nvme_to_user_ptr(io.addr), length, + metadata, meta_len, lower_32_bits(io.slba), NULL, 0, + false); +} + +static bool nvme_validate_passthru_nsid(struct nvme_ctrl *ctrl, + struct nvme_ns *ns, __u32 nsid) +{ + if (ns && nsid != ns->head->ns_id) { + dev_err(ctrl->device, + "%s: nsid (%u) in cmd does not match nsid (%u)" + "of namespace\n", + current->comm, nsid, ns->head->ns_id); + return false; + } + + return true; +} + +static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, + struct nvme_passthru_cmd __user *ucmd) +{ + struct nvme_passthru_cmd cmd; + struct nvme_command c; + unsigned timeout = 0; + u64 result; + int status; + + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + if (copy_from_user(&cmd, ucmd, sizeof(cmd))) + return -EFAULT; + if (cmd.flags) + return -EINVAL; + if (!nvme_validate_passthru_nsid(ctrl, ns, cmd.nsid)) + return -EINVAL; + + memset(&c, 0, sizeof(c)); + c.common.opcode = cmd.opcode; + c.common.flags = cmd.flags; + c.common.nsid = cpu_to_le32(cmd.nsid); + c.common.cdw2[0] = cpu_to_le32(cmd.cdw2); + c.common.cdw2[1] = cpu_to_le32(cmd.cdw3); + c.common.cdw10 = cpu_to_le32(cmd.cdw10); + c.common.cdw11 = cpu_to_le32(cmd.cdw11); + c.common.cdw12 = cpu_to_le32(cmd.cdw12); + c.common.cdw13 = cpu_to_le32(cmd.cdw13); + c.common.cdw14 = cpu_to_le32(cmd.cdw14); + c.common.cdw15 = cpu_to_le32(cmd.cdw15); + + if (cmd.timeout_ms) + timeout = msecs_to_jiffies(cmd.timeout_ms); + + status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c, + nvme_to_user_ptr(cmd.addr), cmd.data_len, + nvme_to_user_ptr(cmd.metadata), cmd.metadata_len, + 0, &result, timeout, false); + + if (status >= 0) { + if (put_user(result, &ucmd->result)) + return -EFAULT; + } + + return status; +} + +static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns, + struct nvme_passthru_cmd64 __user *ucmd, bool vec) +{ + struct nvme_passthru_cmd64 cmd; + struct nvme_command c; + unsigned timeout = 0; + int status; + + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + if (copy_from_user(&cmd, ucmd, sizeof(cmd))) + return -EFAULT; + if (cmd.flags) + return -EINVAL; + if (!nvme_validate_passthru_nsid(ctrl, ns, cmd.nsid)) + return -EINVAL; + + memset(&c, 0, sizeof(c)); + c.common.opcode = cmd.opcode; + c.common.flags = cmd.flags; + c.common.nsid = cpu_to_le32(cmd.nsid); + c.common.cdw2[0] = cpu_to_le32(cmd.cdw2); + c.common.cdw2[1] = cpu_to_le32(cmd.cdw3); + c.common.cdw10 = cpu_to_le32(cmd.cdw10); + c.common.cdw11 = cpu_to_le32(cmd.cdw11); + c.common.cdw12 = cpu_to_le32(cmd.cdw12); + c.common.cdw13 = cpu_to_le32(cmd.cdw13); + c.common.cdw14 = cpu_to_le32(cmd.cdw14); + c.common.cdw15 = cpu_to_le32(cmd.cdw15); + + if (cmd.timeout_ms) + timeout = msecs_to_jiffies(cmd.timeout_ms); + + status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c, + nvme_to_user_ptr(cmd.addr), cmd.data_len, + nvme_to_user_ptr(cmd.metadata), cmd.metadata_len, + 0, &cmd.result, timeout, vec); + + if (status >= 0) { + if (put_user(cmd.result, &ucmd->result)) + return -EFAULT; + } + + return status; +} + +struct nvme_uring_data { + __u64 metadata; + __u64 addr; + __u32 data_len; + __u32 metadata_len; + __u32 timeout_ms; +}; + +/* + * This overlays struct io_uring_cmd pdu. + * Expect build errors if this grows larger than that. + */ +struct nvme_uring_cmd_pdu { + union { + struct bio *bio; + struct request *req; + }; + void *meta; /* kernel-resident buffer */ + void __user *meta_buffer; + u32 meta_len; +}; + +static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu( + struct io_uring_cmd *ioucmd) +{ + return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu; +} + +static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd) +{ + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); + struct request *req = pdu->req; + struct bio *bio = req->bio; + int status; + u64 result; + + if (nvme_req(req)->flags & NVME_REQ_CANCELLED) + status = -EINTR; + else + status = nvme_req(req)->status; + + result = le64_to_cpu(nvme_req(req)->result.u64); + + if (pdu->meta) + status = nvme_finish_user_metadata(req, pdu->meta_buffer, + pdu->meta, pdu->meta_len, status); + if (bio) + blk_rq_unmap_user(bio); + blk_mq_free_request(req); + + io_uring_cmd_done(ioucmd, status, result); +} + +static void nvme_uring_cmd_end_io(struct request *req, blk_status_t err) +{ + struct io_uring_cmd *ioucmd = req->end_io_data; + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); + /* extract bio before reusing the same field for request */ + struct bio *bio = pdu->bio; + void *cookie = READ_ONCE(ioucmd->cookie); + + pdu->req = req; + req->bio = bio; + + /* + * For iopoll, complete it directly. + * Otherwise, move the completion to task work. + */ + if (cookie != NULL && blk_rq_is_poll(req)) + nvme_uring_task_cb(ioucmd); + else + io_uring_cmd_complete_in_task(ioucmd, nvme_uring_task_cb); +} + +static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, + struct io_uring_cmd *ioucmd, unsigned int issue_flags, bool vec) +{ + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); + const struct nvme_uring_cmd *cmd = ioucmd->cmd; + struct request_queue *q = ns ? ns->queue : ctrl->admin_q; + struct nvme_uring_data d; + struct nvme_command c; + struct request *req; + unsigned int rq_flags = 0; + blk_mq_req_flags_t blk_flags = 0; + void *meta = NULL; + + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + + c.common.opcode = READ_ONCE(cmd->opcode); + c.common.flags = READ_ONCE(cmd->flags); + if (c.common.flags) + return -EINVAL; + + c.common.command_id = 0; + c.common.nsid = cpu_to_le32(cmd->nsid); + if (!nvme_validate_passthru_nsid(ctrl, ns, le32_to_cpu(c.common.nsid))) + return -EINVAL; + + c.common.cdw2[0] = cpu_to_le32(READ_ONCE(cmd->cdw2)); + c.common.cdw2[1] = cpu_to_le32(READ_ONCE(cmd->cdw3)); + c.common.metadata = 0; + c.common.dptr.prp1 = c.common.dptr.prp2 = 0; + c.common.cdw10 = cpu_to_le32(READ_ONCE(cmd->cdw10)); + c.common.cdw11 = cpu_to_le32(READ_ONCE(cmd->cdw11)); + c.common.cdw12 = cpu_to_le32(READ_ONCE(cmd->cdw12)); + c.common.cdw13 = cpu_to_le32(READ_ONCE(cmd->cdw13)); + c.common.cdw14 = cpu_to_le32(READ_ONCE(cmd->cdw14)); + c.common.cdw15 = cpu_to_le32(READ_ONCE(cmd->cdw15)); + + d.metadata = READ_ONCE(cmd->metadata); + d.addr = READ_ONCE(cmd->addr); + d.data_len = READ_ONCE(cmd->data_len); + d.metadata_len = READ_ONCE(cmd->metadata_len); + d.timeout_ms = READ_ONCE(cmd->timeout_ms); + + if (issue_flags & IO_URING_F_NONBLOCK) { + rq_flags = REQ_NOWAIT; + blk_flags = BLK_MQ_REQ_NOWAIT; + } + if (issue_flags & IO_URING_F_IOPOLL) + rq_flags |= REQ_HIPRI; + +retry: + req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr), + d.data_len, nvme_to_user_ptr(d.metadata), + d.metadata_len, 0, &meta, d.timeout_ms ? + msecs_to_jiffies(d.timeout_ms) : 0, vec, rq_flags, + blk_flags); + if (IS_ERR(req)) + return PTR_ERR(req); + req->end_io_data = ioucmd; + + if (issue_flags & IO_URING_F_IOPOLL && rq_flags & REQ_HIPRI) { + if (unlikely(!req->bio)) { + /* we can't poll this, so alloc regular req instead */ + blk_mq_free_request(req); + rq_flags &= ~REQ_HIPRI; + goto retry; + } else { + WRITE_ONCE(ioucmd->cookie, req); + req->bio->bi_opf |= REQ_HIPRI; + } + } + /* to free bio on completion, as req->bio will be null at that time */ + pdu->bio = req->bio; + pdu->meta = meta; + pdu->meta_buffer = nvme_to_user_ptr(d.metadata); + pdu->meta_len = d.metadata_len; + + blk_execute_rq_nowait(req->q, NULL, req, 0, nvme_uring_cmd_end_io); + return -EIOCBQUEUED; +} + +static bool is_ctrl_ioctl(unsigned int cmd) +{ + if (cmd == NVME_IOCTL_ADMIN_CMD || cmd == NVME_IOCTL_ADMIN64_CMD) + return true; + if (is_sed_ioctl(cmd)) + return true; + return false; +} + +static int nvme_ctrl_ioctl(struct nvme_ctrl *ctrl, unsigned int cmd, + void __user *argp) +{ + switch (cmd) { + case NVME_IOCTL_ADMIN_CMD: + return nvme_user_cmd(ctrl, NULL, argp); + case NVME_IOCTL_ADMIN64_CMD: + return nvme_user_cmd64(ctrl, NULL, argp, false); + default: + return sed_ioctl(ctrl->opal_dev, cmd, argp); + } +} + +#ifdef COMPAT_FOR_U64_ALIGNMENT +struct nvme_user_io32 { + __u8 opcode; + __u8 flags; + __u16 control; + __u16 nblocks; + __u16 rsvd; + __u64 metadata; + __u64 addr; + __u64 slba; + __u32 dsmgmt; + __u32 reftag; + __u16 apptag; + __u16 appmask; +} __attribute__((__packed__)); +#define NVME_IOCTL_SUBMIT_IO32 _IOW('N', 0x42, struct nvme_user_io32) +#endif /* COMPAT_FOR_U64_ALIGNMENT */ + +static int nvme_ns_ioctl(struct nvme_ns *ns, unsigned int cmd, + void __user *argp) +{ + switch (cmd) { + case NVME_IOCTL_ID: + force_successful_syscall_return(); + return ns->head->ns_id; + case NVME_IOCTL_IO_CMD: + return nvme_user_cmd(ns->ctrl, ns, argp); + /* + * struct nvme_user_io can have different padding on some 32-bit ABIs. + * Just accept the compat version as all fields that are used are the + * same size and at the same offset. + */ +#ifdef COMPAT_FOR_U64_ALIGNMENT + case NVME_IOCTL_SUBMIT_IO32: +#endif + case NVME_IOCTL_SUBMIT_IO: + return nvme_submit_io(ns, argp); + case NVME_IOCTL_IO64_CMD: + return nvme_user_cmd64(ns->ctrl, ns, argp, false); + case NVME_IOCTL_IO64_CMD_VEC: + return nvme_user_cmd64(ns->ctrl, ns, argp, true); + default: + if (!ns->ndev) + return -ENOTTY; + return nvme_nvm_ioctl(ns, cmd, argp); + } +} + +static int __nvme_ioctl(struct nvme_ns *ns, unsigned int cmd, void __user *arg) +{ + if (is_ctrl_ioctl(cmd)) + return nvme_ctrl_ioctl(ns->ctrl, cmd, arg); + return nvme_ns_ioctl(ns, cmd, arg); +} + +int nvme_ioctl(struct block_device *bdev, fmode_t mode, + unsigned int cmd, unsigned long arg) +{ + struct nvme_ns *ns = bdev->bd_disk->private_data; + + return __nvme_ioctl(ns, cmd, (void __user *)arg); +} + +long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{ + struct nvme_ns *ns = + container_of(file_inode(file)->i_cdev, struct nvme_ns, cdev); + + return __nvme_ioctl(ns, cmd, (void __user *)arg); +} + +static int nvme_uring_cmd_checks(unsigned int issue_flags) +{ + + /* NVMe passthrough requires big SQE/CQE support */ + if ((issue_flags & (IO_URING_F_SQE128|IO_URING_F_CQE32)) != + (IO_URING_F_SQE128|IO_URING_F_CQE32)) + return -EOPNOTSUPP; + return 0; +} + +static int nvme_ns_uring_cmd(struct nvme_ns *ns, struct io_uring_cmd *ioucmd, + unsigned int issue_flags) +{ + struct nvme_ctrl *ctrl = ns->ctrl; + int ret; + + BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu)); + + ret = nvme_uring_cmd_checks(issue_flags); + if (ret) + return ret; + + switch (ioucmd->cmd_op) { + case NVME_URING_CMD_IO: + ret = nvme_uring_cmd_io(ctrl, ns, ioucmd, issue_flags, false); + break; + case NVME_URING_CMD_IO_VEC: + ret = nvme_uring_cmd_io(ctrl, ns, ioucmd, issue_flags, true); + break; + default: + ret = -ENOTTY; + } + + return ret; +} + +int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags) +{ + struct nvme_ns *ns = container_of(file_inode(ioucmd->file)->i_cdev, + struct nvme_ns, cdev); + + return nvme_ns_uring_cmd(ns, ioucmd, issue_flags); +} + +int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd) +{ + struct request *req; + int ret = 0; + struct nvme_ns *ns; + struct request_queue *q; + + req = READ_ONCE(ioucmd->cookie); + ns = container_of(file_inode(ioucmd->file)->i_cdev, + struct nvme_ns, cdev); + q = ns->queue; + if (test_bit(QUEUE_FLAG_POLL, &q->queue_flags)) + ret = blk_poll(q, request_to_qc_t(req->mq_hctx, req), true); + return ret; +} +#ifdef CONFIG_NVME_MULTIPATH +static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd, + void __user *argp, struct nvme_ns_head *head, int srcu_idx) +{ + struct nvme_ctrl *ctrl = ns->ctrl; + int ret; + + nvme_get_ctrl(ns->ctrl); + nvme_put_ns_from_disk(head, srcu_idx); + ret = nvme_ctrl_ioctl(ns->ctrl, cmd, argp); + + nvme_put_ctrl(ctrl); + return ret; +} + +int nvme_ns_head_ioctl(struct block_device *bdev, fmode_t mode, + unsigned int cmd, unsigned long arg) +{ + struct nvme_ns_head *head = NULL; + void __user *argp = (void __user *)arg; + struct nvme_ns *ns; + int srcu_idx, ret; + + ns = nvme_get_ns_from_disk(bdev->bd_disk, &head, &srcu_idx); + if (unlikely(!ns)) + return -EWOULDBLOCK; + + /* + * Handle ioctls that apply to the controller instead of the namespace + * seperately and drop the ns SRCU reference early. This avoids a + * deadlock when deleting namespaces using the passthrough interface. + */ + if (is_ctrl_ioctl(cmd)) + ret = nvme_ns_head_ctrl_ioctl(ns, cmd, argp, head, srcu_idx); + else { + ret = nvme_ns_ioctl(ns, cmd, argp); + nvme_put_ns_from_disk(head, srcu_idx); + } + + return ret; +} + +long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd, + unsigned long arg) +{ + struct cdev *cdev = file_inode(file)->i_cdev; + struct nvme_ns_head *head = + container_of(cdev, struct nvme_ns_head, cdev); + void __user *argp = (void __user *)arg; + struct nvme_ns *ns; + int srcu_idx, ret; + + srcu_idx = srcu_read_lock(&head->srcu); + ns = nvme_find_path(head); + if (!ns) { + srcu_read_unlock(&head->srcu, srcu_idx); + return -EWOULDBLOCK; + } + + if (is_ctrl_ioctl(cmd)) + return nvme_ns_head_ctrl_ioctl(ns, cmd, argp, head, srcu_idx); + + ret = nvme_ns_ioctl(ns, cmd, argp); + nvme_put_ns_from_disk(head, srcu_idx); + + return ret; +} + +int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd, + unsigned int issue_flags) +{ + struct cdev *cdev = file_inode(ioucmd->file)->i_cdev; + struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev); + int srcu_idx = srcu_read_lock(&head->srcu); + struct nvme_ns *ns = nvme_find_path(head); + int ret = -EINVAL; + + if (ns) + ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags); + srcu_read_unlock(&head->srcu, srcu_idx); + return ret; +} + +int nvme_ns_head_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd) +{ + struct cdev *cdev = file_inode(ioucmd->file)->i_cdev; + struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev); + int srcu_idx = srcu_read_lock(&head->srcu); + struct nvme_ns *ns = nvme_find_path(head); + struct request *req; + int ret = 0; + struct request_queue *q; + + if (ns) { + req = READ_ONCE(ioucmd->cookie); + q = ns->queue; + if (test_bit(QUEUE_FLAG_POLL, &q->queue_flags)) + ret = blk_poll(q, request_to_qc_t(req->mq_hctx, req), true); + } + srcu_read_unlock(&head->srcu, srcu_idx); + return ret; +} +#endif /* CONFIG_NVME_MULTIPATH */ + +int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags) +{ + struct nvme_ctrl *ctrl = ioucmd->file->private_data; + int ret; + + /* IOPOLL not supported yet */ + if (issue_flags & IO_URING_F_IOPOLL) + return -EOPNOTSUPP; + + ret = nvme_uring_cmd_checks(issue_flags); + if (ret) + return ret; + + switch (ioucmd->cmd_op) { + case NVME_URING_CMD_ADMIN: + ret = nvme_uring_cmd_io(ctrl, NULL, ioucmd, issue_flags, false); + break; + case NVME_URING_CMD_ADMIN_VEC: + ret = nvme_uring_cmd_io(ctrl, NULL, ioucmd, issue_flags, true); + break; + default: + ret = -ENOTTY; + } + + return ret; +} + +static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp) +{ + struct nvme_ns *ns; + int ret; + + down_read(&ctrl->namespaces_rwsem); + if (list_empty(&ctrl->namespaces)) { + ret = -ENOTTY; + goto out_unlock; + } + + ns = list_first_entry(&ctrl->namespaces, struct nvme_ns, list); + if (ns != list_last_entry(&ctrl->namespaces, struct nvme_ns, list)) { + dev_warn(ctrl->device, + "NVME_IOCTL_IO_CMD not supported when multiple namespaces present!\n"); + ret = -EINVAL; + goto out_unlock; + } + + dev_warn(ctrl->device, + "using deprecated NVME_IOCTL_IO_CMD ioctl on the char device!\n"); + kref_get(&ns->kref); + up_read(&ctrl->namespaces_rwsem); + + ret = nvme_user_cmd(ctrl, ns, argp); + nvme_put_ns(ns); + return ret; + +out_unlock: + up_read(&ctrl->namespaces_rwsem); + return ret; +} + +long nvme_dev_ioctl(struct file *file, unsigned int cmd, + unsigned long arg) +{ + struct nvme_ctrl *ctrl = file->private_data; + void __user *argp = (void __user *)arg; + + switch (cmd) { + case NVME_IOCTL_ADMIN_CMD: + return nvme_user_cmd(ctrl, NULL, argp); + case NVME_IOCTL_ADMIN64_CMD: + return nvme_user_cmd64(ctrl, NULL, argp, false); + case NVME_IOCTL_IO_CMD: + return nvme_dev_user_cmd(ctrl, argp); + case NVME_IOCTL_RESET: + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + dev_warn(ctrl->device, "resetting controller\n"); + return nvme_reset_ctrl_sync(ctrl); + case NVME_IOCTL_SUBSYS_RESET: + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + return nvme_reset_subsystem(ctrl); + case NVME_IOCTL_RESCAN: + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + nvme_queue_scan(ctrl); + return 0; + default: + return -ENOTTY; + } +} diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index 470cef3abec3dbc8b7d49a62bb61faf6b0c45718..cbed54bd07f4ba3f58ac65c8ff0b566574d440bb 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -653,9 +653,10 @@ static struct request *nvme_nvm_alloc_request(struct request_queue *q, nvme_nvm_rqtocmd(rqd, ns, cmd); - rq = nvme_alloc_request(q, (struct nvme_command *)cmd, 0); + rq = blk_mq_alloc_request(q, nvme_req_op((struct nvme_command *)cmd), 0); if (IS_ERR(rq)) return rq; + nvme_init_request(rq, (struct nvme_command *)cmd); rq->cmd_flags &= ~REQ_FAILFAST_DRIVER; @@ -767,11 +768,12 @@ static int nvme_nvm_submit_user_cmd(struct request_queue *q, DECLARE_COMPLETION_ONSTACK(wait); int ret = 0; - rq = nvme_alloc_request(q, (struct nvme_command *)vcmd, 0); + rq = blk_mq_alloc_request(q, nvme_req_op((struct nvme_command *)vcmd), 0); if (IS_ERR(rq)) { ret = -ENOMEM; goto err_cmd; } + nvme_init_request(rq, (struct nvme_command *)vcmd); if (timeout) rq->timeout = timeout; @@ -931,15 +933,15 @@ static int nvme_nvm_user_vcmd(struct nvme_ns *ns, int admin, return ret; } -int nvme_nvm_ioctl(struct nvme_ns *ns, unsigned int cmd, unsigned long arg) +int nvme_nvm_ioctl(struct nvme_ns *ns, unsigned int cmd, void __user *argp) { switch (cmd) { case NVME_NVM_IOCTL_ADMIN_VIO: - return nvme_nvm_user_vcmd(ns, 1, (void __user *)arg); + return nvme_nvm_user_vcmd(ns, 1, argp); case NVME_NVM_IOCTL_IO_VIO: - return nvme_nvm_user_vcmd(ns, 0, (void __user *)arg); + return nvme_nvm_user_vcmd(ns, 0, argp); case NVME_NVM_IOCTL_SUBMIT_VIO: - return nvme_nvm_submit_vio(ns, (void __user *)arg); + return nvme_nvm_submit_vio(ns, argp); default: return -ENOTTY; } diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 24e48bc73f7efcbd1fcaa86c119527cdb1441c20..029fbe30ae577e4a8e87638a77d210c2a0bd852d 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -50,19 +50,19 @@ void nvme_mpath_start_freeze(struct nvme_subsystem *subsys) * and those that have a single controller and use the controller node * directly. */ -void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, - struct nvme_ctrl *ctrl, int *flags) -{ - if (!multipath) { - sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance); - } else if (ns->head->disk) { - sprintf(disk_name, "nvme%dc%dn%d", ctrl->subsys->instance, - ctrl->instance, ns->head->instance); - *flags = GENHD_FL_HIDDEN; - } else { - sprintf(disk_name, "nvme%dn%d", ctrl->subsys->instance, - ns->head->instance); +bool nvme_mpath_set_disk_name(struct nvme_ns *ns, char *disk_name, int *flags) +{ + if (!multipath) + return false; + if (!ns->head->disk) { + sprintf(disk_name, "nvme%dn%d", ns->ctrl->subsys->instance, + ns->head->instance); + return true; } + sprintf(disk_name, "nvme%dc%dn%d", ns->ctrl->subsys->instance, + ns->ctrl->instance, ns->head->instance); + *flags = GENHD_FL_HIDDEN; + return true; } void nvme_failover_req(struct request *req) @@ -293,7 +293,7 @@ static bool nvme_available_path(struct nvme_ns_head *head) return false; } -blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) +static blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) { struct nvme_ns_head *head = bio->bi_disk->private_data; struct device *dev = disk_to_dev(head->disk); @@ -334,6 +334,71 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) return ret; } +static int nvme_ns_head_open(struct block_device *bdev, fmode_t mode) +{ + if (!nvme_tryget_ns_head(bdev->bd_disk->private_data)) + return -ENXIO; + return 0; +} + +static void nvme_ns_head_release(struct gendisk *disk, fmode_t mode) +{ + nvme_put_ns_head(disk->private_data); +} + +const struct block_device_operations nvme_ns_head_ops = { + .owner = THIS_MODULE, + .submit_bio = nvme_ns_head_submit_bio, + .open = nvme_ns_head_open, + .release = nvme_ns_head_release, + .ioctl = nvme_ns_head_ioctl, + .getgeo = nvme_getgeo, + .report_zones = nvme_report_zones, + .pr_ops = &nvme_pr_ops, +}; + +static inline struct nvme_ns_head *cdev_to_ns_head(struct cdev *cdev) +{ + return container_of(cdev, struct nvme_ns_head, cdev); +} + +static int nvme_ns_head_chr_open(struct inode *inode, struct file *file) +{ + if (!nvme_tryget_ns_head(cdev_to_ns_head(inode->i_cdev))) + return -ENXIO; + return 0; +} + +static int nvme_ns_head_chr_release(struct inode *inode, struct file *file) +{ + nvme_put_ns_head(cdev_to_ns_head(inode->i_cdev)); + return 0; +} + +static const struct file_operations nvme_ns_head_chr_fops = { + .owner = THIS_MODULE, + .open = nvme_ns_head_chr_open, + .release = nvme_ns_head_chr_release, + .unlocked_ioctl = nvme_ns_head_chr_ioctl, + .compat_ioctl = compat_ptr_ioctl, + .uring_cmd = nvme_ns_head_chr_uring_cmd, + .uring_cmd_iopoll = nvme_ns_head_chr_uring_cmd_iopoll, +}; + +static int nvme_add_ns_head_cdev(struct nvme_ns_head *head) +{ + int ret; + + head->cdev_device.parent = &head->subsys->dev; + ret = dev_set_name(&head->cdev_device, "ng%dn%d", + head->subsys->instance, head->instance); + if (ret) + return ret; + ret = nvme_cdev_add(&head->cdev, &head->cdev_device, + &nvme_ns_head_chr_fops, THIS_MODULE); + return ret; +} + static void nvme_requeue_work(struct work_struct *work) { struct nvme_ns_head *head = @@ -412,9 +477,11 @@ static void nvme_mpath_set_live(struct nvme_ns *ns) if (!head->disk) return; - if (!test_and_set_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) + if (!test_and_set_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { device_add_disk(&head->subsys->dev, head->disk, nvme_ns_id_attr_groups); + nvme_add_ns_head_cdev(head); + } mutex_lock(&head->lock); if (nvme_path_is_optimized(ns)) { @@ -715,8 +782,10 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head) { if (!head->disk) return; - if (head->disk->flags & GENHD_FL_UP) + if (head->disk->flags & GENHD_FL_UP) { + nvme_cdev_del(&head->cdev, &head->cdev_device); del_gendisk(head->disk); + } blk_set_queue_dying(head->disk->queue); /* make sure all pending bios are cleaned up */ kblockd_schedule_work(&head->requeue_work); @@ -789,4 +858,3 @@ void nvme_mpath_uninit(struct nvme_ctrl *ctrl) kfree(ctrl->ana_log_buf); ctrl->ana_log_buf = NULL; } - diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index c5a8feef4df1153e9f520e2731b327dfe593093a..0d4fcad763fba08eab4330fd01aa1179b1847177 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -412,8 +412,12 @@ struct nvme_ns_head { bool shared; int instance; struct nvme_effects_log *effects; -#ifdef CONFIG_NVME_MULTIPATH + + struct cdev cdev; + struct device cdev_device; + struct gendisk *disk; +#ifdef CONFIG_NVME_MULTIPATH struct bio_list requeue_list; spinlock_t requeue_lock; struct work_struct requeue_work; @@ -424,6 +428,11 @@ struct nvme_ns_head { #endif }; +static inline bool nvme_ns_head_multipath(struct nvme_ns_head *head) +{ + return IS_ENABLED(CONFIG_NVME_MULTIPATH) && head->disk; +} + enum nvme_ns_features { NVME_NS_EXT_LBAS = 1 << 0, /* support extended LBA format */ NVME_NS_METADATA_SUPPORTED = 1 << 1, /* support getting generated md */ @@ -458,6 +467,9 @@ struct nvme_ns { #define NVME_NS_DEAD 1 #define NVME_NS_ANA_PENDING 2 + struct cdev cdev; + struct device cdev_device; + struct nvme_fault_inject fault_inject; }; @@ -676,11 +688,13 @@ void nvme_wait_freeze(struct nvme_ctrl *ctrl); int nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout); void nvme_start_freeze(struct nvme_ctrl *ctrl); +static inline unsigned int nvme_req_op(struct nvme_command *cmd) +{ + return nvme_is_write(cmd) ? REQ_OP_DRV_OUT : REQ_OP_DRV_IN; +} + #define NVME_QID_ANY -1 -struct request *nvme_alloc_request(struct request_queue *q, - struct nvme_command *cmd, blk_mq_req_flags_t flags); -struct request *nvme_alloc_request_qid(struct request_queue *q, - struct nvme_command *cmd, blk_mq_req_flags_t flags, int qid); +void nvme_init_request(struct request *req, struct nvme_command *cmd); void nvme_cleanup_cmd(struct request *req); blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req, struct nvme_command *cmd); @@ -717,13 +731,37 @@ int nvme_reset_ctrl(struct nvme_ctrl *ctrl); int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl); int nvme_delete_ctrl(struct nvme_ctrl *ctrl); +void nvme_queue_scan(struct nvme_ctrl *ctrl); int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp, u8 csi, void *log, size_t size, u64 offset); struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk, struct nvme_ns_head **head, int *srcu_idx); void nvme_put_ns_from_disk(struct nvme_ns_head *head, int idx); +bool nvme_tryget_ns_head(struct nvme_ns_head *head); +void nvme_put_ns_head(struct nvme_ns_head *head); +int nvme_cdev_add(struct cdev *cdev, struct device *cdev_device, + const struct file_operations *fops, struct module *owner); +void nvme_cdev_del(struct cdev *cdev, struct device *cdev_device); +int nvme_ioctl(struct block_device *bdev, fmode_t mode, + unsigned int cmd, unsigned long arg); +long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg); +int nvme_ns_head_ioctl(struct block_device *bdev, fmode_t mode, + unsigned int cmd, unsigned long arg); +long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd, + unsigned long arg); +long nvme_dev_ioctl(struct file *file, unsigned int cmd, + unsigned long arg); +int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd); +int nvme_ns_head_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd); +int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, + unsigned int issue_flags); +int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd, + unsigned int issue_flags); +int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo); +int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags); extern const struct attribute_group *nvme_ns_id_attr_groups[]; +extern const struct pr_ops nvme_pr_ops; extern const struct block_device_operations nvme_ns_head_ops; #ifdef CONFIG_NVME_MULTIPATH @@ -735,8 +773,7 @@ static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl) void nvme_mpath_unfreeze(struct nvme_subsystem *subsys); void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys); void nvme_mpath_start_freeze(struct nvme_subsystem *subsys); -void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, - struct nvme_ctrl *ctrl, int *flags); +bool nvme_mpath_set_disk_name(struct nvme_ns *ns, char *disk_name, int *flags); void nvme_failover_req(struct request *req); void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl); int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head); @@ -750,7 +787,6 @@ void nvme_mpath_stop(struct nvme_ctrl *ctrl); bool nvme_mpath_clear_current_path(struct nvme_ns *ns); void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl); struct nvme_ns *nvme_find_path(struct nvme_ns_head *head); -blk_qc_t nvme_ns_head_submit_bio(struct bio *bio); static inline void nvme_mpath_check_last_path(struct nvme_ns *ns) { @@ -778,16 +814,11 @@ static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl) { return false; } -/* - * Without the multipath code enabled, multiple controller per subsystems are - * visible as devices and thus we cannot use the subsystem instance. - */ -static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, - struct nvme_ctrl *ctrl, int *flags) +static inline bool nvme_mpath_set_disk_name(struct nvme_ns *ns, char *disk_name, + int *flags) { - sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance); + return false; } - static inline void nvme_failover_req(struct request *req) { } @@ -882,7 +913,7 @@ static inline int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf) int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node); void nvme_nvm_unregister(struct nvme_ns *ns); extern const struct attribute_group nvme_nvm_attr_group; -int nvme_nvm_ioctl(struct nvme_ns *ns, unsigned int cmd, unsigned long arg); +int nvme_nvm_ioctl(struct nvme_ns *ns, unsigned int cmd, void __user *argp); #else static inline int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node) @@ -892,7 +923,7 @@ static inline int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, static inline void nvme_nvm_unregister(struct nvme_ns *ns) {}; static inline int nvme_nvm_ioctl(struct nvme_ns *ns, unsigned int cmd, - unsigned long arg) + void __user *argp) { return -ENOTTY; } diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 5bed9136250a04e0e83a0a75ddaac3a0ef17c09f..760a9482e01f107c80cdedfbf9e3a34d76ac3a59 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -427,8 +427,9 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, return 0; } -static int nvme_init_request(struct blk_mq_tag_set *set, struct request *req, - unsigned int hctx_idx, unsigned int numa_node) +static int nvme_pci_init_request(struct blk_mq_tag_set *set, + struct request *req, unsigned int hctx_idx, + unsigned int numa_node) { struct nvme_dev *dev = set->driver_data; struct nvme_iod *iod = blk_mq_rq_to_pdu(req); @@ -1362,12 +1363,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) "I/O %d QID %d timeout, aborting\n", req->tag, nvmeq->qid); - abort_req = nvme_alloc_request(dev->ctrl.admin_q, &cmd, - BLK_MQ_REQ_NOWAIT); + abort_req = blk_mq_alloc_request(dev->ctrl.admin_q, nvme_req_op(&cmd), + BLK_MQ_REQ_NOWAIT); if (IS_ERR(abort_req)) { atomic_inc(&dev->ctrl.abort_limit); return BLK_EH_RESET_TIMER; } + nvme_init_request(abort_req, &cmd); abort_req->end_io_data = NULL; blk_execute_rq_nowait(abort_req->q, NULL, abort_req, 0, abort_endio); @@ -1629,7 +1631,7 @@ static const struct blk_mq_ops nvme_mq_admin_ops = { .queue_rq = nvme_queue_rq, .complete = nvme_pci_complete_rq, .init_hctx = nvme_admin_init_hctx, - .init_request = nvme_init_request, + .init_request = nvme_pci_init_request, .timeout = nvme_timeout, }; @@ -1638,7 +1640,7 @@ static const struct blk_mq_ops nvme_mq_ops = { .complete = nvme_pci_complete_rq, .commit_rqs = nvme_commit_rqs, .init_hctx = nvme_init_hctx, - .init_request = nvme_init_request, + .init_request = nvme_pci_init_request, .map_queues = nvme_pci_map_queues, .timeout = nvme_timeout, .poll = nvme_poll, @@ -2340,9 +2342,10 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode) cmd.delete_queue.opcode = opcode; cmd.delete_queue.qid = cpu_to_le16(nvmeq->qid); - req = nvme_alloc_request(q, &cmd, BLK_MQ_REQ_NOWAIT); + req = blk_mq_alloc_request(q, nvme_req_op(&cmd), BLK_MQ_REQ_NOWAIT); if (IS_ERR(req)) return PTR_ERR(req); + nvme_init_request(req, &cmd); req->end_io_data = nvmeq; diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c index 67e87e9f306f167b4053bf00a05da712610e58c6..72710b72a9d319aac450e9da5bb29a8310167f2a 100644 --- a/drivers/nvme/host/zns.c +++ b/drivers/nvme/host/zns.c @@ -91,7 +91,7 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf) dev_warn(ns->ctrl->device, "zone operations:%x not supported for namespace:%u\n", le16_to_cpu(id->zoc), ns->head->ns_id); - status = -EINVAL; + status = -ENODEV; goto free_data; } @@ -100,7 +100,7 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf) dev_warn(ns->ctrl->device, "invalid zone size:%llu for namespace:%u\n", ns->zsze, ns->head->ns_id); - status = -EINVAL; + status = -ENODEV; goto free_data; } diff --git a/drivers/nvme/target/passthru.c b/drivers/nvme/target/passthru.c index 3a826e1aee1bbe6ad0cc014047202cb27637655e..a96b26c5fbea23daa1f3234c5cb4591ded6b5863 100644 --- a/drivers/nvme/target/passthru.c +++ b/drivers/nvme/target/passthru.c @@ -249,11 +249,12 @@ static void nvmet_passthru_execute_cmd(struct nvmet_req *req) q = ns->queue; } - rq = nvme_alloc_request(q, req->cmd, 0); + rq = blk_mq_alloc_request(q, nvme_req_op(req->cmd), 0); if (IS_ERR(rq)) { status = NVME_SC_INTERNAL; goto out_put_ns; } + nvme_init_request(rq, req->cmd); if (req->sg_cnt) { ret = nvmet_passthru_map_sg(req, rq); diff --git a/drivers/power/supply/bq24190_charger.c b/drivers/power/supply/bq24190_charger.c index 8c3c378dce0d545076017f54a1c5ce1658fb13a4..9a93060e5b5f5f602b012b87af1e4f43518af5c4 100644 --- a/drivers/power/supply/bq24190_charger.c +++ b/drivers/power/supply/bq24190_charger.c @@ -1849,6 +1849,7 @@ static int bq24190_remove(struct i2c_client *client) struct bq24190_dev_info *bdi = i2c_get_clientdata(client); int error; + cancel_delayed_work_sync(&bdi->input_current_limit_work); error = pm_runtime_get_sync(bdi->dev); if (error < 0) { dev_warn(bdi->dev, "pm_runtime_get failed: %i\n", error); diff --git a/drivers/power/supply/da9150-charger.c b/drivers/power/supply/da9150-charger.c index f9314cc0cd75ff19fcb7cc537116de40fb944414..6b987da586556e0acd38cfbe7aa766da931626d5 100644 --- a/drivers/power/supply/da9150-charger.c +++ b/drivers/power/supply/da9150-charger.c @@ -662,6 +662,7 @@ static int da9150_charger_remove(struct platform_device *pdev) if (!IS_ERR_OR_NULL(charger->usb_phy)) usb_unregister_notifier(charger->usb_phy, &charger->otg_nb); + cancel_work_sync(&charger->otg_work); power_supply_unregister(charger->battery); power_supply_unregister(charger->usb); diff --git a/drivers/s390/net/ism.h b/drivers/s390/net/ism.h index 38fe90c2597d1862565711c08f844bbffe874b24..70c5bbda0feaaa346f01a54f1e13fde0c7f4bcbe 100644 --- a/drivers/s390/net/ism.h +++ b/drivers/s390/net/ism.h @@ -5,6 +5,7 @@ #include #include #include +#include #include #include @@ -15,7 +16,6 @@ */ #define ISM_DMB_WORD_OFFSET 1 #define ISM_DMB_BIT_OFFSET (ISM_DMB_WORD_OFFSET * 32) -#define ISM_NR_DMBS 1920 #define ISM_IDENT_MASK 0x00FFFF #define ISM_REG_SBA 0x1 @@ -177,7 +177,7 @@ struct ism_eq_header { struct ism_eq { struct ism_eq_header header; - struct smcd_event entry[15]; + struct ism_event entry[15]; }; struct ism_sba { @@ -189,21 +189,6 @@ struct ism_sba { u16 dmbe_mask[ISM_NR_DMBS]; }; -struct ism_dev { - spinlock_t lock; - struct pci_dev *pdev; - struct smcd_dev *smcd; - - struct ism_sba *sba; - dma_addr_t sba_dma_addr; - DECLARE_BITMAP(sba_bitmap, ISM_NR_DMBS); - - struct ism_eq *ieq; - dma_addr_t ieq_dma_addr; - - int ieq_idx; -}; - #define ISM_CREATE_REQ(dmb, idx, sf, offset) \ ((dmb) | (idx) << 24 | (sf) << 23 | (offset)) diff --git a/drivers/s390/net/ism_drv.c b/drivers/s390/net/ism_drv.c index 1adb00ca0a0a4c005a7f3cbe89630bff16440016..57db80f59a5116d36e6cd938a1064cf30c870d96 100644 --- a/drivers/s390/net/ism_drv.c +++ b/drivers/s390/net/ism_drv.c @@ -15,9 +15,6 @@ #include #include #include -#include - -#include #include "ism.h" @@ -34,6 +31,84 @@ static const struct pci_device_id ism_device_table[] = { MODULE_DEVICE_TABLE(pci, ism_device_table); static debug_info_t *ism_debug_info; +static const struct smcd_ops ism_ops; + +#define NO_CLIENT 0xff /* must be >= MAX_CLIENTS */ +static struct ism_client *clients[MAX_CLIENTS]; /* use an array rather than */ + /* a list for fast mapping */ +static u8 max_client; +static DEFINE_SPINLOCK(clients_lock); +struct ism_dev_list { + struct list_head list; + struct mutex mutex; /* protects ism device list */ +}; + +static struct ism_dev_list ism_dev_list = { + .list = LIST_HEAD_INIT(ism_dev_list.list), + .mutex = __MUTEX_INITIALIZER(ism_dev_list.mutex), +}; + +int ism_register_client(struct ism_client *client) +{ + struct ism_dev *ism; + unsigned long flags; + int i, rc = -ENOSPC; + + mutex_lock(&ism_dev_list.mutex); + spin_lock_irqsave(&clients_lock, flags); + for (i = 0; i < MAX_CLIENTS; ++i) { + if (!clients[i]) { + clients[i] = client; + client->id = i; + if (i == max_client) + max_client++; + rc = 0; + break; + } + } + spin_unlock_irqrestore(&clients_lock, flags); + if (i < MAX_CLIENTS) { + /* initialize with all devices that we got so far */ + list_for_each_entry(ism, &ism_dev_list.list, list) { + ism->priv[i] = NULL; + client->add(ism); + } + } + mutex_unlock(&ism_dev_list.mutex); + + return rc; +} +EXPORT_SYMBOL_GPL(ism_register_client); + +int ism_unregister_client(struct ism_client *client) +{ + struct ism_dev *ism; + unsigned long flags; + int rc = 0; + + mutex_lock(&ism_dev_list.mutex); + spin_lock_irqsave(&clients_lock, flags); + clients[client->id] = NULL; + if (client->id + 1 == max_client) + max_client--; + spin_unlock_irqrestore(&clients_lock, flags); + list_for_each_entry(ism, &ism_dev_list.list, list) { + for (int i = 0; i < ISM_NR_DMBS; ++i) { + if (ism->sba_client_arr[i] == client->id) { + pr_err("%s: attempt to unregister client '%s'" + "with registered dmb(s)\n", __func__, + client->name); + rc = -EBUSY; + goto out; + } + } + } +out: + mutex_unlock(&ism_dev_list.mutex); + + return rc; +} +EXPORT_SYMBOL_GPL(ism_unregister_client); static int ism_cmd(struct ism_dev *ism, void *cmd) { @@ -193,15 +268,14 @@ static int ism_read_local_gid(struct ism_dev *ism) if (ret) goto out; - ism->smcd->local_gid = cmd.response.gid; + ism->local_gid = cmd.response.gid; out: return ret; } -static int ism_query_rgid(struct smcd_dev *smcd, u64 rgid, u32 vid_valid, +static int ism_query_rgid(struct ism_dev *ism, u64 rgid, u32 vid_valid, u32 vid) { - struct ism_dev *ism = smcd->priv; union ism_query_rgid cmd; memset(&cmd, 0, sizeof(cmd)); @@ -215,14 +289,14 @@ static int ism_query_rgid(struct smcd_dev *smcd, u64 rgid, u32 vid_valid, return ism_cmd(ism, &cmd); } -static void ism_free_dmb(struct ism_dev *ism, struct smcd_dmb *dmb) +static void ism_free_dmb(struct ism_dev *ism, struct ism_dmb *dmb) { clear_bit(dmb->sba_idx, ism->sba_bitmap); dma_free_coherent(&ism->pdev->dev, dmb->dmb_len, dmb->cpu_addr, dmb->dma_addr); } -static int ism_alloc_dmb(struct ism_dev *ism, struct smcd_dmb *dmb) +static int ism_alloc_dmb(struct ism_dev *ism, struct ism_dmb *dmb) { unsigned long bit; @@ -250,9 +324,9 @@ static int ism_alloc_dmb(struct ism_dev *ism, struct smcd_dmb *dmb) return dmb->cpu_addr ? 0 : -ENOMEM; } -static int ism_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb) +int ism_register_dmb(struct ism_dev *ism, struct ism_dmb *dmb, + struct ism_client *client) { - struct ism_dev *ism = smcd->priv; union ism_reg_dmb cmd; int ret; @@ -277,13 +351,14 @@ static int ism_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb) goto out; } dmb->dmb_tok = cmd.response.dmb_tok; + ism->sba_client_arr[dmb->sba_idx - ISM_DMB_BIT_OFFSET] = client->id; out: return ret; } +EXPORT_SYMBOL_GPL(ism_register_dmb); -static int ism_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb) +int ism_unregister_dmb(struct ism_dev *ism, struct ism_dmb *dmb) { - struct ism_dev *ism = smcd->priv; union ism_unreg_dmb cmd; int ret; @@ -293,6 +368,8 @@ static int ism_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb) cmd.request.dmb_tok = dmb->dmb_tok; + ism->sba_client_arr[dmb->sba_idx - ISM_DMB_BIT_OFFSET] = NO_CLIENT; + ret = ism_cmd(ism, &cmd); if (ret && ret != ISM_ERROR) goto out; @@ -301,10 +378,10 @@ static int ism_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb) out: return ret; } +EXPORT_SYMBOL_GPL(ism_unregister_dmb); -static int ism_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id) +static int ism_add_vlan_id(struct ism_dev *ism, u64 vlan_id) { - struct ism_dev *ism = smcd->priv; union ism_set_vlan_id cmd; memset(&cmd, 0, sizeof(cmd)); @@ -316,9 +393,8 @@ static int ism_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id) return ism_cmd(ism, &cmd); } -static int ism_del_vlan_id(struct smcd_dev *smcd, u64 vlan_id) +static int ism_del_vlan_id(struct ism_dev *ism, u64 vlan_id) { - struct ism_dev *ism = smcd->priv; union ism_set_vlan_id cmd; memset(&cmd, 0, sizeof(cmd)); @@ -330,20 +406,9 @@ static int ism_del_vlan_id(struct smcd_dev *smcd, u64 vlan_id) return ism_cmd(ism, &cmd); } -static int ism_set_vlan_required(struct smcd_dev *smcd) -{ - return ism_cmd_simple(smcd->priv, ISM_SET_VLAN); -} - -static int ism_reset_vlan_required(struct smcd_dev *smcd) -{ - return ism_cmd_simple(smcd->priv, ISM_RESET_VLAN); -} - -static int ism_signal_ieq(struct smcd_dev *smcd, u64 rgid, u32 trigger_irq, +static int ism_signal_ieq(struct ism_dev *ism, u64 rgid, u32 trigger_irq, u32 event_code, u64 info) { - struct ism_dev *ism = smcd->priv; union ism_sig_ieq cmd; memset(&cmd, 0, sizeof(cmd)); @@ -364,10 +429,9 @@ static unsigned int max_bytes(unsigned int start, unsigned int len, return min(boundary - (start & (boundary - 1)), len); } -static int ism_move(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx, - bool sf, unsigned int offset, void *data, unsigned int size) +int ism_move(struct ism_dev *ism, u64 dmb_tok, unsigned int idx, bool sf, + unsigned int offset, void *data, unsigned int size) { - struct ism_dev *ism = smcd->priv; unsigned int bytes; u64 dmb_req; int ret; @@ -388,6 +452,7 @@ static int ism_move(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx, return 0; } +EXPORT_SYMBOL_GPL(ism_move); static struct ism_systemeid SYSTEM_EID = { .seid_string = "IBM-SYSZ-ISMSEID00000000", @@ -409,15 +474,14 @@ static void ism_create_system_eid(void) memcpy(&SYSTEM_EID.type, tmp, 4); } -static u8 *ism_get_system_eid(void) +u8 *ism_get_seid(void) { return SYSTEM_EID.seid_string; } +EXPORT_SYMBOL_GPL(ism_get_seid); -static u16 ism_get_chid(struct smcd_dev *smcd) +static u16 ism_get_chid(struct ism_dev *ism) { - struct ism_dev *ism = (struct ism_dev *)smcd->priv; - if (!ism || !ism->pdev) return 0; @@ -426,7 +490,8 @@ static u16 ism_get_chid(struct smcd_dev *smcd) static void ism_handle_event(struct ism_dev *ism) { - struct smcd_event *entry; + struct ism_event *entry; + int i; while ((ism->ieq_idx + 1) != READ_ONCE(ism->ieq->header.idx)) { if (++(ism->ieq_idx) == ARRAY_SIZE(ism->ieq->entry)) @@ -434,13 +499,18 @@ static void ism_handle_event(struct ism_dev *ism) entry = &ism->ieq->entry[ism->ieq_idx]; debug_event(ism_debug_info, 2, entry, sizeof(*entry)); - smcd_handle_event(ism->smcd, entry); + spin_lock(&clients_lock); + for (i = 0; i < max_client; ++i) + if (clients[i]) + clients[i]->handle_event(ism, entry); + spin_unlock(&clients_lock); } } static irqreturn_t ism_handle_irq(int irq, void *data) { struct ism_dev *ism = data; + struct ism_client *clt; unsigned long bit, end; unsigned long *bv; u16 dmbemask; @@ -460,7 +530,8 @@ static irqreturn_t ism_handle_irq(int irq, void *data) dmbemask = ism->sba->dmbe_mask[bit + ISM_DMB_BIT_OFFSET]; ism->sba->dmbe_mask[bit + ISM_DMB_BIT_OFFSET] = 0; barrier(); - smcd_handle_irq(ism->smcd, bit + ISM_DMB_BIT_OFFSET, dmbemask); + clt = clients[ism->sba_client_arr[bit]]; + clt->handle_irq(ism, bit + ISM_DMB_BIT_OFFSET, dmbemask); } if (ism->sba->e) { @@ -472,33 +543,40 @@ static irqreturn_t ism_handle_irq(int irq, void *data) return IRQ_HANDLED; } -static const struct smcd_ops ism_ops = { - .query_remote_gid = ism_query_rgid, - .register_dmb = ism_register_dmb, - .unregister_dmb = ism_unregister_dmb, - .add_vlan_id = ism_add_vlan_id, - .del_vlan_id = ism_del_vlan_id, - .set_vlan_required = ism_set_vlan_required, - .reset_vlan_required = ism_reset_vlan_required, - .signal_event = ism_signal_ieq, - .move_data = ism_move, - .get_system_eid = ism_get_system_eid, - .get_chid = ism_get_chid, -}; +static u64 ism_get_local_gid(struct ism_dev *ism) +{ + return ism->local_gid; +} + +static void ism_dev_add_work_func(struct work_struct *work) +{ + struct ism_client *client = container_of(work, struct ism_client, + add_work); + + client->add(client->tgt_ism); + atomic_dec(&client->tgt_ism->add_dev_cnt); + wake_up(&client->tgt_ism->waitq); +} static int ism_dev_init(struct ism_dev *ism) { struct pci_dev *pdev = ism->pdev; - int ret; + unsigned long flags; + int i, ret; ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_MSI); if (ret <= 0) goto out; + ism->sba_client_arr = kzalloc(ISM_NR_DMBS, GFP_KERNEL); + if (!ism->sba_client_arr) + goto free_vectors; + memset(ism->sba_client_arr, NO_CLIENT, ISM_NR_DMBS); + ret = request_irq(pci_irq_vector(pdev, 0), ism_handle_irq, 0, pci_name(pdev), ism); if (ret) - goto free_vectors; + goto free_client_arr; ret = register_sba(ism); if (ret) @@ -512,13 +590,31 @@ static int ism_dev_init(struct ism_dev *ism) if (ret) goto unreg_ieq; - if (!ism_add_vlan_id(ism->smcd, ISM_RESERVED_VLANID)) + if (!ism_add_vlan_id(ism, ISM_RESERVED_VLANID)) /* hardware is V2 capable */ ism_create_system_eid(); - ret = smcd_register_dev(ism->smcd); - if (ret) - goto unreg_ieq; + init_waitqueue_head(&ism->waitq); + atomic_set(&ism->free_clients_cnt, 0); + atomic_set(&ism->add_dev_cnt, 0); + + wait_event(ism->waitq, !atomic_read(&ism->add_dev_cnt)); + spin_lock_irqsave(&clients_lock, flags); + for (i = 0; i < max_client; ++i) + if (clients[i]) { + INIT_WORK(&clients[i]->add_work, + ism_dev_add_work_func); + clients[i]->tgt_ism = ism; + atomic_inc(&ism->add_dev_cnt); + schedule_work(&clients[i]->add_work); + } + spin_unlock_irqrestore(&clients_lock, flags); + + wait_event(ism->waitq, !atomic_read(&ism->add_dev_cnt)); + + mutex_lock(&ism_dev_list.mutex); + list_add(&ism->list, &ism_dev_list.list); + mutex_unlock(&ism_dev_list.mutex); query_info(ism); return 0; @@ -529,6 +625,8 @@ static int ism_dev_init(struct ism_dev *ism) unregister_sba(ism); free_irq: free_irq(pci_irq_vector(pdev, 0), ism); +free_client_arr: + kfree(ism->sba_client_arr); free_vectors: pci_free_irq_vectors(pdev); out: @@ -547,6 +645,12 @@ static int ism_probe(struct pci_dev *pdev, const struct pci_device_id *id) spin_lock_init(&ism->lock); dev_set_drvdata(&pdev->dev, ism); ism->pdev = pdev; + ism->dev.parent = &pdev->dev; + device_initialize(&ism->dev); + dev_set_name(&ism->dev, dev_name(&pdev->dev)); + ret = device_add(&ism->dev); + if (ret) + goto err_dev; ret = pci_enable_device_mem(pdev); if (ret) @@ -564,55 +668,80 @@ static int ism_probe(struct pci_dev *pdev, const struct pci_device_id *id) dma_set_max_seg_size(&pdev->dev, SZ_1M); pci_set_master(pdev); - ism->smcd = smcd_alloc_dev(&pdev->dev, dev_name(&pdev->dev), &ism_ops, - ISM_NR_DMBS); - if (!ism->smcd) { - ret = -ENOMEM; - goto err_resource; - } - - ism->smcd->priv = ism; ret = ism_dev_init(ism); if (ret) - goto err_free; + goto err_resource; return 0; -err_free: - smcd_free_dev(ism->smcd); err_resource: + pci_clear_master(pdev); pci_release_mem_regions(pdev); err_disable: pci_disable_device(pdev); err: - kfree(ism); + device_del(&ism->dev); +err_dev: dev_set_drvdata(&pdev->dev, NULL); + kfree(ism); + return ret; } +static void ism_dev_remove_work_func(struct work_struct *work) +{ + struct ism_client *client = container_of(work, struct ism_client, + remove_work); + + client->remove(client->tgt_ism); + atomic_dec(&client->tgt_ism->free_clients_cnt); + wake_up(&client->tgt_ism->waitq); +} + +/* Callers must hold ism_dev_list.mutex */ static void ism_dev_exit(struct ism_dev *ism) { struct pci_dev *pdev = ism->pdev; + unsigned long flags; + int i; + + wait_event(ism->waitq, !atomic_read(&ism->free_clients_cnt)); + spin_lock_irqsave(&clients_lock, flags); + for (i = 0; i < max_client; ++i) + if (clients[i]) { + INIT_WORK(&clients[i]->remove_work, + ism_dev_remove_work_func); + clients[i]->tgt_ism = ism; + atomic_inc(&ism->free_clients_cnt); + schedule_work(&clients[i]->remove_work); + } + spin_unlock_irqrestore(&clients_lock, flags); + + wait_event(ism->waitq, !atomic_read(&ism->free_clients_cnt)); - smcd_unregister_dev(ism->smcd); if (SYSTEM_EID.serial_number[0] != '0' || SYSTEM_EID.type[0] != '0') - ism_del_vlan_id(ism->smcd, ISM_RESERVED_VLANID); + ism_del_vlan_id(ism, ISM_RESERVED_VLANID); unregister_ieq(ism); unregister_sba(ism); free_irq(pci_irq_vector(pdev, 0), ism); + kfree(ism->sba_client_arr); pci_free_irq_vectors(pdev); + list_del_init(&ism->list); } static void ism_remove(struct pci_dev *pdev) { struct ism_dev *ism = dev_get_drvdata(&pdev->dev); + mutex_lock(&ism_dev_list.mutex); ism_dev_exit(ism); + mutex_unlock(&ism_dev_list.mutex); - smcd_free_dev(ism->smcd); + pci_clear_master(pdev); pci_release_mem_regions(pdev); pci_disable_device(pdev); + device_del(&ism->dev); dev_set_drvdata(&pdev->dev, NULL); kfree(ism); } @@ -632,6 +761,8 @@ static int __init ism_init(void) if (!ism_debug_info) return -ENODEV; + memset(clients, 0, sizeof(clients)); + max_client = 0; debug_register_view(ism_debug_info, &debug_hex_ascii_view); ret = pci_register_driver(&ism_driver); if (ret) @@ -642,9 +773,117 @@ static int __init ism_init(void) static void __exit ism_exit(void) { + struct ism_dev *ism; + + mutex_lock(&ism_dev_list.mutex); + list_for_each_entry(ism, &ism_dev_list.list, list) { + ism_dev_exit(ism); + } + mutex_unlock(&ism_dev_list.mutex); + pci_unregister_driver(&ism_driver); debug_unregister(ism_debug_info); } module_init(ism_init); module_exit(ism_exit); + +/*************************** SMC-D Implementation *****************************/ + +#if IS_ENABLED(CONFIG_SMC) +static int smcd_query_rgid(struct smcd_dev *smcd, u64 rgid, u32 vid_valid, + u32 vid) +{ + return ism_query_rgid(smcd->priv, rgid, vid_valid, vid); +} + +static int smcd_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb, + struct ism_client *client) +{ + return ism_register_dmb(smcd->priv, (struct ism_dmb *)dmb, client); +} + +static int smcd_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb) +{ + return ism_unregister_dmb(smcd->priv, (struct ism_dmb *)dmb); +} + +static int smcd_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id) +{ + return ism_add_vlan_id(smcd->priv, vlan_id); +} + +static int smcd_del_vlan_id(struct smcd_dev *smcd, u64 vlan_id) +{ + return ism_del_vlan_id(smcd->priv, vlan_id); +} + +static int smcd_set_vlan_required(struct smcd_dev *smcd) +{ + return ism_cmd_simple(smcd->priv, ISM_SET_VLAN); +} + +static int smcd_reset_vlan_required(struct smcd_dev *smcd) +{ + return ism_cmd_simple(smcd->priv, ISM_RESET_VLAN); +} + +static int smcd_signal_ieq(struct smcd_dev *smcd, u64 rgid, u32 trigger_irq, + u32 event_code, u64 info) +{ + return ism_signal_ieq(smcd->priv, rgid, trigger_irq, event_code, info); +} + +static int smcd_move(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx, + bool sf, unsigned int offset, void *data, + unsigned int size) +{ + return ism_move(smcd->priv, dmb_tok, idx, sf, offset, data, size); +} + +static int smcd_supports_v2(void) +{ + return SYSTEM_EID.serial_number[0] != '0' || + SYSTEM_EID.type[0] != '0'; +} + +static u64 smcd_get_local_gid(struct smcd_dev *smcd) +{ + return ism_get_local_gid(smcd->priv); +} + +static u16 smcd_get_chid(struct smcd_dev *smcd) +{ + return ism_get_chid(smcd->priv); +} + +static inline struct device *smcd_get_dev(struct smcd_dev *dev) +{ + struct ism_dev *ism = dev->priv; + + return &ism->dev; +} + +static const struct smcd_ops ism_ops = { + .query_remote_gid = smcd_query_rgid, + .register_dmb = smcd_register_dmb, + .unregister_dmb = smcd_unregister_dmb, + .add_vlan_id = smcd_add_vlan_id, + .del_vlan_id = smcd_del_vlan_id, + .set_vlan_required = smcd_set_vlan_required, + .reset_vlan_required = smcd_reset_vlan_required, + .signal_event = smcd_signal_ieq, + .move_data = smcd_move, + .supports_v2 = smcd_supports_v2, + .get_system_eid = ism_get_seid, + .get_local_gid = smcd_get_local_gid, + .get_chid = smcd_get_chid, + .get_dev = smcd_get_dev, +}; + +const struct smcd_ops *ism_get_smcd_ops(void) +{ + return &ism_ops; +} +EXPORT_SYMBOL_GPL(ism_get_smcd_ops); +#endif diff --git a/drivers/soc/Kconfig b/drivers/soc/Kconfig index 6f479e1d1800d24b2735e7be62a51fd5a06ffdca..4a0c3f2b77c911cfa8982e4d2000a75c3ab96b00 100644 --- a/drivers/soc/Kconfig +++ b/drivers/soc/Kconfig @@ -24,5 +24,6 @@ source "drivers/soc/xilinx/Kconfig" source "drivers/soc/zte/Kconfig" source "drivers/soc/kendryte/Kconfig" source "drivers/soc/thead/Kconfig" +source "drivers/soc/alibaba/Kconfig" endmenu diff --git a/drivers/soc/Makefile b/drivers/soc/Makefile index acba2ed0dca1b3bb8edb9830c04dfe27ca2e4268..e88df093196edf33e802d6e449db29e29e2b31e9 100644 --- a/drivers/soc/Makefile +++ b/drivers/soc/Makefile @@ -30,3 +30,4 @@ obj-y += xilinx/ obj-$(CONFIG_ARCH_ZX) += zte/ obj-$(CONFIG_SOC_KENDRYTE) += kendryte/ obj-y += thead/ +obj-y += alibaba/ diff --git a/drivers/soc/alibaba/Kconfig b/drivers/soc/alibaba/Kconfig new file mode 100644 index 0000000000000000000000000000000000000000..bab330981c848dd8b1cf914ab4294961c59e9af9 --- /dev/null +++ b/drivers/soc/alibaba/Kconfig @@ -0,0 +1,15 @@ +# SPDX-License-Identifier: GPL-2.0-only + +menu "prefetch tuning drivers" + +config ARM64_PREFETCH_TUNING + tristate "arm64 prefetch tuning support" + depends on ARM64 + help + In some scenarios, adjusting prefetch configuration can effectively + improve performance. This driver provides some interfaces related + to prefetch for sysctl to configure. + + This option enables the support for arm64 prefetch tuning interface. + +endmenu diff --git a/drivers/soc/alibaba/Makefile b/drivers/soc/alibaba/Makefile new file mode 100644 index 0000000000000000000000000000000000000000..7859a67ab473855fb62515dac28baaa3e0637032 --- /dev/null +++ b/drivers/soc/alibaba/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only + +obj-$(CONFIG_ARM64_PREFETCH_TUNING) += prefetch_tuning.o diff --git a/drivers/soc/alibaba/prefetch_tuning.c b/drivers/soc/alibaba/prefetch_tuning.c new file mode 100644 index 0000000000000000000000000000000000000000..9959b0943476f4f5b4de946843f83f8257b8c7c5 --- /dev/null +++ b/drivers/soc/alibaba/prefetch_tuning.c @@ -0,0 +1,267 @@ +// SPDX-License-Identifier: GPL-2.0 + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define DEFINE_HW_TUNABLE2(NAME, H, L) \ + static u64 CPUECTLR_MASK_##NAME = GENMASK_ULL((H), (L)); \ + static int CPUECTLR_MAX_##NAME = GENMASK((H) - (L), 0); \ + static int CPUECTLR_SHIFT_##NAME = L; \ + static int global_##NAME = -1 + +#define DEFINE_HW_TUNABLE1(NAME, B) \ + static u64 CPUECTLR_MASK_##NAME = BIT(B); \ + static int CPUECTLR_MAX_##NAME = 1; \ + static int CPUECTLR_SHIFT_##NAME = B; \ + static int global_##NAME = -1 + +#define CPUECTLR_MASK(NAME) CPUECTLR_MASK_##NAME +#define CPUECTLR_MAX(NAME) CPUECTLR_MAX_##NAME +#define CPUECTLR_SHIFT(NAME) CPUECTLR_SHIFT_##NAME + +#define SYSCTL_ENTRY_HW_TUNABLE(NAME) \ + { \ + .procname = #NAME, \ + .data = &global_##NAME, \ + .maxlen = sizeof(global_##NAME), \ + .mode = 0644, \ + .proc_handler = &proc_dointvec_minmax, \ + .extra1 = SYSCTL_ZERO, \ + .extra2 = &CPUECTLR_MAX(NAME), \ + } + +#define DIRTIED_HW_TUNABLE(NAME) (global_##NAME >= 0) + +#define arm64_read_sysreg(v) ({ \ + u64 __ret; \ + isb(); \ + asm volatile ("mrs %0, " __stringify(v) : "=r" (__ret) :: "memory"); \ + __ret; \ +}) + +#define arm64_write_sysreg(v, r) do { \ + u64 __ret = (u64)(r); \ + asm volatile ("msr " __stringify(v) ", %x0" : : "rZ" (__ret)); \ +} while (0) + +#define update_configure(v, NAME) do { \ + if (DIRTIED_HW_TUNABLE(NAME)) { \ + v &= ~CPUECTLR_MASK(NAME); \ + v |= (u64)global_##NAME << CPUECTLR_SHIFT(NAME); \ + } \ +} while (0) + +#define ID_AA64MMFR1_VHE_MASK GENMASK_ULL(11, 8) +#define ID_AA64MMFR1_VHE_VALID 0x1 + +DEFINE_HW_TUNABLE2(cmc_min_ways, 63, 61); +DEFINE_HW_TUNABLE2(inst_res_ways_l2, 60, 58); +DEFINE_HW_TUNABLE2(ws_threshold_l2, 25, 24); +DEFINE_HW_TUNABLE2(ws_threshold_l3, 23, 22); +DEFINE_HW_TUNABLE2(ws_threshold_l4, 21, 20); +DEFINE_HW_TUNABLE2(ws_threshold_dram, 19, 18); +DEFINE_HW_TUNABLE1(prefetch_disable, 15); +DEFINE_HW_TUNABLE1(prefetch_sts_disable, 9); +DEFINE_HW_TUNABLE1(prefetch_sti_disable, 8); + +static int sysctl_update_cpuectlr; + +static struct ctl_table_header *hw_sysctl_header; + +static u64 *old_cpuectlr; +static bool *write_success; + +static void save_cpuectlr(void *dummy) +{ + int cpu = smp_processor_id(); + u64 cpuectlr; + + /* 0. Get current cpuectlr */ + cpuectlr = arm64_read_sysreg(S3_0_C15_C1_4); //cpuectlr_el1 will fail + + old_cpuectlr[cpu] = cpuectlr; +} + +static void update_cpuectlr(void *dummy) +{ + int cpu = smp_processor_id(); + u64 cpuectlr = old_cpuectlr[cpu]; + u64 new_cpuectlr; + + /* 1. update CMC configure */ + update_configure(cpuectlr, cmc_min_ways); + + /* 2. update instruct partition configure */ + update_configure(cpuectlr, inst_res_ways_l2); + + /* 3. update stream write configure */ + update_configure(cpuectlr, ws_threshold_l2); + update_configure(cpuectlr, ws_threshold_l3); + update_configure(cpuectlr, ws_threshold_l4); + update_configure(cpuectlr, ws_threshold_dram); + + /* 4. update global prefetch configure */ + update_configure(cpuectlr, prefetch_disable); + + /* 5. update store prefetch configure */ + update_configure(cpuectlr, prefetch_sts_disable); + update_configure(cpuectlr, prefetch_sti_disable); + + /* write register */ + arm64_write_sysreg(S3_0_C15_C1_4, cpuectlr); + + /* read again to verify writing is valid */ + new_cpuectlr = arm64_read_sysreg(S3_0_C15_C1_4); + if (new_cpuectlr != cpuectlr) { + pr_err("CPU #%d write cpuectlr failed: expect %llx, but %llx\n", + cpu, cpuectlr, new_cpuectlr); + write_success[cpu] = false; + return; + } + + pr_debug("CPU #%d origin cpuectlr: %llx, update to %llx\n", cpu, old_cpuectlr[cpu], + cpuectlr); +} + +static void recall_cpuectlr(void *dummy) +{ + int cpu = smp_processor_id(); + u64 cpuectlr; + + cpuectlr = arm64_read_sysreg(S3_0_C15_C1_4); + if (old_cpuectlr[cpu] && old_cpuectlr[cpu] != cpuectlr) { + arm64_write_sysreg(S3_0_C15_C1_4, old_cpuectlr[cpu]); + pr_debug("CPU #%d recall cpuectlr to %llx\n", cpu, old_cpuectlr[cpu]); + } +} + +static int update_cpuectlr_sysctl_handler(struct ctl_table *table, int write, + void *buffer, size_t *length, loff_t *ppos) +{ + int ret; + int cpu; + + ret = proc_dointvec_minmax(table, write, buffer, length, ppos); + if (ret) + return ret; + if (write && sysctl_update_cpuectlr == 1) { + for_each_possible_cpu(cpu) + write_success[cpu] = true; + + on_each_cpu(update_cpuectlr, NULL, 1); + + /* recall and return errno if any core write fails */ + for_each_possible_cpu(cpu) { + if (!write_success[cpu]) { + on_each_cpu(recall_cpuectlr, NULL, 1); + pr_err("update cpuectlr error\n"); + return -EACCES; + } + } + } + return 0; +} + +static struct ctl_table hw_sysctl_table[] = { + SYSCTL_ENTRY_HW_TUNABLE(cmc_min_ways), + SYSCTL_ENTRY_HW_TUNABLE(inst_res_ways_l2), + SYSCTL_ENTRY_HW_TUNABLE(ws_threshold_l2), + SYSCTL_ENTRY_HW_TUNABLE(ws_threshold_l3), + SYSCTL_ENTRY_HW_TUNABLE(ws_threshold_l4), + SYSCTL_ENTRY_HW_TUNABLE(ws_threshold_dram), + SYSCTL_ENTRY_HW_TUNABLE(prefetch_disable), + SYSCTL_ENTRY_HW_TUNABLE(prefetch_sts_disable), + SYSCTL_ENTRY_HW_TUNABLE(prefetch_sti_disable), + { + .procname = "update_cpuectlr", + .data = &sysctl_update_cpuectlr, + .maxlen = sizeof(sysctl_update_cpuectlr), + .mode = 0644, + .proc_handler = update_cpuectlr_sysctl_handler, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, + {}, +}; + +static struct ctl_table hw_sysctl_root[] = { + { + .procname = "kernel", + .mode = 0555, + .child = hw_sysctl_table, + }, + {}, +}; + +static bool interface_init(void) +{ + hw_sysctl_header = register_sysctl_table(hw_sysctl_root); + return !!hw_sysctl_header; +} + +static void interface_exit(void) +{ + unregister_sysctl_table(hw_sysctl_header); +} + +static int __init prefetch_tuning_init(void) +{ + bool is_guest; + +#ifdef CONFIG_ARM64_VHE + u64 id_aa64mmfr1 = arm64_read_sysreg(S3_0_C0_C7_1); + + is_guest = ((id_aa64mmfr1 & ID_AA64MMFR1_VHE_MASK) >> ID_AA64MMFR1_VHE_SHIFT) + == ID_AA64MMFR1_VHE_VALID && !is_hyp_mode_available(); +#else + is_guest = false; +#endif + + if (!is_guest) { + pr_err("prefetch_tuning module is only applicable to guest os scene\n"); + return -EPERM; + } + + old_cpuectlr = kmalloc_array(num_possible_cpus(), sizeof(u64), GFP_KERNEL); + if (!old_cpuectlr) + return -ENOMEM; + + write_success = kmalloc_array(num_possible_cpus(), sizeof(bool), GFP_KERNEL); + if (!write_success) { + kfree(old_cpuectlr); + return -ENOMEM; + } + + if (!interface_init()) { + pr_err("Failed to register cmc_sysctl_table"); + kfree(old_cpuectlr); + kfree(write_success); + return -EPERM; + } + + on_each_cpu(save_cpuectlr, NULL, 1); + + return 0; +} + +static void __exit prefetch_tuning_exit(void) +{ + on_each_cpu(recall_cpuectlr, NULL, 1); + interface_exit(); + kfree(old_cpuectlr); + kfree(write_success); +} + +module_init(prefetch_tuning_init); +module_exit(prefetch_tuning_exit); + +MODULE_DESCRIPTION("Prefetch Tuning Switch for Alibaba Cloud ECS"); +MODULE_LICENSE("GPL v2"); diff --git a/drivers/usb/dwc3/dwc3-qcom.c b/drivers/usb/dwc3/dwc3-qcom.c index 504f8af4d0f80ba319276d9c8347a6c0d5b2f38c..a1c4e3df5626b8d32c068b6bc5f32b718d81d1b9 100644 --- a/drivers/usb/dwc3/dwc3-qcom.c +++ b/drivers/usb/dwc3/dwc3-qcom.c @@ -594,8 +594,10 @@ static int dwc3_qcom_acpi_register_core(struct platform_device *pdev) qcom->dwc3->dev.coherent_dma_mask = dev->coherent_dma_mask; child_res = kcalloc(2, sizeof(*child_res), GFP_KERNEL); - if (!child_res) + if (!child_res) { + platform_device_put(qcom->dwc3); return -ENOMEM; + } res = platform_get_resource(pdev, IORESOURCE_MEM, 0); if (!res) { @@ -631,10 +633,15 @@ static int dwc3_qcom_acpi_register_core(struct platform_device *pdev) } ret = platform_device_add(qcom->dwc3); - if (ret) + if (ret) { dev_err(&pdev->dev, "failed to add device\n"); + goto out; + } + kfree(child_res); + return 0; out: + platform_device_put(qcom->dwc3); kfree(child_res); return ret; } diff --git a/drivers/usb/gadget/legacy/inode.c b/drivers/usb/gadget/legacy/inode.c index 454860d52ce77f96b3ea047d37e90da6dc289fe0..a926baca2b514beb93f73ec1ba5041742535ec4d 100644 --- a/drivers/usb/gadget/legacy/inode.c +++ b/drivers/usb/gadget/legacy/inode.c @@ -229,6 +229,7 @@ static void put_ep (struct ep_data *data) */ static const char *CHIP; +static DEFINE_MUTEX(sb_mutex); /* Serialize superblock operations */ /*----------------------------------------------------------------------*/ @@ -2011,13 +2012,20 @@ gadgetfs_fill_super (struct super_block *sb, struct fs_context *fc) { struct inode *inode; struct dev_data *dev; + int rc; - if (the_device) - return -ESRCH; + mutex_lock(&sb_mutex); + + if (the_device) { + rc = -ESRCH; + goto Done; + } CHIP = usb_get_gadget_udc_name(); - if (!CHIP) - return -ENODEV; + if (!CHIP) { + rc = -ENODEV; + goto Done; + } /* superblock */ sb->s_blocksize = PAGE_SIZE; @@ -2054,13 +2062,17 @@ gadgetfs_fill_super (struct super_block *sb, struct fs_context *fc) * from binding to a controller. */ the_device = dev; - return 0; + rc = 0; + goto Done; -Enomem: + Enomem: kfree(CHIP); CHIP = NULL; + rc = -ENOMEM; - return -ENOMEM; + Done: + mutex_unlock(&sb_mutex); + return rc; } /* "mount -t gadgetfs path /dev/gadget" ends up here */ @@ -2082,6 +2094,7 @@ static int gadgetfs_init_fs_context(struct fs_context *fc) static void gadgetfs_kill_sb (struct super_block *sb) { + mutex_lock(&sb_mutex); kill_litter_super (sb); if (the_device) { put_dev (the_device); @@ -2089,6 +2102,7 @@ gadgetfs_kill_sb (struct super_block *sb) } kfree(CHIP); CHIP = NULL; + mutex_unlock(&sb_mutex); } /*----------------------------------------------------------------------*/ diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c index a200c91bd864e9d4b533a524fb0cdb47d1e889a9..180e93d9b2cc350ab1049128295bac27ba3f7247 100644 --- a/drivers/usb/host/xhci-pci.c +++ b/drivers/usb/host/xhci-pci.c @@ -294,6 +294,11 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci) if (pdev->vendor == PCI_VENDOR_ID_ZHAOXIN) xhci->quirks |= XHCI_SUSPEND_DELAY; + if (pdev->vendor == PCI_VENDOR_ID_ZHAOXIN) { + xhci->quirks |= XHCI_LPM_SUPPORT; + xhci->quirks |= XHCI_ZHAOXIN_HOST; + } + /* See https://bugzilla.kernel.org/show_bug.cgi?id=79511 */ if (pdev->vendor == PCI_VENDOR_ID_VIA && pdev->device == 0x3432) @@ -351,6 +356,9 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci) pdev->device == PCI_DEVICE_ID_AMD_YELLOW_CARP_XHCI_8)) xhci->quirks |= XHCI_DEFAULT_PM_RUNTIME_ALLOW; + if (pdev->vendor == PCI_VENDOR_ID_ZHAOXIN && pdev->device == 0x9202) + xhci->quirks |= XHCI_RESET_ON_RESUME; + if (xhci->quirks & XHCI_RESET_ON_RESUME) xhci_dbg_trace(xhci, trace_xhci_dbg_quirks, "QUIRK: Resetting on resume"); diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index 997de5f294f15c0d319f154f73566e63ebdbf632..bc04b4b4994f1d0df77eb991fe37e278a316660b 100644 --- a/drivers/usb/host/xhci.c +++ b/drivers/usb/host/xhci.c @@ -4713,7 +4713,7 @@ static u16 xhci_calculate_u1_timeout(struct xhci_hcd *xhci, } } - if (xhci->quirks & XHCI_INTEL_HOST) + if (xhci->quirks & (XHCI_INTEL_HOST | XHCI_ZHAOXIN_HOST)) timeout_ns = xhci_calculate_intel_u1_timeout(udev, desc); else timeout_ns = udev->u1_params.sel; @@ -4777,7 +4777,7 @@ static u16 xhci_calculate_u2_timeout(struct xhci_hcd *xhci, } } - if (xhci->quirks & XHCI_INTEL_HOST) + if (xhci->quirks & (XHCI_INTEL_HOST | XHCI_ZHAOXIN_HOST)) timeout_ns = xhci_calculate_intel_u2_timeout(udev, desc); else timeout_ns = udev->u2_params.sel; @@ -4874,12 +4874,39 @@ static int xhci_check_intel_tier_policy(struct usb_device *udev, return -E2BIG; } +static int xhci_check_zhaoxin_tier_policy(struct usb_device *udev, + enum usb3_link_state state) +{ + struct usb_device *parent; + unsigned int num_hubs; + char *state_name; + + if (state == USB3_LPM_U1) + state_name = "U1"; + else if (state == USB3_LPM_U2) + state_name = "U2"; + else + state_name = "Unknown"; + /* Don't enable U1/U2 if the device is on an external hub*/ + for (parent = udev->parent, num_hubs = 0; parent->parent; parent = parent->parent) + num_hubs++; + + if (num_hubs < 1) + return 0; + + dev_dbg(&udev->dev, "Disabling %s link state for device below external hub.\n", state_name); + dev_dbg(&udev->dev, "Plug device into root port to decrease power consumption.\n"); + return -E2BIG; +} + static int xhci_check_tier_policy(struct xhci_hcd *xhci, struct usb_device *udev, enum usb3_link_state state) { if (xhci->quirks & XHCI_INTEL_HOST) return xhci_check_intel_tier_policy(udev, state); + else if (xhci->quirks & XHCI_ZHAOXIN_HOST) + return xhci_check_zhaoxin_tier_policy(udev, state); else return 0; } diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h index 54851223c5b3deeb6388b28efcdc9b7e779e39cc..0b59fb9d4bacd43990bd1feddc9138fe0d542cd8 100644 --- a/drivers/usb/host/xhci.h +++ b/drivers/usb/host/xhci.h @@ -1888,6 +1888,7 @@ struct xhci_hcd { #define XHCI_SG_TRB_CACHE_SIZE_QUIRK BIT_ULL(39) #define XHCI_NO_SOFT_RETRY BIT_ULL(40) #define XHCI_EP_CTX_BROKEN_DCS BIT_ULL(42) +#define XHCI_ZHAOXIN_HOST BIT_ULL(43) #define XHCI_ZHAOXIN_TRB_FETCH BIT_ULL(44) unsigned int num_active_eps; diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 32c9925de473638fc1ea4fa9aad67db9c30411da..13f669647847202b4eb2f819900a5b7022e179b5 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -219,6 +219,11 @@ static void destroy_indirect_key(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_m mlx5_vdpa_destroy_mkey(mvdev, &mkey->mkey); } +static struct device *get_dma_device(struct mlx5_vdpa_dev *mvdev) +{ + return &mvdev->mdev->pdev->dev; +} + static int map_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr *mr, struct vhost_iotlb *iotlb) { @@ -234,7 +239,7 @@ static int map_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr u64 pa; u64 paend; struct scatterlist *sg; - struct device *dma = mvdev->mdev->device; + struct device *dma = get_dma_device(mvdev); for (map = vhost_iotlb_itree_first(iotlb, mr->start, mr->end - 1); map; map = vhost_iotlb_itree_next(map, start, mr->end - 1)) { @@ -293,7 +298,7 @@ static int map_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr static void unmap_direct_mr(struct mlx5_vdpa_dev *mvdev, struct mlx5_vdpa_direct_mr *mr) { - struct device *dma = mvdev->mdev->device; + struct device *dma = get_dma_device(mvdev); destroy_direct_mr(mvdev, mr); dma_unmap_sg_attrs(dma, mr->sg_head.sgl, mr->nsg, DMA_BIDIRECTIONAL, 0); diff --git a/drivers/vdpa/mlx5/core/resources.c b/drivers/vdpa/mlx5/core/resources.c index 96e6421c5d1cf896447b9db566be421d560e3943..6521cbd0f5c2784fea9ee4956a4e3ba9faaf1452 100644 --- a/drivers/vdpa/mlx5/core/resources.c +++ b/drivers/vdpa/mlx5/core/resources.c @@ -246,7 +246,8 @@ int mlx5_vdpa_alloc_resources(struct mlx5_vdpa_dev *mvdev) if (err) goto err_key; - kick_addr = pci_resource_start(mdev->pdev, 0) + offset; + kick_addr = mdev->bar_addr + offset; + res->kick_addr = ioremap(kick_addr, PAGE_SIZE); if (!res->kick_addr) { err = -ENOMEM; diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig index c8d55c844fda5df56f17bf83b6459470b5c42e26..7a4d69f41820be476b912c7669ad65e8c87350c2 100644 --- a/drivers/virt/Kconfig +++ b/drivers/virt/Kconfig @@ -39,4 +39,6 @@ source "drivers/virt/coco/efi_secret/Kconfig" source "drivers/virt/coco/tdx-guest/Kconfig" +source "drivers/virt/coco/csv-guest/Kconfig" + endif diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile index 64eeaa3e97fc0a94af9626ea7c4727418f67f4c7..8768e2e050ee5cc9d178faaeecd58cd3b0b1fea0 100644 --- a/drivers/virt/Makefile +++ b/drivers/virt/Makefile @@ -9,3 +9,4 @@ obj-y += vboxguest/ obj-$(CONFIG_NITRO_ENCLAVES) += nitro_enclaves/ obj-$(CONFIG_EFI_SECRET) += coco/efi_secret/ obj-$(CONFIG_INTEL_TDX_GUEST) += coco/tdx-guest/ +obj-$(CONFIG_CSV_GUEST) += coco/csv-guest/ diff --git a/drivers/virt/coco/csv-guest/Kconfig b/drivers/virt/coco/csv-guest/Kconfig new file mode 100644 index 0000000000000000000000000000000000000000..4cbde598e66508482bbe210b8eb7676f2564eac3 --- /dev/null +++ b/drivers/virt/coco/csv-guest/Kconfig @@ -0,0 +1,12 @@ +config CSV_GUEST + tristate "HYGON CSV Guest driver" + default m + depends on AMD_MEM_ENCRYPT + help + CSV firmware provides the guest a mechanism to communicate with + the PSP without risk from a malicious hypervisor who wishes to read, + alter, drop or replay the messages sent. The driver provides + userspace interface to communicate with the PSP to request the + attestation report and more. + + If you choose 'M' here, this module will be called csv-guest. diff --git a/drivers/virt/coco/csv-guest/Makefile b/drivers/virt/coco/csv-guest/Makefile new file mode 100644 index 0000000000000000000000000000000000000000..a1c3a1499fc6f6e997d1629df4e1677b31f1a28d --- /dev/null +++ b/drivers/virt/coco/csv-guest/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-$(CONFIG_CSV_GUEST) += csv-guest.o diff --git a/drivers/virt/coco/csv-guest/csv-guest.c b/drivers/virt/coco/csv-guest/csv-guest.c new file mode 100644 index 0000000000000000000000000000000000000000..d449130f8dcc153c6aa9712278dae7ded5ee0288 --- /dev/null +++ b/drivers/virt/coco/csv-guest/csv-guest.c @@ -0,0 +1,109 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * + * Userspace interface for CSV guest driver + * + * Copyright (C) Hygon Info Technologies Ltd. + */ +#include +#include +#include +#include + +#include + +#include + +#include "csv-guest.h" + +static long csv_get_report(void __user *argp) +{ + u8 *csv_report; + long ret; + struct csv_report_req req; + + if (copy_from_user(&req, argp, sizeof(struct csv_report_req))) + return -EFAULT; + + if (req.len < CSV_REPORT_INPUT_DATA_LEN) + return -EINVAL; + + csv_report = kzalloc(req.len, GFP_KERNEL); + if (!csv_report) { + ret = -ENOMEM; + goto out; + } + + /* Save user input data */ + if (copy_from_user(csv_report, req.report_data, CSV_REPORT_INPUT_DATA_LEN)) { + ret = -EFAULT; + goto out; + } + + /* Generate CSV_REPORT using "KVM_HC_VM_ATTESTATION" VMMCALL */ + ret = kvm_hypercall2(KVM_HC_VM_ATTESTATION, __pa(csv_report), req.len); + if (ret) + goto out; + + if (copy_to_user(req.report_data, csv_report, req.len)) + ret = -EFAULT; + +out: + kfree(csv_report); + return ret; +} + +static long csv_guest_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{ + switch (cmd) { + case CSV_CMD_GET_REPORT: + return csv_get_report((void __user *)arg); + default: + return -ENOTTY; + } +} + +static void mem_test_init(void) +{ + char head_str[] = "test mem encrypt"; + u64 *va_addr = __va(0x0); + + if (va_addr) { + memset(va_addr, 0x66, PAGE_SIZE); + memcpy(va_addr, head_str, sizeof(head_str)); + clflush_cache_range(va_addr, PAGE_SIZE); + } else + pr_err("Initialize 1 page for csv memory test failed!\n"); +} + +static const struct file_operations csv_guest_fops = { + .owner = THIS_MODULE, + .unlocked_ioctl = csv_guest_ioctl, + .compat_ioctl = csv_guest_ioctl, +}; + +static struct miscdevice csv_guest_dev = { + .minor = MISC_DYNAMIC_MINOR, + .name = "csv-guest", + .fops = &csv_guest_fops, + .mode = 0777, +}; + +static int __init csv_guest_init(void) +{ + // Initialize 1 page for csv memory test + mem_test_init(); + + return misc_register(&csv_guest_dev); +} + +static void __exit csv_guest_exit(void) +{ + misc_deregister(&csv_guest_dev); +} + +MODULE_LICENSE("GPL"); +MODULE_VERSION("1.0.0"); +MODULE_DESCRIPTION("HYGON CSV Guest Driver"); +module_init(csv_guest_init); +module_exit(csv_guest_exit); diff --git a/drivers/virt/coco/csv-guest/csv-guest.h b/drivers/virt/coco/csv-guest/csv-guest.h new file mode 100644 index 0000000000000000000000000000000000000000..0342d5f16cb3fed9fa61b8fe529ff4e2b4b84399 --- /dev/null +++ b/drivers/virt/coco/csv-guest/csv-guest.h @@ -0,0 +1,42 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * + * Userspace interface for CSV guest driver + * + * Copyright (C) Hygon Info Technologies Ltd. + */ + +#ifndef __VIRT_CSVGUEST_H__ +#define __VIRT_CSVGUEST_H__ + +#include +#include + +/* Length of the user input datas used in VMMCALL */ +#define CSV_REPORT_USER_DATA_LEN 64 +#define CSV_REPORT_MNONCE_LEN 16 +#define CSV_REPORT_HASH_LEN 32 +#define CSV_REPORT_INPUT_DATA_LEN (CSV_REPORT_USER_DATA_LEN + CSV_REPORT_MNONCE_LEN \ + + CSV_REPORT_HASH_LEN) + +/** + * struct csv_report_req - Request struct for CSV_CMD_GET_REPORT IOCTL. + * + * @report_data:User buffer with REPORT_DATA to be included into CSV_REPORT, and it's also + * user buffer to store CSV_REPORT output from VMMCALL[KVM_HC_VM_ATTESTATION]. + * @len: Length of the user buffer. + */ +struct csv_report_req { + u8 *report_data; + int len; +}; + +/* + * CSV_CMD_GET_REPORT - Get CSV_REPORT using VMMCALL[KVM_HC_VM_ATTESTATION] + * + * Return 0 on success, -EIO on VMMCALL execution failure, and + * standard errno on other general error cases. + */ +#define CSV_CMD_GET_REPORT _IOWR('D', 1, struct csv_report_req) + +#endif /* __VIRT_CSVGUEST_H__ */ diff --git a/fs/dax.c b/fs/dax.c index 8ddad60312f512f19a282dd72167406e719b74b3..0dab8b381f97614f5ecfa624d20e2ad6a432b0eb 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1117,7 +1117,7 @@ static sector_t dax_iomap_sector(const struct iomap *iomap, loff_t pos) return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9; } -static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos, +int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos, size_t size, void **kaddr, pfn_t *pfnp) { const sector_t sector = dax_iomap_sector(iomap, pos); @@ -1156,6 +1156,7 @@ static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos, dax_read_unlock(id); return rc; } +EXPORT_SYMBOL_GPL(dax_iomap_direct_access); /** * dax_iomap_copy_around - Prepare for an unaligned write to a shared/cow page @@ -1728,6 +1729,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp, if (ret & VM_FAULT_ERROR) copied = 0; + else if (ops->iomap_save_private) + ops->iomap_save_private(vma, &iomap); /* * The fault is done by now and there's no way back (other * thread may be already happily using PTE we have installed). diff --git a/fs/erofs/data.c b/fs/erofs/data.c index a1b99e6691302db411810ca8fc090433f01387c9..cf1f56ac04fd7d6d0219267cbeb944769052da45 100644 --- a/fs/erofs/data.c +++ b/fs/erofs/data.c @@ -9,6 +9,7 @@ #include #include #include +#include void erofs_unmap_metabuf(struct erofs_buf *buf) { @@ -22,19 +23,31 @@ void erofs_unmap_metabuf(struct erofs_buf *buf) void erofs_put_metabuf(struct erofs_buf *buf) { + pgoff_t index; + if (!buf->page) return; erofs_unmap_metabuf(buf); + + index = buf->page->index; put_page(buf->page); buf->page = NULL; + + if (buf->mapping) { + buf->mapping->a_ops->endpfn(buf->mapping, index, + &buf->iomap, 0); + buf->mapping = NULL; + memset(&buf->iomap, 0, sizeof(buf->iomap)); + } } /* * Derive the block size from inode->i_blkbits to make compatible with * anonymous inode in fscache mode. */ -void *erofs_bread(struct erofs_buf *buf, struct inode *inode, - erofs_blk_t blkaddr, enum erofs_kmap_type type) +void *__erofs_bread(struct super_block *sb, struct erofs_buf *buf, + struct inode *inode, erofs_blk_t blkaddr, + enum erofs_kmap_type type) { erofs_off_t offset = (erofs_off_t)blkaddr << inode->i_blkbits; struct address_space *const mapping = inode->i_mapping; @@ -43,8 +56,25 @@ void *erofs_bread(struct erofs_buf *buf, struct inode *inode, if (!page || page->index != index) { erofs_put_metabuf(buf); - page = read_cache_page_gfp(mapping, index, + if (EROFS_SB(sb)->bootstrap) { + unsigned int nofs_flag; + + nofs_flag = memalloc_nofs_save(); + if (IS_DAX(inode)) { + page = mapping->a_ops->startpfn(mapping, index, + &buf->iomap); + if (!IS_ERR(page)) + buf->mapping = mapping; + } else { + page = read_cache_page(mapping, index, + (filler_t *)mapping->a_ops->readpage, + EROFS_SB(sb)->bootstrap); + } + memalloc_nofs_restore(nofs_flag); + } else { + page = read_cache_page_gfp(mapping, index, mapping_gfp_constraint(mapping, ~__GFP_FS)); + } if (IS_ERR(page)) return page; /* should already be PageUptodate, no need to lock page */ @@ -65,9 +95,19 @@ void *erofs_bread(struct erofs_buf *buf, struct inode *inode, return buf->base + (offset & ~PAGE_MASK); } +void *erofs_bread(struct erofs_buf *buf, struct inode *inode, + erofs_blk_t blkaddr, enum erofs_kmap_type type) +{ + return __erofs_bread(NULL, buf, inode, blkaddr, type); +} + void *erofs_read_metabuf(struct erofs_buf *buf, struct super_block *sb, erofs_blk_t blkaddr, enum erofs_kmap_type type) { + if (EROFS_SB(sb)->bootstrap) + return __erofs_bread(sb, buf, EROFS_SB(sb)->bootstrap->f_inode, + blkaddr, type); + if (erofs_is_fscache_mode(sb)) return erofs_bread(buf, EROFS_SB(sb)->s_fscache->inode, blkaddr, type); @@ -75,8 +115,7 @@ void *erofs_read_metabuf(struct erofs_buf *buf, struct super_block *sb, return erofs_bread(buf, sb->s_bdev->bd_inode, blkaddr, type); } -static int erofs_map_blocks_flatmode(struct inode *inode, - struct erofs_map_blocks *map) +int erofs_map_blocks_flatmode(struct inode *inode, struct erofs_map_blocks *map) { erofs_blk_t nblocks, lastblk; u64 offset = map->m_la; @@ -199,6 +238,7 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map) int id; map->m_bdev = sb->s_bdev; + map->m_fp = EROFS_SB(sb)->bootstrap; map->m_fscache = EROFS_SB(sb)->s_fscache; if (map->m_deviceid) { @@ -214,6 +254,7 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map) return 0; } map->m_bdev = dif->bdev; + map->m_fp = dif->blobfile; map->m_fscache = dif->fscache; up_read(&devs->rwsem); } else if (devs->extra_devices && !devs->flatdev) { @@ -230,6 +271,7 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map) map->m_pa < startoff + length) { map->m_pa -= startoff; map->m_bdev = dif->bdev; + map->m_fp = dif->blobfile; map->m_fscache = dif->fscache; break; } diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c index d37bd31a3adc63b904287b7eea8df1e71ca4258f..d8df9834a0db3c7e3e987d1cf81861cd0318c367 100644 --- a/fs/erofs/inode.c +++ b/fs/erofs/inode.c @@ -5,9 +5,12 @@ * Copyright (C) 2021, Alibaba Cloud */ #include "xattr.h" - +#include #include +const struct file_operations rafs_v6_file_ro_fops; +const struct address_space_operations rafs_v6_aops; + /* * if inode is successfully read, return its inode page (or sometimes * the inode payload page if it's an extended inode) in order to fill @@ -247,6 +250,8 @@ static int erofs_fill_inode(struct inode *inode) { struct erofs_inode *vi = EROFS_I(inode); struct erofs_buf buf = __EROFS_BUF_INITIALIZER; + struct super_block *sb = inode->i_sb; + struct erofs_sb_info *sbi = EROFS_SB(sb); void *kaddr; unsigned int ofs; int err = 0; @@ -262,10 +267,14 @@ static int erofs_fill_inode(struct inode *inode) switch (inode->i_mode & S_IFMT) { case S_IFREG: inode->i_op = &erofs_generic_iops; - if (erofs_inode_is_data_compressed(vi->datalayout)) + if (erofs_inode_is_data_compressed(vi->datalayout)) { inode->i_fop = &generic_ro_fops; - else - inode->i_fop = &erofs_file_fops; + } else { + if (sbi->bootstrap) + inode->i_fop = &rafs_v6_file_ro_fops; + else + inode->i_fop = &erofs_file_fops; + } break; case S_IFDIR: inode->i_op = &erofs_dir_iops; @@ -297,11 +306,16 @@ static int erofs_fill_inode(struct inode *inode) err = -EOPNOTSUPP; goto out_unlock; } - inode->i_mapping->a_ops = &erofs_raw_access_aops; + if (sbi->bootstrap && !S_ISREG(inode->i_mode)) { + inode_nohighmem(inode); + inode->i_mapping->a_ops = &rafs_v6_aops; + } else if (inode->i_sb->s_bdev) { + inode->i_mapping->a_ops = &erofs_raw_access_aops; #ifdef CONFIG_EROFS_FS_ONDEMAND - if (erofs_is_fscache_mode(inode->i_sb)) + } else if (erofs_is_fscache_mode(inode->i_sb)) { inode->i_mapping->a_ops = &erofs_fscache_access_aops; #endif + } out_unlock: erofs_put_metabuf(&buf); @@ -389,3 +403,266 @@ const struct inode_operations erofs_fast_symlink_iops = { .listxattr = erofs_listxattr, .get_acl = erofs_get_acl, }; + +static ssize_t rafs_v6_read_chunk(struct super_block *sb, + struct iov_iter *to, u64 off, u64 size, + unsigned int device_id) +{ + struct iov_iter titer; + ssize_t read = 0; + struct erofs_map_dev mdev = { + .m_deviceid = device_id, + .m_pa = off, + }; + int err; + + err = erofs_map_dev(sb, &mdev); + if (err) + return err; + off = mdev.m_pa; + do { + ssize_t ret; + + if (iov_iter_is_pipe(to)) { + iov_iter_pipe(&titer, READ, to->pipe, size - read); + + ret = vfs_iter_read(mdev.m_fp, &titer, &off, 0); + pr_debug("pipe ret %ld off %llu size %llu read %ld\n", + ret, off, size, read); + if (ret <= 0) { + pr_err("%s: failed to read blob ret %ld\n", __func__, ret); + return ret; + } + } else { + struct iovec iovec = iov_iter_iovec(to); + + if (iovec.iov_len > size - read) + iovec.iov_len = size - read; + + pr_debug("%s: off %llu size %llu iov_len %lu blob_index %u\n", + __func__, off, size, iovec.iov_len, device_id); + + /* TODO async */ + iov_iter_init(&titer, READ, &iovec, 1, iovec.iov_len); + ret = vfs_iter_read(mdev.m_fp, &titer, &off, 0); + if (ret <= 0) { + pr_err("%s: failed to read blob ret %ld\n", __func__, ret); + return ret; + } else if (ret < iovec.iov_len) { + return read; + } + } + iov_iter_advance(to, ret); + read += ret; + } while (read < size); + + return read; +} + +static ssize_t rafs_v6_file_read_iter(struct kiocb *iocb, struct iov_iter *to) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct erofs_map_blocks map = { 0 }; + ssize_t bytes = 0; + u64 total = min_t(u64, iov_iter_count(to), + inode->i_size - iocb->ki_pos); + + while (total) { + erofs_off_t pos = iocb->ki_pos; + u64 delta, size; + ssize_t read; + + if (map.m_la < pos || map.m_la + map.m_llen >= pos) { + int err; + + map.m_la = pos; + err = erofs_map_blocks(inode, &map); + if (err) + return err; + if (map.m_la >= inode->i_size) + break; + } + delta = pos - map.m_la; + size = min_t(u64, map.m_llen - delta, total); + pr_debug("inode i_size %llu pa %llu delta %llu size %llu", + inode->i_size, map.m_pa, delta, size); + read = rafs_v6_read_chunk(inode->i_sb, to, map.m_pa + delta, + size, map.m_deviceid); + if (read <= 0 || read < size) { + erofs_err(inode->i_sb, + "short read %ld pos %llu size %llu @ nid %llu", + read, pos, size, EROFS_I(inode)->nid); + return read < 0 ? read : -EIO; + } + iocb->ki_pos += read; + bytes += read; + total -= read; + } + return bytes; +} + +static vm_fault_t rafs_v6_filemap_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct inode *inode = file_inode(vma->vm_file); + pgoff_t npages, orig_pgoff = vmf->pgoff; + erofs_off_t pos; + struct erofs_map_blocks map = {0}; + struct erofs_map_dev mdev; + struct vm_area_struct lower_vma; + int err; + vm_fault_t ret; + + npages = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); + if (unlikely(orig_pgoff >= npages)) + return VM_FAULT_SIGBUS; + + memcpy(&lower_vma, vmf->vma, sizeof(lower_vma)); + WARN_ON_ONCE(lower_vma.vm_private_data != vma->vm_private_data); + + /* TODO: check if chunk is available for us to read. */ + map.m_la = orig_pgoff << PAGE_SHIFT; + pos = map.m_la; + err = erofs_map_blocks(inode, &map); + if (err) + return vmf_error(err); + + mdev = (struct erofs_map_dev) { + .m_deviceid = map.m_deviceid, + .m_pa = map.m_pa, + }; + err = erofs_map_dev(inode->i_sb, &mdev); + if (err) + return vmf_error(err); + + lower_vma.vm_file = mdev.m_fp; + vmf->pgoff = (mdev.m_pa + (pos - map.m_la)) >> PAGE_SHIFT; + vmf->vma = &lower_vma; /* override vma temporarily */ + ret = EROFS_I(inode)->lower_vm_ops->fault(vmf); + vmf->vma = vma; + vmf->pgoff = orig_pgoff; + return ret; +} + +static void rafs_v6_vm_close(struct vm_area_struct *vma) +{ + struct inode *inode; + + if (!vma || !vma->vm_file) { + WARN_ON_ONCE(1); + return; + } + + inode = file_inode(vma->vm_file); + if (EROFS_I(inode)->lower_vm_ops && EROFS_I(inode)->lower_vm_ops->close) + EROFS_I(inode)->lower_vm_ops->close(vma); + + WARN_ON(vma->vm_private_data); +} + +static void rafs_v6_vm_open(struct vm_area_struct *vma) +{ + struct inode *inode; + + if (!vma || !vma->vm_file) { + WARN_ON_ONCE(1); + return; + } + + inode = file_inode(vma->vm_file); + if (EROFS_I(inode)->lower_vm_ops && EROFS_I(inode)->lower_vm_ops->open) + EROFS_I(inode)->lower_vm_ops->open(vma); +} + +static const struct vm_operations_struct rafs_v6_vm_ops = { + .fault = rafs_v6_filemap_fault, + .close = rafs_v6_vm_close, + .open = rafs_v6_vm_open, +}; + +static int rafs_v6_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct inode *inode = file_inode(file); + struct erofs_inode *vi = EROFS_I(inode); + const struct vm_operations_struct *lower_vm_ops; + struct file *realfile = EROFS_I_SB(inode)->bootstrap; + int ret; + + if (!realfile || !realfile->f_op->mmap) { + pr_err("%s: no bootstrap or mmap\n", __func__); + return -EOPNOTSUPP; + } + + ret = call_mmap(EROFS_I_SB(inode)->bootstrap, vma); + if (ret) { + pr_err("%s: call_mmap failed ret %d\n", __func__, ret); + return ret; + } + + /* set fs's vm_ops which is used in fault(). */ + lower_vm_ops = vma->vm_ops; + + if (vi->lower_vm_ops && vi->lower_vm_ops != lower_vm_ops) { + WARN_ON_ONCE(1); + return -EOPNOTSUPP; + } + /* fault() must exist in order to proceed. */ + if (!lower_vm_ops || !lower_vm_ops->fault) { + WARN_ON_ONCE(1); + return -EOPNOTSUPP; + } + vi->lower_vm_ops = lower_vm_ops; + vma->vm_flags &= ~VM_HUGEPAGE; /* dont use huge page */ + vma->vm_ops = &rafs_v6_vm_ops; + return 0; +} + +const struct file_operations rafs_v6_file_ro_fops = { + .llseek = generic_file_llseek, + .read_iter = rafs_v6_file_read_iter, + .mmap = rafs_v6_file_mmap, +// .mmap = generic_file_readonly_mmap, + .splice_read = generic_file_splice_read, +}; + +static int rafs_v6_readpage(struct file *file, struct page *page) +{ + struct kvec iov = { + .iov_base = page_address(page), + }; + struct inode *inode = page->mapping->host; + struct super_block *sb = inode->i_sb; + erofs_off_t pos = page->index << PAGE_SHIFT; + struct erofs_map_blocks map = { .m_la = pos }; + struct kiocb kiocb; + struct iov_iter iter; + int err; + + err = erofs_map_blocks(inode, &map); + if (err) + goto err_out; + + iov.iov_len = min_t(u64, PAGE_SIZE, map.m_plen - (pos - map.m_la)); + init_sync_kiocb(&kiocb, EROFS_SB(sb)->bootstrap); + kiocb.ki_pos = map.m_pa + (pos - map.m_la); +// if (!(kiocb.ki_pos & ~PAGE_MASK) && iov.iov_len == PAGE_SIZE) +// kiocb.ki_flags |= IOCB_DIRECT; + iov_iter_kvec(&iter, READ, &iov, 1, iov.iov_len); + err = kiocb.ki_filp->f_op->read_iter(&kiocb, &iter); + if (err < iov.iov_len) + goto err_out; + if (iov.iov_len < PAGE_SIZE) + memset(iov.iov_base + iov.iov_len, 0, + PAGE_SIZE - iov.iov_len); + SetPageUptodate(page); + unlock_page(page); + return 0; +err_out: + SetPageError(page); + unlock_page(page); + return err; +} + +const struct address_space_operations rafs_v6_aops = { + .readpage = rafs_v6_readpage, +}; diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h index 28a3f68e60086b6388c05c4d75fad5ea1e3e363f..bffcd971f1d3cd37ea360679fea96dca644457f1 100644 --- a/fs/erofs/internal.h +++ b/fs/erofs/internal.h @@ -17,6 +17,7 @@ #include #include #include +#include #include "erofs_fs.h" /* redefine pr_fmt "erofs: " */ @@ -51,6 +52,7 @@ struct erofs_device_info { char *path; struct erofs_fscache *fscache; struct block_device *bdev; + struct file *blobfile; u32 blocks; u32 mapped_blkaddr; @@ -80,6 +82,8 @@ struct erofs_fs_context { struct erofs_dev_context *devs; char *fsid; char *domain_id; + char *bootstrap_path; + char *blob_dir_path; }; struct erofs_domain { @@ -112,6 +116,10 @@ struct erofs_sb_info { /* pseudo inode to manage cached pages */ struct inode *managed_cache; #endif /* CONFIG_EROFS_FS_ZIP */ + struct path blob_dir; + struct file *bootstrap; + char *bootstrap_path; + char *blob_dir_path; struct erofs_dev_context *devs; u64 total_blocks; u32 primarydevice_blocks; @@ -152,11 +160,6 @@ struct erofs_sb_info { /* Mount flags set via mount options or defaults */ #define EROFS_MOUNT_XATTR_USER 0x00000010 #define EROFS_MOUNT_POSIX_ACL 0x00000020 -/* - * Bypass [override_creds|revert_creds] for overlayfs - * when erofs is mounted as lowerfs. - */ -#define EROFS_MOUNT_OPT_CREDS 0x80000000 #define clear_opt(opt, option) ((opt)->mount_opt &= ~EROFS_MOUNT_##option) #define set_opt(opt, option) ((opt)->mount_opt |= EROFS_MOUNT_##option) @@ -257,6 +260,8 @@ enum erofs_kmap_type { }; struct erofs_buf { + struct iomap iomap; + struct address_space *mapping; struct page *page; void *base; enum erofs_kmap_type kmap_type; @@ -300,6 +305,7 @@ struct erofs_inode { unsigned int xattr_shared_count; unsigned int *xattr_shared_xattrs; + const struct vm_operations_struct *lower_vm_ops; union { erofs_blk_t raw_blkaddr; @@ -424,6 +430,7 @@ static inline int z_erofs_map_blocks_iter(struct inode *inode, struct erofs_map_dev { struct erofs_fscache *m_fscache; struct block_device *m_bdev; + struct file *m_fp; erofs_off_t m_pa; unsigned int m_deviceid; diff --git a/fs/erofs/super.c b/fs/erofs/super.c index 427f4ec5c09d33c980a60f93aa104ff75632cb70..fb58fa689302be1aaea1de948dc69fc830ff80c8 100644 --- a/fs/erofs/super.c +++ b/fs/erofs/super.c @@ -6,6 +6,7 @@ */ #include #include +#include #include #include #include @@ -156,6 +157,16 @@ static int erofs_init_device(struct erofs_buf *buf, struct super_block *sb, if (IS_ERR(fscache)) return PTR_ERR(fscache); dif->fscache = fscache; + } else if (sbi->blob_dir_path) { + struct file *f; + + f = file_open_root(sbi->blob_dir.dentry, sbi->blob_dir.mnt, + dif->path, O_RDONLY | O_LARGEFILE, 0); + if (IS_ERR(f)) { + erofs_err(sb, "failed to open blob id %s", dif->path); + return PTR_ERR(f); + } + dif->blobfile = f; } else if (!sbi->devs->flatdev) { bdev = blkdev_get_by_path(dif->path, FMODE_READ | FMODE_EXCL, sb->s_type); @@ -327,7 +338,8 @@ enum { Opt_device, Opt_fsid, Opt_domain_id, - Opt_opt_creds, + Opt_bootstrap_path, + Opt_blob_dir_path, Opt_err }; @@ -339,14 +351,15 @@ static const struct constant_table erofs_param_cache_strategy[] = { }; static const struct fs_parameter_spec erofs_fs_parameters[] = { - fsparam_flag_no("user_xattr", Opt_user_xattr), - fsparam_flag_no("acl", Opt_acl), - fsparam_enum("cache_strategy", Opt_cache_strategy, + fsparam_flag_no("user_xattr", Opt_user_xattr), + fsparam_flag_no("acl", Opt_acl), + fsparam_enum("cache_strategy", Opt_cache_strategy, erofs_param_cache_strategy), - fsparam_string("device", Opt_device), - fsparam_string("fsid", Opt_fsid), - fsparam_string("domain_id", Opt_domain_id), - fsparam_string("opt_creds", Opt_opt_creds), + fsparam_string("device", Opt_device), + fsparam_string("fsid", Opt_fsid), + fsparam_string("domain_id", Opt_domain_id), + fsparam_string("bootstrap_path", Opt_bootstrap_path), + fsparam_string("blob_dir_path", Opt_blob_dir_path), {} }; @@ -430,19 +443,32 @@ static int erofs_fc_parse_param(struct fs_context *fc, errorfc(fc, "domain_id option not supported"); #endif break; - case Opt_opt_creds: - if (!strcmp(param->string, "on")) { - set_opt(&ctx->opt, OPT_CREDS); - } else if (!strcmp(param->string, "off")) { - clear_opt(&ctx->opt, OPT_CREDS); - } else { - errorfc(fc, "invalid mount option, using 'opt_creds=[on|off]'"); - return -EINVAL; - } + case Opt_bootstrap_path: + kfree(ctx->bootstrap_path); + ctx->bootstrap_path = kstrdup(param->string, GFP_KERNEL); + if (!ctx->bootstrap_path) + return -ENOMEM; + break; + case Opt_blob_dir_path: + kfree(ctx->blob_dir_path); + ctx->blob_dir_path = kstrdup(param->string, GFP_KERNEL); + if (!ctx->blob_dir_path) + return -ENOMEM; break; default: return -ENOPARAM; } + + if (ctx->blob_dir_path && !ctx->bootstrap_path) { + errorfc(fc, "bootstrap_path required in RAFS mode"); + return -EINVAL; + } + + if (ctx->bootstrap_path && ctx->fsid) { + errorfc(fc, "fscache/RAFS modes are mutually exclusive"); + return -EINVAL; + } + return 0; } @@ -551,6 +577,37 @@ static int erofs_fc_fill_pseudo_super(struct super_block *sb, struct fs_context return simple_fill_super(sb, EROFS_SUPER_MAGIC, &empty_descr); } +static int rafs_v6_fill_super(struct super_block *sb) +{ + struct erofs_sb_info *sbi = EROFS_SB(sb); + + if (sbi->bootstrap_path) { + struct file *f; + + f = filp_open(sbi->bootstrap_path, O_RDONLY | O_LARGEFILE, 0); + if (IS_ERR(f)) + return PTR_ERR(f); + if (!S_ISREG(f->f_inode->i_mode)) { + erofs_err(sb, "bootstrap_path %s shall be regular file", + sbi->bootstrap_path); + return -EINVAL; + } + sbi->bootstrap = f; + } + if (sbi->blob_dir_path) { + int ret = kern_path(sbi->blob_dir_path, + LOOKUP_FOLLOW | LOOKUP_DIRECTORY, + &sbi->blob_dir); + + if (ret) { + kfree(sbi->blob_dir_path); + sbi->blob_dir_path = NULL; + return ret; + } + } + return 0; +} + static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc) { struct inode *inode; @@ -575,6 +632,10 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc) ctx->fsid = NULL; sbi->domain_id = ctx->domain_id; ctx->domain_id = NULL; + sbi->bootstrap_path = ctx->bootstrap_path; + ctx->bootstrap_path = NULL; + sbi->blob_dir_path = ctx->blob_dir_path; + ctx->blob_dir_path = NULL; sbi->blkszbits = PAGE_SHIFT; if (erofs_is_fscache_mode(sb)) { @@ -604,12 +665,19 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc) errorfc(fc, "unsupported blksize for fscache mode"); return -EINVAL; } - if (!sb_set_blocksize(sb, 1 << sbi->blkszbits)) { + if (sb->s_bdev && !sb_set_blocksize(sb, 1 << sbi->blkszbits)) { errorfc(fc, "failed to set erofs blksize"); return -EINVAL; + } else { + sb->s_blocksize = 1 << sbi->blkszbits; + sb->s_blocksize_bits = sbi->blkszbits; } } + err = rafs_v6_fill_super(sb); + if (err) + return err; + sb->s_time_gran = 1; sb->s_xattr = erofs_xattr_handlers; sb->s_export_op = &erofs_export_ops; @@ -619,13 +687,6 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc) else sb->s_flags &= ~SB_POSIXACL; - if (test_opt(&sbi->opt, OPT_CREDS)) - sb->s_iflags |= SB_I_OVL_OPT_CREDS; - else - sb->s_iflags &= ~SB_I_OVL_OPT_CREDS; - - sb->s_iflags |= SB_I_OVL_OPT_ACL_RCU; - #ifdef CONFIG_EROFS_FS_ZIP xa_init(&sbi->managed_pslots); #endif @@ -668,6 +729,9 @@ static int erofs_fc_get_tree(struct fs_context *fc) if (IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) && ctx->fsid) return get_tree_nodev(fc, erofs_fc_fill_super); + if (ctx->bootstrap_path && ctx->blob_dir_path) + return get_tree_nodev(fc, erofs_fc_fill_super); + return get_tree_bdev(fc, erofs_fc_fill_super); } @@ -687,11 +751,6 @@ static int erofs_fc_reconfigure(struct fs_context *fc) else fc->sb_flags &= ~SB_POSIXACL; - if (test_opt(&ctx->opt, OPT_CREDS)) - sb->s_iflags |= SB_I_OVL_OPT_CREDS; - else - sb->s_iflags &= ~SB_I_OVL_OPT_CREDS; - sbi->opt = ctx->opt; fc->sb_flags |= SB_RDONLY; @@ -704,6 +763,8 @@ static int erofs_release_device_info(int id, void *ptr, void *data) if (dif->bdev) blkdev_put(dif->bdev, FMODE_READ | FMODE_EXCL); + if (dif->blobfile) + filp_close(dif->blobfile, NULL); erofs_fscache_unregister_cookie(dif->fscache); dif->fscache = NULL; kfree(dif->path); @@ -784,15 +845,21 @@ static void erofs_kill_sb(struct super_block *sb) return; } - if (erofs_is_fscache_mode(sb)) - kill_anon_super(sb); - else + if (sb->s_bdev) kill_block_super(sb); + else + kill_anon_super(sb); sbi = EROFS_SB(sb); if (!sbi) return; erofs_free_dev_context(sbi->devs); + if (sbi->bootstrap) + filp_close(sbi->bootstrap, NULL); + if (sbi->blob_dir_path) + path_put(&sbi->blob_dir); + kfree(sbi->bootstrap_path); + kfree(sbi->blob_dir_path); erofs_fscache_unregister_fs(sb); kfree(sbi->fsid); kfree(sbi->domain_id); @@ -889,7 +956,7 @@ static int erofs_statfs(struct dentry *dentry, struct kstatfs *buf) struct erofs_sb_info *sbi = EROFS_SB(sb); u64 id = 0; - if (!erofs_is_fscache_mode(sb)) + if (sb->s_bdev) id = huge_encode_dev(sb->s_bdev->bd_dev); buf->f_type = sb->s_magic; @@ -937,11 +1004,6 @@ static int erofs_show_options(struct seq_file *seq, struct dentry *root) if (sbi->domain_id) seq_printf(seq, ",domain_id=%s", sbi->domain_id); #endif - if (test_opt(opt, OPT_CREDS)) - seq_puts(seq, ",opt_creds=on"); - else - seq_puts(seq, ",opt_creds=off"); - return 0; } diff --git a/fs/ext4/super.c b/fs/ext4/super.c index fa4ac595d253e850b668c35ec82d905dbe98a007..c6809253ed3729a5604913c56ed3bf8ae93b4d5f 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -2876,11 +2876,9 @@ static __le16 ext4_group_desc_csum(struct super_block *sb, __u32 block_group, crc = crc16(crc, (__u8 *)gdp, offset); offset += sizeof(gdp->bg_checksum); /* skip checksum */ /* for checksum of struct ext4_group_desc do the rest...*/ - if (ext4_has_feature_64bit(sb) && - offset < le16_to_cpu(sbi->s_es->s_desc_size)) + if (ext4_has_feature_64bit(sb) && offset < sbi->s_desc_size) crc = crc16(crc, (__u8 *)gdp + offset, - le16_to_cpu(sbi->s_es->s_desc_size) - - offset); + sbi->s_desc_size - offset); out: return cpu_to_le16(crc); diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index 72a96cbb24a51a003bade8cb37d965c27f956b1d..b200342f32b677d94a0766e86d2d20583b583e4c 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -12,6 +12,7 @@ #include #include #include +#include /* * Default memory range size. A power of 2 so it agrees with common FUSE_INIT @@ -652,9 +653,52 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length, return 0; } +struct virtiofs_dmap_bitmap { + unsigned int magic; /* 0x01020304 */ + struct fuse_conn *fc; + refcount_t count; + unsigned long bitmap[]; +}; + +static const struct vm_operations_struct fuse_dax_vm_ops; + +static void fuse_iomap_save_private(struct vm_area_struct *vma, + struct iomap *iomap) +{ + struct fuse_dax_mapping *dmap = iomap->private; + unsigned int nr = dmap->window_offset / FUSE_DAX_SZ; + struct virtiofs_dmap_bitmap *db = vma->vm_private_data; + + /* check if the vma belongs erofs */ + if (vma->vm_ops == &fuse_dax_vm_ops) + return; + + if (!dmap) + return; + + if (!db || db->magic != 0x01020304) { + WARN_ON(1); + return; + } + + WARN_ON_ONCE(dmap->window_offset % FUSE_DAX_SZ); + WARN_ON_ONCE(nr >= db->fc->dax->nr_ranges); + + /* (used by erofs) atomic set bitmap in vma->vm_private_data */ + if (!test_and_set_bit(nr, db->bitmap)) { + pr_debug("erofs pinned memory range window_offset=0x%llx length=0x%llx\n", + dmap->window_offset, dmap->length); + + /* increase refcnt so that erofs's vma mapping won't be reclaimed. */ + WARN_ON_ONCE(refcount_read(&dmap->refcnt) < 2); + refcount_inc(&dmap->refcnt); + } +} + static const struct iomap_ops fuse_iomap_ops = { .iomap_begin = fuse_iomap_begin, .iomap_end = fuse_iomap_end, + .iomap_save_private = fuse_iomap_save_private, }; static void fuse_wait_dax_page(struct inode *inode) @@ -778,6 +822,47 @@ ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) return ret; } +static void fuse_dax_endpfn(struct address_space *mapping, + pgoff_t index, struct iomap *iomap, int err) +{ + loff_t pos = (loff_t)index << PAGE_SHIFT; + + fuse_iomap_ops.iomap_end(mapping->host, pos, PAGE_SIZE, + err ? 0 : PAGE_SIZE, 0, iomap); +} + +static struct page *fuse_dax_startpfn(struct address_space *mapping, + pgoff_t index, struct iomap *iomap) + +{ + struct inode *inode = mapping->host; + struct iomap srcmap = { .type = IOMAP_HOLE }; + loff_t pos = (loff_t)index << PAGE_SHIFT; + pfn_t pfn; + struct page *page; + int ret; + + iomap->type = IOMAP_HOLE; + ret = fuse_iomap_ops.iomap_begin(inode, pos, PAGE_SIZE, 0, + iomap, &srcmap); + if (ret) + return ERR_PTR(ret); + + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED)) + return ERR_PTR(-EIO); + + ret = dax_iomap_direct_access(iomap, pos, PAGE_SIZE, NULL, &pfn); + if (ret < 0) { + fuse_dax_endpfn(mapping, index, iomap, 0); + return ERR_PTR(ret); + } + + page = pfn_t_to_page(pfn); + WARN_ON(page_ref_count(page) < 1); + get_page(page); + return page; +} + static int fuse_dax_writepages(struct address_space *mapping, struct writeback_control *wbc) { @@ -853,15 +938,88 @@ static vm_fault_t fuse_dax_pfn_mkwrite(struct vm_fault *vmf) return __fuse_dax_fault(vmf, PE_SIZE_PTE, true); } +static void fuse_dax_vma_close(struct vm_area_struct *vma) +{ + struct fuse_dax_mapping *dmap; + struct virtiofs_dmap_bitmap *db = vma->vm_private_data; + struct fuse_conn_dax *fcd; + + /* check if the vma belongs erofs */ + if (vma->vm_ops == &fuse_dax_vm_ops) + return; + if (!vma->vm_file || !db || db->magic != 0x01020304) { + WARN_ON(1); + return; + } + + vma->vm_private_data = NULL; + if (!refcount_dec_and_test(&db->count)) + return; + fcd = db->fc->dax; + spin_lock(&fcd->lock); + list_for_each_entry(dmap, &fcd->busy_ranges, busy_list) { + unsigned int nr = dmap->window_offset / FUSE_DAX_SZ; + + WARN_ON_ONCE(dmap->window_offset % FUSE_DAX_SZ); + if (!test_bit(nr, db->bitmap)) + continue; + + if (refcount_dec_and_test(&dmap->refcnt)) { + /* + * refcount should not hit 0. This object only goes + * away when fuse connection goes away + */ + WARN_ON_ONCE(1); + } + } + spin_unlock(&fcd->lock); + kfree(db); +} + +static void fuse_dax_vma_open(struct vm_area_struct *vma) +{ + struct virtiofs_dmap_bitmap *db = vma->vm_private_data; + + /* check if the vma belongs erofs */ + if (vma->vm_ops == &fuse_dax_vm_ops) + return; + if (!vma->vm_file || !db || db->magic != 0x01020304) { + WARN_ON(1); + return; + } + refcount_inc(&db->count); + +} + static const struct vm_operations_struct fuse_dax_vm_ops = { .fault = fuse_dax_fault, .huge_fault = fuse_dax_huge_fault, .page_mkwrite = fuse_dax_page_mkwrite, .pfn_mkwrite = fuse_dax_pfn_mkwrite, + .open = fuse_dax_vma_open, + .close = fuse_dax_vma_close, }; int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma) { + if (vma->vm_ops != &fuse_dax_vm_ops && + file != vma->vm_file && + file_inode(vma->vm_file)->i_sb->s_magic == EROFS_SUPER_MAGIC_V1) { + struct fuse_conn *fc = get_fuse_conn(file_inode(file)); + unsigned int size = DIV_ROUND_UP(fc->dax->nr_ranges, + sizeof(unsigned long)); + struct virtiofs_dmap_bitmap *db = + kzalloc(struct_size(db, bitmap, size), GFP_NOFS); + + WARN_ON(vma->vm_private_data); + if (!db) + return -ENOMEM; + db->magic = 0x01020304; + db->fc = fc; + refcount_set(&db->count, 1); + vma->vm_private_data = db; + } + file_accessed(file); vma->vm_ops = &fuse_dax_vm_ops; vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; @@ -1330,6 +1488,8 @@ bool fuse_dax_inode_alloc(struct super_block *sb, struct fuse_inode *fi) } static const struct address_space_operations fuse_dax_file_aops = { + .startpfn = fuse_dax_startpfn, + .endpfn = fuse_dax_endpfn, .writepages = fuse_dax_writepages, .direct_IO = noop_direct_IO, .set_page_dirty = noop_set_page_dirty, diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 0abac1ef2f321e86f84d253f942407729e5595fe..081252307694a73e09d1a0fbc2eeb360e23cb3a2 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -274,17 +274,20 @@ static void flush_bg_queue(struct fuse_conn *fc) } } -static void fuse_update_stats(struct fuse_conn *fc, int opcode, uint64_t send_time) +static void fuse_update_stats(struct fuse_conn *fc, struct fuse_req *req) { + unsigned int opcode = req->in.h.opcode; uint64_t delta_time; - delta_time = get_time_now_us() - send_time; + if (opcode < FUSE_OP_MAX) { + delta_time = get_time_now_us() - req->send_time; - atomic64_add(delta_time, &fc->stats.req_time[FUSE_SUMMARY]); - atomic64_add(delta_time, &fc->stats.req_time[opcode]); + atomic64_add(delta_time, &fc->stats.req_time[FUSE_SUMMARY]); + atomic64_add(delta_time, &fc->stats.req_time[opcode]); - atomic64_inc(&fc->stats.req_cnts[FUSE_SUMMARY]); - atomic64_inc(&fc->stats.req_cnts[opcode]); + atomic64_inc(&fc->stats.req_cnts[FUSE_SUMMARY]); + atomic64_inc(&fc->stats.req_cnts[opcode]); + } } /* @@ -317,7 +320,7 @@ void fuse_request_end(struct fuse_req *req) WARN_ON(test_bit(FR_PENDING, &req->flags)); WARN_ON(test_bit(FR_SENT, &req->flags)); if (test_bit(FR_BACKGROUND, &req->flags)) { - fuse_update_stats(fc, req->in.h.opcode, req->send_time); + fuse_update_stats(fc, req); spin_lock(&fc->bg_lock); clear_bit(FR_BACKGROUND, &req->flags); @@ -453,7 +456,7 @@ static void __fuse_request_send(struct fuse_req *req) /* Pairs with smp_wmb() in fuse_request_end() */ smp_rmb(); - fuse_update_stats(fc, req->in.h.opcode, req->send_time); + fuse_update_stats(fc, req); } } diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c index 807119ae5adf7370622c033dcd714608eea318f1..7648f64a17a82f3e89746aa8509563dcf84318ff 100644 --- a/fs/hfsplus/super.c +++ b/fs/hfsplus/super.c @@ -295,11 +295,11 @@ static void hfsplus_put_super(struct super_block *sb) hfsplus_sync_fs(sb, 1); } + iput(sbi->alloc_file); + iput(sbi->hidden_dir); hfs_btree_close(sbi->attr_tree); hfs_btree_close(sbi->cat_tree); hfs_btree_close(sbi->ext_tree); - iput(sbi->alloc_file); - iput(sbi->hidden_dir); kfree(sbi->s_vhdr_buf); kfree(sbi->s_backup_vhdr_buf); unload_nls(sbi->nls); diff --git a/fs/io_uring.c b/fs/io_uring.c index e4ab6d1ae75530000a21202c42913822102b8f29..6dff323e9e655aefb682665b736b0149ee1964f1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2695,7 +2695,12 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, if (!list_empty(&done)) break; - ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin); + if (req->opcode == IORING_OP_URING_CMD) { + struct io_uring_cmd *ioucmd = &req->uring_cmd; + + ret = req->file->f_op->uring_cmd_iopoll(ioucmd); + } else + ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin); if (ret < 0) break; @@ -4017,7 +4022,12 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2) req_set_fail_links(req); if (req->ctx->flags & IORING_SETUP_CQE32) io_req_set_cqe32_extra(req, res2, 0); - io_req_complete(req, ret); + if (req->ctx->flags & IORING_SETUP_IOPOLL) { + /* order with io_poll_complete() checking ->result */ + smp_wmb(); + WRITE_ONCE(req->iopoll_completed, 1); + } else + io_req_complete(req, ret); } EXPORT_SYMBOL_GPL(io_uring_cmd_done); @@ -4050,11 +4060,20 @@ static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags) struct file *file = req->file; int ret; - if (!req->file->f_op->uring_cmd) + if ((!req->file->f_op->uring_cmd) || + (ctx->flags & IORING_SETUP_IOPOLL && + !req->file->f_op->uring_cmd_iopoll)) return -EOPNOTSUPP; if (ctx->flags & IORING_SETUP_SQE128) issue_flags |= IO_URING_F_SQE128; + if (ctx->flags & IORING_SETUP_CQE32) + issue_flags |= IO_URING_F_CQE32; + if (ctx->flags & IORING_SETUP_IOPOLL) { + issue_flags |= IO_URING_F_IOPOLL; + req->iopoll_completed = 0; + WRITE_ONCE(ioucmd->cookie, NULL); + } if (req->async_data) ioucmd->cmd = req->async_data; diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c index 735ee8a7987056998195864719311a0d03aa8c31..753e0d648ab0268a637ba92a7caa44e42263d77a 100644 --- a/fs/nfsd/nfs4proc.c +++ b/fs/nfsd/nfs4proc.c @@ -1246,13 +1246,6 @@ nfsd4_interssc_connect(struct nl4_server *nss, struct svc_rqst *rqstp, return status; } -static void -nfsd4_interssc_disconnect(struct vfsmount *ss_mnt) -{ - nfs_do_sb_deactive(ss_mnt->mnt_sb); - mntput(ss_mnt); -} - /* * Verify COPY destination stateid. * @@ -1323,11 +1316,6 @@ nfsd4_cleanup_inter_ssc(struct vfsmount *ss_mnt, struct nfsd_file *src, { } -static void -nfsd4_interssc_disconnect(struct vfsmount *ss_mnt) -{ -} - static struct file *nfs42_ssc_open(struct vfsmount *ss_mnt, struct nfs_fh *src_fh, nfs4_stateid *stateid) @@ -1469,14 +1457,14 @@ static int nfsd4_do_async_copy(void *data) copy->nf_src = kzalloc(sizeof(struct nfsd_file), GFP_KERNEL); if (!copy->nf_src) { copy->nfserr = nfserr_serverfault; - nfsd4_interssc_disconnect(copy->ss_mnt); + /* ss_mnt will be unmounted by the laundromat */ goto do_callback; } copy->nf_src->nf_file = nfs42_ssc_open(copy->ss_mnt, ©->c_fh, ©->stateid); if (IS_ERR(copy->nf_src->nf_file)) { copy->nfserr = nfserr_offload_denied; - nfsd4_interssc_disconnect(copy->ss_mnt); + /* ss_mnt will be unmounted by the laundromat */ goto do_callback; } } @@ -1559,8 +1547,10 @@ nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, if (async_copy) cleanup_async_copy(async_copy); status = nfserrno(-ENOMEM); - if (!copy->cp_intra) - nfsd4_interssc_disconnect(copy->ss_mnt); + /* + * source's vfsmount of inter-copy will be unmounted + * by the laundromat + */ goto out; } diff --git a/fs/ntfs/attrib.c b/fs/ntfs/attrib.c index d563abc3e13643535cbf9486f7ab00749767f62d..f171563e51065792949f003cecdd18434a76228e 100644 --- a/fs/ntfs/attrib.c +++ b/fs/ntfs/attrib.c @@ -592,9 +592,25 @@ static int ntfs_attr_find(const ATTR_TYPE type, const ntfschar *name, a = (ATTR_RECORD*)((u8*)ctx->attr + le32_to_cpu(ctx->attr->length)); for (;; a = (ATTR_RECORD*)((u8*)a + le32_to_cpu(a->length))) { - if ((u8*)a < (u8*)ctx->mrec || (u8*)a > (u8*)ctx->mrec + - le32_to_cpu(ctx->mrec->bytes_allocated)) + u8 *mrec_end = (u8 *)ctx->mrec + + le32_to_cpu(ctx->mrec->bytes_allocated); + u8 *name_end; + + /* check whether ATTR_RECORD wrap */ + if ((u8 *)a < (u8 *)ctx->mrec) + break; + + /* check whether Attribute Record Header is within bounds */ + if ((u8 *)a > mrec_end || + (u8 *)a + sizeof(ATTR_RECORD) > mrec_end) + break; + + /* check whether ATTR_RECORD's name is within bounds */ + name_end = (u8 *)a + le16_to_cpu(a->name_offset) + + a->name_length * sizeof(ntfschar); + if (name_end > mrec_end) break; + ctx->attr = a; if (unlikely(le32_to_cpu(a->type) > le32_to_cpu(type) || a->type == AT_END)) diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c index 47383ea02dbbd07a9ce1cd042e688513e97751d2..84ec3a4054b1b449b711420b247f3ec3c95f5613 100644 --- a/fs/overlayfs/inode.c +++ b/fs/overlayfs/inode.c @@ -299,7 +299,7 @@ int ovl_permission(struct inode *inode, int mask) if (err) return err; - old_cred = ovl_override_creds_opt(inode->i_sb); + old_cred = ovl_override_creds(inode->i_sb); if (!upperinode && !special_file(realinode->i_mode) && mask & MAY_WRITE) { mask &= ~(MAY_WRITE | MAY_APPEND); @@ -307,7 +307,7 @@ int ovl_permission(struct inode *inode, int mask) mask |= MAY_READ; } err = inode_permission(realinode, mask); - ovl_revert_creds(old_cred); + revert_creds(old_cred); return err; } @@ -451,22 +451,18 @@ ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size) struct posix_acl *ovl_get_acl(struct inode *inode, int type, bool rcu) { struct inode *realinode = ovl_inode_real(inode); - struct ovl_fs *ofs = inode->i_sb->s_fs_info; const struct cred *old_cred; struct posix_acl *acl; if (!IS_ENABLED(CONFIG_FS_POSIX_ACL) || !IS_POSIXACL(realinode)) return NULL; - if (rcu) { - if (!ofs->config.opt_acl_rcu) - return ERR_PTR(-ECHILD); + if (rcu) return get_cached_acl_rcu(realinode, type); - } - old_cred = ovl_override_creds_opt(inode->i_sb); + old_cred = ovl_override_creds(inode->i_sb); acl = get_acl(realinode, type); - ovl_revert_creds(old_cred); + revert_creds(old_cred); return acl; } diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h index 40695dec1f1d69accc95e3cc713e641cf7d69e47..b9c42694bbe1c96b1c33b79eed891d7c5d2ccd0b 100644 --- a/fs/overlayfs/overlayfs.h +++ b/fs/overlayfs/overlayfs.h @@ -253,8 +253,6 @@ int ovl_want_write(struct dentry *dentry); void ovl_drop_write(struct dentry *dentry); struct dentry *ovl_workdir(struct dentry *dentry); const struct cred *ovl_override_creds(struct super_block *sb); -const struct cred *ovl_override_creds_opt(struct super_block *sb); -void ovl_revert_creds(const struct cred *oldcred); int ovl_can_decode_fh(struct super_block *sb); struct dentry *ovl_indexdir(struct super_block *sb); bool ovl_index_all(struct super_block *sb); diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h index 7ba8c077a31d73969d5d30de70d1fb22c9eda754..d97f866f9b29523f2ee053d457867f4820b0c72c 100644 --- a/fs/overlayfs/ovl_entry.h +++ b/fs/overlayfs/ovl_entry.h @@ -19,8 +19,6 @@ struct ovl_config { bool metacopy; bool userxattr; bool ovl_volatile; - bool opt_creds; - bool opt_acl_rcu; }; struct ovl_sb { diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c index cc1e8025706444c7e4f31f30ce82d98e9672a56c..10b7780e4bdc01e1816026347ff77dd72ff3029c 100644 --- a/fs/overlayfs/readdir.c +++ b/fs/overlayfs/readdir.c @@ -481,6 +481,8 @@ static int ovl_cache_update_ino(struct path *path, struct ovl_cache_entry *p) } this = lookup_one_len(p->name, dir, p->len); if (IS_ERR_OR_NULL(this) || !this->d_inode) { + /* Mark a stale entry */ + p->is_whiteout = true; if (IS_ERR(this)) { err = PTR_ERR(this); this = NULL; @@ -776,6 +778,9 @@ static int ovl_iterate(struct file *file, struct dir_context *ctx) if (err) goto out; } + } + /* ovl_cache_update_ino() sets is_whiteout on stale entry */ + if (!p->is_whiteout) { if (!dir_emit(ctx, p->name, p->len, p->ino, p->type)) break; } diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c index 4dacaaf9a6886ab25443d2a1f4f2226abfa90331..31148bd5851b47485f2cd95fbb1985c710411cf9 100644 --- a/fs/overlayfs/super.c +++ b/fs/overlayfs/super.c @@ -1924,22 +1924,6 @@ static struct dentry *ovl_get_root(struct super_block *sb, return root; } -static bool ovl_config_opt(struct ovl_fs *ofs, unsigned int opt_mask) -{ - int i; - bool opt = true; - - /* - * Enable the optimization for container scenarios if all lowerfs are - * configured with opt_mask. The optimization is disabled by default. - * ofs->numfs must be at least 2, thus the default "true" won't be - * returned without checking any lowerfs. - */ - for (i = 1; opt && i < ofs->numfs; i++) - opt = ofs->fs[i].sb->s_iflags & opt_mask; - return opt; -} - static int ovl_fill_super(struct super_block *sb, void *data, int silent) { struct path upperpath = { }; @@ -2098,9 +2082,6 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent) sb->s_flags |= SB_POSIXACL; sb->s_iflags |= SB_I_SKIP_SYNC; - ofs->config.opt_creds = ovl_config_opt(ofs, SB_I_OVL_OPT_CREDS); - ofs->config.opt_acl_rcu = ovl_config_opt(ofs, SB_I_OVL_OPT_ACL_RCU); - err = -ENOMEM; root_dentry = ovl_get_root(sb, upperpath.dentry, oe); if (!root_dentry) diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c index e13bfc8fdca6738386f72c5d0a2bdacbd3d83187..f50e84a08f5cc3cae6254a2e1ef7503bebd0146e 100644 --- a/fs/overlayfs/util.c +++ b/fs/overlayfs/util.c @@ -40,21 +40,6 @@ const struct cred *ovl_override_creds(struct super_block *sb) return override_creds(ofs->creator_cred); } -const struct cred *ovl_override_creds_opt(struct super_block *sb) -{ - struct ovl_fs *ofs = sb->s_fs_info; - - if (ofs->config.opt_creds) - return NULL; - return override_creds(ofs->creator_cred); -} - -void ovl_revert_creds(const struct cred *old_cred) -{ - if (old_cred) - revert_creds(old_cred); -} - /* * Check if underlying fs supports file handles and try to determine encoding * type, in order to deduce maximum inode number used by fs. diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 6682a251aa9d53c86d3b6629c3b20b278cd2ad8d..ca75736905e7c8b0625bd2a346dccd845650fbbd 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2517,15 +2517,7 @@ xfs_alloc_fix_freelist( goto out_agbp_relse; } - /* - * Also need to fulfill freespace btree splits by reservaing more - * blocks to perform multiple allocations from a single AG and - * transaction if needed. - */ - if (args->postallocs) - need = xfs_alloc_min_freelist(mp, pag) << 1; - else - need = xfs_alloc_min_freelist(mp, pag); + need = xfs_alloc_min_freelist(mp, pag); if (!xfs_alloc_space_available(args, need, flags | XFS_ALLOC_FLAG_CHECK)) goto out_agbp_relse; @@ -2549,10 +2541,7 @@ xfs_alloc_fix_freelist( xfs_agfl_reset(tp, agbp, pag); /* If there isn't enough total space or single-extent, reject it. */ - if (args->postallocs) - need = xfs_alloc_min_freelist(mp, pag) << 1; - else - need = xfs_alloc_min_freelist(mp, pag); + need = xfs_alloc_min_freelist(mp, pag); if (!xfs_alloc_space_available(args, need, flags)) goto out_agbp_relse; diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h index d8b1fad457560a139bff3340c9e266d0691ab11b..6c22b12176b8b65153c6405c906d1a70876c68b4 100644 --- a/fs/xfs/libxfs/xfs_alloc.h +++ b/fs/xfs/libxfs/xfs_alloc.h @@ -73,7 +73,6 @@ typedef struct xfs_alloc_arg { int datatype; /* mask defining data type treatment */ char wasdel; /* set if allocation was prev delayed */ char wasfromfl; /* set if allocation is from freelist */ - char postallocs; /* number of post-allocations */ struct xfs_owner_info oinfo; /* owner of blocks being allocated */ enum xfs_ag_resv_type resv; /* block reservation to use */ } xfs_alloc_arg_t; diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 4d258c78c3e207e21d5a1b188e6d4ee9d52f5149..74308286899769ee705dbdd01c6a73b63b466141 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -3659,8 +3659,6 @@ xfs_bmap_btalloc( args.alignment = 1; args.minalignslop = 0; } - if (ap->minleft == 1) - args.postallocs = 1; args.minleft = ap->minleft; args.wasdel = ap->wasdel; args.resv = XFS_AG_RESV_NONE; diff --git a/include/acpi/actypes.h b/include/acpi/actypes.h index b7b76b9b5cc4cfd233ba0a32175c48280bff5cb0..ac42f3a45a84a3c5be527e360686c7fdc31890e0 100644 --- a/include/acpi/actypes.h +++ b/include/acpi/actypes.h @@ -720,7 +720,8 @@ typedef u32 acpi_event_type; #define ACPI_EVENT_POWER_BUTTON 2 #define ACPI_EVENT_SLEEP_BUTTON 3 #define ACPI_EVENT_RTC 4 -#define ACPI_EVENT_MAX 4 +#define ACPI_EVENT_PCIE_WAKE 5 +#define ACPI_EVENT_MAX 5 #define ACPI_NUM_FIXED_EVENTS ACPI_EVENT_MAX + 1 /* diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 93444d2fac5622d330a40b9e191302477a7d812b..834556f860ec3553cee1ab28b0cf0afae895c08e 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -336,7 +336,8 @@ int acpi_gsi_to_irq (u32 gsi, unsigned int *irq); int acpi_isa_irq_to_gsi (unsigned isa_irq, u32 *gsi); void acpi_set_irq_model(enum acpi_irq_model_id model, - struct fwnode_handle *fwnode); + struct fwnode_handle *(*)(u32)); +void acpi_set_gsi_to_irq_fallback(u32 (*)(u32)); struct irq_domain *acpi_irq_create_hierarchy(unsigned int flags, unsigned int size, diff --git a/include/linux/auxiliary_bus.h b/include/linux/auxiliary_bus.h index 282fbf7bf9af63193ed35ef730eb666bd0348bf8..282682792c8ac8d3d496d609a2c62da7c74b9361 100644 --- a/include/linux/auxiliary_bus.h +++ b/include/linux/auxiliary_bus.h @@ -20,7 +20,7 @@ struct auxiliary_device { struct auxiliary_driver { int (*probe)(struct auxiliary_device *auxdev, const struct auxiliary_device_id *id); - int (*remove)(struct auxiliary_device *auxdev); + void (*remove)(struct auxiliary_device *auxdev); void (*shutdown)(struct auxiliary_device *auxdev); int (*suspend)(struct auxiliary_device *auxdev, pm_message_t state); int (*resume)(struct auxiliary_device *auxdev); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index aa913a32169d72723336dc6407ae4b95bac25447..446b0290caa0cc216da5168f35b38bd169374157 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -982,6 +982,7 @@ extern void blk_execute_rq(struct request_queue *, struct gendisk *, struct request *, int); extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *, struct request *, int, rq_end_io_fn *); +extern bool blk_rq_is_poll(struct request *); /* Helper to convert REQ_OP_XXX to its string format XXX */ extern const char *blk_op_str(unsigned int op); diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h index 632c31b6766619233cf1f83c3db6fe8e05cb1f71..57890b357f85175ac3a93be5a369c77b339aef3c 100644 --- a/include/linux/btf_ids.h +++ b/include/linux/btf_ids.h @@ -184,15 +184,4 @@ MAX_BTF_SOCK_TYPE, extern u32 btf_sock_ids[]; #endif -#if IS_ENABLED(CONFIG_SMC) -enum { -#define BTF_SMC_TYPE(name, type) name, -BTF_SMC_TYPE(BTF_SMC_TYPE_SOCK, smc_sock) -BTF_SMC_TYPE(BTF_SMC_TYPE_CONNECTION, smc_connection) -#undef BTF_SMC_TYPE -MAX_BTF_SMC_TYPE -}; -extern u32 btf_smc_ids[]; -#endif - #endif diff --git a/include/linux/dax.h b/include/linux/dax.h index 53261b3b821c140fbc33135b0d0f7e514cdee72b..1172dc74f3d73937ee53e3b8671ae447e39c7943 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -273,6 +273,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index); int dax_invalidate_mapping_entry_sync(struct address_space *mapping, pgoff_t index); +int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos, + size_t size, void **kaddr, pfn_t *pfnp); s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap, struct iomap *srcmap); static inline bool dax_mapping(struct address_space *mapping) diff --git a/include/linux/efi.h b/include/linux/efi.h index 5a7aab4d729e12d7974c148ebbeadcb73bfa0166..b54686eb88d2820e6c8754a40512a2604073eb81 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -324,6 +324,9 @@ void efi_native_runtime_setup(void); #define UV_SYSTEM_TABLE_GUID EFI_GUID(0x3b13a7d4, 0x633e, 0x11dd, 0x93, 0xec, 0xda, 0x25, 0x56, 0xd8, 0x95, 0x93) #define LINUX_EFI_CRASH_GUID EFI_GUID(0xcfc8fc79, 0xbe2e, 0x4ddc, 0x97, 0xf0, 0x9f, 0x98, 0xbf, 0xe2, 0x98, 0xa0) #define LOADED_IMAGE_PROTOCOL_GUID EFI_GUID(0x5b1b31a1, 0x9562, 0x11d2, 0x8e, 0x3f, 0x00, 0xa0, 0xc9, 0x69, 0x72, 0x3b) +#define LOADED_IMAGE_DEVICE_PATH_PROTOCOL_GUID EFI_GUID(0xbc62157e, 0x3e33, 0x4fec, 0x99, 0x20, 0x2d, 0x3b, 0x36, 0xd7, 0x50, 0xdf) +#define EFI_DEVICE_PATH_PROTOCOL_GUID EFI_GUID(0x09576e91, 0x6d3f, 0x11d2, 0x8e, 0x39, 0x00, 0xa0, 0xc9, 0x69, 0x72, 0x3b) +#define EFI_DEVICE_PATH_TO_TEXT_PROTOCOL_GUID EFI_GUID(0x8b843e20, 0x8132, 0x4852, 0x90, 0xcc, 0x55, 0x1a, 0x4e, 0x4a, 0x7f, 0x1c) #define EFI_GRAPHICS_OUTPUT_PROTOCOL_GUID EFI_GUID(0x9042a9de, 0x23dc, 0x4a38, 0x96, 0xfb, 0x7a, 0xde, 0xd0, 0x80, 0x51, 0x6a) #define EFI_UGA_PROTOCOL_GUID EFI_GUID(0x982c298b, 0xf4fa, 0x41cb, 0xb8, 0x38, 0x77, 0xaa, 0x68, 0x8f, 0xb8, 0x39) #define EFI_PCI_IO_PROTOCOL_GUID EFI_GUID(0x4cf5b200, 0x68b8, 0x4ca5, 0x9e, 0xec, 0xb2, 0x3e, 0x3f, 0x50, 0x02, 0x9a) @@ -362,6 +365,7 @@ void efi_native_runtime_setup(void); #define LINUX_EFI_TPM_FINAL_LOG_GUID EFI_GUID(0x1e2ed096, 0x30e2, 0x4254, 0xbd, 0x89, 0x86, 0x3b, 0xbe, 0xf8, 0x23, 0x25) #define LINUX_EFI_MEMRESERVE_TABLE_GUID EFI_GUID(0x888eb0c6, 0x8ede, 0x4ff5, 0xa8, 0xf0, 0x9a, 0xee, 0x5c, 0xb9, 0x77, 0xc2) #define LINUX_EFI_INITRD_MEDIA_GUID EFI_GUID(0x5568e427, 0x68fc, 0x4f3d, 0xac, 0x74, 0xca, 0x55, 0x52, 0x31, 0xcc, 0x68) +#define LINUX_EFI_ZBOOT_MEDIA_GUID EFI_GUID(0xe565a30d, 0x47da, 0x4dbd, 0xb3, 0x54, 0x9b, 0xb5, 0xc8, 0x4f, 0x8b, 0xe2) #define LINUX_EFI_MOK_VARIABLE_TABLE_GUID EFI_GUID(0xc451ed2b, 0x9694, 0x45d3, 0xba, 0xba, 0xed, 0x9f, 0x89, 0x88, 0xa3, 0x89) #define LINUX_EFI_COCO_SECRET_AREA_GUID EFI_GUID(0xadf956ad, 0xe98c, 0x484c, 0xae, 0x11, 0xb5, 0x1c, 0x7d, 0x33, 0x64, 0x47) @@ -899,6 +903,7 @@ extern int efi_status_to_err(efi_status_t status); #define EFI_DEV_MEDIA_VENDOR 3 #define EFI_DEV_MEDIA_FILE 4 #define EFI_DEV_MEDIA_PROTOCOL 5 +#define EFI_DEV_MEDIA_REL_OFFSET 8 #define EFI_DEV_BIOS_BOOT 0x05 #define EFI_DEV_END_PATH 0x7F #define EFI_DEV_END_PATH2 0xFF @@ -929,12 +934,20 @@ struct efi_vendor_dev_path { u8 vendordata[]; } __packed; +struct efi_rel_offset_dev_path { + struct efi_generic_dev_path header; + u32 reserved; + u64 starting_offset; + u64 ending_offset; +} __packed; + struct efi_dev_path { union { struct efi_generic_dev_path header; struct efi_acpi_dev_path acpi; struct efi_pci_dev_path pci; struct efi_vendor_dev_path vendor; + struct efi_rel_offset_dev_path rel_offset; }; } __packed; @@ -1113,6 +1126,8 @@ static inline void efi_check_for_embedded_firmwares(void) { } efi_status_t efi_random_get_seed(void); +#define arch_efi_call_virt(p, f, args...) ((p)->f(args)) + void efi_retrieve_tpm2_eventlog(void); /* diff --git a/include/linux/fs.h b/include/linux/fs.h index e737a9e95474cf085b47e3251f055eeffdd321c2..193fa2bf7cae95ea9874247a85e7207fe793175c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -366,10 +366,17 @@ typedef struct { typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long); +struct iomap; + struct address_space_operations { int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); + struct page *(*startpfn)(struct address_space *mapping, + pgoff_t index, struct iomap *iomap); + void (*endpfn)(struct address_space *mapping, pgoff_t index, + struct iomap *iomap, int err); + /* Write back some dirty pages from this mapping. */ int (*writepages)(struct address_space *, struct writeback_control *); @@ -1430,10 +1437,6 @@ extern int send_sigurg(struct fown_struct *fown); #define SB_I_SKIP_SYNC 0x00000100 /* Skip superblock at global sync */ -/* hint from lowerfs for overlayfs optimizations (e.g. for container scenarios) */ -#define SB_I_OVL_OPT_CREDS 0x40000000 /* bypass [override|revert]_creds */ -#define SB_I_OVL_OPT_ACL_RCU 0x80000000 /* enable RCU'd ACL */ - /* Possible states of 'frozen' field */ enum { SB_UNFROZEN = 0, /* FS is unfrozen */ @@ -1901,6 +1904,7 @@ struct file_operations { loff_t len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags); + int (*uring_cmd_iopoll)(struct io_uring_cmd *ioucmd); bool may_pollfree; CK_KABI_RESERVE(1) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2a57e42e0ef33f179d23e9fa44c3803ae0055618..61b1cca51c892241edac7b2c1cdb5a42128a439a 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -283,6 +283,66 @@ void prep_transhuge_page(struct page *page); void free_transhuge_page(struct page *page); bool is_transparent_hugepage(struct page *page); +#ifdef CONFIG_MEMCG +enum thp_reclaim_mode_item { + THP_RECLAIM_DISABLE, /* disable mode */ + THP_RECLAIM_SWAP, /* swap mode */ + THP_RECLAIM_ZSR, /* reclaim mode, use zero subpages reclaim */ + THP_RECLAIM_MEMCG, /* For global configure */ +}; + +#define HUGEPAGE_RECLAIM_STAT(stat) \ + atomic_long_t stat; \ + atomic_long_t total_##stat + +struct hugepage_reclaim { + spinlock_t reclaim_queue_lock; + struct list_head reclaim_queue; + unsigned long queue_length; + + HUGEPAGE_RECLAIM_STAT(split_hugepage); + HUGEPAGE_RECLAIM_STAT(reclaim_subpage); + HUGEPAGE_RECLAIM_STAT(split_failed); +}; + +struct thp_reclaim_ctrl { + int threshold; + int proactive; +}; + +#define THP_RECLAIM_THRESHOLD_DEFAULT 16 +#define THP_RECLAIM_PROACTIVE_SLEEP_MS 60000 +extern int global_thp_reclaim; +extern int thp_reclaim_proactive; +int tr_get_hugepage(struct hugepage_reclaim *hr_queue, struct page **reclaim, + int threshold, unsigned long time); +unsigned long tr_reclaim_hugepage(struct hugepage_reclaim *hr_queue, + struct lruvec *lruvec, struct page *page); +void __tr_reclaim_memcg(struct mem_cgroup *memcg, unsigned long time, + unsigned int scan, bool proactive); + +static inline void tr_reclaim_memcg(struct mem_cgroup *memcg) +{ + __tr_reclaim_memcg(memcg, 0, 0, false); +} + +static inline struct list_head *hugepage_reclaim_list(struct page *page) +{ + return &page[3].hugepage_reclaim_list; +} + +static inline unsigned long tr_hugepage_time(struct page *page) +{ + return page[3].list_time; +} + +static inline void tr_set_hugepage_time(struct page *page, + unsigned long time) +{ + page[3].list_time = time; +} +#endif + bool can_split_huge_page(struct page *page, int *pextra_pins); int split_huge_page_to_list(struct page *page, struct list_head *list); static inline int split_huge_page(struct page *page) diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index 55d4701f754b5e29724d15b15c3b3268f33592d0..faea89e12878c610ad70cc4bd00192df8579390a 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -55,13 +55,19 @@ enum io_uring_cmd_flags { /* ctx state flags, for URING_CMD */ IO_URING_F_SQE128 = 4, + IO_URING_F_CQE32 = 8, + IO_URING_F_IOPOLL = 16, }; struct io_uring_cmd { struct file *file; const void *cmd; - /* callback to defer completions to task context */ - void (*task_work_cb)(struct io_uring_cmd *cmd); + union { + /* callback to defer completions to task context */ + void (*task_work_cb)(struct io_uring_cmd *cmd); + /* used for polled completion */ + void *cookie; + }; u32 cmd_op; u32 pad; u8 pdu[32]; /* available inline for free use */ diff --git a/include/linux/iomap.h b/include/linux/iomap.h index e5f692dec41d260fcdac2aa23d947548af37483f..4cd5a30afdf739bcacc9bedfd23940f81a262f9f 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -159,6 +159,10 @@ struct iomap_ops { */ int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length, ssize_t written, unsigned flags, struct iomap *iomap); + + /* Save an object in vma->private_data */ + void (*iomap_save_private)(struct vm_area_struct *vma, + struct iomap *iomap); }; /* diff --git a/include/linux/ism.h b/include/linux/ism.h new file mode 100644 index 0000000000000000000000000000000000000000..ea2bcdae7401235b2ea9050da240b259440a39c5 --- /dev/null +++ b/include/linux/ism.h @@ -0,0 +1,98 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Internal Shared Memory + * + * Definitions for the ISM module + * + * Copyright IBM Corp. 2022 + */ +#ifndef _ISM_H +#define _ISM_H + +#include + +struct ism_dmb { + u64 dmb_tok; + u64 rgid; + u32 dmb_len; + u32 sba_idx; + u32 vlan_valid; + u32 vlan_id; + void *cpu_addr; + dma_addr_t dma_addr; +}; + +/* Unless we gain unexpected popularity, this limit should hold for a while */ +#define MAX_CLIENTS 8 +#define ISM_NR_DMBS 1920 + +struct ism_dev { + spinlock_t lock; /* protects the ism device */ + struct list_head list; + struct pci_dev *pdev; + + struct ism_sba *sba; + dma_addr_t sba_dma_addr; + DECLARE_BITMAP(sba_bitmap, ISM_NR_DMBS); + u8 *sba_client_arr; /* entries are indices into 'clients' array */ + void *priv[MAX_CLIENTS]; + + struct ism_eq *ieq; + dma_addr_t ieq_dma_addr; + + struct device dev; + u64 local_gid; + int ieq_idx; + + atomic_t free_clients_cnt; + atomic_t add_dev_cnt; + wait_queue_head_t waitq; +}; + +struct ism_event { + u32 type; + u32 code; + u64 tok; + u64 time; + u64 info; +}; + +struct ism_client { + const char *name; + void (*add)(struct ism_dev *dev); + void (*remove)(struct ism_dev *dev); + void (*handle_event)(struct ism_dev *dev, struct ism_event *event); + /* Parameter dmbemask contains a bit vector with updated DMBEs, if sent + * via ism_move_data(). Callback function must handle all active bits + * indicated by dmbemask. + */ + void (*handle_irq)(struct ism_dev *dev, unsigned int bit, u16 dmbemask); + /* Private area - don't touch! */ + struct work_struct remove_work; + struct work_struct add_work; + struct ism_dev *tgt_ism; + u8 id; +}; + +int ism_register_client(struct ism_client *client); +int ism_unregister_client(struct ism_client *client); +static inline void *ism_get_priv(struct ism_dev *dev, + struct ism_client *client) { + return dev->priv[client->id]; +} + +static inline void ism_set_priv(struct ism_dev *dev, struct ism_client *client, + void *priv) { + dev->priv[client->id] = priv; +} + +int ism_register_dmb(struct ism_dev *dev, struct ism_dmb *dmb, + struct ism_client *client); +int ism_unregister_dmb(struct ism_dev *dev, struct ism_dmb *dmb); +int ism_move(struct ism_dev *dev, u64 dmb_tok, unsigned int idx, bool sf, + unsigned int offset, void *data, unsigned int size); +u8 *ism_get_seid(void); + +const struct smcd_ops *ism_get_smcd_ops(void); + +#endif /* _ISM_H */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d9299cbe435dce60179972ebb91c39d94912e787..3bc53ef6797f5f3450e67b49010a1aee1de13976 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -179,6 +179,9 @@ struct mem_cgroup_per_node { bool on_tree; struct mem_cgroup *memcg; /* Back pointer, we cannot */ /* use container_of */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + struct hugepage_reclaim hugepage_reclaim_queue; +#endif CK_KABI_RESERVE(1) CK_KABI_RESERVE(2) @@ -444,6 +447,8 @@ struct mem_cgroup { #ifdef CONFIG_TRANSPARENT_HUGEPAGE struct deferred_split deferred_split_queue; + int thp_reclaim; + struct thp_reclaim_ctrl tr_ctrl; #endif #ifdef CONFIG_MEMSLI @@ -1206,6 +1211,26 @@ void memcg_check_wmark_min_adj(struct task_struct *curr, void memcg_meminfo(struct mem_cgroup *memcg, struct sysinfo *info, struct sysinfo_ext *ext); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +void tr_add_hugepage(struct page *page); +void tr_del_hugepage(struct page *page); +static inline int tr_get_reclaim_mode(struct mem_cgroup *memcg) +{ + int reclaim = READ_ONCE(global_thp_reclaim); + + return (reclaim != THP_RECLAIM_MEMCG) ? reclaim : + READ_ONCE(memcg->thp_reclaim); +} +#else +static inline void tr_add_hugepage(struct page *page) +{ +} + +static inline void tr_del_hugepage(struct page *page) +{ +} +#endif + #else /* CONFIG_MEMCG */ #define MEM_CGROUP_ID_SHIFT 0 diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h index cf824366a7d1bec48e08dc1aa22b6c06862638bb..63349aeb8bf8fa073c25f9ffe2601c86d8ef7f44 100644 --- a/include/linux/mlx5/device.h +++ b/include/linux/mlx5/device.h @@ -346,6 +346,7 @@ enum mlx5_event { MLX5_EVENT_TYPE_NIC_VPORT_CHANGE = 0xd, MLX5_EVENT_TYPE_ESW_FUNCTIONS_CHANGED = 0xe, + MLX5_EVENT_TYPE_VHCA_STATE_CHANGE = 0xf, MLX5_EVENT_TYPE_DCT_DRAINED = 0x1c, MLX5_EVENT_TYPE_DCT_KEY_VIOLATION = 0x1d, @@ -717,6 +718,11 @@ struct mlx5_eqe_sync_fw_update { u8 sync_rst_state; }; +struct mlx5_eqe_vhca_state { + __be16 ec_function; + __be16 function_id; +} __packed; + union ev_data { __be32 raw[7]; struct mlx5_eqe_cmd cmd; @@ -736,6 +742,7 @@ union ev_data { struct mlx5_eqe_temp_warning temp_warning; struct mlx5_eqe_xrq_err xrq_err; struct mlx5_eqe_sync_fw_update sync_fw_update; + struct mlx5_eqe_vhca_state vhca_state; } __packed; struct mlx5_eqe { diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h index ae88362216a4e1a32078dbddaf9a1410e4c0072d..d9d08ed18ac7914d0052acf80a20a83891e0ef19 100644 --- a/include/linux/mlx5/driver.h +++ b/include/linux/mlx5/driver.h @@ -48,6 +48,7 @@ #include #include #include +#include #include #include @@ -56,6 +57,8 @@ #include #include +#define MLX5_ADEV_NAME "mlx5_core" + enum { MLX5_BOARD_ID_LEN = 64, }; @@ -190,7 +193,8 @@ enum port_state_policy { enum mlx5_coredev_type { MLX5_COREDEV_PF, - MLX5_COREDEV_VF + MLX5_COREDEV_VF, + MLX5_COREDEV_SF, }; struct mlx5_field_desc { @@ -504,6 +508,10 @@ struct mlx5_devcom; struct mlx5_fw_reset; struct mlx5_eq_table; struct mlx5_irq_table; +struct mlx5_vhca_state_notifier; +struct mlx5_sf_dev_table; +struct mlx5_sf_hw_table; +struct mlx5_sf_table; struct mlx5_rate_limit { u32 rate; @@ -534,6 +542,17 @@ struct mlx5_core_roce { struct mlx5_flow_handle *allow_rule; }; +enum { + MLX5_PRIV_FLAGS_DISABLE_IB_ADEV = 1 << 0, + MLX5_PRIV_FLAGS_DISABLE_ALL_ADEV = 1 << 1, +}; + +struct mlx5_adev { + struct auxiliary_device adev; + struct mlx5_core_dev *mdev; + int idx; +}; + struct mlx5_priv { /* IRQ table valid only for real pci devices PF or VF */ struct mlx5_irq_table *irq_table; @@ -571,6 +590,8 @@ struct mlx5_priv { struct list_head dev_list; struct list_head ctx_list; spinlock_t ctx_lock; + struct mlx5_adev **adev; + int adev_idx; struct mlx5_events *events; struct mlx5_flow_steering *steering; @@ -578,6 +599,7 @@ struct mlx5_priv { struct mlx5_eswitch *eswitch; struct mlx5_core_sriov sriov; struct mlx5_lag *lag; + u32 flags; struct mlx5_devcom *devcom; struct mlx5_fw_reset *fw_reset; struct mlx5_core_roce roce; @@ -586,6 +608,15 @@ struct mlx5_priv { struct mlx5_bfreg_data bfregs; struct mlx5_uars_page *uar; +#ifdef CONFIG_MLX5_SF + struct mlx5_vhca_state_notifier *vhca_state_notifier; + struct mlx5_sf_dev_table *sf_dev_table; + struct mlx5_core_dev *parent_mdev; +#endif +#ifdef CONFIG_MLX5_SF_MANAGER + struct mlx5_sf_hw_table *sf_hw_table; + struct mlx5_sf_table *sf_table; +#endif }; enum mlx5_device_state { @@ -694,6 +725,7 @@ struct mlx5_core_dev { enum mlx5_device_state state; /* sync interface state */ struct mutex intf_state_mutex; + struct lock_class_key lock_key; unsigned long intf_state; struct mlx5_priv priv; struct mlx5_profile *profile; @@ -1059,9 +1091,14 @@ enum { }; enum { - MLX5_INTERFACE_PROTOCOL_IB = 0, - MLX5_INTERFACE_PROTOCOL_ETH = 1, - MLX5_INTERFACE_PROTOCOL_VDPA = 2, + MLX5_INTERFACE_PROTOCOL_ETH_REP, + MLX5_INTERFACE_PROTOCOL_ETH, + + MLX5_INTERFACE_PROTOCOL_IB_REP, + MLX5_INTERFACE_PROTOCOL_MPIB, + MLX5_INTERFACE_PROTOCOL_IB, + + MLX5_INTERFACE_PROTOCOL_VDPA, }; struct mlx5_interface { diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index 6ca97729b54a4952e907c029641cf1d9ce3aa78c..e7beff61a41bdc368083deee6824b3a528bfb612 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -299,6 +299,8 @@ enum { MLX5_CMD_OP_CREATE_UMEM = 0xa08, MLX5_CMD_OP_DESTROY_UMEM = 0xa0a, MLX5_CMD_OP_SYNC_STEERING = 0xb00, + MLX5_CMD_OP_QUERY_VHCA_STATE = 0xb0d, + MLX5_CMD_OP_MODIFY_VHCA_STATE = 0xb0e, MLX5_CMD_OP_MAX }; @@ -1232,7 +1234,15 @@ enum { }; struct mlx5_ifc_cmd_hca_cap_bits { - u8 reserved_at_0[0x30]; + u8 reserved_at_0[0x20]; + + u8 reserved_at_20[0x3]; + u8 event_on_vhca_state_teardown_request[0x1]; + u8 event_on_vhca_state_in_use[0x1]; + u8 event_on_vhca_state_active[0x1]; + u8 event_on_vhca_state_allocated[0x1]; + u8 event_on_vhca_state_invalid[0x1]; + u8 reserved_at_28[0x8]; u8 vhca_id[0x10]; u8 reserved_at_40[0x40]; @@ -1520,7 +1530,8 @@ struct mlx5_ifc_cmd_hca_cap_bits { u8 disable_local_lb_uc[0x1]; u8 disable_local_lb_mc[0x1]; u8 log_min_hairpin_wq_data_sz[0x5]; - u8 reserved_at_3e8[0x3]; + u8 reserved_at_3e8[0x2]; + u8 vhca_state[0x1]; u8 log_max_vlan_list[0x5]; u8 reserved_at_3f0[0x3]; u8 log_max_current_mc_list[0x5]; @@ -1590,7 +1601,7 @@ struct mlx5_ifc_cmd_hca_cap_bits { u8 max_num_of_monitor_counters[0x10]; u8 num_ppcnt_monitor_counters[0x10]; - u8 reserved_at_640[0x10]; + u8 max_num_sf[0x10]; u8 num_q_monitor_counters[0x10]; u8 reserved_at_660[0x20]; diff --git a/include/linux/mm.h b/include/linux/mm.h index 5d5e9ef8eca50c2efb8a70c338bb9367cf6f17a4..7992e42fd3c47a56bdceca18786afb72e0aeb13b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3149,6 +3149,7 @@ int drop_caches_sysctl_handler(struct ctl_table *, int, void *, size_t *, void drop_slab(void); void drop_slab_node(int nid); +unsigned int move_pages_to_lru(struct lruvec *lruvec, struct list_head *list); #ifndef CONFIG_MMU #define randomize_va_space 0 diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 444f90f016874bdc1f2e6e8db94431d0b794de34..b28fc766fb0053de2b6af846a521594524cb3946 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -150,6 +150,15 @@ struct page { /* For both global and memcg */ struct list_head deferred_list; }; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + struct { /* Third tail page of compound page */ + unsigned long _compound_pad_3; /* compound_head */ + /* The time added in hugepage reclaim list. */ + unsigned long list_time; + /* For zero subpage reclaim */ + struct list_head hugepage_reclaim_list; + }; +#endif struct { /* Page table pages */ unsigned long _pt_pad_1; /* compound_head */ pgtable_t pmd_huge_pte; /* protected by page->ptl */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 8d2ce9e03e68cc14a50c6f2ac7eb20d48e9a3200..967da4afffb3dd3275a67be4f300da2d73dbf690 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -519,8 +519,10 @@ PAGEFLAG_FALSE(Mlocked) __CLEARPAGEFLAG_NOOP(Mlocked) #ifdef CONFIG_ARCH_USES_PG_UNCACHED PAGEFLAG(Uncached, uncached, PF_NO_COMPOUND) +#define __PG_UNCACHED (1UL << PG_uncached) #else PAGEFLAG_FALSE(Uncached) +#define __PG_UNCACHED 0 #endif #ifdef CONFIG_MEMORY_FAILURE diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h index d37ea650bec74de2965f63fbb9429cfaf3ba6111..4ba8c07235367ccb0cbeac170ed267ab56df43ee 100644 --- a/include/linux/pci_ids.h +++ b/include/linux/pci_ids.h @@ -2602,6 +2602,7 @@ #define PCI_VENDOR_ID_ZHAOXIN 0x1d17 #define PCI_VENDOR_ID_HYGON 0x1d94 +#define PCI_DEVICE_ID_HYGON_18H_M05H_DF_F3 0x14b3 #define PCI_VENDOR_ID_HXT 0x1dbf diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 8c3313287d37b379d1a71f94e3ca21bf24ab5dd2..c8b5633a49c618389386ace8621f27e773bbf99d 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1426,6 +1426,7 @@ perf_event_addr_filters(struct perf_event *event) } extern void perf_event_addr_filters_sync(struct perf_event *event); +extern void perf_report_aux_output_id(struct perf_event *event, u64 hw_id); extern int perf_output_begin(struct perf_output_handle *handle, struct perf_sample_data *data, diff --git a/include/net/devlink.h b/include/net/devlink.h index edcc6d6ae529442c60b1b07a11d927f7e3ed0d4c..e0a06911bce4726759999f4537395add126f7222 100644 --- a/include/net/devlink.h +++ b/include/net/devlink.h @@ -97,6 +97,18 @@ struct devlink_port_pci_vf_attrs { u8 external:1; }; +/** + * struct devlink_port_pci_sf_attrs - devlink port's PCI SF attributes + * @controller: Associated controller number + * @sf: Associated PCI SF for of the PCI PF for this port. + * @pf: Associated PCI PF number for this port. + */ +struct devlink_port_pci_sf_attrs { + u32 controller; + u32 sf; + u16 pf; +}; + /** * struct devlink_port_attrs - devlink port object * @flavour: flavour of the port @@ -107,6 +119,7 @@ struct devlink_port_pci_vf_attrs { * @phys: physical port attributes * @pci_pf: PCI PF port attributes * @pci_vf: PCI VF port attributes + * @pci_sf: PCI SF port attributes */ struct devlink_port_attrs { u8 split:1, @@ -118,6 +131,7 @@ struct devlink_port_attrs { struct devlink_port_phys_attrs phys; struct devlink_port_pci_pf_attrs pci_pf; struct devlink_port_pci_vf_attrs pci_vf; + struct devlink_port_pci_sf_attrs pci_sf; }; }; @@ -142,6 +156,17 @@ struct devlink_port { struct mutex reporters_lock; /* Protects reporter_list */ }; +struct devlink_port_new_attrs { + enum devlink_port_flavour flavour; + unsigned int port_index; + u32 controller; + u32 sfnum; + u16 pfnum; + u8 port_index_valid:1, + controller_valid:1, + sfnum_valid:1; +}; + struct devlink_sb_pool_info { enum devlink_sb_pool_type pool_type; u32 size; @@ -1350,6 +1375,79 @@ struct devlink_ops { int (*port_function_hw_addr_set)(struct devlink *devlink, struct devlink_port *port, const u8 *hw_addr, int hw_addr_len, struct netlink_ext_ack *extack); + /** + * port_new() - Add a new port function of a specified flavor + * @devlink: Devlink instance + * @attrs: attributes of the new port + * @extack: extack for reporting error messages + * @new_port_index: index of the new port + * + * Devlink core will call this device driver function upon user request + * to create a new port function of a specified flavor and optional + * attributes + * + * Notes: + * - Called without devlink instance lock being held. Drivers must + * implement own means of synchronization + * - On success, drivers must register a port with devlink core + * + * Return: 0 on success, negative value otherwise. + */ + int (*port_new)(struct devlink *devlink, + const struct devlink_port_new_attrs *attrs, + struct netlink_ext_ack *extack, + unsigned int *new_port_index); + /** + * port_del() - Delete a port function + * @devlink: Devlink instance + * @port_index: port function index to delete + * @extack: extack for reporting error messages + * + * Devlink core will call this device driver function upon user request + * to delete a previously created port function + * + * Notes: + * - Called without devlink instance lock being held. Drivers must + * implement own means of synchronization + * - On success, drivers must unregister the corresponding devlink + * port + * + * Return: 0 on success, negative value otherwise. + */ + int (*port_del)(struct devlink *devlink, unsigned int port_index, + struct netlink_ext_ack *extack); + /** + * port_fn_state_get() - Get the state of a port function + * @devlink: Devlink instance + * @port: The devlink port + * @state: Admin configured state + * @opstate: Current operational state + * @extack: extack for reporting error messages + * + * Reports the admin and operational state of a devlink port function + * + * Return: 0 on success, negative value otherwise. + */ + int (*port_fn_state_get)(struct devlink *devlink, + struct devlink_port *port, + enum devlink_port_fn_state *state, + enum devlink_port_fn_opstate *opstate, + struct netlink_ext_ack *extack); + /** + * port_fn_state_set() - Set the admin state of a port function + * @devlink: Devlink instance + * @port: The devlink port + * @state: Admin state + * @extack: extack for reporting error messages + * + * Set the admin state of a devlink port function + * + * Return: 0 on success, negative value otherwise. + */ + int (*port_fn_state_set)(struct devlink *devlink, + struct devlink_port *port, + enum devlink_port_fn_state state, + struct netlink_ext_ack *extack); }; static inline void *devlink_priv(struct devlink *devlink) @@ -1406,6 +1504,8 @@ void devlink_port_attrs_pci_pf_set(struct devlink_port *devlink_port, u32 contro u16 pf, bool external); void devlink_port_attrs_pci_vf_set(struct devlink_port *devlink_port, u32 controller, u16 pf, u16 vf, bool external); +void devlink_port_attrs_pci_sf_set(struct devlink_port *devlink_port, + u32 controller, u16 pf, u32 sf); int devlink_sb_register(struct devlink *devlink, unsigned int sb_index, u32 size, u16 ingress_pools_count, u16 egress_pools_count, u16 ingress_tc_count, diff --git a/include/net/inet_common.h b/include/net/inet_common.h index cb2818862919b0de464a31ed1e4d9a676f7a56bf..1f2e1993ace155929a7f3a79bfceec670f45a207 100644 --- a/include/net/inet_common.h +++ b/include/net/inet_common.h @@ -46,6 +46,7 @@ int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len, int inet_getname(struct socket *sock, struct sockaddr *uaddr, int peer); int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg); +int inet_compat_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg); int inet_ctl_sock_create(struct sock **sk, unsigned short family, unsigned short type, unsigned char protocol, struct net *net); diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h index c19dee9b997ca8c6383d78bf6851bdfec50d8a67..f6158ebea3613f42f5e2afc4b8f958854d0adf86 100644 --- a/include/net/netns/smc.h +++ b/include/net/netns/smc.h @@ -22,10 +22,10 @@ struct netns_smc { #endif unsigned int sysctl_autocorking_size; unsigned int sysctl_smcr_buf_type; + unsigned int sysctl_vendor_exp_options; int sysctl_smcr_testlink_time; int sysctl_wmem; int sysctl_rmem; int sysctl_tcp2smc; - int sysctl_allow_different_subnet; }; #endif diff --git a/include/net/smc.h b/include/net/smc.h index 08e63555210595300bb2907ae833bbdd1ac68c13..8018c3a0143a3cc3ed465a703dfad3653555c83d 100644 --- a/include/net/smc.h +++ b/include/net/smc.h @@ -12,7 +12,14 @@ #define _SMC_H #include -#include +#include +#include +#include +#include +#include +#include "linux/ism.h" + +struct sock; #ifdef ATOMIC64_INIT #define KERNEL_HAS_ATOMIC64 @@ -51,20 +58,14 @@ struct smcd_dmb { #define ISM_ERROR 0xFFFF -struct smcd_event { - u32 type; - u32 code; - u64 tok; - u64 time; - u64 info; -}; - struct smcd_dev; +struct ism_client; struct smcd_ops { int (*query_remote_gid)(struct smcd_dev *dev, u64 rgid, u32 vid_valid, u32 vid); - int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb); + int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb, + struct ism_client *client); int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb); int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id); int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id); @@ -75,15 +76,16 @@ struct smcd_ops { int (*move_data)(struct smcd_dev *dev, u64 dmb_tok, unsigned int idx, bool sf, unsigned int offset, void *data, unsigned int size); + int (*supports_v2)(void); u8* (*get_system_eid)(void); + u64 (*get_local_gid)(struct smcd_dev *dev); u16 (*get_chid)(struct smcd_dev *dev); + struct device* (*get_dev)(struct smcd_dev *dev); }; struct smcd_dev { const struct smcd_ops *ops; - struct device dev; void *priv; - u64 local_gid; struct list_head list; spinlock_t lock; struct smc_connection **conn; @@ -98,14 +100,6 @@ struct smcd_dev { u8 going_away : 1; }; -struct smcd_dev *smcd_alloc_dev(struct device *parent, const char *name, - const struct smcd_ops *ops, int max_dmbs); -int smcd_register_dev(struct smcd_dev *smcd); -void smcd_unregister_dev(struct smcd_dev *smcd); -void smcd_free_dev(struct smcd_dev *smcd); -void smcd_handle_event(struct smcd_dev *dev, struct smcd_event *event); -void smcd_handle_irq(struct smcd_dev *dev, unsigned int bit, u16 dmbemask); - struct smc_wr_rx_hdr { /* common prefix part of LLC and CDC to demultiplex */ union { u8 type; @@ -279,11 +273,24 @@ struct smc_connection { u8 killed : 1; /* abnormal termination */ u8 freed : 1; /* normal termiation */ u8 out_of_sync : 1; /* out of sync with peer */ + u8 unwrap_remaining : 1; /* have remaining data to + * send when RMB unwrapped + */ }; struct smc_sock { /* smc sock container */ - struct sock sk; - struct socket *clcsock; /* internal tcp socket */ + union { + struct tcp_sock tpsk; + struct sock sk; + }; + struct socket *clcsock; /* internal tcp socket */ + unsigned char smc_state; /* smc state used in smc via inet_sk */ + unsigned int isck_smc_negotiation; + struct socket accompany_socket; + struct request_sock *tail_0; + struct request_sock *tail_1; + struct request_sock *reqsk; + unsigned int queued_cnt; void (*clcsk_state_change)(struct sock *sk); /* original stat_change fct. */ void (*clcsk_data_ready)(struct sock *sk); @@ -292,6 +299,7 @@ struct smc_sock { /* smc sock container */ /* original write_space fct. */ void (*clcsk_error_report)(struct sock *sk); /* original error_report fct. */ + void (*original_sk_destruct)(struct sock *sk); struct smc_connection conn; /* smc connection */ struct smc_sock *listen_smc; /* listen parent */ struct work_struct connect_work; /* handle non-blocking connect*/ @@ -301,11 +309,14 @@ struct smc_sock { /* smc sock container */ spinlock_t accept_q_lock; /* protects accept_q */ bool limit_smc_hs; /* put constraint on handshake */ bool use_fallback; /* fallback to tcp */ + bool under_presure; /* under presure */ int fallback_rsn; /* reason for fallback */ u32 peer_diagnosis; /* decline reason from peer */ atomic_t queued_smc_hs; /* queued smc handshakes */ struct inet_connection_sock_af_ops af_ops; const struct inet_connection_sock_af_ops *ori_af_ops; + /* protocol negotiator ops */ + const struct smc_sock_negotiator_ops *negotiator_ops; /* original af ops */ int sockopt_defer_accept; /* sockopt TCP_DEFER_ACCEPT @@ -325,49 +336,54 @@ struct smc_sock { /* smc sock container */ /* non-blocking connect in * flight */ + u8 ordered : 1; struct mutex clcsock_release_lock; /* protects clcsock of a listen * socket */ }; -#define SMC_SOCK_CLOSED_TIMING (0) - -#ifdef CONFIG_BPF_SYSCALL +#define SMC_NEGOTIATOR_NAME_MAX (16) +#define SMC_SOCK_CLOSED_TIMING (0) /* BPF struct ops for smc protocol negotiator */ struct smc_sock_negotiator_ops { - /* ret for negotiate */ - int (*negotiate)(struct smc_sock *sk); - /* info gathering timing */ - void (*collect_info)(struct sock *sk, int timing); -}; + struct list_head list; -/* Query if current sock should go with SMC protocol - * SK_PASS for yes, otherwise for no. - */ -int smc_sock_should_select_smc(const struct smc_sock *smc); + /* ops name */ + char name[16]; + /* key for name */ + u32 key; + /* init with sk */ + void (*init)(struct sock *sk); -/* At some specific points in time, - * let negotiator can perform info gathering - * on target sock. - */ -void smc_sock_perform_collecting_info(const struct sock *sk, int timing); + /* release with sk */ + void (*release)(struct sock *sk); -#else + /* advice for negotiate */ + int (*negotiate)(struct sock *sk); -static inline int smc_sock_should_select_smc(const struct smc_sock *smc) -{ - return SK_PASS; -} + /* info gathering timing */ + void (*collect_info)(struct sock *sk, int timing); -static inline void smc_sock_perform_collecting_info(const struct sock *sk, int timing) -{ + /* module owner */ + struct module *owner; +}; -} +int smc_sock_register_negotiator_ops(struct smc_sock_negotiator_ops *ops); +int smc_sock_update_negotiator_ops(struct smc_sock_negotiator_ops *ops, + struct smc_sock_negotiator_ops *old_ops); +void smc_sock_unregister_negotiator_ops(struct smc_sock_negotiator_ops *ops); +int smc_sock_assign_negotiator_ops(struct smc_sock *smc, const char *name); -#endif /* CONFIG_BPF_SYSCALL */ +#ifdef CONFIG_BPF_SYSCALL +void smc_sock_cleanup_negotiator_ops(struct smc_sock *smc, int in_release); +void smc_sock_clone_negotiator_ops(struct sock *parent, struct sock *child); +#else +static inline void smc_sock_cleanup_negotiator_ops(struct smc_sock *smc, int in_release) {} +static inline void smc_sock_clone_negotiator_ops(struct sock *parent, struct sock *child) {} +#endif #endif /* _SMC_H */ diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h index cf89c318f2ac90a46bd3f61664bd757e2f665d8f..f6008b2fa60fff943bdcf9d34ad28a329eb9fd39 100644 --- a/include/uapi/linux/devlink.h +++ b/include/uapi/linux/devlink.h @@ -200,6 +200,10 @@ enum devlink_port_flavour { DEVLINK_PORT_FLAVOUR_UNUSED, /* Port which exists in the switch, but * is not used in any way. */ + DEVLINK_PORT_FLAVOUR_PCI_SF, /* Represents eswitch port + * for the PCI SF. It is an internal + * port that faces the PCI SF. + */ }; enum devlink_param_cmode { @@ -529,6 +533,7 @@ enum devlink_attr { DEVLINK_ATTR_RELOAD_ACTION_INFO, /* nested */ DEVLINK_ATTR_RELOAD_ACTION_STATS, /* nested */ + DEVLINK_ATTR_PORT_PCI_SF_NUMBER, /* u32 */ /* add new attributes above here, update the policy in devlink.c */ __DEVLINK_ATTR_MAX, @@ -578,9 +583,29 @@ enum devlink_resource_unit { enum devlink_port_function_attr { DEVLINK_PORT_FUNCTION_ATTR_UNSPEC, DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR, /* binary */ + DEVLINK_PORT_FN_ATTR_STATE, /* u8 */ + DEVLINK_PORT_FN_ATTR_OPSTATE, /* u8 */ __DEVLINK_PORT_FUNCTION_ATTR_MAX, DEVLINK_PORT_FUNCTION_ATTR_MAX = __DEVLINK_PORT_FUNCTION_ATTR_MAX - 1 }; +enum devlink_port_fn_state { + DEVLINK_PORT_FN_STATE_INACTIVE, + DEVLINK_PORT_FN_STATE_ACTIVE, +}; + +/** + * enum devlink_port_fn_opstate - indicates operational state of the function + * @DEVLINK_PORT_FN_OPSTATE_ATTACHED: Driver is attached to the function. + * For graceful tear down of the function, after inactivation of the + * function, user should wait for operational state to turn DETACHED. + * @DEVLINK_PORT_FN_OPSTATE_DETACHED: Driver is detached from the function. + * It is safe to delete the port. + */ +enum devlink_port_fn_opstate { + DEVLINK_PORT_FN_OPSTATE_DETACHED, + DEVLINK_PORT_FN_OPSTATE_ATTACHED, +}; + #endif /* _UAPI_LINUX_DEVLINK_H_ */ diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h index 40b1e51b18c9c8d0fa68a9bbf1957a6154bcb791..0a52769ca0b8ccbaa0a06245469b46a399cf97cd 100644 --- a/include/uapi/linux/in.h +++ b/include/uapi/linux/in.h @@ -80,6 +80,8 @@ enum { #define IPPROTO_RAW IPPROTO_RAW IPPROTO_MPTCP = 262, /* Multipath TCP connection */ #define IPPROTO_MPTCP IPPROTO_MPTCP + IPPROTO_SMC = 263, /* Shared Memory Communications */ +#define IPPROTO_SMC IPPROTO_SMC IPPROTO_MAX }; #endif diff --git a/include/uapi/linux/nvme_ioctl.h b/include/uapi/linux/nvme_ioctl.h index d99b5a772698037a9a9fd51a3711a69393a166ef..2f76cba6716637baff53e167a6141b68420d75c3 100644 --- a/include/uapi/linux/nvme_ioctl.h +++ b/include/uapi/linux/nvme_ioctl.h @@ -55,7 +55,10 @@ struct nvme_passthru_cmd64 { __u64 metadata; __u64 addr; __u32 metadata_len; - __u32 data_len; + union { + __u32 data_len; /* for non-vectored io */ + __u32 vec_cnt; /* for vectored io */ + }; __u32 cdw10; __u32 cdw11; __u32 cdw12; @@ -67,6 +70,28 @@ struct nvme_passthru_cmd64 { __u64 result; }; +/* same as struct nvme_passthru_cmd64, minus the 8b result field */ +struct nvme_uring_cmd { + __u8 opcode; + __u8 flags; + __u16 rsvd1; + __u32 nsid; + __u32 cdw2; + __u32 cdw3; + __u64 metadata; + __u64 addr; + __u32 metadata_len; + __u32 data_len; + __u32 cdw10; + __u32 cdw11; + __u32 cdw12; + __u32 cdw13; + __u32 cdw14; + __u32 cdw15; + __u32 timeout_ms; + __u32 rsvd2; +}; + #define nvme_admin_cmd nvme_passthru_cmd #define NVME_IOCTL_ID _IO('N', 0x40) @@ -78,5 +103,12 @@ struct nvme_passthru_cmd64 { #define NVME_IOCTL_RESCAN _IO('N', 0x46) #define NVME_IOCTL_ADMIN64_CMD _IOWR('N', 0x47, struct nvme_passthru_cmd64) #define NVME_IOCTL_IO64_CMD _IOWR('N', 0x48, struct nvme_passthru_cmd64) +#define NVME_IOCTL_IO64_CMD_VEC _IOWR('N', 0x49, struct nvme_passthru_cmd64) + +/* io_uring async commands: */ +#define NVME_URING_CMD_IO _IOWR('N', 0x80, struct nvme_uring_cmd) +#define NVME_URING_CMD_IO_VEC _IOWR('N', 0x81, struct nvme_uring_cmd) +#define NVME_URING_CMD_ADMIN _IOWR('N', 0x82, struct nvme_uring_cmd) +#define NVME_URING_CMD_ADMIN_VEC _IOWR('N', 0x83, struct nvme_uring_cmd) #endif /* _UAPI_LINUX_NVME_IOCTL_H */ diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index e87bbe763d6a2615b5614be3111fe912cdea1dad..33f860d7f17660188b30b5a2f48d5a540147102d 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -1083,6 +1083,21 @@ enum perf_event_type { */ PERF_RECORD_TEXT_POKE = 20, + /* + * Data written to the AUX area by hardware due to aux_output, may need + * to be matched to the event by an architecture-specific hardware ID. + * This records the hardware ID, but requires sample_id to provide the + * event ID. e.g. Intel PT uses this record to disambiguate PEBS-via-PT + * records from multiple events. + * + * struct { + * struct perf_event_header header; + * u64 hw_id; + * struct sample_id sample_id; + * }; + */ + PERF_RECORD_AUX_OUTPUT_HW_ID = 21, + PERF_RECORD_MAX, /* non-ABI */ }; diff --git a/include/uapi/linux/smc.h b/include/uapi/linux/smc.h index d9b5bd6cef85b0e1bd33c8897a8877b4ca01f0a1..fa4719ce5d62a75b2cebb00bfd3acd049baf08a8 100644 --- a/include/uapi/linux/smc.h +++ b/include/uapi/linux/smc.h @@ -148,6 +148,7 @@ enum { SMC_NLA_LINK_SWC_CNT, /* u64 */ SMC_NLA_LINK_RWR_CNT, /* u64 */ SMC_NLA_LINK_RWC_CNT, /* u64 */ + SMC_NLA_LINK_WWR_CNT, /* u64 */ SMC_NLA_LINK_WWC_CNT, /* u64 */ __SMC_NLA_LINK_MAX, SMC_NLA_LINK_MAX = __SMC_NLA_LINK_MAX - 1 diff --git a/kernel/events/core.c b/kernel/events/core.c index 2115d9a3ef99715f628da3cb61c493c55ebc3e02..84b667a67f5eb01c71dc5d8adb0bc2685b12df69 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9009,6 +9009,37 @@ static void perf_log_itrace_start(struct perf_event *event) perf_output_end(&handle); } +void perf_report_aux_output_id(struct perf_event *event, u64 hw_id) +{ + struct perf_output_handle handle; + struct perf_sample_data sample; + struct perf_aux_event { + struct perf_event_header header; + u64 hw_id; + } rec; + int ret; + + if (event->parent) + event = event->parent; + + rec.header.type = PERF_RECORD_AUX_OUTPUT_HW_ID; + rec.header.misc = 0; + rec.header.size = sizeof(rec); + rec.hw_id = hw_id; + + perf_event_header__init_id(&rec.header, &sample, event); + ret = perf_output_begin(&handle, &sample, event, rec.header.size); + + if (ret) + return; + + perf_output_put(&handle, rec); + perf_event__output_id_sample(event, &handle, &sample); + + perf_output_end(&handle); +} +EXPORT_SYMBOL_GPL(perf_report_aux_output_id); + static int __perf_event_account_interrupt(struct perf_event *event, int throttle) { diff --git a/kernel/relay.c b/kernel/relay.c index b08d936d5fa75b58651ad40192d42373c5ff773c..9cae6bf2e66a2f9f06867cd2f0cb9df81ceee7bf 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -1077,7 +1077,8 @@ static size_t relay_file_read_start_pos(struct rchan_buf *buf) size_t subbuf_size = buf->chan->subbuf_size; size_t n_subbufs = buf->chan->n_subbufs; size_t consumed = buf->subbufs_consumed % n_subbufs; - size_t read_pos = consumed * subbuf_size + buf->bytes_consumed; + size_t read_pos = (consumed * subbuf_size + buf->bytes_consumed) + % (n_subbufs * subbuf_size); read_subbuf = read_pos / subbuf_size; padding = buf->padding[read_subbuf]; diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index 171eb481001d30d02e10116ab847c620259c5517..a048270e9a12f3bf9f55ec6efeb13750788b221b 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -127,7 +127,7 @@ void __init housekeeping_init(void) /* We need at least one CPU to handle housekeeping work */ WARN_ON_ONCE(cpumask_empty(housekeeping_mask)); #ifdef CONFIG_CGROUP_SCHED - if (housekeeping_flags & HK_FLAG_DOMAIN) { + if (dyn_isolcpus_ready && (housekeeping_flags & HK_FLAG_DOMAIN)) { cpumask_copy(dyn_allowed, housekeeping_mask); cpumask_copy(dyn_possible, housekeeping_mask); } @@ -367,6 +367,7 @@ static const struct proc_ops proc_dyn_isolcpus_operations = { .proc_read = seq_read, .proc_write = write_dyn_isolcpus, .proc_lseek = noop_llseek, + .proc_release = single_release, }; static int __init dyn_isolcpus_init(void) diff --git a/kernel/sys.c b/kernel/sys.c index be665d2fb9e71456c1ae60ea44d1528071c2e1e9..2d9e3dea139f81da12a2279a839dda52332320e0 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1551,6 +1551,8 @@ int do_prlimit(struct task_struct *tsk, unsigned int resource, if (resource >= RLIM_NLIMITS) return -EINVAL; + resource = array_index_nospec(resource, RLIM_NLIMITS); + if (new_rlim) { if (new_rlim->rlim_cur > new_rlim->rlim_max) return -EINVAL; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index af45be81946dd5ac66331e713af83132c7a8da59..ffd57f90af0fe0373f40b0ffedacfd81daa125fe 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -62,6 +62,19 @@ unsigned long transparent_hugepage_flags __read_mostly = static struct shrinker deferred_split_shrinker; +#ifdef CONFIG_MEMCG +/* + * Zero subpages reclaim for huge pages only works when memcg defined. + */ +int global_thp_reclaim = THP_RECLAIM_MEMCG; +int thp_reclaim_proactive; +unsigned int thp_reclaim_proactive_sleep_ms = THP_RECLAIM_PROACTIVE_SLEEP_MS; +unsigned int thp_reclaim_proactive_scan = 100; +static DEFINE_MUTEX(thp_reclaim_proactive_mutex); +struct delayed_work thp_reclaim_proactive_dwork; +static void thp_reclaim_proactive_func(struct work_struct *work); +#endif + static atomic_t huge_zero_refcount; struct page *huge_zero_page __read_mostly; unsigned long huge_zero_pfn __read_mostly = ~0UL; @@ -402,6 +415,139 @@ static struct kobj_attribute hugetext_pad_threshold_attr = hugetext_pad_threshold_store); #endif /* CONFIG_HUGETEXT */ +#ifdef CONFIG_MEMCG +static ssize_t reclaim_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + int thp_reclaim = READ_ONCE(global_thp_reclaim); + + if (thp_reclaim == THP_RECLAIM_MEMCG) + return sprintf(buf, "[memcg] reclaim swap disable\n"); + else if (thp_reclaim == THP_RECLAIM_ZSR) + return sprintf(buf, "memcg [reclaim] swap disable\n"); + else if (thp_reclaim == THP_RECLAIM_SWAP) + return sprintf(buf, "memcg reclaim [swap] disable\n"); + else + return sprintf(buf, "memcg reclaim swap [disable]\n"); +} + +static ssize_t reclaim_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + if (!memcmp("memcg", buf, + min(sizeof("memcg")-1, count))) + WRITE_ONCE(global_thp_reclaim, THP_RECLAIM_MEMCG); + else if (!memcmp("reclaim", buf, + min(sizeof("reclaim")-1, count))) + WRITE_ONCE(global_thp_reclaim, THP_RECLAIM_ZSR); + else if (!memcmp("swap", buf, + min(sizeof("swap")-1, count))) + WRITE_ONCE(global_thp_reclaim, THP_RECLAIM_SWAP); + else if (!memcmp("disable", buf, + min(sizeof("disable")-1, count))) + WRITE_ONCE(global_thp_reclaim, THP_RECLAIM_DISABLE); + else + return -EINVAL; + + return count; +} + +static struct kobj_attribute reclaim_attr = + __ATTR(reclaim, 0644, reclaim_show, reclaim_store); + +static ssize_t reclaim_proactive_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%d\n", READ_ONCE(thp_reclaim_proactive)); +} + +static ssize_t reclaim_proactive_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int ret, proactive; + + ret = kstrtouint(buf, 0, &proactive); + if (ret || (proactive != 0 && proactive != 1)) + return -EINVAL; + + mutex_lock(&thp_reclaim_proactive_mutex); + if (proactive == thp_reclaim_proactive) + goto unlock; + + thp_reclaim_proactive = proactive; + if (thp_reclaim_proactive) + schedule_delayed_work(&thp_reclaim_proactive_dwork, + msecs_to_jiffies(thp_reclaim_proactive_sleep_ms)); + else + cancel_delayed_work_sync(&thp_reclaim_proactive_dwork); +unlock: + mutex_unlock(&thp_reclaim_proactive_mutex); + + return count; +} + +static struct kobj_attribute reclaim_proactive_attr = + __ATTR(reclaim_proactive, 0644, reclaim_proactive_show, + reclaim_proactive_store); + +static ssize_t reclaim_proactive_sleep_ms_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%u\n", thp_reclaim_proactive_sleep_ms); +} + +static ssize_t reclaim_proactive_sleep_ms_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned long msecs; + int ret; + + ret = kstrtoul(buf, 10, &msecs); + if (ret || msecs > UINT_MAX) + return -EINVAL; + + WRITE_ONCE(thp_reclaim_proactive_sleep_ms, msecs); + + return count; +} + +static struct kobj_attribute reclaim_proactive_sleep_ms_attr = + __ATTR(reclaim_proactive_sleep_ms, 0644, + reclaim_proactive_sleep_ms_show, + reclaim_proactive_sleep_ms_store); + +static ssize_t reclaim_proactive_scan_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%u\n", READ_ONCE(thp_reclaim_proactive_scan)); +} + +static ssize_t reclaim_proactive_scan_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned long scan; + int ret; + + ret = kstrtoul(buf, 10, &scan); + if (ret || scan > UINT_MAX) + return -EINVAL; + + WRITE_ONCE(thp_reclaim_proactive_scan, scan); + + return count; +} + +static struct kobj_attribute reclaim_proactive_scan_attr = + __ATTR(reclaim_proactive_scan, 0644, + reclaim_proactive_scan_show, + reclaim_proactive_scan_store); +#endif + static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, @@ -413,6 +559,12 @@ static struct attribute *hugepage_attr[] = { #ifdef CONFIG_HUGETEXT &hugetext_enabled_attr.attr, &hugetext_pad_threshold_attr.attr, +#endif +#ifdef CONFIG_MEMCG + &reclaim_attr.attr, + &reclaim_proactive_attr.attr, + &reclaim_proactive_sleep_ms_attr.attr, + &reclaim_proactive_scan_attr.attr, #endif NULL, }; @@ -522,6 +674,12 @@ static int __init hugepage_init(void) if (err) goto err_khugepaged; + INIT_DELAYED_WORK(&thp_reclaim_proactive_dwork, + thp_reclaim_proactive_func); + if (thp_reclaim_proactive) + schedule_delayed_work(&thp_reclaim_proactive_dwork, + msecs_to_jiffies(thp_reclaim_proactive_sleep_ms)); + return 0; err_khugepaged: unregister_shrinker(&deferred_split_shrinker); @@ -637,6 +795,9 @@ void prep_transhuge_page(struct page *page) */ INIT_LIST_HEAD(page_deferred_list(page)); +#ifdef CONFIG_MEMCG + INIT_LIST_HEAD(hugepage_reclaim_list(page)); +#endif set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR); } @@ -2578,7 +2739,7 @@ static void __split_huge_page_tail(struct page *head, int tail, (1L << PG_dirty))); /* ->mapping in first tail page is compound_mapcount */ - VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING, + VM_BUG_ON_PAGE(tail > 3 && page_tail->mapping != TAIL_MAPPING, page_tail); page_tail->mapping = head->mapping; page_tail->index = head->index + tail; @@ -2833,6 +2994,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) VM_BUG_ON_PAGE(!PageLocked(head), head); VM_BUG_ON_PAGE(!PageCompound(head), head); +#ifdef CONFIG_MEMCG + tr_del_hugepage(page); +#endif + if (PageWriteback(head)) return -EBUSY; @@ -2994,6 +3159,7 @@ void deferred_split_huge_page(struct page *page) if (memcg) memcg_set_shrinker_bit(memcg, page_to_nid(page), deferred_split_shrinker.id); + tr_del_hugepage(page); #endif } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); @@ -3190,3 +3356,407 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) update_mmu_cache_pmd(vma, address, pvmw->pmd); } #endif + +#ifdef CONFIG_MEMCG +static inline bool tr_zero_page(struct page *page) +{ + void *addr = kmap(page); + bool ret; + + ret = !memchr_inv(addr, 0, PAGE_SIZE) ? true : false; + kunmap(page); + + return ret; +} + +/* + * We'll split the huge page iff it contains at least 1/32 zeros, + * estimate it by checking some discrete long values. + */ +static bool tr_hugepage_estimate_zero(struct page *page, int threshold) +{ + unsigned int i, maybe_zero_pages = 0, offset = 0; + void *addr; + +#define ESTIMATE_SIZE 64U + for (i = 0; i < HPAGE_PMD_NR; i++, page++, offset++) { + if (HPAGE_PMD_NR - i + maybe_zero_pages < threshold) + return false; + + addr = kmap(page); + if (unlikely((offset + 1) * ESTIMATE_SIZE > PAGE_SIZE)) + offset = 0; + if (!memchr_inv((char *)addr + offset * ESTIMATE_SIZE, 0, + ESTIMATE_SIZE)) { + if (++maybe_zero_pages >= threshold) { + kunmap(page); + return true; + } + } + kunmap(page); + } + + return false; +} + +static bool tr_replace_zero_pte(struct page *page, struct vm_area_struct *vma, + unsigned long addr, void *zero_page) +{ + struct page_vma_mapped_walk pvmw = { + .page = page, + .vma = vma, + .address = addr, + .flags = PVMW_SYNC | PVMW_MIGRATION, + }; + pte_t pte; + + VM_BUG_ON_PAGE(PageTail(page), page); + + while (page_vma_mapped_walk(&pvmw)) { + pte = pte_mkspecial( + pfn_pte(page_to_pfn((struct page *)zero_page), + vma->vm_page_prot)); + + /* + * We're replacing an anonymous page with a zero page, which is + * not anonymous. We need to do proper accounting otherwise we + * when tearing down the mm. + */ + dec_mm_counter(vma->vm_mm, MM_ANONPAGES); + set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); + + /* No need to invalidate - it was non-present before.*/ + update_mmu_cache(vma, pvmw.address, pvmw.pte); + } + + return true; +} + +static void tr_replace_zero_ptes_locked(struct page *page) +{ + struct page *zero_page = ZERO_PAGE(0); + struct rmap_walk_control rwc = { + .rmap_one = tr_replace_zero_pte, + .arg = zero_page, + }; + + rmap_walk_locked(page, &rwc); +} + +static bool tr_replace_zero_page(struct page *page) +{ + struct anon_vma *anon_vma = NULL; + bool unmap_success; + bool ret = true; + + anon_vma = page_get_anon_vma(page); + if (!anon_vma) + return false; + + anon_vma_lock_write(anon_vma); + unmap_success = try_to_unmap(page, TTU_RMAP_LOCKED | TTU_MIGRATION | + TTU_SYNC); + + if (!unmap_success || !tr_zero_page(page)) { + /* remap the non-zero page */ + remove_migration_ptes(page, page, true); + ret = false; + } else + tr_replace_zero_ptes_locked(page); + + anon_vma_unlock_write(anon_vma); + put_anon_vma(anon_vma); + + return ret; +} + +/* + * tr_reclaim_zero_subpages - reclaim the zero subpages and putback the non-zero + * subpages. + * + * The non-zero subpages are putback to the keep_list, and will be putback to + * the lru list. + * + * Return the number of reclaimed zero subpages. + */ +static unsigned long tr_reclaim_zero_subpages(struct list_head *list, + struct list_head *keep_list) +{ + LIST_HEAD(zero_list); + struct page *page; + unsigned long reclaimed = 0; + + while (!list_empty(list)) { + page = lru_to_page(list); + list_del_init(&page->lru); + if (tr_zero_page(page)) { + if (!trylock_page(page)) + goto keep; + + if (!tr_replace_zero_page(page)) { + unlock_page(page); + goto keep; + } + + ClearPageActive(page); + unlock_page(page); + if (put_page_testzero(page)) { + list_add(&page->lru, &zero_list); + reclaimed++; + } + + /* someone may hold the zero page, we just skip it. */ + + continue; + } +keep: + list_add(&page->lru, keep_list); + } + + mem_cgroup_uncharge_list(&zero_list); + free_unref_page_list(&zero_list); + + return reclaimed; +} + +/* Filter unsupported page flags. */ +#define THP_RECLAIM_FLAG_CHECK \ + ((1UL << PG_error) | \ + (1UL << PG_owner_priv_1) | \ + (1UL << PG_arch_1) | \ + (1UL << PG_reserved) | \ + (1UL << PG_private) | \ + (1UL << PG_private_2) | \ + (1UL << PG_writeback) | \ + (1UL << PG_swapcache) | \ + (1UL << PG_mappedtodisk) | \ + (1UL << PG_reclaim) | \ + (1UL << PG_unevictable) | \ + __PG_MLOCKED | __PG_UNCACHED | __PG_HWPOISON) + +#define hugepage_can_reclaim(page) \ + (PageAnon(page) && !PageKsm(page) && \ + !(page->flags & THP_RECLAIM_FLAG_CHECK)) + +#define hr_list_to_page(head) \ + (list_entry((head)->prev, struct page, hugepage_reclaim_list) - 3) + +static inline struct mem_cgroup *hr_queue_to_memcg(struct hugepage_reclaim *hq) +{ + return container_of(hq, struct mem_cgroup_per_node, + hugepage_reclaim_queue)->memcg; +} + +/* + * tr_get_hugepage - get one huge page from huge page reclaim queue + * + * Return -EINVAL if the queue is empty; otherwise, return 0. + * If the queue is not empty, it will check whether the tail page of the + * queue can be reclaimed or not. If the page can be reclaimed, it will + * be stored in reclaim_page; otherwise, just delete the page from the + * queue. + */ +int tr_get_hugepage(struct hugepage_reclaim *hr_queue, struct page **reclaim, + int threshold, unsigned long time) +{ + struct page *page; + unsigned long flags; + int ret = 0; + + if (!spin_trylock_irqsave(&hr_queue->reclaim_queue_lock, flags)) + return ret; + + if (list_empty(&hr_queue->reclaim_queue)) { + ret = -EINVAL; + goto unlock; + } + + page = hr_list_to_page(&hr_queue->reclaim_queue); + + if (time && tr_hugepage_time(page) > time) { + ret = -EINVAL; + goto unlock; + } + + list_del_init(hugepage_reclaim_list(page)); + hr_queue->queue_length--; + + if (!hugepage_can_reclaim(page) || !get_page_unless_zero(page)) + goto unlock; + + if (!trylock_page(page)) { + put_page(page); + goto unlock; + } + + spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags); + + if (hugepage_can_reclaim(page) && + tr_hugepage_estimate_zero(page, threshold) && + !isolate_lru_page(page)) { + __mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON, + HPAGE_PMD_NR); + /* dec the reference added in isolate_lru_page */ + page_ref_dec(page); + *reclaim = page; + } else { + unlock_page(page); + put_page(page); + } + + return ret; +unlock: + spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags); + return ret; +} + +unsigned long tr_reclaim_hugepage(struct hugepage_reclaim *hr_queue, + struct lruvec *lruvec, struct page *page) +{ + struct pglist_data *pgdat = page_pgdat(page); + struct mem_cgroup *memcg = hr_queue_to_memcg(hr_queue); + unsigned long reclaimed; + unsigned long flags; + LIST_HEAD(split_list); + LIST_HEAD(keep_list); + int nid = pgdat->node_id; + + /* + * Split the huge page and reclaim the zero subpages. + * And putback the non-zero subpages to the lru list. + */ + if (split_huge_page_to_list(page, &split_list)) { + unlock_page(page); + putback_lru_page(page); + mod_node_page_state(pgdat, NR_ISOLATED_ANON, -HPAGE_PMD_NR); + atomic_long_inc(&hr_queue->split_failed); + do { + hr_queue = &memcg->nodeinfo[nid]->hugepage_reclaim_queue; + atomic_long_inc(&hr_queue->total_split_failed); + } while (!mem_cgroup_is_root(memcg) && + (memcg = parent_mem_cgroup(memcg))); + + return 0; + } + + unlock_page(page); + list_add_tail(&page->lru, &split_list); + reclaimed = tr_reclaim_zero_subpages(&split_list, &keep_list); + atomic_long_inc(&hr_queue->split_hugepage); + atomic_long_add(reclaimed, &hr_queue->reclaim_subpage); + do { + hr_queue = &memcg->nodeinfo[nid]->hugepage_reclaim_queue; + atomic_long_inc(&hr_queue->total_split_hugepage); + atomic_long_add(reclaimed, &hr_queue->total_reclaim_subpage); + } while (!mem_cgroup_is_root(memcg) && + (memcg = parent_mem_cgroup(memcg))); + + spin_lock_irqsave(&lruvec->lru_lock, flags); + move_pages_to_lru(lruvec, &keep_list); + spin_unlock_irqrestore(&lruvec->lru_lock, flags); + mod_node_page_state(pgdat, NR_ISOLATED_ANON, -HPAGE_PMD_NR); + + mem_cgroup_uncharge_list(&keep_list); + free_unref_page_list(&keep_list); + + return reclaimed; +} + +/* + * Trigger the memcg reclaim. + * The huge page allocated before @time will be handled. + * @scan is the limit numbers of huge page can be handled. + * If @scan is 0, it means there is no limit. + */ +void __tr_reclaim_memcg(struct mem_cgroup *memcg, unsigned long time, + unsigned int scan, bool proactive) +{ + struct lruvec *lruvec; + struct hugepage_reclaim *hr_queue; + int threshold, nid; + int nr_to_scan = scan; + + if (tr_get_reclaim_mode(memcg) != THP_RECLAIM_ZSR) + return; + + threshold = READ_ONCE(memcg->tr_ctrl.threshold); + for_each_online_node(nid) { + lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); + hr_queue = &memcg->nodeinfo[nid]->hugepage_reclaim_queue; + for ( ; ; ) { + struct page *page = NULL; + + cond_resched(); + + if (proactive && + (!READ_ONCE(thp_reclaim_proactive) || + !READ_ONCE(memcg->tr_ctrl.proactive))) + break; + + if (scan && nr_to_scan == 0) + break; + + if (tr_get_hugepage(hr_queue, &page, threshold, time)) + break; + + if (!page) + continue; + + if (scan) + nr_to_scan--; + + tr_reclaim_hugepage(hr_queue, lruvec, page); + } + } +} + +static int __init setup_thp_reclaim(char *str) +{ + int ret = -1; + + if (!str) + goto out; + + if (!strcmp(str, "reclaim")) + global_thp_reclaim = THP_RECLAIM_ZSR; + else if (!strcmp(str, "swap")) + global_thp_reclaim = THP_RECLAIM_SWAP; + else if (!strcmp(str, "disable")) + global_thp_reclaim = THP_RECLAIM_DISABLE; + else if (!strcmp(str, "memcg")) + global_thp_reclaim = THP_RECLAIM_MEMCG; + else + goto out; + ret = 0; +out: + if (ret) + pr_warn("tr= cannot parse, ignored\n"); + return !ret; +} +__setup("tr=", setup_thp_reclaim); + +static void thp_reclaim_proactive_func(struct work_struct *work) +{ + struct mem_cgroup *memcg; + unsigned long time; + unsigned int scan; + + if (READ_ONCE(global_thp_reclaim) == THP_RECLAIM_DISABLE) + goto resched; + + scan = READ_ONCE(thp_reclaim_proactive_scan); + time = jiffies - msecs_to_jiffies(thp_reclaim_proactive_sleep_ms); + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + if (READ_ONCE(memcg->tr_ctrl.proactive)) + __tr_reclaim_memcg(memcg, time, scan, true); + + if (!READ_ONCE(thp_reclaim_proactive)) + break; + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + +resched: + schedule_delayed_work(&thp_reclaim_proactive_dwork, + msecs_to_jiffies(thp_reclaim_proactive_sleep_ms)); +} +#endif diff --git a/mm/kfence/core.c b/mm/kfence/core.c index 6a69fee5380d714a89901ea035bdfada98081e99..b1396611f1b4ab5745db874489278c16a9cd54d1 100644 --- a/mm/kfence/core.c +++ b/mm/kfence/core.c @@ -57,6 +57,7 @@ unsigned long kfence_num_objects __read_mostly = CONFIG_KFENCE_NUM_OBJECTS; EXPORT_SYMBOL(kfence_num_objects); static unsigned long kfence_num_objects_snap __read_mostly; /* Used to record upstream ver. */ static int *kfence_node_map; /* Map real node to "virtual kfence node". */ +bool kfence_panic_on_fault; struct kfence_alloc_node_cond { long need; long allocated; @@ -213,6 +214,34 @@ static const struct kernel_param_ops order0_page_param_ops = { }; module_param_cb(order0_page, &order0_page_param_ops, NULL, 0600); +static int param_set_fault(const char *val, const struct kernel_param *kp) +{ + bool mode; + char *s = strstrip((char *)val); + + if (!strcmp(s, "report")) + mode = false; + else if (!strcmp(s, "panic")) + mode = true; + else + return -EINVAL; + + *((bool *)kp->arg) = mode; + + return 0; +} + +static int param_get_fault(char *buffer, const struct kernel_param *kp) +{ + return sprintf(buffer, "%s\n", *(bool *)kp->arg ? "panic" : "report"); +} + +static const struct kernel_param_ops fault_param_ops = { + .set = param_set_fault, + .get = param_get_fault, +}; +module_param_cb(fault, &fault_param_ops, &kfence_panic_on_fault, 0600); + /* * The pool of pages used for guard pages and objects. * Only used in booting init state. Will be cleared after that. diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h index 9ffa2d1dfebfaf86abbc3fafde8f4b7d10173daa..3605c38f7ab008379a8e1147c56736118cadfcb9 100644 --- a/mm/kfence/kfence.h +++ b/mm/kfence/kfence.h @@ -99,6 +99,7 @@ struct kfence_metadata { }; extern unsigned long kfence_num_objects; +extern bool kfence_panic_on_fault; DECLARE_STATIC_KEY_FALSE(kfence_short_canary); /* KFENCE error types for report generation. */ diff --git a/mm/kfence/report.c b/mm/kfence/report.c index 31df24779f35e4acd46274ed84d017a5290da97f..58729cbebcf53b25bc07104a5ef45c3aa1c1584f 100644 --- a/mm/kfence/report.c +++ b/mm/kfence/report.c @@ -274,8 +274,8 @@ void kfence_report_error(unsigned long address, bool is_write, struct pt_regs *r lockdep_on(); - if (panic_on_warn) - panic("panic_on_warn set ...\n"); + if (kfence_panic_on_fault) + panic("kfence.fault=panic set ...\n"); /* We encountered a memory unsafety error, taint the kernel! */ add_taint(TAINT_BAD_PAGE, LOCKDEP_STILL_OK); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9e7276c63a82db493af5ad7c3803da354d50fe85..613c67a9feeb9873a756b772298cfef555117be0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3265,6 +3265,61 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) } #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +/* + * Need the page lock if the page is not a newly allocated page. + */ +void tr_add_hugepage(struct page *page) +{ + struct mem_cgroup *memcg = page_memcg(page); + struct hugepage_reclaim *hr_queue; + unsigned long flags; + + VM_BUG_ON_PAGE(PageTail(page), page); + + if (tr_get_reclaim_mode(memcg) == THP_RECLAIM_DISABLE) + return; + + /* + * We just want to add the anon page to the queue, but it is not sure + * the page is anon or not when charging to memcg. + * page_mapping return NULL if the page is a anon page or the mapping + * is not yet set. + */ + if (!is_transparent_hugepage(page) || page_mapping(page)) + return; + + hr_queue = &memcg->nodeinfo[page_to_nid(page)]->hugepage_reclaim_queue; + spin_lock_irqsave(&hr_queue->reclaim_queue_lock, flags); + if (list_empty(hugepage_reclaim_list(page))) { + list_add(hugepage_reclaim_list(page), &hr_queue->reclaim_queue); + tr_set_hugepage_time(page, jiffies); + hr_queue->queue_length++; + } + spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags); +} + +void tr_del_hugepage(struct page *page) +{ + struct page *head = compound_head(page); + struct mem_cgroup *memcg = page_memcg(head); + struct hugepage_reclaim *hr_queue; + unsigned long flags; + + if (!memcg || !is_transparent_hugepage(page)) + return; + + hr_queue = &memcg->nodeinfo[page_to_nid(page)]->hugepage_reclaim_queue; + spin_lock_irqsave(&hr_queue->reclaim_queue_lock, flags); + if (!list_empty(hugepage_reclaim_list(head))) { + list_del_init(hugepage_reclaim_list(head)); + tr_set_hugepage_time(page, 0); + hr_queue->queue_length--; + } + spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags); +} +#endif + static void commit_charge(struct page *page, struct mem_cgroup *memcg) { VM_BUG_ON_PAGE(page->mem_cgroup, page); @@ -3277,6 +3332,8 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg) * - exclusive reference */ page->mem_cgroup = memcg; + + tr_add_hugepage(page); } #ifdef CONFIG_MEMCG_KMEM @@ -6509,6 +6566,227 @@ static int mem_cgroup_allow_pgcache_sync_write(struct cgroup_subsys_state *css, } #endif /* CONFIG_PAGECACHE_LIMIT */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static int memcg_thp_reclaim_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + int thp_reclaim = READ_ONCE(memcg->thp_reclaim); + + if (thp_reclaim == THP_RECLAIM_ZSR) + seq_puts(m, "[reclaim] swap disable\n"); + else if (memcg->thp_reclaim == THP_RECLAIM_SWAP) + seq_puts(m, "reclaim [swap] disable\n"); + else + seq_puts(m, "reclaim swap [disable]\n"); + + return 0; +} + +static ssize_t memcg_thp_reclaim_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + + buf = strstrip(buf); + if (!strcmp(buf, "reclaim")) + WRITE_ONCE(memcg->thp_reclaim, + THP_RECLAIM_ZSR); + else if (!strcmp(buf, "swap")) + WRITE_ONCE(memcg->thp_reclaim, THP_RECLAIM_SWAP); + else if (!strcmp(buf, "disable")) + WRITE_ONCE(memcg->thp_reclaim, THP_RECLAIM_DISABLE); + else + return -EINVAL; + + return nbytes; +} + +static int memcg_thp_reclaim_stat_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup_per_node *mz; + struct mem_cgroup *mi; + int node; + unsigned long val; + + seq_puts(m, "queue_length\t"); + for_each_online_node(node) { + mz = mem_cgroup_nodeinfo(memcg, node); + val = READ_ONCE(mz->hugepage_reclaim_queue.queue_length); + seq_printf(m, "%-24lu", val); + } + seq_puts(m, "\n"); + +#define memcg_stat(e) \ + do { \ + seq_puts(m, #e"\t"); \ + for_each_online_node(node) { \ + mz = mem_cgroup_nodeinfo(memcg, node); \ + val = atomic_long_read( \ + &mz->hugepage_reclaim_queue.e); \ + seq_printf(m, "%-24lu", val); \ + } \ + seq_puts(m, "\n"); \ + } while (0) + + memcg_stat(split_hugepage); + memcg_stat(reclaim_subpage); + memcg_stat(split_failed); + + seq_puts(m, "total_queue_length\t"); + val = 0; + for_each_mem_cgroup_tree(mi, memcg) { + for_each_online_node(node) { + mz = mem_cgroup_nodeinfo(mi, node); + val += READ_ONCE( + mz->hugepage_reclaim_queue.queue_length); + } + } + seq_printf(m, "%-24lu\n", val); + +#define memcg_total_stat(e) \ + do { \ + seq_puts(m, "total_"#e"\t"); \ + val = 0; \ + for_each_online_node(node) { \ + mz = mem_cgroup_nodeinfo(memcg, node); \ + val += atomic_long_read( \ + &mz->hugepage_reclaim_queue.total_##e); \ + } \ + seq_printf(m, "%-24lu\n", val); \ + } while (0) + + memcg_total_stat(split_hugepage); + memcg_total_stat(reclaim_subpage); + memcg_total_stat(split_failed); + + return 0; +} + +static inline char *strsep_s(char **s, const char *ct) +{ + char *p; + + while ((p = strsep(s, ct))) { + if (*p) + return p; + } + return NULL; +} + +static int memcg_thp_reclaim_ctrl_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + int thp_reclaim_threshold = READ_ONCE(memcg->tr_ctrl.threshold); + int thp_reclaim_proactive = READ_ONCE(memcg->tr_ctrl.proactive); + + seq_printf(m, "threshold\t%d\n", thp_reclaim_threshold); + seq_printf(m, "proactive\t%d\n", thp_reclaim_proactive); + + return 0; +} + +static inline int get_thp_reclaim_ctrl_value(char *buf, int *value) +{ + char *string = strsep_s(&buf, " \t\n"); + + if (!string) + return -EINVAL; + + return kstrtouint(string, 0, value); +} + +#define CTRL_RECLAIM_MEMCG 1 /* only relciam current memcg*/ +#define CTRL_RECLAIM_ALL 2 /* reclaim current memcg and all the child memcg */ +static ssize_t memcg_thp_reclaim_ctrl_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + int ret; + char *key; + + key = strsep_s(&buf, " \t\n"); + if (!key) + return -EINVAL; + + if (!strcmp(key, "threshold")) { + int threshold; + + ret = get_thp_reclaim_ctrl_value(buf, &threshold); + if (ret) + return ret; + + if (threshold > HPAGE_PMD_NR || threshold < 1) + return -EINVAL; + + xchg(&memcg->tr_ctrl.threshold, threshold); + } else if (!strcmp(key, "reclaim")) { + struct mem_cgroup *iter; + int mode; + + ret = get_thp_reclaim_ctrl_value(buf, &mode); + if (ret) + return ret; + + switch (mode) { + case CTRL_RECLAIM_MEMCG: + tr_reclaim_memcg(memcg); + break; + case CTRL_RECLAIM_ALL: + iter = mem_cgroup_iter(memcg, NULL, NULL); + do { + tr_reclaim_memcg(iter); + } while ((iter = mem_cgroup_iter(memcg, iter, NULL))); + break; + default: + return -EINVAL; + } + } else if (!strcmp(key, "proactive")) { + int proactive; + + ret = get_thp_reclaim_ctrl_value(buf, &proactive); + if (ret) + return ret; + + if (proactive != 0 && proactive != 1) + return -EINVAL; + + xchg(&memcg->tr_ctrl.proactive, proactive); + } else + return -EINVAL; + + return nbytes; +} +#endif + +static int thp_reclaim_proactive_memcg_init; +static int __init setup_thp_reclaim_proactive_init(char *str) +{ + int ret = 0; + unsigned int value; + + if (!str) + goto out; + + ret = kstrtouint(str, 0, &value); + if (ret) + goto out; + + if (value < 0 || value > 3) + goto out; + + if (value & 0x1) + thp_reclaim_proactive = 1; + + if (value & 0x2) + thp_reclaim_proactive_memcg_init = 1; +out: + if (ret) + pr_warn("tr.proactive= cannot parse, ignored\n"); + return !ret; +} +__setup("tr.proactive=", setup_thp_reclaim_proactive_init); + static struct cftype mem_cgroup_legacy_files[] = { { .name = "usage_in_bytes", @@ -6807,6 +7085,23 @@ static struct cftype mem_cgroup_legacy_files[] = { .write_u64 = mem_cgroup_allow_pgcache_sync_write, }, #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + { + .name = "thp_reclaim", + .seq_show = memcg_thp_reclaim_show, + .write = memcg_thp_reclaim_write, + }, + { + .name = "thp_reclaim_stat", + .seq_show = memcg_thp_reclaim_stat_show, + }, + { + .name = "thp_reclaim_ctrl", + .seq_show = memcg_thp_reclaim_ctrl_show, + .write = memcg_thp_reclaim_ctrl_write, + }, +#endif + { }, /* terminate */ }; @@ -6905,6 +7200,12 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) pn->on_tree = false; pn->memcg = memcg; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + spin_lock_init(&pn->hugepage_reclaim_queue.reclaim_queue_lock); + INIT_LIST_HEAD(&pn->hugepage_reclaim_queue.reclaim_queue); + pn->hugepage_reclaim_queue.queue_length = 0; +#endif + memcg->nodeinfo[node] = pn; return 0; } @@ -7038,6 +7339,10 @@ static struct mem_cgroup *mem_cgroup_alloc(void) spin_lock_init(&memcg->deferred_split_queue.split_queue_lock); INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue); memcg->deferred_split_queue.split_queue_len = 0; + + memcg->thp_reclaim = THP_RECLAIM_DISABLE; + memcg->tr_ctrl.threshold = THP_RECLAIM_THRESHOLD_DEFAULT; + memcg->tr_ctrl.proactive = thp_reclaim_proactive_memcg_init; #endif kidled_memcg_init(memcg); idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); @@ -7075,6 +7380,11 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) /* Default gap is 0.5% max limit */ memcg->wmark_scale_factor = parent->wmark_scale_factor ? : 50; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + memcg->thp_reclaim = parent->thp_reclaim; + memcg->tr_ctrl.threshold = parent->tr_ctrl.threshold; + memcg->tr_ctrl.proactive = parent->tr_ctrl.proactive; +#endif kidled_memcg_inherit_parent_buckets(parent, memcg); memcg->reap_background = parent->reap_background; #ifdef CONFIG_DUPTEXT @@ -7461,6 +7771,8 @@ static int mem_cgroup_move_account(struct page *page, __mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages); } + tr_del_hugepage(page); + /* * All state has been migrated, let's switch to the new memcg. * @@ -7481,6 +7793,8 @@ static int mem_cgroup_move_account(struct page *page, page->mem_cgroup = to; + tr_add_hugepage(page); + __unlock_page_memcg(from); ret = 0; @@ -8727,6 +9041,8 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug) if (!page->mem_cgroup) return; + tr_del_hugepage(page); + /* * Nobody should be changing or seriously looking at * page->mem_cgroup at this point, we have fully @@ -9157,6 +9473,8 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) VM_BUG_ON_PAGE(oldid, page); mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries); + tr_del_hugepage(page); + page->mem_cgroup = NULL; if (!mem_cgroup_is_root(memcg)) diff --git a/mm/mmap.c b/mm/mmap.c index a6390f364505f34cf6a4187a3afd425285108903..2c1248bf47dec572ee38d8c6ebb6cab5e69f5072 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2690,11 +2690,28 @@ static void unmap_region(struct mm_struct *mm, { struct vm_area_struct *next = vma_next(mm, prev); struct mmu_gather tlb; + struct vm_area_struct *cur_vma; lru_add_drain(); tlb_gather_mmu(&tlb, mm, start, end); update_hiwater_rss(mm); unmap_vmas(&tlb, vma, start, end); + + /* + * Ensure we have no stale TLB entries by the time this mapping is + * removed from the rmap. + * Note that we don't have to worry about nested flushes here because + * we're holding the mm semaphore for removing the mapping - so any + * concurrent flush in this region has to be coming through the rmap, + * and we synchronize against that using the rmap lock. + */ + for (cur_vma = vma; cur_vma; cur_vma = cur_vma->vm_next) { + if ((cur_vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) != 0) { + tlb_flush_mmu(&tlb); + break; + } + } + free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, next ? next->vm_start : USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, start, end); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c48a261b2be2155530abd7987d809ac12615cebe..b7a888b73bac062dfded2dfb78983379b90ee31b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1268,6 +1268,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page) * deferred_list.next -- ignore value. */ break; + case 3: + /* + * the third tail page: ->mapping is + * hugepage_reclaim_list.next -- ignore value. + */ + break; default: if (page->mapping != TAIL_MAPPING) { bad_page(page, "corrupted mapping in tail page"); diff --git a/mm/vmscan.c b/mm/vmscan.c index c0671df36deabb14e1564f1e1bb8450196048cd9..d81495357268160b98e48106a273ff7453207fb8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1996,8 +1996,8 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, * * Returns the number of pages moved to the given lruvec. */ -static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, - struct list_head *list) +unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, + struct list_head *list) { int nr_pages, nr_moved = 0; LIST_HEAD(pages_to_free); @@ -2747,6 +2747,56 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, } } +#if defined(CONFIG_MEMCG) && defined(CONFIG_TRANSPARENT_HUGEPAGE) +/* + * Try to reclaim the zero subpages for the transparent huge page. + */ + +static inline unsigned long reclaim_hugepages(struct lruvec *lruvec, + unsigned long nr_to_reclaim) +{ + struct mem_cgroup *memcg; + struct hugepage_reclaim *hr_queue; + int nid = lruvec->pgdat->node_id; + int threshold; + unsigned long nr_reclaimed = 0; + + memcg = lruvec_memcg(lruvec); + if (!memcg) + goto out; + + threshold = READ_ONCE(memcg->tr_ctrl.threshold); + hr_queue = &memcg->nodeinfo[nid]->hugepage_reclaim_queue; + + /* Now we only support zsr mode. */ + if (tr_get_reclaim_mode(memcg) != THP_RECLAIM_ZSR) + goto out; + + do { + struct page *page = NULL; + + if (tr_get_hugepage(hr_queue, &page, threshold, 0)) + break; + + if (!page) + continue; + + nr_reclaimed += tr_reclaim_hugepage(hr_queue, lruvec, page); + + cond_resched(); + } while (nr_reclaimed < nr_to_reclaim); + +out: + return nr_reclaimed; +} +#else +static inline unsigned long reclaim_hugepages(struct lruvec *lruvec, + unsigned long nr_to_reclaim) +{ + return 0; +} +#endif + static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) { unsigned long nr[NR_LRU_LISTS]; @@ -2850,6 +2900,12 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) scan_adjusted = true; } blk_finish_plug(&plug); + + /* Trigger zero subpages reclaim only when priority is 0. */ + if (sc->priority == 0 && nr_reclaimed < nr_to_reclaim) + nr_reclaimed += reclaim_hugepages(lruvec, + nr_to_reclaim - nr_reclaimed); + sc->nr_reclaimed += nr_reclaimed; /* diff --git a/net/9p/trans_xen.c b/net/9p/trans_xen.c index 432ac5a16f2e04243aa88eac3ab309558ef28313..7e27f733869b88c4310b236530aaaf4303cf5ee8 100644 --- a/net/9p/trans_xen.c +++ b/net/9p/trans_xen.c @@ -291,6 +291,10 @@ static void xen_9pfs_front_free(struct xen_9pfs_front_priv *priv) write_unlock(&xen_9pfs_lock); for (i = 0; i < priv->num_rings; i++) { + struct xen_9pfs_dataring *ring = &priv->rings[i]; + + cancel_work_sync(&ring->work); + if (!priv->rings[i].intf) break; if (priv->rings[i].irq > 0) diff --git a/net/Kconfig b/net/Kconfig index 0a73d0c2fe4994dc3510a62fdc86bee1b0e1cba3..46d6747faaca9877b96155a5078af3480ff8e6d4 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -67,6 +67,7 @@ source "net/xfrm/Kconfig" source "net/iucv/Kconfig" source "net/smc/Kconfig" source "net/xdp/Kconfig" +source "net/vtoa/Kconfig" source "net/hookers/Kconfig" config INET diff --git a/net/Makefile b/net/Makefile index d2f39c906b194cac34df27f65a6c69df2ddac523..04649a8dd6799c38b027f6b25f66d91b6bf7244c 100644 --- a/net/Makefile +++ b/net/Makefile @@ -55,7 +55,7 @@ obj-$(CONFIG_IUCV) += iucv/ obj-$(CONFIG_SMC) += smc/ ifneq ($(CONFIG_SMC),) ifeq ($(CONFIG_BPF_SYSCALL),y) -obj-y += smc/bpf_smc_struct_ops.o +obj-y += smc/bpf_smc.o endif endif obj-$(CONFIG_RFKILL) += rfkill/ @@ -94,3 +94,4 @@ obj-$(CONFIG_NET_NCSI) += ncsi/ obj-$(CONFIG_XDP_SOCKETS) += xdp/ obj-$(CONFIG_HOOKERS) += hookers/ obj-$(CONFIG_MPTCP) += mptcp/ +obj-$(CONFIG_VTOA) += vtoa/ diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c index 71d18d3295f50551893e9cf505e57665154e2e77..d28e263acb6233367760358aeba415b7add4abc5 100644 --- a/net/bluetooth/hci_sock.c +++ b/net/bluetooth/hci_sock.c @@ -1000,7 +1000,14 @@ static int hci_sock_ioctl(struct socket *sock, unsigned int cmd, if (hci_sock_gen_cookie(sk)) { struct sk_buff *skb; - if (capable(CAP_NET_ADMIN)) + /* Perform careful checks before setting the HCI_SOCK_TRUSTED + * flag. Make sure that not only the current task but also + * the socket opener has the required capability, since + * privileged programs can be tricked into making ioctl calls + * on HCI sockets, and the socket should not be marked as + * trusted simply because the ioctl caller is privileged. + */ + if (sk_capable(sk, CAP_NET_ADMIN)) hci_sock_set_flag(sk, HCI_SOCK_TRUSTED); /* Send event to monitor */ diff --git a/net/core/devlink.c b/net/core/devlink.c index 72047750dcd96f4a43f30b35a8125176fc1f2892..9479e3939764124671df05b97354b20cd3c8b4d0 100644 --- a/net/core/devlink.c +++ b/net/core/devlink.c @@ -87,6 +87,9 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_trap_report); static const struct nla_policy devlink_function_nl_policy[DEVLINK_PORT_FUNCTION_ATTR_MAX + 1] = { [DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR] = { .type = NLA_BINARY }, + [DEVLINK_PORT_FN_ATTR_STATE] = + NLA_POLICY_RANGE(NLA_U8, DEVLINK_PORT_FN_STATE_INACTIVE, + DEVLINK_PORT_FN_STATE_ACTIVE), }; static LIST_HEAD(devlink_list); @@ -690,6 +693,15 @@ static int devlink_nl_port_attrs_put(struct sk_buff *msg, if (nla_put_u8(msg, DEVLINK_ATTR_PORT_EXTERNAL, attrs->pci_vf.external)) return -EMSGSIZE; break; + case DEVLINK_PORT_FLAVOUR_PCI_SF: + if (nla_put_u32(msg, DEVLINK_ATTR_PORT_CONTROLLER_NUMBER, + attrs->pci_sf.controller) || + nla_put_u16(msg, DEVLINK_ATTR_PORT_PCI_PF_NUMBER, + attrs->pci_sf.pf) || + nla_put_u32(msg, DEVLINK_ATTR_PORT_PCI_SF_NUMBER, + attrs->pci_sf.sf)) + return -EMSGSIZE; + break; case DEVLINK_PORT_FLAVOUR_PHYSICAL: case DEVLINK_PORT_FLAVOUR_CPU: case DEVLINK_PORT_FLAVOUR_DSA: @@ -711,6 +723,83 @@ static int devlink_nl_port_attrs_put(struct sk_buff *msg, return 0; } +static int +devlink_port_fn_hw_addr_fill(struct devlink *devlink, const struct devlink_ops *ops, + struct devlink_port *port, struct sk_buff *msg, + struct netlink_ext_ack *extack, bool *msg_updated) +{ + u8 hw_addr[MAX_ADDR_LEN]; + int hw_addr_len; + int err; + + if (!ops->port_function_hw_addr_get) + return 0; + + err = ops->port_function_hw_addr_get(devlink, port, hw_addr, &hw_addr_len, extack); + if (err) { + if (err == -EOPNOTSUPP) + return 0; + return err; + } + err = nla_put(msg, DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR, hw_addr_len, hw_addr); + if (err) + return err; + *msg_updated = true; + return 0; +} + +static bool +devlink_port_fn_state_valid(enum devlink_port_fn_state state) +{ + return state == DEVLINK_PORT_FN_STATE_INACTIVE || + state == DEVLINK_PORT_FN_STATE_ACTIVE; +} + +static bool +devlink_port_fn_opstate_valid(enum devlink_port_fn_opstate opstate) +{ + return opstate == DEVLINK_PORT_FN_OPSTATE_DETACHED || + opstate == DEVLINK_PORT_FN_OPSTATE_ATTACHED; +} + +static int +devlink_port_fn_state_fill(struct devlink *devlink, + const struct devlink_ops *ops, + struct devlink_port *port, struct sk_buff *msg, + struct netlink_ext_ack *extack, + bool *msg_updated) +{ + enum devlink_port_fn_opstate opstate; + enum devlink_port_fn_state state; + int err; + + if (!ops->port_fn_state_get) + return 0; + + err = ops->port_fn_state_get(devlink, port, &state, &opstate, extack); + if (err) { + if (err == -EOPNOTSUPP) + return 0; + return err; + } + if (!devlink_port_fn_state_valid(state)) { + WARN_ON_ONCE(1); + NL_SET_ERR_MSG_MOD(extack, "Invalid state read from driver"); + return -EINVAL; + } + if (!devlink_port_fn_opstate_valid(opstate)) { + WARN_ON_ONCE(1); + NL_SET_ERR_MSG_MOD(extack, + "Invalid operational state read from driver"); + return -EINVAL; + } + if (nla_put_u8(msg, DEVLINK_PORT_FN_ATTR_STATE, state) || + nla_put_u8(msg, DEVLINK_PORT_FN_ATTR_OPSTATE, opstate)) + return -EMSGSIZE; + *msg_updated = true; + return 0; +} + static int devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *port, struct netlink_ext_ack *extack) @@ -718,36 +807,22 @@ devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *por struct devlink *devlink = port->devlink; const struct devlink_ops *ops; struct nlattr *function_attr; - bool empty_nest = true; - int err = 0; + bool msg_updated = false; + int err; function_attr = nla_nest_start_noflag(msg, DEVLINK_ATTR_PORT_FUNCTION); if (!function_attr) return -EMSGSIZE; ops = devlink->ops; - if (ops->port_function_hw_addr_get) { - int hw_addr_len; - u8 hw_addr[MAX_ADDR_LEN]; - - err = ops->port_function_hw_addr_get(devlink, port, hw_addr, &hw_addr_len, extack); - if (err == -EOPNOTSUPP) { - /* Port function attributes are optional for a port. If port doesn't - * support function attribute, returning -EOPNOTSUPP is not an error. - */ - err = 0; - goto out; - } else if (err) { - goto out; - } - err = nla_put(msg, DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR, hw_addr_len, hw_addr); - if (err) - goto out; - empty_nest = false; - } - + err = devlink_port_fn_hw_addr_fill(devlink, ops, port, msg, + extack, &msg_updated); + if (err) + goto out; + err = devlink_port_fn_state_fill(devlink, ops, port, msg, extack, + &msg_updated); out: - if (err || empty_nest) + if (err || !msg_updated) nla_nest_cancel(msg, function_attr); else nla_nest_end(msg, function_attr); @@ -985,7 +1060,6 @@ devlink_port_function_hw_addr_set(struct devlink *devlink, struct devlink_port * const struct devlink_ops *ops; const u8 *hw_addr; int hw_addr_len; - int err; hw_addr = nla_data(attr); hw_addr_len = nla_len(attr); @@ -1010,12 +1084,25 @@ devlink_port_function_hw_addr_set(struct devlink *devlink, struct devlink_port * return -EOPNOTSUPP; } - err = ops->port_function_hw_addr_set(devlink, port, hw_addr, hw_addr_len, extack); - if (err) - return err; + return ops->port_function_hw_addr_set(devlink, port, hw_addr, hw_addr_len, extack); +} - devlink_port_notify(port, DEVLINK_CMD_PORT_NEW); - return 0; +static int devlink_port_fn_state_set(struct devlink *devlink, + struct devlink_port *port, + const struct nlattr *attr, + struct netlink_ext_ack *extack) +{ + enum devlink_port_fn_state state; + const struct devlink_ops *ops; + + state = nla_get_u8(attr); + ops = devlink->ops; + if (!ops->port_fn_state_set) { + NL_SET_ERR_MSG_MOD(extack, + "Function does not support state setting"); + return -EOPNOTSUPP; + } + return ops->port_fn_state_set(devlink, port, state, extack); } static int @@ -1033,9 +1120,21 @@ devlink_port_function_set(struct devlink *devlink, struct devlink_port *port, } attr = tb[DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR]; - if (attr) + if (attr) { err = devlink_port_function_hw_addr_set(devlink, port, attr, extack); + if (err) + return err; + } + /* Keep this as the last function attribute set, so that when + * multiple port function attributes are set along with state, + * Those can be applied first before activating the state. + */ + attr = tb[DEVLINK_PORT_FN_ATTR_STATE]; + if (attr) + err = devlink_port_fn_state_set(devlink, port, attr, extack); + if (!err) + devlink_port_notify(port, DEVLINK_CMD_PORT_NEW); return err; } @@ -1135,6 +1234,111 @@ static int devlink_nl_cmd_port_unsplit_doit(struct sk_buff *skb, return devlink_port_unsplit(devlink, port_index, info->extack); } +static int devlink_port_new_notifiy(struct devlink *devlink, + unsigned int port_index, + struct genl_info *info) +{ + struct devlink_port *devlink_port; + struct sk_buff *msg; + int err; + + msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); + if (!msg) + return -ENOMEM; + + mutex_lock(&devlink->lock); + devlink_port = devlink_port_get_by_index(devlink, port_index); + if (!devlink_port) { + err = -ENODEV; + goto out; + } + + err = devlink_nl_port_fill(msg, devlink, devlink_port, + DEVLINK_CMD_NEW, info->snd_portid, + info->snd_seq, 0, NULL); + if (err) + goto out; + + err = genlmsg_reply(msg, info); + mutex_unlock(&devlink->lock); + return err; + +out: + mutex_unlock(&devlink->lock); + nlmsg_free(msg); + return err; +} + +static int devlink_nl_cmd_port_new_doit(struct sk_buff *skb, + struct genl_info *info) +{ + struct netlink_ext_ack *extack = info->extack; + struct devlink_port_new_attrs new_attrs = {}; + struct devlink *devlink = info->user_ptr[0]; + unsigned int new_port_index; + int err; + + if (!devlink->ops->port_new || !devlink->ops->port_del) + return -EOPNOTSUPP; + + if (!info->attrs[DEVLINK_ATTR_PORT_FLAVOUR] || + !info->attrs[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]) { + NL_SET_ERR_MSG_MOD(extack, "Port flavour or PCI PF are not specified"); + return -EINVAL; + } + new_attrs.flavour = nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_FLAVOUR]); + new_attrs.pfnum = + nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]); + + if (info->attrs[DEVLINK_ATTR_PORT_INDEX]) { + /* Port index of the new port being created by driver. */ + new_attrs.port_index = + nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_INDEX]); + new_attrs.port_index_valid = true; + } + if (info->attrs[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER]) { + new_attrs.controller = + nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER]); + new_attrs.controller_valid = true; + } + if (new_attrs.flavour == DEVLINK_PORT_FLAVOUR_PCI_SF && + info->attrs[DEVLINK_ATTR_PORT_PCI_SF_NUMBER]) { + new_attrs.sfnum = nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_PCI_SF_NUMBER]); + new_attrs.sfnum_valid = true; + } + + err = devlink->ops->port_new(devlink, &new_attrs, extack, + &new_port_index); + if (err) + return err; + + err = devlink_port_new_notifiy(devlink, new_port_index, info); + if (err && err != -ENODEV) { + /* Fail to send the response; destroy newly created port. */ + devlink->ops->port_del(devlink, new_port_index, extack); + } + return err; +} + +static int devlink_nl_cmd_port_del_doit(struct sk_buff *skb, + struct genl_info *info) +{ + struct netlink_ext_ack *extack = info->extack; + struct devlink *devlink = info->user_ptr[0]; + unsigned int port_index; + + if (!devlink->ops->port_del) + return -EOPNOTSUPP; + + if (!info->attrs[DEVLINK_ATTR_PORT_INDEX]) { + NL_SET_ERR_MSG_MOD(extack, "Port index is not specified"); + return -EINVAL; + } + port_index = nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_INDEX]); + + return devlink->ops->port_del(devlink, port_index, extack); +} + static int devlink_nl_sb_fill(struct sk_buff *msg, struct devlink *devlink, struct devlink_sb *devlink_sb, enum devlink_command cmd, u32 portid, @@ -7594,6 +7798,10 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = { [DEVLINK_ATTR_RELOAD_ACTION] = NLA_POLICY_RANGE(NLA_U8, DEVLINK_RELOAD_ACTION_DRIVER_REINIT, DEVLINK_RELOAD_ACTION_MAX), [DEVLINK_ATTR_RELOAD_LIMITS] = NLA_POLICY_BITFIELD32(DEVLINK_RELOAD_LIMITS_VALID_MASK), + [DEVLINK_ATTR_PORT_FLAVOUR] = { .type = NLA_U16 }, + [DEVLINK_ATTR_PORT_PCI_PF_NUMBER] = { .type = NLA_U16 }, + [DEVLINK_ATTR_PORT_PCI_SF_NUMBER] = { .type = NLA_U32 }, + [DEVLINK_ATTR_PORT_CONTROLLER_NUMBER] = { .type = NLA_U32 }, }; static const struct genl_small_ops devlink_nl_ops[] = { @@ -7633,6 +7841,18 @@ static const struct genl_small_ops devlink_nl_ops[] = { .flags = GENL_ADMIN_PERM, .internal_flags = DEVLINK_NL_FLAG_NO_LOCK, }, + { + .cmd = DEVLINK_CMD_PORT_NEW, + .doit = devlink_nl_cmd_port_new_doit, + .flags = GENL_ADMIN_PERM, + .internal_flags = DEVLINK_NL_FLAG_NO_LOCK, + }, + { + .cmd = DEVLINK_CMD_PORT_DEL, + .doit = devlink_nl_cmd_port_del_doit, + .flags = GENL_ADMIN_PERM, + .internal_flags = DEVLINK_NL_FLAG_NO_LOCK, + }, { .cmd = DEVLINK_CMD_SB_GET, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, @@ -8370,6 +8590,32 @@ void devlink_port_attrs_pci_vf_set(struct devlink_port *devlink_port, u32 contro } EXPORT_SYMBOL_GPL(devlink_port_attrs_pci_vf_set); +/** + * devlink_port_attrs_pci_sf_set - Set PCI SF port attributes + * + * @devlink_port: devlink port + * @controller: associated controller number for the devlink port instance + * @pf: associated PF for the devlink port instance + * @sf: associated SF of a PF for the devlink port instance + */ +void devlink_port_attrs_pci_sf_set(struct devlink_port *devlink_port, u32 controller, + u16 pf, u32 sf) +{ + struct devlink_port_attrs *attrs = &devlink_port->attrs; + int ret; + + if (WARN_ON(devlink_port->registered)) + return; + ret = __devlink_port_attrs_set(devlink_port, + DEVLINK_PORT_FLAVOUR_PCI_SF); + if (ret) + return; + attrs->pci_sf.controller = controller; + attrs->pci_sf.pf = pf; + attrs->pci_sf.sf = sf; +} +EXPORT_SYMBOL_GPL(devlink_port_attrs_pci_sf_set); + static int __devlink_port_phys_port_name_get(struct devlink_port *devlink_port, char *name, size_t len) { @@ -8417,6 +8663,10 @@ static int __devlink_port_phys_port_name_get(struct devlink_port *devlink_port, n = snprintf(name, len, "pf%uvf%u", attrs->pci_vf.pf, attrs->pci_vf.vf); break; + case DEVLINK_PORT_FLAVOUR_PCI_SF: + n = snprintf(name, len, "pf%usf%u", attrs->pci_sf.pf, + attrs->pci_sf.sf); + break; case DEVLINK_PORT_FLAVOUR_VIRTUAL: return -EOPNOTSUPP; } diff --git a/net/core/stream.c b/net/core/stream.c index a166a32b411fa6d9c37f5aed5da9b85b912a8086..0d4457f54f622281aefa91d2ba96ec1070a6a909 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -73,8 +73,8 @@ int sk_stream_wait_connect(struct sock *sk, long *timeo_p) add_wait_queue(sk_sleep(sk), &wait); sk->sk_write_pending++; done = sk_wait_event(sk, timeo_p, - !sk->sk_err && - !((1 << sk->sk_state) & + !READ_ONCE(sk->sk_err) && + !((1 << READ_ONCE(sk->sk_state)) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)), &wait); remove_wait_queue(sk_sleep(sk), &wait); sk->sk_write_pending--; @@ -87,9 +87,9 @@ EXPORT_SYMBOL(sk_stream_wait_connect); * sk_stream_closing - Return 1 if we still have things to send in our buffers. * @sk: socket to verify */ -static inline int sk_stream_closing(struct sock *sk) +static int sk_stream_closing(const struct sock *sk) { - return (1 << sk->sk_state) & + return (1 << READ_ONCE(sk->sk_state)) & (TCPF_FIN_WAIT1 | TCPF_CLOSING | TCPF_LAST_ACK); } @@ -142,8 +142,8 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p) set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); sk->sk_write_pending++; - sk_wait_event(sk, ¤t_timeo, sk->sk_err || - (sk->sk_shutdown & SEND_SHUTDOWN) || + sk_wait_event(sk, ¤t_timeo, READ_ONCE(sk->sk_err) || + (READ_ONCE(sk->sk_shutdown) & SEND_SHUTDOWN) || (sk_stream_memory_free(sk) && !vm_wait), &wait); sk->sk_write_pending--; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 480b86dab4cd0f7ad12b300518c4f8579db8b5b1..6889b77db73be58e46a89b9f67bc802fba245ad2 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1010,7 +1010,7 @@ static int inet_compat_routing_ioctl(struct sock *sk, unsigned int cmd, return ip_rt_ioctl(sock_net(sk), cmd, &rt); } -static int inet_compat_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) +int inet_compat_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) { void __user *argp = compat_ptr(arg); struct sock *sk = sock->sk; @@ -1025,6 +1025,7 @@ static int inet_compat_ioctl(struct socket *sock, unsigned int cmd, unsigned lon return sk->sk_prot->compat_ioctl(sk, cmd, arg); } } +EXPORT_SYMBOL_GPL(inet_compat_ioctl); #endif /* CONFIG_COMPAT */ const struct proto_ops inet_stream_ops = { diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index eaf2308c355a6ed9f9ecafc660f296fbbd1ea698..1e15cc76cd78da5d4a3a8343a26ae587e6424631 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -258,7 +258,7 @@ static int tcp_bpf_wait_data(struct sock *sk, struct sk_psock *psock, sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); ret = sk_wait_event(sk, &timeo, !list_empty(&psock->ingress_msg) || - !skb_queue_empty(&sk->sk_receive_queue), &wait); + !skb_queue_empty_lockless(&sk->sk_receive_queue), &wait); sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk); remove_wait_queue(sk_sleep(sk), &wait); return ret; diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index f6409b8262e51a400edcac3e0c68696dcf7972ae..14db046e8039ef6a1efe398a1124cad689f205e6 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -639,6 +639,7 @@ int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) return INDIRECT_CALL_2(sk->sk_prot->sendmsg, tcp_sendmsg, udpv6_sendmsg, sk, msg, size); } +EXPORT_SYMBOL_GPL(inet6_sendmsg); INDIRECT_CALLABLE_DECLARE(int udpv6_recvmsg(struct sock *, struct msghdr *, size_t, int, int, int *)); @@ -659,6 +660,7 @@ int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, msg->msg_namelen = addr_len; return err; } +EXPORT_SYMBOL_GPL(inet6_recvmsg); const struct proto_ops inet6_stream_ops = { .family = PF_INET6, diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c index 99a37c411323edcf59733486a07a8a1e1ce6b3a6..01e26698285a05c4e14d293d9fd8b40b41a688b8 100644 --- a/net/llc/af_llc.c +++ b/net/llc/af_llc.c @@ -582,7 +582,8 @@ static int llc_ui_wait_for_disc(struct sock *sk, long timeout) add_wait_queue(sk_sleep(sk), &wait); while (1) { - if (sk_wait_event(sk, &timeout, sk->sk_state == TCP_CLOSE, &wait)) + if (sk_wait_event(sk, &timeout, + READ_ONCE(sk->sk_state) == TCP_CLOSE, &wait)) break; rc = -ERESTARTSYS; if (signal_pending(current)) @@ -602,7 +603,8 @@ static bool llc_ui_wait_for_conn(struct sock *sk, long timeout) add_wait_queue(sk_sleep(sk), &wait); while (1) { - if (sk_wait_event(sk, &timeout, sk->sk_state != TCP_SYN_SENT, &wait)) + if (sk_wait_event(sk, &timeout, + READ_ONCE(sk->sk_state) != TCP_SYN_SENT, &wait)) break; if (signal_pending(current) || !timeout) break; @@ -621,7 +623,7 @@ static int llc_ui_wait_for_busy_core(struct sock *sk, long timeout) while (1) { rc = 0; if (sk_wait_event(sk, &timeout, - (sk->sk_shutdown & RCV_SHUTDOWN) || + (READ_ONCE(sk->sk_shutdown) & RCV_SHUTDOWN) || (!llc_data_accept_state(llc->state) && !llc->remote_busy_flag && !llc->p_flag), &wait)) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 385ecc950aef379df970dd8d0b5c809dbc5cd7e2..8b15a400b7c11a686515283f927138f4438b9bbd 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3371,7 +3371,8 @@ static int nf_tables_newrule(struct net *net, struct sock *nlsk, return 0; err2: - nf_tables_rule_release(&ctx, rule); + nft_rule_expr_deactivate(&ctx, rule, NFT_TRANS_PREPARE); + nf_tables_rule_destroy(&ctx, rule); err1: for (i = 0; i < n; i++) { if (info[i].ops) { diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c index 79fbf37291f389835691eb08eab55e61ad7a7aa3..51e3953b414c0497f7d99fd4c249e49d5415e065 100644 --- a/net/netfilter/nfnetlink_osf.c +++ b/net/netfilter/nfnetlink_osf.c @@ -269,6 +269,7 @@ bool nf_osf_find(const struct sk_buff *skb, struct nf_osf_hdr_ctx ctx; const struct tcphdr *tcp; struct tcphdr _tcph; + bool found = false; memset(&ctx, 0, sizeof(ctx)); @@ -283,10 +284,11 @@ bool nf_osf_find(const struct sk_buff *skb, data->genre = f->genre; data->version = f->version; + found = true; break; } - return true; + return found; } EXPORT_SYMBOL_GPL(nf_osf_find); diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c index e5c8a295e64066971869b83ece1daee60a7f2ddc..5c04da4cfbad0c963bd4467b1be2abe86bae9b96 100644 --- a/net/netrom/af_netrom.c +++ b/net/netrom/af_netrom.c @@ -400,6 +400,11 @@ static int nr_listen(struct socket *sock, int backlog) struct sock *sk = sock->sk; lock_sock(sk); + if (sock->state != SS_UNCONNECTED) { + release_sock(sk); + return -EINVAL; + } + if (sk->sk_state != TCP_LISTEN) { memset(&nr_sk(sk)->user_addr, 0, AX25_ADDR_LEN); sk->sk_max_ack_backlog = backlog; diff --git a/net/rds/message.c b/net/rds/message.c index 799034e0f513d988334280186cbdf255fbf50eb7..b363ef13c75ef680084e53ed85ef209e0f704964 100644 --- a/net/rds/message.c +++ b/net/rds/message.c @@ -104,9 +104,9 @@ static void rds_rm_zerocopy_callback(struct rds_sock *rs, spin_lock_irqsave(&q->lock, flags); head = &q->zcookie_head; if (!list_empty(head)) { - info = list_entry(head, struct rds_msg_zcopy_info, - rs_zcookie_next); - if (info && rds_zcookie_add(info, cookie)) { + info = list_first_entry(head, struct rds_msg_zcopy_info, + rs_zcookie_next); + if (rds_zcookie_add(info, cookie)) { spin_unlock_irqrestore(&q->lock, flags); kfree(rds_info_from_znotifier(znotif)); /* caller invokes rds_wake_sk_sleep() */ diff --git a/net/sched/Kconfig b/net/sched/Kconfig index d762e89ab74f7ec4390c64975608de26385e34ea..8f25d35dae7b9029161c8db280e728e248692a88 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -503,17 +503,6 @@ config NET_CLS_BASIC To compile this code as a module, choose M here: the module will be called cls_basic. -config NET_CLS_TCINDEX - tristate "Traffic-Control Index (TCINDEX)" - select NET_CLS - help - Say Y here if you want to be able to classify packets based on - traffic control indices. You will want this feature if you want - to implement Differentiated Services together with DSMARK. - - To compile this code as a module, choose M here: the - module will be called cls_tcindex. - config NET_CLS_ROUTE4 tristate "Routing decision (ROUTE)" depends on INET diff --git a/net/sched/Makefile b/net/sched/Makefile index 66bbf9a98f9ea123edd69c84615b37266d537fed..4311fdb211197ec073f70f2adbbe26024e38c16c 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -69,7 +69,6 @@ obj-$(CONFIG_NET_CLS_U32) += cls_u32.o obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o obj-$(CONFIG_NET_CLS_FW) += cls_fw.o obj-$(CONFIG_NET_CLS_RSVP) += cls_rsvp.o -obj-$(CONFIG_NET_CLS_TCINDEX) += cls_tcindex.o obj-$(CONFIG_NET_CLS_RSVP6) += cls_rsvp6.o obj-$(CONFIG_NET_CLS_BASIC) += cls_basic.o obj-$(CONFIG_NET_CLS_FLOW) += cls_flow.o diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 35ee6d8226e615bfc6642c7b6ec3ac88daa93125..caf1a05bfbde4e5999c2c0eb7ea0da4fcb8c2c92 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -1086,6 +1086,9 @@ static int fl_set_geneve_opt(const struct nlattr *nla, struct fl_flow_key *key, if (option_len > sizeof(struct geneve_opt)) data_len = option_len - sizeof(struct geneve_opt); + if (key->enc_opts.len > FLOW_DIS_TUN_OPTS_MAX - 4) + return -ERANGE; + opt = (struct geneve_opt *)&key->enc_opts.data[key->enc_opts.len]; memset(opt, 0xff, option_len); opt->length = data_len / 4; diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c deleted file mode 100644 index dc87feaa3cb35a9e41a1209db82ebc5cef399462..0000000000000000000000000000000000000000 --- a/net/sched/cls_tcindex.c +++ /dev/null @@ -1,764 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-only -/* - * net/sched/cls_tcindex.c Packet classifier for skb->tc_index - * - * Written 1998,1999 by Werner Almesberger, EPFL ICA - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -/* - * Passing parameters to the root seems to be done more awkwardly than really - * necessary. At least, u32 doesn't seem to use such dirty hacks. To be - * verified. FIXME. - */ - -#define PERFECT_HASH_THRESHOLD 64 /* use perfect hash if not bigger */ -#define DEFAULT_HASH_SIZE 64 /* optimized for diffserv */ - - -struct tcindex_data; - -struct tcindex_filter_result { - struct tcf_exts exts; - struct tcf_result res; - struct tcindex_data *p; - struct rcu_work rwork; -}; - -struct tcindex_filter { - u16 key; - struct tcindex_filter_result result; - struct tcindex_filter __rcu *next; - struct rcu_work rwork; -}; - - -struct tcindex_data { - struct tcindex_filter_result *perfect; /* perfect hash; NULL if none */ - struct tcindex_filter __rcu **h; /* imperfect hash; */ - struct tcf_proto *tp; - u16 mask; /* AND key with mask */ - u32 shift; /* shift ANDed key to the right */ - u32 hash; /* hash table size; 0 if undefined */ - u32 alloc_hash; /* allocated size */ - u32 fall_through; /* 0: only classify if explicit match */ - refcount_t refcnt; /* a temporary refcnt for perfect hash */ - struct rcu_work rwork; -}; - -static inline int tcindex_filter_is_set(struct tcindex_filter_result *r) -{ - return tcf_exts_has_actions(&r->exts) || r->res.classid; -} - -static void tcindex_data_get(struct tcindex_data *p) -{ - refcount_inc(&p->refcnt); -} - -static void tcindex_data_put(struct tcindex_data *p) -{ - if (refcount_dec_and_test(&p->refcnt)) { - kfree(p->perfect); - kfree(p->h); - kfree(p); - } -} - -static struct tcindex_filter_result *tcindex_lookup(struct tcindex_data *p, - u16 key) -{ - if (p->perfect) { - struct tcindex_filter_result *f = p->perfect + key; - - return tcindex_filter_is_set(f) ? f : NULL; - } else if (p->h) { - struct tcindex_filter __rcu **fp; - struct tcindex_filter *f; - - fp = &p->h[key % p->hash]; - for (f = rcu_dereference_bh_rtnl(*fp); - f; - fp = &f->next, f = rcu_dereference_bh_rtnl(*fp)) - if (f->key == key) - return &f->result; - } - - return NULL; -} - - -static int tcindex_classify(struct sk_buff *skb, const struct tcf_proto *tp, - struct tcf_result *res) -{ - struct tcindex_data *p = rcu_dereference_bh(tp->root); - struct tcindex_filter_result *f; - int key = (skb->tc_index & p->mask) >> p->shift; - - pr_debug("tcindex_classify(skb %p,tp %p,res %p),p %p\n", - skb, tp, res, p); - - f = tcindex_lookup(p, key); - if (!f) { - struct Qdisc *q = tcf_block_q(tp->chain->block); - - if (!p->fall_through) - return -1; - res->classid = TC_H_MAKE(TC_H_MAJ(q->handle), key); - res->class = 0; - pr_debug("alg 0x%x\n", res->classid); - return 0; - } - *res = f->res; - pr_debug("map 0x%x\n", res->classid); - - return tcf_exts_exec(skb, &f->exts, res); -} - - -static void *tcindex_get(struct tcf_proto *tp, u32 handle) -{ - struct tcindex_data *p = rtnl_dereference(tp->root); - struct tcindex_filter_result *r; - - pr_debug("tcindex_get(tp %p,handle 0x%08x)\n", tp, handle); - if (p->perfect && handle >= p->alloc_hash) - return NULL; - r = tcindex_lookup(p, handle); - return r && tcindex_filter_is_set(r) ? r : NULL; -} - -static int tcindex_init(struct tcf_proto *tp) -{ - struct tcindex_data *p; - - pr_debug("tcindex_init(tp %p)\n", tp); - p = kzalloc(sizeof(struct tcindex_data), GFP_KERNEL); - if (!p) - return -ENOMEM; - - p->mask = 0xffff; - p->hash = DEFAULT_HASH_SIZE; - p->fall_through = 1; - refcount_set(&p->refcnt, 1); /* Paired with tcindex_destroy_work() */ - - rcu_assign_pointer(tp->root, p); - return 0; -} - -static void __tcindex_destroy_rexts(struct tcindex_filter_result *r) -{ - tcf_exts_destroy(&r->exts); - tcf_exts_put_net(&r->exts); - tcindex_data_put(r->p); -} - -static void tcindex_destroy_rexts_work(struct work_struct *work) -{ - struct tcindex_filter_result *r; - - r = container_of(to_rcu_work(work), - struct tcindex_filter_result, - rwork); - rtnl_lock(); - __tcindex_destroy_rexts(r); - rtnl_unlock(); -} - -static void __tcindex_destroy_fexts(struct tcindex_filter *f) -{ - tcf_exts_destroy(&f->result.exts); - tcf_exts_put_net(&f->result.exts); - kfree(f); -} - -static void tcindex_destroy_fexts_work(struct work_struct *work) -{ - struct tcindex_filter *f = container_of(to_rcu_work(work), - struct tcindex_filter, - rwork); - - rtnl_lock(); - __tcindex_destroy_fexts(f); - rtnl_unlock(); -} - -static int tcindex_delete(struct tcf_proto *tp, void *arg, bool *last, - bool rtnl_held, struct netlink_ext_ack *extack) -{ - struct tcindex_data *p = rtnl_dereference(tp->root); - struct tcindex_filter_result *r = arg; - struct tcindex_filter __rcu **walk; - struct tcindex_filter *f = NULL; - - pr_debug("tcindex_delete(tp %p,arg %p),p %p\n", tp, arg, p); - if (p->perfect) { - if (!r->res.class) - return -ENOENT; - } else { - int i; - - for (i = 0; i < p->hash; i++) { - walk = p->h + i; - for (f = rtnl_dereference(*walk); f; - walk = &f->next, f = rtnl_dereference(*walk)) { - if (&f->result == r) - goto found; - } - } - return -ENOENT; - -found: - rcu_assign_pointer(*walk, rtnl_dereference(f->next)); - } - tcf_unbind_filter(tp, &r->res); - /* all classifiers are required to call tcf_exts_destroy() after rcu - * grace period, since converted-to-rcu actions are relying on that - * in cleanup() callback - */ - if (f) { - if (tcf_exts_get_net(&f->result.exts)) - tcf_queue_work(&f->rwork, tcindex_destroy_fexts_work); - else - __tcindex_destroy_fexts(f); - } else { - tcindex_data_get(p); - - if (tcf_exts_get_net(&r->exts)) - tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work); - else - __tcindex_destroy_rexts(r); - } - - *last = false; - return 0; -} - -static void tcindex_destroy_work(struct work_struct *work) -{ - struct tcindex_data *p = container_of(to_rcu_work(work), - struct tcindex_data, - rwork); - - tcindex_data_put(p); -} - -static inline int -valid_perfect_hash(struct tcindex_data *p) -{ - return p->hash > (p->mask >> p->shift); -} - -static const struct nla_policy tcindex_policy[TCA_TCINDEX_MAX + 1] = { - [TCA_TCINDEX_HASH] = { .type = NLA_U32 }, - [TCA_TCINDEX_MASK] = { .type = NLA_U16 }, - [TCA_TCINDEX_SHIFT] = { .type = NLA_U32 }, - [TCA_TCINDEX_FALL_THROUGH] = { .type = NLA_U32 }, - [TCA_TCINDEX_CLASSID] = { .type = NLA_U32 }, -}; - -static int tcindex_filter_result_init(struct tcindex_filter_result *r, - struct tcindex_data *p, - struct net *net) -{ - memset(r, 0, sizeof(*r)); - r->p = p; - return tcf_exts_init(&r->exts, net, TCA_TCINDEX_ACT, - TCA_TCINDEX_POLICE); -} - -static void tcindex_free_perfect_hash(struct tcindex_data *cp); - -static void tcindex_partial_destroy_work(struct work_struct *work) -{ - struct tcindex_data *p = container_of(to_rcu_work(work), - struct tcindex_data, - rwork); - - rtnl_lock(); - if (p->perfect) - tcindex_free_perfect_hash(p); - kfree(p); - rtnl_unlock(); -} - -static void tcindex_free_perfect_hash(struct tcindex_data *cp) -{ - int i; - - for (i = 0; i < cp->hash; i++) - tcf_exts_destroy(&cp->perfect[i].exts); - kfree(cp->perfect); -} - -static int tcindex_alloc_perfect_hash(struct net *net, struct tcindex_data *cp) -{ - int i, err = 0; - - cp->perfect = kcalloc(cp->hash, sizeof(struct tcindex_filter_result), - GFP_KERNEL | __GFP_NOWARN); - if (!cp->perfect) - return -ENOMEM; - - for (i = 0; i < cp->hash; i++) { - err = tcf_exts_init(&cp->perfect[i].exts, net, - TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE); - if (err < 0) - goto errout; - cp->perfect[i].p = cp; - } - - return 0; - -errout: - tcindex_free_perfect_hash(cp); - return err; -} - -static int -tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, - u32 handle, struct tcindex_data *p, - struct tcindex_filter_result *r, struct nlattr **tb, - struct nlattr *est, bool ovr, struct netlink_ext_ack *extack) -{ - struct tcindex_filter_result new_filter_result, *old_r = r; - struct tcindex_data *cp = NULL, *oldp; - struct tcindex_filter *f = NULL; /* make gcc behave */ - struct tcf_result cr = {}; - int err, balloc = 0; - struct tcf_exts e; - bool update_h = false; - - err = tcf_exts_init(&e, net, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE); - if (err < 0) - return err; - err = tcf_exts_validate(net, tp, tb, est, &e, ovr, true, extack); - if (err < 0) - goto errout; - - err = -ENOMEM; - /* tcindex_data attributes must look atomic to classifier/lookup so - * allocate new tcindex data and RCU assign it onto root. Keeping - * perfect hash and hash pointers from old data. - */ - cp = kzalloc(sizeof(*cp), GFP_KERNEL); - if (!cp) - goto errout; - - cp->mask = p->mask; - cp->shift = p->shift; - cp->hash = p->hash; - cp->alloc_hash = p->alloc_hash; - cp->fall_through = p->fall_through; - cp->tp = tp; - refcount_set(&cp->refcnt, 1); /* Paired with tcindex_destroy_work() */ - - if (tb[TCA_TCINDEX_HASH]) - cp->hash = nla_get_u32(tb[TCA_TCINDEX_HASH]); - - if (tb[TCA_TCINDEX_MASK]) - cp->mask = nla_get_u16(tb[TCA_TCINDEX_MASK]); - - if (tb[TCA_TCINDEX_SHIFT]) { - cp->shift = nla_get_u32(tb[TCA_TCINDEX_SHIFT]); - if (cp->shift > 16) { - err = -EINVAL; - goto errout; - } - } - if (!cp->hash) { - /* Hash not specified, use perfect hash if the upper limit - * of the hashing index is below the threshold. - */ - if ((cp->mask >> cp->shift) < PERFECT_HASH_THRESHOLD) - cp->hash = (cp->mask >> cp->shift) + 1; - else - cp->hash = DEFAULT_HASH_SIZE; - } - - if (p->perfect) { - int i; - - if (tcindex_alloc_perfect_hash(net, cp) < 0) - goto errout; - cp->alloc_hash = cp->hash; - for (i = 0; i < min(cp->hash, p->hash); i++) - cp->perfect[i].res = p->perfect[i].res; - balloc = 1; - } - cp->h = p->h; - - err = tcindex_filter_result_init(&new_filter_result, cp, net); - if (err < 0) - goto errout_alloc; - if (old_r) - cr = r->res; - - err = -EBUSY; - - /* Hash already allocated, make sure that we still meet the - * requirements for the allocated hash. - */ - if (cp->perfect) { - if (!valid_perfect_hash(cp) || - cp->hash > cp->alloc_hash) - goto errout_alloc; - } else if (cp->h && cp->hash != cp->alloc_hash) { - goto errout_alloc; - } - - err = -EINVAL; - if (tb[TCA_TCINDEX_FALL_THROUGH]) - cp->fall_through = nla_get_u32(tb[TCA_TCINDEX_FALL_THROUGH]); - - if (!cp->perfect && !cp->h) - cp->alloc_hash = cp->hash; - - /* Note: this could be as restrictive as if (handle & ~(mask >> shift)) - * but then, we'd fail handles that may become valid after some future - * mask change. While this is extremely unlikely to ever matter, - * the check below is safer (and also more backwards-compatible). - */ - if (cp->perfect || valid_perfect_hash(cp)) - if (handle >= cp->alloc_hash) - goto errout_alloc; - - - err = -ENOMEM; - if (!cp->perfect && !cp->h) { - if (valid_perfect_hash(cp)) { - if (tcindex_alloc_perfect_hash(net, cp) < 0) - goto errout_alloc; - balloc = 1; - } else { - struct tcindex_filter __rcu **hash; - - hash = kcalloc(cp->hash, - sizeof(struct tcindex_filter *), - GFP_KERNEL); - - if (!hash) - goto errout_alloc; - - cp->h = hash; - balloc = 2; - } - } - - if (cp->perfect) { - r = cp->perfect + handle; - } else { - /* imperfect area is updated in-place using rcu */ - update_h = !!tcindex_lookup(cp, handle); - r = &new_filter_result; - } - - if (r == &new_filter_result) { - f = kzalloc(sizeof(*f), GFP_KERNEL); - if (!f) - goto errout_alloc; - f->key = handle; - f->next = NULL; - err = tcindex_filter_result_init(&f->result, cp, net); - if (err < 0) { - kfree(f); - goto errout_alloc; - } - } - - if (tb[TCA_TCINDEX_CLASSID]) { - cr.classid = nla_get_u32(tb[TCA_TCINDEX_CLASSID]); - tcf_bind_filter(tp, &cr, base); - } - - if (old_r && old_r != r) { - err = tcindex_filter_result_init(old_r, cp, net); - if (err < 0) { - kfree(f); - goto errout_alloc; - } - } - - oldp = p; - r->res = cr; - tcf_exts_change(&r->exts, &e); - - rcu_assign_pointer(tp->root, cp); - - if (update_h) { - struct tcindex_filter __rcu **fp; - struct tcindex_filter *cf; - - f->result.res = r->res; - tcf_exts_change(&f->result.exts, &r->exts); - - /* imperfect area bucket */ - fp = cp->h + (handle % cp->hash); - - /* lookup the filter, guaranteed to exist */ - for (cf = rcu_dereference_bh_rtnl(*fp); cf; - fp = &cf->next, cf = rcu_dereference_bh_rtnl(*fp)) - if (cf->key == handle) - break; - - f->next = cf->next; - - cf = rcu_replace_pointer(*fp, f, 1); - tcf_exts_get_net(&cf->result.exts); - tcf_queue_work(&cf->rwork, tcindex_destroy_fexts_work); - } else if (r == &new_filter_result) { - struct tcindex_filter *nfp; - struct tcindex_filter __rcu **fp; - - f->result.res = r->res; - tcf_exts_change(&f->result.exts, &r->exts); - - fp = cp->h + (handle % cp->hash); - for (nfp = rtnl_dereference(*fp); - nfp; - fp = &nfp->next, nfp = rtnl_dereference(*fp)) - ; /* nothing */ - - rcu_assign_pointer(*fp, f); - } else { - tcf_exts_destroy(&new_filter_result.exts); - } - - if (oldp) - tcf_queue_work(&oldp->rwork, tcindex_partial_destroy_work); - return 0; - -errout_alloc: - if (balloc == 1) - tcindex_free_perfect_hash(cp); - else if (balloc == 2) - kfree(cp->h); - tcf_exts_destroy(&new_filter_result.exts); -errout: - kfree(cp); - tcf_exts_destroy(&e); - return err; -} - -static int -tcindex_change(struct net *net, struct sk_buff *in_skb, - struct tcf_proto *tp, unsigned long base, u32 handle, - struct nlattr **tca, void **arg, bool ovr, - bool rtnl_held, struct netlink_ext_ack *extack) -{ - struct nlattr *opt = tca[TCA_OPTIONS]; - struct nlattr *tb[TCA_TCINDEX_MAX + 1]; - struct tcindex_data *p = rtnl_dereference(tp->root); - struct tcindex_filter_result *r = *arg; - int err; - - pr_debug("tcindex_change(tp %p,handle 0x%08x,tca %p,arg %p),opt %p," - "p %p,r %p,*arg %p\n", - tp, handle, tca, arg, opt, p, r, *arg); - - if (!opt) - return 0; - - err = nla_parse_nested_deprecated(tb, TCA_TCINDEX_MAX, opt, - tcindex_policy, NULL); - if (err < 0) - return err; - - return tcindex_set_parms(net, tp, base, handle, p, r, tb, - tca[TCA_RATE], ovr, extack); -} - -static void tcindex_walk(struct tcf_proto *tp, struct tcf_walker *walker, - bool rtnl_held) -{ - struct tcindex_data *p = rtnl_dereference(tp->root); - struct tcindex_filter *f, *next; - int i; - - pr_debug("tcindex_walk(tp %p,walker %p),p %p\n", tp, walker, p); - if (p->perfect) { - for (i = 0; i < p->hash; i++) { - if (!p->perfect[i].res.class) - continue; - if (walker->count >= walker->skip) { - if (walker->fn(tp, p->perfect + i, walker) < 0) { - walker->stop = 1; - return; - } - } - walker->count++; - } - } - if (!p->h) - return; - for (i = 0; i < p->hash; i++) { - for (f = rtnl_dereference(p->h[i]); f; f = next) { - next = rtnl_dereference(f->next); - if (walker->count >= walker->skip) { - if (walker->fn(tp, &f->result, walker) < 0) { - walker->stop = 1; - return; - } - } - walker->count++; - } - } -} - -static void tcindex_destroy(struct tcf_proto *tp, bool rtnl_held, - struct netlink_ext_ack *extack) -{ - struct tcindex_data *p = rtnl_dereference(tp->root); - int i; - - pr_debug("tcindex_destroy(tp %p),p %p\n", tp, p); - - if (p->perfect) { - for (i = 0; i < p->hash; i++) { - struct tcindex_filter_result *r = p->perfect + i; - - /* tcf_queue_work() does not guarantee the ordering we - * want, so we have to take this refcnt temporarily to - * ensure 'p' is freed after all tcindex_filter_result - * here. Imperfect hash does not need this, because it - * uses linked lists rather than an array. - */ - tcindex_data_get(p); - - tcf_unbind_filter(tp, &r->res); - if (tcf_exts_get_net(&r->exts)) - tcf_queue_work(&r->rwork, - tcindex_destroy_rexts_work); - else - __tcindex_destroy_rexts(r); - } - } - - for (i = 0; p->h && i < p->hash; i++) { - struct tcindex_filter *f, *next; - bool last; - - for (f = rtnl_dereference(p->h[i]); f; f = next) { - next = rtnl_dereference(f->next); - tcindex_delete(tp, &f->result, &last, rtnl_held, NULL); - } - } - - tcf_queue_work(&p->rwork, tcindex_destroy_work); -} - - -static int tcindex_dump(struct net *net, struct tcf_proto *tp, void *fh, - struct sk_buff *skb, struct tcmsg *t, bool rtnl_held) -{ - struct tcindex_data *p = rtnl_dereference(tp->root); - struct tcindex_filter_result *r = fh; - struct nlattr *nest; - - pr_debug("tcindex_dump(tp %p,fh %p,skb %p,t %p),p %p,r %p\n", - tp, fh, skb, t, p, r); - pr_debug("p->perfect %p p->h %p\n", p->perfect, p->h); - - nest = nla_nest_start_noflag(skb, TCA_OPTIONS); - if (nest == NULL) - goto nla_put_failure; - - if (!fh) { - t->tcm_handle = ~0; /* whatever ... */ - if (nla_put_u32(skb, TCA_TCINDEX_HASH, p->hash) || - nla_put_u16(skb, TCA_TCINDEX_MASK, p->mask) || - nla_put_u32(skb, TCA_TCINDEX_SHIFT, p->shift) || - nla_put_u32(skb, TCA_TCINDEX_FALL_THROUGH, p->fall_through)) - goto nla_put_failure; - nla_nest_end(skb, nest); - } else { - if (p->perfect) { - t->tcm_handle = r - p->perfect; - } else { - struct tcindex_filter *f; - struct tcindex_filter __rcu **fp; - int i; - - t->tcm_handle = 0; - for (i = 0; !t->tcm_handle && i < p->hash; i++) { - fp = &p->h[i]; - for (f = rtnl_dereference(*fp); - !t->tcm_handle && f; - fp = &f->next, f = rtnl_dereference(*fp)) { - if (&f->result == r) - t->tcm_handle = f->key; - } - } - } - pr_debug("handle = %d\n", t->tcm_handle); - if (r->res.class && - nla_put_u32(skb, TCA_TCINDEX_CLASSID, r->res.classid)) - goto nla_put_failure; - - if (tcf_exts_dump(skb, &r->exts) < 0) - goto nla_put_failure; - nla_nest_end(skb, nest); - - if (tcf_exts_dump_stats(skb, &r->exts) < 0) - goto nla_put_failure; - } - - return skb->len; - -nla_put_failure: - nla_nest_cancel(skb, nest); - return -1; -} - -static void tcindex_bind_class(void *fh, u32 classid, unsigned long cl, - void *q, unsigned long base) -{ - struct tcindex_filter_result *r = fh; - - if (r && r->res.classid == classid) { - if (cl) - __tcf_bind_filter(q, &r->res, base); - else - __tcf_unbind_filter(q, &r->res); - } -} - -static struct tcf_proto_ops cls_tcindex_ops __read_mostly = { - .kind = "tcindex", - .classify = tcindex_classify, - .init = tcindex_init, - .destroy = tcindex_destroy, - .get = tcindex_get, - .change = tcindex_change, - .delete = tcindex_delete, - .walk = tcindex_walk, - .dump = tcindex_dump, - .bind_class = tcindex_bind_class, - .owner = THIS_MODULE, -}; - -static int __init init_tcindex(void) -{ - return register_tcf_proto_ops(&cls_tcindex_ops); -} - -static void __exit exit_tcindex(void) -{ - unregister_tcf_proto_ops(&cls_tcindex_ops); -} - -module_init(init_tcindex) -module_exit(exit_tcindex) -MODULE_LICENSE("GPL"); diff --git a/net/smc/Makefile b/net/smc/Makefile index 59a4f49f186aa75c8162e36d0945d0930682d1a3..d652a9c13e6b20e30dee6ffe13fee55483e36f56 100644 --- a/net/smc/Makefile +++ b/net/smc/Makefile @@ -5,4 +5,5 @@ obj-$(CONFIG_SMC_DIAG) += smc_diag.o smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o smc_netlink.o smc_stats.o smc-y += smc_tracepoint.o smc_proc.o smc_dim.o +smc-y += smc_inet.o smc-$(CONFIG_SYSCTL) += smc_sysctl.o diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c index aec9075684e3983f62491f941e55cad2105854fe..b8d57c7ed895b8b8bd944660f6d912e780375a17 100644 --- a/net/smc/af_smc.c +++ b/net/smc/af_smc.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include @@ -35,6 +36,9 @@ #include #include +#include +#include +#include #include "smc_netns.h" #include "smc.h" @@ -53,6 +57,7 @@ #include "smc_tracepoint.h" #include "smc_sysctl.h" #include "smc_proc.h" +#include "smc_inet.h" static DEFINE_MUTEX(smc_server_lgr_pending); /* serialize link group * creation on server @@ -68,6 +73,12 @@ struct workqueue_struct *smc_close_wq; /* wq for close work */ static void smc_tcp_listen_work(struct work_struct *); static void smc_connect_work(struct work_struct *); +static void smc_inet_sock_state_change(struct sock *sk); +static int smc_inet_sock_do_handshake(struct sock *sk, bool sk_locked, bool sync); + +static void __smc_inet_sock_sort_csk_queue(struct sock *parent, int *tcp_cnt, int *smc_cnt); +static int smc_inet_sock_sort_csk_queue(struct sock *parent); + /* default use reserve_mode */ bool reserve_mode = true; module_param(reserve_mode, bool, 0444); @@ -78,6 +89,43 @@ u16 rsvd_ports_base = SMC_IWARP_RSVD_PORTS_BASE; module_param(rsvd_ports_base, ushort, 0444); MODULE_PARM_DESC(rsvd_ports_base, "base of rsvd ports for reserve_mode"); +static int smc_sock_should_select_smc(const struct smc_sock *smc) +{ + const struct smc_sock_negotiator_ops *ops; + int ret; + + rcu_read_lock(); + ops = READ_ONCE(smc->negotiator_ops); + + /* No negotiator_ops supply or no negotiate func set, + * always pass it. + */ + if (!ops || !ops->negotiate) { + rcu_read_unlock(); + return SK_PASS; + } + + ret = ops->negotiate((struct sock *)&smc->sk); + rcu_read_unlock(); + return ret; +} + +static void smc_sock_perform_collecting_info(const struct smc_sock *smc, int timing) +{ + const struct smc_sock_negotiator_ops *ops; + + rcu_read_lock(); + ops = READ_ONCE(smc->negotiator_ops); + + if (!ops || !ops->collect_info) { + rcu_read_unlock(); + return; + } + + ops->collect_info((struct sock *)&smc->sk, timing); + rcu_read_unlock(); +} + int smc_nl_dump_hs_limitation(struct sk_buff *skb, struct netlink_callback *cb) { struct smc_nl_dmp_ctx *cb_ctx = smc_nl_dmp_ctx(cb); @@ -168,7 +216,8 @@ static struct sock *smc_tcp_syn_recv_sock(const struct sock *sk, static bool smc_hs_congested(const struct sock *sk) { - const struct smc_sock *smc; + struct smc_sock *smc; + int tcp_cnt, smc_cnt; smc = smc_clcsock_user_data(sk); @@ -181,6 +230,23 @@ static bool smc_hs_congested(const struct sock *sk) if (!smc_sock_should_select_smc(smc)) return true; + /* only works for inet sock */ + if (smc_sock_is_inet_sock(&smc->sk)) { + __smc_inet_sock_sort_csk_queue(&smc->sk, &tcp_cnt, &smc_cnt); + smc_cnt += atomic_read(&smc->queued_smc_hs); + if (!smc_inet_sock_is_under_presure(&smc->sk)) { + if (smc_cnt > (sk->sk_max_ack_backlog >> 1)) { + smc_inet_sock_under_presure(&smc->sk); + return true; + } + } else { + if (smc_cnt > (sk->sk_max_ack_backlog >> 2)) + return true; + /* leave the presure state */ + smc_inet_sock_leave_presure(&smc->sk); + } + } + return false; } @@ -298,16 +364,16 @@ static int __smc_release(struct smc_sock *smc) sock_set_flag(sk, SOCK_DEAD); sk->sk_shutdown |= SHUTDOWN_MASK; } else { - if (sk->sk_state != SMC_CLOSED) { - if (sk->sk_state != SMC_LISTEN && - sk->sk_state != SMC_INIT) + if (smc_sk_state(sk) != SMC_CLOSED) { + if (smc_sk_state(sk) != SMC_LISTEN && + smc_sk_state(sk) != SMC_INIT) sock_put(sk); /* passive closing */ - if (sk->sk_state == SMC_LISTEN) { + if (smc_sk_state(sk) == SMC_LISTEN) { /* wake up clcsock accept */ rc = kernel_sock_shutdown(smc->clcsock, SHUT_RDWR); } - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); sk->sk_state_change(sk); } smc_restore_fallback_changes(smc); @@ -315,7 +381,7 @@ static int __smc_release(struct smc_sock *smc) sk->sk_prot->unhash(sk); - if (sk->sk_state == SMC_CLOSED) { + if (smc_sk_state(sk) == SMC_CLOSED) { if (smc->clcsock) { release_sock(sk); smc_clcsock_release(smc); @@ -340,10 +406,10 @@ static int smc_release(struct socket *sock) sock_hold(sk); /* sock_put below */ smc = smc_sk(sk); - /* trigger info gathering if needed.*/ - smc_sock_perform_collecting_info(sk, SMC_SOCK_CLOSED_TIMING); + old_state = smc_sk_state(sk); - old_state = sk->sk_state; + /* trigger info gathering if needed.*/ + smc_sock_perform_collecting_info(smc, SMC_SOCK_CLOSED_TIMING); /* cleanup for a dangling non-blocking connect */ if (smc->connect_nonblock && old_state == SMC_INIT) @@ -352,7 +418,7 @@ static int smc_release(struct socket *sock) if (smc->connect_nonblock && cancel_work_sync(&smc->connect_work)) sock_put(&smc->sk); /* sock_hold in smc_connect for passive closing */ - if (sk->sk_state == SMC_LISTEN) + if (smc_sk_state(sk) == SMC_LISTEN) /* smc_close_non_accepted() is called and acquires * sock lock for child sockets again */ @@ -361,7 +427,7 @@ static int smc_release(struct socket *sock) lock_sock(sk); if ((old_state == SMC_INIT || smc->conn.killed) && - sk->sk_state == SMC_ACTIVE && !smc->use_fallback) + smc_sk_state(sk) == SMC_ACTIVE && !smc->use_fallback) smc_close_active_abort(smc); rc = __smc_release(smc); @@ -379,7 +445,12 @@ static int smc_release(struct socket *sock) static void smc_destruct(struct sock *sk) { - if (sk->sk_state != SMC_CLOSED) + if (smc_sk(sk)->original_sk_destruct) + smc_sk(sk)->original_sk_destruct(sk); + + smc_sock_cleanup_negotiator_ops(smc_sk(sk), /* in release */ 1); + + if (smc_sk_state(sk) != SMC_CLOSED) return; if (!sock_flag(sk, SOCK_DEAD)) return; @@ -387,10 +458,59 @@ static void smc_destruct(struct sock *sk) sk_refcnt_debug_dec(sk); } +static inline void smc_sock_init_common(struct sock *sk) +{ + struct smc_sock *smc = smc_sk(sk); + + smc_sk_set_state(sk, SMC_INIT); + INIT_DELAYED_WORK(&smc->conn.tx_work, smc_tx_work); + spin_lock_init(&smc->conn.send_lock); + mutex_init(&smc->clcsock_release_lock); +} + +static void smc_sock_init_passive(struct sock *par, struct sock *sk) +{ + struct smc_sock *parent = smc_sk(par); + struct sock *clcsk; + + smc_sock_init_common(sk); + smc_sk(sk)->listen_smc = parent; + + smc_sock_clone_negotiator_ops(par, sk); + + clcsk = smc_sock_is_inet_sock(sk) ? sk : smc_sk(sk)->clcsock->sk; + + if (tcp_sk(clcsk)->syn_smc) { + smc_sk(sk)->smc_negotiated = 1; + atomic_inc(&parent->queued_smc_hs); + /* memory barrier */ + smp_mb__after_atomic(); + } +} + +static void smc_sock_init(struct sock *sk, struct net *net) +{ + struct smc_sock *smc = smc_sk(sk); + + smc_sock_init_common(sk); + INIT_WORK(&smc->connect_work, smc_connect_work); + INIT_WORK(&smc->tcp_listen_work, smc_tcp_listen_work); + INIT_LIST_HEAD(&smc->accept_q); + spin_lock_init(&smc->accept_q_lock); + WRITE_ONCE(sk->sk_sndbuf, READ_ONCE(net->smc.sysctl_wmem)); + WRITE_ONCE(sk->sk_rcvbuf, READ_ONCE(net->smc.sysctl_rmem)); + smc->limit_smc_hs = net->smc.limit_smc_hs; + + /* already set (for inet sock), save the original */ + if (sk->sk_destruct) + smc->original_sk_destruct = sk->sk_destruct; + sk_refcnt_debug_inc(sk); + sk->sk_destruct = smc_destruct; +} + static struct sock *smc_sock_alloc(struct net *net, struct socket *sock, int protocol) { - struct smc_sock *smc; struct proto *prot; struct sock *sk; @@ -400,23 +520,9 @@ static struct sock *smc_sock_alloc(struct net *net, struct socket *sock, return NULL; sock_init_data(sock, sk); /* sets sk_refcnt to 1 */ - sk->sk_state = SMC_INIT; - sk->sk_destruct = smc_destruct; sk->sk_protocol = protocol; - WRITE_ONCE(sk->sk_sndbuf, READ_ONCE(net->smc.sysctl_wmem)); - WRITE_ONCE(sk->sk_rcvbuf, READ_ONCE(net->smc.sysctl_rmem)); - smc = smc_sk(sk); - INIT_WORK(&smc->tcp_listen_work, smc_tcp_listen_work); - INIT_WORK(&smc->connect_work, smc_connect_work); - INIT_DELAYED_WORK(&smc->conn.tx_work, smc_tx_work); - INIT_LIST_HEAD(&smc->accept_q); - spin_lock_init(&smc->accept_q_lock); - spin_lock_init(&smc->conn.send_lock); + smc_sock_init(sk, net); sk->sk_prot->hash(sk); - sk_refcnt_debug_inc(sk); - mutex_init(&smc->clcsock_release_lock); - smc_init_saved_callbacks(smc); - return sk; } @@ -449,7 +555,7 @@ static int smc_bind(struct socket *sock, struct sockaddr *uaddr, /* Check if socket is already active */ rc = -EINVAL; - if (sk->sk_state != SMC_INIT || smc->connect_nonblock) + if (smc_sk_state(sk) != SMC_INIT || smc->connect_nonblock) goto out_rel; smc->clcsock->sk->sk_reuse = sk->sk_reuse; @@ -465,6 +571,10 @@ static int smc_bind(struct socket *sock, struct sockaddr *uaddr, static void smc_copy_sock_settings(struct sock *nsk, struct sock *osk, unsigned long mask) { + /* no need for inet smc */ + if (smc_sock_is_inet_sock(nsk)) + return; + /* options we don't get control via setsockopt for */ nsk->sk_type = osk->sk_type; nsk->sk_sndbuf = osk->sk_sndbuf; @@ -636,20 +746,22 @@ static int smcr_clnt_conf_first_link(struct smc_sock *smc) smc_llc_link_active(link); smcr_lgr_set_type(link->lgr, SMC_LGR_SINGLE); - /* optional 2nd link, receive ADD LINK request from server */ - qentry = smc_llc_wait(link->lgr, NULL, SMC_LLC_WAIT_TIME, - SMC_LLC_ADD_LINK); - if (!qentry) { - struct smc_clc_msg_decline dclc; - - rc = smc_clc_wait_msg(smc, &dclc, sizeof(dclc), - SMC_CLC_DECLINE, CLC_WAIT_TIME_SHORT); - if (rc == -EAGAIN) - rc = 0; /* no DECLINE received, go with one link */ - return rc; + if (link->lgr->max_links > 1) { + /* optional 2nd link, receive ADD LINK request from server */ + qentry = smc_llc_wait(link->lgr, NULL, SMC_LLC_WAIT_TIME, + SMC_LLC_ADD_LINK); + if (!qentry) { + struct smc_clc_msg_decline dclc; + + rc = smc_clc_wait_msg(smc, &dclc, sizeof(dclc), + SMC_CLC_DECLINE, CLC_WAIT_TIME_SHORT); + if (rc == -EAGAIN) + rc = 0; /* no DECLINE received, go with one link */ + return rc; + } + smc_llc_flow_qentry_clr(&link->lgr->llc_flow_lcl); + smc_llc_cli_add_link(link, qentry); } - smc_llc_flow_qentry_clr(&link->lgr->llc_flow_lcl); - smc_llc_cli_add_link(link, qentry); return 0; } @@ -742,6 +854,16 @@ static void smc_link_save_peer_info(struct smc_link *link, memcpy(link->peer_mac, ini->peer_mac, sizeof(link->peer_mac)); link->peer_psn = ntoh24(clc->r0.psn); link->peer_mtu = clc->r0.qp_mtu; + link->credits_enable = (ini->vendor_opt_valid && ini->credits_en && + clc->r0.init_credits) ? 1 : 0; + if (link->credits_enable) { + atomic_set(&link->peer_rq_credits, clc->r0.init_credits); + /* set peer rq credits watermark, if less than init_credits * 2/3, + * then credit announcement is needed. + */ + link->peer_cr_watermark_low = + max(clc->r0.init_credits * 2 / 3, 1); + } } static void smc_stat_inc_fback_rsn_cnt(struct smc_sock *smc, @@ -906,16 +1028,29 @@ static int smc_switch_to_fallback(struct smc_sock *smc, int reason_code) { int rc = 0; + /* no need protected by clcsock_release_lock, move head */ + smc->use_fallback = true; + smc->fallback_rsn = reason_code; + smc_stat_fallback(smc); + trace_smc_switch_to_fallback(smc, reason_code); + + /* inet sock */ + if (smc_sock_is_inet_sock(&smc->sk)) { + write_lock_bh(&smc->sk.sk_callback_lock); + smc_inet_sock_switch_negotiation_state_locked(&smc->sk, + isck_smc_negotiation_load(smc), + SMC_NEGOTIATION_NO_SMC); + write_unlock_bh(&smc->sk.sk_callback_lock); + return 0; + } + + /* smc sock */ mutex_lock(&smc->clcsock_release_lock); if (!smc->clcsock) { rc = -EBADF; goto out; } - smc->use_fallback = true; - smc->fallback_rsn = reason_code; - smc_stat_fallback(smc); - trace_smc_switch_to_fallback(smc, reason_code); if (smc->sk.sk_socket && smc->sk.sk_socket->file) { smc->clcsock->file = smc->sk.sk_socket->file; smc->clcsock->file->private_data = smc->clcsock; @@ -942,14 +1077,14 @@ static int smc_connect_fallback(struct smc_sock *smc, int reason_code) rc = smc_switch_to_fallback(smc, reason_code); if (rc) { /* fallback fails */ this_cpu_inc(net->smc.smc_stats->clnt_hshake_err_cnt); - if (smc->sk.sk_state == SMC_INIT) + if (smc_sk_state(&smc->sk) == SMC_INIT && !smc_sock_is_inet_sock(&smc->sk)) sock_put(&smc->sk); /* passive closing */ return rc; } smc_copy_sock_settings_to_clc(smc); smc->connect_nonblock = 0; - if (smc->sk.sk_state == SMC_INIT) - smc->sk.sk_state = SMC_ACTIVE; + if (smc_sk_state(&smc->sk) == SMC_INIT) + smc_sk_set_state(&smc->sk, SMC_ACTIVE); return 0; } @@ -962,15 +1097,16 @@ static int smc_connect_decline_fallback(struct smc_sock *smc, int reason_code, if (reason_code < 0) { /* error, fallback is not possible */ this_cpu_inc(net->smc.smc_stats->clnt_hshake_err_cnt); - if (smc->sk.sk_state == SMC_INIT) + if (smc_sk_state(&smc->sk) == SMC_INIT && !smc_sock_is_inet_sock(&smc->sk)) sock_put(&smc->sk); /* passive closing */ return reason_code; } + if (reason_code != SMC_CLC_DECL_PEERDECL) { rc = smc_clc_send_decline(smc, reason_code, version); if (rc < 0) { this_cpu_inc(net->smc.smc_stats->clnt_hshake_err_cnt); - if (smc->sk.sk_state == SMC_INIT) + if (smc_sk_state(&smc->sk) == SMC_INIT && !smc_sock_is_inet_sock(&smc->sk)) sock_put(&smc->sk); /* passive closing */ return rc; } @@ -1099,6 +1235,20 @@ static int smc_find_proposal_devices(struct smc_sock *smc, ini->smcr_version &= ~SMC_V1; /* else RDMA is supported for this connection */ + /* make sure SMC_V1 ibdev still available */ + if (ini->smcr_version & SMC_V1) { + mutex_lock(&smc_ib_devices.mutex); + if (list_empty(&ini->ib_dev->list)) { + ini->ib_dev = NULL; + ini->ib_port = 0; + ini->smcr_version &= ~SMC_V1; + } else { + /* put in __smc_connect */ + smc_ib_get_pending_device(ini->ib_dev); + } + mutex_unlock(&smc_ib_devices.mutex); + } + ini->smc_type_v1 = smc_indicated_type(ini->smcd_version & SMC_V1, ini->smcr_version & SMC_V1); @@ -1118,6 +1268,20 @@ static int smc_find_proposal_devices(struct smc_sock *smc, ini->smcr_version &= ~SMC_V2; ini->check_smcrv2 = false; + /* make sure SMC_V2 ibdev still available */ + if (ini->smcr_version & SMC_V2) { + mutex_lock(&smc_ib_devices.mutex); + if (list_empty(&ini->smcrv2.ib_dev_v2->list)) { + ini->smcrv2.ib_dev_v2 = NULL; + ini->smcrv2.ib_port_v2 = 0; + ini->smcr_version &= ~SMC_V2; + } else { + /* put in __smc_connect */ + smc_ib_get_pending_device(ini->smcrv2.ib_dev_v2); + } + mutex_unlock(&smc_ib_devices.mutex); + } + ini->smc_type_v2 = smc_indicated_type(ini->smcd_version & SMC_V2, ini->smcr_version & SMC_V2); @@ -1143,7 +1307,7 @@ static int smc_connect_ism_vlan_cleanup(struct smc_sock *smc, #define SMC_CLC_MAX_ACCEPT_LEN \ (sizeof(struct smc_clc_msg_accept_confirm_v2) + \ - sizeof(struct smc_clc_first_contact_ext) + \ + sizeof(struct smc_clc_first_contact_ext_v2x) + \ sizeof(struct smc_clc_msg_trail)) /* CLC handshake during connect */ @@ -1199,6 +1363,7 @@ static int smc_connect_rdma_v2_prepare(struct smc_sock *smc, struct smc_clc_first_contact_ext *fce = (struct smc_clc_first_contact_ext *) (((u8 *)clc_v2) + sizeof(*clc_v2)); + int rc; if (!ini->first_contact_peer || aclc->hdr.version == SMC_V1) return 0; @@ -1217,6 +1382,14 @@ static int smc_connect_rdma_v2_prepare(struct smc_sock *smc, return SMC_CLC_DECL_NOINDIRECT; } } + + if (fce->release > SMC_RELEASE) + return SMC_CLC_DECL_VERSMISMAT; + ini->release_ver = fce->release; + rc = smc_clc_cli_v2x_features_validate(smc, fce, ini); + if (rc) + return rc; + return 0; } @@ -1235,6 +1408,8 @@ static int smc_connect_rdma(struct smc_sock *smc, memcpy(ini->peer_systemid, aclc->r0.lcl.id_for_peer, SMC_SYSTEMID_LEN); memcpy(ini->peer_gid, aclc->r0.lcl.gid, SMC_GID_SIZE); memcpy(ini->peer_mac, aclc->r0.lcl.mac, ETH_ALEN); + ini->max_conns = SMC_RMBS_PER_LGR_MAX; + ini->max_links = SMC_LINKS_ADD_LNK_MAX; reason_code = smc_connect_rdma_v2_prepare(smc, aclc, ini); if (reason_code) @@ -1297,6 +1472,11 @@ static int smc_connect_rdma(struct smc_sock *smc, goto connect_abort; } } else { + if (smc_llc_announce_credits(link, SMC_LLC_RESP, true)) { + reason_code = SMC_CLC_DECL_CREDITSERR; + goto connect_abort; + } + /* reg sendbufs if they were vzalloced */ if (smc->conn.sndbuf_desc->is_vm) { if (smcr_lgr_reg_sndbufs(link, smc->conn.sndbuf_desc)) { @@ -1339,8 +1519,8 @@ static int smc_connect_rdma(struct smc_sock *smc, smc_copy_sock_settings_to_clc(smc); smc->connect_nonblock = 0; - if (smc->sk.sk_state == SMC_INIT) - smc->sk.sk_state = SMC_ACTIVE; + if (smc_sk_state(&smc->sk) == SMC_INIT) + smc_sk_set_state(&smc->sk, SMC_ACTIVE); return 0; connect_abort: @@ -1385,6 +1565,18 @@ static int smc_connect_ism(struct smc_sock *smc, struct smc_clc_msg_accept_confirm_v2 *aclc_v2 = (struct smc_clc_msg_accept_confirm_v2 *)aclc; + if (ini->first_contact_peer) { + struct smc_clc_first_contact_ext *fce = + smc_get_clc_first_contact_ext(aclc_v2, true); + + if (fce->release > SMC_RELEASE) + return SMC_CLC_DECL_VERSMISMAT; + ini->release_ver = fce->release; + rc = smc_clc_cli_v2x_features_validate(smc, fce, ini); + if (rc) + return rc; + } + rc = smc_v2_determine_accepted_chid(aclc_v2, ini); if (rc) return rc; @@ -1419,15 +1611,15 @@ static int smc_connect_ism(struct smc_sock *smc, } rc = smc_clc_send_confirm(smc, ini->first_contact_local, - aclc->hdr.version, eid, NULL); + aclc->hdr.version, eid, ini); if (rc) goto connect_abort; mutex_unlock(&smc_server_lgr_pending); smc_copy_sock_settings_to_clc(smc); smc->connect_nonblock = 0; - if (smc->sk.sk_state == SMC_INIT) - smc->sk.sk_state = SMC_ACTIVE; + if (smc_sk_state(&smc->sk) == SMC_INIT) + smc_sk_set_state(&smc->sk, SMC_ACTIVE); return 0; connect_abort: @@ -1546,6 +1738,10 @@ static int __smc_connect(struct smc_sock *smc) if (rc) goto vlan_cleanup; + if (ini->smcrv2.ib_dev_v2) + smc_ib_put_pending_device(ini->smcrv2.ib_dev_v2); + if (ini->ib_dev) + smc_ib_put_pending_device(ini->ib_dev); SMC_STAT_CLNT_SUCC_INC(sock_net(smc->clcsock->sk), aclc); smc_connect_ism_vlan_cleanup(smc, ini); kfree(buf); @@ -1556,6 +1752,10 @@ static int __smc_connect(struct smc_sock *smc) smc_connect_ism_vlan_cleanup(smc, ini); kfree(buf); fallback: + if (ini->smcrv2.ib_dev_v2) + smc_ib_put_pending_device(ini->smcrv2.ib_dev_v2); + if (ini->ib_dev) + smc_ib_put_pending_device(ini->ib_dev); kfree(ini); return smc_connect_decline_fallback(smc, rc, version); } @@ -1583,7 +1783,7 @@ static void smc_connect_work(struct work_struct *work) release_sock(smc->clcsock->sk); lock_sock(&smc->sk); if (rc != 0 || smc->sk.sk_err) { - smc->sk.sk_state = SMC_CLOSED; + smc_sk_set_state(&smc->sk, SMC_CLOSED); if (rc == -EPIPE || rc == -EAGAIN) smc->sk.sk_err = EPIPE; else if (signal_pending(current)) @@ -1614,6 +1814,7 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr, struct sock *sk = sock->sk; struct smc_sock *smc; int rc = -EINVAL; + int cur; smc = smc_sk(sk); @@ -1629,10 +1830,10 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr, rc = -EINVAL; goto out; case SS_CONNECTED: - rc = sk->sk_state == SMC_ACTIVE ? -EISCONN : -EINVAL; + rc = smc_sk_state(sk) == SMC_ACTIVE ? -EISCONN : -EINVAL; goto out; case SS_CONNECTING: - if (sk->sk_state == SMC_ACTIVE) + if (smc_sk_state(sk) == SMC_ACTIVE) goto connected; break; case SS_UNCONNECTED: @@ -1640,7 +1841,7 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr, break; } - switch (sk->sk_state) { + switch (smc_sk_state(sk)) { default: goto out; case SMC_CLOSED: @@ -1659,19 +1860,40 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr, rc = -EBADF; goto out; } - smc_copy_sock_settings_to_clc(smc); - /* accept out connection as SMC connection */ - if (smc_sock_should_select_smc(smc) == SK_PASS) { - tcp_sk(smc->clcsock->sk)->syn_smc = 1; - } else { - tcp_sk(smc->clcsock->sk)->syn_smc = 0; - smc_switch_to_fallback(smc, /* just a chooice */ 0); - } if (smc->connect_nonblock) { rc = -EALREADY; goto out; } + + smc_copy_sock_settings_to_clc(smc); + + if (smc_sock_should_select_smc(smc) != SK_PASS) { + tcp_sk(smc->clcsock->sk)->syn_smc = 0; + smc_switch_to_fallback(smc, /* active fallback */ SMC_CLC_DECL_ACTIVE); + goto do_tcp_connect; + } + + if (smc_sock_is_inet_sock(sk)) { + if (smc_inet_sock_set_syn_smc(sk)) { + if (flags & O_NONBLOCK) { + smc->connect_nonblock = 1; + /* To ensure that userspace will not be awakened by TCP sock events + * before the SMC handshake is completed or totaly fallback/failed. + */ + sk->sk_wq = &smc->accompany_socket.wq; + smc_clcsock_replace_cb(&sk->sk_state_change, + smc_inet_sock_state_change, + &smc->clcsk_state_change); + } + } else { + smc_switch_to_fallback(smc, SMC_CLC_DECL_ACTIVE); + } + } else { + tcp_sk(smc->clcsock->sk)->syn_smc = 1; + } + +do_tcp_connect: rc = kernel_connect(smc->clcsock, addr, alen, flags); if (rc && rc != -EINPROGRESS) goto out; @@ -1680,6 +1902,26 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr, sock->state = rc ? SS_CONNECTING : SS_CONNECTED; goto out; } + + /* for inet sock */ + if (smc_sock_is_inet_sock(sk)) { + if (flags & O_NONBLOCK) { + rc = -EINPROGRESS; + } else { + rc = 0; + cur = smc_inet_sock_switch_negotiation_state(sk, SMC_NEGOTIATION_TBD, + tcp_sk(sk)->syn_smc ? + SMC_NEGOTIATION_PREPARE_SMC : + SMC_NEGOTIATION_NO_SMC); + if (cur == SMC_NEGOTIATION_PREPARE_SMC) + rc = smc_inet_sock_do_handshake(sk, /* sk_locked */ true, + true); + if (rc) + goto connected; + } + goto out; + } + sock_hold(&smc->sk); /* sock put in passive closing */ if (flags & O_NONBLOCK) { @@ -1727,12 +1969,12 @@ static int smc_clcsock_accept(struct smc_sock *lsmc, struct smc_sock **new_smc) lock_sock(lsk); if (rc < 0 && rc != -EAGAIN) lsk->sk_err = -rc; - if (rc < 0 || lsk->sk_state == SMC_CLOSED) { + if (rc < 0 || smc_sk_state(lsk) == SMC_CLOSED) { new_sk->sk_prot->unhash(new_sk); mutex_lock(&lsmc->clcsock_release_lock); if (new_clcsock) sock_release(new_clcsock); - new_sk->sk_state = SMC_CLOSED; + smc_sk_set_state(new_sk, SMC_CLOSED); sock_set_flag(new_sk, SOCK_DEAD); mutex_unlock(&lsmc->clcsock_release_lock); sock_put(new_sk); /* final */ @@ -1772,7 +2014,8 @@ static void smc_accept_enqueue(struct sock *parent, struct sock *sk) sock_hold(sk); /* sock_put in smc_accept_unlink () */ spin_lock(&par->accept_q_lock); list_add_tail(&smc_sk(sk)->accept_q, &par->accept_q); - sk_acceptq_added(parent); + if (!smc_sock_is_inet_sock(sk)) + sk_acceptq_added(parent); spin_unlock(&par->accept_q_lock); } @@ -1783,7 +2026,8 @@ static void smc_accept_unlink(struct sock *sk) spin_lock(&par->accept_q_lock); list_del_init(&smc_sk(sk)->accept_q); - sk_acceptq_removed(&smc_sk(sk)->listen_smc->sk); + if (!smc_sock_is_inet_sock(sk)) + sk_acceptq_removed(&smc_sk(sk)->listen_smc->sk); spin_unlock(&par->accept_q_lock); sock_put(sk); /* sock_hold in smc_accept_enqueue */ } @@ -1806,7 +2050,11 @@ struct sock *smc_accept_dequeue(struct sock *parent, new_sk = (struct sock *)isk; smc_accept_unlink(new_sk); - if (new_sk->sk_state == SMC_CLOSED) { + if (smc_sk_state(new_sk) == SMC_CLOSED) { + if (smc_sock_is_inet_sock(parent)) { + tcp_close(new_sk, 0); + continue; + } new_sk->sk_prot->unhash(new_sk); if (isk->clcsock) { sock_release(isk->clcsock); @@ -1835,13 +2083,26 @@ void smc_close_non_accepted(struct sock *sk) sock_hold(sk); /* sock_put below */ lock_sock(sk); - if (!sk->sk_lingertime) - /* wait for peer closing */ - sk->sk_lingertime = SMC_MAX_STREAM_WAIT_TIMEOUT; - __smc_release(smc); + if (smc_sock_is_inet_sock(sk)) { + if (!smc_inet_sock_check_fallback(sk) && smc_sk_state(sk) != SMC_CLOSED) { + smc_close_active(smc); + sock_set_flag(sk, SOCK_DEAD); + if (smc_sk_state(sk) == SMC_CLOSED) + smc_conn_free(&smc->conn); + } + } else { + if (!sk->sk_lingertime) + /* wait for peer closing */ + sk->sk_lingertime = SMC_MAX_STREAM_WAIT_TIMEOUT; + __smc_release(smc); + } release_sock(sk); sock_put(sk); /* sock_hold above */ - sock_put(sk); /* final sock_put */ + + if (smc_sock_is_inet_sock(sk)) + tcp_close(sk, 0); + else + sock_put(sk); /* final sock_put */ } static int smcr_serv_conf_first_link(struct smc_sock *smc) @@ -1887,10 +2148,12 @@ static int smcr_serv_conf_first_link(struct smc_sock *smc) smc_llc_link_active(link); smcr_lgr_set_type(link->lgr, SMC_LGR_SINGLE); - down_write(&link->lgr->llc_conf_mutex); - /* initial contact - try to establish second link */ - smc_llc_srv_add_link(link, NULL); - up_write(&link->lgr->llc_conf_mutex); + if (link->lgr->max_links > 1) { + down_write(&link->lgr->llc_conf_mutex); + /* initial contact - try to establish second link */ + smc_llc_srv_add_link(link, NULL); + up_write(&link->lgr->llc_conf_mutex); + } return 0; } @@ -1903,7 +2166,12 @@ static void smc_listen_out(struct smc_sock *new_smc) if (new_smc->smc_negotiated) atomic_dec(&lsmc->queued_smc_hs); - if (lsmc->sk.sk_state == SMC_LISTEN) { + if (smc_sock_is_inet_sock(newsmcsk)) + smc_inet_sock_switch_negotiation_state(newsmcsk, + SMC_NEGOTIATION_PREPARE_SMC, + SMC_NEGOTIATION_SMC); + + if (smc_sk_state(&lsmc->sk) == SMC_LISTEN) { lock_sock_nested(&lsmc->sk, SINGLE_DEPTH_NESTING); smc_accept_enqueue(&lsmc->sk, newsmcsk); release_sock(&lsmc->sk); @@ -1921,8 +2189,8 @@ static void smc_listen_out_connected(struct smc_sock *new_smc) { struct sock *newsmcsk = &new_smc->sk; - if (newsmcsk->sk_state == SMC_INIT) - newsmcsk->sk_state = SMC_ACTIVE; + if (smc_sk_state(newsmcsk) == SMC_INIT) + smc_sk_set_state(newsmcsk, SMC_ACTIVE); smc_listen_out(new_smc); } @@ -1934,9 +2202,9 @@ static void smc_listen_out_err(struct smc_sock *new_smc) struct net *net = sock_net(newsmcsk); this_cpu_inc(net->smc.smc_stats->srv_hshake_err_cnt); - if (newsmcsk->sk_state == SMC_INIT) + if (smc_sk_state(newsmcsk) == SMC_INIT) sock_put(&new_smc->sk); /* passive closing */ - newsmcsk->sk_state = SMC_CLOSED; + smc_sk_set_state(newsmcsk, SMC_CLOSED); smc_listen_out(new_smc); } @@ -2013,6 +2281,10 @@ static int smc_listen_v2_check(struct smc_sock *new_smc, } } + ini->release_ver = pclc_v2_ext->hdr.flag.release; + if (pclc_v2_ext->hdr.flag.release > SMC_RELEASE) + ini->release_ver = SMC_RELEASE; + out: if (!ini->smcd_version && !ini->smcr_version) return rc; @@ -2048,8 +2320,10 @@ static int smc_listen_rdma_init(struct smc_sock *new_smc, return rc; /* create send buffer and rmb */ - if (smc_buf_create(new_smc, false)) + if (smc_buf_create(new_smc, false)) { + smc_conn_abort(new_smc, ini->first_contact_local); return SMC_CLC_DECL_MEM; + } return 0; } @@ -2259,18 +2533,33 @@ static void smc_find_rdma_v2_device_serv(struct smc_sock *new_smc, smc_find_ism_store_rc(rc, ini); goto not_found; } + /* make sure SMC_V2 ibdev still available */ + mutex_lock(&smc_ib_devices.mutex); + if (list_empty(&ini->smcrv2.ib_dev_v2->list)) { + smc_find_ism_store_rc(SMC_CLC_DECL_NOSMCRDEV, ini); + goto not_found; + } else { + /* put below or in smc_listen_work */ + smc_ib_get_pending_device(ini->smcrv2.ib_dev_v2); + } + mutex_unlock(&smc_ib_devices.mutex); + if (!ini->smcrv2.uses_gateway) memcpy(ini->smcrv2.nexthop_mac, pclc->lcl.mac, ETH_ALEN); smcr_version = ini->smcr_version; ini->smcr_version = SMC_V2; rc = smc_listen_rdma_init(new_smc, ini); - if (!rc) + if (!rc) { rc = smc_listen_rdma_reg(new_smc, ini->first_contact_local); + if (rc) + smc_conn_abort(new_smc, ini->first_contact_local); + } if (!rc) return; ini->smcr_version = smcr_version; smc_find_ism_store_rc(rc, ini); + smc_ib_put_pending_device(ini->smcrv2.ib_dev_v2); not_found: ini->smcr_version &= ~SMC_V2; @@ -2296,6 +2585,18 @@ static int smc_find_rdma_v1_device_serv(struct smc_sock *new_smc, /* no RDMA device found */ return SMC_CLC_DECL_NOSMCDEV; } + /* make sure SMC_V1 ibdev still available */ + mutex_lock(&smc_ib_devices.mutex); + if (list_empty(&ini->ib_dev->list)) { + ini->ib_dev = NULL; + ini->ib_port = 0; + mutex_unlock(&smc_ib_devices.mutex); + return SMC_CLC_DECL_NOSMCDEV; + } + /* put in smc_listen_work */ + smc_ib_get_pending_device(ini->ib_dev); + mutex_unlock(&smc_ib_devices.mutex); + rc = smc_listen_rdma_init(new_smc, ini); if (rc) return rc; @@ -2307,7 +2608,6 @@ static int smc_listen_find_device(struct smc_sock *new_smc, struct smc_clc_msg_proposal *pclc, struct smc_init_info *ini) { - struct net *net = sock_net(&new_smc->sk); int prfx_rc; /* check for ISM device matching V2 proposed device */ @@ -2315,12 +2615,10 @@ static int smc_listen_find_device(struct smc_sock *new_smc, if (ini->ism_dev[0]) return 0; - if (!net->smc.sysctl_allow_different_subnet) { - /* check for matching IP prefix and subnet length (V1) */ - prfx_rc = smc_listen_prfx_check(new_smc, pclc); - if (prfx_rc) - smc_find_ism_store_rc(prfx_rc, ini); - } + /* check for matching IP prefix and subnet length (V1) */ + prfx_rc = smc_listen_prfx_check(new_smc, pclc); + if (prfx_rc) + smc_find_ism_store_rc(prfx_rc, ini); /* get vlan id from IP device */ if (smc_vlan_by_tcpsk(new_smc->clcsock, ini)) @@ -2393,7 +2691,7 @@ static void smc_listen_work(struct work_struct *work) u8 accept_version; int rc = 0; - if (new_smc->listen_smc->sk.sk_state != SMC_LISTEN) + if (smc_sk_state(&new_smc->listen_smc->sk) != SMC_LISTEN) return smc_listen_out_err(new_smc); if (new_smc->use_fallback) { @@ -2445,6 +2743,10 @@ static void smc_listen_work(struct work_struct *work) if (rc) goto out_decl; + rc = smc_clc_srv_v2x_features_validate(new_smc, pclc, ini); + if (rc) + goto out_decl; + mutex_lock(&smc_server_lgr_pending); smc_close_init(new_smc); smc_rx_init(new_smc); @@ -2458,7 +2760,7 @@ static void smc_listen_work(struct work_struct *work) /* send SMC Accept CLC message */ accept_version = ini->is_smcd ? ini->smcd_version : ini->smcr_version; rc = smc_clc_send_accept(new_smc, ini->first_contact_local, - accept_version, ini->negotiated_eid); + accept_version, ini->negotiated_eid, ini); if (rc) goto out_unlock; @@ -2477,6 +2779,18 @@ static void smc_listen_work(struct work_struct *work) goto out_decl; } + rc = smc_clc_v2x_features_confirm_check(cclc, ini); + if (rc) { + if (!ini->is_smcd) + goto out_unlock; + goto out_decl; + } + + /* fce smc release version is needed in smc_listen_rdma_finish, + * so save fce info here. + */ + smc_conn_save_peer_info_fce(new_smc, cclc); + /* finish worker */ if (!ini->is_smcd) { rc = smc_listen_rdma_finish(new_smc, cclc, @@ -2485,6 +2799,10 @@ static void smc_listen_work(struct work_struct *work) goto out_unlock; mutex_unlock(&smc_server_lgr_pending); } + if (ini->smcrv2.ib_dev_v2) + smc_ib_put_pending_device(ini->smcrv2.ib_dev_v2); + if (ini->ib_dev) + smc_ib_put_pending_device(ini->ib_dev); smc_conn_save_peer_info(new_smc, cclc); smc_listen_out_connected(new_smc); if (newclcsock->sk) @@ -2492,6 +2810,10 @@ static void smc_listen_work(struct work_struct *work) goto out_free; out_unlock: + if (ini->smcrv2.ib_dev_v2) + smc_ib_put_pending_device(ini->smcrv2.ib_dev_v2); + if (ini->ib_dev) + smc_ib_put_pending_device(ini->ib_dev); mutex_unlock(&smc_server_lgr_pending); out_decl: smc_listen_decline(new_smc, rc, ini ? ini->first_contact_local : 0, @@ -2510,21 +2832,15 @@ static void smc_tcp_listen_work(struct work_struct *work) int rc = 0; lock_sock(lsk); - while (lsk->sk_state == SMC_LISTEN) { + while (smc_sk_state(lsk) == SMC_LISTEN) { rc = smc_clcsock_accept(lsmc, &new_smc); if (rc) /* clcsock accept queue empty or error */ goto out; if (!new_smc) continue; - if (tcp_sk(new_smc->clcsock->sk)->syn_smc) { - new_smc->smc_negotiated = 1; - atomic_inc(&lsmc->queued_smc_hs); - /* memory barrier */ - smp_mb__after_atomic(); - } + smc_sock_init_passive(lsk, &new_smc->sk); - new_smc->listen_smc = lsmc; new_smc->use_fallback = lsmc->use_fallback; new_smc->fallback_rsn = lsmc->fallback_rsn; sock_hold(lsk); /* sock_put in smc_listen_work */ @@ -2551,7 +2867,7 @@ static void smc_clcsock_data_ready(struct sock *listen_clcsock) if (!lsmc) goto out; lsmc->clcsk_data_ready(listen_clcsock); - if (lsmc->sk.sk_state == SMC_LISTEN) { + if (smc_sk_state(&lsmc->sk) == SMC_LISTEN) { sock_hold(&lsmc->sk); /* sock_put in smc_tcp_listen_work() */ if (!queue_work(smc_tcp_ls_wq, &lsmc->tcp_listen_work)) sock_put(&lsmc->sk); @@ -2560,6 +2876,36 @@ static void smc_clcsock_data_ready(struct sock *listen_clcsock) read_unlock_bh(&listen_clcsock->sk_callback_lock); } +static inline void smc_init_listen(struct smc_sock *smc) +{ + struct sock *clcsk; + + clcsk = smc_sock_is_inet_sock(&smc->sk) ? &smc->sk : smc->clcsock->sk; + + /* save original sk_data_ready function and establish + * smc-specific sk_data_ready function + */ + write_lock_bh(&clcsk->sk_callback_lock); + clcsk->sk_user_data = + (void *)((uintptr_t)smc | SK_USER_DATA_NOCOPY); + smc_clcsock_replace_cb(&clcsk->sk_data_ready, + smc_clcsock_data_ready, &smc->clcsk_data_ready); + write_unlock_bh(&clcsk->sk_callback_lock); + + if (!smc_sock_is_inet_sock(&smc->sk)) { + /* save original ops */ + smc->ori_af_ops = inet_csk(clcsk)->icsk_af_ops; + + smc->af_ops = *smc->ori_af_ops; + smc->af_ops.syn_recv_sock = smc_tcp_syn_recv_sock; + + inet_csk(clcsk)->icsk_af_ops = &smc->af_ops; + } + + if (smc->limit_smc_hs) + tcp_sk(clcsk)->smc_hs_congested = smc_hs_congested; +} + static int smc_listen(struct socket *sock, int backlog) { struct sock *sk = sock->sk; @@ -2570,12 +2916,12 @@ static int smc_listen(struct socket *sock, int backlog) lock_sock(sk); rc = -EINVAL; - if ((sk->sk_state != SMC_INIT && sk->sk_state != SMC_LISTEN) || + if ((smc_sk_state(sk) != SMC_INIT && smc_sk_state(sk) != SMC_LISTEN) || smc->connect_nonblock || sock->state != SS_UNCONNECTED) goto out; rc = 0; - if (sk->sk_state == SMC_LISTEN) { + if (smc_sk_state(sk) == SMC_LISTEN) { sk->sk_max_ack_backlog = backlog; goto out; } @@ -2586,26 +2932,7 @@ static int smc_listen(struct socket *sock, int backlog) if (!smc->use_fallback) tcp_sk(smc->clcsock->sk)->syn_smc = 1; - /* save original sk_data_ready function and establish - * smc-specific sk_data_ready function - */ - write_lock_bh(&smc->clcsock->sk->sk_callback_lock); - smc->clcsock->sk->sk_user_data = - (void *)((uintptr_t)smc | SK_USER_DATA_NOCOPY); - smc_clcsock_replace_cb(&smc->clcsock->sk->sk_data_ready, - smc_clcsock_data_ready, &smc->clcsk_data_ready); - write_unlock_bh(&smc->clcsock->sk->sk_callback_lock); - - /* save original ops */ - smc->ori_af_ops = inet_csk(smc->clcsock->sk)->icsk_af_ops; - - smc->af_ops = *smc->ori_af_ops; - smc->af_ops.syn_recv_sock = smc_tcp_syn_recv_sock; - - inet_csk(smc->clcsock->sk)->icsk_af_ops = &smc->af_ops; - - if (smc->limit_smc_hs) - tcp_sk(smc->clcsock->sk)->smc_hs_congested = smc_hs_congested; + smc_init_listen(smc); rc = kernel_listen(smc->clcsock, backlog); if (rc) { @@ -2618,27 +2945,26 @@ static int smc_listen(struct socket *sock, int backlog) } sk->sk_max_ack_backlog = backlog; sk->sk_ack_backlog = 0; - sk->sk_state = SMC_LISTEN; + smc_sk_set_state(sk, SMC_LISTEN); out: release_sock(sk); return rc; } -static int smc_accept(struct socket *sock, struct socket *new_sock, - int flags, bool kern) +static struct sock *__smc_accept(struct sock *sk, struct socket *new_sock, + int flags, int *err, bool kern) { - struct sock *sk = sock->sk, *nsk; DECLARE_WAITQUEUE(wait, current); + struct sock *nsk = NULL; struct smc_sock *lsmc; long timeo; int rc = 0; lsmc = smc_sk(sk); - sock_hold(sk); /* sock_put below */ lock_sock(sk); - if (lsmc->sk.sk_state != SMC_LISTEN) { + if (smc_sk_state(&lsmc->sk) != SMC_LISTEN) { rc = -EINVAL; release_sock(sk); goto out; @@ -2691,8 +3017,21 @@ static int smc_accept(struct socket *sock, struct socket *new_sock, } out: - sock_put(sk); /* sock_hold above */ - return rc; + *err = rc; + return nsk; +} + +static int smc_accept(struct socket *sock, struct socket *new_sock, + int flags, bool kern) +{ + struct sock *sk = sock->sk; + int error; + + sock_hold(sk); + __smc_accept(sk, new_sock, flags, &error, kern); + sock_put(sk); + + return error; } static int smc_getname(struct socket *sock, struct sockaddr *addr, @@ -2700,8 +3039,8 @@ static int smc_getname(struct socket *sock, struct sockaddr *addr, { struct smc_sock *smc; - if (peer && (sock->sk->sk_state != SMC_ACTIVE) && - (sock->sk->sk_state != SMC_APPCLOSEWAIT1)) + if (peer && (smc_sk_state(sock->sk) != SMC_ACTIVE) && + (smc_sk_state(sock->sk) != SMC_APPCLOSEWAIT1)) return -ENOTCONN; smc = smc_sk(sock->sk); @@ -2713,17 +3052,15 @@ static int smc_sendmsg(struct socket *sock, struct msghdr *msg, size_t len) { struct sock *sk = sock->sk; struct smc_sock *smc; - int rc = -EPIPE; + int rc; smc = smc_sk(sk); lock_sock(sk); - if ((sk->sk_state != SMC_ACTIVE) && - (sk->sk_state != SMC_APPCLOSEWAIT1) && - (sk->sk_state != SMC_INIT)) - goto out; + /* SMC does not support connect with fastopen */ if (msg->msg_flags & MSG_FASTOPEN) { - if (sk->sk_state == SMC_INIT && !smc->connect_nonblock) { + /* not connected yet, fallback */ + if (smc_sk_state(sk) == SMC_INIT && !smc->connect_nonblock) { rc = smc_switch_to_fallback(smc, SMC_CLC_DECL_OPTUNSUPP); if (rc) goto out; @@ -2731,6 +3068,11 @@ static int smc_sendmsg(struct socket *sock, struct msghdr *msg, size_t len) rc = -EINVAL; goto out; } + } else if ((smc_sk_state(sk) != SMC_ACTIVE) && + (smc_sk_state(sk) != SMC_APPCLOSEWAIT1) && + (smc_sk_state(sk) != SMC_INIT)) { + rc = -EPIPE; + goto out; } if (smc->use_fallback) { @@ -2753,17 +3095,17 @@ static int smc_recvmsg(struct socket *sock, struct msghdr *msg, size_t len, smc = smc_sk(sk); lock_sock(sk); - if (sk->sk_state == SMC_CLOSED && (sk->sk_shutdown & RCV_SHUTDOWN)) { + if (smc_sk_state(sk) == SMC_CLOSED && (sk->sk_shutdown & RCV_SHUTDOWN)) { /* socket was connected before, no more data to read */ rc = 0; goto out; } - if ((sk->sk_state == SMC_INIT) || - (sk->sk_state == SMC_LISTEN) || - (sk->sk_state == SMC_CLOSED)) + if ((smc_sk_state(sk) == SMC_INIT) || + (smc_sk_state(sk) == SMC_LISTEN) || + (smc_sk_state(sk) == SMC_CLOSED)) goto out; - if (sk->sk_state == SMC_PEERFINCLOSEWAIT) { + if (smc_sk_state(sk) == SMC_PEERFINCLOSEWAIT) { rc = 0; goto out; } @@ -2805,14 +3147,14 @@ static __poll_t smc_poll(struct file *file, struct socket *sock, mask = smc->clcsock->ops->poll(file, smc->clcsock, wait); sk->sk_err = smc->clcsock->sk->sk_err; } else { - if (sk->sk_state != SMC_CLOSED) + if (smc_sk_state(sk) != SMC_CLOSED) sock_poll_wait(file, sock, wait); if (sk->sk_err) mask |= EPOLLERR; if ((sk->sk_shutdown == SHUTDOWN_MASK) || - (sk->sk_state == SMC_CLOSED)) + (smc_sk_state(sk) == SMC_CLOSED)) mask |= EPOLLHUP; - if (sk->sk_state == SMC_LISTEN) { + if (smc_sk_state(sk) == SMC_LISTEN) { /* woken up by sk_data_ready in smc_listen_work() */ mask |= smc_accept_poll(sk); } else if (smc->use_fallback) { /* as result of connect_work()*/ @@ -2820,7 +3162,7 @@ static __poll_t smc_poll(struct file *file, struct socket *sock, wait); sk->sk_err = smc->clcsock->sk->sk_err; } else { - if ((sk->sk_state != SMC_INIT && + if ((smc_sk_state(sk) != SMC_INIT && atomic_read(&smc->conn.sndbuf_space)) || sk->sk_shutdown & SEND_SHUTDOWN) { mask |= EPOLLOUT | EPOLLWRNORM; @@ -2832,7 +3174,7 @@ static __poll_t smc_poll(struct file *file, struct socket *sock, mask |= EPOLLIN | EPOLLRDNORM; if (sk->sk_shutdown & RCV_SHUTDOWN) mask |= EPOLLIN | EPOLLRDNORM | EPOLLRDHUP; - if (sk->sk_state == SMC_APPCLOSEWAIT1) + if (smc_sk_state(sk) == SMC_APPCLOSEWAIT1) mask |= EPOLLIN; if (smc->conn.urg_state == SMC_URG_VALID) mask |= EPOLLPRI; @@ -2859,29 +3201,29 @@ static int smc_shutdown(struct socket *sock, int how) lock_sock(sk); if (sock->state == SS_CONNECTING) { - if (sk->sk_state == SMC_ACTIVE) + if (smc_sk_state(sk) == SMC_ACTIVE) sock->state = SS_CONNECTED; - else if (sk->sk_state == SMC_PEERCLOSEWAIT1 || - sk->sk_state == SMC_PEERCLOSEWAIT2 || - sk->sk_state == SMC_APPCLOSEWAIT1 || - sk->sk_state == SMC_APPCLOSEWAIT2 || - sk->sk_state == SMC_APPFINCLOSEWAIT) + else if (smc_sk_state(sk) == SMC_PEERCLOSEWAIT1 || + smc_sk_state(sk) == SMC_PEERCLOSEWAIT2 || + smc_sk_state(sk) == SMC_APPCLOSEWAIT1 || + smc_sk_state(sk) == SMC_APPCLOSEWAIT2 || + smc_sk_state(sk) == SMC_APPFINCLOSEWAIT) sock->state = SS_DISCONNECTING; } rc = -ENOTCONN; - if ((sk->sk_state != SMC_ACTIVE) && - (sk->sk_state != SMC_PEERCLOSEWAIT1) && - (sk->sk_state != SMC_PEERCLOSEWAIT2) && - (sk->sk_state != SMC_APPCLOSEWAIT1) && - (sk->sk_state != SMC_APPCLOSEWAIT2) && - (sk->sk_state != SMC_APPFINCLOSEWAIT)) + if ((smc_sk_state(sk) != SMC_ACTIVE) && + (smc_sk_state(sk) != SMC_PEERCLOSEWAIT1) && + (smc_sk_state(sk) != SMC_PEERCLOSEWAIT2) && + (smc_sk_state(sk) != SMC_APPCLOSEWAIT1) && + (smc_sk_state(sk) != SMC_APPCLOSEWAIT2) && + (smc_sk_state(sk) != SMC_APPFINCLOSEWAIT)) goto out; if (smc->use_fallback) { rc = kernel_sock_shutdown(smc->clcsock, how); sk->sk_shutdown = smc->clcsock->sk->sk_shutdown; if (sk->sk_shutdown == SHUTDOWN_MASK) { - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); sk->sk_socket->state = SS_UNCONNECTED; sock_put(sk); } @@ -2889,10 +3231,10 @@ static int smc_shutdown(struct socket *sock, int how) } switch (how) { case SHUT_RDWR: /* shutdown in both directions */ - old_state = sk->sk_state; + old_state = smc_sk_state(sk); rc = smc_close_active(smc); if (old_state == SMC_ACTIVE && - sk->sk_state == SMC_PEERCLOSEWAIT1) + smc_sk_state(sk) == SMC_PEERCLOSEWAIT1) do_shutdown = false; break; case SHUT_WR: @@ -2908,7 +3250,7 @@ static int smc_shutdown(struct socket *sock, int how) /* map sock_shutdown_cmd constants to sk_shutdown value range */ sk->sk_shutdown |= how + 1; - if (sk->sk_state == SMC_CLOSED) + if (smc_sk_state(sk) == SMC_CLOSED) sock->state = SS_UNCONNECTED; else sock->state = SS_DISCONNECTING; @@ -2982,38 +3324,61 @@ static int __smc_setsockopt(struct socket *sock, int level, int optname, return rc; } -static int smc_setsockopt(struct socket *sock, int level, int optname, - sockptr_t optval, unsigned int optlen) +/* When an unsupported sockopt is found, + * SMC should try it best to fallback. If fallback is not possible, + * an error should be explicitly returned. + */ +static inline bool smc_is_unsupport_tcp_sockopt(int optname) { - struct sock *sk = sock->sk; - struct smc_sock *smc; - int val, rc; + switch (optname) { + case TCP_FASTOPEN: + case TCP_FASTOPEN_CONNECT: + case TCP_FASTOPEN_KEY: + case TCP_FASTOPEN_NO_COOKIE: + case TCP_ULP: + return true; + } + return false; +} - if (level == SOL_TCP && optname == TCP_ULP) - return -EOPNOTSUPP; - else if (level == SOL_SMC) - return __smc_setsockopt(sock, level, optname, optval, optlen); +/* Return true if smc might modify the semantics of + * the imcoming TCP options. Specifically, it includes + * unsupported TCP options. + */ +static inline bool smc_need_override_tcp_sockopt(struct sock *sk, int optname) +{ + switch (optname) { + case TCP_NODELAY: + case TCP_CORK: + if (smc_sk_state(sk) == SMC_INIT || + smc_sk_state(sk) == SMC_LISTEN || + smc_sk_state(sk) == SMC_CLOSED) + return false; + fallthrough; + case TCP_DEFER_ACCEPT: + return true; + default: + break; + } + return smc_is_unsupport_tcp_sockopt(optname); +} + +static int smc_setsockopt_takeover(struct socket *sock, int level, int optname, + sockptr_t optval, unsigned int optlen) +{ + struct sock *sk = sock->sk; + struct smc_sock *smc; + int val, rc = 0; smc = smc_sk(sk); - /* generic setsockopts reaching us here always apply to the - * CLC socket - */ - mutex_lock(&smc->clcsock_release_lock); - if (!smc->clcsock) { - mutex_unlock(&smc->clcsock_release_lock); - return -EBADF; - } - if (unlikely(!smc->clcsock->ops->setsockopt)) - rc = -EOPNOTSUPP; - else - rc = smc->clcsock->ops->setsockopt(smc->clcsock, level, optname, - optval, optlen); - if (smc->clcsock->sk->sk_err) { - sk->sk_err = smc->clcsock->sk->sk_err; - sk->sk_error_report(sk); - } - mutex_unlock(&smc->clcsock_release_lock); + /* Obviously, the logic bellow requires the level to be TCP_SOL */ + if (level != SOL_TCP) + return 0; + + /* fast path, just go away if no extra action needed */ + if (!smc_need_override_tcp_sockopt(sk, optname)) + return 0; if (optlen < sizeof(int)) return -EINVAL; @@ -3021,24 +3386,13 @@ static int smc_setsockopt(struct socket *sock, int level, int optname, return -EFAULT; lock_sock(sk); - if (rc || smc->use_fallback) + if (smc->use_fallback) goto out; switch (optname) { - case TCP_FASTOPEN: - case TCP_FASTOPEN_CONNECT: - case TCP_FASTOPEN_KEY: - case TCP_FASTOPEN_NO_COOKIE: - /* option not supported by SMC */ - if (sk->sk_state == SMC_INIT && !smc->connect_nonblock) { - rc = smc_switch_to_fallback(smc, SMC_CLC_DECL_OPTUNSUPP); - } else { - rc = -EINVAL; - } - break; case TCP_NODELAY: - if (sk->sk_state != SMC_INIT && - sk->sk_state != SMC_LISTEN && - sk->sk_state != SMC_CLOSED) { + if (smc_sk_state(sk) != SMC_INIT && + smc_sk_state(sk) != SMC_LISTEN && + smc_sk_state(sk) != SMC_CLOSED) { if (val) { SMC_STAT_INC(smc, ndly_cnt); smc_tx_pending(&smc->conn); @@ -3047,9 +3401,9 @@ static int smc_setsockopt(struct socket *sock, int level, int optname, } break; case TCP_CORK: - if (sk->sk_state != SMC_INIT && - sk->sk_state != SMC_LISTEN && - sk->sk_state != SMC_CLOSED) { + if (smc_sk_state(sk) != SMC_INIT && + smc_sk_state(sk) != SMC_LISTEN && + smc_sk_state(sk) != SMC_CLOSED) { if (!val) { SMC_STAT_INC(smc, cork_cnt); smc_tx_pending(&smc->conn); @@ -3061,14 +3415,56 @@ static int smc_setsockopt(struct socket *sock, int level, int optname, smc->sockopt_defer_accept = val; break; default: + if (smc_is_unsupport_tcp_sockopt(optname)) { + /* option not supported by SMC */ + if (smc_sk_state(sk) == SMC_INIT && !smc->connect_nonblock) + rc = smc_switch_to_fallback(smc, SMC_CLC_DECL_OPTUNSUPP); + else + rc = -EINVAL; + } break; } out: release_sock(sk); - return rc; } +static int smc_setsockopt(struct socket *sock, int level, int optname, + sockptr_t optval, unsigned int optlen) +{ + struct sock *sk = sock->sk; + struct smc_sock *smc; + int rc; + + if (level == SOL_TCP && optname == TCP_ULP) + return -EOPNOTSUPP; + else if (level == SOL_SMC) + return __smc_setsockopt(sock, level, optname, optval, optlen); + + smc = smc_sk(sk); + + /* generic setsockopts reaching us here always apply to the + * CLC socket + */ + mutex_lock(&smc->clcsock_release_lock); + if (!smc->clcsock) { + mutex_unlock(&smc->clcsock_release_lock); + return -EBADF; + } + if (unlikely(!smc->clcsock->ops->setsockopt)) + rc = -EOPNOTSUPP; + else + rc = smc->clcsock->ops->setsockopt(smc->clcsock, level, optname, + optval, optlen); + if (smc->clcsock->sk->sk_err) { + sk->sk_err = smc->clcsock->sk->sk_err; + sk->sk_error_report(sk); + } + mutex_unlock(&smc->clcsock_release_lock); + + return rc ?: smc_setsockopt_takeover(sock, level, optname, optval, optlen); +} + static int smc_getsockopt(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen) { @@ -3117,24 +3513,24 @@ static int smc_ioctl(struct socket *sock, unsigned int cmd, } switch (cmd) { case SIOCINQ: /* same as FIONREAD */ - if (smc->sk.sk_state == SMC_LISTEN) { + if (smc_sk_state(&smc->sk) == SMC_LISTEN) { release_sock(&smc->sk); return -EINVAL; } - if (smc->sk.sk_state == SMC_INIT || - smc->sk.sk_state == SMC_CLOSED) + if (smc_sk_state(&smc->sk) == SMC_INIT || + smc_sk_state(&smc->sk) == SMC_CLOSED) answ = 0; else answ = atomic_read(&smc->conn.bytes_to_rcv); break; case SIOCOUTQ: /* output queue size (not send + not acked) */ - if (smc->sk.sk_state == SMC_LISTEN) { + if (smc_sk_state(&smc->sk) == SMC_LISTEN) { release_sock(&smc->sk); return -EINVAL; } - if (smc->sk.sk_state == SMC_INIT || - smc->sk.sk_state == SMC_CLOSED) + if (smc_sk_state(&smc->sk) == SMC_INIT || + smc_sk_state(&smc->sk) == SMC_CLOSED) answ = 0; else answ = smc->conn.sndbuf_desc->len - @@ -3142,23 +3538,23 @@ static int smc_ioctl(struct socket *sock, unsigned int cmd, break; case SIOCOUTQNSD: /* output queue size (not send only) */ - if (smc->sk.sk_state == SMC_LISTEN) { + if (smc_sk_state(&smc->sk) == SMC_LISTEN) { release_sock(&smc->sk); return -EINVAL; } - if (smc->sk.sk_state == SMC_INIT || - smc->sk.sk_state == SMC_CLOSED) + if (smc_sk_state(&smc->sk) == SMC_INIT || + smc_sk_state(&smc->sk) == SMC_CLOSED) answ = 0; else answ = smc_tx_prepared_sends(&smc->conn); break; case SIOCATMARK: - if (smc->sk.sk_state == SMC_LISTEN) { + if (smc_sk_state(&smc->sk) == SMC_LISTEN) { release_sock(&smc->sk); return -EINVAL; } - if (smc->sk.sk_state == SMC_INIT || - smc->sk.sk_state == SMC_CLOSED) { + if (smc_sk_state(&smc->sk) == SMC_INIT || + smc_sk_state(&smc->sk) == SMC_CLOSED) { answ = 0; } else { smc_curs_copy(&cons, &conn->local_tx_ctrl.cons, conn); @@ -3185,7 +3581,7 @@ static ssize_t smc_sendpage(struct socket *sock, struct page *page, smc = smc_sk(sk); lock_sock(sk); - if (sk->sk_state != SMC_ACTIVE) { + if (smc_sk_state(sk) != SMC_ACTIVE) { release_sock(sk); goto out; } @@ -3220,17 +3616,17 @@ static ssize_t smc_splice_read(struct socket *sock, loff_t *ppos, smc = smc_sk(sk); lock_sock(sk); - if (sk->sk_state == SMC_CLOSED && (sk->sk_shutdown & RCV_SHUTDOWN)) { + if (smc_sk_state(sk) == SMC_CLOSED && (sk->sk_shutdown & RCV_SHUTDOWN)) { /* socket was connected before, no more data to read */ rc = 0; goto out; } - if (sk->sk_state == SMC_INIT || - sk->sk_state == SMC_LISTEN || - sk->sk_state == SMC_CLOSED) + if (smc_sk_state(sk) == SMC_INIT || + smc_sk_state(sk) == SMC_LISTEN || + smc_sk_state(sk) == SMC_CLOSED) goto out; - if (sk->sk_state == SMC_PEERFINCLOSEWAIT) { + if (smc_sk_state(sk) == SMC_PEERFINCLOSEWAIT) { rc = 0; goto out; } @@ -3279,8 +3675,8 @@ static const struct proto_ops smc_sock_ops = { .splice_read = smc_splice_read, }; -static int __smc_create(struct net *net, struct socket *sock, int protocol, - int kern, struct socket *clcsock) +static int smc_create(struct net *net, struct socket *sock, int protocol, + int kern) { int family = (protocol == SMCPROTO_SMC6) ? PF_INET6 : PF_INET; struct smc_sock *smc; @@ -3307,95 +3703,23 @@ static int __smc_create(struct net *net, struct socket *sock, int protocol, smc->use_fallback = false; /* assume rdma capability first */ smc->fallback_rsn = 0; - /* default behavior from limit_smc_hs in every net namespace */ - smc->limit_smc_hs = net->smc.limit_smc_hs; - - rc = 0; - if (!clcsock) { - rc = sock_create_kern(net, family, SOCK_STREAM, IPPROTO_TCP, - &smc->clcsock); - if (rc) { - sk_common_release(sk); - goto out; - } - } else { - smc->clcsock = clcsock; + rc = sock_create_kern(net, family, SOCK_STREAM, IPPROTO_TCP, + &smc->clcsock); + if (rc) { + sk_common_release(sk); + goto out; } out: return rc; } -static int smc_create(struct net *net, struct socket *sock, int protocol, - int kern) -{ - return __smc_create(net, sock, protocol, kern, NULL); -} - static const struct net_proto_family smc_sock_family_ops = { .family = PF_SMC, .owner = THIS_MODULE, .create = smc_create, }; -static int smc_ulp_init(struct sock *sk) -{ - struct socket *tcp = sk->sk_socket; - struct net *net = sock_net(sk); - struct socket *smcsock; - int protocol, ret; - - /* only TCP can be replaced */ - if (tcp->type != SOCK_STREAM || sk->sk_protocol != IPPROTO_TCP || - (sk->sk_family != AF_INET && sk->sk_family != AF_INET6)) - return -ESOCKTNOSUPPORT; - /* don't handle wq now */ - if (tcp->state != SS_UNCONNECTED || !tcp->file || tcp->wq.fasync_list) - return -ENOTCONN; - - if (sk->sk_family == AF_INET) - protocol = SMCPROTO_SMC; - else - protocol = SMCPROTO_SMC6; - - smcsock = sock_alloc(); - if (!smcsock) - return -ENFILE; - - smcsock->type = SOCK_STREAM; - __module_get(THIS_MODULE); /* tried in __tcp_ulp_find_autoload */ - ret = __smc_create(net, smcsock, protocol, 1, tcp); - if (ret) { - sock_release(smcsock); /* module_put() which ops won't be NULL */ - return ret; - } - - /* replace tcp socket to smc */ - smcsock->file = tcp->file; - smcsock->file->private_data = smcsock; - smcsock->file->f_inode = SOCK_INODE(smcsock); /* replace inode when sock_close */ - smcsock->file->f_path.dentry->d_inode = SOCK_INODE(smcsock); /* dput() in __fput */ - tcp->file = NULL; - - return ret; -} - -static void smc_ulp_clone(const struct request_sock *req, struct sock *newsk, - const gfp_t priority) -{ - struct inet_connection_sock *icsk = inet_csk(newsk); - - /* don't inherit ulp ops to child when listen */ - icsk->icsk_ulp_ops = NULL; -} - -static struct tcp_ulp_ops smc_ulp_ops __read_mostly = { - .name = "smc", - .owner = THIS_MODULE, - .init = smc_ulp_init, - .clone = smc_ulp_clone, -}; - static int smc_net_reserve_ports(struct net *net) { struct smc_ib_device *smcibdev; @@ -3487,121 +3811,994 @@ static struct pernet_operations smc_net_stat_ops = { .exit = smc_net_stat_exit, }; -static int __init smc_init(void) +static int __smc_inet_connect_work_locked(struct smc_sock *smc) { - int rc, i; + int rc; - if (reserve_mode) { - pr_info_ratelimited("smc: load SMC module with reserve_mode\n"); - if (rsvd_ports_base > - (U16_MAX - SMC_IWARP_RSVD_PORTS_NUM)) { - pr_info_ratelimited("smc: reserve_mode with invalid " - "ports base\n"); - return -EINVAL; - } + rc = __smc_connect(smc); + if (rc < 0) + smc->sk.sk_err = -rc; + + smc_inet_sock_switch_negotiation_state(&smc->sk, SMC_NEGOTIATION_PREPARE_SMC, + (smc->use_fallback || + smc_sk_state(&smc->sk) == SMC_INIT) ? + SMC_NEGOTIATION_NO_SMC : SMC_NEGOTIATION_SMC); + + /* reset to this */ + if (smc->sk.sk_socket) { + wake_up_interruptible_all(&smc->accompany_socket.wq.wait); + smc->sk.sk_wq = &smc->sk.sk_socket->wq; } - rc = register_pernet_subsys(&smc_net_ops); - if (rc) - return rc; + /* make smc_negotiation can be seen */ + smp_wmb(); - rc = register_pernet_subsys(&smc_net_stat_ops); - if (rc) - goto out_pernet_subsys; + if (!sock_flag(&smc->sk, SOCK_DEAD)) { + if (smc->sk.sk_err) + smc->sk.sk_state_change(&smc->sk); + else + smc->sk.sk_write_space(&smc->sk); + } - smc_ism_init(); - smc_clc_init(); + /* sock hold in smc_inet_sock_state_change() or smc_inet_connect() */ + sock_put(&smc->sk); + return rc; +} - rc = smc_nl_init(); - if (rc) - goto out_pernet_subsys_stat; +static void smc_inet_connect_work(struct work_struct *work) +{ + struct smc_sock *smc = container_of(work, struct smc_sock, + connect_work); - rc = smc_pnet_init(); - if (rc) - goto out_nl; + lock_sock(&smc->sk); + __smc_inet_connect_work_locked(smc); + release_sock(&smc->sk); +} - rc = -ENOMEM; +static void smc_inet_listen_work(struct work_struct *work) +{ + struct smc_sock *smc = container_of(work, struct smc_sock, + smc_listen_work); + struct sock *sk = &smc->sk; - smc_tcp_ls_wq = alloc_workqueue("smc_tcp_ls_wq", 0, 0); - if (!smc_tcp_ls_wq) - goto out_pnet; + /* Initialize accompanying socket */ + smc_inet_sock_init_accompany_socket(sk); - smc_hs_wq = alloc_workqueue("smc_hs_wq", 0, 0); - if (!smc_hs_wq) - goto out_alloc_tcp_ls_wq; + /* current smc sock has not bee accept yet. */ + sk->sk_wq = &smc_sk(sk)->accompany_socket.wq; - smc_close_wq = alloc_workqueue("smc_close_wq", 0, 0); - if (!smc_close_wq) - goto out_alloc_hs_wq; + smc_listen_work(work); + /* sock hold in smc_inet_sock_do_handshake() */ + sock_put(&smc->sk); +} - rc = smc_core_init(); - if (rc) { - pr_err("%s: smc_core_init fails with %d\n", __func__, rc); - goto out_alloc_wqs; - } +static int smc_inet_sock_do_handshake(struct sock *sk, bool sk_locked, bool sync) +{ + struct smc_sock *smc = smc_sk(sk); + int rc = 0; - rc = smc_llc_init(); - if (rc) { - pr_err("%s: smc_llc_init fails with %d\n", __func__, rc); - goto out_core; + if (smc_inet_sock_is_active_open(sk)) { + INIT_WORK(&smc->connect_work, smc_inet_connect_work); + /* protected sk during smc_inet_connect_work/__smc_inet_connect_work_locked */ + sock_hold(sk); + if (!sync) { + if (unlikely(!queue_work(smc_hs_wq, &smc->connect_work))) + sock_put(sk); /* sock hold above */ + return 0; + } + if (sk_locked) + return __smc_inet_connect_work_locked(smc); + lock_sock(sk); + rc = __smc_inet_connect_work_locked(smc); + release_sock(sk); + return rc; } - rc = smc_cdc_init(); - if (rc) { - pr_err("%s: smc_cdc_init fails with %d\n", __func__, rc); - goto out_core; - } + INIT_WORK(&smc->smc_listen_work, smc_inet_listen_work); + /* protected sk during smc_inet_listen_work */ + sock_hold(sk); + /* protected listen_smc during smc_inet_listen_work */ + sock_hold(&smc->listen_smc->sk); - rc = proto_register(&smc_proto, 1); - if (rc) { - pr_err("%s: proto_register(v4) fails with %d\n", __func__, rc); - goto out_core; + if (!sync) { + if (unlikely(!queue_work(smc_hs_wq, &smc->smc_listen_work))) + sock_put(sk); /* sock hold above */ + } else { + smc_inet_listen_work(&smc->smc_listen_work); } + /* listen work has no retval */ + return 0; +} - rc = proto_register(&smc_proto6, 1); - if (rc) { - pr_err("%s: proto_register(v6) fails with %d\n", __func__, rc); - goto out_proto; - } +static void smc_inet_sock_state_change(struct sock *sk) +{ + struct smc_sock *smc = smc_sk(sk); + int cur; - rc = sock_register(&smc_sock_family_ops); - if (rc) { - pr_err("%s: sock_register fails with %d\n", __func__, rc); - goto out_proto6; - } + if (sk->sk_err || (1 << sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_ESTABLISHED)) { + write_lock_bh(&sk->sk_callback_lock); - for (i = 0; i < SMC_HTABLE_SIZE; i++) { - INIT_HLIST_HEAD(&smc_v4_hashinfo.ht[i]); - INIT_HLIST_HEAD(&smc_v6_hashinfo.ht[i]); - } + /* cause by release */ + if (unlikely(sk->sk_state_change != smc_inet_sock_state_change)) + goto out_unlock; - rc = smc_ib_register_client(); - if (rc) { - pr_err("%s: ib_register fails with %d\n", __func__, rc); - goto out_sock; + cur = smc_inet_sock_switch_negotiation_state_locked(sk, SMC_NEGOTIATION_TBD, + (tcp_sk(sk)->syn_smc && + !sk->sk_err) ? + SMC_NEGOTIATION_PREPARE_SMC : + SMC_NEGOTIATION_NO_SMC); + + /* resume sk_state_change when cur changed */ + if (cur != SMC_NEGOTIATION_TBD) + sk->sk_state_change = smc->clcsk_state_change; + + if (cur == SMC_NEGOTIATION_PREPARE_SMC) { + smc_inet_sock_do_handshake(sk, /* not locked */ false, /* async */ false); + } else if (cur == SMC_NEGOTIATION_NO_SMC) { + /* resume sk_wq */ + sk->sk_wq = &sk->sk_socket->wq; + /* flush all sleeper on accompany_socket.wq */ + wake_up_interruptible_all(&smc->accompany_socket.wq.wait); + } +out_unlock: + write_unlock_bh(&sk->sk_callback_lock); } - rc = tcp_register_ulp(&smc_ulp_ops); - if (rc) { - pr_err("%s: tcp_ulp_register fails with %d\n", __func__, rc); - goto out_ib; - } + smc->clcsk_state_change(sk); +} - rc = smc_proc_init(); - if (rc) { - pr_err("%s: smc_proc_init fails with %d\n", __func__, rc); - goto out_ulp; - } +int smc_inet_init_sock(struct sock *sk) +{ + struct smc_sock *smc = smc_sk(sk); + int rc; + + /* Call tcp init sock first */ + rc = smc_inet_get_tcp_prot(sk->sk_family)->init(sk); + if (rc) + return rc; + + /* init common smc sock */ + smc_sock_init(sk, sock_net(sk)); + + /* IPPROTO_SMC does not exist in network, we MUST + * reset it to IPPROTO_TCP before connect. + */ + sk->sk_protocol = IPPROTO_TCP; + + /* Initialize smc_sock state */ + smc_sk_set_state(sk, SMC_INIT); + + /* built link */ + smc->clcsock = &smc->accompany_socket; + + /* Initialize negotiation state, see more details in + * enum smc_inet_sock_negotiation_state. + */ + isck_smc_negotiation_store(smc, SMC_NEGOTIATION_TBD); - static_branch_enable(&tcp_have_smc); return 0; +} -out_ulp: - tcp_unregister_ulp(&smc_ulp_ops); -out_ib: - smc_ib_unregister_client(); -out_sock: - sock_unregister(PF_SMC); +void smc_inet_sock_proto_release_cb(struct sock *sk) +{ + tcp_release_cb(sk); + + /* smc_release_cb only works for socks who identified + * as SMC. Note listen sock will also return here. + */ + if (!smc_inet_sock_check_smc(sk)) + return; + + smc_release_cb(sk); +} + +int smc_inet_connect(struct socket *sock, struct sockaddr *addr, + int alen, int flags) +{ + /* Initialize accompanying socket */ + smc_inet_sock_init_accompany_socket(sock->sk); + return smc_connect(sock, addr, alen, flags); +} + +int smc_inet_setsockopt(struct socket *sock, int level, int optname, + sockptr_t optval, unsigned int optlen) +{ + struct sock *sk = sock->sk; + struct smc_sock *smc; + bool fallback; + int rc; + + smc = smc_sk(sk); + fallback = smc_inet_sock_check_fallback(sk); + + if (level == SOL_SMC) + return __smc_setsockopt(sock, level, optname, optval, optlen); + + /* Note that we always need to check if it's an unsupported + * options before set it to the given value via sock_common_setsockopt(). + * This is because if we set it after we found it is not supported to smc and + * we have no idea to fallback, we have to report this error to userspace. + * However, the user might find it is set correctly via sock_common_getsockopt(). + */ + if (!fallback && level == SOL_TCP && smc_is_unsupport_tcp_sockopt(optname)) { + /* can not fallback, but with not-supported option */ + if (smc_inet_sock_try_fallback_fast(sk, /* try best */ 0)) + return -EOPNOTSUPP; + fallback = true; + } + + /* call original setsockopt */ + rc = sock_common_setsockopt(sock, level, optname, optval, optlen); + if (rc) + return rc; + + /* already be fallback */ + if (fallback) + return 0; + + /* deliver to smc if needed */ + return smc_setsockopt_takeover(sock, level, optname, optval, optlen); +} + +int smc_inet_getsockopt(struct socket *sock, int level, int optname, + char __user *optval, int __user *optlen) +{ + if (level == SOL_SMC) + return __smc_getsockopt(sock, level, optname, optval, optlen); + + /* smc_getsockopt is just a wrap on sock_common_getsockopt + * So we don't need to reuse it. + */ + return sock_common_getsockopt(sock, level, optname, optval, optlen); +} + +int smc_inet_ioctl(struct socket *sock, unsigned int cmd, + unsigned long arg) +{ + struct sock *sk = sock->sk; + int rc; + + if (smc_inet_sock_check_fallback(sk)) +fallback: + return smc_call_inet_sock_ops(sk, inet_ioctl, inet6_ioctl, sock, cmd, arg); + + rc = smc_ioctl(sock, cmd, arg); + if (unlikely(smc_sk(sk)->use_fallback)) + goto fallback; + + return rc; +} + +int smc_inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t len) +{ + struct sock *sk = sock->sk; + struct smc_sock *smc; + int rc; + + smc = smc_sk(sk); + + /* Send before connected, might be fastopen or user's incorrect usage, but + * whatever, in either case, we do not need to replace it with SMC any more. + * If it dues to user's incorrect usage, then it is also an error for TCP. + * Users should correct that error themselves. + */ + if (!smc_inet_sock_access_before(sk)) + goto no_smc; + + rc = smc_sendmsg(sock, msg, len); + if (likely(!smc->use_fallback)) + return rc; + + /* Fallback during smc_sendmsg */ +no_smc: + return smc_call_inet_sock_ops(sk, inet_sendmsg, inet6_sendmsg, sock, msg, len); +} + +int smc_inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len, + int flags) +{ + struct sock *sk = sock->sk; + struct smc_sock *smc; + int rc; + + smc = smc_sk(sk); + + /* Recv before connection goes established, it's okay for TCP but not + * support in SMC(see smc_recvmsg), we should try our best to fallback + * if passible. + */ + if (!smc_inet_sock_access_before(sk)) + goto no_smc; + + rc = smc_recvmsg(sock, msg, len, flags); + if (likely(!smc->use_fallback)) + return rc; + + /* Fallback during smc_recvmsg */ +no_smc: + return smc_call_inet_sock_ops(sk, inet_recvmsg, inet6_recvmsg, sock, msg, len, flags); +} + +ssize_t smc_inet_sendpage(struct socket *sock, struct page *page, + int offset, size_t size, int flags) +{ + struct sock *sk = sock->sk; + struct smc_sock *smc; + int rc; + + smc = smc_sk(sk); + + /* same reason with smc_recvmsg */ + if (!smc_inet_sock_access_before(sk)) + goto no_smc; + + rc = smc_sendpage(sock, page, offset, size, flags); + if (likely(!smc->use_fallback)) + return rc; + + /* Fallback during smc_sendpage */ +no_smc: + return inet_sendpage(sock, page, offset, size, flags); +} + +ssize_t smc_inet_splice_read(struct socket *sock, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, + unsigned int flags) +{ + struct sock *sk = sock->sk; + struct smc_sock *smc; + int rc; + + smc = smc_sk(sk); + + if (!smc_inet_sock_access_before(sk)) + goto no_smc; + + rc = smc_splice_read(sock, ppos, pipe, len, flags); + if (likely(!smc->use_fallback)) + return rc; + + /* Fallback during smc_splice_read */ +no_smc: + return tcp_splice_read(sock, ppos, pipe, len, flags); +} + +static inline __poll_t smc_inet_listen_poll(struct file *file, struct socket *sock, + poll_table *wait) +{ + __poll_t mask; + + mask = tcp_poll(file, sock, wait); + /* no tcp sock */ + if (!(smc_inet_sock_sort_csk_queue(sock->sk) & SMC_REQSK_TCP)) + mask &= ~(EPOLLIN | EPOLLRDNORM); + mask |= smc_accept_poll(sock->sk); + return mask; +} + +__poll_t smc_inet_poll(struct file *file, struct socket *sock, poll_table *wait) +{ + struct sock *sk = sock->sk; + __poll_t mask; + + if (smc_inet_sock_check_fallback_fast(sk)) +no_smc: + return tcp_poll(file, sock, wait); + + /* special case */ + if (inet_sk_state_load(sk) == TCP_LISTEN) + return smc_inet_listen_poll(file, sock, wait); + + mask = smc_poll(file, sock, wait); + if (smc_sk(sk)->use_fallback) + goto no_smc; + + return mask; +} + +int smc_inet_shutdown(struct socket *sock, int how) +{ + struct sock *sk = sock->sk; + struct smc_sock *smc; + int rc; + + smc = smc_sk(sk); + + /* All state changes of sock are handled by inet_shutdown, + * smc only needs to be responsible for + * executing the corresponding semantics. + */ + rc = inet_shutdown(sock, how); + if (rc) + return rc; + + /* shutdown during SMC_NEGOTIATION_TBD, we can force it to be + * fallback. + */ + if (!smc_inet_sock_try_fallback_fast(sk, /* force it to no_smc */ 1)) + return 0; + + /* executing the corresponding semantics if can not be fallback */ + lock_sock(sk); + switch (how) { + case SHUT_RDWR: /* shutdown in both directions */ + rc = smc_close_active(smc); + break; + case SHUT_WR: + rc = smc_close_shutdown_write(smc); + break; + case SHUT_RD: + rc = 0; + /* nothing more to do because peer is not involved */ + break; + } + release_sock(sk); + return rc; +} + +int smc_inet_release(struct socket *sock) +{ + struct sock *sk = sock->sk; + struct smc_sock *smc; + int old_state; + + if (!sk) + return 0; + + smc = smc_sk(sk); + + /* trigger info gathering if needed.*/ + smc_sock_perform_collecting_info(smc, SMC_SOCK_CLOSED_TIMING); + + if (!smc_inet_sock_try_fallback_fast(sk, /* force it to no_smc */ 1)) + goto out; + + old_state = smc_sk_state(sk); + + /* cleanup for a dangling non-blocking connect */ + if (smc->connect_nonblock && old_state == SMC_INIT) { + sk->sk_err = ECONNABORTED; + sk->sk_error_report(sk); + } + + if (smc->connect_nonblock && cancel_work_sync(&smc->connect_work)) + sock_put(&smc->sk); /* sock_hold in smc_connect for passive closing */ + + if (smc_sk_state(sk) == SMC_LISTEN) + /* smc_close_non_accepted() is called and acquires + * sock lock for child sockets again + */ + lock_sock_nested(sk, SINGLE_DEPTH_NESTING); + else + lock_sock(sk); + + if ((old_state == SMC_INIT || smc->conn.killed) && + smc_sk_state(sk) == SMC_ACTIVE && !smc->use_fallback) + smc_close_active_abort(smc); + + /* ret of smc_close_active do not need return to userspace */ + smc_close_active(smc); + sock_set_flag(sk, SOCK_DEAD); + + if (smc_sk_state(sk) == SMC_CLOSED) + smc_conn_free(&smc->conn); + + release_sock(sk); +out: + /* release tcp sock */ + return smc_call_inet_sock_ops(sk, inet_release, inet6_release, sock); +} + +static inline struct request_sock *smc_inet_reqsk_get_safe_tail_0(struct sock *parent) +{ + struct request_sock_queue *queue = &inet_csk(parent)->icsk_accept_queue; + struct request_sock *req = queue->rskq_accept_head; + + if (req && smc_sk(req->sk)->ordered && tcp_sk(req->sk)->syn_smc == 0) + return smc_sk(parent)->tail_0; + + return NULL; +} + +static inline struct request_sock *smc_inet_reqsk_get_safe_tail_1(struct sock *parent) +{ + struct request_sock_queue *queue = &inet_csk(parent)->icsk_accept_queue; + struct request_sock *tail_0 = smc_inet_reqsk_get_safe_tail_0(parent); + struct request_sock *req; + + if (tail_0) + req = tail_0->dl_next; + else + req = queue->rskq_accept_head; + + if (req && smc_sk(req->sk)->ordered && tcp_sk(req->sk)->syn_smc) + return smc_sk(parent)->tail_1; + + return NULL; +} + +static inline void smc_reqsk_queue_remove_locked(struct request_sock_queue *queue) +{ + struct request_sock *req; + + req = queue->rskq_accept_head; + if (req) { + WRITE_ONCE(queue->rskq_accept_head, req->dl_next); + if (!queue->rskq_accept_head) + queue->rskq_accept_tail = NULL; + } +} + +static inline void smc_reqsk_queue_add_locked(struct request_sock_queue *queue, + struct request_sock *req) +{ + req->dl_next = NULL; + if (!queue->rskq_accept_head) + WRITE_ONCE(queue->rskq_accept_head, req); + else + queue->rskq_accept_tail->dl_next = req; + queue->rskq_accept_tail = req; +} + +static inline void smc_reqsk_queue_join_locked(struct request_sock_queue *to, + struct request_sock_queue *from) +{ + if (reqsk_queue_empty(from)) + return; + + if (reqsk_queue_empty(to)) { + to->rskq_accept_head = from->rskq_accept_head; + to->rskq_accept_tail = from->rskq_accept_tail; + } else { + to->rskq_accept_tail->dl_next = from->rskq_accept_head; + to->rskq_accept_tail = from->rskq_accept_tail; + } + + from->rskq_accept_head = NULL; + from->rskq_accept_tail = NULL; +} + +static inline void smc_reqsk_queue_cut_locked(struct request_sock_queue *queue, + struct request_sock *tail, + struct request_sock_queue *split) +{ + if (!tail) { + split->rskq_accept_tail = queue->rskq_accept_tail; + split->rskq_accept_head = queue->rskq_accept_head; + queue->rskq_accept_tail = NULL; + queue->rskq_accept_head = NULL; + return; + } + + if (tail == queue->rskq_accept_tail) { + split->rskq_accept_tail = NULL; + split->rskq_accept_head = NULL; + return; + } + + split->rskq_accept_head = tail->dl_next; + split->rskq_accept_tail = queue->rskq_accept_tail; + queue->rskq_accept_tail = tail; + tail->dl_next = NULL; +} + +static inline void __smc_inet_sock_sort_csk_queue(struct sock *parent, int *tcp_cnt, int *smc_cnt) +{ + struct request_sock_queue queue_smc, queue_free; + struct smc_sock *par = smc_sk(parent); + struct request_sock_queue *queue; + struct request_sock *req; + int cnt0, cnt1; + + queue = &inet_csk(parent)->icsk_accept_queue; + + spin_lock_bh(&queue->rskq_lock); + + par->tail_0 = smc_inet_reqsk_get_safe_tail_0(parent); + par->tail_1 = smc_inet_reqsk_get_safe_tail_1(parent); + + cnt0 = par->tail_0 ? smc_sk(par->tail_0->sk)->queued_cnt : 0; + cnt1 = par->tail_1 ? smc_sk(par->tail_1->sk)->queued_cnt : 0; + + smc_reqsk_queue_cut_locked(queue, par->tail_0, &queue_smc); + smc_reqsk_queue_cut_locked(&queue_smc, par->tail_1, &queue_free); + + /* scan all queue_free and re-add it */ + while ((req = queue_free.rskq_accept_head)) { + smc_sk(req->sk)->ordered = 1; + smc_reqsk_queue_remove_locked(&queue_free); + /* It's not good at timecast, but better to understand */ + if (tcp_sk(req->sk)->syn_smc) { + smc_reqsk_queue_add_locked(&queue_smc, req); + cnt1++; + } else { + smc_reqsk_queue_add_locked(queue, req); + cnt0++; + } + } + /* update tail */ + par->tail_0 = queue->rskq_accept_tail; + par->tail_1 = queue_smc.rskq_accept_tail; + + /* join queue */ + smc_reqsk_queue_join_locked(queue, &queue_smc); + + if (par->tail_0) { + smc_sk(par->tail_0->sk)->queued_cnt += cnt0; + cnt0 = smc_sk(par->tail_0->sk)->queued_cnt; + } + + if (par->tail_1) { + smc_sk(par->tail_1->sk)->queued_cnt += cnt1; + cnt1 = smc_sk(par->tail_1->sk)->queued_cnt; + } + + *tcp_cnt = cnt0; + *smc_cnt = cnt1; + + spin_unlock_bh(&queue->rskq_lock); +} + +static int smc_inet_sock_sort_csk_queue(struct sock *parent) +{ + int smc_cnt, tcp_cnt; + int mask = 0; + + __smc_inet_sock_sort_csk_queue(parent, &tcp_cnt, &smc_cnt); + if (tcp_cnt) + mask |= SMC_REQSK_TCP; + if (smc_cnt) + mask |= SMC_REQSK_SMC; + + return mask; +} + +static int smc_inet_sock_reverse_ordered_csk_queue(struct sock *parent) +{ + struct request_sock_queue *queue, queue_smc, queue_free; + struct smc_sock *par = smc_sk(parent); + int mask = SMC_REQSK_TCP; + + queue = &inet_csk(parent)->icsk_accept_queue; + spin_lock_bh(&queue->rskq_lock); + + par->tail_0 = smc_inet_reqsk_get_safe_tail_0(parent); + par->tail_1 = smc_inet_reqsk_get_safe_tail_1(parent); + + smc_reqsk_queue_cut_locked(queue, par->tail_0, &queue_smc); + smc_reqsk_queue_cut_locked(&queue_smc, par->tail_1, &queue_free); + + /* has smc reqsk */ + if (!reqsk_queue_empty(&queue_smc)) + mask = SMC_REQSK_SMC; + + smc_reqsk_queue_join_locked(&queue_smc, queue); + smc_reqsk_queue_join_locked(&queue_smc, &queue_free); + smc_reqsk_queue_join_locked(queue, &queue_smc); + + if (par->tail_0) + par->tail_1 = NULL; + spin_unlock_bh(&queue->rskq_lock); + return mask; +} + +/* Wait for an incoming connection, avoid race conditions. This must be called + * with the socket locked. + */ +static int smc_inet_csk_wait_for_connect(struct sock *sk, long *timeo) +{ + struct inet_connection_sock *icsk = inet_csk(sk); + DEFINE_WAIT(wait); + int err; + + lock_sock(sk); + + /* True wake-one mechanism for incoming connections: only + * one process gets woken up, not the 'whole herd'. + * Since we do not 'race & poll' for established sockets + * anymore, the common case will execute the loop only once. + * + * Subtle issue: "add_wait_queue_exclusive()" will be added + * after any current non-exclusive waiters, and we know that + * it will always _stay_ after any new non-exclusive waiters + * because all non-exclusive waiters are added at the + * beginning of the wait-queue. As such, it's ok to "drop" + * our exclusiveness temporarily when we get woken up without + * having to remove and re-insert us on the wait queue. + */ + for (;;) { + prepare_to_wait_exclusive(sk_sleep(sk), &wait, + TASK_INTERRUPTIBLE); + release_sock(sk); + if (reqsk_queue_empty(&icsk->icsk_accept_queue)) + *timeo = schedule_timeout(*timeo); + sched_annotate_sleep(); + lock_sock(sk); + err = 0; + if (!reqsk_queue_empty(&icsk->icsk_accept_queue)) + break; + if (!smc_accept_queue_empty(sk)) + break; + err = -EINVAL; + if (sk->sk_state != TCP_LISTEN) + break; + err = sock_intr_errno(*timeo); + if (signal_pending(current)) + break; + err = -EAGAIN; + if (!*timeo) + break; + } + finish_wait(sk_sleep(sk), &wait); + release_sock(sk); + return err; +} + +struct sock *__smc_inet_csk_accept(struct sock *sk, int flags, int *err, bool kern, int next_state) +{ + struct sock *child; + int cur; + + child = inet_csk_accept(sk, flags | O_NONBLOCK, err, kern); + if (child) { + /* depends on syn_smc if next_state not specify */ + if (next_state == SMC_NEGOTIATION_TBD) + next_state = tcp_sk(child)->syn_smc ? SMC_NEGOTIATION_PREPARE_SMC : + SMC_NEGOTIATION_NO_SMC; + + cur = smc_inet_sock_switch_negotiation_state(child, SMC_NEGOTIATION_TBD, + next_state); + switch (cur) { + case SMC_NEGOTIATION_NO_SMC: + smc_sk_set_state(child, SMC_ACTIVE); + smc_switch_to_fallback(smc_sk(child), SMC_CLC_DECL_PEERNOSMC); + break; + case SMC_NEGOTIATION_PREPARE_SMC: + /* init as passive open smc sock */ + smc_sock_init_passive(sk, child); + break; + default: + break; + } + } + return child; +} + +struct sock *smc_inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) +{ + struct sock *child; + long timeo; + + timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK); + +again: + /* has smc sock */ + if (!smc_accept_queue_empty(sk)) { + child = __smc_accept(sk, NULL, flags | O_NONBLOCK, err, kern); + if (child) + return child; + } + + child = __smc_inet_csk_accept(sk, flags | O_NONBLOCK, err, kern, SMC_NEGOTIATION_TBD); + if (child) { + /* not smc sock */ + if (smc_inet_sock_check_fallback_fast(child)) + return child; + /* smc sock */ + smc_inet_sock_do_handshake(child, /* sk not locked */ false, /* sync */ false); + *err = -EAGAIN; + child = NULL; + } + + if (*err == -EAGAIN && timeo) { + *err = smc_inet_csk_wait_for_connect(sk, &timeo); + if (*err == 0) + goto again; + } + + return NULL; +} + +static void smc_inet_tcp_listen_work(struct work_struct *work) +{ + struct smc_sock *lsmc = container_of(work, struct smc_sock, + tcp_listen_work); + struct sock *lsk = &lsmc->sk; + struct sock *child; + int error = 0; + + while (smc_sk_state(lsk) == SMC_LISTEN && + (smc_inet_sock_reverse_ordered_csk_queue(lsk) & SMC_REQSK_SMC)) { + child = __smc_inet_csk_accept(lsk, O_NONBLOCK, &error, 1, + SMC_NEGOTIATION_PREPARE_SMC); + if (!child || error) + break; + + /* run handshake for child + * If child is a fallback connection, run a sync handshake to eliminate + * the impact of queue_work(). + */ + smc_inet_sock_do_handshake(child, /* sk not locked */ false, + !tcp_sk(child)->syn_smc); + + /* Minimize handling fallback connections in workqueue as much as possible */ + if (!tcp_sk(child)->syn_smc) + break; + } +} + +static void smc_inet_sock_data_ready(struct sock *sk) +{ + struct smc_sock *smc = smc_sk(sk); + int mask; + + if (inet_sk_state_load(sk) == TCP_LISTEN) { + mask = smc_inet_sock_sort_csk_queue(sk); + if (mask & SMC_REQSK_TCP || !smc_accept_queue_empty(sk)) + smc->clcsk_data_ready(sk); + if (mask & SMC_REQSK_SMC) + queue_work(smc_tcp_ls_wq, &smc->tcp_listen_work); + } else { + write_lock_bh(&sk->sk_callback_lock); + sk->sk_data_ready = smc->clcsk_data_ready; + write_unlock_bh(&sk->sk_callback_lock); + smc->clcsk_data_ready(sk); + } +} + +int smc_inet_listen(struct socket *sock, int backlog) +{ + struct sock *sk = sock->sk; + bool need_init = false; + struct smc_sock *smc; + + smc = smc_sk(sk); + + write_lock_bh(&sk->sk_callback_lock); + /* still wish to accept smc sock */ + if (isck_smc_negotiation_load(smc) == SMC_NEGOTIATION_TBD) { + need_init = tcp_sk(sk)->syn_smc = 1; + isck_smc_negotiation_set_flags(smc, SMC_NEGOTIATION_LISTEN_FLAG); + } + write_unlock_bh(&sk->sk_callback_lock); + + if (need_init) { + lock_sock(sk); + if (smc_sk_state(sk) == SMC_INIT) { + smc_init_listen(smc); + INIT_WORK(&smc->tcp_listen_work, smc_inet_tcp_listen_work); + smc_clcsock_replace_cb(&sk->sk_data_ready, smc_inet_sock_data_ready, + &smc->clcsk_data_ready); + smc_sk_set_state(sk, SMC_LISTEN); + } + release_sock(sk); + } + return inet_listen(sock, backlog); +} + +static int __init smc_init(void) +{ + int rc, i; + + if (reserve_mode) { + pr_info_ratelimited("smc: load SMC module with reserve_mode\n"); + if (rsvd_ports_base > + (U16_MAX - SMC_IWARP_RSVD_PORTS_NUM)) { + pr_info_ratelimited("smc: reserve_mode with invalid " + "ports base\n"); + return -EINVAL; + } + } + + rc = register_pernet_subsys(&smc_net_ops); + if (rc) + return rc; + + rc = register_pernet_subsys(&smc_net_stat_ops); + if (rc) + goto out_pernet_subsys; + + rc = smc_ism_init(); + if (rc) + goto out_pernet_subsys_stat; + smc_clc_init(); + + rc = smc_nl_init(); + if (rc) + goto out_ism; + + rc = smc_pnet_init(); + if (rc) + goto out_nl; + + rc = -ENOMEM; + + smc_tcp_ls_wq = alloc_workqueue("smc_tcp_ls_wq", 0, 0); + if (!smc_tcp_ls_wq) + goto out_pnet; + + smc_hs_wq = alloc_workqueue("smc_hs_wq", 0, 0); + if (!smc_hs_wq) + goto out_alloc_tcp_ls_wq; + + smc_close_wq = alloc_workqueue("smc_close_wq", 0, 0); + if (!smc_close_wq) + goto out_alloc_hs_wq; + + rc = smc_core_init(); + if (rc) { + pr_err("%s: smc_core_init fails with %d\n", __func__, rc); + goto out_alloc_wqs; + } + + rc = smc_llc_init(); + if (rc) { + pr_err("%s: smc_llc_init fails with %d\n", __func__, rc); + goto out_core; + } + + rc = smc_cdc_init(); + if (rc) { + pr_err("%s: smc_cdc_init fails with %d\n", __func__, rc); + goto out_core; + } + + rc = proto_register(&smc_proto, 1); + if (rc) { + pr_err("%s: proto_register(v4) fails with %d\n", __func__, rc); + goto out_core; + } + + rc = proto_register(&smc_proto6, 1); + if (rc) { + pr_err("%s: proto_register(v6) fails with %d\n", __func__, rc); + goto out_proto; + } + + rc = sock_register(&smc_sock_family_ops); + if (rc) { + pr_err("%s: sock_register fails with %d\n", __func__, rc); + goto out_proto6; + } + + for (i = 0; i < SMC_HTABLE_SIZE; i++) { + INIT_HLIST_HEAD(&smc_v4_hashinfo.ht[i]); + INIT_HLIST_HEAD(&smc_v6_hashinfo.ht[i]); + } + + rc = smc_ib_register_client(); + if (rc) { + pr_err("%s: ib_register fails with %d\n", __func__, rc); + goto out_sock; + } + + rc = smc_proc_init(); + if (rc) { + pr_err("%s: smc_proc_init fails with %d\n", __func__, rc); + goto out_ib; + } + + /* init smc inet sock related proto and proto_ops */ + rc = smc_inet_sock_init(); + if (!rc) { + /* registe smc inet proto */ + rc = proto_register(&smc_inet_prot, 1); + if (rc) { + pr_err("%s: proto_register smc_inet_prot fails with %d\n", __func__, rc); + goto out_proc; + } + /* no return value */ + inet_register_protosw(&smc_inet_protosw); + } + + static_branch_enable(&tcp_have_smc); + return 0; +out_proc: + smc_proc_exit(); +out_ib: + smc_ib_unregister_client(); +out_sock: + sock_unregister(PF_SMC); out_proto6: proto_unregister(&smc_proto6); out_proto: @@ -3618,6 +4815,9 @@ static int __init smc_init(void) smc_pnet_exit(); out_nl: smc_nl_exit(); +out_ism: + smc_clc_exit(); + smc_ism_exit(); out_pernet_subsys_stat: unregister_pernet_subsys(&smc_net_stat_ops); out_pernet_subsys: @@ -3629,16 +4829,18 @@ static int __init smc_init(void) static void __exit smc_exit(void) { static_branch_disable(&tcp_have_smc); - tcp_unregister_ulp(&smc_ulp_ops); + inet_unregister_protosw(&smc_inet_protosw); smc_proc_exit(); sock_unregister(PF_SMC); smc_core_exit(); smc_ib_unregister_client(); + smc_ism_exit(); destroy_workqueue(smc_close_wq); destroy_workqueue(smc_tcp_ls_wq); destroy_workqueue(smc_hs_wq); proto_unregister(&smc_proto6); proto_unregister(&smc_proto); + proto_unregister(&smc_inet_prot); smc_pnet_exit(); smc_nl_exit(); smc_clc_exit(); @@ -3654,5 +4856,8 @@ MODULE_AUTHOR("Ursula Braun "); MODULE_DESCRIPTION("smc socket address family"); MODULE_LICENSE("GPL"); MODULE_ALIAS_NETPROTO(PF_SMC); -MODULE_ALIAS_TCP_ULP("smc"); +/* It seems that this macro has different + * understanding of enum type(IPPROTO_SMC or SOCK_STREAM) + */ +MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_INET, 263, 1); MODULE_ALIAS_GENL_FAMILY(SMC_GENL_FAMILY_NAME); diff --git a/net/smc/bpf_smc.c b/net/smc/bpf_smc.c new file mode 100644 index 0000000000000000000000000000000000000000..5c569b1f0df916b7710dd48455a5c0abf064b9bd --- /dev/null +++ b/net/smc/bpf_smc.c @@ -0,0 +1,352 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Support eBPF for Shared Memory Communications over RDMA (SMC-R) and RoCE + * + * Copyright IBM Corp. 2016, 2018 + * + * Author(s): D. Wythe + */ + +#include +#include +#include +#include +#include +#include +#include +#include "smc.h" + +struct bpf_struct_ops bpf_smc_sock_negotiator_ops; + +static DEFINE_SPINLOCK(smc_sock_negotiator_list_lock); +static LIST_HEAD(smc_sock_negotiator_list); +static u32 smc_sock_id, sock_id; + +/* required smc_sock_negotiator_list_lock locked */ +static struct smc_sock_negotiator_ops *smc_negotiator_ops_get_by_key(u32 key) +{ + struct smc_sock_negotiator_ops *ops; + + list_for_each_entry_rcu(ops, &smc_sock_negotiator_list, list) { + if (ops->key == key) + return ops; + } + + return NULL; +} + +/* required smc_sock_negotiator_list_lock locked */ +static struct smc_sock_negotiator_ops * +smc_negotiator_ops_get_by_name(const char *name) +{ + struct smc_sock_negotiator_ops *ops; + + list_for_each_entry_rcu(ops, &smc_sock_negotiator_list, list) { + if (strcmp(ops->name, name) == 0) + return ops; + } + + return NULL; +} + +static int smc_sock_validate_negotiator_ops(struct smc_sock_negotiator_ops *ops) +{ + /* not required yet */ + return 0; +} + +/* register ops */ +int smc_sock_register_negotiator_ops(struct smc_sock_negotiator_ops *ops) +{ + int ret; + + ret = smc_sock_validate_negotiator_ops(ops); + if (ret) + return ret; + + /* calt key by name hash */ + ops->key = jhash(ops->name, sizeof(ops->name), strlen(ops->name)); + + spin_lock(&smc_sock_negotiator_list_lock); + if (smc_negotiator_ops_get_by_key(ops->key)) { + pr_notice("smc: %s negotiator already registered\n", ops->name); + ret = -EEXIST; + } else { + list_add_tail_rcu(&ops->list, &smc_sock_negotiator_list); + } + spin_unlock(&smc_sock_negotiator_list_lock); + return ret; +} +EXPORT_SYMBOL_GPL(smc_sock_register_negotiator_ops); + +/* unregister ops */ +void smc_sock_unregister_negotiator_ops(struct smc_sock_negotiator_ops *ops) +{ + spin_lock(&smc_sock_negotiator_list_lock); + list_del_rcu(&ops->list); + spin_unlock(&smc_sock_negotiator_list_lock); + + /* Wait for outstanding readers to complete before the + * ops gets removed entirely. + */ + synchronize_rcu(); +} +EXPORT_SYMBOL_GPL(smc_sock_unregister_negotiator_ops); + +int smc_sock_update_negotiator_ops(struct smc_sock_negotiator_ops *ops, + struct smc_sock_negotiator_ops *old_ops) +{ + struct smc_sock_negotiator_ops *existing; + int ret; + + ret = smc_sock_validate_negotiator_ops(ops); + if (ret) + return ret; + + ops->key = jhash(ops->name, sizeof(ops->name), strlen(ops->name)); + if (unlikely(!ops->key)) + return -EINVAL; + + spin_lock(&smc_sock_negotiator_list_lock); + existing = smc_negotiator_ops_get_by_key(old_ops->key); + if (!existing || strcmp(existing->name, ops->name)) { + ret = -EINVAL; + } else if (existing != old_ops) { + pr_notice("invalid old negotiator to replace\n"); + ret = -EINVAL; + } else { + list_add_tail_rcu(&ops->list, &smc_sock_negotiator_list); + list_del_rcu(&existing->list); + } + + spin_unlock(&smc_sock_negotiator_list_lock); + if (ret) + return ret; + + synchronize_rcu(); + return 0; +} +EXPORT_SYMBOL_GPL(smc_sock_update_negotiator_ops); + +/* assign ops to sock */ +int smc_sock_assign_negotiator_ops(struct smc_sock *smc, const char *name) +{ + struct smc_sock_negotiator_ops *ops; + int ret = -EINVAL; + + /* already set */ + if (READ_ONCE(smc->negotiator_ops)) + smc_sock_cleanup_negotiator_ops(smc, /* in release */ 0); + + /* Just for clear negotiator_ops */ + if (!name || !strlen(name)) + return 0; + + rcu_read_lock(); + ops = smc_negotiator_ops_get_by_name(name); + if (likely(ops)) { + if (unlikely(!bpf_try_module_get(ops, ops->owner))) { + ret = -EACCES; + } else { + WRITE_ONCE(smc->negotiator_ops, ops); + /* make sure ops can be seen */ + smp_wmb(); + if (ops->init) + ops->init(&smc->sk); + ret = 0; + } + } + rcu_read_unlock(); + return ret; +} +EXPORT_SYMBOL_GPL(smc_sock_assign_negotiator_ops); + +/* reset ops to sock */ +void smc_sock_cleanup_negotiator_ops(struct smc_sock *smc, int in_release) +{ + const struct smc_sock_negotiator_ops *ops; + + ops = READ_ONCE(smc->negotiator_ops); + + /* not all smc sock has negotiator_ops */ + if (!ops) + return; + + might_sleep(); + + /* Just ensure data integrity */ + WRITE_ONCE(smc->negotiator_ops, NULL); + /* make sure NULL can be seen */ + smp_wmb(); + /* If the cleanup was not caused by the release of the sock, + * it means that we may need to wait for the readers of ops + * to complete. + */ + if (unlikely(!in_release)) + synchronize_rcu(); + if (ops->release) + ops->release(&smc->sk); + bpf_module_put(ops, ops->owner); +} +EXPORT_SYMBOL_GPL(smc_sock_cleanup_negotiator_ops); + +void smc_sock_clone_negotiator_ops(struct sock *parent, struct sock *child) +{ + const struct smc_sock_negotiator_ops *ops; + + rcu_read_lock(); + ops = READ_ONCE(smc_sk(parent)->negotiator_ops); + if (ops && bpf_try_module_get(ops, ops->owner)) { + smc_sk(child)->negotiator_ops = ops; + if (ops->init) + ops->init(child); + } + rcu_read_unlock(); +} +EXPORT_SYMBOL_GPL(smc_sock_clone_negotiator_ops); + +static int bpf_smc_negotiator_init(struct btf *btf) +{ + s32 type_id; + + type_id = btf_find_by_name_kind(btf, "sock", BTF_KIND_STRUCT); + if (type_id < 0) + return -EINVAL; + sock_id = type_id; + + type_id = btf_find_by_name_kind(btf, "smc_sock", BTF_KIND_STRUCT); + if (type_id < 0) + return -EINVAL; + smc_sock_id = type_id; + + return 0; +} + +/* register ops */ +static int bpf_smc_negotiator_reg(void *kdata) +{ + return smc_sock_register_negotiator_ops(kdata); +} + +/* unregister ops */ +static void bpf_smc_negotiator_unreg(void *kdata) +{ + smc_sock_unregister_negotiator_ops(kdata); +} + +static int bpf_smc_negotiator_check_member(const struct btf_type *t, + const struct btf_member *member) +{ + return 0; +} + +static int bpf_smc_negotiator_init_member(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata) +{ + const struct smc_sock_negotiator_ops *uops; + struct smc_sock_negotiator_ops *ops; + u32 moff; + + uops = (const struct smc_sock_negotiator_ops *)udata; + ops = (struct smc_sock_negotiator_ops *)kdata; + + moff = btf_member_bit_offset(t, member) / 8; + + /* init name */ + if (moff == offsetof(struct smc_sock_negotiator_ops, name)) { + if (bpf_obj_name_cpy(ops->name, uops->name, + sizeof(uops->name)) <= 0) + return -EINVAL; + return 1; + } + + return 0; +} + +BPF_CALL_1(bpf_smc_skc_to_tcp_sock, struct sock *, sk) +{ + if (sk && sk_fullsock(sk)) { + if (inet_sk(sk)->is_icsk) + return (unsigned long)sk; + return (unsigned long)((struct smc_sock *)(sk))->clcsock->sk; + } + return (unsigned long)NULL; +} + +const struct bpf_func_proto bpf_smc_skc_to_tcp_sock_proto = { + .func = bpf_smc_skc_to_tcp_sock, + .gpl_only = false, + .ret_type = RET_PTR_TO_BTF_ID_OR_NULL, + .arg1_type = ARG_PTR_TO_BTF_ID_SOCK_COMMON, + .ret_btf_id = &btf_sock_ids[BTF_SOCK_TYPE_TCP], +}; + +static const struct bpf_func_proto * +smc_negotiator_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + const struct btf_member *m; + const struct btf_type *t; + u32 midx, moff; + + midx = prog->expected_attach_type; + t = bpf_smc_sock_negotiator_ops.type; + m = &btf_type_member(t)[midx]; + + moff = btf_member_bit_offset(t, m) / 8; + + switch (func_id) { + case BPF_FUNC_skc_to_tcp_sock: + return &bpf_smc_skc_to_tcp_sock_proto; + default: + return bpf_base_func_proto(func_id); + } +} + +static bool smc_negotiator_prog_is_valid_access(int off, int size, enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) +{ + if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS) + return false; + if (type != BPF_READ) + return false; + if (off % size != 0) + return false; + + if (!btf_ctx_access(off, size, type, prog, info)) + return false; + + /* promote it to smc_sock */ + if (base_type(info->reg_type) == PTR_TO_BTF_ID && + info->btf_id == sock_id) + info->btf_id = smc_sock_id; + + return true; +} + +static int smc_negotiator_prog_btf_struct_access(struct bpf_verifier_log *log, + const struct btf_type *t, int off, + int size, enum bpf_access_type atype, + u32 *next_btf_id) +{ + if (atype == BPF_READ) + return btf_struct_access(log, t, off, size, atype, next_btf_id); + return -EACCES; +} + +static const struct bpf_verifier_ops bpf_smc_negotiator_verifier_ops = { + .get_func_proto = smc_negotiator_prog_func_proto, + .is_valid_access = smc_negotiator_prog_is_valid_access, + .btf_struct_access = smc_negotiator_prog_btf_struct_access, +}; + +struct bpf_struct_ops bpf_smc_sock_negotiator_ops = { + .verifier_ops = &bpf_smc_negotiator_verifier_ops, + .init = bpf_smc_negotiator_init, + .check_member = bpf_smc_negotiator_check_member, + .init_member = bpf_smc_negotiator_init_member, + .reg = bpf_smc_negotiator_reg, + .unreg = bpf_smc_negotiator_unreg, + .name = "smc_sock_negotiator_ops", +}; diff --git a/net/smc/bpf_smc_struct_ops.c b/net/smc/bpf_smc_struct_ops.c deleted file mode 100644 index 15fd1b506a169fea6339020758c83e5a1dd3a22c..0000000000000000000000000000000000000000 --- a/net/smc/bpf_smc_struct_ops.c +++ /dev/null @@ -1,152 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 - -#include -#include -#include -#include -#include -#include -#include - -extern struct bpf_struct_ops smc_sock_negotiator_ops; - -DEFINE_RWLOCK(smc_sock_negotiator_ops_rwlock); -struct smc_sock_negotiator_ops *negotiator; - -/* convert sk to smc_sock */ -static inline struct smc_sock *smc_sk(const struct sock *sk) -{ - return (struct smc_sock *)sk; -} - -/* register ops */ -static inline void smc_reg_passive_sk_ops(struct smc_sock_negotiator_ops *ops) -{ - write_lock_bh(&smc_sock_negotiator_ops_rwlock); - negotiator = ops; - write_unlock_bh(&smc_sock_negotiator_ops_rwlock); -} - -/* unregister ops */ -static inline void smc_unreg_passive_sk_ops(struct smc_sock_negotiator_ops *ops) -{ - write_lock_bh(&smc_sock_negotiator_ops_rwlock); - if (negotiator == ops) - negotiator = NULL; - write_unlock_bh(&smc_sock_negotiator_ops_rwlock); -} - -int smc_sock_should_select_smc(const struct smc_sock *smc) -{ - int ret = SK_PASS; - - read_lock_bh(&smc_sock_negotiator_ops_rwlock); - if (negotiator && negotiator->negotiate) - ret = negotiator->negotiate((struct smc_sock *)smc); - read_unlock_bh(&smc_sock_negotiator_ops_rwlock); - return ret; -} -EXPORT_SYMBOL_GPL(smc_sock_should_select_smc); - -void smc_sock_perform_collecting_info(const struct sock *sk, int timing) -{ - read_lock_bh(&smc_sock_negotiator_ops_rwlock); - if (negotiator && negotiator->collect_info) - negotiator->collect_info((struct sock *)sk, timing); - read_unlock_bh(&smc_sock_negotiator_ops_rwlock); -} -EXPORT_SYMBOL_GPL(smc_sock_perform_collecting_info); - -/* define global smc ID for smc_struct_ops */ -BTF_ID_LIST_GLOBAL(btf_smc_ids) -#define BTF_SMC_TYPE(name, type) BTF_ID(struct, type) -BTF_SMC_TYPE(BTF_SMC_TYPE_SOCK, smc_sock) -BTF_SMC_TYPE(BTF_SMC_TYPE_CONNECTION, smc_connection) -#undef BTF_SMC_TYPE - -static int bpf_smc_passive_sk_init(struct btf *btf) -{ - return 0; -} - -/* register ops by BPF */ -static int bpf_smc_passive_sk_ops_reg(void *kdata) -{ - struct smc_sock_negotiator_ops *ops = kdata; - - /* at least one ops need implement */ - if (!ops->negotiate || !ops->collect_info) { - pr_err("At least one ops need implement.\n"); - return -EINVAL; - } - - smc_reg_passive_sk_ops(ops); - /* always success now */ - return 0; -} - -/* unregister ops by BPF */ -static void bpf_smc_passive_sk_ops_unreg(void *kdata) -{ - smc_unreg_passive_sk_ops(kdata); -} - -static int bpf_smc_passive_sk_ops_check_member(const struct btf_type *t, - const struct btf_member *member) -{ - return 0; -} - -static int bpf_smc_passive_sk_ops_init_member(const struct btf_type *t, - const struct btf_member *member, - void *kdata, const void *udata) -{ - return 0; -} - -static const struct bpf_func_proto * -smc_passive_sk_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) -{ - return bpf_base_func_proto(func_id); -} - -static bool smc_passive_sk_ops_prog_is_valid_access(int off, int size, enum bpf_access_type type, - const struct bpf_prog *prog, - struct bpf_insn_access_aux *info) -{ - if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS) - return false; - if (type != BPF_READ) - return false; - if (off % size != 0) - return false; - - return btf_ctx_access(off, size, type, prog, info); -} - -static int smc_passive_sk_ops_prog_struct_access(struct bpf_verifier_log *log, - const struct btf_type *t, int off, - int size, enum bpf_access_type atype, - u32 *next_btf_id) -{ - if (atype == BPF_READ) - return btf_struct_access(log, t, off, size, atype, next_btf_id); - - return -EACCES; -} - -static const struct bpf_verifier_ops bpf_smc_passive_sk_verifier_ops = { - .get_func_proto = smc_passive_sk_prog_func_proto, - .is_valid_access = smc_passive_sk_ops_prog_is_valid_access, - .btf_struct_access = smc_passive_sk_ops_prog_struct_access -}; - -struct bpf_struct_ops bpf_smc_sock_negotiator_ops = { - .verifier_ops = &bpf_smc_passive_sk_verifier_ops, - .init = bpf_smc_passive_sk_init, - .check_member = bpf_smc_passive_sk_ops_check_member, - .init_member = bpf_smc_passive_sk_ops_init_member, - .reg = bpf_smc_passive_sk_ops_reg, - .unreg = bpf_smc_passive_sk_ops_unreg, - .name = "smc_sock_negotiator_ops", -}; diff --git a/net/smc/smc.h b/net/smc/smc.h index 1be0ce2c9b681b3e0686e109887953eeae947fbb..e91c5ef1c12d0749466879584d7b22b2d94b3b63 100644 --- a/net/smc/smc.h +++ b/net/smc/smc.h @@ -21,7 +21,9 @@ #define SMC_V1 1 /* SMC version V1 */ #define SMC_V2 2 /* SMC version V2 */ -#define SMC_RELEASE 0 +#define SMC_RELEASE_0 0 +#define SMC_RELEASE_1 1 +#define SMC_RELEASE SMC_RELEASE_1 /* the latest release version */ #define SMC_MAX_ISM_DEVS 8 /* max # of proposed non-native ISM * devices */ @@ -33,6 +35,27 @@ extern struct proto smc_proto6; extern bool reserve_mode; extern u16 rsvd_ports_base; +static __always_inline bool smc_sock_is_inet_sock(struct sock *sk) +{ + return inet_sk(sk)->is_icsk; +} + +#define smc_sk_state(sk) ({ \ + struct sock *__sk = (sk); \ + smc_sock_is_inet_sock(__sk) ? \ + smc_sk(__sk)->smc_state : (__sk)->sk_state; \ +}) + +#define smc_sk_set_state(sk, state) do { \ + struct sock *__sk = (sk); \ + unsigned char __state = (state); \ + if (smc_sock_is_inet_sock(__sk)) \ + smc_sk(__sk)->smc_state = (__state); \ + else \ + (__sk)->sk_state = (__state); \ +} while (0) + + enum smc_state { /* possible states of an SMC socket */ SMC_ACTIVE = 1, SMC_INIT = 2, diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c index 1e806869e561603d50a950dc1aa0edaec32a25b0..c25bc9b64f3ac2f3836ff7f8fe9f52e6c61d30b4 100644 --- a/net/smc/smc_cdc.c +++ b/net/smc/smc_cdc.c @@ -32,6 +32,11 @@ static void smc_cdc_tx_handler(struct smc_wr_tx_pend_priv *pnd_snd, struct smc_sock *smc; int diff; + if (unlikely(link->lgr->use_rwwi)) { + pr_err_once("smc: unexpected cdc msg tx work completion when rwwi enabled.\n"); + return; + } + sndbuf_desc = conn->sndbuf_desc; smc = container_of(conn, struct smc_sock, conn); bh_lock_sock(&smc->sk); @@ -69,6 +74,59 @@ static void smc_cdc_tx_handler(struct smc_wr_tx_pend_priv *pnd_snd, bh_unlock_sock(&smc->sk); } +void smc_cdc_tx_handler_rwwi(struct ib_wc *wc) +{ + struct smc_link *link = wc->qp->qp_context; + struct smc_link_group *lgr = link->lgr; + struct smc_connection *conn = NULL; + union smc_wr_rwwi_tx_id wr_id; + struct smc_sock *smc = NULL; + int diff; + + if (unlikely(!lgr->use_rwwi)) { + pr_err_once("smc: unexpected rwwi msg tx work completion when rwwi disabled.\n"); + return; + } + + wr_id.data = wc->wr_id; + + read_lock_bh(&lgr->conns_lock); + conn = smc_lgr_find_conn(wr_id.token, lgr); + read_unlock_bh(&lgr->conns_lock); + if (!conn) + return; + + smc = container_of(conn, struct smc_sock, conn); + bh_lock_sock(&smc->sk); + + if (!wc->status) { + diff = wr_id.inflight_sent; + /* sndbuf_space is decreased in smc_sendmsg */ + smp_mb__before_atomic(); + atomic_add(diff, &conn->sndbuf_space); + /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */ + smp_mb__after_atomic(); + + smc_curs_add(conn->sndbuf_desc->len, &conn->tx_curs_fin, diff); + smc_curs_add(conn->sndbuf_desc->len, &conn->local_tx_ctrl_fin, diff); + } + + if (atomic_dec_and_test(&conn->cdc_pend_tx_wr)) { + if (sock_owned_by_user(&smc->sk)) + conn->tx_in_release_sock = true; + else + smc_tx_pending(conn); + + if (unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq))) + wake_up(&conn->cdc_pend_tx_wq); + } + + WARN_ON(atomic_read(&conn->cdc_pend_tx_wr) < 0); + + smc_tx_sndbuf_nonfull(smc); + bh_unlock_sock(&smc->sk); +} + int smc_cdc_get_free_slot(struct smc_connection *conn, struct smc_link *link, struct smc_wr_buf **wr_buf, @@ -113,25 +171,36 @@ int smc_cdc_msg_send(struct smc_connection *conn, struct smc_cdc_tx_pend *pend) { struct smc_link *link = conn->lnk; + struct smc_cdc_msg *cdc_msg = (struct smc_cdc_msg *)wr_buf; union smc_host_cursor cfed; + u8 saved_credits = 0; int rc; + if (unlikely(link->lgr->use_rwwi)) { + pr_err_once("smc: send unexpected cdc msg when rwwi enabled.\n"); + return -EINVAL; + } + smc_cdc_add_pending_send(conn, pend); conn->tx_cdc_seq++; conn->local_tx_ctrl.seqno = conn->tx_cdc_seq; - smc_host_msg_to_cdc((struct smc_cdc_msg *)wr_buf, conn, &cfed); + smc_host_msg_to_cdc(cdc_msg, conn, &cfed); + if (smc_wr_rx_credits_need_announce_frequent(link)) + saved_credits = (u8)smc_wr_rx_get_credits(link); + cdc_msg->credits = saved_credits; atomic_inc(&conn->cdc_pend_tx_wr); smp_mb__after_atomic(); /* Make sure cdc_pend_tx_wr added before post */ rc = smc_wr_tx_send(link, (struct smc_wr_tx_pend_priv *)pend); - if (!rc) { + if (likely(!rc)) { smc_curs_copy(&conn->rx_curs_confirmed, &cfed, conn); conn->local_rx_ctrl.prod_flags.cons_curs_upd_req = 0; } else { conn->tx_cdc_seq--; conn->local_tx_ctrl.seqno = conn->tx_cdc_seq; + smc_wr_rx_put_credits(link, saved_credits); atomic_dec(&conn->cdc_pend_tx_wr); } @@ -318,19 +387,11 @@ static void smc_cdc_msg_validate(struct smc_sock *smc, struct smc_cdc_msg *cdc, } } -static void smc_cdc_msg_recv_action(struct smc_sock *smc, - struct smc_cdc_msg *cdc) +static void __smc_cdc_msg_recv_action(struct smc_sock *smc, + int diff_prod, int diff_cons) { - union smc_host_cursor cons_old, prod_old; struct smc_connection *conn = &smc->conn; - int diff_cons, diff_prod; - smc_curs_copy(&prod_old, &conn->local_rx_ctrl.prod, conn); - smc_curs_copy(&cons_old, &conn->local_rx_ctrl.cons, conn); - smc_cdc_msg_to_host(&conn->local_rx_ctrl, cdc, conn); - - diff_cons = smc_curs_diff(conn->peer_rmbe_size, &cons_old, - &conn->local_rx_ctrl.cons); if (diff_cons) { /* peer_rmbe_space is decreased during data transfer with RDMA * write @@ -340,9 +401,6 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc, /* guarantee 0 <= peer_rmbe_space <= peer_rmbe_size */ smp_mb__after_atomic(); } - - diff_prod = smc_curs_diff(conn->rmb_desc->len, &prod_old, - &conn->local_rx_ctrl.prod); if (diff_prod) { if (conn->local_rx_ctrl.prod_flags.urg_data_present) smc_cdc_handle_urg_data_arrival(smc, &diff_prod); @@ -392,6 +450,24 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc, } } +static void smc_cdc_msg_recv_action(struct smc_sock *smc, + struct smc_cdc_msg *cdc) +{ + union smc_host_cursor cons_old, prod_old; + struct smc_connection *conn = &smc->conn; + int diff_cons, diff_prod; + + smc_curs_copy(&prod_old, &conn->local_rx_ctrl.prod, conn); + smc_curs_copy(&cons_old, &conn->local_rx_ctrl.cons, conn); + smc_cdc_msg_to_host(&conn->local_rx_ctrl, cdc, conn); + + diff_cons = smc_curs_diff(conn->peer_rmbe_size, &cons_old, + &conn->local_rx_ctrl.cons); + diff_prod = smc_curs_diff(conn->rmb_desc->len, &prod_old, + &conn->local_rx_ctrl.prod); + __smc_cdc_msg_recv_action(smc, diff_prod, diff_cons); +} + /* called under tasklet context */ static void smc_cdc_msg_recv(struct smc_sock *smc, struct smc_cdc_msg *cdc) { @@ -443,11 +519,19 @@ static void smc_cdc_rx_handler(struct ib_wc *wc, void *buf) struct smc_link_group *lgr; struct smc_sock *smc; + if (unlikely(link->lgr->use_rwwi)) { + pr_err_once("smc: recv unexpected cdc msg when rwwi enabled.\n"); + return; + } + if (wc->byte_len < offsetof(struct smc_cdc_msg, reserved)) return; /* short message */ if (cdc->len != SMC_WR_TX_SIZE) return; /* invalid message */ + if (cdc->credits) + smc_wr_tx_put_credits(link, cdc->credits, true); + /* lookup connection */ lgr = smc_get_lgr(link); read_lock_bh(&lgr->conns_lock); @@ -469,6 +553,158 @@ static void smc_cdc_rx_handler(struct ib_wc *wc, void *buf) smc_cdc_msg_recv(smc, cdc); } +static void smc_cdc_handle_rwwi_data_msg(struct smc_sock *smc, + union smc_wr_imm_msg *imm_msg, int diff_prod) +{ + struct smc_connection *conn = &smc->conn; + int diff_cons; + + diff_cons = imm_msg->data.diff_cons; + if (diff_cons) + smc_curs_add(conn->peer_rmbe_size, &conn->local_rx_ctrl.cons, diff_cons); + /* cause this imm_data contains no conn_state_flags and prod_flags info, clean them */ + memset(&conn->local_rx_ctrl.conn_state_flags, 0, + sizeof(struct smc_cdc_conn_state_flags)); + memset(&conn->local_rx_ctrl.prod_flags, 0, + sizeof(struct smc_cdc_producer_flags)); + + __smc_cdc_msg_recv_action(smc, diff_prod, diff_cons); +} + +static void smc_cdc_handle_rwwi_data_with_flags_msg(struct smc_sock *smc, + union smc_wr_imm_msg *imm_msg, int diff_prod) +{ + struct smc_connection *conn = &smc->conn; + struct smc_cdc_producer_flags *pflags; + int diff_cons; + + diff_cons = imm_msg->data_with_flags.diff_cons; + if (diff_cons) + smc_curs_add(conn->peer_rmbe_size, &conn->local_rx_ctrl.cons, diff_cons); + /* clean prod_flags that are not carried by this imm_data */ + memset(&conn->local_rx_ctrl.prod_flags, 0, + sizeof(struct smc_cdc_producer_flags)); + pflags = &conn->local_rx_ctrl.prod_flags; + pflags->write_blocked = imm_msg->data_with_flags.write_blocked; + pflags->urg_data_present = imm_msg->data_with_flags.urg_data_present; + pflags->urg_data_pending = imm_msg->data_with_flags.urg_data_pending; + /* cause this imm_data contains no conn_state_flagsinfo, clean it */ + memset(&conn->local_rx_ctrl.conn_state_flags, 0, + sizeof(struct smc_cdc_conn_state_flags)); + + __smc_cdc_msg_recv_action(smc, diff_prod, diff_cons); +} + +static void smc_cdc_handle_rwwi_data_cr_msg(struct smc_sock *smc, + union smc_wr_imm_msg *imm_msg, int diff_prod) +{ + struct smc_connection *conn = &smc->conn; + int diff_cons; + + if (imm_msg->data_cr.credits) + smc_wr_tx_put_credits(conn->lnk, imm_msg->data_cr.credits, true); + + diff_cons = imm_msg->data_cr.diff_cons; + if (diff_cons) + smc_curs_add(conn->peer_rmbe_size, &conn->local_rx_ctrl.cons, diff_cons); + /* cause this imm_data contains no conn_state_flags and prod_flags info, clean them */ + memset(&conn->local_rx_ctrl.conn_state_flags, 0, + sizeof(struct smc_cdc_conn_state_flags)); + memset(&conn->local_rx_ctrl.prod_flags, 0, + sizeof(struct smc_cdc_producer_flags)); + + __smc_cdc_msg_recv_action(smc, diff_prod, diff_cons); +} + +static void smc_cdc_handle_rwwi_data_with_flags_cr_msg(struct smc_sock *smc, + union smc_wr_imm_msg *imm_msg, int diff_prod) +{ + struct smc_connection *conn = &smc->conn; + struct smc_cdc_producer_flags *pflags; + int diff_cons; + + if (imm_msg->data_with_flags_cr.credits) + smc_wr_tx_put_credits(conn->lnk, imm_msg->data_with_flags_cr.credits, true); + + diff_cons = imm_msg->data_with_flags_cr.diff_cons; + if (diff_cons) + smc_curs_add(conn->peer_rmbe_size, &conn->local_rx_ctrl.cons, diff_cons); + /* clean prod_flags that are not carried by this imm_data */ + memset(&conn->local_rx_ctrl.prod_flags, 0, + sizeof(struct smc_cdc_producer_flags)); + pflags = &conn->local_rx_ctrl.prod_flags; + pflags->write_blocked = imm_msg->data_with_flags_cr.write_blocked; + pflags->urg_data_present = imm_msg->data_with_flags_cr.urg_data_present; + pflags->urg_data_pending = imm_msg->data_with_flags_cr.urg_data_pending; + /* cause this imm_data contains no conn_state_flagsinfo, clean it */ + memset(&conn->local_rx_ctrl.conn_state_flags, 0, + sizeof(struct smc_cdc_conn_state_flags)); + + __smc_cdc_msg_recv_action(smc, diff_prod, diff_cons); +} + +static void smc_cdc_handle_rwwi_ctrl_msg(struct smc_sock *smc, + union smc_wr_imm_msg *imm_msg, int diff_prod) +{ + struct smc_connection *conn = &smc->conn; + + conn->local_rx_ctrl.prod_flags = imm_msg->ctrl.pflags; + conn->local_rx_ctrl.conn_state_flags = imm_msg->ctrl.csflags; + /* this imm_data contains no diff_cons info, clean it */ + __smc_cdc_msg_recv_action(smc, diff_prod, 0); +} + +void smc_cdc_rx_handler_rwwi(struct ib_wc *wc) +{ + struct smc_link *link = wc->qp->qp_context; + struct smc_link_group *lgr = link->lgr; + struct smc_connection *conn = NULL; + union smc_wr_imm_msg imm_msg; + struct smc_sock *smc = NULL; + int diff_prod; + + if (unlikely(!link->lgr->use_rwwi)) { + pr_err_once("smc: recv unexpected rwwi msg when rwwi disabled.\n"); + return; + } + + imm_msg.imm_data = be32_to_cpu(wc->ex.imm_data); + read_lock_bh(&lgr->conns_lock); + conn = smc_lgr_find_conn(imm_msg.hdr.token, lgr); + read_unlock_bh(&lgr->conns_lock); + if (!conn) + return; + + smc = container_of(conn, struct smc_sock, conn); + + sock_hold(&smc->sk); + bh_lock_sock(&smc->sk); + diff_prod = wc->byte_len; + if (diff_prod) + smc_curs_add(conn->rmb_desc->len, &conn->local_rx_ctrl.prod, diff_prod); + + switch (imm_msg.hdr.opcode) { + case SMC_WR_OP_DATA: + smc_cdc_handle_rwwi_data_msg(smc, &imm_msg, diff_prod); + break; + case SMC_WR_OP_DATA_WITH_FLAGS: + smc_cdc_handle_rwwi_data_with_flags_msg(smc, &imm_msg, diff_prod); + break; + case SMC_WR_OP_CTRL: + smc_cdc_handle_rwwi_ctrl_msg(smc, &imm_msg, diff_prod); + break; + case SMC_WR_OP_DATA_CR: + smc_cdc_handle_rwwi_data_cr_msg(smc, &imm_msg, diff_prod); + break; + case SMC_WR_OP_DATA_WITH_FLAGS_CR: + smc_cdc_handle_rwwi_data_with_flags_cr_msg(smc, &imm_msg, diff_prod); + break; + } + + bh_unlock_sock(&smc->sk); + sock_put(&smc->sk); /* no free sk in softirq-context */ +} + static struct smc_wr_rx_handler smc_cdc_rx_handlers[] = { { .handler = smc_cdc_rx_handler, diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h index 696cc11f2303b95318f6750479bb8abffde3ca24..039c57ebb2b16459676bed0be9d72cb7dea1958d 100644 --- a/net/smc/smc_cdc.h +++ b/net/smc/smc_cdc.h @@ -47,7 +47,8 @@ struct smc_cdc_msg { union smc_cdc_cursor cons; /* piggy backed "ack" */ struct smc_cdc_producer_flags prod_flags; struct smc_cdc_conn_state_flags conn_state_flags; - u8 reserved[18]; + u8 credits; /* credits synced by every cdc msg */ + u8 reserved[17]; }; /* SMC-D cursor format */ @@ -301,5 +302,6 @@ int smcr_cdc_msg_send_validation(struct smc_connection *conn, struct smc_wr_buf *wr_buf); int smc_cdc_init(void) __init; void smcd_cdc_rx_init(struct smc_connection *conn); - +void smc_cdc_rx_handler_rwwi(struct ib_wc *wc); +void smc_cdc_tx_handler_rwwi(struct ib_wc *wc); #endif /* SMC_CDC_H */ diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c index 026a5078acfd3d4ec94fcda502fd2d0ffc37aaa1..48945395ae524e6addca2c241e263a4fa05f9d19 100644 --- a/net/smc/smc_clc.c +++ b/net/smc/smc_clc.c @@ -39,6 +39,9 @@ static const char SMC_EYECATCHER[4] = {'\xe2', '\xd4', '\xc3', '\xd9'}; /* eye catcher "SMCD" EBCDIC for CLC messages */ static const char SMCD_EYECATCHER[4] = {'\xe2', '\xd4', '\xc3', '\xc4'}; +/* ALIBABA OUI */ +static const u8 SMC_VENDOR_OUI_ALIBABA[3] = {0xFC, 0xA6, 0x4C}; + static u8 smc_hostname[SMC_MAX_HOSTNAME_LEN]; struct smc_clc_eid_table { @@ -391,9 +394,7 @@ smc_clc_msg_acc_conf_valid(struct smc_clc_msg_accept_confirm_v2 *clc_v2) return false; } else { if (hdr->typev1 == SMC_TYPE_D && - ntohs(hdr->length) != SMCD_CLC_ACCEPT_CONFIRM_LEN_V2 && - (ntohs(hdr->length) != SMCD_CLC_ACCEPT_CONFIRM_LEN_V2 + - sizeof(struct smc_clc_first_contact_ext))) + ntohs(hdr->length) < SMCD_CLC_ACCEPT_CONFIRM_LEN_V2) return false; if (hdr->typev1 == SMC_TYPE_R && ntohs(hdr->length) < SMCR_CLC_ACCEPT_CONFIRM_LEN_V2) @@ -420,13 +421,35 @@ smc_clc_msg_decl_valid(struct smc_clc_msg_decline *dclc) return true; } -static void smc_clc_fill_fce(struct smc_clc_first_contact_ext *fce, int *len) +static int smc_clc_fill_fce(struct smc_clc_first_contact_ext_v2x *fce, + struct smc_init_info *ini) { + int ret = sizeof(*fce); + memset(fce, 0, sizeof(*fce)); - fce->os_type = SMC_CLC_OS_LINUX; - fce->release = SMC_RELEASE; - memcpy(fce->hostname, smc_hostname, sizeof(smc_hostname)); - (*len) += sizeof(*fce); + fce->fce_v20.os_type = SMC_CLC_OS_LINUX; + fce->fce_v20.release = ini->release_ver; + memcpy(fce->fce_v20.hostname, smc_hostname, sizeof(smc_hostname)); + if (ini->is_smcd && ini->release_ver < SMC_RELEASE_1) { + ret = sizeof(struct smc_clc_first_contact_ext); + goto out; + } + + if (ini->release_ver >= SMC_RELEASE_1) { + if (!ini->is_smcd) { + fce->max_conns = ini->max_conns; + fce->max_links = ini->max_links; + } + + if (ini->vendor_opt_valid) { + fce->vendor_exp_options.valid = 1; + fce->vendor_exp_options.credits_en = ini->credits_en; + fce->vendor_exp_options.rwwi_en = ini->rwwi_en; + } + } + +out: + return ret; } /* check if received message has a correct header length and contains valid @@ -807,6 +830,24 @@ int smc_clc_send_decline(struct smc_sock *smc, u32 peer_diag_info, u8 version) return len > 0 ? 0 : len; } +inline struct smc_clc_vendor_opt_ali +smc_clc_vendor_opts_ali_get_config(struct smc_sock *smc) +{ + /* smc_clc_vendor_opt_ali is defined in network bytes order, + * but sysctl_vendor_exp_options is assinged in host bytes order. + * So converted sysctl_vendor_exp_options to __be32, in case + * sysctl_vendor_exp_options can be configured compatible either + * on be host or le host. + */ + __be32 vendor_opts_val = + cpu_to_be32(sock_net(&smc->sk)->smc.sysctl_vendor_exp_options); + + BUILD_BUG_ON_MSG(sizeof(struct smc_clc_vendor_opt_ali) > sizeof(unsigned int), + "struct smc_clc_vendor_opt_ali size can not exceeds 4 bytes"); + + return *((struct smc_clc_vendor_opt_ali *)&vendor_opts_val); +} + /* send CLC PROPOSAL message across internal TCP socket */ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) { @@ -819,6 +860,7 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) struct smc_clc_v2_extension *v2_ext; struct smc_clc_msg_smcd *pclc_smcd; struct smc_clc_msg_trail *trl; + struct smcd_dev *smcd; int len, i, plen, rc; int reason_code = 0; struct kvec vec[8]; @@ -874,7 +916,9 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) if (smcd_indicated(ini->smc_type_v1)) { /* add SMC-D specifics */ if (ini->ism_dev[0]) { - pclc_smcd->ism.gid = htonll(ini->ism_dev[0]->local_gid); + smcd = ini->ism_dev[0]; + pclc_smcd->ism.gid = + htonll(smcd->ops->get_local_gid(smcd)); pclc_smcd->ism.chid = htons(smc_ism_get_chid(ini->ism_dev[0])); } @@ -884,6 +928,8 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) } else { struct smc_clc_eid_entry *ueident; u16 v2_ext_offset; + struct smc_clc_vendor_opt_ali vendor_config = + smc_clc_vendor_opts_ali_get_config(smc); v2_ext->hdr.flag.release = SMC_RELEASE; v2_ext_offset = sizeof(*pclc_smcd) - @@ -893,6 +939,16 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) pclc_prfx->ipv6_prefixes_cnt * sizeof(ipv6_prfx[0]); pclc_smcd->v2_ext_offset = htons(v2_ext_offset); + + if (vendor_config.valid) { + memcpy(pclc_smcd->vendor_oui, SMC_VENDOR_OUI_ALIBABA, + sizeof(SMC_VENDOR_OUI_ALIBABA)); + pclc_smcd->vendor_exp_options.valid = 1; + if (vendor_config.credits_en) + pclc_smcd->vendor_exp_options.credits_en = 1; + if (vendor_config.rwwi_en) + pclc_smcd->vendor_exp_options.rwwi_en = 1; + } plen += sizeof(*v2_ext); read_lock(&smc_clc_eid_table.lock); @@ -920,8 +976,9 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) plen += sizeof(*smcd_v2_ext); if (ini->ism_offered_cnt) { for (i = 1; i <= ini->ism_offered_cnt; i++) { + smcd = ini->ism_dev[i]; gidchids[i - 1].gid = - htonll(ini->ism_dev[i]->local_gid); + htonll(smcd->ops->get_local_gid(smcd)); gidchids[i - 1].chid = htons(smc_ism_get_chid(ini->ism_dev[i])); } @@ -929,8 +986,11 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) sizeof(struct smc_clc_smcd_gid_chid); } } - if (smcr_indicated(ini->smc_type_v2)) + if (smcr_indicated(ini->smc_type_v2)) { memcpy(v2_ext->roce, ini->smcrv2.ib_gid_v2, SMC_GID_SIZE); + v2_ext->max_conns = SMC_CONN_PER_LGR_MAX; + v2_ext->max_links = SMC_LINKS_PER_LGR_PREFER; + } pclc_base->hdr.length = htons(plen); memcpy(trl->eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER)); @@ -989,12 +1049,12 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc, { struct smc_connection *conn = &smc->conn; struct smc_clc_msg_accept_confirm *clc; - struct smc_clc_first_contact_ext fce; + struct smc_clc_first_contact_ext_v2x fce; struct smc_clc_fce_gid_ext gle; struct smc_clc_msg_trail trl; struct kvec vec[5]; struct msghdr msg; - int i, len; + int i, len, fce_len; /* send SMC Confirm CLC msg */ clc = (struct smc_clc_msg_accept_confirm *)clc_v2; @@ -1006,7 +1066,8 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc, memcpy(clc->hdr.eyecatcher, SMCD_EYECATCHER, sizeof(SMCD_EYECATCHER)); clc->hdr.typev1 = SMC_TYPE_D; - clc->d0.gid = conn->lgr->smcd->local_gid; + clc->d0.gid = + conn->lgr->smcd->ops->get_local_gid(conn->lgr->smcd); clc->d0.token = conn->rmb_desc->token; clc->d0.dmbe_size = conn->rmbe_size_short; clc->d0.dmbe_idx = 0; @@ -1019,8 +1080,10 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc, if (eid && eid[0]) memcpy(clc_v2->d1.eid, eid, SMC_MAX_EID_LEN); len = SMCD_CLC_ACCEPT_CONFIRM_LEN_V2; - if (first_contact) - smc_clc_fill_fce(&fce, &len); + if (first_contact) { + fce_len = smc_clc_fill_fce(&fce, ini); + len += fce_len; + } clc_v2->hdr.length = htons(len); } memcpy(trl.eyecatcher, SMCD_EYECATCHER, @@ -1046,9 +1109,13 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc, switch (clc->hdr.type) { case SMC_CLC_ACCEPT: clc->r0.qp_mtu = link->path_mtu; + if (first_contact && ini->vendor_opt_valid && ini->credits_en) + clc->r0.init_credits = (u8)link->wr_rx_cnt; break; case SMC_CLC_CONFIRM: clc->r0.qp_mtu = min(link->path_mtu, link->peer_mtu); + if (first_contact && link->credits_enable) + clc->r0.init_credits = (u8)link->wr_rx_cnt; break; } clc->r0.rmbe_size = conn->rmbe_size_short; @@ -1064,15 +1131,14 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc, memcpy(clc_v2->r1.eid, eid, SMC_MAX_EID_LEN); len = SMCR_CLC_ACCEPT_CONFIRM_LEN_V2; if (first_contact) { - smc_clc_fill_fce(&fce, &len); - fce.v2_direct = !link->lgr->uses_gateway; - memset(&gle, 0, sizeof(gle)); + fce_len = smc_clc_fill_fce(&fce, ini); + len += fce_len; + fce.fce_v20.v2_direct = !link->lgr->uses_gateway; if (ini && clc->hdr.type == SMC_CLC_CONFIRM) { + memset(&gle, 0, sizeof(gle)); gle.gid_cnt = ini->smcrv2.gidlist.len; len += sizeof(gle); len += gle.gid_cnt * sizeof(gle.gid[0]); - } else { - len += sizeof(gle.reserved); } } clc_v2->hdr.length = htons(len); @@ -1095,7 +1161,7 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc, sizeof(trl); if (version > SMC_V1 && first_contact) { vec[i].iov_base = &fce; - vec[i++].iov_len = sizeof(fce); + vec[i++].iov_len = fce_len; if (!conn->lgr->is_smcd) { if (clc->hdr.type == SMC_CLC_CONFIRM) { vec[i].iov_base = &gle; @@ -1103,9 +1169,6 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc, vec[i].iov_base = &ini->smcrv2.gidlist.list; vec[i++].iov_len = gle.gid_cnt * sizeof(gle.gid[0]); - } else { - vec[i].iov_base = &gle.reserved; - vec[i++].iov_len = sizeof(gle.reserved); } } } @@ -1142,7 +1205,7 @@ int smc_clc_send_confirm(struct smc_sock *smc, bool clnt_first_contact, /* send CLC ACCEPT message across internal TCP socket */ int smc_clc_send_accept(struct smc_sock *new_smc, bool srv_first_contact, - u8 version, u8 *negotiated_eid) + u8 version, u8 *negotiated_eid, struct smc_init_info *ini) { struct smc_clc_msg_accept_confirm_v2 aclc_v2; int len; @@ -1150,13 +1213,153 @@ int smc_clc_send_accept(struct smc_sock *new_smc, bool srv_first_contact, memset(&aclc_v2, 0, sizeof(aclc_v2)); aclc_v2.hdr.type = SMC_CLC_ACCEPT; len = smc_clc_send_confirm_accept(new_smc, &aclc_v2, srv_first_contact, - version, negotiated_eid, NULL); + version, negotiated_eid, ini); if (len < ntohs(aclc_v2.hdr.length)) len = len >= 0 ? -EPROTO : -new_smc->clcsock->sk->sk_err; return len > 0 ? 0 : len; } +void smc_clc_vendor_opt_validate(struct smc_sock *smc, + struct smc_clc_msg_proposal *pclc, + struct smc_init_info *ini) +{ + struct smc_clc_msg_smcd *prop_smcd = smc_get_clc_msg_smcd(pclc); + struct smc_clc_vendor_opt_ali vendor_config = + smc_clc_vendor_opts_ali_get_config(smc); + + if (!prop_smcd || !vendor_config.valid) + return; + + if (memcmp(prop_smcd->vendor_oui, SMC_VENDOR_OUI_ALIBABA, + sizeof(SMC_VENDOR_OUI_ALIBABA))) + return; + + if (!prop_smcd->vendor_exp_options.valid) + return; + + ini->vendor_opt_valid = 1; + + if (vendor_config.credits_en) + ini->credits_en = prop_smcd->vendor_exp_options.credits_en; + else + ini->credits_en = 0; + + if (vendor_config.rwwi_en) + ini->rwwi_en = prop_smcd->vendor_exp_options.rwwi_en; + else + ini->rwwi_en = 0; +} + +int smc_clc_srv_v2x_features_validate(struct smc_sock *smc, + struct smc_clc_msg_proposal *pclc, + struct smc_init_info *ini) +{ + struct smc_clc_v2_extension *pclc_v2_ext; + + /* default max conn is SMC_RMBS_PER_LGR_MAX(255), + * which is the default value in smc v1 and v2.0. + */ + ini->max_conns = SMC_RMBS_PER_LGR_MAX; + ini->max_links = SMC_LINKS_ADD_LNK_MAX; + ini->vendor_opt_valid = 0; + + if ((!(ini->smcd_version & SMC_V2) && !(ini->smcr_version & SMC_V2)) || + ini->release_ver < SMC_RELEASE_1) + return 0; + + pclc_v2_ext = smc_get_clc_v2_ext(pclc); + if (!pclc_v2_ext) + return SMC_CLC_DECL_NOV2EXT; + + if (ini->smcr_version & SMC_V2) { + ini->max_conns = min_t(u8, pclc_v2_ext->max_conns, SMC_CONN_PER_LGR_MAX); + if (!ini->max_conns) + return SMC_CLC_DECL_MAXCONNERR; + + ini->max_links = min_t(u8, pclc_v2_ext->max_links, SMC_LINKS_PER_LGR_PREFER); + if (!ini->max_links) + return SMC_CLC_DECL_MAXLINKERR; + } + + smc_clc_vendor_opt_validate(smc, pclc, ini); + + return 0; +} + +int smc_clc_cli_v2x_features_validate(struct smc_sock *smc, + struct smc_clc_first_contact_ext *fce, + struct smc_init_info *ini) +{ + struct smc_clc_first_contact_ext_v2x *fce_v2x = + (struct smc_clc_first_contact_ext_v2x *)fce; + struct smc_clc_vendor_opt_ali vendor_config = + smc_clc_vendor_opts_ali_get_config(smc); + + if (ini->release_ver < SMC_RELEASE_1) + return 0; + + if (!ini->is_smcd) { + if (fce_v2x->max_conns > SMC_CONN_PER_LGR_MAX) + return SMC_CLC_DECL_MAXCONNERR; + ini->max_conns = fce_v2x->max_conns; + + if (fce_v2x->max_links > SMC_LINKS_ADD_LNK_MAX) + return SMC_CLC_DECL_MAXLINKERR; + ini->max_links = fce_v2x->max_links; + } + + if ((!vendor_config.valid && fce_v2x->vendor_exp_options.valid) || + (!vendor_config.credits_en && + fce_v2x->vendor_exp_options.credits_en) || + (!vendor_config.rwwi_en && fce_v2x->vendor_exp_options.rwwi_en)) + return SMC_CLC_DECL_VENDORERR; + + if (fce_v2x->vendor_exp_options.valid) { + ini->vendor_opt_valid = 1; + ini->credits_en = fce_v2x->vendor_exp_options.credits_en; + ini->rwwi_en = fce_v2x->vendor_exp_options.rwwi_en; + } + + return 0; +} + +int smc_clc_v2x_features_confirm_check(struct smc_clc_msg_accept_confirm *cclc, + struct smc_init_info *ini) +{ + struct smc_clc_msg_accept_confirm_v2 *clc_v2 = + (struct smc_clc_msg_accept_confirm_v2 *)cclc; + struct smc_clc_first_contact_ext *fce = + smc_get_clc_first_contact_ext(clc_v2, ini->is_smcd); + struct smc_clc_first_contact_ext_v2x *fce_v2x = + (struct smc_clc_first_contact_ext_v2x *)fce; + + if (cclc->hdr.version == SMC_V1 || + !(cclc->hdr.typev2 & SMC_FIRST_CONTACT_MASK)) + return 0; + + if (ini->release_ver != fce->release) + return SMC_CLC_DECL_RELEASEERR; + + if (fce->release < SMC_RELEASE_1) + return 0; + + if (!ini->is_smcd) { + if (fce_v2x->max_conns != ini->max_conns) + return SMC_CLC_DECL_MAXCONNERR; + if (fce_v2x->max_links != ini->max_links) + return SMC_CLC_DECL_MAXLINKERR; + } + + if (ini->vendor_opt_valid && + (!fce_v2x->vendor_exp_options.valid || + fce_v2x->vendor_exp_options.credits_en != ini->credits_en || + fce_v2x->vendor_exp_options.rwwi_en != ini->rwwi_en)) + return SMC_CLC_DECL_VENDORERR; + + return 0; +} + void smc_clc_get_hostname(u8 **host) { *host = &smc_hostname[0]; @@ -1164,6 +1367,8 @@ void smc_clc_get_hostname(u8 **host) void __init smc_clc_init(void) { + static const char def_ueid[] = "SMCV2-DEFAULT-UEID"; + char ueid[SMC_MAX_EID_LEN + 1] = { 0 }; struct new_utsname *u; memset(smc_hostname, _S, sizeof(smc_hostname)); /* ASCII blanks */ @@ -1175,6 +1380,10 @@ void __init smc_clc_init(void) rwlock_init(&smc_clc_eid_table.lock); smc_clc_eid_table.ueid_cnt = 0; smc_clc_eid_table.seid_enabled = 1; + + memset(ueid, ' ', SMC_MAX_EID_LEN); /* fill with space */ + memcpy(ueid, def_ueid, strlen(def_ueid)); + smc_clc_ueid_add(ueid); } void smc_clc_exit(void) diff --git a/net/smc/smc_clc.h b/net/smc/smc_clc.h index 5fee545c9a1096e9564905062ce2569f58de080a..1edd211a42f965f25abfef44752ce1a525ce404f 100644 --- a/net/smc/smc_clc.h +++ b/net/smc/smc_clc.h @@ -35,6 +35,7 @@ #define SMC_CLC_DECL_TIMEOUT_AL 0x02020000 /* timeout w4 QP add link */ #define SMC_CLC_DECL_CNFERR 0x03000000 /* configuration error */ #define SMC_CLC_DECL_PEERNOSMC 0x03010000 /* peer did not indicate SMC */ +#define SMC_CLC_DECL_ACTIVE 0x03010001 /* local active fallback */ #define SMC_CLC_DECL_IPSEC 0x03020000 /* IPsec usage */ #define SMC_CLC_DECL_NOSMCDEV 0x03030000 /* no SMC device found (R or D) */ #define SMC_CLC_DECL_NOSMCDDEV 0x03030001 /* no SMC-D device found */ @@ -45,6 +46,10 @@ #define SMC_CLC_DECL_NOSEID 0x03030006 /* peer sent no SEID */ #define SMC_CLC_DECL_NOSMCD2DEV 0x03030007 /* no SMC-Dv2 device found */ #define SMC_CLC_DECL_NOUEID 0x03030008 /* peer sent no UEID */ +#define SMC_CLC_DECL_RELEASEERR 0x03030009 /* release version negotiate failed */ +#define SMC_CLC_DECL_MAXCONNERR 0x0303000a /* max connections negotiate failed */ +#define SMC_CLC_DECL_MAXLINKERR 0x0303000b /* max links negotiate failed */ +#define SMC_CLC_DECL_VENDORERR 0x0303000c /* vendor opts negotiate failed */ #define SMC_CLC_DECL_MODEUNSUPP 0x03040000 /* smc modes do not match (R or D)*/ #define SMC_CLC_DECL_RMBE_EC 0x03050000 /* peer has eyecatcher in RMBE */ #define SMC_CLC_DECL_OPTUNSUPP 0x03060000 /* fastopen sockopt not supported */ @@ -63,6 +68,7 @@ #define SMC_CLC_DECL_ERR_RTOK 0x09990001 /* rtoken handling failed */ #define SMC_CLC_DECL_ERR_RDYLNK 0x09990002 /* ib ready link failed */ #define SMC_CLC_DECL_ERR_REGBUF 0x09990003 /* reg rdma bufs failed */ +#define SMC_CLC_DECL_CREDITSERR 0x09990004 /* announce credits failed */ #define SMC_FIRST_CONTACT_MASK 0b10 /* first contact bit within typev2 */ @@ -133,7 +139,9 @@ struct smc_clc_smcd_gid_chid { struct smc_clc_v2_extension { struct smc_clnt_opts_area_hdr hdr; u8 roce[16]; /* RoCEv2 GID */ - u8 reserved[16]; + u8 max_conns; + u8 max_links; + u8 reserved[14]; u8 user_eids[][SMC_MAX_EID_LEN]; }; @@ -144,10 +152,29 @@ struct smc_clc_msg_proposal_prefix { /* prefix part of clc proposal message*/ u8 ipv6_prefixes_cnt; /* number of IPv6 prefixes in prefix array */ } __aligned(4); +/* Alibaba vendor experimental options */ +struct smc_clc_vendor_opt_ali { +#if defined(__BIG_ENDIAN_BITFIELD) + u8 valid : 1, + credits_en : 1, + rwwi_en : 1, + reserved0 : 5; +#elif defined(__LITTLE_ENDIAN_BITFIELD) + u8 reserved0 : 5, + rwwi_en : 1, + credits_en : 1, + valid : 1; +#endif + u8 reserved[3]; +}; + struct smc_clc_msg_smcd { /* SMC-D GID information */ struct smc_clc_smcd_gid_chid ism; /* ISM native GID+CHID of requestor */ __be16 v2_ext_offset; /* SMC Version 2 Extension Offset */ - u8 reserved[28]; + u8 vendor_oui[3]; + u8 reserved0; + struct smc_clc_vendor_opt_ali vendor_exp_options; + u8 reserved[20]; }; struct smc_clc_smcd_v2_extension { @@ -190,7 +217,7 @@ struct smcr_clc_msg_accept_confirm { /* SMCR accept/confirm */ u8 qp_mtu : 4, rmbe_size : 4; #endif - u8 reserved; + u8 init_credits; /* QP rq init credits for rq flowctrl */ __be64 rmb_dma_addr; /* RMB virtual address */ u8 reserved2; u8 psn[3]; /* packet sequence number */ @@ -231,8 +258,16 @@ struct smc_clc_first_contact_ext { u8 hostname[SMC_MAX_HOSTNAME_LEN]; }; +struct smc_clc_first_contact_ext_v2x { + struct smc_clc_first_contact_ext fce_v20; + u8 max_conns; /* for SMC-R only */ + u8 max_links; /* for SMC-R only */ + u8 reserved3[2]; + struct smc_clc_vendor_opt_ali vendor_exp_options; + u8 reserved4[8]; +} __packed; + struct smc_clc_fce_gid_ext { - u8 reserved[16]; u8 gid_cnt; u8 reserved2[3]; u8 gid[][SMC_GID_SIZE]; @@ -370,6 +405,27 @@ smc_get_clc_smcd_v2_ext(struct smc_clc_v2_extension *prop_v2ext) ntohs(prop_v2ext->hdr.smcd_v2_ext_offset)); } +static inline struct smc_clc_first_contact_ext * +smc_get_clc_first_contact_ext(struct smc_clc_msg_accept_confirm_v2 *clc_v2, + bool is_smcd) +{ + int clc_v2_len; + + if (clc_v2->hdr.version == SMC_V1 || + !(clc_v2->hdr.typev2 & SMC_FIRST_CONTACT_MASK)) + return NULL; + + if (is_smcd) + clc_v2_len = + offsetofend(struct smc_clc_msg_accept_confirm_v2, d1); + else + clc_v2_len = + offsetofend(struct smc_clc_msg_accept_confirm_v2, r1); + + return (struct smc_clc_first_contact_ext *)(((u8 *)clc_v2) + + clc_v2_len); +} + struct smcd_dev; struct smc_init_info; @@ -382,7 +438,15 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini); int smc_clc_send_confirm(struct smc_sock *smc, bool clnt_first_contact, u8 version, u8 *eid, struct smc_init_info *ini); int smc_clc_send_accept(struct smc_sock *smc, bool srv_first_contact, - u8 version, u8 *negotiated_eid); + u8 version, u8 *negotiated_eid, struct smc_init_info *ini); +int smc_clc_srv_v2x_features_validate(struct smc_sock *smc, + struct smc_clc_msg_proposal *pclc, + struct smc_init_info *ini); +int smc_clc_cli_v2x_features_validate(struct smc_sock *smc, + struct smc_clc_first_contact_ext *fce, + struct smc_init_info *ini); +int smc_clc_v2x_features_confirm_check(struct smc_clc_msg_accept_confirm *cclc, + struct smc_init_info *ini); void smc_clc_init(void) __init; void smc_clc_exit(void); void smc_clc_get_hostname(u8 **host); diff --git a/net/smc/smc_close.c b/net/smc/smc_close.c index 31db7438857c9f81fd336bfce74c5c92b7d275c6..ebc2d040cc6919b9a47b27eaf1f206a656131c0f 100644 --- a/net/smc/smc_close.c +++ b/net/smc/smc_close.c @@ -17,14 +17,19 @@ #include "smc.h" #include "smc_tx.h" +#include "smc_wr.h" #include "smc_cdc.h" #include "smc_close.h" +#include "smc_inet.h" /* release the clcsock that is assigned to the smc_sock */ void smc_clcsock_release(struct smc_sock *smc) { struct socket *tcp; + if (smc_sock_is_inet_sock(&smc->sk)) + return; + if (smc->listen_smc && current_work() != &smc->smc_listen_work) cancel_work_sync(&smc->smc_listen_work); mutex_lock(&smc->clcsock_release_lock); @@ -67,8 +72,8 @@ static void smc_close_stream_wait(struct smc_sock *smc, long timeout) rc = sk_wait_event(sk, &timeout, !smc_tx_prepared_sends(&smc->conn) || - sk->sk_err == ECONNABORTED || - sk->sk_err == ECONNRESET || + READ_ONCE(sk->sk_err) == ECONNABORTED || + READ_ONCE(sk->sk_err) == ECONNRESET || smc->conn.killed, &wait); if (rc) @@ -89,7 +94,10 @@ static int smc_close_wr(struct smc_connection *conn) { conn->local_tx_ctrl.conn_state_flags.peer_done_writing = 1; - return smc_cdc_get_slot_and_msg_send(conn); + if (conn->lgr->use_rwwi) + return smc_tx_rdma_write_with_no_data_rwwi(conn); + else + return smc_cdc_get_slot_and_msg_send(conn); } static int smc_close_final(struct smc_connection *conn) @@ -101,14 +109,20 @@ static int smc_close_final(struct smc_connection *conn) if (conn->killed) return -EPIPE; - return smc_cdc_get_slot_and_msg_send(conn); + if (conn->lgr->use_rwwi) + return smc_tx_rdma_write_with_no_data_rwwi(conn); + else + return smc_cdc_get_slot_and_msg_send(conn); } int smc_close_abort(struct smc_connection *conn) { conn->local_tx_ctrl.conn_state_flags.peer_conn_abort = 1; - return smc_cdc_get_slot_and_msg_send(conn); + if (conn->lgr->use_rwwi) + return smc_tx_rdma_write_with_no_data_rwwi(conn); + else + return smc_cdc_get_slot_and_msg_send(conn); } static void smc_close_cancel_work(struct smc_sock *smc) @@ -129,41 +143,49 @@ void smc_close_active_abort(struct smc_sock *smc) struct sock *sk = &smc->sk; bool release_clcsock = false; - if (sk->sk_state != SMC_INIT && smc->clcsock && smc->clcsock->sk) { - sk->sk_err = ECONNABORTED; - if (smc->clcsock && smc->clcsock->sk) + if (smc_sk_state(sk) != SMC_INIT) { + /* sock locked */ + if (smc_sock_is_inet_sock(sk)) { + sk->sk_err = ECONNABORTED; + /* This barrier is coupled with smp_rmb() in tcp_poll() */ + smp_wmb(); + sk->sk_error_report(sk); + } else if (smc->clcsock && smc->clcsock->sk) { + sk->sk_err = ECONNABORTED; tcp_abort(smc->clcsock->sk, ECONNABORTED); + } } - switch (sk->sk_state) { + + switch (smc_sk_state(sk)) { case SMC_ACTIVE: case SMC_APPCLOSEWAIT1: case SMC_APPCLOSEWAIT2: - sk->sk_state = SMC_PEERABORTWAIT; + smc_sk_set_state(sk, SMC_PEERABORTWAIT); smc_close_cancel_work(smc); - if (sk->sk_state != SMC_PEERABORTWAIT) + if (smc_sk_state(sk) != SMC_PEERABORTWAIT) break; - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); sock_put(sk); /* (postponed) passive closing */ break; case SMC_PEERCLOSEWAIT1: case SMC_PEERCLOSEWAIT2: case SMC_PEERFINCLOSEWAIT: - sk->sk_state = SMC_PEERABORTWAIT; + smc_sk_set_state(sk, SMC_PEERABORTWAIT); smc_close_cancel_work(smc); - if (sk->sk_state != SMC_PEERABORTWAIT) + if (smc_sk_state(sk) != SMC_PEERABORTWAIT) break; - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); smc_conn_free(&smc->conn); release_clcsock = true; sock_put(sk); /* passive closing */ break; case SMC_PROCESSABORT: case SMC_APPFINCLOSEWAIT: - sk->sk_state = SMC_PEERABORTWAIT; + smc_sk_set_state(sk, SMC_PEERABORTWAIT); smc_close_cancel_work(smc); - if (sk->sk_state != SMC_PEERABORTWAIT) + if (smc_sk_state(sk) != SMC_PEERABORTWAIT) break; - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); smc_conn_free(&smc->conn); release_clcsock = true; break; @@ -204,14 +226,14 @@ int smc_close_active(struct smc_sock *smc) 0 : sock_flag(sk, SOCK_LINGER) ? sk->sk_lingertime : SMC_MAX_STREAM_WAIT_TIMEOUT; - old_state = sk->sk_state; + old_state = smc_sk_state(sk); again: - switch (sk->sk_state) { + switch (smc_sk_state(sk)) { case SMC_INIT: - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); break; case SMC_LISTEN: - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); sk->sk_state_change(sk); /* wake up accept */ if (smc->clcsock && smc->clcsock->sk) { write_lock_bh(&smc->clcsock->sk->sk_callback_lock); @@ -231,10 +253,10 @@ int smc_close_active(struct smc_sock *smc) release_sock(sk); cancel_delayed_work_sync(&conn->tx_work); lock_sock(sk); - if (sk->sk_state == SMC_ACTIVE) { + if (smc_sk_state(sk) == SMC_ACTIVE) { /* send close request */ rc = smc_close_final(conn); - sk->sk_state = SMC_PEERCLOSEWAIT1; + smc_sk_set_state(sk, SMC_PEERCLOSEWAIT1); /* actively shutdown clcsock before peer close it, * prevent peer from entering TIME_WAIT state. @@ -256,7 +278,7 @@ int smc_close_active(struct smc_sock *smc) /* just shutdown wr done, send close request */ rc = smc_close_final(conn); } - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); break; case SMC_APPCLOSEWAIT1: case SMC_APPCLOSEWAIT2: @@ -265,18 +287,18 @@ int smc_close_active(struct smc_sock *smc) release_sock(sk); cancel_delayed_work_sync(&conn->tx_work); lock_sock(sk); - if (sk->sk_state != SMC_APPCLOSEWAIT1 && - sk->sk_state != SMC_APPCLOSEWAIT2) + if (smc_sk_state(sk) != SMC_APPCLOSEWAIT1 && + smc_sk_state(sk) != SMC_APPCLOSEWAIT2) goto again; /* confirm close from peer */ rc = smc_close_final(conn); if (smc_cdc_rxed_any_close(conn)) { /* peer has closed the socket already */ - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); sock_put(sk); /* postponed passive closing */ } else { /* peer has just issued a shutdown write */ - sk->sk_state = SMC_PEERFINCLOSEWAIT; + smc_sk_set_state(sk, SMC_PEERFINCLOSEWAIT); } break; case SMC_PEERCLOSEWAIT1: @@ -293,17 +315,17 @@ int smc_close_active(struct smc_sock *smc) break; case SMC_PROCESSABORT: rc = smc_close_abort(conn); - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); break; case SMC_PEERABORTWAIT: - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); break; case SMC_CLOSED: /* nothing to do, add tracing in future patch */ break; } - if (old_state != sk->sk_state) + if (old_state != smc_sk_state(sk)) sk->sk_state_change(sk); return rc; } @@ -314,33 +336,33 @@ static void smc_close_passive_abort_received(struct smc_sock *smc) &smc->conn.local_tx_ctrl.conn_state_flags; struct sock *sk = &smc->sk; - switch (sk->sk_state) { + switch (smc_sk_state(sk)) { case SMC_INIT: case SMC_ACTIVE: case SMC_APPCLOSEWAIT1: - sk->sk_state = SMC_PROCESSABORT; + smc_sk_set_state(sk, SMC_PROCESSABORT); sock_put(sk); /* passive closing */ break; case SMC_APPFINCLOSEWAIT: - sk->sk_state = SMC_PROCESSABORT; + smc_sk_set_state(sk, SMC_PROCESSABORT); break; case SMC_PEERCLOSEWAIT1: case SMC_PEERCLOSEWAIT2: if (txflags->peer_done_writing && !smc_close_sent_any_close(&smc->conn)) /* just shutdown, but not yet closed locally */ - sk->sk_state = SMC_PROCESSABORT; + smc_sk_set_state(sk, SMC_PROCESSABORT); else - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); sock_put(sk); /* passive closing */ break; case SMC_APPCLOSEWAIT2: case SMC_PEERFINCLOSEWAIT: - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); sock_put(sk); /* passive closing */ break; case SMC_PEERABORTWAIT: - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); break; case SMC_PROCESSABORT: /* nothing to do, add tracing in future patch */ @@ -364,7 +386,7 @@ static void smc_close_passive_work(struct work_struct *work) int old_state; lock_sock(sk); - old_state = sk->sk_state; + old_state = smc_sk_state(sk); rxflags = &conn->local_rx_ctrl.conn_state_flags; if (rxflags->peer_conn_abort) { @@ -376,19 +398,19 @@ static void smc_close_passive_work(struct work_struct *work) goto wakeup; } - switch (sk->sk_state) { + switch (smc_sk_state(sk)) { case SMC_INIT: - sk->sk_state = SMC_APPCLOSEWAIT1; + smc_sk_set_state(sk, SMC_APPCLOSEWAIT1); break; case SMC_ACTIVE: - sk->sk_state = SMC_APPCLOSEWAIT1; + smc_sk_set_state(sk, SMC_APPCLOSEWAIT1); /* postpone sock_put() for passive closing to cover * received SEND_SHUTDOWN as well */ break; case SMC_PEERCLOSEWAIT1: if (rxflags->peer_done_writing) - sk->sk_state = SMC_PEERCLOSEWAIT2; + smc_sk_set_state(sk, SMC_PEERCLOSEWAIT2); fallthrough; /* to check for closing */ case SMC_PEERCLOSEWAIT2: @@ -397,16 +419,16 @@ static void smc_close_passive_work(struct work_struct *work) if (sock_flag(sk, SOCK_DEAD) && smc_close_sent_any_close(conn)) { /* smc_release has already been called locally */ - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); } else { /* just shutdown, but not yet closed locally */ - sk->sk_state = SMC_APPFINCLOSEWAIT; + smc_sk_set_state(sk, SMC_APPFINCLOSEWAIT); } sock_put(sk); /* passive closing */ break; case SMC_PEERFINCLOSEWAIT: if (smc_cdc_rxed_any_close(conn)) { - sk->sk_state = SMC_CLOSED; + smc_sk_set_state(sk, SMC_CLOSED); sock_put(sk); /* passive closing */ } break; @@ -428,9 +450,9 @@ static void smc_close_passive_work(struct work_struct *work) sk->sk_data_ready(sk); /* wakeup blocked rcvbuf consumers */ sk->sk_write_space(sk); /* wakeup blocked sndbuf producers */ - if (old_state != sk->sk_state) { + if (old_state != smc_sk_state(sk)) { sk->sk_state_change(sk); - if ((sk->sk_state == SMC_CLOSED) && + if ((smc_sk_state(sk) == SMC_CLOSED) && (sock_flag(sk, SOCK_DEAD) || !sk->sk_socket)) { smc_conn_free(conn); if (smc->clcsock) @@ -455,19 +477,19 @@ int smc_close_shutdown_write(struct smc_sock *smc) 0 : sock_flag(sk, SOCK_LINGER) ? sk->sk_lingertime : SMC_MAX_STREAM_WAIT_TIMEOUT; - old_state = sk->sk_state; + old_state = smc_sk_state(sk); again: - switch (sk->sk_state) { + switch (smc_sk_state(sk)) { case SMC_ACTIVE: smc_close_stream_wait(smc, timeout); release_sock(sk); cancel_delayed_work_sync(&conn->tx_work); lock_sock(sk); - if (sk->sk_state != SMC_ACTIVE) + if (smc_sk_state(sk) != SMC_ACTIVE) goto again; /* send close wr request */ rc = smc_close_wr(conn); - sk->sk_state = SMC_PEERCLOSEWAIT1; + smc_sk_set_state(sk, SMC_PEERCLOSEWAIT1); break; case SMC_APPCLOSEWAIT1: /* passive close */ @@ -476,11 +498,11 @@ int smc_close_shutdown_write(struct smc_sock *smc) release_sock(sk); cancel_delayed_work_sync(&conn->tx_work); lock_sock(sk); - if (sk->sk_state != SMC_APPCLOSEWAIT1) + if (smc_sk_state(sk) != SMC_APPCLOSEWAIT1) goto again; /* confirm close from peer */ rc = smc_close_wr(conn); - sk->sk_state = SMC_APPCLOSEWAIT2; + smc_sk_set_state(sk, SMC_APPCLOSEWAIT2); break; case SMC_APPCLOSEWAIT2: case SMC_PEERFINCLOSEWAIT: @@ -493,7 +515,7 @@ int smc_close_shutdown_write(struct smc_sock *smc) break; } - if (old_state != sk->sk_state) + if (old_state != smc_sk_state(sk)) sk->sk_state_change(sk); return rc; } diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c index 7c5c5ad81324bb276f578b1c6b9e6fa7bd30d943..e97cf71cebcaf15ad0f1c50c22c737587b7bfe1f 100644 --- a/net/smc/smc_core.c +++ b/net/smc/smc_core.c @@ -127,6 +127,7 @@ static int smcr_lgr_conn_assign_link(struct smc_connection *conn, bool first) int i, j; /* do link balancing */ + conn->lnk = NULL; /* reset conn->lnk first */ for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) { struct smc_link *lnk = &conn->lgr->lnk[i]; @@ -167,6 +168,7 @@ static int smc_lgr_register_conn(struct smc_connection *conn, bool first) { struct smc_sock *smc = container_of(conn, struct smc_sock, conn); static atomic_t nexttoken = ATOMIC_INIT(0); + int i; int rc; if (!conn->lgr->is_smcd) { @@ -180,10 +182,26 @@ static int smc_lgr_register_conn(struct smc_connection *conn, bool first) * in this link group */ sock_hold(&smc->sk); /* sock_put in smc_lgr_unregister_conn() */ - while (!conn->alert_token_local) { - conn->alert_token_local = atomic_inc_return(&nexttoken); - if (smc_lgr_find_conn(conn->alert_token_local, conn->lgr)) - conn->alert_token_local = 0; + if (conn->lgr->use_rwwi) { + for (i = 1; i <= SMC_MAX_TOKEN_LOCAL; i++) { + if (!smc_lgr_find_conn(i, conn->lgr)) { + conn->alert_token_local = i; + break; + } + } + if (!conn->alert_token_local) { + atomic_dec(&conn->lnk->conn_cnt); + conn->lnk = NULL; + conn->lgr = NULL; + sock_put(&smc->sk); + return SMC_CLC_DECL_INTERR; + } + } else { + while (!conn->alert_token_local) { + conn->alert_token_local = atomic_inc_return(&nexttoken); + if (smc_lgr_find_conn(conn->alert_token_local, conn->lgr)) + conn->alert_token_local = 0; + } } smc_lgr_add_alert_token(conn); conn->lgr->conns_num++; @@ -453,6 +471,9 @@ static int smc_nl_fill_lgr_link(struct smc_link_group *lgr, if (nla_put_u64_64bit(skb, SMC_NLA_LINK_RWC_CNT, stats->r_wc_cnt, SMC_NLA_LINK_UNSPEC)) goto errstats; + if (nla_put_u64_64bit(skb, SMC_NLA_LINK_WWR_CNT, + stats->rw_wr_cnt, SMC_NLA_LINK_UNSPEC)) + goto errstats; if (nla_put_u64_64bit(skb, SMC_NLA_LINK_WWC_CNT, stats->rw_wc_cnt, SMC_NLA_LINK_UNSPEC)) goto errstats; @@ -534,6 +555,7 @@ static int smc_nl_fill_smcd_lgr(struct smc_link_group *lgr, struct netlink_callback *cb) { char smc_pnet[SMC_MAX_PNETID_LEN + 1]; + struct smcd_dev *smcd = lgr->smcd; struct nlattr *attrs; void *nlh; @@ -549,8 +571,9 @@ static int smc_nl_fill_smcd_lgr(struct smc_link_group *lgr, if (nla_put_u32(skb, SMC_NLA_LGR_D_ID, *((u32 *)&lgr->id))) goto errattr; - if (nla_put_u64_64bit(skb, SMC_NLA_LGR_D_GID, lgr->smcd->local_gid, - SMC_NLA_LGR_D_PAD)) + if (nla_put_u64_64bit(skb, SMC_NLA_LGR_D_GID, + smcd->ops->get_local_gid(smcd), + SMC_NLA_LGR_D_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_LGR_D_PEER_GID, lgr->peer_gid, SMC_NLA_LGR_D_PAD)) @@ -822,6 +845,21 @@ static void smcr_link_iw_extension(struct iw_ext_conn_param *iw_param, struct so } else { iw_param->sk_addr.saddr_v6 = clcsk->sk_v6_rcv_saddr; iw_param->sk_addr.daddr_v6 = clcsk->sk_v6_daddr; + + /* Workaround for IPv6 + */ + if (ipv6_addr_v4mapped(&iw_param->sk_addr.saddr_v6) && + ipv6_addr_v4mapped(&iw_param->sk_addr.daddr_v6)) { + __be32 saddr_v4, daddr_v4; + + saddr_v4 = iw_param->sk_addr.saddr_v6.s6_addr32[3]; + daddr_v4 = iw_param->sk_addr.daddr_v6.s6_addr32[3]; + memset(&iw_param->sk_addr.saddr_v6, 0, sizeof(struct in6_addr)); + memset(&iw_param->sk_addr.daddr_v6, 0, sizeof(struct in6_addr)); + iw_param->sk_addr.family = PF_INET; + iw_param->sk_addr.saddr_v4 = saddr_v4; + iw_param->sk_addr.daddr_v4 = daddr_v4; + } #endif } @@ -843,14 +881,6 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk, lnk->smcibdev = ini->ib_dev; lnk->ibport = ini->ib_port; } - - if (!lnk->smcibdev->ibdev) { - /* check if smcibdev still available */ - memset(lnk, 0, sizeof(struct smc_link)); - lnk->state = SMC_LNK_UNUSED; - return SMC_CLC_DECL_NOSMCRDEV; - } - get_device(&lnk->smcibdev->ibdev->dev); atomic_inc(&lnk->smcibdev->lnk_cnt); refcount_set(&lnk->refcnt, 1); /* link refcnt is set to 1 */ @@ -879,6 +909,10 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk, &ini->smcrv2 : NULL); if (rc) goto out; + + if (smc_ib_is_iwarp(lnk->smcibdev->ibdev, lnk->ibport)) + memcpy(lnk->eiwarp_gid, ini->smcrv2.eiwarp_gid, SMC_GID_SIZE); + rc = smc_llc_link_init(lnk); if (rc) goto out; @@ -920,13 +954,12 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk, /* create a new SMC link group */ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini) { - struct smc_ib_device *ibdev; struct smc_link_group *lgr; struct list_head *lgr_list; + struct smcd_dev *smcd; struct smc_link *lnk; spinlock_t *lgr_lock; u8 link_idx; - int ibport; int rc = 0; int i; @@ -970,7 +1003,8 @@ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini) lgr->conns_all = RB_ROOT; if (ini->is_smcd) { /* SMC-D specific settings */ - get_device(&ini->ism_dev[ini->ism_selected]->dev); + smcd = ini->ism_dev[ini->ism_selected]; + get_device(smcd->ops->get_dev(smcd)); lgr->peer_gid = ini->ism_peer_gid[ini->ism_selected]; lgr->smcd = ini->ism_dev[ini->ism_selected]; lgr_list = &ini->ism_dev[ini->ism_selected]->lgr_list; @@ -980,6 +1014,9 @@ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini) atomic_inc(&ini->ism_dev[ini->ism_selected]->lgr_cnt); } else { /* SMC-R specific settings */ + struct smc_ib_device *ibdev; + int ibport; + lgr->role = smc->listen_smc ? SMC_SERV : SMC_CLNT; lgr->smc_version = ini->smcr_version; memcpy(lgr->peer_systemid, ini->peer_systemid, @@ -991,23 +1028,26 @@ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini) lgr->uses_gateway = ini->smcrv2.uses_gateway; memcpy(lgr->nexthop_mac, ini->smcrv2.nexthop_mac, ETH_ALEN); + lgr->max_conns = ini->max_conns; + lgr->max_links = ini->max_links; + lgr->credits_en = ini->vendor_opt_valid && ini->credits_en; + /* use_rwwi is limited for single link lgr */ + lgr->use_rwwi = ini->vendor_opt_valid && ini->rwwi_en && + lgr->max_links <= 1; } else { ibdev = ini->ib_dev; ibport = ini->ib_port; + lgr->max_conns = SMC_RMBS_PER_LGR_MAX; + lgr->max_links = SMC_LINKS_ADD_LNK_MAX; + lgr->credits_en = 0; + lgr->use_rwwi = 0; } + memcpy(lgr->pnet_id, ibdev->pnetid[ibport - 1], + SMC_MAX_PNETID_LEN); + rc = smc_lgr_link_stats_init(lgr); if (rc) goto free_wq; - - mutex_lock(&smc_ib_devices.mutex); - if (list_empty(&ibdev->list) || - test_bit(ibport, ibdev->ports_going_away)) { - /* ibdev unavailable */ - rc = SMC_CLC_DECL_NOSMCRDEV; - goto free_stats; - } - memcpy(lgr->pnet_id, ibdev->pnetid[ibport - 1], - SMC_MAX_PNETID_LEN); rc = smc_wr_alloc_lgr_mem(lgr); if (rc) goto free_stats; @@ -1032,16 +1072,12 @@ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini) spin_lock_bh(lgr_lock); list_add_tail(&lgr->list, lgr_list); spin_unlock_bh(lgr_lock); - if (!ini->is_smcd) - mutex_unlock(&smc_ib_devices.mutex); return 0; free_stats: if (!ini->is_smcd) smc_lgr_link_stats_free(lgr); free_wq: - if (!ini->is_smcd) - mutex_unlock(&smc_ib_devices.mutex); destroy_workqueue(lgr->tx_wq); free_lgr: kfree(lgr); @@ -1050,16 +1086,10 @@ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini) smc_ism_put_vlan(ini->ism_dev[ini->ism_selected], ini->vlan_id); out: if (rc < 0) { - switch (rc) { - case -ENOMEM: + if (rc == -ENOMEM) rc = SMC_CLC_DECL_MEM; - break; - case SMC_CLC_DECL_NOSMCRDEV: - break; - default: + else rc = SMC_CLC_DECL_INTERR; - break; - } } return rc; } @@ -1114,8 +1144,8 @@ static int smc_switch_cursor(struct smc_sock *smc, struct smc_cdc_tx_pend *pend, /* recalculate, value is used by tx_rdma_writes() */ atomic_set(&smc->conn.peer_rmbe_space, smc_write_space(conn)); - if (smc->sk.sk_state != SMC_INIT && - smc->sk.sk_state != SMC_CLOSED) { + if (smc_sk_state(&smc->sk) != SMC_INIT && + smc_sk_state(&smc->sk) != SMC_CLOSED) { rc = smcr_cdc_msg_send_validation(conn, pend, wr_buf); if (!rc) { queue_delayed_work(conn->lgr->tx_wq, &conn->tx_work, 0); @@ -1176,17 +1206,17 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr, continue; smc = container_of(conn, struct smc_sock, conn); /* conn->lnk not yet set in SMC_INIT state */ - if (smc->sk.sk_state == SMC_INIT) + if (smc_sk_state(&smc->sk) == SMC_INIT) continue; - if (smc->sk.sk_state == SMC_CLOSED || - smc->sk.sk_state == SMC_PEERCLOSEWAIT1 || - smc->sk.sk_state == SMC_PEERCLOSEWAIT2 || - smc->sk.sk_state == SMC_APPFINCLOSEWAIT || - smc->sk.sk_state == SMC_APPCLOSEWAIT1 || - smc->sk.sk_state == SMC_APPCLOSEWAIT2 || - smc->sk.sk_state == SMC_PEERFINCLOSEWAIT || - smc->sk.sk_state == SMC_PEERABORTWAIT || - smc->sk.sk_state == SMC_PROCESSABORT) { + if (smc_sk_state(&smc->sk) == SMC_CLOSED || + smc_sk_state(&smc->sk) == SMC_PEERCLOSEWAIT1 || + smc_sk_state(&smc->sk) == SMC_PEERCLOSEWAIT2 || + smc_sk_state(&smc->sk) == SMC_APPFINCLOSEWAIT || + smc_sk_state(&smc->sk) == SMC_APPCLOSEWAIT1 || + smc_sk_state(&smc->sk) == SMC_APPCLOSEWAIT2 || + smc_sk_state(&smc->sk) == SMC_PEERFINCLOSEWAIT || + smc_sk_state(&smc->sk) == SMC_PEERABORTWAIT || + smc_sk_state(&smc->sk) == SMC_PROCESSABORT) { spin_lock_bh(&conn->send_lock); smc_switch_link_and_count(conn, to_lnk); spin_unlock_bh(&conn->send_lock); @@ -1495,7 +1525,8 @@ static void __smc_lgr_free(struct smc_link_group *lgr) if (!atomic_dec_return(&lgr_cnt)) wake_up(&lgrs_deleted); } - smc_lgr_link_stats_free(lgr); + if (!lgr->is_smcd) + smc_lgr_link_stats_free(lgr); kfree(lgr); } @@ -1517,7 +1548,7 @@ static void smc_lgr_free(struct smc_link_group *lgr) destroy_workqueue(lgr->tx_wq); if (lgr->is_smcd) { smc_ism_put_vlan(lgr->smcd, lgr->vlan_id); - put_device(&lgr->smcd->dev); + put_device(lgr->smcd->ops->get_dev(lgr->smcd)); } smc_lgr_put(lgr); /* theoretically last lgr_put */ } @@ -1592,7 +1623,7 @@ static void __smc_lgr_terminate(struct smc_link_group *lgr, bool soft) if (lgr->terminating) return; /* lgr already terminating */ /* cancel free_work sync, will terminate when lgr->freeing is set */ - cancel_delayed_work_sync(&lgr->free_work); + cancel_delayed_work(&lgr->free_work); lgr->terminating = 1; /* kill remaining link group connections */ @@ -1750,7 +1781,7 @@ void smcr_lgr_set_type(struct smc_link_group *lgr, enum smc_lgr_type new_type) lgr_type = "ASYMMETRIC_LOCAL"; break; } - pr_warn_ratelimited("smc: SMC-R lg %*phN state changed: " + pr_info_ratelimited("smc: SMC-R lg %*phN state changed: " "%s, pnetid %.16s\n", SMC_LGR_ID_SIZE, &lgr->id, lgr_type, lgr->pnet_id); } @@ -1781,6 +1812,7 @@ void smcr_port_add(struct smc_ib_device *smcibdev, u8 ibport) { struct smc_link_group *lgr, *n; + spin_lock_bh(&smc_lgr_list.lock); list_for_each_entry_safe(lgr, n, &smc_lgr_list.list, list) { struct smc_link *link; @@ -1791,11 +1823,15 @@ void smcr_port_add(struct smc_ib_device *smcibdev, u8 ibport) !rdma_dev_access_netns(smcibdev->ibdev, lgr->net)) continue; + if (lgr->type == SMC_LGR_SINGLE && lgr->max_links <= 1) + continue; + /* trigger local add link processing */ link = smc_llc_usable_link(lgr); if (link) smc_llc_add_link_local(link); } + spin_unlock_bh(&smc_lgr_list.lock); } /* link is down - switch connections to alternate link, @@ -2015,7 +2051,7 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini) (ini->smcd_version == SMC_V2 || lgr->vlan_id == ini->vlan_id) && (role == SMC_CLNT || ini->is_smcd || - (lgr->conns_num < SMC_RMBS_PER_LGR_MAX && + (lgr->conns_num < lgr->max_conns && !bitmap_full(lgr->rtokens_used_mask, SMC_RMBS_PER_LGR_MAX)))) { /* link group found */ ini->first_contact_local = 0; @@ -2727,6 +2763,7 @@ static int smc_core_reboot_event(struct notifier_block *this, { smc_lgrs_shutdown(); smc_ib_unregister_client(); + smc_ism_exit(); return 0; } diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h index e0c258777ab4876a8fe39ef2e4a57fdb6226bf26..57605e0563705dbbe66d9920eed2e77145068257 100644 --- a/net/smc/smc_core.h +++ b/net/smc/smc_core.h @@ -23,7 +23,13 @@ #include "smc_stats.h" #define SMC_RMBS_PER_LGR_MAX 255 /* max. # of RMBs per link group */ - +#define SMC_CONN_PER_LGR_MAX 32 /* max. # of connections per link group. + * Correspondingly, SMC_WR_BUF_CNT should not be less than + * 2 * SMC_CONN_PER_LGR_MAX, since every connection at + * least has two rq/sq credits in average, otherwise + * may result in waiting for credits in sending process. + */ +#define SMC_MAX_TOKEN_LOCAL 255 struct smc_lgr_list { /* list of link group definition */ struct list_head list; spinlock_t lock; /* protects list of link groups */ @@ -81,6 +87,8 @@ struct smc_rdma_wr { /* work requests per message #define SMC_LGR_ID_SIZE 4 +#define SMC_LINKFLAG_ANNOUNCE_PENDING 0 + struct smc_link { struct iw_ext_conn_param iw_conn_param; struct smc_ib_device *smcibdev; /* ib-device */ @@ -109,7 +117,11 @@ struct smc_link { unsigned long *wr_tx_mask; /* bit mask of used indexes */ u32 wr_tx_cnt; /* number of WR send buffers */ wait_queue_head_t wr_tx_wait; /* wait for free WR send buf */ - atomic_t wr_tx_refcnt; /* tx refs to link */ + struct { + struct percpu_ref wr_tx_refs; + } ____cacheline_aligned_in_smp; + struct completion tx_ref_comp; + atomic_t tx_inflight_credit; struct smc_wr_buf *wr_rx_bufs; /* WR recv payload buffers */ struct ib_recv_wr *wr_rx_ibs; /* WR recv meta data */ @@ -122,10 +134,24 @@ struct smc_link { struct ib_reg_wr wr_reg; /* WR register memory region */ wait_queue_head_t wr_reg_wait; /* wait for wr_reg result */ - atomic_t wr_reg_refcnt; /* reg refs to link */ + struct { + struct percpu_ref wr_reg_refs; + } ____cacheline_aligned_in_smp; + struct completion reg_ref_comp; enum smc_wr_reg_state wr_reg_state; /* state of wr_reg request */ + atomic_t peer_rq_credits; /* credits for peer rq flowctrl */ + atomic_t local_rq_credits; /* credits for local rq flowctrl */ + u8 credits_enable; /* credits enable flag, set when negotiation */ + u8 local_cr_watermark_high; /* local rq credits watermark */ + u8 peer_cr_watermark_low; /* peer rq credits watermark */ + u8 credits_update_limit; /* credits update limit for cdc msg */ + struct work_struct credits_announce_work; /* work for credits announcement */ + unsigned long flags; /* link flags, SMC_LINKFLAG_ANNOUNCE_PENDING .etc */ + u8 gid[SMC_GID_SIZE];/* gid matching used vlan id*/ + u8 eiwarp_gid[SMC_GID_SIZE]; + /* gid of eRDMA iWARP device */ u8 sgid_index; /* gid index for vlan id */ u32 peer_qpn; /* QP number of peer */ enum ib_mtu path_mtu; /* used mtu */ @@ -158,6 +184,8 @@ struct smc_link { */ #define SMC_LINKS_PER_LGR_MAX 3 #define SMC_SINGLE_LINK 0 +#define SMC_LINKS_PER_LGR_PREFER 1 /* prefer 1 link per lgr */ +#define SMC_LINKS_ADD_LNK_MAX 2 /* tx/rx buffer list element for sndbufs list and rmbs list of a lgr */ struct smc_buf_desc { @@ -326,6 +354,13 @@ struct smc_link_group { __be32 saddr; /* net namespace */ struct net *net; + u8 max_conns; + /* max conn can be assigned to lgr */ + u8 max_links; + /* max links can be added in lgr */ + u8 credits_en; + /* is credits enabled by vendor opts negotiation */ + u8 use_rwwi; /* use RDMA WRITE with Imm or not */ }; struct { /* SMC-D */ u64 peer_gid; @@ -357,6 +392,7 @@ struct smc_init_info_smcrv2 { struct smc_ib_device *ib_dev_v2; u8 ib_port_v2; u8 ib_gid_v2[SMC_GID_SIZE]; + u8 eiwarp_gid[SMC_GID_SIZE]; /* Additional output fields when clc_sk and daddr is set as well */ u8 uses_gateway; @@ -369,6 +405,12 @@ struct smc_init_info { u8 is_smcd; u8 smc_type_v1; u8 smc_type_v2; + u8 release_ver; + u8 max_conns; + u8 max_links; + u8 vendor_opt_valid : 1; + u8 credits_en : 1; + u8 rwwi_en : 1; u8 first_contact_peer; u8 first_contact_local; unsigned short vlan_id; diff --git a/net/smc/smc_diag.c b/net/smc/smc_diag.c index 6edc739f8e080cac899bae710514e260668dda00..a07f18379ac67cbe9529a223d2dd89551c6307d3 100644 --- a/net/smc/smc_diag.c +++ b/net/smc/smc_diag.c @@ -21,6 +21,7 @@ #include "smc.h" #include "smc_core.h" +#include "smc_inet.h" struct smc_diag_dump_ctx { int pos[2]; @@ -34,24 +35,42 @@ static struct smc_diag_dump_ctx *smc_dump_context(struct netlink_callback *cb) static void smc_diag_msg_common_fill(struct smc_diag_msg *r, struct sock *sk) { struct smc_sock *smc = smc_sk(sk); + struct sock *clcsk; + bool is_v4, is_v6; + + if (smc_sock_is_inet_sock(sk)) + clcsk = sk; + else if (smc->clcsock) + clcsk = smc->clcsock->sk; + else + return; memset(r, 0, sizeof(*r)); r->diag_family = sk->sk_family; sock_diag_save_cookie(sk, r->id.idiag_cookie); - if (!smc->clcsock) - return; - r->id.idiag_sport = htons(smc->clcsock->sk->sk_num); - r->id.idiag_dport = smc->clcsock->sk->sk_dport; - r->id.idiag_if = smc->clcsock->sk->sk_bound_dev_if; - if (sk->sk_protocol == SMCPROTO_SMC) { - r->id.idiag_src[0] = smc->clcsock->sk->sk_rcv_saddr; - r->id.idiag_dst[0] = smc->clcsock->sk->sk_daddr; + + r->id.idiag_sport = htons(clcsk->sk_num); + r->id.idiag_dport = clcsk->sk_dport; + r->id.idiag_if = clcsk->sk_bound_dev_if; + + is_v4 = smc_sock_is_inet_sock(sk) ? clcsk->sk_family == AF_INET : + sk->sk_protocol == SMCPROTO_SMC; #if IS_ENABLED(CONFIG_IPV6) - } else if (sk->sk_protocol == SMCPROTO_SMC6) { - memcpy(&r->id.idiag_src, &smc->clcsock->sk->sk_v6_rcv_saddr, - sizeof(smc->clcsock->sk->sk_v6_rcv_saddr)); - memcpy(&r->id.idiag_dst, &smc->clcsock->sk->sk_v6_daddr, - sizeof(smc->clcsock->sk->sk_v6_daddr)); + is_v6 = smc_sock_is_inet_sock(sk) ? clcsk->sk_family == AF_INET6 : + sk->sk_protocol == SMCPROTO_SMC6; +#else + is_v6 = false; +#endif + + if (is_v4) { + r->id.idiag_src[0] = clcsk->sk_rcv_saddr; + r->id.idiag_dst[0] = clcsk->sk_daddr; +#if IS_ENABLED(CONFIG_IPV6) + } else if (is_v6) { + memcpy(&r->id.idiag_src, &clcsk->sk_v6_rcv_saddr, + sizeof(clcsk->sk_v6_rcv_saddr)); + memcpy(&r->id.idiag_dst, &clcsk->sk_v6_daddr, + sizeof(clcsk->sk_v6_daddr)); #endif } } @@ -86,7 +105,7 @@ static int __smc_diag_dump(struct sock *sk, struct sk_buff *skb, r = nlmsg_data(nlh); smc_diag_msg_common_fill(r, sk); - r->diag_state = sk->sk_state; + r->diag_state = smc_sk_state(sk); if (smc->use_fallback) r->diag_mode = SMC_DIAG_MODE_FALLBACK_TCP; else if (smc_conn_lgr_valid(&smc->conn) && smc->conn.lgr->is_smcd) @@ -167,12 +186,13 @@ static int __smc_diag_dump(struct sock *sk, struct sk_buff *skb, !list_empty(&smc->conn.lgr->list)) { struct smc_connection *conn = &smc->conn; struct smcd_diag_dmbinfo dinfo; + struct smcd_dev *smcd = conn->lgr->smcd; memset(&dinfo, 0, sizeof(dinfo)); dinfo.linkid = *((u32 *)conn->lgr->id); dinfo.peer_gid = conn->lgr->peer_gid; - dinfo.my_gid = conn->lgr->smcd->local_gid; + dinfo.my_gid = smcd->ops->get_local_gid(smcd); dinfo.token = conn->rmb_desc->token; dinfo.peer_token = conn->peer_token; @@ -188,6 +208,75 @@ static int __smc_diag_dump(struct sock *sk, struct sk_buff *skb, return -EMSGSIZE; } +static int smc_diag_dump_inet_proto(struct inet_hashinfo *hashinfo, struct sk_buff *skb, + struct netlink_callback *cb, int p_type) +{ + struct smc_diag_dump_ctx *cb_ctx = smc_dump_context(cb); + struct net *net = sock_net(skb->sk); + int snum = cb_ctx->pos[p_type]; + struct nlattr *bc = NULL; + int rc = 0, num = 0, i; + struct sock *sk; + + for (i = 0; i < INET_LHTABLE_SIZE; i++) { + struct inet_listen_hashbucket *ilb; + struct hlist_nulls_node *node; + + ilb = &hashinfo->listening_hash[i]; + spin_lock(&ilb->lock); + sk_nulls_for_each(sk, node, &ilb->nulls_head) { + if (!net_eq(sock_net(sk), net)) + continue; + if (sk->sk_prot != &smc_inet_prot) + continue; + if (num < snum) + goto next_ls; + rc = __smc_diag_dump(sk, skb, cb, nlmsg_data(cb->nlh), bc); + if (rc < 0) { + spin_unlock(&ilb->lock); + goto out; + } +next_ls: + num++; + } + spin_unlock(&ilb->lock); + } + + for (i = 0; i <= hashinfo->ehash_mask; i++) { + struct inet_ehash_bucket *head = &hashinfo->ehash[i]; + spinlock_t *lock = inet_ehash_lockp(hashinfo, i); + struct hlist_nulls_node *node; + + if (hlist_nulls_empty(&head->chain)) + continue; + + spin_lock_bh(lock); + sk_nulls_for_each(sk, node, &head->chain) { + if (!net_eq(sock_net(sk), net)) + continue; + if (sk->sk_state == TCP_TIME_WAIT) + continue; + if (sk->sk_state == TCP_NEW_SYN_RECV) + continue; + if (sk->sk_prot != &smc_inet_prot) + continue; + if (num < snum) + goto next; + rc = __smc_diag_dump(sk, skb, cb, nlmsg_data(cb->nlh), bc); + if (rc < 0) { + spin_unlock_bh(lock); + goto out; + } +next: + num++; + } + spin_unlock_bh(lock); + } +out: + cb_ctx->pos[p_type] += num; + return rc; +} + static int smc_diag_dump_proto(struct proto *prot, struct sk_buff *skb, struct netlink_callback *cb, int p_type) { @@ -219,7 +308,7 @@ static int smc_diag_dump_proto(struct proto *prot, struct sk_buff *skb, out: read_unlock(&prot->h.smc_hash->lock); - cb_ctx->pos[p_type] = num; + cb_ctx->pos[p_type] += num; return rc; } @@ -229,7 +318,13 @@ static int smc_diag_dump(struct sk_buff *skb, struct netlink_callback *cb) rc = smc_diag_dump_proto(&smc_proto, skb, cb, SMCPROTO_SMC); if (!rc) - smc_diag_dump_proto(&smc_proto6, skb, cb, SMCPROTO_SMC6); + rc = smc_diag_dump_proto(&smc_proto6, skb, cb, SMCPROTO_SMC6); + if (!rc) + rc = smc_diag_dump_inet_proto(smc_inet_prot.h.hashinfo, skb, cb, SMCPROTO_SMC); +#if IS_ENABLED(CONFIG_IPV6) + if (!rc) + rc = smc_diag_dump_inet_proto(smc_inet6_prot.h.hashinfo, skb, cb, SMCPROTO_SMC6); +#endif return skb->len; } diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c index a0833f6cd454345a53e5aeeaba9079e19c0cca25..e2c31411f529b97ddc772c869f4ab8e2b997b47f 100644 --- a/net/smc/smc_ib.c +++ b/net/smc/smc_ib.c @@ -198,13 +198,26 @@ int smc_ib_ready_link(struct smc_link *lnk) static int smc_ib_fill_mac(struct smc_ib_device *smcibdev, u8 ibport) { + struct ib_device *ibdev = smcibdev->ibdev; const struct ib_gid_attr *attr; + struct net_device *ndev; int rc; attr = rdma_get_gid_attr(smcibdev->ibdev, ibport, 0); if (IS_ERR(attr)) return -ENODEV; + if (smc_ib_is_iwarp(ibdev, ibport)) { + if (!ibdev->ops.get_netdev) + return -ENODEV; + ndev = ibdev->ops.get_netdev(ibdev, ibport); + if (!ndev) + return -ENODEV; + ether_addr_copy(smcibdev->mac[ibport - 1], ndev->dev_addr); + dev_put(ndev); + return 0; + } + rc = rdma_read_gid_l2_fields(attr, NULL, smcibdev->mac[ibport - 1]); rdma_put_gid_attr(attr); return rc; @@ -269,13 +282,62 @@ static int smc_ib_determine_gid_rcu(const struct net_device *ndev, u8 gid[], u8 *sgid_index, struct smc_init_info_smcrv2 *smcrv2) { - if (!smcrv2) { + if (!smcrv2 && attr->gid_type == IB_GID_TYPE_ROCE) { if (gid) memcpy(gid, &attr->gid, SMC_GID_SIZE); if (sgid_index) *sgid_index = attr->index; return 0; } + + /* Note: This is a tricky workaround for eRDMA iWARP. + * + * eRDAM iWARP has only one GID of type IB_GID_TYPE_IB which is + * converted from the MAC address. But RoCEv2 device have GID of + * type IB_GID_TYPE_ROCE_UDP_ENCAP which is converted from the + * IPv4 or IPv6 address. The IPv4 GID is used by SMCRv2 protocol + * to record IPv4 address of peer. + * + * So in order to make SMCRv2 works well with eRDMA iWARP, we + * do not record the real MAC GID of eRDMA iWARP, but convert the + * IPv4 address of the net_device corresponding to eRDMA iWARP + * into an IPv4 GID and record it. + * + * A prerequisite for this is that eRDMA iWARP will only be selected + * for the reason that its corresponding net_device is the net_device + * of clcsock, which means that we are not allowed to bind an eRDMA + * iWARP device with another ethernet device through pnet_id. + */ + if (smcrv2 && attr->gid_type == IB_GID_TYPE_IB) { + struct in_device *in_dev = __in_dev_get_rcu(ndev); + const struct in_ifaddr *ifa; + bool ip_match = false; + + if (!in_dev || smcrv2->saddr == cpu_to_be32(INADDR_NONE)) + goto out; + in_dev_for_each_ifa_rcu(ifa, in_dev) { + if (ifa->ifa_address != smcrv2->saddr) + continue; + ip_match = true; + break; + } + if (!ip_match) + goto out; + if (smcrv2->daddr && smc_ib_find_route(smcrv2->saddr, + smcrv2->daddr, + smcrv2->nexthop_mac, + &smcrv2->uses_gateway)) + goto out; + + if (gid) + ipv6_addr_set_v4mapped(smcrv2->saddr, (struct in6_addr *)gid); + if (sgid_index) + *sgid_index = attr->index; + + memcpy(smcrv2->eiwarp_gid, &attr->gid, SMC_GID_SIZE); + return 0; + } + if (smcrv2 && attr->gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP && smc_ib_gid_to_ipv4((u8 *)&attr->gid) != cpu_to_be32(INADDR_NONE)) { struct in_device *in_dev = __in_dev_get_rcu(ndev); @@ -313,29 +375,40 @@ int smc_ib_determine_gid(struct smc_ib_device *smcibdev, u8 ibport, unsigned short vlan_id, u8 gid[], u8 *sgid_index, struct smc_init_info_smcrv2 *smcrv2) { + struct ib_device *ibdev = smcibdev->ibdev; const struct ib_gid_attr *attr; const struct net_device *ndev; + bool iwarp_ndev = false; int i; + iwarp_ndev = smc_ib_is_iwarp(ibdev, ibport) && ibdev->ops.get_netdev; + for (i = 0; i < smcibdev->pattr[ibport - 1].gid_tbl_len; i++) { - attr = rdma_get_gid_attr(smcibdev->ibdev, ibport, i); + attr = rdma_get_gid_attr(ibdev, ibport, i); if (IS_ERR(attr)) continue; rcu_read_lock(); - ndev = rdma_read_gid_attr_ndev_rcu(attr); - if (smc_ib_is_iwarp(smcibdev->ibdev, ibport) || - (!IS_ERR(ndev) && + if (iwarp_ndev) + ndev = ibdev->ops.get_netdev(ibdev, ibport); + else + ndev = rdma_read_gid_attr_ndev_rcu(attr); + + if (ndev && !IS_ERR(ndev) && ((!vlan_id && !is_vlan_dev(ndev)) || (vlan_id && is_vlan_dev(ndev) && - vlan_dev_vlan_id(ndev) == vlan_id)))) { + vlan_dev_vlan_id(ndev) == vlan_id))) { if (!smc_ib_determine_gid_rcu(ndev, attr, gid, sgid_index, smcrv2)) { + if (iwarp_ndev) + dev_put((struct net_device *)ndev); rcu_read_unlock(); rdma_put_gid_attr(attr); return 0; } } + if (ndev && iwarp_ndev) + dev_put((struct net_device *)ndev); rcu_read_unlock(); rdma_put_gid_attr(attr); } @@ -356,7 +429,9 @@ static bool smc_ib_check_link_gid(u8 gid[SMC_GID_SIZE], bool smcrv2, continue; rcu_read_lock(); - if ((!smcrv2 && attr->gid_type == IB_GID_TYPE_ROCE) || + if ((smcrv2 && attr->gid_type == IB_GID_TYPE_IB && + smc_ib_is_iwarp(smcibdev->ibdev, ibport)) || + (!smcrv2 && attr->gid_type == IB_GID_TYPE_ROCE) || (smcrv2 && attr->gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP && !(ipv6_addr_type((const struct in6_addr *)&attr->gid) & IPV6_ADDR_LINKLOCAL))) @@ -371,6 +446,7 @@ static bool smc_ib_check_link_gid(u8 gid[SMC_GID_SIZE], bool smcrv2, /* check all links if the gid is still defined on smcibdev */ static void smc_ib_gid_check(struct smc_ib_device *smcibdev, u8 ibport) { + bool is_iwarp = smc_ib_is_iwarp(smcibdev->ibdev, ibport); struct smc_link_group *lgr; int i; @@ -385,7 +461,9 @@ static void smc_ib_gid_check(struct smc_ib_device *smcibdev, u8 ibport) if (lgr->lnk[i].state == SMC_LNK_UNUSED || lgr->lnk[i].smcibdev != smcibdev) continue; - if (!smc_ib_check_link_gid(lgr->lnk[i].gid, + if (!smc_ib_check_link_gid(is_iwarp ? + lgr->lnk[i].eiwarp_gid : + lgr->lnk[i].gid, lgr->smc_version == SMC_V2, smcibdev, ibport)) smcr_port_err(smcibdev, ibport); @@ -707,9 +785,11 @@ static void smc_ib_qp_event_handler(struct ib_event *ibevent, void *priv) port_idx = ibevent->element.qp->port - 1; if (port_idx >= SMC_MAX_PORTS) break; - set_bit(port_idx, &smcibdev->port_event_mask); - if (!test_and_set_bit(port_idx, smcibdev->ports_going_away)) - schedule_work(&smcibdev->port_event_work); + if (!smc_ib_port_active(smcibdev, port_idx + 1)) { + set_bit(port_idx, &smcibdev->port_event_mask); + if (!test_and_set_bit(port_idx, smcibdev->ports_going_away)) + schedule_work(&smcibdev->port_event_work); + } break; default: break; @@ -738,10 +818,16 @@ int smc_ib_create_queue_pair(struct smc_link *lnk) .srq = NULL, .cap = { /* include unsolicited rdma_writes as well, - * there are max. 2 RDMA_WRITE per 1 WR_SEND + * there are max. 2 RDMA_WRITE per 1 WR_SEND. + * RDMA_WRITE consumes send queue entities, + * without recv queue entities.When using rwwi, + * the max num of inflight llc msg is SMC_WR_BUF_CNT, + * and the max num of inflight cdc msg is SMC_WR_BUF_CNT, + * 1 cdc msg has max 2 RDMA_WRITE_WITH_IMM, + * so SMC_WR_BUF_CNT * 3 is enough for sq depth. */ .max_send_wr = SMC_WR_BUF_CNT * 3, - .max_recv_wr = SMC_WR_BUF_CNT * 3, + .max_recv_wr = SMC_WR_BUF_CNT, .max_send_sge = SMC_IB_MAX_SEND_SGE, .max_recv_sge = 1, .max_inline_data = 0, @@ -1053,10 +1139,18 @@ void smc_ib_ndev_change(struct net_device *ndev, unsigned long event) bool smc_ib_is_iwarp(struct ib_device *ibdev, u8 ibport) { - struct ib_port_immutable immutable; + return rdma_protocol_iwarp(ibdev, ibport); +} - ibdev->ops.get_port_immutable(ibdev, ibport, &immutable); - return immutable.core_cap_flags & RDMA_CORE_CAP_PROT_IWARP; +void smc_ib_get_pending_device(struct smc_ib_device *smcibdev) +{ + refcount_inc(&smcibdev->lnk_pending_cnt); +} + +void smc_ib_put_pending_device(struct smc_ib_device *smcibdev) +{ + if (refcount_dec_and_test(&smcibdev->lnk_pending_cnt)) + wake_up(&smcibdev->lnks_pending); } /* Reserve socket ports of each net namespace which can be accessed @@ -1153,7 +1247,9 @@ static int smc_ib_add_dev(struct ib_device *ibdev) } INIT_WORK(&smcibdev->port_event_work, smc_ib_port_event_work); atomic_set(&smcibdev->lnk_cnt, 0); + refcount_set(&smcibdev->lnk_pending_cnt, 1); init_waitqueue_head(&smcibdev->lnks_deleted); + init_waitqueue_head(&smcibdev->lnks_pending); mutex_init(&smcibdev->mutex); mutex_lock(&smc_ib_devices.mutex); list_add_tail(&smcibdev->list, &smc_ib_devices.list); @@ -1165,7 +1261,7 @@ static int smc_ib_add_dev(struct ib_device *ibdev) /* trigger reading of the port attributes */ port_cnt = smcibdev->ibdev->phys_port_cnt; - pr_warn_ratelimited("smc: adding ib device %s with port count %d\n", + pr_info_ratelimited("smc: adding ib device %s with port count %d\n", smcibdev->ibdev->name, port_cnt); for (i = 0; i < min_t(size_t, port_cnt, SMC_MAX_PORTS); @@ -1176,7 +1272,7 @@ static int smc_ib_add_dev(struct ib_device *ibdev) smcibdev->pnetid[i])) smc_pnetid_by_table_ib(smcibdev, i + 1); smc_copy_netdev_ifindex(smcibdev, i); - pr_warn_ratelimited("smc: ib device %s port %d has pnetid " + pr_info_ratelimited("smc: ib device %s port %d has pnetid " "%.16s%s\n", smcibdev->ibdev->name, i + 1, smcibdev->pnetid[i], @@ -1196,7 +1292,10 @@ static void smc_ib_remove_dev(struct ib_device *ibdev, void *client_data) mutex_lock(&smc_ib_devices.mutex); list_del_init(&smcibdev->list); /* remove from smc_ib_devices */ mutex_unlock(&smc_ib_devices.mutex); - pr_warn_ratelimited("smc: removing ib device %s\n", + smc_ib_put_pending_device(smcibdev); + wait_event(smcibdev->lnks_pending, /* wait for no pending usage */ + !refcount_read(&smcibdev->lnk_pending_cnt)); + pr_info_ratelimited("smc: removing ib device %s\n", smcibdev->ibdev->name); smc_smcr_terminate_all(smcibdev); smc_iw_release_ports(smcibdev); diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h index 709c383e776aad97e2018c6d7099e40dfcfb4351..78d9cc6463dbcf2bad0ffd93cf47a02830d101ec 100644 --- a/net/smc/smc_ib.h +++ b/net/smc/smc_ib.h @@ -59,7 +59,9 @@ struct smc_ib_device { /* ib-device infos for smc */ unsigned long port_event_mask; DECLARE_BITMAP(ports_going_away, SMC_MAX_PORTS); atomic_t lnk_cnt; /* number of links on ibdev */ + refcount_t lnk_pending_cnt;/* number of links attempt to use ibdev */ wait_queue_head_t lnks_deleted; /* wait 4 removal of all links*/ + wait_queue_head_t lnks_pending; /* wait 4 pending establish of links */ struct mutex mutex; /* protect dev setup+cleanup */ atomic_t lnk_cnt_by_port[SMC_MAX_PORTS]; /* number of links per port */ @@ -124,5 +126,7 @@ int smc_ib_find_route(__be32 saddr, __be32 daddr, u8 nexthop_mac[], u8 *uses_gateway); bool smc_ib_is_valid_local_systemid(void); bool smc_ib_is_iwarp(struct ib_device *ibdev, u8 ibport); +void smc_ib_get_pending_device(struct smc_ib_device *smcibdev); +void smc_ib_put_pending_device(struct smc_ib_device *smcibdev); int smcr_nl_get_device(struct sk_buff *skb, struct netlink_callback *cb); #endif diff --git a/net/smc/smc_inet.c b/net/smc/smc_inet.c new file mode 100644 index 0000000000000000000000000000000000000000..badd6036ebd6683feffc12af14bcb59ca3b41071 --- /dev/null +++ b/net/smc/smc_inet.c @@ -0,0 +1,421 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Shared Memory Communications over RDMA (SMC-R) and RoCE + * + * AF_SMC protocol family socket handler keeping the AF_INET sock address type + * applies to SOCK_STREAM sockets only + * offers an alternative communication option for TCP-protocol sockets + * applicable with RoCE-cards only + * + * Initial restrictions: + * - support for alternate links postponed + * + * Copyright IBM Corp. 2016, 2018 + * + */ + +#include +#include + +#include "smc_inet.h" + +static struct timewait_sock_ops smc_timewait_sock_ops = { + .twsk_obj_size = sizeof(struct tcp_timewait_sock), + .twsk_unique = tcp_twsk_unique, + .twsk_destructor = tcp_twsk_destructor, +}; + +struct proto smc_inet_prot = { + .name = "SMC", + .owner = THIS_MODULE, + .close = tcp_close, + .pre_connect = NULL, + .connect = tcp_v4_connect, + .disconnect = tcp_disconnect, + .accept = smc_inet_csk_accept, + .ioctl = tcp_ioctl, + .init = smc_inet_init_sock, + .destroy = tcp_v4_destroy_sock, + .shutdown = tcp_shutdown, + .setsockopt = tcp_setsockopt, + .getsockopt = tcp_getsockopt, + .keepalive = tcp_set_keepalive, + .recvmsg = tcp_recvmsg, + .sendmsg = tcp_sendmsg, + .sendpage = tcp_sendpage, + .backlog_rcv = tcp_v4_do_rcv, + .release_cb = smc_inet_sock_proto_release_cb, + .hash = inet_hash, + .unhash = inet_unhash, + .get_port = inet_csk_get_port, + .enter_memory_pressure = tcp_enter_memory_pressure, + .leave_memory_pressure = tcp_leave_memory_pressure, + .stream_memory_free = tcp_stream_memory_free, + .sockets_allocated = &tcp_sockets_allocated, + .orphan_count = &tcp_orphan_count, + .memory_allocated = &tcp_memory_allocated, + .memory_pressure = &tcp_memory_pressure, + .sysctl_mem = sysctl_tcp_mem, + .sysctl_wmem_offset = offsetof(struct net, ipv4.sysctl_tcp_wmem), + .sysctl_rmem_offset = offsetof(struct net, ipv4.sysctl_tcp_rmem), + .max_header = MAX_TCP_HEADER, + .obj_size = sizeof(struct smc_sock), + .slab_flags = SLAB_TYPESAFE_BY_RCU, + .twsk_prot = &smc_timewait_sock_ops, + /* tcp_conn_request will use tcp_request_sock_ops */ + .rsk_prot = NULL, + .h.hashinfo = &tcp_hashinfo, + .no_autobind = true, + .diag_destroy = tcp_abort, +}; +EXPORT_SYMBOL_GPL(smc_inet_prot); + +const struct proto_ops smc_inet_stream_ops = { + .family = PF_INET, + .flags = PROTO_CMSG_DATA_ONLY, + .owner = THIS_MODULE, + .release = smc_inet_release, + .bind = inet_bind, + .connect = smc_inet_connect, + .socketpair = sock_no_socketpair, + .accept = inet_accept, + .getname = inet_getname, + .poll = smc_inet_poll, + .ioctl = smc_inet_ioctl, + .gettstamp = sock_gettstamp, + .listen = smc_inet_listen, + .shutdown = smc_inet_shutdown, + .setsockopt = smc_inet_setsockopt, + .getsockopt = smc_inet_getsockopt, + .sendmsg = smc_inet_sendmsg, + .recvmsg = smc_inet_recvmsg, +#ifdef CONFIG_MMU + .mmap = tcp_mmap, +#endif + .sendpage = smc_inet_sendpage, + .splice_read = smc_inet_splice_read, + .read_sock = tcp_read_sock, + .sendmsg_locked = tcp_sendmsg_locked, + .sendpage_locked = tcp_sendpage_locked, + .peek_len = tcp_peek_len, +#ifdef CONFIG_COMPAT + .compat_ioctl = inet_compat_ioctl, +#endif + .set_rcvlowat = tcp_set_rcvlowat, +}; + +struct inet_protosw smc_inet_protosw = { + .type = SOCK_STREAM, + .protocol = IPPROTO_SMC, + .prot = &smc_inet_prot, + .ops = &smc_inet_stream_ops, + .flags = INET_PROTOSW_ICSK, +}; + +#if IS_ENABLED(CONFIG_IPV6) +struct proto smc_inet6_prot = { + .name = "SMCv6", + .owner = THIS_MODULE, + .close = tcp_close, + .pre_connect = NULL, + .connect = NULL, + .disconnect = tcp_disconnect, + .accept = smc_inet_csk_accept, + .ioctl = tcp_ioctl, + .init = smc_inet_init_sock, + .destroy = NULL, + .shutdown = tcp_shutdown, + .setsockopt = tcp_setsockopt, + .getsockopt = tcp_getsockopt, + .keepalive = tcp_set_keepalive, + .recvmsg = tcp_recvmsg, + .sendmsg = tcp_sendmsg, + .sendpage = tcp_sendpage, + .backlog_rcv = NULL, + .release_cb = smc_inet_sock_proto_release_cb, + .hash = NULL, + .unhash = inet_unhash, + .get_port = inet_csk_get_port, + .enter_memory_pressure = tcp_enter_memory_pressure, + .leave_memory_pressure = tcp_leave_memory_pressure, + .stream_memory_free = tcp_stream_memory_free, + .sockets_allocated = &tcp_sockets_allocated, + .memory_allocated = &tcp_memory_allocated, + .memory_pressure = &tcp_memory_pressure, + .orphan_count = &tcp_orphan_count, + .sysctl_mem = sysctl_tcp_mem, + .sysctl_wmem_offset = offsetof(struct net, ipv4.sysctl_tcp_wmem), + .sysctl_rmem_offset = offsetof(struct net, ipv4.sysctl_tcp_rmem), + .max_header = MAX_TCP_HEADER, + .obj_size = sizeof(struct smc_sock), + .slab_flags = SLAB_TYPESAFE_BY_RCU, + .twsk_prot = &smc_timewait_sock_ops, + /* tcp_conn_request will use tcp_request_sock_ops */ + .rsk_prot = NULL, + .h.hashinfo = &tcp_hashinfo, + .no_autobind = true, + .diag_destroy = tcp_abort, +}; +EXPORT_SYMBOL_GPL(smc_inet6_prot); + +const struct proto_ops smc_inet6_stream_ops = { + .family = PF_INET6, + .flags = PROTO_CMSG_DATA_ONLY, + .owner = THIS_MODULE, + .release = smc_inet_release, + .bind = inet6_bind, + .connect = smc_inet_connect, /* ok */ + .socketpair = sock_no_socketpair, /* a do nothing */ + .accept = inet_accept, /* ok */ + .getname = inet6_getname, + .poll = smc_inet_poll, /* ok */ + .ioctl = smc_inet_ioctl, /* must change */ + .gettstamp = sock_gettstamp, + .listen = smc_inet_listen, /* ok */ + .shutdown = smc_inet_shutdown, /* ok */ + .setsockopt = smc_inet_setsockopt, /* ok */ + .getsockopt = smc_inet_getsockopt, /* ok */ + .sendmsg = smc_inet_sendmsg, /* retpoline's sake */ + .recvmsg = smc_inet_recvmsg, /* retpoline's sake */ +#ifdef CONFIG_MMU + .mmap = tcp_mmap, +#endif + .sendpage = smc_inet_sendpage, + .sendmsg_locked = tcp_sendmsg_locked, + .sendpage_locked = tcp_sendpage_locked, + .splice_read = smc_inet_splice_read, + .read_sock = tcp_read_sock, + .peek_len = tcp_peek_len, +#ifdef CONFIG_COMPAT + .compat_ioctl = inet6_compat_ioctl, +#endif + .set_rcvlowat = tcp_set_rcvlowat, +}; + +struct inet_protosw smc_inet6_protosw = { + .type = SOCK_STREAM, + .protocol = IPPROTO_SMC, + .prot = &smc_inet6_prot, + .ops = &smc_inet6_stream_ops, + .flags = INET_PROTOSW_ICSK, +}; +#endif + +int smc_inet_sock_switch_negotiation_state_locked(struct sock *sk, int except, int target) +{ + struct smc_sock *smc = smc_sk(sk); + int cur; + + cur = isck_smc_negotiation_load(smc); + if (cur != except) + return cur; + + switch (cur) { + case SMC_NEGOTIATION_TBD: + switch (target) { + case SMC_NEGOTIATION_PREPARE_SMC: + /* same as passive closing */ + sock_hold(sk); + fallthrough; + case SMC_NEGOTIATION_NO_SMC: + isck_smc_negotiation_store(smc, target); + return target; + default: + break; + } + break; + case SMC_NEGOTIATION_PREPARE_SMC: + switch (target) { + case SMC_NEGOTIATION_NO_SMC: + sock_put(sk); /* sock hold in SMC_NEGOTIATION_PREPARE_SMC */ + fallthrough; + case SMC_NEGOTIATION_SMC: + isck_smc_negotiation_store(smc, target); + return target; + default: + break; + } + break; + default: + break; + } + + return cur; +} + +int smc_inet_sock_init(void) +{ + struct proto *tcp_v4prot; +#if IS_ENABLED(CONFIG_IPV6) + struct proto *tcp_v6prot; +#endif + + tcp_v4prot = smc_inet_get_tcp_prot(PF_INET); + if (unlikely(!tcp_v4prot)) + return -EINVAL; + +#if IS_ENABLED(CONFIG_IPV6) + tcp_v6prot = smc_inet_get_tcp_prot(PF_INET6); + if (unlikely(!tcp_v6prot)) + return -EINVAL; +#endif + + /* INET sock has a issues here. twsk will hold the reference of the this module, + * so it may be found that the SMC module cannot be uninstalled after the test program ends, + * But eventually, twsk will release the reference of the module. + * This may affect some old test cases if they try to remove the module immediately after + * completing their test. + */ + + /* Complete the full prot and proto_ops to + * ensure consistency with TCP. Some symbols here have not been exported, + * so that we have to assign it here. + */ + smc_inet_prot.pre_connect = tcp_v4prot->pre_connect; + +#if IS_ENABLED(CONFIG_IPV6) + smc_inet6_prot.pre_connect = tcp_v6prot->pre_connect; + smc_inet6_prot.connect = tcp_v6prot->connect; + smc_inet6_prot.init = tcp_v6prot->init; + smc_inet6_prot.destroy = tcp_v6prot->destroy; + smc_inet6_prot.backlog_rcv = tcp_v6prot->backlog_rcv; + smc_inet6_prot.hash = tcp_v6prot->hash; +#endif + return 0; +} + +static int smc_inet_clcsock_sendmsg(struct socket *sock, struct msghdr *msg, size_t len) +{ + struct sock *sk = sock->sk; + struct smc_sock *smc; + + smc = smc_sk(sock->sk); + + if (current_work() == &smc->smc_listen_work) + return tcp_sendmsg(sk, msg, len); + + /* smc_inet_clcsock_sendmsg only works for smc handshaking + * fallback sendmsg should process by smc_inet_sendmsg. + * see more details in smc_inet_sendmsg(). + */ + if (smc->use_fallback) + return -EOPNOTSUPP; + + /* It is difficult for us to determine whether the current sk is locked. + * Therefore, we rely on the implementation of conenct_work() implementation, which + * is locked always. + */ + return tcp_sendmsg_locked(sk, msg, len); +} + +static int smc_inet_clcsock_recvmsg(struct socket *sock, struct msghdr *msg, size_t len, + int flags) +{ + struct sock *sk = sock->sk; + struct smc_sock *smc; + int addr_len, err; + long timeo; + + smc = smc_sk(sock->sk); + + /* smc_inet_clcsock_recvmsg only works for smc handshaking + * fallback recvmsg should process by smc_inet_recvmsg. + */ + if (smc->use_fallback) + return -EOPNOTSUPP; + + if (likely(!(flags & MSG_ERRQUEUE))) + sock_rps_record_flow(sk); + + timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); + + if (current_work() == &smc->smc_listen_work) { + err = tcp_recvmsg(sk, msg, len, flags & MSG_DONTWAIT, + flags & ~MSG_DONTWAIT, &addr_len); + } else { + /* Locked, see more details in smc_inet_clcsock_sendmsg() */ + release_sock(sock->sk); + err = tcp_recvmsg(sk, msg, len, flags & MSG_DONTWAIT, + flags & ~MSG_DONTWAIT, &addr_len); + lock_sock(sock->sk); + /* since we release sock before, there might be state changed */ + if (smc_sk_state(&smc->sk) != SMC_INIT) + err = -EPIPE; + } + + if (err >= 0) + msg->msg_namelen = addr_len; + + return err; +} + +static ssize_t smc_inet_clcsock_sendpage(struct socket *sock, struct page *page, int offset, + size_t size, int flags) +{ + /* fallback sendpage should process by smc_inet_sendpage. */ + return -EOPNOTSUPP; +} + +static ssize_t smc_inet_clcsock_splice_read(struct socket *sock, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, + unsigned int flags) +{ + /* fallback splice_read should process by smc_inet_splice_read. */ + return -EOPNOTSUPP; +} + +static int smc_inet_clcsock_connect(struct socket *sock, struct sockaddr *addr, + int alen, int flags) +{ + /* smc_connect will lock the sock->sk */ + return __inet_stream_connect(sock, addr, alen, flags, 0); +} + +static int smc_inet_clcsock_shutdown(struct socket *sock, int how) +{ + /* shutdown could call from smc_close_active, we should + * not fail it. + */ + return 0; +} + +static int smc_inet_clcsock_release(struct socket *sock) +{ + /* shutdown could call from smc_close_active, we should + * not fail it. + */ + return 0; +} + +static int smc_inet_clcsock_getname(struct socket *sock, struct sockaddr *addr, + int peer) +{ + return sock->sk->sk_family == PF_INET ? inet_getname(sock, addr, peer) : +#if IS_ENABLED(CONFIG_IPV6) + inet6_getname(sock, addr, peer); +#else + -EINVAL; +#endif +} + +static __poll_t smc_inet_clcsock_poll(struct file *file, struct socket *sock, + poll_table *wait) +{ + return 0; +} + +const struct proto_ops smc_inet_clcsock_ops = { + .family = PF_UNSPEC, + .flags = PROTO_CMSG_DATA_ONLY, + /* It is not a real ops, its lifecycle is bound to the SMC module. */ + .owner = NULL, + .release = smc_inet_clcsock_release, + .getname = smc_inet_clcsock_getname, + .connect = smc_inet_clcsock_connect, + .shutdown = smc_inet_clcsock_shutdown, + .sendmsg = smc_inet_clcsock_sendmsg, + .recvmsg = smc_inet_clcsock_recvmsg, + .sendpage = smc_inet_clcsock_sendpage, + .splice_read = smc_inet_clcsock_splice_read, + .poll = smc_inet_clcsock_poll, +}; diff --git a/net/smc/smc_inet.h b/net/smc/smc_inet.h new file mode 100644 index 0000000000000000000000000000000000000000..ec235d6646fe836c1b8ac6e7cd57dc808fc90367 --- /dev/null +++ b/net/smc/smc_inet.h @@ -0,0 +1,284 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Shared Memory Communications over RDMA (SMC-R) and RoCE + * + * Definitions for the SMC module (socket related) + * + * Copyright IBM Corp. 2016 + * + */ +#ifndef __SMC_INET +#define __SMC_INET + +#include +#include +#include +#include +/* MUST after net/tcp.h or warning */ +#include + +#include +#include "smc.h" + +extern struct proto smc_inet_prot; +extern struct proto smc_inet6_prot; + +extern const struct proto_ops smc_inet_stream_ops; +extern const struct proto_ops smc_inet6_stream_ops; + +extern struct inet_protosw smc_inet_protosw; +extern struct inet_protosw smc_inet6_protosw; + +extern const struct proto_ops smc_inet_clcsock_ops; + +enum smc_inet_sock_negotiation_state { + /* When creating an AF_SMC sock, the state field will be initialized to 0 by default, + * which is only for logical compatibility with that situation + * and will never be used. + */ + SMC_NEGOTIATION_COMPATIBLE_WITH_AF_SMC = 0, + + /* This connection is still uncertain whether it is an SMC connection or not, + * It always appears when actively open SMC connection, because it's unclear + * whether the server supports the SMC protocol and has willing to use SMC. + */ + SMC_NEGOTIATION_TBD = 0x10, + + /* This state indicates that this connection is definitely not an SMC connection. + * and it is absolutely impossible to become an SMC connection again. A fina + * state. + */ + SMC_NEGOTIATION_NO_SMC = 0x20, + + /* This state indicates that this connection is an SMC connection. and it is + * absolutely impossible to become an not-SMC connection again. A final state. + */ + SMC_NEGOTIATION_SMC = 0x40, + + /* This state indicates that this connection is in the process of SMC handshake. + * It is mainly used to eliminate the ambiguity of syn_smc, because when syn_smc is 1, + * It may represent remote has support for SMC, or it may just indicate that itself has + * supports for SMC. + */ + SMC_NEGOTIATION_PREPARE_SMC = 0x80, + + /* flags */ + SMC_NEGOTIATION_LISTEN_FLAG = 0x01, +}; + +static __always_inline void isck_smc_negotiation_store(struct smc_sock *smc, + enum smc_inet_sock_negotiation_state state) +{ + smc->isck_smc_negotiation = (state | (smc->isck_smc_negotiation & 0x0f)); +} + +static __always_inline int isck_smc_negotiation_load(struct smc_sock *smc) +{ + return smc->isck_smc_negotiation & 0xf0; +} + +static __always_inline void isck_smc_negotiation_set_flags(struct smc_sock *smc, int flags) +{ + smc->isck_smc_negotiation = (smc->isck_smc_negotiation | (flags & 0x0f)); +} + +static __always_inline int isck_smc_negotiation_get_flags(struct smc_sock *smc) +{ + return smc->isck_smc_negotiation & 0x0f; +} + +static inline int smc_inet_sock_set_syn_smc(struct sock *sk) +{ + int rc = 0; + + /* already set */ + if (unlikely(tcp_sk(sk)->syn_smc)) + return 1; + + read_lock_bh(&sk->sk_callback_lock); + /* Only set syn_smc when negotiation still be SMC_NEGOTIATION_TBD, + * it can prevent sock that have already been fallback from being enabled again. + * For example, setsockopt might actively fallback before call connect(). + */ + if (isck_smc_negotiation_load(smc_sk(sk)) == SMC_NEGOTIATION_TBD) { + tcp_sk(sk)->syn_smc = 1; + rc = 1; + } + read_unlock_bh(&sk->sk_callback_lock); + return rc; +} + +static inline int smc_inet_sock_try_fallback_fast(struct sock *sk, int abort) +{ + struct smc_sock *smc = smc_sk(sk); + int syn_smc = 1; + + write_lock_bh(&sk->sk_callback_lock); + switch (isck_smc_negotiation_load(smc)) { + case SMC_NEGOTIATION_TBD: + if (!abort && tcp_sk(sk)->syn_smc) + break; + /* fallback is meanless for listen socks */ + if (unlikely(inet_sk_state_load(sk) == TCP_LISTEN)) + break; + /* In the implementation of INET sock, syn_smc will only be determined after + * smc_inet_connect or smc_inet_listen, which means that if there is + * no syn_smc set, we can easily fallback. + */ + isck_smc_negotiation_store(smc, SMC_NEGOTIATION_NO_SMC); + fallthrough; + case SMC_NEGOTIATION_NO_SMC: + if (smc->clcsk_state_change) + sk->sk_state_change = smc->clcsk_state_change; + syn_smc = 0; + default: + break; + } + write_unlock_bh(&sk->sk_callback_lock); + + return syn_smc; +} + +static __always_inline bool smc_inet_sock_check_smc(struct sock *sk) +{ + if (!tcp_sk(sk)->syn_smc) + return false; + + return isck_smc_negotiation_load(smc_sk(sk)) == SMC_NEGOTIATION_SMC; +} + +static __always_inline bool smc_inet_sock_check_fallback_fast(struct sock *sk) +{ + return !tcp_sk(sk)->syn_smc; +} + +static __always_inline bool smc_inet_sock_check_fallback(struct sock *sk) +{ + return isck_smc_negotiation_load(smc_sk(sk)) == SMC_NEGOTIATION_NO_SMC; +} + +static inline int smc_inet_sock_access_before(struct sock *sk) +{ + if (smc_inet_sock_check_fallback(sk)) + return 0; + + if (unlikely(isck_smc_negotiation_load(smc_sk(sk)) == SMC_NEGOTIATION_TBD)) + return smc_inet_sock_try_fallback_fast(sk, /* try best */ 0); + + return 1; +} + +static __always_inline bool smc_inet_sock_is_active_open(struct sock *sk) +{ + return !(isck_smc_negotiation_get_flags(smc_sk(sk)) & SMC_NEGOTIATION_LISTEN_FLAG); +} + +/* obtain TCP proto via sock family */ +static __always_inline struct proto *smc_inet_get_tcp_prot(int family) +{ + switch (family) { + case AF_INET: + return &tcp_prot; + case AF_INET6: + return &tcpv6_prot; + default: + pr_warn_once("smc: %s(unknown family %d)\n", __func__, family); + break; + } + return NULL; +} + +int smc_inet_sock_switch_negotiation_state_locked(struct sock *sk, int except, int target); + +static __always_inline int smc_inet_sock_switch_negotiation_state(struct sock *sk, + int except, int target) +{ + int rc; + + write_lock_bh(&sk->sk_callback_lock); + rc = smc_inet_sock_switch_negotiation_state_locked(sk, except, target); + write_unlock_bh(&sk->sk_callback_lock); + return rc; +} + +static __always_inline void smc_inet_sock_init_accompany_socket(struct sock *sk) +{ + struct smc_sock *smc = smc_sk(sk); + + smc->accompany_socket.sk = sk; + init_waitqueue_head(&smc->accompany_socket.wq.wait); + smc->accompany_socket.ops = &smc_inet_clcsock_ops; + smc->accompany_socket.state = SS_UNCONNECTED; + + smc->clcsock = &smc->accompany_socket; +} + +#if IS_ENABLED(CONFIG_IPV6) +#define smc_call_inet_sock_ops(sk, inet, inet6, ...) ({ \ + (sk)->sk_family == PF_INET ? inet(__VA_ARGS__) : \ + inet6(__VA_ARGS__); \ +}) +#else +#define smc_call_inet_sock_ops(sk, inet, inet6, ...) inet(__VA_ARGS__) +#endif + +#define SMC_REQSK_SMC 0x01 +#define SMC_REQSK_TCP 0x02 + +static inline bool smc_inet_sock_is_under_presure(const struct sock *sk) +{ + return READ_ONCE(smc_sk(sk)->under_presure); +} + +static inline void smc_inet_sock_under_presure(struct sock *sk) +{ + WRITE_ONCE(smc_sk(sk)->under_presure, 1); +} + +static inline void smc_inet_sock_leave_presure(struct sock *sk) +{ + WRITE_ONCE(smc_sk(sk)->under_presure, 0); +} + +/* This function initializes the inet related structures. + * If initialization is successful, it returns 0; + * otherwise, it returns a non-zero value. + */ +int smc_inet_sock_init(void); + +int smc_inet_init_sock(struct sock *sk); +void smc_inet_sock_proto_release_cb(struct sock *sk); + +int smc_inet_connect(struct socket *sock, struct sockaddr *addr, + int alen, int flags); + +int smc_inet_setsockopt(struct socket *sock, int level, int optname, + sockptr_t optval, unsigned int optlen); + +int smc_inet_getsockopt(struct socket *sock, int level, int optname, + char __user *optval, int __user *optlen); + +int smc_inet_ioctl(struct socket *sock, unsigned int cmd, + unsigned long arg); + +int smc_inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t len); + +int smc_inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len, + int flags); + +ssize_t smc_inet_sendpage(struct socket *sock, struct page *page, + int offset, size_t size, int flags); + +ssize_t smc_inet_splice_read(struct socket *sock, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, + unsigned int flags); + +__poll_t smc_inet_poll(struct file *file, struct socket *sock, poll_table *wait); + +struct sock *smc_inet_csk_accept(struct sock *sk, int flags, int *err, bool kern); +int smc_inet_listen(struct socket *sock, int backlog); + +int smc_inet_shutdown(struct socket *sock, int how); +int smc_inet_release(struct socket *sock); + +#endif // __SMC_INET diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c index 98de2bc6148332b7a4e5dd1b6401480111d5d658..261803e34fb528fe2c666cd29121ad7246df9c31 100644 --- a/net/smc/smc_ism.c +++ b/net/smc/smc_ism.c @@ -16,6 +16,7 @@ #include "smc_ism.h" #include "smc_pnet.h" #include "smc_netlink.h" +#include "linux/ism.h" struct smcd_dev_list smcd_dev_list = { .list = LIST_HEAD_INIT(smcd_dev_list.list), @@ -25,6 +26,22 @@ struct smcd_dev_list smcd_dev_list = { static bool smc_ism_v2_capable; static u8 smc_ism_v2_system_eid[SMC_MAX_EID_LEN]; +#if IS_ENABLED(CONFIG_ISM) +static void smcd_register_dev(struct ism_dev *ism); +static void smcd_unregister_dev(struct ism_dev *ism); +static void smcd_handle_event(struct ism_dev *ism, struct ism_event *event); +static void smcd_handle_irq(struct ism_dev *ism, unsigned int dmbno, + u16 dmbemask); + +static struct ism_client smc_ism_client = { + .name = "SMC-D", + .add = smcd_register_dev, + .remove = smcd_unregister_dev, + .handle_event = smcd_handle_event, + .handle_irq = smcd_handle_irq, +}; +#endif + /* Test if an ISM communication is possible - same CPC */ int smc_ism_cantalk(u64 peer_gid, unsigned short vlan_id, struct smcd_dev *smcd) { @@ -182,6 +199,7 @@ int smc_ism_unregister_dmb(struct smcd_dev *smcd, struct smc_buf_desc *dmb_desc) int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len, struct smc_buf_desc *dmb_desc) { +#if IS_ENABLED(CONFIG_ISM) struct smcd_dmb dmb; int rc; @@ -190,7 +208,7 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len, dmb.sba_idx = dmb_desc->sba_idx; dmb.vlan_id = lgr->vlan_id; dmb.rgid = lgr->peer_gid; - rc = lgr->smcd->ops->register_dmb(lgr->smcd, &dmb); + rc = lgr->smcd->ops->register_dmb(lgr->smcd, &dmb, &smc_ism_client); if (!rc) { dmb_desc->sba_idx = dmb.sba_idx; dmb_desc->token = dmb.dmb_tok; @@ -199,6 +217,9 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len, dmb_desc->len = dmb.dmb_len; } return rc; +#else + return 0; +#endif } static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd, @@ -209,9 +230,11 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd, struct smc_pci_dev smc_pci_dev; struct nlattr *port_attrs; struct nlattr *attrs; + struct ism_dev *ism; int use_cnt = 0; void *nlh; + ism = smcd->priv; nlh = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, &smc_gen_nl_family, NLM_F_MULTI, SMC_NETLINK_GET_DEV_SMCD); @@ -226,7 +249,7 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd, if (nla_put_u8(skb, SMC_NLA_DEV_IS_CRIT, use_cnt > 0)) goto errattr; memset(&smc_pci_dev, 0, sizeof(smc_pci_dev)); - smc_set_pci_values(to_pci_dev(smcd->dev.parent), &smc_pci_dev); + smc_set_pci_values(to_pci_dev(ism->dev.parent), &smc_pci_dev); if (nla_put_u32(skb, SMC_NLA_DEV_PCI_FID, smc_pci_dev.pci_fid)) goto errattr; if (nla_put_u16(skb, SMC_NLA_DEV_PCI_CHID, smc_pci_dev.pci_pchid)) @@ -292,10 +315,11 @@ int smcd_nl_get_device(struct sk_buff *skb, struct netlink_callback *cb) return skb->len; } +#if IS_ENABLED(CONFIG_ISM) struct smc_ism_event_work { struct work_struct work; struct smcd_dev *smcd; - struct smcd_event event; + struct ism_event event; }; #define ISM_EVENT_REQUEST 0x0001 @@ -335,24 +359,6 @@ static void smcd_handle_sw_event(struct smc_ism_event_work *wrk) } } -int smc_ism_signal_shutdown(struct smc_link_group *lgr) -{ - int rc; - union smcd_sw_event_info ev_info; - - if (lgr->peer_shutdown) - return 0; - - memcpy(ev_info.uid, lgr->id, SMC_LGR_ID_SIZE); - ev_info.vlan_id = lgr->vlan_id; - ev_info.code = ISM_EVENT_REQUEST; - rc = lgr->smcd->ops->signal_event(lgr->smcd, lgr->peer_gid, - ISM_EVENT_REQUEST_IR, - ISM_EVENT_CODE_SHUTDOWN, - ev_info.info); - return rc; -} - /* worker for SMC-D events */ static void smc_ism_event_work(struct work_struct *work) { @@ -372,44 +378,25 @@ static void smc_ism_event_work(struct work_struct *work) kfree(wrk); } -static void smcd_release(struct device *dev) -{ - struct smcd_dev *smcd = container_of(dev, struct smcd_dev, dev); - - kfree(smcd->conn); - kfree(smcd); -} - -struct smcd_dev *smcd_alloc_dev(struct device *parent, const char *name, - const struct smcd_ops *ops, int max_dmbs) +static struct smcd_dev *smcd_alloc_dev(struct device *parent, const char *name, + const struct smcd_ops *ops, int max_dmbs) { struct smcd_dev *smcd; - smcd = kzalloc(sizeof(*smcd), GFP_KERNEL); + smcd = devm_kzalloc(parent, sizeof(*smcd), GFP_KERNEL); if (!smcd) return NULL; - smcd->conn = kcalloc(max_dmbs, sizeof(struct smc_connection *), - GFP_KERNEL); - if (!smcd->conn) { - kfree(smcd); + smcd->conn = devm_kcalloc(parent, max_dmbs, + sizeof(struct smc_connection *), GFP_KERNEL); + if (!smcd->conn) return NULL; - } smcd->event_wq = alloc_ordered_workqueue("ism_evt_wq-%s)", WQ_MEM_RECLAIM, name); - if (!smcd->event_wq) { - kfree(smcd->conn); - kfree(smcd); + if (!smcd->event_wq) return NULL; - } - smcd->dev.parent = parent; - smcd->dev.release = smcd_release; - device_initialize(&smcd->dev); - dev_set_name(&smcd->dev, name); smcd->ops = ops; - if (smc_pnetid_by_dev_port(parent, 0, smcd->pnetid)) - smc_pnetid_by_table_smcd(smcd); spin_lock_init(&smcd->lock); spin_lock_init(&smcd->lgr_lock); @@ -418,18 +405,30 @@ struct smcd_dev *smcd_alloc_dev(struct device *parent, const char *name, init_waitqueue_head(&smcd->lgrs_deleted); return smcd; } -EXPORT_SYMBOL_GPL(smcd_alloc_dev); -int smcd_register_dev(struct smcd_dev *smcd) +static void smcd_register_dev(struct ism_dev *ism) { - int rc; + const struct smcd_ops *ops = ism_get_smcd_ops(); + struct smcd_dev *smcd; + + if (!ops) + return; + + smcd = smcd_alloc_dev(&ism->pdev->dev, dev_name(&ism->pdev->dev), ops, + ISM_NR_DMBS); + if (!smcd) + return; + smcd->priv = ism; + ism_set_priv(ism, &smc_ism_client, smcd); + if (smc_pnetid_by_dev_port(&ism->pdev->dev, 0, smcd->pnetid)) + smc_pnetid_by_table_smcd(smcd); mutex_lock(&smcd_dev_list.mutex); if (list_empty(&smcd_dev_list.list)) { u8 *system_eid = NULL; system_eid = smcd->ops->get_system_eid(); - if (system_eid[24] != '0' || system_eid[28] != '0') { + if (smcd->ops->supports_v2()) { smc_ism_v2_capable = true; memcpy(smc_ism_v2_system_eid, system_eid, SMC_MAX_EID_LEN); @@ -442,44 +441,29 @@ int smcd_register_dev(struct smcd_dev *smcd) list_add(&smcd->list, &smcd_dev_list.list); mutex_unlock(&smcd_dev_list.mutex); - pr_warn_ratelimited("smc: adding smcd device %s with pnetid %.16s%s\n", - dev_name(&smcd->dev), smcd->pnetid, + pr_info_ratelimited("smc: adding smcd device %s with pnetid %.16s%s\n", + dev_name(&ism->dev), smcd->pnetid, smcd->pnetid_by_user ? " (user defined)" : ""); - rc = device_add(&smcd->dev); - if (rc) { - mutex_lock(&smcd_dev_list.mutex); - list_del(&smcd->list); - mutex_unlock(&smcd_dev_list.mutex); - } - - return rc; + return; } -EXPORT_SYMBOL_GPL(smcd_register_dev); -void smcd_unregister_dev(struct smcd_dev *smcd) +static void smcd_unregister_dev(struct ism_dev *ism) { - pr_warn_ratelimited("smc: removing smcd device %s\n", - dev_name(&smcd->dev)); + struct smcd_dev *smcd = ism_get_priv(ism, &smc_ism_client); + + pr_info_ratelimited("smc: removing smcd device %s\n", + dev_name(&ism->dev)); + smcd->going_away = 1; + smc_smcd_terminate_all(smcd); mutex_lock(&smcd_dev_list.mutex); list_del_init(&smcd->list); mutex_unlock(&smcd_dev_list.mutex); - smcd->going_away = 1; - smc_smcd_terminate_all(smcd); destroy_workqueue(smcd->event_wq); - - device_del(&smcd->dev); -} -EXPORT_SYMBOL_GPL(smcd_unregister_dev); - -void smcd_free_dev(struct smcd_dev *smcd) -{ - put_device(&smcd->dev); } -EXPORT_SYMBOL_GPL(smcd_free_dev); /* SMCD Device event handler. Called from ISM device interrupt handler. - * Parameters are smcd device pointer, + * Parameters are ism device pointer, * - event->type (0 --> DMB, 1 --> GID), * - event->code (event code), * - event->tok (either DMB token when event type 0, or GID when event type 1) @@ -489,8 +473,9 @@ EXPORT_SYMBOL_GPL(smcd_free_dev); * Context: * - Function called in IRQ context from ISM device driver event handler. */ -void smcd_handle_event(struct smcd_dev *smcd, struct smcd_event *event) +static void smcd_handle_event(struct ism_dev *ism, struct ism_event *event) { + struct smcd_dev *smcd = ism_get_priv(ism, &smc_ism_client); struct smc_ism_event_work *wrk; if (smcd->going_away) @@ -504,17 +489,18 @@ void smcd_handle_event(struct smcd_dev *smcd, struct smcd_event *event) wrk->event = *event; queue_work(smcd->event_wq, &wrk->work); } -EXPORT_SYMBOL_GPL(smcd_handle_event); /* SMCD Device interrupt handler. Called from ISM device interrupt handler. - * Parameters are smcd device pointer, DMB number, and the DMBE bitmask. + * Parameters are the ism device pointer, DMB number, and the DMBE bitmask. * Find the connection and schedule the tasklet for this connection. * * Context: * - Function called in IRQ context from ISM device driver IRQ handler. */ -void smcd_handle_irq(struct smcd_dev *smcd, unsigned int dmbno, u16 dmbemask) +static void smcd_handle_irq(struct ism_dev *ism, unsigned int dmbno, + u16 dmbemask) { + struct smcd_dev *smcd = ism_get_priv(ism, &smc_ism_client); struct smc_connection *conn = NULL; unsigned long flags; @@ -524,10 +510,44 @@ void smcd_handle_irq(struct smcd_dev *smcd, unsigned int dmbno, u16 dmbemask) tasklet_schedule(&conn->rx_tsklet); spin_unlock_irqrestore(&smcd->lock, flags); } -EXPORT_SYMBOL_GPL(smcd_handle_irq); +#endif + +int smc_ism_signal_shutdown(struct smc_link_group *lgr) +{ + int rc = 0; +#if IS_ENABLED(CONFIG_ISM) + union smcd_sw_event_info ev_info; + + if (lgr->peer_shutdown) + return 0; + + memcpy(ev_info.uid, lgr->id, SMC_LGR_ID_SIZE); + ev_info.vlan_id = lgr->vlan_id; + ev_info.code = ISM_EVENT_REQUEST; + rc = lgr->smcd->ops->signal_event(lgr->smcd, lgr->peer_gid, + ISM_EVENT_REQUEST_IR, + ISM_EVENT_CODE_SHUTDOWN, + ev_info.info); +#endif + return rc; +} -void __init smc_ism_init(void) +int smc_ism_init(void) { + int rc = 0; + +#if IS_ENABLED(CONFIG_ISM) smc_ism_v2_capable = false; memset(smc_ism_v2_system_eid, 0, SMC_MAX_EID_LEN); + + rc = ism_register_client(&smc_ism_client); +#endif + return rc; +} + +void smc_ism_exit(void) +{ +#if IS_ENABLED(CONFIG_ISM) + ism_unregister_client(&smc_ism_client); +#endif } diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h index d6b2db604fe8fcc01bd3817b2420de8cd0f1bd6b..832b2f42d79f342f16a30f85bdf22fc8e5a00887 100644 --- a/net/smc/smc_ism.h +++ b/net/smc/smc_ism.h @@ -42,7 +42,8 @@ int smc_ism_signal_shutdown(struct smc_link_group *lgr); void smc_ism_get_system_eid(u8 **eid); u16 smc_ism_get_chid(struct smcd_dev *dev); bool smc_ism_is_v2_capable(void); -void smc_ism_init(void); +int smc_ism_init(void); +void smc_ism_exit(void); int smcd_nl_get_device(struct sk_buff *skb, struct netlink_callback *cb); static inline int smc_ism_write(struct smcd_dev *smcd, u64 dmb_tok, diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c index bc1304b243abb8a9f6d95251ae6c1a53af637acf..d6bcd3e90b16f88bd872759b77fa1cc3dcc8ffc2 100644 --- a/net/smc/smc_llc.c +++ b/net/smc/smc_llc.c @@ -52,14 +52,13 @@ struct smc_llc_msg_confirm_link { /* type 0x01 */ u8 link_num; u8 link_uid[SMC_LGR_ID_SIZE]; u8 max_links; - u8 reserved[9]; + u8 max_conns; + u8 reserved[8]; }; #define SMC_LLC_FLAG_ADD_LNK_REJ 0x40 #define SMC_LLC_REJ_RSN_NO_ALT_PATH 1 -#define SMC_LLC_ADD_LNK_MAX_LINKS 2 - struct smc_llc_msg_add_link { /* type 0x02 */ struct smc_llc_hdr hd; u8 sender_mac[ETH_ALEN]; @@ -75,7 +74,8 @@ struct smc_llc_msg_add_link { /* type 0x02 */ reserved3 : 4; #endif u8 initial_psn[3]; - u8 reserved[8]; + u8 init_credits; /* QP rq init credits for rq flowctrl */ + u8 reserved[7]; }; struct smc_llc_msg_add_link_cont_rt { @@ -170,6 +170,12 @@ struct smc_llc_msg_delete_rkey { /* type 0x09 */ u8 reserved2[4]; }; +struct smc_llc_msg_announce_credits { /* type 0x0A */ + struct smc_llc_hdr hd; + u8 credits; + u8 reserved[39]; +}; + struct smc_llc_msg_delete_rkey_v2 { /* type 0x29 */ struct smc_llc_hdr hd; u8 num_rkeys; @@ -189,6 +195,7 @@ union smc_llc_msg { struct smc_llc_msg_delete_rkey delete_rkey; struct smc_llc_msg_test_link test_link; + struct smc_llc_msg_announce_credits announce_credits; struct { struct smc_llc_hdr hdr; u8 data[SMC_LLC_DATA_LEN]; @@ -242,11 +249,11 @@ static void smc_llc_flow_parallel(struct smc_link_group *lgr, u8 flow_type, } /* drop parallel or already-in-progress llc requests */ if (flow_type != msg_type) - pr_warn_once("smc: SMC-R lg %*phN dropped parallel " - "LLC msg: msg %d flow %d role %d\n", - SMC_LGR_ID_SIZE, &lgr->id, - qentry->msg.raw.hdr.common.type, - flow_type, lgr->role); + pr_warn_ratelimited("smc: SMC-R lg %*phN dropped parallel " + "LLC msg: msg %d flow %d role %d\n", + SMC_LGR_ID_SIZE, &lgr->id, + qentry->msg.raw.hdr.common.type, + flow_type, lgr->role); kfree(qentry); } @@ -359,11 +366,11 @@ struct smc_llc_qentry *smc_llc_wait(struct smc_link_group *lgr, smc_llc_flow_qentry_clr(flow)); return NULL; } - pr_warn_once("smc: SMC-R lg %*phN dropped unexpected LLC msg: " - "msg %d exp %d flow %d role %d flags %x\n", - SMC_LGR_ID_SIZE, &lgr->id, rcv_msg, exp_msg, - flow->type, lgr->role, - flow->qentry->msg.raw.hdr.flags); + pr_warn_ratelimited("smc: SMC-R lg %*phN dropped unexpected LLC msg: " + "msg %d exp %d flow %d role %d flags %x\n", + SMC_LGR_ID_SIZE, &lgr->id, rcv_msg, exp_msg, + flow->type, lgr->role, + flow->qentry->msg.raw.hdr.flags); smc_llc_flow_qentry_del(flow); } out: @@ -469,7 +476,12 @@ int smc_llc_send_confirm_link(struct smc_link *link, hton24(confllc->sender_qp_num, link->roce_qp->qp_num); confllc->link_num = link->link_id; memcpy(confllc->link_uid, link->link_uid, SMC_LGR_ID_SIZE); - confllc->max_links = SMC_LLC_ADD_LNK_MAX_LINKS; + confllc->max_links = SMC_LINKS_ADD_LNK_MAX; + if (link->lgr->smc_version == SMC_V2 && + link->lgr->peer_smc_release >= SMC_RELEASE_1) { + confllc->max_conns = link->lgr->max_conns; + confllc->max_links = link->lgr->max_links; + } /* send llc message */ rc = smc_wr_tx_send(link, pend); put_out: @@ -576,7 +588,10 @@ static struct smc_buf_desc *smc_llc_get_next_rmb(struct smc_link_group *lgr, { struct smc_buf_desc *buf_next; - if (!buf_pos || list_is_last(&buf_pos->list, &lgr->rmbs[*buf_lst])) { + if (!buf_pos) + return _smc_llc_get_next_rmb(lgr, buf_lst); + + if (list_is_last(&buf_pos->list, &lgr->rmbs[*buf_lst])) { (*buf_lst)++; return _smc_llc_get_next_rmb(lgr, buf_lst); } @@ -612,6 +627,8 @@ static int smc_llc_fill_ext_v2(struct smc_llc_msg_add_link_v2_ext *ext, goto out; buf_pos = smc_llc_get_first_rmb(lgr, &buf_lst); for (i = 0; i < ext->num_rkeys; i++) { + while (buf_pos && !(buf_pos)->used) + buf_pos = smc_llc_get_next_rmb(lgr, &buf_lst, buf_pos); if (!buf_pos) break; rmb = buf_pos; @@ -621,8 +638,6 @@ static int smc_llc_fill_ext_v2(struct smc_llc_msg_add_link_v2_ext *ext, cpu_to_be64((uintptr_t)rmb->cpu_addr) : cpu_to_be64((u64)sg_dma_address(rmb->sgt[lnk_idx].sgl)); buf_pos = smc_llc_get_next_rmb(lgr, &buf_lst, buf_pos); - while (buf_pos && !(buf_pos)->used) - buf_pos = smc_llc_get_next_rmb(lgr, &buf_lst, buf_pos); } len += i * sizeof(ext->rt[0]); out: @@ -672,11 +687,16 @@ int smc_llc_send_add_link(struct smc_link *link, u8 mac[], u8 gid[], addllc->link_num = link_new->link_id; hton24(addllc->sender_qp_num, link_new->roce_qp->qp_num); hton24(addllc->initial_psn, link_new->psn_initial); - if (reqresp == SMC_LLC_REQ) + if (reqresp == SMC_LLC_REQ) { addllc->qp_mtu = link_new->path_mtu; - else + if (link_new->lgr->credits_en) + addllc->init_credits = (u8)link_new->wr_rx_cnt; + } else { addllc->qp_mtu = min(link_new->path_mtu, link_new->peer_mtu); + if (link_new->credits_enable) + addllc->init_credits = (u8)link_new->wr_rx_cnt; + } } if (ext && link_new) len += smc_llc_fill_ext_v2(ext, link, link_new); @@ -752,6 +772,46 @@ static int smc_llc_send_test_link(struct smc_link *link, u8 user_data[16]) return rc; } +/* send credits announce request or response */ +int smc_llc_announce_credits(struct smc_link *link, + enum smc_llc_reqresp reqresp, bool force) +{ + struct smc_llc_msg_announce_credits *announce_credits; + struct smc_wr_tx_pend_priv *pend; + struct smc_wr_buf *wr_buf; + int rc; + u8 saved_credits = 0; + + if (!link->credits_enable || + (!force && !smc_wr_rx_credits_need_announce(link))) + return 0; + + saved_credits = (u8)smc_wr_rx_get_credits(link); + if (!saved_credits) + /* maybe synced by cdc msg */ + return 0; + + rc = smc_llc_add_pending_send(link, &wr_buf, &pend); + if (rc) { + smc_wr_rx_put_credits(link, saved_credits); + return rc; + } + + announce_credits = (struct smc_llc_msg_announce_credits *)wr_buf; + memset(announce_credits, 0, sizeof(*announce_credits)); + announce_credits->hd.common.type = SMC_LLC_ANNOUNCE_CREDITS; + announce_credits->hd.length = sizeof(struct smc_llc_msg_announce_credits); + if (reqresp == SMC_LLC_RESP) + announce_credits->hd.flags |= SMC_LLC_FLAG_RESP; + announce_credits->credits = saved_credits; + /* send llc message */ + rc = smc_wr_tx_send(link, pend); + if (rc) + smc_wr_rx_put_credits(link, saved_credits); + + return rc; +} + /* schedule an llc send on link, may wait for buffers */ static int smc_llc_send_message(struct smc_link *link, void *llcbuf) { @@ -846,6 +906,8 @@ static int smc_llc_add_link_cont(struct smc_link *link, addc_llc->num_rkeys = *num_rkeys_todo; n = *num_rkeys_todo; for (i = 0; i < min_t(u8, n, SMC_LLC_RKEYS_PER_CONT_MSG); i++) { + while (*buf_pos && !(*buf_pos)->used) + *buf_pos = smc_llc_get_next_rmb(lgr, buf_lst, *buf_pos); if (!*buf_pos) { addc_llc->num_rkeys = addc_llc->num_rkeys - *num_rkeys_todo; @@ -862,8 +924,6 @@ static int smc_llc_add_link_cont(struct smc_link *link, (*num_rkeys_todo)--; *buf_pos = smc_llc_get_next_rmb(lgr, buf_lst, *buf_pos); - while (*buf_pos && !(*buf_pos)->used) - *buf_pos = smc_llc_get_next_rmb(lgr, buf_lst, *buf_pos); } addc_llc->hd.common.llc_type = SMC_LLC_ADD_LINK_CONT; addc_llc->hd.length = sizeof(struct smc_llc_msg_add_link_cont); @@ -1020,6 +1080,16 @@ static void smc_llc_save_add_link_info(struct smc_link *link, memcpy(link->peer_mac, add_llc->sender_mac, ETH_ALEN); link->peer_psn = ntoh24(add_llc->initial_psn); link->peer_mtu = add_llc->qp_mtu; + link->credits_enable = + (link->lgr->credits_en && add_llc->init_credits) ? 1 : 0; + if (link->credits_enable) { + atomic_set(&link->peer_rq_credits, add_llc->init_credits); + /* set peer rq credits watermark, if less than init_credits * 2/3, + * then credit announcement is needed. + */ + link->peer_cr_watermark_low = + max(add_llc->init_credits * 2 / 3, 1); + } } /* as an SMC client, process an add link request */ @@ -1028,6 +1098,7 @@ int smc_llc_cli_add_link(struct smc_link *link, struct smc_llc_qentry *qentry) struct smc_llc_msg_add_link *llc = &qentry->msg.add_link; enum smc_lgr_type lgr_new_t = SMC_LGR_SYMMETRIC; struct smc_link_group *lgr = smc_get_lgr(link); + struct smc_ib_device *ibdev_selected = NULL; struct smc_init_info *ini = NULL; struct smc_link *lnk_new = NULL; int lnk_idx, rc = 0; @@ -1041,6 +1112,9 @@ int smc_llc_cli_add_link(struct smc_link *link, struct smc_llc_qentry *qentry) goto out_reject; } + if (lgr->type == SMC_LGR_SINGLE && lgr->max_links <= 1) + goto out_reject; + ini->vlan_id = lgr->vlan_id; if (lgr->smc_version == SMC_V2) { ini->check_smcrv2 = true; @@ -1064,14 +1138,29 @@ int smc_llc_cli_add_link(struct smc_link *link, struct smc_llc_qentry *qentry) ini->ib_dev = link->smcibdev; ini->ib_port = link->ibport; } + + mutex_lock(&smc_ib_devices.mutex); + if (lgr->smc_version == SMC_V2) + ibdev_selected = ini->smcrv2.ib_dev_v2; + else if (lgr->smc_version < SMC_V2) + ibdev_selected = ini->ib_dev; + if (list_empty(&ibdev_selected->list)) { + rc = -ENODEV; + ibdev_selected = NULL; + mutex_unlock(&smc_ib_devices.mutex); + goto out_reject; + } + smc_ib_get_pending_device(ibdev_selected); /* put below */ + mutex_unlock(&smc_ib_devices.mutex); + lnk_idx = smc_llc_alloc_alt_link(lgr, lgr_new_t); if (lnk_idx < 0) - goto out_reject; + goto out_pending_dev; lnk_new = &lgr->lnk[lnk_idx]; lnk_new->iw_conn_param = link->iw_conn_param; rc = smcr_link_init(lgr, lnk_new, lnk_idx, ini); if (rc) - goto out_reject; + goto out_pending_dev; smc_llc_save_add_link_info(lnk_new, llc); lnk_new->link_id = llc->link_num; /* SMC server assigns link id */ smc_llc_link_set_uid(lnk_new); @@ -1099,11 +1188,15 @@ int smc_llc_cli_add_link(struct smc_link *link, struct smc_llc_qentry *qentry) } } rc = smc_llc_cli_conf_link(link, ini, lnk_new, lgr_new_t); - if (!rc) + if (!rc) { + smc_ib_put_pending_device(ibdev_selected); goto out; + } out_clear_lnk: lnk_new->state = SMC_LNK_INACTIVE; smcr_link_clear(lnk_new, false); +out_pending_dev: + smc_ib_put_pending_device(ibdev_selected); out_reject: smc_llc_cli_add_link_reject(qentry); out: @@ -1166,6 +1259,9 @@ static void smc_llc_cli_add_link_invite(struct smc_link *link, lgr->type == SMC_LGR_ASYMMETRIC_PEER) goto out; + if (lgr->type == SMC_LGR_SINGLE && lgr->max_links <= 1) + goto out; + ini = kzalloc(sizeof(*ini), GFP_KERNEL); if (!ini) goto out; @@ -1393,6 +1489,7 @@ int smc_llc_srv_add_link(struct smc_link *link, struct smc_llc_qentry *req_qentry) { enum smc_lgr_type lgr_new_t = SMC_LGR_SYMMETRIC; + struct smc_ib_device *ibdev_selected = NULL; struct smc_link_group *lgr = link->lgr; struct smc_llc_msg_add_link *add_llc; struct smc_llc_qentry *qentry = NULL; @@ -1411,6 +1508,9 @@ int smc_llc_srv_add_link(struct smc_link *link, goto out; } + if (lgr->type == SMC_LGR_SINGLE && lgr->max_links <= 1) + goto out; + /* ignore client add link recommendation, start new flow */ ini->vlan_id = lgr->vlan_id; if (lgr->smc_version == SMC_V2) { @@ -1433,16 +1533,31 @@ int smc_llc_srv_add_link(struct smc_link *link, ini->ib_dev = link->smcibdev; ini->ib_port = link->ibport; } + + mutex_lock(&smc_ib_devices.mutex); + if (lgr->smc_version == SMC_V2) + ibdev_selected = ini->smcrv2.ib_dev_v2; + else if (lgr->smc_version < SMC_V2) + ibdev_selected = ini->ib_dev; + if (list_empty(&ibdev_selected->list)) { + rc = -ENODEV; + ibdev_selected = NULL; + mutex_unlock(&smc_ib_devices.mutex); + goto out; + } + smc_ib_get_pending_device(ibdev_selected); /* put below */ + mutex_unlock(&smc_ib_devices.mutex); + lnk_idx = smc_llc_alloc_alt_link(lgr, lgr_new_t); if (lnk_idx < 0) { rc = 0; - goto out; + goto out_dev; } lgr->lnk[lnk_idx].iw_conn_param = link->iw_conn_param; rc = smcr_link_init(lgr, &lgr->lnk[lnk_idx], lnk_idx, ini); if (rc) - goto out; + goto out_dev; link_new = &lgr->lnk[lnk_idx]; rc = smcr_buf_map_lgr(link_new); @@ -1492,6 +1607,8 @@ int smc_llc_srv_add_link(struct smc_link *link, rc = smc_llc_srv_conf_link(link, link_new, lgr_new_t); if (rc) goto out_err; + + smc_ib_put_pending_device(ibdev_selected); kfree(ini); return 0; out_err: @@ -1499,6 +1616,8 @@ int smc_llc_srv_add_link(struct smc_link *link, link_new->state = SMC_LNK_INACTIVE; smcr_link_clear(link_new, false); } +out_dev: + smc_ib_put_pending_device(ibdev_selected); out: kfree(ini); if (send_req_add_link_resp) @@ -1939,6 +2058,10 @@ static void smc_llc_event_handler(struct smc_llc_qentry *qentry) smc_llc_flow_stop(lgr, &lgr->llc_flow_rmt); } return; + case SMC_LLC_ANNOUNCE_CREDITS: + if (smc_link_active(link)) + smc_wr_tx_put_credits(link, llc->announce_credits.credits, true); + break; case SMC_LLC_REQ_ADD_LINK: /* handle response here, smc_llc_flow_stop() cannot be called * in tasklet context @@ -2024,6 +2147,10 @@ static void smc_llc_rx_response(struct smc_link *link, case SMC_LLC_CONFIRM_RKEY_CONT: /* not used because max links is 3 */ break; + case SMC_LLC_ANNOUNCE_CREDITS: + if (smc_link_active(link)) + smc_wr_tx_put_credits(link, qentry->msg.announce_credits.credits, true); + break; default: smc_llc_protocol_violation(link->lgr, qentry->msg.raw.hdr.common.type); @@ -2117,6 +2244,27 @@ static void smc_llc_testlink_work(struct work_struct *work) schedule_delayed_work(&link->llc_testlink_wrk, next_interval); } +static void smc_llc_announce_credits_work(struct work_struct *work) +{ + struct smc_link *link = container_of(work, + struct smc_link, credits_announce_work); + int rc, retry = 0, agains = 0; + +again: + do { + rc = smc_llc_announce_credits(link, SMC_LLC_RESP, false); + } while ((rc == -EBUSY) && smc_link_sendable(link) && + (retry++ < SMC_LLC_ANNOUNCE_CR_MAX_RETRY)); + + if (smc_wr_rx_credits_need_announce(link) && + smc_link_sendable(link) && agains <= 5 && !rc) { + agains++; + goto again; + } + + clear_bit(SMC_LINKFLAG_ANNOUNCE_PENDING, &link->flags); +} + void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc) { struct net *net = sock_net(smc->clcsock->sk); @@ -2152,12 +2300,13 @@ int smc_llc_link_init(struct smc_link *link) { init_completion(&link->llc_testlink_resp); INIT_DELAYED_WORK(&link->llc_testlink_wrk, smc_llc_testlink_work); + INIT_WORK(&link->credits_announce_work, smc_llc_announce_credits_work); return 0; } void smc_llc_link_active(struct smc_link *link) { - pr_warn_ratelimited("smc: SMC-R lg %*phN link added: id %*phN, " + pr_info_ratelimited("smc: SMC-R lg %*phN link added: id %*phN, " "peerid %*phN, ibdev %s, ibport %d\n", SMC_LGR_ID_SIZE, &link->lgr->id, SMC_LGR_ID_SIZE, &link->link_uid, @@ -2175,7 +2324,7 @@ void smc_llc_link_active(struct smc_link *link) void smc_llc_link_clear(struct smc_link *link, bool log) { if (log) - pr_warn_ratelimited("smc: SMC-R lg %*phN link removed: id %*phN" + pr_info_ratelimited("smc: SMC-R lg %*phN link removed: id %*phN" ", peerid %*phN, ibdev %s, ibport %d\n", SMC_LGR_ID_SIZE, &link->lgr->id, SMC_LGR_ID_SIZE, &link->link_uid, @@ -2183,6 +2332,7 @@ void smc_llc_link_clear(struct smc_link *link, bool log) link->smcibdev->ibdev->name, link->ibport); complete(&link->llc_testlink_resp); cancel_delayed_work_sync(&link->llc_testlink_wrk); + cancel_work_sync(&link->credits_announce_work); } /* register a new rtoken at the remote peer (for all links) */ @@ -2297,6 +2447,10 @@ static struct smc_wr_rx_handler smc_llc_rx_handlers[] = { .handler = smc_llc_rx_handler, .type = SMC_LLC_DELETE_RKEY }, + { + .handler = smc_llc_rx_handler, + .type = SMC_LLC_ANNOUNCE_CREDITS + }, /* V2 types */ { .handler = smc_llc_rx_handler, diff --git a/net/smc/smc_llc.h b/net/smc/smc_llc.h index 7e7a3162c68b3d9923a9a5ed4315ec67c4c5fb85..d0c941e20beec6bce638709f1520d1f8cad2f387 100644 --- a/net/smc/smc_llc.h +++ b/net/smc/smc_llc.h @@ -21,6 +21,8 @@ #define SMC_LLC_WAIT_TIME (2 * HZ) #define SMC_LLC_TESTLINK_DEFAULT_TIME (30 * HZ) +#define SMC_LLC_ANNOUNCE_CR_MAX_RETRY (1) + enum smc_llc_reqresp { SMC_LLC_REQ, SMC_LLC_RESP @@ -36,6 +38,7 @@ enum smc_llc_msg_type { SMC_LLC_TEST_LINK = 0x07, SMC_LLC_CONFIRM_RKEY_CONT = 0x08, SMC_LLC_DELETE_RKEY = 0x09, + SMC_LLC_ANNOUNCE_CREDITS = 0X0A, /* V2 types */ SMC_LLC_CONFIRM_LINK_V2 = 0x21, SMC_LLC_ADD_LINK_V2 = 0x22, @@ -87,6 +90,8 @@ int smc_llc_send_add_link(struct smc_link *link, u8 mac[], u8 gid[], int smc_llc_send_delete_link(struct smc_link *link, u8 link_del_id, enum smc_llc_reqresp reqresp, bool orderly, u32 reason); +int smc_llc_announce_credits(struct smc_link *link, + enum smc_llc_reqresp reqresp, bool force); void smc_llc_srv_delete_link_local(struct smc_link *link, u8 del_link_id); void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc); void smc_llc_lgr_clear(struct smc_link_group *lgr); diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c index 1ed4bbccaf314d0dc2069972175549d3aab346d1..c3579dcd9014d4a265a3d4bf1560f6ff3dbc1859 100644 --- a/net/smc/smc_pnet.c +++ b/net/smc/smc_pnet.c @@ -102,7 +102,7 @@ static int smc_pnet_remove_by_pnetid(struct net *net, char *pnet_name) struct smc_pnetentry *pnetelem, *tmp_pe; struct smc_pnettable *pnettable; struct smc_ib_device *ibdev; - struct smcd_dev *smcd_dev; + struct smcd_dev *smcd; struct smc_net *sn; int rc = -ENOENT; int ibport; @@ -120,7 +120,7 @@ static int smc_pnet_remove_by_pnetid(struct net *net, char *pnet_name) list_del(&pnetelem->list); if (pnetelem->type == SMC_PNET_ETH && pnetelem->ndev) { dev_put(pnetelem->ndev); - pr_warn_ratelimited("smc: net device %s " + pr_info_ratelimited("smc: net device %s " "erased user defined " "pnetid %.16s\n", pnetelem->eth_name, @@ -144,7 +144,7 @@ static int smc_pnet_remove_by_pnetid(struct net *net, char *pnet_name) (!pnet_name || smc_pnet_match(pnet_name, ibdev->pnetid[ibport]))) { - pr_warn_ratelimited("smc: ib device %s ibport " + pr_info_ratelimited("smc: ib device %s ibport " "%d erased user defined " "pnetid %.16s\n", ibdev->ibdev->name, @@ -160,16 +160,17 @@ static int smc_pnet_remove_by_pnetid(struct net *net, char *pnet_name) mutex_unlock(&smc_ib_devices.mutex); /* remove smcd devices */ mutex_lock(&smcd_dev_list.mutex); - list_for_each_entry(smcd_dev, &smcd_dev_list.list, list) { - if (smcd_dev->pnetid_by_user && + list_for_each_entry(smcd, &smcd_dev_list.list, list) { + if (smcd->pnetid_by_user && (!pnet_name || - smc_pnet_match(pnet_name, smcd_dev->pnetid))) { - pr_warn_ratelimited("smc: smcd device %s " + smc_pnet_match(pnet_name, smcd->pnetid))) { + pr_info_ratelimited("smc: smcd device %s " "erased user defined pnetid " - "%.16s\n", dev_name(&smcd_dev->dev), - smcd_dev->pnetid); - memset(smcd_dev->pnetid, 0, SMC_MAX_PNETID_LEN); - smcd_dev->pnetid_by_user = false; + "%.16s\n", + dev_name(smcd->ops->get_dev(smcd)), + smcd->pnetid); + memset(smcd->pnetid, 0, SMC_MAX_PNETID_LEN); + smcd->pnetid_by_user = false; rc = 0; } } @@ -198,7 +199,7 @@ static int smc_pnet_add_by_ndev(struct net_device *ndev) dev_hold(ndev); pnetelem->ndev = ndev; rc = 0; - pr_warn_ratelimited("smc: adding net device %s with " + pr_info_ratelimited("smc: adding net device %s with " "user defined pnetid %.16s\n", pnetelem->eth_name, pnetelem->pnet_name); @@ -229,7 +230,7 @@ static int smc_pnet_remove_by_ndev(struct net_device *ndev) dev_put(pnetelem->ndev); pnetelem->ndev = NULL; rc = 0; - pr_warn_ratelimited("smc: removing net device %s with " + pr_info_ratelimited("smc: removing net device %s with " "user defined pnetid %.16s\n", pnetelem->eth_name, pnetelem->pnet_name); @@ -329,8 +330,8 @@ static struct smcd_dev *smc_pnet_find_smcd(char *smcd_name) mutex_lock(&smcd_dev_list.mutex); list_for_each_entry(smcd_dev, &smcd_dev_list.list, list) { - if (!strncmp(dev_name(&smcd_dev->dev), smcd_name, - IB_DEVICE_NAME_MAX - 1)) + if (!strncmp(dev_name(smcd_dev->ops->get_dev(smcd_dev)), + smcd_name, IB_DEVICE_NAME_MAX - 1)) goto out; } smcd_dev = NULL; @@ -389,7 +390,7 @@ static int smc_pnet_add_eth(struct smc_pnettable *pnettable, struct net *net, goto out_put; } if (ndev) - pr_warn_ratelimited("smc: net device %s " + pr_info_ratelimited("smc: net device %s " "applied user defined pnetid %.16s\n", new_pe->eth_name, new_pe->pnet_name); return 0; @@ -407,7 +408,8 @@ static int smc_pnet_add_ib(struct smc_pnettable *pnettable, char *ib_name, struct smc_ib_device *ib_dev; bool smcddev_applied = true; bool ibdev_applied = true; - struct smcd_dev *smcd_dev; + struct smcd_dev *smcd; + struct device *dev; bool new_ibdev; /* try to apply the pnetid to active devices */ @@ -415,20 +417,22 @@ static int smc_pnet_add_ib(struct smc_pnettable *pnettable, char *ib_name, if (ib_dev) { ibdev_applied = smc_pnet_apply_ib(ib_dev, ib_port, pnet_name); if (ibdev_applied) - pr_warn_ratelimited("smc: ib device %s ibport %d " + pr_info_ratelimited("smc: ib device %s ibport %d " "applied user defined pnetid " "%.16s\n", ib_dev->ibdev->name, ib_port, ib_dev->pnetid[ib_port - 1]); } - smcd_dev = smc_pnet_find_smcd(ib_name); - if (smcd_dev) { - smcddev_applied = smc_pnet_apply_smcd(smcd_dev, pnet_name); - if (smcddev_applied) - pr_warn_ratelimited("smc: smcd device %s " + smcd = smc_pnet_find_smcd(ib_name); + if (smcd) { + smcddev_applied = smc_pnet_apply_smcd(smcd, pnet_name); + if (smcddev_applied) { + dev = smcd->ops->get_dev(smcd); + pr_info_ratelimited("smc: smcd device %s " "applied user defined pnetid " - "%.16s\n", dev_name(&smcd_dev->dev), - smcd_dev->pnetid); + "%.16s\n", dev_name(dev), + smcd->pnetid); + } } /* Apply fails when a device has a hardware-defined pnetid set, do not * add a pnet table entry in that case. @@ -867,10 +871,6 @@ int smc_pnet_net_init(struct net *net) rwlock_init(&pnetids_ndev->lock); smc_pnet_create_pnetids_list(net); - - /* disable handshake limitation by default */ - net->smc.limit_smc_hs = 0; - return 0; } @@ -1176,7 +1176,7 @@ int smc_pnetid_by_table_ib(struct smc_ib_device *smcibdev, u8 ib_port) */ int smc_pnetid_by_table_smcd(struct smcd_dev *smcddev) { - const char *ib_name = dev_name(&smcddev->dev); + const char *ib_name = dev_name(smcddev->ops->get_dev(smcddev)); struct smc_pnettable *pnettable; struct smc_pnetentry *tmp_pe; struct smc_net *sn; diff --git a/net/smc/smc_proc.c b/net/smc/smc_proc.c index 43cb5c6cd6b8987bde43f37ba32ee76139239319..cf20e8afa8849f7efcdf6408cfbabde908fb736f 100644 --- a/net/smc/smc_proc.c +++ b/net/smc/smc_proc.c @@ -144,12 +144,12 @@ static void _conn_show(struct seq_file *seq, struct smc_sock *smc, int protocol) } seq_printf(seq, CONN_SK_FM, fb ? 'Y' : 'N', fb ? smc->fallback_rsn : 0, - sk, clcsock->sk, fb ? clcsock->sk->sk_state : sk->sk_state, sock_i_ino(sk)); + sk, clcsock->sk, fb ? clcsock->sk->sk_state : smc_sk_state(sk), sock_i_ino(sk)); lgr = smc->conn.lgr; lnk = smc->conn.lnk; - if (!fb && sk->sk_state == SMC_ACTIVE && lgr && lnk) { + if (!fb && smc_sk_state(sk) == SMC_ACTIVE && lgr && lnk) { for (i = 0; i < SMC_LGR_ID_SIZE; i++) seq_printf(seq, "%02X", lgr->id[i]); @@ -280,6 +280,53 @@ static struct proc_ops dim_file_ops = { .proc_release = single_release, }; +static int proc_show_links(struct seq_file *seq, void *v) +{ + struct smc_link_group *lgr, *lg; + struct smc_link *lnk; + int i = 0, j = 0; + + seq_printf(seq, LINK_HDR_FM, "grp", "type", "role", "mxlnk", "idx", + "mxcon", "gconn", "conn", "state", "qpn_l", "qpn_r", "tx", + "rx", "cr-e", "cr-l", "cr-r", "cr_h", "cr_l", "flags", + "rwwi"); + + spin_lock_bh(&smc_lgr_list.lock); + list_for_each_entry_safe(lgr, lg, &smc_lgr_list.list, list) { + for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) { + lnk = &lgr->lnk[i]; + if (!smc_link_usable(lnk)) + continue; + for (j = 0; j < SMC_LGR_ID_SIZE; j++) + seq_printf(seq, "%02X", lgr->id[j]); + seq_printf(seq, LINK_INFO_FM, lgr->is_smcd ? "D" : "R", + lgr->role == SMC_CLNT ? "C" : "S", lgr->max_links, + i, lgr->max_conns, lgr->conns_num, + atomic_read(&lnk->conn_cnt), lnk->state, + lnk->roce_qp ? lnk->roce_qp->qp_num : 0, lnk->peer_qpn, + lnk->wr_tx_cnt, lnk->wr_rx_cnt, lnk->credits_enable, + atomic_read(&lnk->local_rq_credits), + atomic_read(&lnk->peer_rq_credits), + lnk->local_cr_watermark_high, + lnk->peer_cr_watermark_low, lnk->flags, lgr->use_rwwi); + } + } + spin_unlock_bh(&smc_lgr_list.lock); + return 0; +} + +static int proc_open_links(struct inode *inode, struct file *file) +{ + single_open(file, proc_show_links, NULL); + return 0; +} + +static struct proc_ops link_file_ops = { +.proc_open = proc_open_links, +.proc_read = seq_read, +.proc_release = single_release, +}; + static int __net_init smc_proc_dir_init(struct net *net) { struct proc_dir_entry *proc_net_smc; @@ -297,12 +344,18 @@ static int __net_init smc_proc_dir_init(struct net *net) goto err_entry; } - if (!proc_create("dim", 0444, proc_net_smc, &dim_file_ops)) + if (!proc_create("links", 0444, proc_net_smc, &link_file_ops)) goto err_entry; + if (!proc_create("dim", 0444, proc_net_smc, &dim_file_ops)) + goto err_link; + net->proc_net_smc = proc_net_smc; return 0; +err_link: + remove_proc_entry("links", proc_net_smc); + err_entry: for (i -= 1; i >= 0; i--) remove_proc_entry(smc_proc[i].name, proc_net_smc); @@ -318,6 +371,7 @@ static void __net_exit smc_proc_dir_exit(struct net *net) struct proc_dir_entry *proc_net_smc = net->proc_net_smc; remove_proc_entry("dim", proc_net_smc); + remove_proc_entry("links", proc_net_smc); for (i = 0; i < ARRAY_SIZE(smc_proc); i++) remove_proc_entry(smc_proc[i].name, proc_net_smc); diff --git a/net/smc/smc_proc.h b/net/smc/smc_proc.h index ec59ca03e1633f352e4479d4790aecd778c7fc23..0e6775b51d4c49bdffe055ef5666b6bfba973c05 100644 --- a/net/smc/smc_proc.h +++ b/net/smc/smc_proc.h @@ -16,6 +16,13 @@ #define CONN_SK_FM (" %c %-8X %pK %pK %2d %-16lu ") #define CONN_LGR_FM (" %c %-8s %d %-4X %-4X %-8X %-8X\n") +#define LINK_HDR_FM \ + ("%-9s%-6s%-6s%-7s%-5s%-7s%-7s%-6s%-7s%-7s%-7s%-4s%-4s%-6s%-6s%-6s%-6s%-6s%-7s%-6s\n") +#define LINK_INFO_FM \ + (" %-6s%-6s%-7d%-5d%-7d%-7d%-6d%-7d%-7d%-7d%-4d%-4d%-6u%-6d%-6d%-6u%-6u%-7lu%-6u\n") + +extern struct smc_lgr_list smc_lgr_list; + struct smc_proc_private { struct seq_net_private p; int num, bucket, offset; diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c index 17c5aee7ee4f2019fe27dd688b6bbc2eb69b29f2..9856ec6cde302a0148877041d219940178eaaed0 100644 --- a/net/smc/smc_rx.c +++ b/net/smc/smc_rx.c @@ -13,6 +13,7 @@ #include #include #include +#include #include @@ -40,7 +41,7 @@ static void smc_rx_wake_up(struct sock *sk) EPOLLRDNORM | EPOLLRDBAND); sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN); if ((sk->sk_shutdown == SHUTDOWN_MASK) || - (sk->sk_state == SMC_CLOSED)) + (smc_sk_state(sk) == SMC_CLOSED)) sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_HUP); rcu_read_unlock(); } @@ -115,9 +116,9 @@ static void smc_rx_pipe_buf_release(struct pipe_inode_info *pipe, struct smc_connection *conn; struct sock *sk = &smc->sk; - if (sk->sk_state == SMC_CLOSED || - sk->sk_state == SMC_PEERFINCLOSEWAIT || - sk->sk_state == SMC_APPFINCLOSEWAIT) + if (smc_sk_state(sk) == SMC_CLOSED || + smc_sk_state(sk) == SMC_PEERFINCLOSEWAIT || + smc_sk_state(sk) == SMC_APPFINCLOSEWAIT) goto out; conn = &smc->conn; lock_sock(sk); @@ -263,9 +264,9 @@ int smc_rx_wait(struct smc_sock *smc, long *timeo, sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); add_wait_queue(sk_sleep(sk), &wait); rc = sk_wait_event(sk, timeo, - sk->sk_err || + READ_ONCE(sk->sk_err) || cflags->peer_conn_abort || - sk->sk_shutdown & RCV_SHUTDOWN || + READ_ONCE(sk->sk_shutdown) & RCV_SHUTDOWN || conn->killed || fcrit(conn), &wait); @@ -312,7 +313,7 @@ static int smc_rx_recv_urg(struct smc_sock *smc, struct msghdr *msg, int len, return rc ? -EFAULT : len; } - if (sk->sk_state == SMC_CLOSED || sk->sk_shutdown & RCV_SHUTDOWN) + if (smc_sk_state(sk) == SMC_CLOSED || sk->sk_shutdown & RCV_SHUTDOWN) return 0; return -EAGAIN; @@ -357,7 +358,7 @@ int smc_rx_recvmsg(struct smc_sock *smc, struct msghdr *msg, return -EINVAL; /* future work for sk.sk_family == AF_SMC */ sk = &smc->sk; - if (sk->sk_state == SMC_LISTEN) + if (smc_sk_state(sk) == SMC_LISTEN) return -ENOTCONN; if (flags & MSG_OOB) return smc_rx_recv_urg(smc, msg, len, flags); @@ -394,7 +395,7 @@ int smc_rx_recvmsg(struct smc_sock *smc, struct msghdr *msg, if (read_done) { if (sk->sk_err || - sk->sk_state == SMC_CLOSED || + smc_sk_state(sk) == SMC_CLOSED || !timeo || signal_pending(current)) break; @@ -403,7 +404,7 @@ int smc_rx_recvmsg(struct smc_sock *smc, struct msghdr *msg, read_done = sock_error(sk); break; } - if (sk->sk_state == SMC_CLOSED) { + if (smc_sk_state(sk) == SMC_CLOSED) { if (!sock_flag(sk, SOCK_DONE)) { /* This occurs when user tries to read * from never connected socket. diff --git a/net/smc/smc_stats.h b/net/smc/smc_stats.h index 311de65b6fce9fa7c38d65064bd40c5b94a4367d..d63fb1297501f8fc511bc40aa6bd039c36666177 100644 --- a/net/smc/smc_stats.h +++ b/net/smc/smc_stats.h @@ -89,11 +89,12 @@ struct smc_stats { }; struct smc_link_ib_stats { - u64 s_wr_cnt; - u64 s_wc_cnt; - u64 r_wr_cnt; - u64 r_wc_cnt; - u64 rw_wc_cnt; + u64 s_wr_cnt; /* send request cnt*/ + u64 s_wc_cnt; /* send tx complete cnt */ + u64 r_wr_cnt; /* recv wr cnt */ + u64 r_wc_cnt; /* recv complete cnt */ + u64 rw_wr_cnt; /* rdma write(with imm) request cnt */ + u64 rw_wc_cnt; /* rdma write(with imm) tx complete cnt */ }; struct smc_link_stats { @@ -129,6 +130,8 @@ do { \ } else { \ if (op == IB_WR_SEND || op == IB_WR_REG_MR) \ SMC_LINK_STAT_IB(_lnk_stats, s, wr); \ + else if (op == IB_WR_RDMA_WRITE_WITH_IMM) \ + SMC_LINK_STAT_IB(_lnk_stats, rw, wr); \ } \ } \ while (0) @@ -289,8 +292,9 @@ while (0) #define SMC_STAT_SERV_SUCC_INC(net, _ini) \ do { \ typeof(_ini) i = (_ini); \ - bool is_v2 = (i->smcd_version & SMC_V2); \ bool is_smcd = (i->is_smcd); \ + u8 version = is_smcd ? i->smcd_version : i->smcr_version; \ + bool is_v2 = (version & SMC_V2); \ typeof(net->smc.smc_stats) smc_stats = (net)->smc.smc_stats; \ if (is_v2 && is_smcd) \ this_cpu_inc(smc_stats->smc[SMC_TYPE_D].srv_v2_succ_cnt); \ diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c index d6324144cdf337ed3b9d704c873f8c501886b5c4..3f53694d5549cac57654d4487ac9fac723a055bd 100644 --- a/net/smc/smc_sysctl.c +++ b/net/smc/smc_sysctl.c @@ -70,15 +70,6 @@ static struct ctl_table smc_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, - { - .procname = "allow_different_subnet", - .data = &init_net.smc.sysctl_allow_different_subnet, - .maxlen = sizeof(init_net.smc.sysctl_allow_different_subnet), - .mode = 0644, - .proc_handler = proc_dointvec_minmax, - .extra1 = SYSCTL_ZERO, - .extra2 = SYSCTL_ONE, - }, { .procname = "limit_handshake", .data = &init_net.smc.limit_smc_hs, @@ -88,6 +79,13 @@ static struct ctl_table smc_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, + { + .procname = "vendor_exp_options", + .data = &init_net.smc.sysctl_vendor_exp_options, + .maxlen = sizeof(init_net.smc.sysctl_vendor_exp_options), + .mode = 0644, + .proc_handler = proc_douintvec, + }, { } }; @@ -113,11 +111,13 @@ int __net_init smc_sysctl_net_init(struct net *net) net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE; net->smc.sysctl_smcr_buf_type = SMCR_PHYS_CONT_BUFS; + net->smc.sysctl_vendor_exp_options = ~0U; net->smc.sysctl_smcr_testlink_time = SMC_LLC_TESTLINK_DEFAULT_TIME; net->smc.sysctl_wmem = 262144; /* 256 KiB */ net->smc.sysctl_rmem = 262144; /* 256 KiB */ net->smc.sysctl_tcp2smc = 0; - net->smc.sysctl_allow_different_subnet = 1; + /* enable handshake limitation by default */ + net->smc.limit_smc_hs = 1; return 0; err_reg: diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c index 1c0c2a411e972a6feed68907a27ad0c97040d397..645abd73453cd029f73125513b0e5c9632797902 100644 --- a/net/smc/smc_tx.c +++ b/net/smc/smc_tx.c @@ -113,8 +113,8 @@ static int smc_tx_wait(struct smc_sock *smc, int flags) break; /* at least 1 byte of free & no urgent data */ set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); sk_wait_event(sk, &timeo, - sk->sk_err || - (sk->sk_shutdown & SEND_SHUTDOWN) || + READ_ONCE(sk->sk_err) || + (READ_ONCE(sk->sk_shutdown) & SEND_SHUTDOWN) || smc_cdc_rxed_any_close(conn) || (atomic_read(&conn->sndbuf_space) && !conn->urg_tx_pend), @@ -199,7 +199,7 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len) goto out_err; } - if (sk->sk_state == SMC_INIT) + if (smc_sk_state(sk) == SMC_INIT) return -ENOTCONN; if (len > conn->sndbuf_desc->len) @@ -338,7 +338,8 @@ static int smc_tx_rdma_write(struct smc_connection *conn, int peer_rmbe_offset, struct smc_link *link = conn->lnk; int rc; - rdma_wr->wr.wr_id = smc_wr_tx_get_next_wr_id(link); + if (!lgr->use_rwwi) + rdma_wr->wr.wr_id = smc_wr_tx_get_next_wr_id(link); rdma_wr->wr.num_sge = num_sges; rdma_wr->remote_addr = lgr->rtokens[conn->rtoken_idx][link->link_idx].dma_addr + @@ -356,6 +357,118 @@ static int smc_tx_rdma_write(struct smc_connection *conn, int peer_rmbe_offset, rc = ib_post_send(link->roce_qp, &rdma_wr->wr, NULL); if (rc) smcr_link_down_cond_sched(link); + else if (lgr->use_rwwi) + SMC_LINK_STAT_WR(&lgr->lnk_stats[link->link_idx], rdma_wr->wr.opcode, 0); + return rc; +} + +static int __smcr_tx_rdma_writes_rwwi(struct smc_connection *conn, int dst_off, + int dst_len, int num_sges, struct ib_rdma_wr *wr) +{ + struct smc_cdc_producer_flags *pflags; + bool update_rx_curs_confirmed = true; + struct smc_link *link = conn->lnk; + union smc_host_cursor cons_old; + union smc_wr_rwwi_tx_id wr_id; + union smc_wr_imm_msg imm_msg; + union smc_host_cursor cfed; + u8 urg_flags, prod_flags; + u8 saved_credits = 0; + bool cr_flag = false; + u8 conn_state_flags; + int diff_cons, rc; + + BUILD_BUG_ON_MSG(sizeof(union smc_wr_imm_msg) > sizeof(__be32), + "sizeof(union smc_wr_imm_msg) can not exceed the size of imm_data(__be32)"); + + pflags = &conn->local_tx_ctrl.prod_flags; + imm_msg.imm_data = 0; + wr_id.data = 0; + + atomic_inc(&conn->cdc_pend_tx_wr); + smp_mb__after_atomic(); /* Make sure cdc_pend_tx_wr added before post */ + cr_flag = smc_wr_rx_credits_need_announce_frequent(link); + imm_msg.hdr.token = conn->local_tx_ctrl.token; + imm_msg.hdr.opcode = cr_flag ? SMC_WR_OP_DATA_CR : SMC_WR_OP_DATA; + + conn_state_flags = *((u8 *)&conn->local_tx_ctrl.conn_state_flags); + prod_flags = *((u8 *)pflags); + urg_flags = (*((u8 *)pflags) & SMC_PROD_FLAGS_MASK); + /* transfer priority: + * 1. transfer conn_state_flags and prod_flags (except urg_flags) + * when they are set. + * 2. transfer urg_flags when it is set. + * Diff_cons should be transferred together to ensure subsequent sending + * with urg_flags always set. + * 3. transfer diff_cons without flags. + */ + if (urg_flags) + imm_msg.hdr.opcode = cr_flag ? + SMC_WR_OP_DATA_WITH_FLAGS_CR : SMC_WR_OP_DATA_WITH_FLAGS; + /* conn_state_flags or prod_flags (except urg_flags) is set */ + if (conn_state_flags || prod_flags != urg_flags) + imm_msg.hdr.opcode = SMC_WR_OP_CTRL; + + smc_curs_copy(&cfed, &conn->local_tx_ctrl.cons, conn); + smc_curs_copy(&cons_old, &conn->rx_curs_confirmed, conn); + diff_cons = smc_curs_diff(conn->rmb_desc->len, &cons_old, + &conn->local_tx_ctrl.cons); + switch (imm_msg.hdr.opcode) { + case SMC_WR_OP_DATA: + if (diff_cons > SMC_DATA_MAX_DIFF_CONS) + diff_cons = SMC_DATA_MAX_DIFF_CONS; + imm_msg.data.diff_cons = diff_cons; + break; + case SMC_WR_OP_CTRL: + imm_msg.ctrl.pflags = conn->local_tx_ctrl.prod_flags; + imm_msg.ctrl.csflags = conn->local_tx_ctrl.conn_state_flags; + update_rx_curs_confirmed = false; + break; + case SMC_WR_OP_DATA_WITH_FLAGS: + if (diff_cons > SMC_DATA_WITH_FLAGS_MAX_DIFF_CONS) + diff_cons = SMC_DATA_WITH_FLAGS_MAX_DIFF_CONS; + imm_msg.data_with_flags.write_blocked = pflags->write_blocked; + imm_msg.data_with_flags.urg_data_present = pflags->urg_data_present; + imm_msg.data_with_flags.urg_data_pending = pflags->urg_data_pending; + imm_msg.data_with_flags.diff_cons = diff_cons; + break; + case SMC_WR_OP_DATA_CR: + if (diff_cons > SMC_DATA_CR_MAX_DIFF_CONS) + diff_cons = SMC_DATA_CR_MAX_DIFF_CONS; + imm_msg.data_cr.diff_cons = diff_cons; + saved_credits = (u8)smc_wr_rx_get_credits(link); + imm_msg.data_cr.credits = saved_credits; + break; + case SMC_WR_OP_DATA_WITH_FLAGS_CR: + if (diff_cons > SMC_DATA_WITH_FLAGS_CR_MAX_DIFF_CONS) + diff_cons = SMC_DATA_WITH_FLAGS_CR_MAX_DIFF_CONS; + imm_msg.data_with_flags_cr.write_blocked = pflags->write_blocked; + imm_msg.data_with_flags_cr.urg_data_present = pflags->urg_data_present; + imm_msg.data_with_flags_cr.urg_data_pending = pflags->urg_data_pending; + imm_msg.data_with_flags_cr.diff_cons = diff_cons; + saved_credits = (u8)smc_wr_rx_get_credits(link); + imm_msg.data_with_flags_cr.credits = saved_credits; + break; + } + + wr_id.rwwi_flag = 1; + wr_id.token = conn->alert_token_local; + wr_id.inflight_sent = dst_len; + wr->wr.wr_id = wr_id.data; + wr->wr.ex.imm_data = cpu_to_be32(imm_msg.imm_data); + + rc = smc_tx_rdma_write(conn, dst_off, num_sges, wr); + if (!rc) { + /* do not update rx_curs_confirmed if all flags equal to 0, + * since diff_cons will not be carried by imm_data in this case. + */ + if (update_rx_curs_confirmed) + smc_curs_add(conn->rmb_desc->len, &conn->rx_curs_confirmed, diff_cons); + } else { + smc_wr_rx_put_credits(link, saved_credits); + atomic_dec(&conn->cdc_pend_tx_wr); + } + return rc; } @@ -375,55 +488,128 @@ static inline void smc_tx_advance_cursors(struct smc_connection *conn, smc_curs_add(conn->sndbuf_desc->len, sent, len); } -/* SMC-R helper for smc_tx_rdma_writes() */ -static int smcr_tx_rdma_writes(struct smc_connection *conn, size_t len, - size_t src_off, size_t src_len, - size_t dst_off, size_t dst_len, - struct smc_rdma_wr *wr_rdma_buf) +static inline int __smc_get_free_slot_rwwi(struct smc_link *link) +{ + if (!smc_link_sendable(link)) + return -ENOLINK; + + if (atomic_dec_if_positive(&link->tx_inflight_credit) < 0) + return -EBUSY; + + if (smc_wr_tx_get_credit(link)) + return 0; + + atomic_inc(&link->tx_inflight_credit); + return -EBUSY; +} + +static int smc_tx_get_free_slot_rwwi(struct smc_link *link, + struct smc_connection *conn) +{ + struct smc_link_group *lgr = link->lgr; + int rc; + + if (in_softirq() || lgr->terminating) { + rc = __smc_get_free_slot_rwwi(link); + if (rc) + return rc; + } else { + rc = wait_event_interruptible_timeout(link->wr_tx_wait, + !smc_link_sendable(link) || + lgr->terminating || + (__smc_get_free_slot_rwwi(link) != -EBUSY), + SMC_WR_TX_WAIT_FREE_SLOT_TIME); + if (!rc) { + /* timeout - terminate link */ + smcr_link_down_cond_sched(link); + return -EPIPE; + } + if (!smc_link_sendable(link) || lgr->terminating) + return -EPIPE; + } + if (conn->killed) { + atomic_inc(&link->tx_inflight_credit); + return -EPIPE; + } + return 0; +} + +void smc_tx_put_free_slot_rwwi(struct smc_link *link, bool complete) +{ + if (!complete) + smc_wr_tx_put_credits(link, 1, true); + + atomic_inc(&link->tx_inflight_credit); +} + +static int smc_tx_fill_wr(struct smc_connection *conn, int *src_off, + int src_len, int dst_len, struct ib_rdma_wr *wr, + struct ib_sge *sge, bool use_rwwi) { struct smc_link *link = conn->lnk; dma_addr_t dma_addr = sg_dma_address(conn->sndbuf_desc->sgt[link->link_idx].sgl); u64 virt_addr = (uintptr_t)conn->sndbuf_desc->cpu_addr; + int src_len_sum = src_len; + u64 base_addr = dma_addr; + int srcchunk; + int num_sges; + + if (dst_len < link->qp_attr.cap.max_inline_data) { + base_addr = virt_addr; + wr->wr.send_flags |= IB_SEND_INLINE; + } else { + wr->wr.send_flags &= ~IB_SEND_INLINE; + } + + num_sges = 0; + for (srcchunk = 0; srcchunk < 2; srcchunk++) { + sge[srcchunk].addr = conn->sndbuf_desc->is_vm ? + (virt_addr + *src_off) : (base_addr + *src_off); + sge[srcchunk].length = src_len; + + if (conn->sndbuf_desc->is_vm) { + sge[srcchunk].lkey = + conn->sndbuf_desc->mr[link->link_idx]->lkey; + } else { + sge[srcchunk].lkey = conn->lnk->roce_pd->local_dma_lkey; + } + num_sges++; + + *src_off += src_len; + if (*src_off >= conn->sndbuf_desc->len) + *src_off -= conn->sndbuf_desc->len; + /* modulo in send ring */ + if (src_len_sum == dst_len) + break; /* either on 1st or 2nd iteration */ + /* prepare next (== 2nd) iteration */ + src_len = dst_len - src_len; /* remainder */ + src_len_sum += src_len; + } + + return num_sges; +} + +/* SMC-R helper for smc_tx_rdma_writes() */ +static int smcr_tx_rdma_writes(struct smc_connection *conn, size_t len, + size_t src_off, size_t src_len, + size_t dst_off, size_t dst_len, + struct smc_rdma_wr *wr_rdma_buf) +{ int src_len_sum = src_len, dst_len_sum = dst_len; int sent_count = src_off; - int srcchunk, dstchunk; + int *src_offset; + int dstchunk; int num_sges; int rc; + src_offset = (int *)&src_off; for (dstchunk = 0; dstchunk < 2; dstchunk++) { struct ib_rdma_wr *wr = &wr_rdma_buf->wr_tx_rdma[dstchunk]; struct ib_sge *sge = wr->wr.sg_list; - u64 base_addr = dma_addr; - - if (dst_len < link->qp_attr.cap.max_inline_data) { - base_addr = virt_addr; - wr->wr.send_flags |= IB_SEND_INLINE; - } else { - wr->wr.send_flags &= ~IB_SEND_INLINE; - } - - num_sges = 0; - for (srcchunk = 0; srcchunk < 2; srcchunk++) { - sge[srcchunk].addr = conn->sndbuf_desc->is_vm ? - (virt_addr + src_off) : (base_addr + src_off); - sge[srcchunk].length = src_len; - if (conn->sndbuf_desc->is_vm) - sge[srcchunk].lkey = - conn->sndbuf_desc->mr[link->link_idx]->lkey; - num_sges++; - src_off += src_len; - if (src_off >= conn->sndbuf_desc->len) - src_off -= conn->sndbuf_desc->len; - /* modulo in send ring */ - if (src_len_sum == dst_len) - break; /* either on 1st or 2nd iteration */ - /* prepare next (== 2nd) iteration */ - src_len = dst_len - src_len; /* remainder */ - src_len_sum += src_len; - } + num_sges = smc_tx_fill_wr(conn, src_offset, src_len, dst_len, wr, sge, false); rc = smc_tx_rdma_write(conn, dst_off, num_sges, wr); if (rc) return rc; @@ -440,6 +626,141 @@ static int smcr_tx_rdma_writes(struct smc_connection *conn, size_t len, return 0; } +static int smcr_tx_rdma_writes_rwwi(struct smc_connection *conn) +{ + union smc_host_cursor sent, prep, prod, cons; + int to_send, rmbespace, src_off, num_sges; + struct smc_cdc_producer_flags *pflags; + size_t len, src_len, dst_len; /* current chunk values */ + struct ib_rdma_wr wr; + struct ib_sge sge[2]; + int rc; + + conn->unwrap_remaining = 0; + /* source: sndbuf */ + smc_curs_copy(&sent, &conn->tx_curs_sent, conn); + smc_curs_copy(&prep, &conn->tx_curs_prep, conn); + /* cf. wmem_alloc - (snd_max - snd_una) */ + to_send = smc_curs_diff(conn->sndbuf_desc->len, &sent, &prep); + if (to_send <= 0) + return 0; + + /* destination: RMBE */ + /* cf. snd_wnd */ + rmbespace = atomic_read(&conn->peer_rmbe_space); + if (rmbespace <= 0) { + struct smc_sock *smc = container_of(conn, struct smc_sock, + conn); + SMC_STAT_RMB_TX_PEER_FULL(smc, !conn->lnk); + return 0; + } + smc_curs_copy(&prod, &conn->local_tx_ctrl.prod, conn); + smc_curs_copy(&cons, &conn->local_rx_ctrl.cons, conn); + + /* if usable snd_wnd closes ask peer to advertise once it opens again */ + pflags = &conn->local_tx_ctrl.prod_flags; + pflags->write_blocked = (to_send >= rmbespace); + /* cf. usable snd_wnd */ + len = min(to_send, rmbespace); + + if (prod.wrap == cons.wrap) { + /* the filled destination area is unwrapped, + * hence the available free destination space is wrapped + * and we need 2 destination chunks of sum len; start with 1st + * which is limited by what's available in sndbuf + */ + dst_len = min_t(size_t, conn->peer_rmbe_size - prod.count, len); + } else { + /* the filled destination area is wrapped, + * hence the available free destination space is unwrapped + * and we need a single destination chunk of entire len + */ + dst_len = len; + } + /* dst_len determines the maximum src_len */ + if (sent.count + dst_len <= conn->sndbuf_desc->len) { + /* unwrapped src case: single chunk of entire dst_len */ + src_len = dst_len; + } else { + /* wrapped src case: 2 chunks of sum dst_len; start with 1st: */ + src_len = conn->sndbuf_desc->len - sent.count; + } + + /* if the filled destination area is unwrapped, set unwrap_remaining flag and + * the remaining data will send the next time. + */ + if (dst_len < len) + conn->unwrap_remaining = 1; + + /* update urg_data_present in advance since this info needs + * to transfer to remote by write with imm + */ + if (conn->urg_tx_pend && dst_len == to_send) + pflags->urg_data_present = 1; + + src_off = sent.count; + memset(&wr, 0, sizeof(wr)); + wr.wr.send_flags = IB_SEND_SIGNALED; + wr.wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM; + num_sges = smc_tx_fill_wr(conn, &src_off, src_len, dst_len, &wr, sge, true); + wr.wr.sg_list = sge; + rc = __smcr_tx_rdma_writes_rwwi(conn, prod.count, dst_len, num_sges, &wr); + if (rc) + return rc; + + smc_tx_advance_cursors(conn, &prod, &sent, dst_len); + /* update connection's cursors with advanced local cursors */ + smc_curs_copy(&conn->local_tx_ctrl.prod, &prod, conn); + /* dst: peer RMBE */ + smc_curs_copy(&conn->tx_curs_sent, &sent, conn);/* src: local sndbuf */ + + return 0; +} + +int smc_tx_rdma_write_with_no_data_rwwi(struct smc_connection *conn) +{ + struct smc_link *link; + struct ib_rdma_wr wr; + bool again = false; + int rc; + + if (!smc_conn_lgr_valid(conn)) + return -EPIPE; + +again: + link = conn->lnk; + if (!smc_wr_tx_link_hold(link)) + return -ENOLINK; + + memset(&wr, 0, sizeof(wr)); + wr.wr.send_flags = IB_SEND_SIGNALED; + wr.wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM; + + rc = smc_tx_get_free_slot_rwwi(link, conn); + if (rc) + goto put_out; + + spin_lock_bh(&conn->send_lock); + if (link != conn->lnk) { + /* link of connection changed, try again one time*/ + spin_unlock_bh(&conn->send_lock); + smc_tx_put_free_slot_rwwi(link, false); + smc_wr_tx_link_put(link); + if (again) + return -ENOLINK; + again = true; + goto again; + } + + rc = __smcr_tx_rdma_writes_rwwi(conn, 0, 0, 0, &wr); + if (rc) + smc_tx_put_free_slot_rwwi(link, false); + spin_unlock_bh(&conn->send_lock); +put_out: + smc_wr_tx_link_put(link); + return rc; +} + /* SMC-D helper for smc_tx_rdma_writes() */ static int smcd_tx_rdma_writes(struct smc_connection *conn, size_t len, size_t src_off, size_t src_len, @@ -625,6 +946,65 @@ static int smcr_tx_sndbuf_nonempty(struct smc_connection *conn) return rc; } +static int smcr_tx_sndbuf_nonempty_rwwi(struct smc_connection *conn) +{ + struct smc_cdc_producer_flags *pflags = &conn->local_tx_ctrl.prod_flags; + struct smc_link *link = conn->lnk; + int rc; + + if (!link || !smc_wr_tx_link_hold(link)) + return -ENOLINK; + +again: + /*count the num of infight io and limit it*/ + rc = smc_tx_get_free_slot_rwwi(link, conn); + if (rc < 0) { + smc_wr_tx_link_put(link); + if (rc == -EBUSY) { + struct smc_sock *smc = + container_of(conn, struct smc_sock, conn); + + if (smc->sk.sk_err == ECONNABORTED) + return sock_error(&smc->sk); + if (conn->killed) + return -EPIPE; + rc = 0; + mod_delayed_work(conn->lgr->tx_wq, &conn->tx_work, + SMC_TX_WORK_DELAY); + } + return rc; + } + + spin_lock_bh(&conn->send_lock); + if (link != conn->lnk) { + /* link of connection changed, tx_work will restart */ + smc_tx_put_free_slot_rwwi(link, false); + rc = -ENOLINK; + goto out_unlock; + } + + rc = smcr_tx_rdma_writes_rwwi(conn); + if (rc) { + smc_tx_put_free_slot_rwwi(link, false); + goto out_unlock; + } + + if (pflags->urg_data_present) { + pflags->urg_data_pending = 0; + pflags->urg_data_present = 0; + } + + if (unlikely(conn->unwrap_remaining)) { + spin_unlock_bh(&conn->send_lock); + goto again; + } + +out_unlock: + spin_unlock_bh(&conn->send_lock); + smc_wr_tx_link_put(link); + return rc; +} + static int smcd_tx_sndbuf_nonempty(struct smc_connection *conn) { struct smc_cdc_producer_flags *pflags = &conn->local_tx_ctrl.prod_flags; @@ -664,10 +1044,14 @@ static int __smc_tx_sndbuf_nonempty(struct smc_connection *conn) rc = -EPIPE; /* connection being aborted */ goto out; } - if (conn->lgr->is_smcd) + if (conn->lgr->is_smcd) { rc = smcd_tx_sndbuf_nonempty(conn); - else - rc = smcr_tx_sndbuf_nonempty(conn); + } else { + if (conn->lgr->use_rwwi) + rc = smcr_tx_sndbuf_nonempty_rwwi(conn); + else + rc = smcr_tx_sndbuf_nonempty(conn); + } if (!rc) { /* trigger socket release if connection is closing */ @@ -718,6 +1102,9 @@ void smc_tx_pending(struct smc_connection *conn) if (smc->sk.sk_err) return; + if (smc_tx_prepared_sends(conn) <= 0) + return; + rc = smc_tx_sndbuf_nonempty(conn); if (!rc && conn->local_rx_ctrl.prod_flags.write_blocked && !atomic_read(&conn->bytes_to_rcv)) @@ -764,11 +1151,21 @@ void smc_tx_consumer_update(struct smc_connection *conn, bool force) if (conn->killed || conn->local_rx_ctrl.conn_state_flags.peer_conn_abort) return; - if ((smc_cdc_get_slot_and_msg_send(conn) < 0) && - !conn->killed) { - queue_delayed_work(conn->lgr->tx_wq, &conn->tx_work, - SMC_TX_WORK_DELAY); - return; + + if (conn->lgr->use_rwwi) { + if ((smc_tx_rdma_write_with_no_data_rwwi(conn) < 0) && + !conn->killed) { + queue_delayed_work(conn->lgr->tx_wq, &conn->tx_work, + SMC_TX_WORK_DELAY); + return; + } + } else { + if ((smc_cdc_get_slot_and_msg_send(conn) < 0) && + !conn->killed) { + queue_delayed_work(conn->lgr->tx_wq, &conn->tx_work, + SMC_TX_WORK_DELAY); + return; + } } } if (conn->local_rx_ctrl.prod_flags.write_blocked && diff --git a/net/smc/smc_tx.h b/net/smc/smc_tx.h index 34b578498b1f1cd78a75fc9bed698e9bd1080dae..71a8e446b765c1945512b16a60afcd60bd3be4bc 100644 --- a/net/smc/smc_tx.h +++ b/net/smc/smc_tx.h @@ -38,5 +38,6 @@ void smc_tx_sndbuf_nonfull(struct smc_sock *smc); void smc_tx_consumer_update(struct smc_connection *conn, bool force); int smcd_tx_ism_write(struct smc_connection *conn, void *data, size_t len, u32 offset, int signal); - +void smc_tx_put_free_slot_rwwi(struct smc_link *link, bool complete); +int smc_tx_rdma_write_with_no_data_rwwi(struct smc_connection *conn); #endif /* SMC_TX_H */ diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c index 0eff796ca5a9dcac514437a006127ee595d3b161..cede8da85bc880ec3130db480b73082782a0d553 100644 --- a/net/smc/smc_wr.c +++ b/net/smc/smc_wr.c @@ -31,6 +31,8 @@ #include "smc.h" #include "smc_wr.h" #include "smc_dim.h" +#include "smc_cdc.h" +#include "smc_tx.h" #define SMC_WR_MAX_POLL_CQE 10 /* max. # of compl. queue elements in 1 poll */ @@ -75,13 +77,22 @@ static inline int smc_wr_tx_find_pending_index(struct smc_link *link, u64 wr_id) return link->wr_tx_cnt; } -static inline void smc_wr_tx_process_cqe(struct ib_wc *wc) +static inline void smc_wr_tx_process_cqe(struct ib_wc *wc, bool is_rwwi) { struct smc_wr_tx_pend pnd_snd; + union smc_wr_rwwi_tx_id wr_id; struct smc_link *link; u32 pnd_snd_idx; link = wc->qp->qp_context; + if (is_rwwi) { + wr_id.data = wc->wr_id; + smc_tx_put_free_slot_rwwi(link, true); + if (wc->status) + smcr_link_down_cond_sched(link); + smc_cdc_tx_handler_rwwi(wc); + goto wake_tx_wait; + } if (wc->opcode == IB_WC_REG_MR) { if (wc->status) @@ -131,7 +142,9 @@ static inline void smc_wr_tx_process_cqe(struct ib_wc *wc) } if (pnd_snd.handler) pnd_snd.handler(&pnd_snd.priv, link, wc->status); - wake_up(&link->wr_tx_wait); +wake_tx_wait: + if (wq_has_sleeper(&link->wr_tx_wait)) + wake_up(&link->wr_tx_wait); } /*---------------------------- request submission ---------------------------*/ @@ -141,11 +154,16 @@ static inline int smc_wr_tx_get_free_slot_index(struct smc_link *link, u32 *idx) *idx = link->wr_tx_cnt; if (!smc_link_sendable(link)) return -ENOLINK; + + if (!smc_wr_tx_get_credit(link)) + return -EBUSY; + for_each_clear_bit(*idx, link->wr_tx_mask, link->wr_tx_cnt) { if (!test_and_set_bit(*idx, link->wr_tx_mask)) return 0; } *idx = link->wr_tx_cnt; + smc_wr_tx_put_credits(link, 1, false); return -EBUSY; } @@ -251,7 +269,7 @@ int smc_wr_tx_put_slot(struct smc_link *link, memset(&link->wr_tx_bufs[idx], 0, sizeof(link->wr_tx_bufs[idx])); test_and_clear_bit(idx, link->wr_tx_mask); - wake_up(&link->wr_tx_wait); + smc_wr_tx_put_credits(link, 1, true); return 1; } else if (link->lgr->smc_version == SMC_V2 && pend->idx == link->wr_tx_cnt) { @@ -338,7 +356,7 @@ int smc_wr_reg_send(struct smc_link *link, struct ib_mr *mr) int rc; link->wr_reg_state = POSTED; - link->wr_reg.wr.wr_id = (u64)(uintptr_t)mr; + link->wr_reg.wr.wr_id = smc_wr_tx_get_next_wr_id(link); link->wr_reg.mr = mr; link->wr_reg.key = mr->rkey; rc = ib_post_send(link->roce_qp, &link->wr_reg.wr, NULL); @@ -348,12 +366,11 @@ int smc_wr_reg_send(struct smc_link *link, struct ib_mr *mr) SMC_LINK_STAT_WR(&link->lgr->lnk_stats[link->link_idx], link->wr_reg.wr.opcode, 0); - atomic_inc(&link->wr_reg_refcnt); + percpu_ref_get(&link->wr_reg_refs); rc = wait_event_interruptible_timeout(link->wr_reg_wait, (link->wr_reg_state != POSTED), SMC_WR_REG_MR_WAIT_TIME); - if (atomic_dec_and_test(&link->wr_reg_refcnt)) - wake_up_all(&link->wr_reg_wait); + percpu_ref_put(&link->wr_reg_refs); if (!rc) { /* timeout - terminate link */ smcr_link_down_cond_sched(link); @@ -426,7 +443,10 @@ static inline void smc_wr_rx_process_cqe(struct ib_wc *wc) if (wc->status == IB_WC_SUCCESS) { link->wr_rx_tstamp = jiffies; - smc_wr_rx_demultiplex(wc); + if (wc->wc_flags & IB_WC_WITH_IMM) + smc_cdc_rx_handler_rwwi(wc); + else + smc_wr_rx_demultiplex(wc); smc_wr_rx_post(link); /* refill WR RX */ } else { /* handle status errors */ @@ -441,6 +461,12 @@ static inline void smc_wr_rx_process_cqe(struct ib_wc *wc) break; } } + + if (smc_wr_rx_credits_need_announce(link) && + !test_bit(SMC_LINKFLAG_ANNOUNCE_PENDING, &link->flags)) { + set_bit(SMC_LINKFLAG_ANNOUNCE_PENDING, &link->flags); + schedule_work(&link->credits_announce_work); + } } static void smc_wr_tasklet_fn(struct tasklet_struct *t) @@ -458,12 +484,17 @@ static void smc_wr_tasklet_fn(struct tasklet_struct *t) for (i = 0; i < rc; i++) { link = wc[i].qp->qp_context; lnk_stats = &link->lgr->lnk_stats[link->link_idx]; - if (smc_wr_id_is_rx(wc[i].wr_id)) { - SMC_LINK_STAT_WC(lnk_stats, wc[i].opcode, 1); - smc_wr_rx_process_cqe(&wc[i]); - } else { + if (SMC_WR_IS_TX_RWWI(wc[i].wr_id)) { SMC_LINK_STAT_WC(lnk_stats, wc[i].opcode, 0); - smc_wr_tx_process_cqe(&wc[i]); + smc_wr_tx_process_cqe(&wc[i], true); + } else { + if (smc_wr_id_is_rx(wc[i].wr_id)) { + SMC_LINK_STAT_WC(lnk_stats, wc[i].opcode, 1); + smc_wr_rx_process_cqe(&wc[i]); + } else { + SMC_LINK_STAT_WC(lnk_stats, wc[i].opcode, 0); + smc_wr_tx_process_cqe(&wc[i], false); + } } } @@ -498,6 +529,8 @@ int smc_wr_rx_post_init(struct smc_link *link) for (i = 0; i < link->wr_rx_cnt; i++) rc = smc_wr_rx_post(link); + /* credits have already been announced to peer */ + atomic_set(&link->local_rq_credits, 0); return rc; } @@ -534,7 +567,7 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk) lnk->wr_tx_cnt = min_t(size_t, SMC_WR_BUF_CNT, lnk->qp_attr.cap.max_send_wr); - lnk->wr_rx_cnt = min_t(size_t, SMC_WR_BUF_CNT * 3, + lnk->wr_rx_cnt = min_t(size_t, SMC_WR_BUF_CNT, lnk->qp_attr.cap.max_recv_wr); lnk_stats->qpn = lnk->roce_qp->qp_num; } @@ -620,6 +653,8 @@ static void smc_wr_init_sge(struct smc_link *lnk) void smc_wr_free_link(struct smc_link *lnk) { + int rx_buf_size = (lnk->lgr->smc_version == SMC_V2) ? + SMC_WR_BUF_V2_SIZE : SMC_WR_BUF_SIZE; struct ib_device *ibdev; if (!lnk->smcibdev) @@ -630,12 +665,14 @@ void smc_wr_free_link(struct smc_link *lnk) smc_wr_wakeup_tx_wait(lnk); smc_wr_tx_wait_no_pending_sends(lnk); - wait_event(lnk->wr_reg_wait, (!atomic_read(&lnk->wr_reg_refcnt))); - wait_event(lnk->wr_tx_wait, (!atomic_read(&lnk->wr_tx_refcnt))); + percpu_ref_kill(&lnk->wr_reg_refs); + wait_for_completion(&lnk->reg_ref_comp); + percpu_ref_kill(&lnk->wr_tx_refs); + wait_for_completion(&lnk->tx_ref_comp); if (lnk->wr_rx_dma_addr) { ib_dma_unmap_single(ibdev, lnk->wr_rx_dma_addr, - SMC_WR_BUF_SIZE * lnk->wr_rx_cnt, + rx_buf_size * lnk->wr_rx_cnt, DMA_FROM_DEVICE); lnk->wr_rx_dma_addr = 0; } @@ -714,7 +751,7 @@ int smc_wr_alloc_link_mem(struct smc_link *link) link->wr_tx_bufs = kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE, GFP_KERNEL); if (!link->wr_tx_bufs) goto no_mem; - link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, rx_buf_size, + link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT, rx_buf_size, GFP_KERNEL); if (!link->wr_rx_bufs) goto no_mem_wr_tx_bufs; @@ -722,7 +759,7 @@ int smc_wr_alloc_link_mem(struct smc_link *link) GFP_KERNEL); if (!link->wr_tx_ibs) goto no_mem_wr_rx_bufs; - link->wr_rx_ibs = kcalloc(SMC_WR_BUF_CNT * 3, + link->wr_rx_ibs = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_rx_ibs[0]), GFP_KERNEL); if (!link->wr_rx_ibs) @@ -741,7 +778,7 @@ int smc_wr_alloc_link_mem(struct smc_link *link) GFP_KERNEL); if (!link->wr_tx_sges) goto no_mem_wr_tx_rdma_sges; - link->wr_rx_sges = kcalloc(SMC_WR_BUF_CNT * 3, sizeof(link->wr_rx_sges[0]), + link->wr_rx_sges = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_rx_sges[0]), GFP_KERNEL); if (!link->wr_rx_sges) goto no_mem_wr_tx_sges; @@ -823,6 +860,20 @@ void smc_wr_add_dev(struct smc_ib_device *smcibdev) } } +static void smcr_wr_tx_refs_free(struct percpu_ref *ref) +{ + struct smc_link *lnk = container_of(ref, struct smc_link, wr_tx_refs); + + complete(&lnk->tx_ref_comp); +} + +static void smcr_wr_reg_refs_free(struct percpu_ref *ref) +{ + struct smc_link *lnk = container_of(ref, struct smc_link, wr_reg_refs); + + complete(&lnk->reg_ref_comp); +} + int smc_wr_create_link(struct smc_link *lnk) { int rx_buf_size = (lnk->lgr->smc_version == SMC_V2) ? @@ -860,9 +911,26 @@ int smc_wr_create_link(struct smc_link *lnk) smc_wr_init_sge(lnk); bitmap_zero(lnk->wr_tx_mask, SMC_WR_BUF_CNT); init_waitqueue_head(&lnk->wr_tx_wait); - atomic_set(&lnk->wr_tx_refcnt, 0); + rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL); + if (rc) + goto dma_unmap; + init_completion(&lnk->tx_ref_comp); init_waitqueue_head(&lnk->wr_reg_wait); - atomic_set(&lnk->wr_reg_refcnt, 0); + rc = percpu_ref_init(&lnk->wr_reg_refs, smcr_wr_reg_refs_free, 0, GFP_KERNEL); + if (rc) + goto dma_unmap; + init_completion(&lnk->reg_ref_comp); + atomic_set(&lnk->tx_inflight_credit, lnk->wr_tx_cnt); + atomic_set(&lnk->peer_rq_credits, 0); + atomic_set(&lnk->local_rq_credits, 0); + lnk->flags = 0; + lnk->local_cr_watermark_high = max(lnk->wr_rx_cnt / 3, 1U); + lnk->peer_cr_watermark_low = 0; + + /* if credits accumlated less than 10% of wr_rx_cnt(at least 5), + * will not be announced by cdc msg. + */ + lnk->credits_update_limit = max(lnk->wr_rx_cnt / 10, 5U); return rc; dma_unmap: @@ -873,7 +941,7 @@ int smc_wr_create_link(struct smc_link *lnk) lnk->wr_tx_v2_dma_addr = 0; } ib_dma_unmap_single(ibdev, lnk->wr_rx_dma_addr, - SMC_WR_BUF_SIZE * lnk->wr_rx_cnt, + rx_buf_size * lnk->wr_rx_cnt, DMA_FROM_DEVICE); lnk->wr_rx_dma_addr = 0; out: diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h index fca395701be3b852d2cde1672e65784a95db23d4..18700de14e76378e6ea67ca664478076377cd12e 100644 --- a/net/smc/smc_wr.h +++ b/net/smc/smc_wr.h @@ -19,7 +19,12 @@ #include "smc.h" #include "smc_core.h" -#define SMC_WR_BUF_CNT 16 /* # of ctrl buffers per link */ +#define SMC_WR_BUF_CNT 64 /* # of ctrl buffers per link, SMC_WR_BUF_CNT + * should not be less than 2 * SMC_RMBS_PER_LGR_MAX, + * since every connection at least has two rq/sq + * credits in average, otherwise may result in + * waiting for credits in sending process. + */ #define SMC_WR_TX_WAIT_FREE_SLOT_TIME (10 * HZ) @@ -46,12 +51,166 @@ struct smc_wr_rx_handler { u8 type; }; +/* SMC_WR_OP_xx should not exceed 7, as op field is 3bits */ +enum { + SMC_WR_OP_DATA = 0, + SMC_WR_OP_DATA_WITH_FLAGS, + SMC_WR_OP_CTRL, + SMC_WR_OP_RSVD, + SMC_WR_OP_DATA_CR = 6, + SMC_WR_OP_DATA_WITH_FLAGS_CR = 7 +}; + +/* used to replace member 'data' in union smc_wr_imm_msg + * when imm_data carries flags info + */ +struct smc_wr_imm_data_msg { +#if defined(__BIG_ENDIAN_BITFIELD) + u32 token : 8; + u32 opcode : 3; + u32 diff_cons : 21; +#elif defined(__LITTLE_ENDIAN_BITFIELD) + u32 diff_cons : 21; + u32 opcode : 3; + u32 token : 8; +#endif +} __packed; + +/* the value of SMC_PROD_FLAGS_MASK is related to + * the definition of struct smc_wr_imm_data_with_flags_msg + */ +struct smc_wr_imm_data_with_flags_msg { +#if defined(__BIG_ENDIAN_BITFIELD) + u32 token : 8; + u32 opcode : 3; + u32 write_blocked : 1; + u32 urg_data_pending : 1; + u32 urg_data_present : 1; + u32 reserved : 1; + u32 diff_cons : 17; +#elif defined(__LITTLE_ENDIAN_BITFIELD) + u32 diff_cons : 17; + u32 reserved : 1; + u32 urg_data_present : 1; + u32 urg_data_pending : 1; + u32 write_blocked : 1; + u32 opcode : 3; + u32 token : 8; +#endif +} __packed; + +struct smc_wr_imm_ctrl_msg { + struct smc_cdc_conn_state_flags csflags; + struct smc_cdc_producer_flags pflags; +#if defined(__BIG_ENDIAN_BITFIELD) + u8 opcode : 3; + u8 reserved : 5; +#elif defined(__LITTLE_ENDIAN_BITFIELD) + u8 reserved : 5; + u8 opcode : 3; +#endif + u8 token; +} __packed; + +struct smc_wr_imm_data_cr_msg { +#if defined(__BIG_ENDIAN_BITFIELD) + u32 token : 8; + u32 opcode : 3; + u32 credits : 8; + u32 diff_cons : 13; +#elif defined(__LITTLE_ENDIAN_BITFIELD) + u32 diff_cons : 13; + u32 credits : 8; + u32 opcode : 3; + u32 token : 8; +#endif +} __packed; + +struct smc_wr_imm_data_with_flags_cr_msg { +#if defined(__BIG_ENDIAN_BITFIELD) + u32 token : 8; + u32 opcode : 3; + u32 write_blocked : 1; + u32 urg_data_pending : 1; + u32 urg_data_present : 1; + u32 credits : 8; + u32 diff_cons : 10; +#elif defined(__LITTLE_ENDIAN_BITFIELD) + u32 diff_cons : 10; + u32 credits : 8; + u32 urg_data_present : 1; + u32 urg_data_pending : 1; + u32 write_blocked : 1; + u32 opcode : 3; + u32 token : 8; +#endif +} __packed; + +struct smc_wr_imm_header { +#if defined(__BIG_ENDIAN_BITFIELD) + u32 token : 8; + u32 opcode : 3; + u32 data : 21; +#elif defined(__LITTLE_ENDIAN_BITFIELD) + u32 data : 21; + u32 opcode : 3; + u32 token : 8; +#endif +} __packed; + +/* the 11-bit (token and opcode) definition of struct + * smc_wr_imm_xxx_msg should be consistent with + * that of struct smc_wr_imm_header + */ +union smc_wr_imm_msg { + u32 imm_data; + struct smc_wr_imm_header hdr; + struct smc_wr_imm_data_msg data; + struct smc_wr_imm_data_with_flags_msg data_with_flags; + struct smc_wr_imm_ctrl_msg ctrl; + struct smc_wr_imm_data_cr_msg data_cr; + struct smc_wr_imm_data_with_flags_cr_msg data_with_flags_cr; +}; + +union smc_wr_rwwi_tx_id { + u64 data; + struct { +#if defined(__BIG_ENDIAN_BITFIELD) + u64 rwwi_flag : 1; + u64 reserved : 7; + u64 token : 8; + u64 inflight_cons : 24; + u64 inflight_sent : 24; +#elif defined(__LITTLE_ENDIAN_BITFIELD) + u64 inflight_sent : 24; + u64 inflight_cons : 24; + u64 token : 8; + u64 reserved : 7; + u64 rwwi_flag : 1; +#endif + }; +} __packed; + +/* diff_cons only holds 17 bits in DATA_WITH_FLAGS imm_data, + * and holds 21 bits in DATA imm_data, + * so diff_cons value is limited in one WRITE_WITH_IMM + * it is related to the definition of union smc_wr_imm_msg + */ +#define SMC_DATA_MAX_DIFF_CONS ((1 << 21) - 1) +#define SMC_DATA_WITH_FLAGS_MAX_DIFF_CONS ((1 << 17) - 1) +#define SMC_DATA_CR_MAX_DIFF_CONS ((1 << 13) - 1) +#define SMC_DATA_WITH_FLAGS_CR_MAX_DIFF_CONS ((1 << 10) - 1) +#define SMC_PROD_FLAGS_MASK 0xE0 +#define SMC_WR_ID_SEQ_MASK ((uint64_t)0x7FFFFFFFFFFFFFFFULL) +#define SMC_WR_ID_FLAG_RWWI (((uint64_t)1) << 63) +#define SMC_WR_IS_TX_RWWI(wr_id) ((wr_id) & SMC_WR_ID_FLAG_RWWI) + /* Only used by RDMA write WRs. * All other WRs (CDC/LLC) use smc_wr_tx_send handling WR_ID implicitly */ static inline long smc_wr_tx_get_next_wr_id(struct smc_link *link) { - return atomic_long_add_return(2, &link->wr_tx_id); + return atomic_long_add_return(2, &link->wr_tx_id) & SMC_WR_ID_SEQ_MASK; } static inline void smc_wr_tx_set_wr_id(atomic_long_t *wr_tx_id, long val) @@ -63,14 +222,13 @@ static inline bool smc_wr_tx_link_hold(struct smc_link *link) { if (!smc_link_sendable(link)) return false; - atomic_inc(&link->wr_tx_refcnt); + percpu_ref_get(&link->wr_tx_refs); return true; } static inline void smc_wr_tx_link_put(struct smc_link *link) { - if (atomic_dec_and_test(&link->wr_tx_refcnt)) - wake_up_all(&link->wr_tx_wait); + percpu_ref_put(&link->wr_tx_refs); } static inline void smc_wr_wakeup_tx_wait(struct smc_link *lnk) @@ -83,6 +241,67 @@ static inline void smc_wr_wakeup_reg_wait(struct smc_link *lnk) wake_up(&lnk->wr_reg_wait); } +/* get one tx credit, and peer rq credits dec */ +static inline int smc_wr_tx_get_credit(struct smc_link *link) +{ + return !link->credits_enable || + atomic_dec_if_positive(&link->peer_rq_credits) >= 0; +} + +/* put tx credits, when some failures occurred after tx credits got + * or receive announce credits msgs + */ +static inline void smc_wr_tx_put_credits(struct smc_link *link, int credits, + bool wakeup) +{ + if (link->credits_enable && credits) { + atomic_add(credits, &link->peer_rq_credits); + if (wakeup && wq_has_sleeper(&link->wr_tx_wait)) + wake_up_nr(&link->wr_tx_wait, credits); + } +} + +/* to check whether peer rq credits is lower than watermark. */ +static inline int smc_wr_tx_credits_need_announce(struct smc_link *link) +{ + return link->credits_enable && atomic_read(&link->peer_rq_credits) <= + link->peer_cr_watermark_low; +} + +/* get local rq credits and set credits to zero. + * may called when announcing credits. + */ +static inline int smc_wr_rx_get_credits(struct smc_link *link) +{ + return link->credits_enable ? + atomic_fetch_and(0, &link->local_rq_credits) : 0; +} + +/* called when post_recv a rqe */ +static inline void smc_wr_rx_put_credits(struct smc_link *link, int credits) +{ + if (link->credits_enable && credits) + atomic_add(credits, &link->local_rq_credits); +} + +/* to check whether local rq credits is higher than watermark. */ +static inline int smc_wr_rx_credits_need_announce(struct smc_link *link) +{ + return link->credits_enable && atomic_read(&link->local_rq_credits) >= + link->local_cr_watermark_high; +} + +static inline int smc_wr_rx_credits_need_announce_frequent(struct smc_link *link) +{ + /* announce when local rq credits accumulated more than credits_update_limit, or + * peer rq credits is empty. As peer credits empty and local credits is less than + * credits_update_limit, may results in credits deadlock. + */ + return link->credits_enable && (atomic_read(&link->local_rq_credits) >= + link->credits_update_limit || + !atomic_read(&link->peer_rq_credits)); +} + /* post a new receive work request to fill a completed old work request entry */ static inline int smc_wr_rx_post(struct smc_link *link) { @@ -93,13 +312,15 @@ static inline int smc_wr_rx_post(struct smc_link *link) u32 index; link->wr_rx_id += 2; - wr_id = link->wr_rx_id; /* tasklet context, thus not atomic */ + wr_id = link->wr_rx_id & SMC_WR_ID_SEQ_MASK; /* tasklet context, thus not atomic */ temp_wr_id = wr_id / 2; index = do_div(temp_wr_id, link->wr_rx_cnt); link->wr_rx_ibs[index].wr_id = wr_id; rc = ib_post_recv(link->roce_qp, &link->wr_rx_ibs[index], NULL); - if (!rc) + if (!rc) { SMC_LINK_STAT_WR(lnk_stats, 0, 1); + smc_wr_rx_put_credits(link, 1); + } return rc; } diff --git a/net/socket.c b/net/socket.c index 96860a0f93308b4ddb86a25c5bfe460ea1f49023..3c361db340c7c16f40a1015da0c907770fbd690a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1371,8 +1371,7 @@ int __sock_create(struct net *net, int family, int type, int protocol, if (!kern && (family == AF_INET || family == AF_INET6) && type == SOCK_STREAM && (protocol == IPPROTO_IP || protocol == IPPROTO_TCP) && net->smc.sysctl_tcp2smc) { - protocol = (family == AF_INET) ? SMCPROTO_SMC : SMCPROTO_SMC6; - family = AF_SMC; + protocol = IPPROTO_SMC; } #endif diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 38256aabf4f1d03b0ba3cc6883df57b36d5ca02f..9334817a699f540fd1023480328be07acc6d0c7b 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -300,9 +300,9 @@ static void tsk_rej_rx_queue(struct sock *sk, int error) tipc_sk_respond(sk, skb, error); } -static bool tipc_sk_connected(struct sock *sk) +static bool tipc_sk_connected(const struct sock *sk) { - return sk->sk_state == TIPC_ESTABLISHED; + return READ_ONCE(sk->sk_state) == TIPC_ESTABLISHED; } /* tipc_sk_type_connectionless - check if the socket is datagram socket diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c index 13f3143609f9ef8e358010f5e9d0564c13a4e87f..94f1da32a4ab3462a0d49d47222a45146bf663ab 100644 --- a/net/tipc/topsrv.c +++ b/net/tipc/topsrv.c @@ -176,7 +176,7 @@ static void tipc_conn_close(struct tipc_conn *con) conn_put(con); } -static struct tipc_conn *tipc_conn_alloc(struct tipc_topsrv *s) +static struct tipc_conn *tipc_conn_alloc(struct tipc_topsrv *s, struct socket *sock) { struct tipc_conn *con; int ret; @@ -202,10 +202,11 @@ static struct tipc_conn *tipc_conn_alloc(struct tipc_topsrv *s) } con->conid = ret; s->idr_in_use++; - spin_unlock_bh(&s->idr_lock); set_bit(CF_CONNECTED, &con->flags); con->server = s; + con->sock = sock; + spin_unlock_bh(&s->idr_lock); return con; } @@ -460,7 +461,7 @@ static void tipc_topsrv_accept(struct work_struct *work) ret = kernel_accept(lsock, &newsock, O_NONBLOCK); if (ret < 0) return; - con = tipc_conn_alloc(srv); + con = tipc_conn_alloc(srv, newsock); if (IS_ERR(con)) { ret = PTR_ERR(con); sock_release(newsock); @@ -472,7 +473,6 @@ static void tipc_topsrv_accept(struct work_struct *work) newsk->sk_data_ready = tipc_conn_data_ready; newsk->sk_write_space = tipc_conn_write_space; newsk->sk_user_data = con; - con->sock = newsock; write_unlock_bh(&newsk->sk_callback_lock); /* Wake up receive process in case of 'SYN+' message */ @@ -570,7 +570,7 @@ bool tipc_topsrv_kern_subscr(struct net *net, u32 port, u32 type, u32 lower, sub.filter = filter; *(u32 *)&sub.usr_handle = port; - con = tipc_conn_alloc(tipc_topsrv(net)); + con = tipc_conn_alloc(tipc_topsrv(net), NULL); if (IS_ERR(con)) return false; diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c index bcd6f01594eec227d68110b524310a65bbdf630a..b52a3e64b3c90a1efe3931918b820ac550fb41fc 100644 --- a/net/tls/tls_main.c +++ b/net/tls/tls_main.c @@ -92,7 +92,8 @@ int wait_on_pending_writer(struct sock *sk, long *timeo) break; } - if (sk_wait_event(sk, timeo, !sk->sk_write_pending, &wait)) + if (sk_wait_event(sk, timeo, + !READ_ONCE(sk->sk_write_pending), &wait)) break; } remove_wait_queue(sk_sleep(sk), &wait); diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 28721e9575b75fba23b074fc3319028cf50ad9f5..429345f34905b12b9ed4d66211e99b072559bd07 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -111,6 +111,7 @@ #include #include #include +#include #include #include diff --git a/net/vtoa/Kconfig b/net/vtoa/Kconfig new file mode 100644 index 0000000000000000000000000000000000000000..86ebedf515827386bfe4abb31a92a060f3501d06 --- /dev/null +++ b/net/vtoa/Kconfig @@ -0,0 +1,15 @@ +# SPDX-License-Identifier: GPL-2.0-only + +config VTOA + tristate "VTOA: VPC TCP option address for AliYun" + depends on NETFILTER + depends on HOOKERS + default m + help + 1. Support classic toa, obtains the source address and port from the + specific tcp option. + + 2. Support vpc toa, obtains the source address and port, vip, vport, + and vpc tunnel id as well. + + 3. Support IPv4 and IPv6. diff --git a/net/vtoa/Makefile b/net/vtoa/Makefile new file mode 100644 index 0000000000000000000000000000000000000000..57d59fac30158d74ec43b35d21673f45e9acb3ae --- /dev/null +++ b/net/vtoa/Makefile @@ -0,0 +1,2 @@ +obj-m = vtoa.o +vtoa-objs := vtoa_main.o vtoa_ctl.o diff --git a/net/vtoa/vtoa.h b/net/vtoa/vtoa.h new file mode 100644 index 0000000000000000000000000000000000000000..6b16ba863df299aa7a72df83c8b8441c12759c09 --- /dev/null +++ b/net/vtoa/vtoa.h @@ -0,0 +1,150 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __NET__TOA_H__ +#define __NET__TOA_H__ +#include +#include +#include + +#include + +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) +#include +#include +#endif + +#define SK_TOA_DATA(sock) ((void *)((sock)->sk_toa_data)) + +#define NIPQUAD(addr) ({\ + (uint8_t *)_a = (uint8_t *)&(addr); \ + _a[0], _a[1], _a[2], _a[3]}}) + +#ifdef TOA_DEBUG +#define TOA_DBG(msg...) pr_debug("[DEBUG] TOA: " msg) +#else +#define TOA_DBG(msg...) +#endif + +#define TOA_INFO(msg...) \ + do { \ + if (net_ratelimit()) \ + pr_info("TOA: " msg); \ + } while (0) + +#define TCPOPT_TOA 254 +#define TCPOPT_TOA_VIP 250 +#define TCPOPT_VTOA 252 +#define TCPOPT_V6VTOA 249 + +/* MUST be 4n !!!! */ +/* |opcode|size|ip+port| = 1 + 1 + 6 */ +#define TCPOLEN_TOA 8 +/* |opcode|size|ip+port| = 1 + 1 + 6 */ +#define TCPOLEN_TOA_VIP 8 +/* |opcode|size|cport+cip+vid+vip+vport+pad[2]| = 1 + 1 + 16 + 2 */ +#define TCPOLEN_VTOA 20 +/* |opcode|size|cport+v6cip+vid+v6vip| = 1 + 1 + 2 + 16 + 4 + 16 */ +#define TCPOLEN_V6VTOA 40 + +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) +#define TCPOPT_TOA_V6 253 +/* |opcode|size|port|ipv6| = 1 + 1 + 2 + 16 */ +#define TCPOLEN_TOA_V6 20 +#endif + +/* MUST be 4 bytes alignment */ +struct toa_data { + __u8 optcode; + __u8 optsize; + __be16 port; + union { + __be32 ip; +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + struct in6_addr in6; +#endif + }; +}; + +/* MUST be 4 bytes alignment */ + +struct toa_vip_data { + __u8 optcode; + __u8 optsize; + __be16 port; + __be32 ip; +}; + +/* MUST be 4 bytes alignment */ +struct vtoa_data { + __u8 optcode; + __u8 optsize; + __be16 cport; + __be32 cip; + __be32 vid; + __be32 vip; + __be16 vport; + __u8 pad[2]; +}; + +struct v6vtoa_data { + u8 opcode; + u8 opsize; + __be16 cport; + __be32 cip[4]; + __be32 vid:24, + reserved:8; + __be32 vip[4]; +} __packed; + +/* we define this because gcc cannot take address of bit-field structure member 'vid' */ +#define OFFSETOF_VID(xptr) (offsetof(struct v6vtoa_data, cip) + sizeof(xptr->cip)) +#define SIZEOF_VID 3 +#define OFFSETOF_RESERVED(xptr) (offsetof(struct v6vtoa_data, cip) + sizeof(xptr->cip) + SIZEOF_VID) + +#define IPV6_PREFIX_4BYTES 4 +#define IPV6_PREFIX_7BYTES 7 + +#define VID_BE_UNFOLD(vni) ((vni) << 8) + +/* statistics about toa in proc /proc/net/vtoa_stat */ +enum { + SYN_RECV_SOCK_TOA_CNT = 1, + SYN_RECV_SOCK_NO_TOA_CNT, + GETNAME_TOA_OK_CNT_V4, + GETNAME_V6VTOA_OK_CNT, +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + GETNAME_TOA_OK_CNT_V6, + GETNAME_TOA_OK_CNT_MAPPED, +#endif + GETNAME_TOA_MISMATCH_CNT, + GETNAME_TOA_BYPASS_CNT, + GETNAME_TOA_EMPTY_CNT, + TOA_STAT_LAST +}; + +struct toa_stats_entry { + char *name; + int entry; +}; + +#define TOA_STAT_ITEM(_name, _entry) { \ + .name = _name, \ + .entry = _entry, \ +} + +#define TOA_STAT_END { \ + NULL, \ + 0, \ +} + +struct toa_stat_mib { + unsigned long mibs[TOA_STAT_LAST]; +}; + +#define TOA_INC_STATS(mib, field) \ + (per_cpu_ptr(mib, smp_processor_id())->mibs[field]++) + +extern int sysctl_v6vtoa_info_mode; +extern int v6vtoa_vip_prefixlen_learned; +extern struct in6_addr v6vtoa_vip_prefix; +#endif diff --git a/net/vtoa/vtoa_ctl.c b/net/vtoa/vtoa_ctl.c new file mode 100644 index 0000000000000000000000000000000000000000..b20077dcca3bc791706a7474048e979bd17bbb20 --- /dev/null +++ b/net/vtoa/vtoa_ctl.c @@ -0,0 +1,399 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2019 Alibaba Group Holding Limited. All Rights Reserved. */ + +#define KMSG_COMPONENT "VTOA" +#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt +#include + +#include "vtoa.h" +#include "vtoa_ctl.h" + +/* mode 0: default mode, save: cport + cip + * mode 1: save: cport + cip + vip-4bytes + * mode 2: save: cport + cip + vid + vip-7bytes + */ +int sysctl_v6vtoa_info_mode; +static int v6vtoa_info_mode_int_min; +static int v6vtoa_info_mode_int_max = 2; + +int v6vtoa_vip_prefixlen_learned; +struct in6_addr v6vtoa_vip_prefix = IN6ADDR_ANY_INIT; + +static DEFINE_MUTEX(__v6vtoa_info_mode_mutex); +static DEFINE_MUTEX(__vtoa_mutex); + +static int sysctl_set_v6vtoa_info_mode(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + int *valp = table->data, *min = table->extra1, *max = table->extra2; + int val_old, ret; + + if (mutex_lock_interruptible(&__v6vtoa_info_mode_mutex)) + return -ERESTARTSYS; + + /* backup the value first */ + val_old = *valp; + + ret = proc_dointvec(table, write, buffer, lenp, ppos); + if (write && (*valp < *min || *valp > *max)) { + /* Restore the correct value */ + *valp = val_old; + ret = -EINVAL; + goto out; + } + + if (*valp == val_old) + goto out; + + memset(&v6vtoa_vip_prefix, 0, sizeof(v6vtoa_vip_prefix)); + v6vtoa_vip_prefixlen_learned = 0; + pr_info("reset v6vtoa_vip_prifix_learned after v6vtoa_info_mode set to %d!\n", *valp); +out: + mutex_unlock(&__v6vtoa_info_mode_mutex); + return ret; +} + +/* SLB_VTOA sysctl table (under the /proc/sys/net/ipv4/slb_vtoa/) */ +static struct ctl_table vtoa_vars[] = { + { + .procname = "v6vtoa_info_mode", + .data = &sysctl_v6vtoa_info_mode, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = sysctl_set_v6vtoa_info_mode, + .extra1 = &v6vtoa_info_mode_int_min, + .extra2 = &v6vtoa_info_mode_int_max, + }, + { } +}; + +static struct ctl_table_header *sysctl_header; + +#define GET_CMDID(cmd) ((cmd) - VTOA_BASE_CTL) +#define GET_VS_ARG_LEN (sizeof(struct vtoa_get_vs)) +#define GET_VS4RDS_ARG_LEN (sizeof(struct vtoa_get_vs4rds)) +#define GET_V6VS_ARG_LEN (sizeof(struct v6vtoa_get_vs)) + +static const unsigned char get_arglen[GET_CMDID(VTOA_SO_GET_MAX) + 1] = { + [GET_CMDID(VTOA_SO_GET_VS)] = GET_VS_ARG_LEN, + [GET_CMDID(VTOA_SO_GET_VS4RDS)] = GET_VS4RDS_ARG_LEN, + [GET_CMDID(HYBRID_VTOA_SO_GET_VS)] = GET_V6VS_ARG_LEN, +}; + +static int +do_vtoa_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) +{ + int ret = 0; + struct vtoa_data *tdata = SK_TOA_DATA(sk); + struct toa_vip_data *tdata_vip = SK_TOA_DATA(sk) + TCPOLEN_TOA; + + if (*len < get_arglen[GET_CMDID(cmd)]) { + pr_err("get_ctl: len %u < %u\n", + *len, get_arglen[GET_CMDID(cmd)]); + return -EINVAL; + } + + switch (cmd) { + case VTOA_SO_GET_VS: + { + struct vtoa_get_vs vs = { {0, 0, 0} }; + + /* VPC */ + if (tdata->optcode == TCPOPT_VTOA && + tdata->optsize == TCPOLEN_VTOA) { + vs.vs.vid = tdata->vid; + vs.vs.vaddr = tdata->vip; + vs.vs.vport = tdata->vport; + /* FNAT:cip+vip */ + } else if (tdata->optcode == TCPOPT_TOA && + tdata->optsize == TCPOLEN_TOA && + tdata_vip->optcode == TCPOPT_TOA_VIP && + tdata_vip->optsize == TCPOLEN_TOA_VIP) { + vs.vs.vid = 0; + vs.vs.vaddr = tdata_vip->ip; + vs.vs.vport = tdata_vip->port; + /* FNAT:vip */ + } else if (tdata->optcode == TCPOPT_TOA_VIP && + tdata->optsize == TCPOLEN_TOA_VIP) { + tdata_vip = (void *)tdata; + vs.vs.vid = 0; + vs.vs.vaddr = tdata_vip->ip; + vs.vs.vport = tdata_vip->port; + } else { + ret = -ESRCH; + break; + } + if (copy_to_user(user, &vs, sizeof(vs))) { + pr_err("%s err: copy to user.\n", __func__); + ret = -EFAULT; + } + break; + } + case HYBRID_VTOA_SO_GET_VS: + { + struct v6vtoa_get_vs v6vs = { {0, {{0}}} }; + struct v6vtoa_data *v6vtdata = SK_TOA_DATA(sk); + + v6vs.vs.vaddr_af = AF_INET; + /* VPC */ + if (tdata->optcode == TCPOPT_VTOA && + tdata->optsize == TCPOLEN_VTOA) { + v6vs.vs.vid = tdata->vid; //in host order + v6vs.vs.vaddr.ip = tdata->vip; //in network order + v6vs.vs.vport = tdata->vport; + /* FNAT:cip+vip */ + } else if (tdata->optcode == TCPOPT_TOA && + tdata->optsize == TCPOLEN_TOA && + tdata_vip->optcode == TCPOPT_TOA_VIP && + tdata_vip->optsize == TCPOLEN_TOA_VIP) { + v6vs.vs.vid = 0; + v6vs.vs.vaddr.ip = tdata_vip->ip; + v6vs.vs.vport = tdata_vip->port; + /* FNAT:vip */ + } else if (tdata->optcode == TCPOPT_TOA_VIP && + tdata->optsize == TCPOLEN_TOA_VIP) { + tdata_vip = (void *)tdata; + v6vs.vs.vid = 0; + v6vs.vs.vaddr.ip = tdata_vip->ip; + v6vs.vs.vport = tdata_vip->port; + /* FNAT: v6vtoa */ + } else if (tdata->optcode == TCPOPT_V6VTOA && + tdata->optsize == TCPOLEN_V6VTOA) { + v6vs.vs.vid = 0xffffff; + v6vs.vs.vaddr_af = AF_INET6; + v6vs.vs.vport = 0; + + if (sizeof(sk->sk_toa_data) >= TCPOLEN_V6VTOA) { + v6vs.vs.vid = ntohl(VID_BE_UNFOLD(v6vtdata->vid)); + memcpy(&v6vs.vs.vaddr, v6vtdata->vip, + sizeof(v6vtdata->vip)); + + } else if (sysctl_v6vtoa_info_mode == 0) { + /* mode 0: default mode, save: cport + cip */ + pr_info("warning get_v6vs: vid and vip not available in mode 0.\n"); + } else if (sysctl_v6vtoa_info_mode == 1) { + memcpy(&v6vs.vs.vaddr, + &v6vtoa_vip_prefix, + IPV6_PREFIX_4BYTES); + memcpy((char *)&v6vs.vs.vaddr + IPV6_PREFIX_4BYTES, + (char *)v6vtdata + OFFSETOF_VID(v6vtdata), + sizeof(v6vtdata->vip) - IPV6_PREFIX_4BYTES); + + } else if (sysctl_v6vtoa_info_mode == 2) { + v6vs.vs.vid = ntohl(VID_BE_UNFOLD(v6vtdata->vid)); + memcpy(&v6vs.vs.vaddr, + &v6vtoa_vip_prefix, + IPV6_PREFIX_7BYTES); + memcpy((char *)&v6vs.vs.vaddr + IPV6_PREFIX_7BYTES, + (char *)v6vtdata + OFFSETOF_RESERVED(v6vtdata), + sizeof(v6vtdata->vip) - IPV6_PREFIX_7BYTES); + + } else { + pr_err("err get_v6vs: unexpected mode %d.\n", + sysctl_v6vtoa_info_mode); + ret = -EFAULT; + return ret; + } + + if (copy_to_user(user, &v6vs, sizeof(v6vs))) { + pr_err("get_v6vs err: copy to user.\n"); + ret = -EFAULT; + } + + return ret; + } + + ret = -ESRCH; + break; + } + + case VTOA_SO_GET_VS4RDS: + { + char arg[sizeof(struct vtoa_get_vs4rds) + sizeof(struct vtoa_vs)]; + struct vtoa_get_vs4rds *vs4rds = (void *)arg; + + if (*len != sizeof(struct vtoa_get_vs4rds) + sizeof(struct vtoa_vs)) { + ret = -EINVAL; + break; + } + /* VPC */ + if (tdata->optcode == TCPOPT_VTOA && tdata->optsize == TCPOLEN_VTOA) { + vs4rds->entrytable->vid = tdata->vid; + vs4rds->entrytable->vaddr = tdata->vip; + vs4rds->entrytable->vport = tdata->vport; + /* FNAT:cip+vip */ + } else if (tdata->optcode == TCPOPT_TOA && + tdata->optsize == TCPOLEN_TOA && + tdata_vip->optcode == TCPOPT_TOA_VIP && + tdata_vip->optsize == TCPOLEN_TOA_VIP) { + vs4rds->entrytable->vid = 0; + vs4rds->entrytable->vaddr = tdata_vip->ip; + vs4rds->entrytable->vport = tdata_vip->port; + /* FNAT:vip */ + } else if (tdata->optcode == TCPOPT_TOA_VIP && + tdata->optsize == TCPOLEN_TOA_VIP) { + tdata_vip = (void *)tdata; + vs4rds->entrytable->vid = 0; + vs4rds->entrytable->vaddr = tdata_vip->ip; + vs4rds->entrytable->vport = tdata_vip->port; + } else { + ret = -ESRCH; + break; + } + if (copy_to_user(((struct vtoa_get_vs4rds *)user)->entrytable, + vs4rds->entrytable, sizeof(struct vtoa_vs))) { + pr_err("%s err: copy to user.\n", __func__); + ret = -EFAULT; + } + break; + } + default: + ret = -EINVAL; + } + + return ret; +} + +static int +do_v6vtoa_get_ctl(struct sock *sk, int cmd, void __user *user, int *len) +{ + struct v6vtoa_data *v6vtdata = SK_TOA_DATA(sk); + struct vtoa_data *tdata = SK_TOA_DATA(sk); + struct v6vtoa_get_vs v6vs = { {0, {{0}}} }; + int ret = 0; + + if (*len < get_arglen[GET_CMDID(cmd)]) { + pr_err("get_ctl: len %u < %u\n", + *len, get_arglen[GET_CMDID(cmd)]); + return -EINVAL; + } + + if (cmd != HYBRID_VTOA_SO_GET_VS) + return -EINVAL; + + if (tdata->optcode != TCPOPT_V6VTOA || tdata->optsize != TCPOLEN_V6VTOA) + return -ESRCH; + + /* default vid is invalid, in cpu order */ + v6vs.vs.vid = 0xffffff; + v6vs.vs.vaddr_af = AF_INET6; + v6vs.vs.vport = 0; + + if (sizeof(sk->sk_toa_data) >= TCPOLEN_V6VTOA) { + v6vs.vs.vid = ntohl(VID_BE_UNFOLD(v6vtdata->vid)); + memcpy(&v6vs.vs.vaddr, v6vtdata->vip, sizeof(v6vtdata->vip)); + + } else if (sysctl_v6vtoa_info_mode == 0) { + /* mode 0: default mode, save: cport + cip */ + pr_info("warning get_v6vs: vid and vip not available in mode 0.\n"); + + } else if (sysctl_v6vtoa_info_mode == 1) { + memcpy(&v6vs.vs.vaddr, &v6vtoa_vip_prefix, IPV6_PREFIX_4BYTES); + memcpy((char *)&v6vs.vs.vaddr + IPV6_PREFIX_4BYTES, + (char *)v6vtdata + OFFSETOF_VID(v6vtdata), + sizeof(v6vtdata->vip) - IPV6_PREFIX_4BYTES); + + } else if (sysctl_v6vtoa_info_mode == 2) { + v6vs.vs.vid = ntohl(VID_BE_UNFOLD(v6vtdata->vid)); + memcpy(&v6vs.vs.vaddr, &v6vtoa_vip_prefix, IPV6_PREFIX_7BYTES); + memcpy((char *)&v6vs.vs.vaddr + IPV6_PREFIX_7BYTES, + (char *)v6vtdata + OFFSETOF_RESERVED(v6vtdata), + sizeof(v6vtdata->vip) - IPV6_PREFIX_7BYTES); + + } else { + pr_err("err get_v6vs: unexpected mode %d.\n", sysctl_v6vtoa_info_mode); + ret = -EFAULT; + return ret; + } + + if (copy_to_user(user, &v6vs, sizeof(v6vs))) { + pr_err("get_v6vs err: copy to user.\n"); + ret = -EFAULT; + } + + return ret; +} + +static struct nf_sockopt_ops vtoa_sockopts = { + .pf = PF_INET, + .get_optmin = VTOA_BASE_CTL, + .get_optmax = VTOA_SO_GET_MAX + 1, + .get = do_vtoa_get_ctl, + .owner = THIS_MODULE, +}; + +static struct nf_sockopt_ops v6vtoa_sockopts = { + .pf = PF_INET6, + .get_optmin = VTOA_BASE_CTL, + .get_optmax = VTOA_SO_GET_MAX + 1, + .get = do_v6vtoa_get_ctl, + .owner = THIS_MODULE, +}; + +static int vtoa_init_sysctl(struct net *net) +{ + struct ctl_table *table; + + table = kmemdup(vtoa_vars, sizeof(vtoa_vars), GFP_KERNEL); + if (!table) + goto out; + + sysctl_header = register_net_sysctl(net, "net/ipv4/slb_vtoa", table); + if (!sysctl_header) { + pr_err("can't register to sysctl.\n"); + goto out_register; + } + return 0; + +out_register: + kfree(table); +out: + return -ENOMEM; +} + +static void vtoa_cleanup_sysctl(struct net *net) +{ + struct ctl_table *table; + + if (sysctl_header) { + table = sysctl_header->ctl_table_arg; + unregister_net_sysctl_table(sysctl_header); + + kfree(table); + } +} + +int __init vtoa_ctl_init(void) +{ + int ret; + + ret = nf_register_sockopt(&vtoa_sockopts); + if (ret < 0) { + pr_err("cannot register vtoa_sockopts.\n"); + goto register_sockopt_fail; + } + + ret = nf_register_sockopt(&v6vtoa_sockopts); + if (ret < 0) + pr_err("cannot register ipv6 vtoa_sockopts.\n"); + + ret = vtoa_init_sysctl(&init_net); + if (ret < 0) + goto register_sysctl_fail; + + pr_info("vtoa init finish.\n"); + return 0; + +register_sysctl_fail: + nf_unregister_sockopt(&v6vtoa_sockopts); + nf_unregister_sockopt(&vtoa_sockopts); +register_sockopt_fail: + return ret; +} + +void __exit vtoa_ctl_cleanup(void) +{ + vtoa_cleanup_sysctl(&init_net); + nf_unregister_sockopt(&v6vtoa_sockopts); + nf_unregister_sockopt(&vtoa_sockopts); +} diff --git a/net/vtoa/vtoa_ctl.h b/net/vtoa/vtoa_ctl.h new file mode 100644 index 0000000000000000000000000000000000000000..186b54bffcc2bce0703d40c1b9ea92a529e83dc5 --- /dev/null +++ b/net/vtoa/vtoa_ctl.h @@ -0,0 +1,80 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef VTOA_CTL_H_INCLUDE +#define VTOA_CTL_H_INCLUDE + +#include "vtoa.h" + +union vtoa_ipaddr { + u32 all[4]; + u32 ip; + u32 ip6[4]; + struct in_addr in; + struct in6_addr in6; +}; + +struct v6vtoa_vs { + /* VPC ID */ + __u32 vid; + /* vip */ + union vtoa_ipaddr vaddr; + __u16 vaddr_af; + /* vport */ + __be16 vport; +}; + +struct v6vtoa_get_vs { + struct v6vtoa_vs vs; +}; + +struct v6vtoa_get_vs4rds { + /* which connection*/ + __u16 protocol; + /* client address */ + union vtoa_ipaddr caddr; + __be16 cport; + /* destination address */ + union vtoa_ipaddr daddr; + __be16 dport; + /* the virtual servers */ + struct v6vtoa_vs entrytable[0]; +}; + +struct vtoa_vs { + /* VPC ID */ + __u32 vid; + /* vip */ + __be32 vaddr; + /* vport */ + __be16 vport; +}; + +struct vtoa_get_vs { + struct vtoa_vs vs; +}; + +struct vtoa_get_vs4rds { + /* which connection*/ + __u16 protocol; + /* client address */ + __be32 caddr; + __be16 cport; + /* destination address */ + __be32 daddr; + __be16 dport; + + /* the virtual servers */ + struct vtoa_vs entrytable[0]; +}; + +#define VTOA_BASE_CTL (64 + 1024 + 64 + 64 + 64 + 64) + +#define VTOA_SO_GET_VS (VTOA_BASE_CTL + 1) +#define VTOA_SO_GET_VS4RDS (VTOA_BASE_CTL + 2) +#define HYBRID_VTOA_SO_GET_VS (VTOA_BASE_CTL + 3) +#define VTOA_SO_GET_MAX (HYBRID_VTOA_SO_GET_VS) + +int vtoa_ctl_init(void); +void vtoa_ctl_cleanup(void); + +#endif diff --git a/net/vtoa/vtoa_main.c b/net/vtoa/vtoa_main.c new file mode 100644 index 0000000000000000000000000000000000000000..d6e5476cd9a39825a535eda0cc0034d31b8287d5 --- /dev/null +++ b/net/vtoa/vtoa_main.c @@ -0,0 +1,568 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2019 Alibaba Group Holding Limited. All Rights Reserved. */ + +#define KMSG_COMPONENT "VTOA" +#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt + +#include "vtoa.h" +#include "vtoa_ctl.h" + +/* Statistics of toa in proc /proc/net/vtoa_stats */ +struct toa_stats_entry toa_stats[] = { + TOA_STAT_ITEM("syn_recv_sock_toa", SYN_RECV_SOCK_TOA_CNT), + TOA_STAT_ITEM("syn_recv_sock_no_toa", SYN_RECV_SOCK_NO_TOA_CNT), + TOA_STAT_ITEM("getname_toa_ok_v4", GETNAME_TOA_OK_CNT_V4), + TOA_STAT_ITEM("getname_v6vtoa_ok", GETNAME_V6VTOA_OK_CNT), +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + TOA_STAT_ITEM("getname_toa_ok_v6", GETNAME_TOA_OK_CNT_V6), + TOA_STAT_ITEM("getname_toa_ok_mapped", GETNAME_TOA_OK_CNT_MAPPED), +#endif + TOA_STAT_ITEM("getname_toa_mismatch", GETNAME_TOA_MISMATCH_CNT), + TOA_STAT_ITEM("getname_toa_bypass", GETNAME_TOA_BYPASS_CNT), + TOA_STAT_ITEM("getname_toa_empty", GETNAME_TOA_EMPTY_CNT), + TOA_STAT_END +}; + +struct toa_stat_mib *ext_stats; + +/* Parse TCP options in skb, try to get client ip, port + * @param skb [in] received skb, it should be a ack/get-ack packet. + * @return NULL if we don't get client ip/port; + * value of toa_data in ret_ptr if we get client ip/port. + */ +static int get_toa_data(struct sk_buff *skb, void *sk_toa_data, int sk_toa_datalen) +{ + struct tcphdr *th; + int length; + unsigned char *ptr; + + if (!skb) + return 0; + + th = tcp_hdr(skb); + length = (th->doff * 4) - sizeof(struct tcphdr); + ptr = (unsigned char *)(th + 1); + + while (length > 0) { + int opcode = *ptr++; + int opsize; + + switch (opcode) { + case TCPOPT_EOL: + return 0; + + /* Ref: RFC 793 section 3.1 */ + case TCPOPT_NOP: + length--; + continue; + } + + opsize = *ptr++; + + /* "silly options" */ + if (opsize < 2) + return 0; + + /* don't parse partial options */ + if (opsize > length) + return 0; + + if ((opcode == TCPOPT_TOA && opsize == TCPOLEN_TOA)) { + struct toa_data *tdata; + struct toa_vip_data *tdata_vip; + + memset(sk_toa_data, 0, sizeof(struct toa_data)); + memcpy(sk_toa_data, ptr - 2, TCPOLEN_TOA); + tdata = sk_toa_data; + + TOA_DBG("find toa data: ip = %u.%u.%u.%u, port = %u\n", + NIPQUAD(tdata->ip), + ntohs(tdata->port)); + + /* TOA_VIP: vip parse */ + length -= opsize; + ptr += (opsize - 2); + + opcode = *ptr++; + opsize = *ptr++; + + /* "silly options" */ + if (opsize < 2) + return 0; + + /* don't parse partial options */ + if (opsize > length) + return 0; + + if (TCPOPT_TOA_VIP == opcode && TCPOLEN_TOA_VIP == opsize) { + sk_toa_data += TCPOLEN_TOA; + memset(sk_toa_data, 0, sizeof(struct toa_vip_data)); + memcpy(sk_toa_data, ptr - 2, TCPOLEN_TOA_VIP); + tdata_vip = sk_toa_data; + + TOA_DBG("find toa data: ip = %u.%u.%u.%u, port = %u\n", + NIPQUAD(tdata_vip->ip), ntohs(tdata_vip->port)); + } + + return 1; + + } else if (opcode == TCPOPT_TOA_VIP && opsize == TCPOLEN_TOA_VIP) { + struct toa_vip_data *tdata; + + memset(sk_toa_data, 0, sizeof(struct toa_vip_data)); + memcpy(sk_toa_data, ptr - 2, TCPOLEN_TOA_VIP); + tdata = sk_toa_data; + + TOA_DBG("find toa data: ip = %u.%u.%u.%u, port = %u\n", + NIPQUAD(tdata->ip), ntohs(tdata->port)); + return 1; + + } else if (opcode == TCPOPT_VTOA && opsize == TCPOLEN_VTOA) { + struct vtoa_data *vtdata; + + memset(sk_toa_data, 0, sizeof(struct vtoa_data)); + memcpy(sk_toa_data, ptr - 2, TCPOLEN_VTOA); + + vtdata = sk_toa_data; + + TOA_DBG("find vtoa data: cip:cport->vid:vip:vport\n" + "%u.%u.%u.%u:%u->%u:%u.%u.%u.%u:%u\n", + NIPQUAD(vtdata->cip), + ntohs(vtdata->cport), + vtdata->vid, + NIPQUAD(vtdata->vip), + ntohs(vtdata->vport) + ); + return 1; + + } else if (opcode == TCPOPT_V6VTOA && opsize == TCPOLEN_V6VTOA) { +#ifdef TOA_DEBUG + struct in6_addr dbg_v6vip = IN6ADDR_ANY_INIT; + struct v6vtoa_data *saved; + __be32 dbg_vid = 0; +#endif + struct v6vtoa_data *v6vtdata = (struct v6vtoa_data *)(ptr - 2); + + memset(sk_toa_data, 0, sk_toa_datalen); + + if (sk_toa_datalen >= TCPOLEN_V6VTOA) { + memcpy(sk_toa_data, v6vtdata, TCPOLEN_V6VTOA); + + } else if (sk_toa_datalen == 32) { + memcpy(sk_toa_data, v6vtdata, OFFSETOF_VID(v6vtdata)); + + if (sysctl_v6vtoa_info_mode == 0) { + /* mode 0: default mode, save: cport + + * cip, do nothing + */ + } else if (sysctl_v6vtoa_info_mode == 1) { + /* mode 1: save: cport + cip + vip, + * learn vip prefix-length 4bytes + */ + memcpy((char *)sk_toa_data + OFFSETOF_VID(v6vtdata), + (char *)v6vtdata->vip + IPV6_PREFIX_4BYTES, + sk_toa_datalen - OFFSETOF_VID(v6vtdata)); //12 bytes + if (v6vtoa_vip_prefixlen_learned == 0) { + memcpy(&v6vtoa_vip_prefix, (char *)v6vtdata->vip, + IPV6_PREFIX_4BYTES); + v6vtoa_vip_prefixlen_learned = IPV6_PREFIX_4BYTES; + + TOA_INFO("v6vtoa origin data: cip:cport->vid:vip " + "[%pI6]:%u -> %u:[%pI6]\n" + "saved: [%pI6]/%d\n", + (struct in6_addr *)(v6vtdata->cip), + ntohs(v6vtdata->cport), + ntohl(VID_BE_UNFOLD(v6vtdata->vid)), + (struct in6_addr *)(v6vtdata->vip), + &v6vtoa_vip_prefix, + v6vtoa_vip_prefixlen_learned); + } + } else if (sysctl_v6vtoa_info_mode == 2) { + /* mode 2: save: cport + cip + vid + vip, + * learn vip prefix-length 7bytes + * network order vid 1193046 in + * memory(low->high address): 0x12 0x34 + * 0x56 0x00 + */ + memcpy((char *)sk_toa_data + OFFSETOF_VID(v6vtdata), + (char *)v6vtdata + OFFSETOF_VID(v6vtdata), + SIZEOF_VID); + memcpy((char *)sk_toa_data + OFFSETOF_RESERVED(v6vtdata), + (char *)v6vtdata->vip + IPV6_PREFIX_7BYTES, + sk_toa_datalen - OFFSETOF_RESERVED(v6vtdata)); + if (v6vtoa_vip_prefixlen_learned == 0) { + memcpy(&v6vtoa_vip_prefix, (char *)v6vtdata->vip, + IPV6_PREFIX_7BYTES); + v6vtoa_vip_prefixlen_learned = IPV6_PREFIX_7BYTES; + + TOA_INFO("v6vtoa origin data: cip:cport->vid:vip " + "[%pI6]:%u -> %u:[%pI6]\n" + "saved: [%pI6]/%d\n", + (struct in6_addr *)(v6vtdata->cip), + ntohs(v6vtdata->cport), + ntohl(VID_BE_UNFOLD(v6vtdata->vid)), + (struct in6_addr *)(v6vtdata->vip), + &v6vtoa_vip_prefix, + v6vtoa_vip_prefixlen_learned); + } + } + } +#ifdef TOA_DEBUG + saved = (struct v6vtoa_data *)sk_toa_data; + if (sysctl_v6vtoa_info_mode == 1) { + memcpy(&dbg_v6vip, &v6vtoa_vip_prefix, 4); + memcpy((char *)&dbg_v6vip + 4, (char *)&saved->vip - 4, + sizeof(v6vtdata->vip) - 4); + } else if (sysctl_v6vtoa_info_mode == 2) { + memcpy((char *)&dbg_vid + 1, (char *)&saved->vip - 4, 3); + memcpy(&dbg_v6vip, &v6vtoa_vip_prefix, 7); + memcpy((char *)&dbg_v6vip + 7, (char *)&saved->vip + 3 - 4, + sizeof(v6vtdata->vip) - 7); + } +#endif + TOA_DBG("v6vtoa origin data: cip:cport->vid:vip [%pI6]:%u -> %u:[%pI6]\n" + "saved: [%pI6]:%u -> %u:[%pI6]\n", + (struct in6_addr *)(v6vtdata->cip), ntohs(v6vtdata->cport), + ntohl(VID_BE_UNFOLD(v6vtdata->vid)), + (struct in6_addr *)(v6vtdata->vip), + (struct in6_addr *)(saved->cip), ntohs(saved->cport), + ntohl(dbg_vid), &dbg_v6vip); + + return 1; + } +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + else if (opcode == TCPOPT_TOA_V6 && opsize == TCPOLEN_TOA_V6) { + struct toa_data *tdata; + + memset(sk_toa_data, 0, sizeof(struct toa_data)); + memcpy(sk_toa_data, ptr - 2, TCPOLEN_TOA_V6); + tdata = (struct toa_data *)sk_toa_data; + TOA_DBG("find toa data: ipv6 = %pI6, port = %u\n", + &tdata->in6, ntohs(tdata->port)); + return 1; + } +#endif + ptr += opsize - 2; + length -= opsize; + } + return 0; +} + +/* get client ip from socket + * @param sock [in] the socket to getpeername() or getsockname() + * @param uaddr [out] the place to put client ip, port + * @param uaddr_len [out] length of @uaddr + * @peer [in] if(peer), try to get remote address; if(!peer), + * try to get local address + * @return: return what the original inet_getname() returns. + */ +static int inet_getname_toa(struct socket *sock, struct sockaddr *uaddr, + int peer, int *p_retval) +{ + int retval = *p_retval; + struct sock *sk = sock->sk; + struct sockaddr_in *sin = (struct sockaddr_in *)uaddr; + u8 *option = SK_TOA_DATA(sk); + + if (retval < 0 || !peer) { + TOA_INC_STATS(ext_stats, GETNAME_TOA_EMPTY_CNT); + return retval; + } + + if (TCPOPT_TOA == option[0] && TCPOLEN_TOA == option[1]) { + struct toa_data *tdata = SK_TOA_DATA(sk); + + TOA_INC_STATS(ext_stats, GETNAME_TOA_OK_CNT_V4); + sin->sin_port = tdata->port; + sin->sin_addr.s_addr = tdata->ip; + TOA_DBG("%s: set new sockaddr, ip %u.%u.%u.%u -> %u.%u.%u.%u, port %u -> %u\n", + __func__, NIPQUAD(sin->sin_addr.s_addr), + NIPQUAD(tdata->ip), ntohs(sin->sin_port), + ntohs(tdata->port)); + + } else if (TCPOPT_VTOA == option[0] && TCPOLEN_VTOA == option[1]) { + struct vtoa_data *vtdata = SK_TOA_DATA(sk); + + TOA_INC_STATS(ext_stats, GETNAME_TOA_OK_CNT_V4); + sin->sin_port = vtdata->cport; + sin->sin_addr.s_addr = vtdata->cip; + TOA_DBG("%s: set new sockaddr, ip %u.%u.%u.%u -> %u.%u.%u.%u, port %u -> %u\n", + __func__, NIPQUAD(sin->sin_addr.s_addr), + NIPQUAD(vtdata->cip), ntohs(sin->sin_port), + ntohs(vtdata->cport)); + + } else if (TCPOPT_V6VTOA == option[0] && TCPOLEN_V6VTOA == option[1]) { + struct v6vtoa_data *v6vtdata = SK_TOA_DATA(sk); + struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)uaddr; + + TOA_INC_STATS(ext_stats, GETNAME_V6VTOA_OK_CNT); + *p_retval = sizeof(*sin6); //must update *p_retval + retval = *p_retval; + sin6->sin6_family = AF_INET6; //hack to AF_INET6 + sin6->sin6_port = v6vtdata->cport; + sin6->sin6_flowinfo = 0; + sin6->sin6_scope_id = 0; + //memcpy(&sin6->sin6_addr, v6vtdata->cip, sizeof(v6vtdata->cip)); + ipv6_addr_set(&sin6->sin6_addr, v6vtdata->cip[0], v6vtdata->cip[1], + v6vtdata->cip[2], v6vtdata->cip[3]); + TOA_DBG("%s: af: %d, cip [%pI6]:%u\n", sin6->sin6_family, + __func__, &sin6->sin6_addr, ntohs(sin6->sin6_port)); + + } else { /* doesn't belong to us */ +#ifdef TOA_DEBUG + struct toa_data *tdata = SK_TOA_DATA(sk); +#endif + + TOA_INC_STATS(ext_stats, GETNAME_TOA_MISMATCH_CNT); + TOA_DBG("%s: invalid toa data, ip %u.%u.%u.%u port %u opcode %u opsize %u\n", + __func__, NIPQUAD(tdata->ip), ntohs(tdata->port), + tdata->optcode, tdata->optsize); + } + + TOA_DBG("%s called, retval: %d, peer: %d\n", __func__, retval, peer); + return retval; +} + +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) +static int inet6_getname_toa(struct socket *sock, struct sockaddr *uaddr, + int peer, int *p_retval) +{ + struct sockaddr_in6 *sin = (struct sockaddr_in6 *)uaddr; + struct sock *sk = sock->sk; + int retval = *p_retval; + u8 *option = SK_TOA_DATA(sk); + + if (retval < 0 || !peer) { + TOA_INC_STATS(ext_stats, GETNAME_TOA_EMPTY_CNT); + return retval; + } + + /* set our value if need */ + if (TCPOPT_TOA_V6 == option[0] && TCPOLEN_TOA_V6 == option[1]) { + struct toa_data *tdata = SK_TOA_DATA(sk); + + TOA_INC_STATS(ext_stats, GETNAME_TOA_OK_CNT_V6); + sin->sin6_port = tdata->port; + sin->sin6_addr = tdata->in6; + TOA_DBG("%s: ipv6 = %pI6, port = %u\n", + __func__, &sin->sin6_addr, ntohs(sin->sin6_port)); + + } else if (TCPOPT_TOA == option[0] && TCPOLEN_TOA == option[1]) { + struct toa_data *tdata = SK_TOA_DATA(sk); + + TOA_INC_STATS(ext_stats, GETNAME_TOA_OK_CNT_MAPPED); + sin->sin6_port = tdata->port; + ipv6_addr_set(&sin->sin6_addr, 0, 0, + htonl(0x0000FFFF), tdata->ip); + TOA_DBG("%s: ipv6_mapped = %pI6, port = %u\n", + __func__, &sin->sin6_addr, ntohs(sin->sin6_port)); + + } else if (TCPOPT_V6VTOA == option[0] && TCPOLEN_V6VTOA == option[1]) { + struct v6vtoa_data *v6vtdata = SK_TOA_DATA(sk); + struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)uaddr; + + TOA_INC_STATS(ext_stats, GETNAME_V6VTOA_OK_CNT); + sin6->sin6_port = v6vtdata->cport; + memcpy(&sin6->sin6_addr, v6vtdata->cip, sizeof(v6vtdata->cip)); + TOA_DBG("%s: cip [%pI6]:%u -> vid:vip %u:[%pI6]\n", + __func__, &sin6->sin6_addr, sin6->sin6_port, + ntohl(VID_BE_UNFOLD(v6vtdata->vid)), v6vtdata->vip); + + } else { /* doesn't belong to us */ + TOA_INC_STATS(ext_stats, GETNAME_TOA_MISMATCH_CNT); + } + + TOA_DBG("inet_getname_toa called, retval: %d, peer: %d\n", retval, peer); + return retval; +} +#endif + +/* The three way handshake has completed - we got a valid synack - + * now create the new socket. + * We need to save toa data into the new socket. + * @param sk [out] the socket + * @param skb [in] the ack/ack-get packet + * @param req [in] the open request for this connection + * @param dst [out] route cache entry + * @return NULL if fail new socket if succeed. + */ +static struct sock * +tcp_v4_syn_recv_sock_toa(struct sock *sk, struct sk_buff *skb, + struct request_sock *req, struct dst_entry *dst, + struct request_sock *req_unhash, bool *own_req, + struct sock **p_newsock) +{ + struct sock *newsock = *p_newsock; + + if (!sk || !skb) + return NULL; + + /* set our value if need */ + if (newsock) { + if (get_toa_data(skb, newsock->sk_toa_data, sizeof(newsock->sk_toa_data))) + TOA_INC_STATS(ext_stats, SYN_RECV_SOCK_TOA_CNT); + else + TOA_INC_STATS(ext_stats, SYN_RECV_SOCK_NO_TOA_CNT); + } + return newsock; +} + +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) +static struct sock * +tcp_v6_syn_recv_sock_toa(struct sock *sk, struct sk_buff *skb, + struct request_sock *req, struct dst_entry *dst, + struct request_sock *req_unhash, bool *own_req, + struct sock **p_newsock) +{ + struct sock *newsock = *p_newsock; + + if (!sk || !skb) + return NULL; + + /* set our value if need */ + if (newsock) { + if (get_toa_data(skb, newsock->sk_toa_data, sizeof(newsock->sk_toa_data))) + TOA_INC_STATS(ext_stats, SYN_RECV_SOCK_TOA_CNT); + else + TOA_INC_STATS(ext_stats, SYN_RECV_SOCK_NO_TOA_CNT); + } + return newsock; +} +#endif + +static struct hooker inet_getname_hooker = { + .func = inet_getname_toa, +}; + +static struct hooker inet_tcp_hooker = { + .func = tcp_v4_syn_recv_sock_toa, +}; + +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) +static struct hooker inet6_getname_hooker = { + .func = inet6_getname_toa, +}; + +static struct hooker inet6_tcp_hooker = { + .func = tcp_v6_syn_recv_sock_toa, +}; +#endif + +extern const struct inet_connection_sock_af_ops ipv6_specific; + +/* replace the functions with our functions */ +static inline int +hook_toa_functions(void) +{ + int ret; + + ret = hooker_install(&inet_stream_ops.getname, &inet_getname_hooker); + ret |= hooker_install(&ipv4_specific.syn_recv_sock, &inet_tcp_hooker); + +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + ret |= hooker_install(&inet6_stream_ops.getname, &inet6_getname_hooker); + ret |= hooker_install(&ipv6_specific.syn_recv_sock, &inet6_tcp_hooker); +#endif + return ret; +} + +/* replace the functions to original ones */ +static void +unhook_toa_functions(void) +{ + hooker_uninstall(&inet_getname_hooker); + hooker_uninstall(&inet_tcp_hooker); + +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + hooker_uninstall(&inet6_getname_hooker); + hooker_uninstall(&inet6_tcp_hooker); +#endif +} + +/* Statistics of toa in proc /proc/net/vtoa_stats */ +static int toa_stats_show(struct seq_file *seq, void *v) +{ + int i, j, cpu_nr; + + /* print CPU first */ + seq_puts(seq, " "); + cpu_nr = num_possible_cpus(); + for (i = 0; i < cpu_nr; i++) + if (cpu_online(i)) + seq_printf(seq, "CPU%d ", i); + seq_putc(seq, '\n'); + + i = 0; + while (toa_stats[i].name) { + seq_printf(seq, "%-25s:", toa_stats[i].name); + for (j = 0; j < cpu_nr; j++) { + if (cpu_online(j)) + seq_printf(seq, "%10lu ", + *(((unsigned long *)per_cpu_ptr(ext_stats, j)) + + toa_stats[i].entry)); + } + seq_putc(seq, '\n'); + i++; + } + return 0; +} + +static int toa_stats_seq_open(struct inode *inode, struct file *file) +{ + return single_open(file, toa_stats_show, NULL); +} + +static const struct proc_ops toa_stats_fops = { + .proc_open = toa_stats_seq_open, + .proc_read = seq_read, + .proc_lseek = seq_lseek, + .proc_release = single_release, +}; + +/* module init */ +static int __init +toa_init(void) +{ + /* alloc statistics array for toa */ + ext_stats = alloc_percpu(struct toa_stat_mib); + if (!ext_stats) + return -ENOMEM; + + if (!proc_create("vtoa_stats", 0, init_net.proc_net, &toa_stats_fops)) { + TOA_INFO("cannot create procfs /proc/net/vtoa_stats.\n"); + goto err_percpu; + } + + /* hook funcs for parse and get toa */ + if (hook_toa_functions()) + goto err_proc; + + if (vtoa_ctl_init() < 0) { + TOA_INFO("vtoa_ctl_init() failed\n"); + goto err_ctl; + } + + return 0; +err_ctl: + unhook_toa_functions(); +err_proc: + remove_proc_entry("vtoa_stats", init_net.proc_net); +err_percpu: + free_percpu(ext_stats); + return -ENODEV; +} + +/* module cleanup*/ +static void __exit +toa_exit(void) +{ + vtoa_ctl_cleanup(); + + unhook_toa_functions(); + remove_proc_entry("vtoa_stats", init_net.proc_net); + free_percpu(ext_stats); +} + +module_init(toa_init); +module_exit(toa_exit); +MODULE_LICENSE("GPL"); diff --git a/scripts/spdxcheck.py b/scripts/spdxcheck.py index bc87200f9c7cf4c15ddb89c56d1ec6fafd05f157..2288192967fac899305df26147f7b76833742d88 100755 --- a/scripts/spdxcheck.py +++ b/scripts/spdxcheck.py @@ -200,7 +200,7 @@ class id_parser(object): tok = pe.tok.value sys.stdout.write('%s: %d:%d %s: %s\n' %(fname, self.curline, col, pe.txt, tok)) else: - sys.stdout.write('%s: %d:0 %s\n' %(fname, self.curline, col, pe.txt)) + sys.stdout.write('%s: %d:0 %s\n' %(fname, self.curline, pe.txt)) self.spdx_errors += 1 def scan_git_tree(tree): diff --git a/tools/include/uapi/linux/in.h b/tools/include/uapi/linux/in.h index d1b327036ae43686d66019c18533ed1091ccc65b..698d289b6bac51c85dc3515ac0032e931fd473d4 100644 --- a/tools/include/uapi/linux/in.h +++ b/tools/include/uapi/linux/in.h @@ -80,6 +80,8 @@ enum { #define IPPROTO_RAW IPPROTO_RAW IPPROTO_MPTCP = 262, /* Multipath TCP connection */ #define IPPROTO_MPTCP IPPROTO_MPTCP + IPPROTO_SMC = 263, /* Shared Memory Communications */ +#define IPPROTO_SMC IPPROTO_SMC IPPROTO_MAX }; #endif diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h index 923dc1c55805f6f37de847d05576c0f6adf2869b..2e554078943482398a538b7fdaa6e47244f56d13 100644 --- a/tools/include/uapi/linux/perf_event.h +++ b/tools/include/uapi/linux/perf_event.h @@ -1095,6 +1095,21 @@ enum perf_event_type { */ PERF_RECORD_TEXT_POKE = 20, + /* + * Data written to the AUX area by hardware due to aux_output, may need + * to be matched to the event by an architecture-specific hardware ID. + * This records the hardware ID, but requires sample_id to provide the + * event ID. e.g. Intel PT uses this record to disambiguate PEBS-via-PT + * records from multiple events. + * + * struct { + * struct perf_event_header header; + * u64 hw_id; + * struct sample_id sample_id; + * }; + */ + PERF_RECORD_AUX_OUTPUT_HW_ID = 21, + PERF_RECORD_MAX, /* non-ABI */ }; diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h index 4a24b855d3ce2a5ff22c0d83724aef1f4d5f74f4..4b1711d2e46d5439fbbe07afb4a07775413d52e6 100644 --- a/tools/lib/perf/include/perf/event.h +++ b/tools/lib/perf/include/perf/event.h @@ -279,6 +279,11 @@ struct perf_record_itrace_start { __u32 tid; }; +struct perf_record_aux_output_hw_id { + struct perf_event_header header; + __u64 hw_id; +}; + struct perf_record_thread_map_entry { __u64 pid; char comm[16]; @@ -404,6 +409,7 @@ union perf_event { struct perf_record_auxtrace_error auxtrace_error; struct perf_record_aux aux; struct perf_record_itrace_start itrace_start; + struct perf_record_aux_output_hw_id aux_output_hw_id; struct perf_record_switch context_switch; struct perf_record_thread_map thread_map; struct perf_record_cpu_map cpu_map; diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c index b59c9fe08c1dcb2bfe9bc47ac4298db31efb50ef..87cd69ac6fa9e60d6180c34a85e73777602fb193 100644 --- a/tools/perf/builtin-inject.c +++ b/tools/perf/builtin-inject.c @@ -865,7 +865,8 @@ static int __cmd_inject(struct perf_inject *inject) inject->tool.auxtrace_info = perf_event__process_auxtrace_info; inject->tool.auxtrace = perf_event__process_auxtrace; inject->tool.aux = perf_event__drop_aux; - inject->tool.itrace_start = perf_event__drop_aux, + inject->tool.itrace_start = perf_event__drop_aux; + inject->tool.aux_output_hw_id = perf_event__drop_aux; inject->tool.ordered_events = true; inject->tool.ordering_requires_timestamps = true; /* Allow space in the header for new attributes */ @@ -937,6 +938,7 @@ int cmd_inject(int argc, const char **argv) .lost_samples = perf_event__repipe, .aux = perf_event__repipe, .itrace_start = perf_event__repipe, + .aux_output_hw_id = perf_event__repipe, .context_switch = perf_event__repipe, .throttle = perf_event__repipe, .unthrottle = perf_event__repipe, diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 2c7182ad4c1465431ffbdde18b4a682481e555e0..08f1c2aeeedb24d661e8ca69b031149233962751 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -1429,7 +1429,7 @@ static int record__synthesize(struct record *rec, bool tail) goto out; /* Synthesize id_index before auxtrace_info */ - if (rec->opts.auxtrace_sample_mode) { + if (rec->opts.auxtrace_sample_mode || rec->opts.full_auxtrace) { err = perf_event__synthesize_id_index(tool, process_synthesized_event, session->evlist, machine); diff --git a/tools/perf/scripts/python/flamegraph.py b/tools/perf/scripts/python/flamegraph.py index 65780013f74573e045cdb499e38f24d39c02baff..b6af1dd5f816e47c8a904cc9aca7d39894c81348 100755 --- a/tools/perf/scripts/python/flamegraph.py +++ b/tools/perf/scripts/python/flamegraph.py @@ -13,6 +13,10 @@ # Written by Andreas Gerstmayr # Flame Graphs invented by Brendan Gregg # Works in tandem with d3-flame-graph by Martin Spier +# +# pylint: disable=missing-module-docstring +# pylint: disable=missing-class-docstring +# pylint: disable=missing-function-docstring from __future__ import print_function import sys @@ -20,16 +24,19 @@ import os import io import argparse import json +import subprocess - +# pylint: disable=too-few-public-methods class Node: - def __init__(self, name, libtype=""): + def __init__(self, name, libtype): self.name = name + # "root" | "kernel" | "" + # "" indicates user space self.libtype = libtype self.value = 0 self.children = [] - def toJSON(self): + def to_json(self): return { "n": self.name, "l": self.libtype, @@ -41,7 +48,7 @@ class Node: class FlameGraphCLI: def __init__(self, args): self.args = args - self.stack = Node("root") + self.stack = Node("all", "root") if self.args.format == "html" and \ not os.path.isfile(self.args.template): @@ -53,13 +60,21 @@ class FlameGraphCLI: file=sys.stderr) sys.exit(1) - def find_or_create_node(self, node, name, dso): - libtype = "kernel" if dso == "[kernel.kallsyms]" else "" - if name is None: - name = "[unknown]" + @staticmethod + def get_libtype_from_dso(dso): + """ + when kernel-debuginfo is installed, + dso points to /usr/lib/debug/lib/modules/*/vmlinux + """ + if dso and (dso == "[kernel.kallsyms]" or dso.endswith("/vmlinux")): + return "kernel" + return "" + + @staticmethod + def find_or_create_node(node, name, libtype): for child in node.children: - if child.name == name and child.libtype == libtype: + if child.name == name: return child child = Node(name, libtype) @@ -67,30 +82,65 @@ class FlameGraphCLI: return child def process_event(self, event): - node = self.find_or_create_node(self.stack, event["comm"], None) + pid = event.get("sample", {}).get("pid", 0) + # event["dso"] sometimes contains /usr/lib/debug/lib/modules/*/vmlinux + # for user-space processes; let's use pid for kernel or user-space distinction + if pid == 0: + comm = event["comm"] + libtype = "kernel" + else: + comm = "{} ({})".format(event["comm"], pid) + libtype = "" + node = self.find_or_create_node(self.stack, comm, libtype) + if "callchain" in event: - for entry in reversed(event['callchain']): - node = self.find_or_create_node( - node, entry.get("sym", {}).get("name"), event.get("dso")) + for entry in reversed(event["callchain"]): + name = entry.get("sym", {}).get("name", "[unknown]") + libtype = self.get_libtype_from_dso(entry.get("dso")) + node = self.find_or_create_node(node, name, libtype) else: - node = self.find_or_create_node( - node, entry.get("symbol"), event.get("dso")) + name = event.get("symbol", "[unknown]") + libtype = self.get_libtype_from_dso(event.get("dso")) + node = self.find_or_create_node(node, name, libtype) node.value += 1 + def get_report_header(self): + if self.args.input == "-": + # when this script is invoked with "perf script flamegraph", + # no perf.data is created and we cannot read the header of it + return "" + + try: + output = subprocess.check_output(["perf", "report", "--header-only"]) + return output.decode("utf-8") + except Exception as err: # pylint: disable=broad-except + print("Error reading report header: {}".format(err), file=sys.stderr) + return "" + def trace_end(self): - json_str = json.dumps(self.stack, default=lambda x: x.toJSON()) + stacks_json = json.dumps(self.stack, default=lambda x: x.to_json()) if self.args.format == "html": + report_header = self.get_report_header() + options = { + "colorscheme": self.args.colorscheme, + "context": report_header + } + options_json = json.dumps(options) + try: - with io.open(self.args.template, encoding="utf-8") as f: - output_str = f.read().replace("/** @flamegraph_json **/", - json_str) - except IOError as e: - print("Error reading template file: {}".format(e), file=sys.stderr) + with io.open(self.args.template, encoding="utf-8") as template: + output_str = ( + template.read() + .replace("/** @options_json **/", options_json) + .replace("/** @flamegraph_json **/", stacks_json) + ) + except IOError as err: + print("Error reading template file: {}".format(err), file=sys.stderr) sys.exit(1) output_fn = self.args.output or "flamegraph.html" else: - output_str = json_str + output_str = stacks_json output_fn = self.args.output or "stacks.json" if output_fn == "-": @@ -101,8 +151,8 @@ class FlameGraphCLI: try: with io.open(output_fn, "w", encoding="utf-8") as out: out.write(output_str) - except IOError as e: - print("Error writing output file: {}".format(e), file=sys.stderr) + except IOError as err: + print("Error writing output file: {}".format(err), file=sys.stderr) sys.exit(1) @@ -115,12 +165,16 @@ if __name__ == "__main__": help="output file name") parser.add_argument("--template", default="/usr/share/d3-flame-graph/d3-flamegraph-base.html", - help="path to flamegraph HTML template") + help="path to flame graph HTML template") + parser.add_argument("--colorscheme", + default="blue-green", + help="flame graph color scheme", + choices=["blue-green", "orange"]) parser.add_argument("-i", "--input", help=argparse.SUPPRESS) - args = parser.parse_args() - cli = FlameGraphCLI(args) + cli_args = parser.parse_args() + cli = FlameGraphCLI(cli_args) process_event = cli.process_event trace_end = cli.trace_end diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c index 7e440fa90c9383ac1ab33cd83e167803477cb7fa..8f20781d2cff117177c8569256ea5aed9852e178 100644 --- a/tools/perf/util/event.c +++ b/tools/perf/util/event.c @@ -57,6 +57,7 @@ static const char *perf_event__names[] = { [PERF_RECORD_BPF_EVENT] = "BPF_EVENT", [PERF_RECORD_CGROUP] = "CGROUP", [PERF_RECORD_TEXT_POKE] = "TEXT_POKE", + [PERF_RECORD_AUX_OUTPUT_HW_ID] = "AUX_OUTPUT_HW_ID", [PERF_RECORD_HEADER_ATTR] = "ATTR", [PERF_RECORD_HEADER_EVENT_TYPE] = "EVENT_TYPE", [PERF_RECORD_HEADER_TRACING_DATA] = "TRACING_DATA", @@ -237,6 +238,14 @@ int perf_event__process_itrace_start(struct perf_tool *tool __maybe_unused, return machine__process_itrace_start_event(machine, event); } +int perf_event__process_aux_output_hw_id(struct perf_tool *tool __maybe_unused, + union perf_event *event, + struct perf_sample *sample __maybe_unused, + struct machine *machine) +{ + return machine__process_aux_output_hw_id_event(machine, event); +} + int perf_event__process_lost_samples(struct perf_tool *tool __maybe_unused, union perf_event *event, struct perf_sample *sample, @@ -388,6 +397,12 @@ size_t perf_event__fprintf_itrace_start(union perf_event *event, FILE *fp) event->itrace_start.pid, event->itrace_start.tid); } +size_t perf_event__fprintf_aux_output_hw_id(union perf_event *event, FILE *fp) +{ + return fprintf(fp, " hw_id: %#"PRI_lx64"\n", + event->aux_output_hw_id.hw_id); +} + size_t perf_event__fprintf_switch(union perf_event *event, FILE *fp) { bool out = event->header.misc & PERF_RECORD_MISC_SWITCH_OUT; @@ -515,6 +530,9 @@ size_t perf_event__fprintf(union perf_event *event, struct machine *machine, FIL case PERF_RECORD_TEXT_POKE: ret += perf_event__fprintf_text_poke(event, machine, fp); break; + case PERF_RECORD_AUX_OUTPUT_HW_ID: + ret += perf_event__fprintf_aux_output_hw_id(event, fp); + break; default: ret += fprintf(fp, "\n"); } diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h index 8a449362b5961908026c7d4b2379c14fb687edf4..9a4526ed3edb7e21816abb29292819ffade52ccb 100644 --- a/tools/perf/util/event.h +++ b/tools/perf/util/event.h @@ -318,6 +318,10 @@ int perf_event__process_itrace_start(struct perf_tool *tool, union perf_event *event, struct perf_sample *sample, struct machine *machine); +int perf_event__process_aux_output_hw_id(struct perf_tool *tool, + union perf_event *event, + struct perf_sample *sample, + struct machine *machine); int perf_event__process_switch(struct perf_tool *tool, union perf_event *event, struct perf_sample *sample, @@ -385,6 +389,7 @@ size_t perf_event__fprintf_mmap2(union perf_event *event, FILE *fp); size_t perf_event__fprintf_task(union perf_event *event, FILE *fp); size_t perf_event__fprintf_aux(union perf_event *event, FILE *fp); size_t perf_event__fprintf_itrace_start(union perf_event *event, FILE *fp); +size_t perf_event__fprintf_aux_output_hw_id(union perf_event *event, FILE *fp); size_t perf_event__fprintf_switch(union perf_event *event, FILE *fp); size_t perf_event__fprintf_thread_map(union perf_event *event, FILE *fp); size_t perf_event__fprintf_cpu_map(union perf_event *event, FILE *fp); diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c index df515cd8d01846f156e5c971d30649b54579e0f6..ab21c9e250a7b8b0da1f8f6c64c8352c8480be6a 100644 --- a/tools/perf/util/machine.c +++ b/tools/perf/util/machine.c @@ -728,6 +728,14 @@ int machine__process_itrace_start_event(struct machine *machine __maybe_unused, return 0; } +int machine__process_aux_output_hw_id_event(struct machine *machine __maybe_unused, + union perf_event *event) +{ + if (dump_trace) + perf_event__fprintf_aux_output_hw_id(event, stdout); + return 0; +} + int machine__process_switch_event(struct machine *machine __maybe_unused, union perf_event *event) { @@ -1982,6 +1990,8 @@ int machine__process_event(struct machine *machine, union perf_event *event, ret = machine__process_bpf(machine, event, sample); break; case PERF_RECORD_TEXT_POKE: ret = machine__process_text_poke(machine, event, sample); break; + case PERF_RECORD_AUX_OUTPUT_HW_ID: + ret = machine__process_aux_output_hw_id_event(machine, event); break; default: ret = -1; break; diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h index 26368d3c17543bd6607ab7e8e11b64367446ac9a..94a3cd3409cbf664a034fd5d5ebec7a4df17017e 100644 --- a/tools/perf/util/machine.h +++ b/tools/perf/util/machine.h @@ -123,6 +123,8 @@ int machine__process_aux_event(struct machine *machine, union perf_event *event); int machine__process_itrace_start_event(struct machine *machine, union perf_event *event); +int machine__process_aux_output_hw_id_event(struct machine *machine, + union perf_event *event); int machine__process_switch_event(struct machine *machine, union perf_event *event); int machine__process_namespaces_event(struct machine *machine, diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c index 088f13ceee5c88f952515ed0bfbe625c25b470e3..0e865acb14322ce7fdf136603704fbe4b2c67c39 100644 --- a/tools/perf/util/session.c +++ b/tools/perf/util/session.c @@ -492,6 +492,8 @@ void perf_tool__fill_defaults(struct perf_tool *tool) tool->bpf = perf_event__process_bpf; if (tool->text_poke == NULL) tool->text_poke = perf_event__process_text_poke; + if (tool->aux_output_hw_id == NULL) + tool->aux_output_hw_id = perf_event__process_aux_output_hw_id; if (tool->read == NULL) tool->read = process_event_sample_stub; if (tool->throttle == NULL) @@ -980,6 +982,7 @@ static perf_event__swap_op perf_event__swap_ops[] = { [PERF_RECORD_NAMESPACES] = perf_event__namespaces_swap, [PERF_RECORD_CGROUP] = perf_event__cgroup_swap, [PERF_RECORD_TEXT_POKE] = perf_event__text_poke_swap, + [PERF_RECORD_AUX_OUTPUT_HW_ID] = perf_event__all64_swap, [PERF_RECORD_HEADER_ATTR] = perf_event__hdr_attr_swap, [PERF_RECORD_HEADER_EVENT_TYPE] = perf_event__event_type_swap, [PERF_RECORD_HEADER_TRACING_DATA] = perf_event__tracing_data_swap, @@ -1528,6 +1531,8 @@ static int machines__deliver_event(struct machines *machines, return tool->bpf(tool, event, sample, machine); case PERF_RECORD_TEXT_POKE: return tool->text_poke(tool, event, sample, machine); + case PERF_RECORD_AUX_OUTPUT_HW_ID: + return tool->aux_output_hw_id(tool, event, sample, machine); default: ++evlist->stats.nr_unknown_events; return -1; diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h index bbbc0dcd461ff453b66791cf78f019aad4da5139..ef873f2cc38f2d9af4145761282d31229aa0cbd1 100644 --- a/tools/perf/util/tool.h +++ b/tools/perf/util/tool.h @@ -53,6 +53,7 @@ struct perf_tool { lost_samples, aux, itrace_start, + aux_output_hw_id, context_switch, throttle, unthrottle, diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_smc.c b/tools/testing/selftests/bpf/prog_tests/bpf_smc.c index b57326c265447a52939da94d753a748808f313d5..5e01d88adb436fa8145965c8600732862920abc5 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_smc.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_smc.c @@ -3,31 +3,22 @@ #include #include #include +#include "network_helpers.h" #include "bpf_smc.skel.h" -void test_bpf_smc(void) +#define SOL_SMC 286 +#define SMC_NEGOTIATOR 2 + +void test_load(void) { struct bpf_smc *smc_skel; struct bpf_link *link; - int err; - smc_skel = bpf_smc__open(); + smc_skel = bpf_smc__open_and_load(); if (!ASSERT_OK_PTR(smc_skel, "skel_open")) return; - err = bpf_map__set_type(smc_skel->maps.negotiator_map, BPF_MAP_TYPE_HASH); - if (!ASSERT_OK(err, "bpf_map__set_type")) - goto error; - - err = bpf_map__set_max_entries(smc_skel->maps.negotiator_map, 1); - if (!ASSERT_OK(err, "bpf_map__set_type")) - goto error; - - err = bpf_smc__load(smc_skel); - if (!ASSERT_OK(err, "skel_load")) - goto error; - - link = bpf_map__attach_struct_ops(smc_skel->maps.ops); + link = bpf_map__attach_struct_ops(smc_skel->maps.anolis_smc); if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) goto error; @@ -35,3 +26,9 @@ void test_bpf_smc(void) error: bpf_smc__destroy(smc_skel); } + +void test_bpf_smc(void) +{ + if (test__start_subtest("load")) + test_load(); +} diff --git a/tools/testing/selftests/bpf/progs/bpf_smc.c b/tools/testing/selftests/bpf/progs/bpf_smc.c index 4f4a5416c3c9526b7eb3693fb52b0a890bf74d7c..1d91d9dc9c8dfe7e540a78e014a044507f270659 100644 --- a/tools/testing/selftests/bpf/progs/bpf_smc.c +++ b/tools/testing/selftests/bpf/progs/bpf_smc.c @@ -1,226 +1,175 @@ // SPDX-License-Identifier: GPL-2.0-only -#include -#include -#include -#include -#include +#include "vmlinux.h" #include #include #include #define AF_SMC (43) +#define AF_INET (2) #define SMC_LISTEN (10) #define SMC_SOCK_CLOSED_TIMING (0) -extern unsigned long CONFIG_HZ __kconfig; -#define HZ CONFIG_HZ + +#define min(a, b) ((a) < (b) ? (a) : (b)) char _license[] SEC("license") = "GPL"; -#define max(a, b) ((a) > (b) ? (a) : (b)) - -struct sock_common { - unsigned char skc_state; - unsigned short skc_family; - __u16 skc_num; -} __attribute__((preserve_access_index)); - -struct sock { - struct sock_common __sk_common; - int sk_sndbuf; -} __attribute__((preserve_access_index)); - -struct inet_sock { - struct sock sk; -} __attribute__((preserve_access_index)); - -struct inet_connection_sock { - struct inet_sock icsk_inet; -} __attribute__((preserve_access_index)); - -struct tcp_sock { - struct inet_connection_sock inet_conn; - __u32 rcv_nxt; - __u32 snd_nxt; - __u32 snd_una; - __u32 delivered; - __u8 syn_data:1, /* SYN includes data */ - syn_fastopen:1, /* SYN includes Fast Open option */ - syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */ - syn_fastopen_ch:1, /* Active TFO re-enabling probe */ - syn_data_acked:1,/* data in SYN is acked by SYN-ACK */ - save_syn:1, /* Save headers of SYN packet */ - is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */ - syn_smc:1; /* SYN includes SMC */ -} __attribute__((preserve_access_index)); - -struct socket { - struct sock *sk; -} __attribute__((preserve_access_index)); - -union smc_host_cursor { - struct { - __u16 reserved; - __u16 wrap; - __u32 count; - }; -} __attribute__((preserve_access_index)); - -struct smc_connection { - union smc_host_cursor tx_curs_sent; - union smc_host_cursor rx_curs_confirmed; -} __attribute__((preserve_access_index)); - -struct smc_sock { - struct sock sk; - struct socket *clcsock; /* internal tcp socket */ - struct smc_connection conn; - int use_fallback; -} __attribute__((preserve_access_index)); - -static __always_inline struct tcp_sock *tcp_sk(const struct sock *sk) -{ - return (struct tcp_sock *)sk; -} static __always_inline struct smc_sock *smc_sk(struct sock *sk) { return (struct smc_sock *)sk; } +struct smc_strategy { + /* 0 for deny; 1 for auto; 2 for allow */ + __u8 mode; + /* reserver */ + __u8 reserved1; + /* how many rounds for long cc */ + __u16 rtt_threshold; + /* low = 0, hi = 1, N = 2 */ + __u16 smc_productivity[4]; + __u16 tcp_productivity[4]; +#define LOW_WATER_LEVEL(domain) domain##_productivity[0] +#define HI_WATER_LEVEL(domain) domain##_productivity[1] +#define EVERY_N(domain) domain##_productivity[2] + /* max value of credits, limit the totol smc-r */ + __u32 max_credits; + /* Initial value of credits */ + __u32 initial_credits; + /* max burst in one slice */ + __u32 max_pacing_burst; + /* fixed pacing delta */ + __u64 pacing_delta; +}; + +/* maps for smc_strategy */ +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 1024); + __type(key, __u16); + __type(value, struct smc_strategy); +} smc_strategies SEC(".maps"); + struct smc_prediction { + /* count to allow smc */ + __u32 credits; + __u32 pacing_burst; + /* count for smc conn */ + __u16 count_total_smc_conn; + __u16 count_matched_smc_conn; + /* count fot tcp conn */ + __u16 count_total_tcp_conn; + __u16 count_matched_tcp_conn; + /* last pick timestamp */ + __u64 last_tstamp; /* protection for smc_prediction */ struct bpf_spin_lock lock; - /* start of time slice */ - __u64 start_tstamp; - /* delta of pacing */ - __u64 pacing_delta; - /* N of closed connections determined as long connections - * in current time slice - */ - __u32 closed_long_cc; - /* N of closed connections in this time slice */ - __u32 closed_total_cc; - /* N of incoming connections determined as long connections - * in current time slice - */ - __u32 incoming_long_cc; - /* last splice rate of long cc */ - __u32 last_rate_of_lcc; }; -#define SMC_PREDICTION_MIN_PACING_DELTA (1llu) -#define SMC_PREDICTION_MAX_PACING_DELTA (HZ << 3) -#define SMC_PREDICTION_MAX_LONGCC_PER_SPLICE (8) -#define SMC_PREDICTION_MAX_PORT (64) -#define SMC_PREDICTION_MAX_SPLICE_GAP (1) -#define SMC_PREDICTION_LONGCC_RATE_THRESHOLD (13189) -#define SMC_PREDICTION_LONGCC_PACKETS_THRESHOLD (100) -#define SMC_PREDICTION_LONGCC_BYTES_THRESHOLD \ - (SMC_PREDICTION_LONGCC_PACKETS_THRESHOLD * 1024) - struct { __uint(type, BPF_MAP_TYPE_HASH); - __uint(max_entries, SMC_PREDICTION_MAX_PORT); + __uint(max_entries, 1024); __type(key, __u16); __type(value, struct smc_prediction); -} negotiator_map SEC(".maps"); +} smc_predictors SEC(".maps"); - -static inline __u32 smc_prediction_calt_rate(struct smc_prediction *smc_predictor) +static inline struct smc_prediction *smc_prediction_get(__u16 key, struct smc_strategy *strategy) { - if (!smc_predictor->closed_total_cc) - return smc_predictor->last_rate_of_lcc; - - return (smc_predictor->closed_long_cc << 14) / smc_predictor->closed_total_cc; -} - -static inline struct smc_prediction *smc_prediction_get(__u16 key, __u64 tstamp) -{ - struct smc_prediction zero = {}, *smc_predictor; - __u32 gap; - int err; + struct smc_prediction *smc_predictor; - smc_predictor = bpf_map_lookup_elem(&negotiator_map, &key); + smc_predictor = bpf_map_lookup_elem(&smc_predictors, &key); if (!smc_predictor) { - zero.start_tstamp = bpf_jiffies64(); - zero.pacing_delta = SMC_PREDICTION_MIN_PACING_DELTA; - err = bpf_map_update_elem(&negotiator_map, &key, &zero, 0); - if (err) - return NULL; - smc_predictor = bpf_map_lookup_elem(&negotiator_map, &key); - if (!smc_predictor) - return NULL; - } - - if (tstamp) { - bpf_spin_lock(&smc_predictor->lock); - gap = (tstamp - smc_predictor->start_tstamp) / smc_predictor->pacing_delta; - /* new splice */ - if (gap > 0) { - smc_predictor->start_tstamp = tstamp; - smc_predictor->last_rate_of_lcc = - (smc_prediction_calt_rate(smc_predictor) * 7) >> (2 + gap); - smc_predictor->closed_long_cc = 0; - smc_predictor->closed_total_cc = 0; - smc_predictor->incoming_long_cc = 0; - } - bpf_spin_unlock(&smc_predictor->lock); + struct smc_prediction init = { + .credits = strategy->initial_credits, + }; + bpf_map_update_elem(&smc_predictors, &key, &init, BPF_NOEXIST); + smc_predictor = bpf_map_lookup_elem(&smc_predictors, &key); } return smc_predictor; } -/* BPF struct ops for smc protocol negotiator */ -struct smc_sock_negotiator_ops { - /* ret for negotiate */ - int (*negotiate)(struct smc_sock *smc); +static inline struct smc_strategy *smc_strategy_get(__u16 key) +{ + struct smc_strategy *strategy; - /* info gathering timing */ - void (*collect_info)(struct smc_sock *smc, int timing); -}; + strategy = bpf_map_lookup_elem(&smc_strategies, &key); + if (!strategy && key != 0) { + /* search for default */ + key = 0; + strategy = bpf_map_lookup_elem(&smc_strategies, &key); + } + return strategy; +} int SEC("struct_ops/bpf_smc_negotiate") -BPF_PROG(bpf_smc_negotiate, struct smc_sock *smc) +BPF_PROG(bpf_smc_negotiate, struct sock *sk) { struct smc_prediction *smc_predictor; - int err, ret = SK_DROP; + struct smc_strategy *strategy; + struct smc_sock *smc; struct tcp_sock *tp; - struct sock *clcsk; - __u32 rate = 0; + __u64 now; __u16 key; - /* client side */ - if (smc == NULL || smc->sk.__sk_common.skc_state != SMC_LISTEN) { - /* use Global smc_predictor */ - key = 0; - } else { /* server side */ - clcsk = BPF_CORE_READ(smc, clcsock, sk); - if (!clcsk) - goto error; - tp = tcp_sk(clcsk); - err = bpf_core_read(&key, sizeof(__u16), - &tp->inet_conn.icsk_inet.sk.__sk_common.skc_num); - if (err) - goto error; - } + if (!sk) + return SK_DROP; - smc_predictor = smc_prediction_get(key, bpf_jiffies64()); - if (!smc_predictor) + smc = smc_sk(sk); + + /* for client side */ + if (!smc->listen_smc && smc->sk.__sk_common.skc_state != SMC_LISTEN) { + /* client always say yes */ return SK_PASS; + } - bpf_spin_lock(&smc_predictor->lock); + /* every full smc sock should contains a tcp sock */ + tp = bpf_skc_to_tcp_sock(sk); - if (smc_predictor->incoming_long_cc == 0) - goto out_locked_pass; + /* local port as key */ + key = tp ? tp->inet_conn.icsk_inet.sk.__sk_common.skc_num : 0; - if (smc_predictor->incoming_long_cc > SMC_PREDICTION_MAX_LONGCC_PER_SPLICE) - goto out_locked_drop; + strategy = smc_strategy_get(key); + if (!strategy) + return SK_DROP; - rate = smc_prediction_calt_rate(smc_predictor); - if (rate < SMC_PREDICTION_LONGCC_RATE_THRESHOLD) - goto out_locked_drop; +#define DENYLIST_MODE (0) +#define AUTO_MODE (1) +#define ALLOWLIST_MODE (2) + switch (strategy->mode) { + case AUTO_MODE: + break; + case ALLOWLIST_MODE: + return SK_PASS; + case DENYLIST_MODE: + default: + return SK_DROP; + } +#undef ALLOWLIST_MODE +#undef AUTO_MODE +#undef DENYLIST_MODE + smc_predictor = smc_prediction_get(key, strategy); + if (!smc_predictor) + return SK_DROP; + + now = bpf_jiffies64(); + + bpf_spin_lock(&smc_predictor->lock); + if (!smc_predictor->credits) + goto out_locked_drop; out_locked_pass: - smc_predictor->incoming_long_cc++; + /* pacing incoming rate */ + if (now - smc_predictor->last_tstamp < strategy->pacing_delta) { +pacing: + if (!smc_predictor->pacing_burst) + goto out_locked_drop; + smc_predictor->pacing_burst--; + } else { + smc_predictor->last_tstamp = now; + smc_predictor->pacing_burst = strategy->max_pacing_burst; + goto pacing; + } + smc_predictor->credits--; bpf_spin_unlock(&smc_predictor->lock); return SK_PASS; out_locked_drop: @@ -232,81 +181,105 @@ BPF_PROG(bpf_smc_negotiate, struct smc_sock *smc) void SEC("struct_ops/bpf_smc_collect_info") BPF_PROG(bpf_smc_collect_info, struct sock *sk, int timing) { + bool match = false, smc_traffic = false; struct smc_prediction *smc_predictor; - int use_fallback, sndbuf, err; + struct smc_strategy *strategy; struct smc_sock *smc; struct tcp_sock *tp; - struct sock *clcsk; - bool match = false; - __u16 wrap, count; - __u32 delivered; + __u64 delta = 0; __u16 key; - /* no info can collect */ - if (sk == NULL) + /* smc sock */ + smc = smc_sk(sk); + if (!smc) return; /* only fouces on closed */ if (timing != SMC_SOCK_CLOSED_TIMING) return; - /* first check the sk type */ - if (sk->__sk_common.skc_family == AF_SMC) { - smc = smc_sk(sk); - clcsk = BPF_CORE_READ(smc, clcsock, sk); - if (!clcsk) - goto error; - tp = tcp_sk(clcsk); - /* check if it's fallback */ - err = bpf_core_read(&use_fallback, sizeof(use_fallback), &smc->use_fallback); - if (err) - goto error; - if (use_fallback) - goto fallback; - err = bpf_core_read(&wrap, sizeof(__u16), &smc->conn.tx_curs_sent.wrap); - if (err) - goto error; - err = bpf_core_read(&count, sizeof(__u16), &smc->conn.tx_curs_sent.count); - if (err) - goto error; - err = bpf_core_read(&sndbuf, sizeof(int), &clcsk->sk_sndbuf); - if (err) - goto error; - match = (count + wrap * sndbuf) > SMC_PREDICTION_LONGCC_BYTES_THRESHOLD; - } else { - smc = NULL; - tp = tcp_sk(sk); - use_fallback = 1; -fallback: - err = bpf_core_read(&delivered, sizeof(delivered), &tp->delivered); - if (err) - goto error; - match = (delivered > SMC_PREDICTION_LONGCC_PACKETS_THRESHOLD); - } + /* every full smc sock should contains a tcp sock */ + tp = bpf_skc_to_tcp_sock(sk); + if (!tp) + return; + + if (!smc->listen_smc) + /* only monitor passive open for server */ + return; + + /* local port as key */ + key = tp->inet_conn.icsk_inet.sk.__sk_common.skc_num; + if (key == 0) + return; - /* whatever, tp is never NULL */ - err = bpf_core_read(&key, sizeof(__u16), &tp->inet_conn.icsk_inet.sk.__sk_common.skc_num); - if (err) - goto error; + strategy = smc_strategy_get(key); + if (!strategy) + return; - smc_predictor = smc_prediction_get(key, 0); + smc_predictor = smc_prediction_get(key, strategy); if (!smc_predictor) - goto error; + return; + + switch (sk->__sk_common.skc_family) { + case AF_INET: + if (sk != &tp->inet_conn.icsk_inet.sk) + return; + case AF_SMC: + if (!smc->use_fallback) { + smc_traffic = true; + /* full rtt*/ + match = smc->conn.tx_cdc_seq > strategy->rtt_threshold; + break; + } + default: + match = tp->data_segs_out > tp->snd_cwnd * strategy->rtt_threshold; + break; + } bpf_spin_lock(&smc_predictor->lock); - smc_predictor->closed_total_cc++; - if (match) { - /* increase stats */ - smc_predictor->closed_long_cc++; - /* try more aggressive */ - if (smc_predictor->pacing_delta > SMC_PREDICTION_MIN_PACING_DELTA) { - if (use_fallback) { - smc_predictor->pacing_delta = max(SMC_PREDICTION_MIN_PACING_DELTA, - (smc_predictor->pacing_delta * 3) >> 2); - } + if (smc_traffic) { + /* matched smc connection */ + if (match) + ++smc_predictor->count_matched_smc_conn; + /* For every N smc connection,matched connection in + * [0, LOW_WATER_LEVEL) : smc_predictor->credits >> 1; + * [LOW_WATER_LEVEL, HI_WATER_LEVEL) : no impact; + * [HI_WATER_LEVEL, ) : inc smc_predictor->credits + */ + if (++smc_predictor->count_total_smc_conn >= strategy->EVERY_N(smc)) { + /* fast down-grade */ + if (smc_predictor->count_matched_smc_conn < strategy->LOW_WATER_LEVEL(smc)) + smc_predictor->credits = smc_predictor->credits >> 1; + else if (smc_predictor->count_matched_smc_conn >= + strategy->HI_WATER_LEVEL(smc)) + /* return back */ + smc_predictor->credits = min(smc_predictor->credits + 1, + strategy->max_credits); + /* reset smc_producer */ + smc_predictor->count_total_smc_conn = 0; + smc_predictor->count_matched_smc_conn = 0; + } + } else { + if (match) + /* matched tcp connection */ + ++smc_predictor->count_matched_tcp_conn; + /* For every N tcp connection,matched connection in + * [0, LOW_WATER_LEVEL) : no impact + * [LOW_WATER_LEVEL, HI_WATER_LEVEL) : inc smc_predictor->credits + * [HI_WATER_LEVEL, ) : add smc_predictor->credits by n. + */ + if (++smc_predictor->count_total_tcp_conn >= strategy->EVERY_N(tcp)) { + if (smc_predictor->count_matched_tcp_conn >= strategy->HI_WATER_LEVEL(tcp)) + delta = smc_predictor->count_matched_tcp_conn; + else if (smc_predictor->count_matched_tcp_conn >= + strategy->LOW_WATER_LEVEL(tcp)) + delta = 1; + smc_predictor->credits = min(smc_predictor->credits + delta, + strategy->max_credits); + /* reset tcp_producer */ + smc_predictor->count_total_tcp_conn = 0; + smc_predictor->count_matched_tcp_conn = 0; } - } else if (!use_fallback) { - smc_predictor->pacing_delta <<= 1; } bpf_spin_unlock(&smc_predictor->lock); error: @@ -314,7 +287,8 @@ BPF_PROG(bpf_smc_collect_info, struct sock *sk, int timing) } SEC(".struct_ops") -struct smc_sock_negotiator_ops ops = { +struct smc_sock_negotiator_ops anolis_smc = { + .name = "anolis", .negotiate = (void *)bpf_smc_negotiate, .collect_info = (void *)bpf_smc_collect_info, }; diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.c b/tools/testing/selftests/net/mptcp/mptcp_connect.c index 77bb62feb87261c39863264696207d3ccb011bce..e5dc245f3002125a5703c5cd74cbf8c43e63d17b 100644 --- a/tools/testing/selftests/net/mptcp/mptcp_connect.c +++ b/tools/testing/selftests/net/mptcp/mptcp_connect.c @@ -351,7 +351,7 @@ static int copyfd_io_poll(int infd, int peerfd, int outfd) char rbuf[8192]; ssize_t len; - if (fds.events == 0) + if (fds.events == 0 || quit) break; switch (poll(&fds, 1, poll_timeout)) { @@ -435,7 +435,7 @@ static int copyfd_io_poll(int infd, int peerfd, int outfd) } /* leave some time for late join/announce */ - if (cfg_join || cfg_remove) + if (cfg_join || (cfg_remove && !quit)) usleep(cfg_wait); close(peerfd); @@ -834,7 +834,6 @@ static void parse_opts(int argc, char **argv) case 'j': cfg_join = true; cfg_mode = CFG_MODE_POLL; - cfg_wait = 400000; break; case 'r': cfg_remove = true; diff --git a/tools/testing/selftests/net/mptcp/simult_flows.sh b/tools/testing/selftests/net/mptcp/simult_flows.sh index 8fcb2892781826ec3c8b5bb65441f439dc4d0ed6..37ad2bd3aa4e77f573bf052aa3aea54b81a48b1d 100755 --- a/tools/testing/selftests/net/mptcp/simult_flows.sh +++ b/tools/testing/selftests/net/mptcp/simult_flows.sh @@ -50,7 +50,7 @@ setup() sout=$(mktemp) cout=$(mktemp) capout=$(mktemp) - size=$((2048 * 4096)) + size=$((2048 * 4096 * 4)) dd if=/dev/zero of=$small bs=4096 count=20 >/dev/null 2>&1 dd if=/dev/zero of=$large bs=4096 count=$((size / 4096)) >/dev/null 2>&1 @@ -126,7 +126,11 @@ do_transfer() local cin=$1 local sin=$2 local max_time=$3 + local reverse=$4 local port + local srv_args="-j" + local cl_args="" + port=$((10000+$test_cnt)) test_cnt=$((test_cnt+1)) @@ -157,14 +161,19 @@ do_transfer() sleep 1 fi - ip netns exec ${ns3} ./mptcp_connect -jt $timeout -l -p $port 0.0.0.0 < "$sin" > "$sout" & + if [ "$reverse" = true ]; then + srv_args="" + cl_args="-j" + fi + + ip netns exec ${ns3} ./mptcp_connect $srv_args -t $timeout -l -p $port 0.0.0.0 < "$sin" > "$sout" & local spid=$! wait_local_port_listen "${ns3}" "${port}" local start start=$(date +%s%3N) - ip netns exec ${ns1} ./mptcp_connect -jt $timeout -p $port 10.0.3.3 < "$cin" > "$cout" & + ip netns exec ${ns1} ./mptcp_connect $srv_args -t $timeout -p $port 10.0.3.3 < "$cin" > "$cout" & local cpid=$! wait $cpid @@ -236,12 +245,12 @@ run_test() tc -n $ns2 qdisc add dev ns2eth1 root netem rate ${rate1}mbit $delay1 tc -n $ns2 qdisc add dev ns2eth2 root netem rate ${rate2}mbit $delay2 - # time is measure in ms - local time=$((size * 8 * 1000 / (( $rate1 + $rate2) * 1024 *1024) )) + # time is measure in ms, account for headers overhead, with DSS+ACK64 presence + local time=$((size * 8 * 1000 * 1514 / (( $rate1 + $rate2) * 1024 * 1024 * 1424) )) # mptcp_connect will do some sleeps to allow the mp_join handshake # completion - time=$((time + 1350)) + time=$((time + 350)) printf "%-50s" "$msg" do_transfer $small $large $((time * 11 / 10)) diff --git a/tools/testing/selftests/tdx/Makefile b/tools/testing/selftests/tdx/Makefile index 8dd43517cd55d31c353cf2929727d99789c8e05a..306e9c4d5ef7ca519260c164e49eb93f972fb830 100644 --- a/tools/testing/selftests/tdx/Makefile +++ b/tools/testing/selftests/tdx/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 -CFLAGS += -O3 -Wl,-no-as-needed -Wall -static +CFLAGS += -O3 -Wl,-no-as-needed -Wall $(KHDR_INCLUDES) -static TEST_GEN_PROGS := tdx_guest_test diff --git a/tools/testing/selftests/tdx/tdx_guest_test.c b/tools/testing/selftests/tdx/tdx_guest_test.c index 2a2afd856798b4bcf74820709796c1b5f0187579..81d8cb88ea1ad9bd358523beb68d32b73b28e0f3 100644 --- a/tools/testing/selftests/tdx/tdx_guest_test.c +++ b/tools/testing/selftests/tdx/tdx_guest_test.c @@ -12,8 +12,8 @@ #include #include +#include #include "../kselftest_harness.h" -#include "../../../../include/uapi/linux/tdx-guest.h" #define TDX_GUEST_DEVNAME "/dev/tdx_guest" #define HEX_DUMP_SIZE 8