Tags

Tags give the ability to mark specific points in history as being important
  • x86_cleanups_for_6.5

     - Address -Wmissing-prototype warnings
     - Remove repeated 'the' in comments
     - Remove unused current_untag_mask()
     - Document urgent tip branch timing
     - Clean up MSR kernel-doc notation
     - Clean up paravirt_ops doc
     - Update Srivatsa S. Bhat's maintained areas
     - Remove unused extern declaration acpi_copy_wakeup_routine()
    
  • x86_tdx_for_6.5

     - Fix a race window where load_unaligned_zeropad() could cause
       a fatal shutdown during TDX private<=>shared conversion
     - Annotate sites where VM "exit reasons" are reused as hypercall
       numbers.
    
  • x86_platform_for_6.5

    Add UV platform support for sub-NUMA clustering
    
  • x86_irq_for_6.5

    Add Hyper-V interrupts to /proc/stat
    
  • x86_cpu_for_v6.5

    - Compute the purposeful misalignment of zen_untrain_ret automatically
      and assert __x86_return_thunk's alignment so that future changes to
      the symbol macros do not accidentally break them.
    
    - Remove CONFIG_X86_FEATURE_NAMES Kconfig option as its existence is
      pointless
    
  • x86_cc_for_v6.5

    - Add support for unaccepted memory as specified in the UEFI spec v2.9.
      The gist of it all is that Intel TDX and AMD SEV-SNP confidential
      computing guests define the notion of accepting memory before using it
      and thus preventing a whole set of attacks against such guests like
      memory replay and the like.
    
      There are a couple of strategies of how memory should be accepted
      - the current implementation does an on-demand way of accepting.
    
  • x86_cache_for_v6.5

    - Implement a rename operation in resctrlfs to facilitate handling
      of application containers with dynamically changing task lists
    
    - When reading the tasks file, show the tasks' pid which are only in
      the current namespace as opposed to showing the pids from the init
      namespace too
    
    - Other fixes and improvements
    
  • x86_build_for_v6.5

    - Remove relocation information from vmlinux as it is not needed by
      other tooling and thus a slimmer binary is generated. This is
      important for distros who have to distribute vmlinux blobs with their
      kernel packages too and that extraneous unnecessary data bloats them
      for no good reason
    
  • x86_alternatives_for_v6.5

    - Up until now the Fast Short Rep Mov optimizations implied the presence
      of the ERMS CPUID flag. AMD decoupled them with a BIOS setting so decouple
      that dependency in the kernel code too
    
    - Teach the alternatives machinery to handle relocations
    
    - Make debug_alternative accept flags in order to see only that set of
      patching done one is interested in
    
    - Other fixes, cleanups and optimizations to the patching code
    
  • ras_core_for_v6.5

    - Add initial support for RAS hardware found on AMD server GPUs (MI200).
      Those GPUs and CPUs are connected together through the coherent fabric
      and the GPU memory controllers report errors through x86's MCA so EDAC
      needs to support them. The amd64_edac driver supports now HBM (High
      Bandwidth Memory) and thus such heterogeneous memory controller
      systems
    
    - Other small cleanups and improvements
    
  • x86-core-2023-06-26

    A set of fixes for kexec(), reboot and shutdown issues
    
     - Ensure that the WBINVD in stop_this_cpu() has been completed before the
       control CPU proceedes.
    
       stop_this_cpu() is used for kexec(), reboot and shutdown to park the APs
       in a HLT loop.
    
       The control CPU sends an IPI to the APs and waits for their CPU online bits
       to be cleared. Once they all are marked "offline" it proceeds.
    
       But stop_this_cpu() clears the CPU online bit before issuing WBINVD,
       which means there is no guarantee that the AP has reached the HLT loop.
    
       This was reported to cause intermittent reboot/shutdown failures due to
       some dubious interaction with the firmware.
    
       This is not only a problem of WBINVD. The code to actually "stop" the
       CPU which runs between clearing the online bit and reaching the HLT loop
       can cause large enough delays on its own (think virtualization). That's
       especially dangerous for kexec() as kexec() expects that all APs are in
       a safe state and not executing code while the boot CPU jumps to the new
       kernel. There are more issues vs. kexec() which are addressed separately.
    
       Cure this by implementing an explicit synchronization point right before
       the AP reaches HLT. This guarantees that the AP has completed the full
       stop proceedure.
    
     - Fix the condition for WBINVD in stop_this_cpu().
    
       The WBINVD in stop_this_cpu() is required for ensuring that when
       switching to or from memory encryption no dirty data is left in the
       cache lines which might cause a write back in the wrong more later.
    
       This checks CPUID directly because the feature bit might have been
       cleared due to a command line option.
    
       But that CPUID check accesses leaf 0x8000001f::EAX unconditionally. Intel
       CPUs return the content of the highest supported leaf when a non-existing
       leaf is read, while AMD CPUs return all zeros for unsupported leafs.
    
       So the result of the test on Intel CPUs is lottery and on AMD its just
       correct by chance.
    
       While harmless it's incorrect and causes the conditional wbinvd() to be
       issued where not required, which caused the above issue to be unearthed.
    
     - Make kexec() robust against AP code execution
    
       Ashok observed triple faults when doing kexec() on a system which had
       been booted with "nosmt".
    
       It turned out that the SMT siblings which had been brought up partially
       are parked in mwait_play_dead() to enable power savings.
    
       mwait_play_dead() is monitoring the thread flags of the AP's idle task,
       which has been chosen as it's unlikely to be written to.
    
       But kexec() can overwrite the previous kernel text and data including
       page tables etc. When it overwrites the cache lines monitored by an AP
       that AP resumes execution after the MWAIT on eventually overwritten
       text, stack and page tables, which obviously might end up in a triple
       fault easily.
    
       Make this more robust in several steps:
    
        1) Use an explicit per CPU cache line for monitoring.
    
        2) Write a command to these cache lines to kick APs out of MWAIT before
           proceeding with kexec(), shutdown or reboot.
    
           The APs confirm the wakeup by writing status back and then enter a
           HLT loop.
    
        3) If the system uses INIT/INIT/STARTUP for AP bringup, park the APs
           in INIT state.
    
           HLT is not a guarantee that an AP won't wake up and resume
           execution. HLT is woken up by NMI and SMI. SMI puts the CPU back
           into HLT (+/- firmware bugs), but NMI is delivered to the CPU which
           executes the NMI handler. Same issue as the MWAIT scenario described
           above.
    
           Sending an INIT/INIT sequence to the APs puts them into wait for
           STARTUP state, which is safe against NMI.
    
        There is still an issue remaining which can't be fixed: #MCE
    
        If the AP sits in HLT and receives a broadcast #MCE it will try to
        handle it with the obvious consequences.
    
        INIT/INIT clears CR4.MCE in the AP which will cause a broadcast #MCE to
        shut down the machine.
    
        So there is a choice between fire (HLT) and frying pan (INIT). Frying
        pan has been chosen as it's at least preventing the NMI issue.
    
        On systems which are not using INIT/INIT/STARTUP there is not much
        which can be done right now, but at least the obvious and easy to
        trigger MWAIT issue has been addressed.
    
  • x86-boot-2023-06-26

    Updates for the x86 boot process:
    
     - Initialize FPU late.
    
       Right now FPU is initialized very early during boot. There is no real
       requirement to do so. The only requirement is to have it done before
       alternatives are patched.
    
       That's done in check_bugs() which does way more than what the function
       name suggests.
    
       So first rename check_bugs() to arch_cpu_finalize_init() which makes it
       clear what this is about.
    
       Move the invocation of arch_cpu_finalize_init() earlier in
       start_kernel() as it has to be done before fork_init() which needs to
       know the FPU register buffer size.
    
       With those prerequisites the FPU initialization can be moved into
       arch_cpu_finalize_init(), which removes it from the early and fragile
       part of the x86 bringup.
    
  • timers-core-2023-06-26

    Time, timekeeping and related device driver updates:
    
     - Core:
    
       - A set of fixes, cleanups and enhancements to the posix timer code:
    
         - Prevent another possible live lock scenario in the exit() path,
           which affects POSIX_CPU_TIMERS_TASK_WORK enabled architectures.
    
         - Fix a loop termination issue which was reported syzcaller/KSAN in
           the posix timer ID allocation code.
    
           That triggered a deeper look into the posix-timer code which
           unearthed more small issues.
    
         - Add missing READ/WRITE_ONCE() annotations
    
         - Fix or remove completely outdated comments
    
         - Document places which are subtle and completely undocumented.
    
       - Add missing hrtimer modes to the trace event decoder
    
       - Small cleanups and enhancements all over the place
    
     - Drivers:
    
         - Rework the Hyper-V clocksource and sched clock setup code
    
         - Remove a deprecated clocksource driver
    
         - Small fixes and enhancements all over the place
    
  • smp-core-2023-06-26

    A large update for SMP management:
    
      - Parallel CPU bringup
    
        The reason why people are interested in parallel bringup is to shorten
        the (kexec) reboot time of cloud servers to reduce the downtime of the
        VM tenants.
    
        The current fully serialized bringup does the following per AP:
    
          1) Prepare callbacks (allocate, intialize, create threads)
          2) Kick the AP alive (e.g. INIT/SIPI on x86)
          3) Wait for the AP to report alive state
          4) Let the AP continue through the atomic bringup
          5) Let the AP run the threaded bringup to full online state
    
        There are two significant delays:
    
          #3 The time for an AP to report alive state in start_secondary() on
             x86 has been measured in the range between 350us and 3.5ms
             depending on vendor and CPU type, BIOS microcode size etc.
    
          #4 The atomic bringup does the microcode update. This has been
             measured to take up to ~8ms on the primary threads depending on
             the microcode patch size to apply.
    
        On a two socket SKL server with 56 cores (112 threads) the boot CPU
        spends on current mainline about 800ms busy waiting for the APs to come
        up and apply microcode. That's more than 80% of the actual onlining
        procedure.
    
        This can be reduced significantly by splitting the bringup mechanism
        into two parts:
    
          1) Run the prepare callbacks and kick the AP alive for each AP which
          	 needs to be brought up.
    
    	 The APs wake up, do their firmware initialization and run the low
          	 level kernel startup code including microcode loading in parallel
          	 up to the first synchronization point. (#1 and #2 above)
    
          2) Run the rest of the bringup code strictly serialized per CPU
          	 (#3 - #5 above) as it's done today.
    
    	 Parallelizing that stage of the CPU bringup might be possible in
    	 theory, but it's questionable whether required surgery would be
    	 justified for a pretty small gain.
    
        If the system is large enough the first AP is already waiting at the
        first synchronization point when the boot CPU finished the wake-up of
        the last AP. That reduces the AP bringup time on that SKL from ~800ms
        to ~80ms, i.e. by a factor ~10x.
    
        The actual gain varies wildly depending on the system, CPU, microcode
        patch size and other factors. There are some opportunities to reduce
        the overhead further, but that needs some deep surgery in the x86 CPU
        bringup code.
    
        For now this is only enabled on x86, but the core functionality
        obviously works for all SMP capable architectures.
    
      - Enhancements for SMP function call tracing so it is possible to locate
        the scheduling and the actual execution points. That allows to measure
        IPI delivery time precisely.
    
  • irq-core-2023-06-26

    Updates for the interrupt subsystem:
    
     - Core:
    
       - Convert the interrupt descriptor storage to a maple tree to overcome
         the limitations of the radixtree + fixed size bitmap. This allows to
         handle real large servers with a huge number of guests without
         imposing a huge memory overhead on everyone.
    
       - Implement optional retriggering of interrupts which utilize the
         fasteoi handler to work around a GICv3 architecture issue.
    
     - Drivers:
    
       - A set of fixes and updates for the Loongson/Loongarch related drivers.
    
       - Workaound for an ASR8601 integration hickup which ends up with CPU
         numbering which can't be represented in the GIC implementation.
    
       - The usual set of boring fixes and updates all over the place.
    
  • core-debugobjects-2023-06-26

    A single update for debug objects:
    
      - Recheck whether debug objects is enabled before reporting a problem to
        avoid spamming the logs with messages which are caused by a concurrent
        OOM.
    
  • v6.4

    6995e2de · Linux 6.4 ·
    Linux 6.4
    
  • v6.3.9-danctnix1

    DanctNIX kernel v6.3.9-danctnix1