Apple Silicon PMU Counters: Analysis and Implementation Guide

Overview

This document analyzes the Performance Monitoring Unit (PMU) counter system on Apple Silicon processors (M1, M2, and later), based on research into Apple's private kperf API. The analysis covers the hardware architecture, counter limitations, compatibility rules, and practical implementation considerations.

Problem Statement

Apple Silicon processors provide PMU counters for tracking microarchitectural events, but Apple does not publicly document:

  • The maximum number of counters that can be monitored simultaneously
  • Why certain counters are incompatible with each other
  • The algorithm for counter allocation
  • How counter ordering affects compatibility

This lack of documentation forces developers to rely on trial-and-error or reverse-engineering to use PMU counters effectively.

System Analysis

Fundamental Components

graph TD
    subgraph Hardware
        A[Fixed Counters]
        B[Programmable Counters]
    end

    subgraph Software
        C[kperf Framework]
        D[Counter Database]
        E[Allocation Algorithm]
    end

    subgraph User Space
        F[Instruments App]
        G[Custom Tools]
    end

    A --> C
    B --> C
    D --> E
    C --> E
    E --> F
    E --> G

PMU Counter Architecture

Fixed Counters (2)

Apple Silicon provides two fixed counters that are always available:

  • Cycles (FIXED_CYCLES): Mask 0b0000000001
  • Instructions (FIXED_INSTRUCTIONS): Mask 0b0000000010

These counters have unique bit masks and are compatible with any other counter.

Programmable Counters (8)

The remaining 8 slots are shared among 58 programmable events. These are allocated using a 10-bit mask system where each bit represents a potential counter slot.

Counter Categories by Mask

Mask TypeBit PatternCountersAllocation Rule
Fixed Cycle00000000011Unique slot
Fixed Instruction00000000101Unique slot
Group M00100000006Single slot (bit 6)
Group G0011100000183 slots (bits 5-7)
General1111111100338 slots (bits 2-9)

Group M Counters (6 counters - incompatible in pairs)

INST_ALL
INST_INT_ALU
INST_INT_ST
INST_LDST
INST_SIMD_ALU
RETIRE_UOP

Group G Counters (18 counters - incompatible in quadruples)

BRANCH_CALL_INDIR_MISPRED_NONSPEC
BRANCH_COND_MISPRED_NONSPEC
BRANCH_INDIR_MISPRED_NONSPEC
BRANCH_MISPRED_NONSPEC
BRANCH_RET_INDIR_MISPRED_NONSPEC
INST_BARRIER
INST_BRANCH
INST_BRANCH_CALL
INST_BRANCH_COND
INST_BRANCH_INDIR
INST_BRANCH_RET
INST_BRANCH_TAKEN
INST_INT_LD
INST_SIMD_LD
INST_SIMD_ST
L1D_CACHE_MISS_LD_NONSPEC
L1D_CACHE_MISS_ST_NONSPEC
L1D_TLB_MISS_NONSPEC

Counter Allocation Algorithm

Core Principle

When adding a counter to the monitoring list:

The counter picks the first available slot starting from the lower bit based on its mask.

Why Order Matters

The allocation algorithm processes counters sequentially. A counter with a wide mask may occupy slots that prevent subsequent counters with specific masks from being allocated.

Example: Ordering Failure Case

graph LR
    subgraph Initial State
        B0["Slot 0: Empty"]
        B1["Slot 1: Empty"]
        B2["Slot 2: Empty"]
        B3["Slot 3: Empty"]
        B4["Slot 4: Empty"]
        B5["Slot 5: Empty"]
        B6["Slot 6: Empty"]
        B7["Slot 7: Empty"]
        B8["Slot 8: Empty"]
        B9["Slot 9: Empty"]
    end

Adding counters in this order fails:

  1. L1D_TLB_ACCESS (mask 1111111100) - occupies slot 2
  2. L1D_TLB_MISS (mask 1111111100) - occupies slot 3
  3. L1D_CACHE_MISS_ST (mask 1111111100) - occupies slot 4
  4. L1D_CACHE_MISS_LD (mask 1111111100) - occupies slot 5
  5. LD_UNIT_UOP (mask 1111111100) - occupies slot 6
  6. ST_UNIT_UOP (mask 1111111100) - occupies slot 7
  7. INST_LDST (mask 0010000000) - needs slot 6, but it's occupied

Solution: Reorder Counters

Swap ST_UNIT_UOP and INST_LDST:
1-5. Same as above

  1. INST_LDST (mask 0010000000) - occupies slot 6
  2. ST_UNIT_UOP (mask 1111111100) - occupies slot 8 (skips occupied slot 6)

Recommended Ordering Strategy

For predictable behavior, add counters in ascending order by mask:

  1. Fixed counters first (FIXED_CYCLES, FIXED_INSTRUCTIONS)
  2. Group M counters (single-slot, mask 0010000000)
  3. Group G counters (three-slot, mask 0011100000)
  4. General counters (wide mask, 1111111100)
graph TD
    A[Start] --> B{Add Fixed Counters}
    B --> C{Add Group M Counters}
    C --> D{Add Group G Counters}
    D --> E{Add General Counters}
    E --> F[Complete]

    B -->|Error| G[Allocation Failed]
    C -->|Error| G
    D -->|Error| G
    E -->|Error| G

Implementation Considerations

kperf API Structures

The kpep_event structure contains the critical mask field:

typedef struct kpep_event {
    const char *name;
    const char *description;
    const char *errata;
    const char *alias;        // e.g., "Instructions", "Cycles"
    const char *fallback;
    u32 mask;                 // Critical for compatibility
    u8 number;
    u8 umask;
    u8 reserved;
    u8 is_fixed;
} kpep_event;

Key Constraints

ConstraintValueNotes
Maximum counters10Based on 10-bit mask width
Fixed counters2Always available
Programmable slots8Shared among 58 events
Privileges requiredsudokperf requires root access

Practical Tools

Lauka

A custom tool created as a result of this research:

  • Forked from the poop tool by Andrew Kelly
  • Incorporates kperf reverse-engineering by ibireme
  • Apple Silicon only (M1, M2, and later)
  • Features:

    • Select events to monitor
    • Display all available events
    • Warming up capability
    • Proper counter ordering

Example Output

measurement                 mean ± σ          min … max
wall_time                   591ms ± 7.6ms     583ms … 605ms
peak_rss                    137MB ± 0.3MB     136.6MB … 137.4MB
core_active_cycle           2.51G ± 22.1M     2.48G … 2.54G
inst_all                    3.62G ± 23.9M     2.53G … 3.69G
l1d_cache_miss_ld_nonspec   3.58M ± 31.7K     3.54M … 3.63M
branch_mispred_nonspec      21.4M ± 58.2K     21.3M … 21.5M

Lessons Learned

  1. Research cross-platform: Linux PMU implementations are better documented and can provide insights applicable to Apple Silicon.
  2. Study reverse-engineered code deeply: Early thorough analysis of the kperf structures would have revealed the mask-based allocation system immediately.
  3. Focus on root causes: Spending time on combinatorial analysis (18+ million incompatible cases) was less productive than understanding the underlying allocation algorithm.
  4. Order matters everywhere: Counter ordering affects compatibility even in Apple's own Instruments application.

References

最后修改:2026 年 01 月 12 日
如果觉得我的文章对你有用,请随意赞赏