Apple Silicon PMU Counters: Analysis and Implementation Guide
Overview
This document analyzes the Performance Monitoring Unit (PMU) counter system on Apple Silicon processors (M1, M2, and later), based on research into Apple's private kperf API. The analysis covers the hardware architecture, counter limitations, compatibility rules, and practical implementation considerations.
Problem Statement
Apple Silicon processors provide PMU counters for tracking microarchitectural events, but Apple does not publicly document:
- The maximum number of counters that can be monitored simultaneously
- Why certain counters are incompatible with each other
- The algorithm for counter allocation
- How counter ordering affects compatibility
This lack of documentation forces developers to rely on trial-and-error or reverse-engineering to use PMU counters effectively.
System Analysis
Fundamental Components
graph TD
subgraph Hardware
A[Fixed Counters]
B[Programmable Counters]
end
subgraph Software
C[kperf Framework]
D[Counter Database]
E[Allocation Algorithm]
end
subgraph User Space
F[Instruments App]
G[Custom Tools]
end
A --> C
B --> C
D --> E
C --> E
E --> F
E --> GPMU Counter Architecture
Fixed Counters (2)
Apple Silicon provides two fixed counters that are always available:
- Cycles (
FIXED_CYCLES): Mask0b0000000001 - Instructions (
FIXED_INSTRUCTIONS): Mask0b0000000010
These counters have unique bit masks and are compatible with any other counter.
Programmable Counters (8)
The remaining 8 slots are shared among 58 programmable events. These are allocated using a 10-bit mask system where each bit represents a potential counter slot.
Counter Categories by Mask
| Mask Type | Bit Pattern | Counters | Allocation Rule |
|---|---|---|---|
| Fixed Cycle | 0000000001 | 1 | Unique slot |
| Fixed Instruction | 0000000010 | 1 | Unique slot |
| Group M | 0010000000 | 6 | Single slot (bit 6) |
| Group G | 0011100000 | 18 | 3 slots (bits 5-7) |
| General | 1111111100 | 33 | 8 slots (bits 2-9) |
Group M Counters (6 counters - incompatible in pairs)
INST_ALL
INST_INT_ALU
INST_INT_ST
INST_LDST
INST_SIMD_ALU
RETIRE_UOPGroup G Counters (18 counters - incompatible in quadruples)
BRANCH_CALL_INDIR_MISPRED_NONSPEC
BRANCH_COND_MISPRED_NONSPEC
BRANCH_INDIR_MISPRED_NONSPEC
BRANCH_MISPRED_NONSPEC
BRANCH_RET_INDIR_MISPRED_NONSPEC
INST_BARRIER
INST_BRANCH
INST_BRANCH_CALL
INST_BRANCH_COND
INST_BRANCH_INDIR
INST_BRANCH_RET
INST_BRANCH_TAKEN
INST_INT_LD
INST_SIMD_LD
INST_SIMD_ST
L1D_CACHE_MISS_LD_NONSPEC
L1D_CACHE_MISS_ST_NONSPEC
L1D_TLB_MISS_NONSPECCounter Allocation Algorithm
Core Principle
When adding a counter to the monitoring list:
The counter picks the first available slot starting from the lower bit based on its mask.
Why Order Matters
The allocation algorithm processes counters sequentially. A counter with a wide mask may occupy slots that prevent subsequent counters with specific masks from being allocated.
Example: Ordering Failure Case
graph LR
subgraph Initial State
B0["Slot 0: Empty"]
B1["Slot 1: Empty"]
B2["Slot 2: Empty"]
B3["Slot 3: Empty"]
B4["Slot 4: Empty"]
B5["Slot 5: Empty"]
B6["Slot 6: Empty"]
B7["Slot 7: Empty"]
B8["Slot 8: Empty"]
B9["Slot 9: Empty"]
endAdding counters in this order fails:
L1D_TLB_ACCESS(mask1111111100) - occupies slot 2L1D_TLB_MISS(mask1111111100) - occupies slot 3L1D_CACHE_MISS_ST(mask1111111100) - occupies slot 4L1D_CACHE_MISS_LD(mask1111111100) - occupies slot 5LD_UNIT_UOP(mask1111111100) - occupies slot 6ST_UNIT_UOP(mask1111111100) - occupies slot 7INST_LDST(mask0010000000) - needs slot 6, but it's occupied
Solution: Reorder Counters
Swap ST_UNIT_UOP and INST_LDST:
1-5. Same as above
INST_LDST(mask0010000000) - occupies slot 6ST_UNIT_UOP(mask1111111100) - occupies slot 8 (skips occupied slot 6)
Recommended Ordering Strategy
For predictable behavior, add counters in ascending order by mask:
- Fixed counters first (
FIXED_CYCLES,FIXED_INSTRUCTIONS) - Group M counters (single-slot, mask
0010000000) - Group G counters (three-slot, mask
0011100000) - General counters (wide mask,
1111111100)
graph TD
A[Start] --> B{Add Fixed Counters}
B --> C{Add Group M Counters}
C --> D{Add Group G Counters}
D --> E{Add General Counters}
E --> F[Complete]
B -->|Error| G[Allocation Failed]
C -->|Error| G
D -->|Error| G
E -->|Error| GImplementation Considerations
kperf API Structures
The kpep_event structure contains the critical mask field:
typedef struct kpep_event {
const char *name;
const char *description;
const char *errata;
const char *alias; // e.g., "Instructions", "Cycles"
const char *fallback;
u32 mask; // Critical for compatibility
u8 number;
u8 umask;
u8 reserved;
u8 is_fixed;
} kpep_event;Key Constraints
| Constraint | Value | Notes |
|---|---|---|
| Maximum counters | 10 | Based on 10-bit mask width |
| Fixed counters | 2 | Always available |
| Programmable slots | 8 | Shared among 58 events |
| Privileges required | sudo | kperf requires root access |
Practical Tools
Lauka
A custom tool created as a result of this research:
- Forked from the
pooptool by Andrew Kelly - Incorporates
kperfreverse-engineering by ibireme - Apple Silicon only (M1, M2, and later)
Features:
- Select events to monitor
- Display all available events
- Warming up capability
- Proper counter ordering
Example Output
measurement mean ± σ min … max
wall_time 591ms ± 7.6ms 583ms … 605ms
peak_rss 137MB ± 0.3MB 136.6MB … 137.4MB
core_active_cycle 2.51G ± 22.1M 2.48G … 2.54G
inst_all 3.62G ± 23.9M 2.53G … 3.69G
l1d_cache_miss_ld_nonspec 3.58M ± 31.7K 3.54M … 3.63M
branch_mispred_nonspec 21.4M ± 58.2K 21.3M … 21.5MLessons Learned
- Research cross-platform: Linux PMU implementations are better documented and can provide insights applicable to Apple Silicon.
- Study reverse-engineered code deeply: Early thorough analysis of the
kperfstructures would have revealed the mask-based allocation system immediately. - Focus on root causes: Spending time on combinatorial analysis (18+ million incompatible cases) was less productive than understanding the underlying allocation algorithm.
- Order matters everywhere: Counter ordering affects compatibility even in Apple's own Instruments application.
References
- Original article: https://blog.bugsiki.dev/posts/apple-pmu/
- Apple CPU Optimization Guide (requires Apple Developer account)
- kperf reverse-engineering: ibireme's work
- poop tool: Andrew Kelly
- Lauka tool: https://github.com/ (link to be added)