第1页
Feb
2016
Linux
4.x
Performance
Using
BPF
Superpowers
Brendan Gregg
Senior Performance Architect
第2页
Ten
years
ago,
I
gave
a
talk
here
about
DTrace
tools…
第4页
Superpowers
are
coming
to
Linux
Solve performance issues that were previously impossible For example, full off-CPU analysis…
第7页
Ideal
Thread
States
A starting point for deeper analysis
第8页
Linux
Thread
States
Based on:
TASK_RUNNING TASK_INTERRUPTIBLE TASK_UNINTERRUPTIBLE
Still a useful starting point
第9页
Linux
On-‐CPU
Analysis
• I'll start with on-CPU analysis:
• Split into user/kernel states using /proc, mpstat(1), ...
CPU
Flame
Graph
• perf_events ("perf") to analyze further:
– User & kernel stack sampling (as a CPU flame graph)
– CPI
– Should be easy, but…
第10页
Broken
Stacks
Missing Java stacks
第11页
Missing
Symbols
"[unknown]"
第12页
Java
Mixed-‐Mode
CPU
Flame
Graph
• Fixed!
– Java –XX:+PreserveFramePointer – Java perf-map-agent – Linux perf_events
Kernel
Java
JVM
GC
第13页
Stack depth
Samples (alphabetical sort)
第14页
Also,
CPI
Flame
Graph
Cycles Per Instruction
- red == instruction heavy
- blue == cycle heavy (likely mem stalls)
zoomed:
第15页
Linux
Off-‐CPU
Analysis
On Linux, the state isn't helpful, but the code path is Off-CPU analysis by measuring blocked time with stack traces
第16页
Off-‐CPU
Time
Flame
Graph
Stack depth
Off-CPU time
From
hRp://www.brendangregg.com/blog/2016-‐02-‐01/linux-‐wakeup-‐offwake-‐profiling.html
第17页
Off-‐CPU
Time
(zoomed):
tar(1)
directory read from disk
file read from disk
fstat from disk path read from disk
pipe write
Currently kernel stacks only; user stacks will add more context
第18页
Off-‐CPU
Time:
more
states
lock contention
sleep
run queue latency
Flame graph quantifies total time spent in states
第19页
CPU
+
Off-‐CPU
==
See
Everything?
第20页
Off-‐CPU
Time
(zoomed):
gzip(1)
Off-CPU doesn't always make sense: what is gzip blocked on?
第21页
Wakeup
Time
Flame
Graph
第22页
Wakeup
Time
(zoomed):
gzip(1)
gzip(1) is blocked on tar(1)!
tar cf - * | gzip > out.tar.gz
Can't we associate off-CPU with wakeup stacks?
第23页
Off-‐Wake
Time
Flame
Graph
第24页
Wakeup stacks are associated
and merged in-kernel using BPF
We couldn't do this before
第26页
Haven't
Solved
Everything
Yet…
• One wakeup stack is often not enough…
• Who woke the waker?
第27页
Chain
Graphs
第28页
Merging multiple wakeup stacks in kernel using BPF
With enough stacks, all paths
lead to metal
第29页
Solve
Everything
CPU + off-CPU analysis can solve most issues
Flame graph (profiling) types:
1. CPU 2. CPI 3. Off-CPU time 4. Wakeup time 5. Off-wake time 6. Chain
different off-CPU analysis views, with more context and increasing measurement cost
BPF makes this all more practical
第30页
2.
BPF
"One
of
the
more
interesbng
features
in
this
cycle
is
the
ability
to
aRach
eBPF
programs
(user-‐defined,
sandboxed
bytecode
executed
by
the
kernel)
to
kprobes.
This
allows
user-‐ defined
instrumentabon
on
a
live
kernel
image
that
can
never
crash,
hang
or
interfere
with
the
kernel
negabvely."
–
Ingo
Molnár
(Linux
developer)
Source:
hRps://lkml.org/lkml/2015/4/14/232
第31页
2.
BPF
"crazy
stuff"
–
Alexei
Starovoitov
(eBPF
lead)
Source:
hRp://www.slideshare.net/AlexeiStarovoitov/bpf-‐inkernel-‐virtual-‐machine
第32页
BPF
• eBPF == enhanced Berkeley Packet Filter; now just BPF
• Integrated into Linux (in stages: 3.15, 3.19, 4.1, 4.5, …)
• Uses
– virtual networking
– tracing
– "crazy stuff"
• Front-ends
– samples/bpf (raw)
– bcc: Python, C – Linux perf_events
BPF
mascot
第33页
BPF
for
Tracing
• Can do per-event output and in-kernel summary statistics (histograms, etc).
User
Program
1.
generate
BPF
bytecode
per-‐ event
data
2.
load
perf_output
stabsbcs
3.
async
read
Kernel
kprobes
BPF
uprobes
tracepoints
maps
第34页
Old
way:
TCP
Retransmits
• tcpdump of all send & receive, dump to FS, post-process • Overheads adds up on 10GbE+
tcpdump
1.
read
2.
dump
Analyzer
1.
read
2.
state
machine
3.
print
Kernel
buffer
send
receive
file
system
disks
第35页
New
way:
BPF
TCP
Retransmits
• Just trace the retransmit functions • Negligible overhead
tcpretrans
(bcc)
1.
Config
BPF
&
kprobe
2.
read,
print
Kernel
send/recv
send
as-‐is
receive
tcp_retransmit_skb()
第36页
BPF:
TCP
Retransmits
# ./tcpretrans
TIME
PID
01:55:05 0
01:55:05 0
01:55:17 0
[…]
IP LADDR:LPORT 4 10.153.223.157:22 4 10.153.223.157:22 4 10.153.223.157:22
T> RADDR:RPORT R> 69.53.245.40:34619 R> 69.53.245.40:34619 R> 69.53.245.40:22957
STATE ESTABLISHED ESTABLISHED ESTABLISHED
includes
kernel
state
第37页
Old:
Off-‐CPU
Time
Stack
Profiling
• perf_events tracing of sched events, post-process • Despite buffering, usually high cost (>1M events/sec)
perf
record
1.
async
read
2.
dump
perf
inject
1.
read
2.
rewrite
perf
report/script
read,
process,
print
Kernel
buffer
scheduler
file
system
(or
pipe)
disks
第38页
New:
BPF
Off-‐CPU
Time
Stacks
• Measure off-CPU time, add to map with key = stack, value = total time. Async read map.
offcpuDme
(bcc)
1.
Config
BPF
&
kprobe
2.
async
read
stacks
3.
symbol
translate
4.
print
Kernel
scheduler
finish_task_switch()
BPF
maps
第39页
Stack
Trace
Hack
• For my offcputime tool, I wrote a BPF stack walker:
第40页
"Crazy
Stuff"
• … using unrolled loops & goto:
第41页
BPF
Stack
Traces
• Proper BPF stack support just landed in net-next:
Date
Sat, 20 Feb 2016 00:25:05 -0500 (EST)
Subject Re: [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map
From
David Miller <>
From: Alexei Starovoitov <ast@fb.com> Date: Wed, 17 Feb 2016 19:58:56 -0800
> This patch set introduces new map type to store stack traces and > corresponding bpf_get_stackid() helper.
...
Series applied, thanks Alexei.
• Allows more than just chain graphs
第42页
memleak
• Real-time memory growth and leak analysis:
# ./memleak.py -o 10 60 1 Attaching to kmalloc and kfree, Ctrl+C to quit. [01:27:34] Top 10 stacks with outstanding allocations:
72 bytes in 1 allocations from stack alloc_fdtable [kernel] (ffffffff8121960f) expand_files [kernel] (ffffffff8121986b) sys_dup2 [kernel] (ffffffff8121a68d)
[…] 2048 bytes in 1 allocations from stack alloc_fdtable [kernel] (ffffffff812195da) expand_files [kernel] (ffffffff8121986b) sys_dup2 [kernel] (ffffffff8121a68d) ]
Trace
for
60s
Show
kernel
allocabons
older
than
10s
that
were
not
freed
• Uses my stack hack, but will switch to BPF stacks soon • By Sasha Goldshtein. Another bcc tool.
第43页
3.
bcc
• BPF Compiler Collection
– https://github.com/iovisor/bcc
• Python front-end, C instrumentation • Currently beta – in development! • Some example tracing tools…
第44页
execsnoop
• Trace new processes:
# ./execsnoop
PCOMM
PID RET ARGS
bash
15887 0 /usr/bin/man ls
preconv 15894 0 /usr/bin/preconv -e UTF-8
man 15896 0 /usr/bin/tbl
man 15897 0 /usr/bin/nroff -mandoc -rLL=169n -rLT=169n -Tutf8
man 15898 0 /usr/bin/pager -s
nroff
15900 0 /usr/bin/locale charmap
nroff
15901 0 /usr/bin/groff -mtty-char -Tutf8 -mandoc -rLL=169n …
groff
15902 0 /usr/bin/troff -mtty-char -mandoc -rLL=169n -rLT=169 …
groff
15903 0 /usr/bin/grotty
第45页
biolatency
• Block device (disk) I/O latency distribution:
# ./biolatency -mT 1 5 Tracing block device I/O... Hit Ctrl-C to end.
06:20:16 msecs 0 -> 1 2 -> 3 4 -> 7 8 -> 15 16 -> 31 32 -> 63 64 -> 127
[…]
: count : 36 :1 :3 : 17 : 33 :7 :6
distribution
|**************************************|
|* |
|***
|
|*****************
|
|********************************** |
|*******
|
|******
|
第46页
ext4slower
• ext4 file system I/O, slower than a threshold:
# ./ext4slower 1
Tracing ext4 operations slower than 1 ms
TIME
COMM
PID T BYTES OFF_KB
06:49:17 bash
3616 R 128
06:49:17 cksum
3616 R 39552 0
06:49:17 cksum
3616 R 96
06:49:17 cksum
3616 R 96
06:49:17 cksum
3616 R 10320 0
06:49:17 cksum
3616 R 65536 0
06:49:17 cksum
3616 R 55400 0
06:49:17 cksum
3616 R 36792 0
06:49:17 cksum
3616 R 15008 0
06:49:17 cksum
3616 R 6123 0
repository
06:49:17 cksum
3616 R 6280 0
06:49:17 cksum
3616 R 27696 0
06:49:17 cksum
3616 R 58080 0
06:49:17 cksum
3616 R 906
[…]
LAT(ms) FILENAME 7.75 cksum 1.34 [ 5.36 2to3-2.7
14.94 2to3-3.4 6.82 411toppm 4.01 a2p 8.77 ab
16.34 aclocal-1.14 19.31 acpi_listen 17.23 add-apt-
18.40 addpart 2.16 addr2line
10.11 ag 6.30 ec2-meta-data
第47页
bashreadline
• Trace bash interactive commands system-wide:
# ./bashreadline
TIME
PID COMMAND
05:28:25 21176 ls -l
05:28:28 21176 date
05:28:35 21176 echo hello world
05:28:43 21176 foo this command failed
05:28:45 21176 df -h
05:29:04 3059 echo another shell
05:29:13 21176 echo first shell again
第48页
gethostlatency
• Show latency for getaddrinfo/gethostbyname[2] calls:
# ./gethostlatency
TIME
PID COMM
06:10:24 28011 wget
06:10:28 28127 wget
06:10:41 28404 wget
06:10:48 28544 curl
06:11:10 29054 curl
06:11:16 29195 curl
06:11:25 29404 curl
06:11:28 29475 curl
LATms HOST 90.00 www.iovisor.org
0.00 www.iovisor.org 9.00 www.netflix.com 35.00 www.netflix.com.au 31.00 www.plumgrid.com 3.00 www.facebook.com 72.00 foo 1.00 foo
第49页
trace
• Trace custom events. Ad hoc analysis multitool:
# trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
TIME
PID COMM
FUNC
-
05:18:23 4490 dd
sys_read
read 1048576 bytes
05:18:23 4490 dd
sys_read
read 1048576 bytes
05:18:23 4490 dd
sys_read
read 1048576 bytes
05:18:23 4490 dd
sys_read
read 1048576 bytes
^C
第50页
Linux
bcc/BPF
Tracing
Tools
第51页
4.
Future
Work
• All event sources • Language improvements • More tools: eg, TCP • GUI support
第52页
Linux
Event
Sources
XXX:
todo
XXX:
todo
done
done
XXX:
todo
第53页
BPF/bcc
Language
Improvements
第54页
More
Tools
• eg, netstat(8)…
$ netstat -s Ip:
7962754 total packets received 8 with invalid addresses 0 forwarded
0 incoming packets discarded 7962746 incoming packets delivered 8019427 requests sent out Icmp: 382 ICMP messages received
0 input ICMP message failed. ICMP input histogram:
destination unreachable: 125 timeout in transit: 257 3410 ICMP messages sent 0 ICMP messages failed
ICMP output histogram: destination unreachable: 3410
IcmpMsg: InType3: 125 InType11: 257
OutType3: 3410 Tcp:
17337 active connections openings 395515 passive connection openings 8953 failed connection attempts
240214 connection resets received 3 connections established 7198375 segments received 7504939 segments send out 62696 segments retransmited
10 bad segments received. 1072 resets sent
InCsumErrors: 5 Udp:
759925 packets received
3412 packets to unknown port received. 0 packet receive errors 784370 packets sent UdpLite:
TcpExt: 858 invalid SYN cookies received 8951 resets received for embryonic SYN_RECV sockets 14 packets pruned from receive queue because of socket buffer overrun 6177 TCP sockets finished time wait in fast timer
293 packets rejects in established connections because of timestamp 733028 delayed acks sent 89 delayed acks further delayed because of locked socket
Quick ack mode was activated 13214 times 336520 packets directly queued to recvmsg prequeue. 43964 packets directly received from backlog 11406012 packets directly received from prequeue 1039165 packets header predicted
7066 packets header predicted and directly queued to user 1428960 acknowledgments not containing data received 1004791 predicted acknowledgments 1 times recovered from packet loss due to fast retransmit 5044 times recovered from packet loss due to SACK data
2 bad SACKs received Detected reordering 4 times using SACK Detected reordering 11 times using time stamp 13 congestion windows fully recovered 11 congestion windows partially recovered using Hoe heuristic TCPDSACKUndo: 39
2384 congestion windows recovered after partial ack 228 timeouts after SACK recovery 100 timeouts in loss state 5018 fast retransmits 39 forward retransmits
783 retransmits in slow start 32455 other TCP timeouts TCPLossProbes: 30233 TCPLossProbeRecovery: 19070 992 sack retransmits failed
18 times receiver scheduled too late for direct processing 705 packets collapsed in receive queue due to low socket buffer 13658 DSACKs sent for old packets 8 DSACKs sent for out of order packets 13595 DSACKs received
33 DSACKs for out of order packets received 32 connections reset due to unexpected data 108 connections reset due to early user close 1608 connections aborted due to timeout TCPSACKDiscard: 4
TCPDSACKIgnoredOld: 1 TCPDSACKIgnoredNoUndo: 8649 TCPSpuriousRTOs: 445 TCPSackShiftFallback: 8588
TCPRcvCoalesce: 95854 TCPOFOQueue: 24741 TCPOFOMerge: 8 TCPChallengeACK: 1441 TCPSYNChallenge: 5
TCPSpuriousRtxHostQueues: 1 TCPAutoCorking: 4823 IpExt: InOctets: 1561561375 OutOctets: 1509416943
InNoECTPkts: 8201572 InECT1Pkts: 2 InECT0Pkts: 3844 InCEPkts: 306
第56页
BeRer
TCP
Tools
• TCP retransmit by type and time • Congestion algorithm metrics • etc.
第57页
GUI
Support
• eg, Netflix Vector: open source instance analyzer:
第58页
Summary
• BPF in Linux 4.x makes many new things possible
– Stack-based thread state analysis (solve all issues!) – Real-time memory growth/leak detection – Better TCP metrics – etc...
• Get involved: see iovisor/bcc • So far just a preview of things to come
第59页
Links
• iovisor bcc: • https://github.com/iovisor/bcc • http://www.brendangregg.com/blog/2015-09-22/bcc-linux-4.3-tracing.html • http://blogs.microsoft.co.il/sasha/2016/02/14/two-new-ebpf-tools-memleak-and-argdist/
• BPF Off-CPU, Wakeup, Off-Wake & Chain Graphs: • http://www.brendangregg.com/blog/2016-01-20/ebpf-offcpu-flame-graph.html • http://www.brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html • http://www.brendangregg.com/blog/2016-02-05/ebpf-chaingraph-prototype.html
• Linux Performance: • http://www.brendangregg.com/linuxperf.html
• Linux perf_events: • https://perf.wiki.kernel.org/index.php/Main_Page • http://www.brendangregg.com/perf.html
• Flame Graphs: • http://techblog.netflix.com/2015/07/java-in-flames.html • http://www.brendangregg.com/flamegraphs.html
• Netflix Tech Blog on Vector: • http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
• Wordcloud: https://www.jasondavies.com/wordcloud/
第60页
• Questions? • http://slideshare.net/brendangregg • http://www.brendangregg.com • bgregg@netflix.com • @brendangregg
Thanks to Alexei Starovoitov (Facebook), Brenden Blanco (PLUMgrid), Daniel Borkmann (Cisco), Wang Nan (Huawei), Sasha Goldshtein (Sela), and other BPF and bcc contributors!
Feb
2016