< All posts | Fediverse | RSS | GitHub | Talks

Sep 9 2020

Hacking Ethernet out of Fibre Channel cards

A Fibre Channel Card

This story, like another in the past, started as an eBay purchase that I would soon regret. I was scrolling through my favorite eBay supplier when I found a listing titled Job Lot Of Roughly 350 Various Network Cards. This of course piqued my interest. Some close photo inspection identified a whole load of cards in the pile that were worth the price of the whole lot. So I naturally bought it.

What I had not mentally planned for was that “Roughly 350” is a lot of anything. This soon became apparent however due to a pallet truck turning up at my flat in London.

After dealing with the slightly embarrassing logistical challenge of getting a pallet up into my flat itself, I began to sort the cards in to their types, and found myself with a lot of two types, 4x 1G Intel i20 cards, and a lot of 8G Fibre Channel cards from QLogic (now Marvell) and Emulex (Now Broadcom). I had an unreasonable amount of these Fibre Channel PCIe cards.

Fibre Channel (FC) is used for connecting systems to remote storage efficiently over a fabric different to the common ethernet network. To make it slightly confusing, to the untrained eye they could be confused with 10G ethernet since both take SFP+ optical modules.

SFP+ Optics

After doing some research, it does appear that buried in the history of FC, there was support for carrying IP traffic over FC presumably so that systems could be connected with only FC.

There are 3 RFC’s that cover this, RFC2625 (to cover the basics), RFC3831 (to add IPv6 support), and a final update to both with RFC4338.

It seems that this functionality never actually really hit full support. IPoFC support used to exist in the QLogic driver and then got removed, here is a public attempt to restore this function to the QLogic driver with some degree of success, just without stability or modern device support.

Surely there is a way to get some use out of these cards? Since I had so many of them, I figured it would not hurt to try and bring new life into them.

So I grabbed two test machines, installed a single port QLogic card into both of them, and connected them up with some OM3 cable.

My FC Test rig FC Host Bus Adapters (HBA) by default are looking for a target (typically a Storage Area Network (SAN)) to establish a session with. So while connecting two machines together will bring a link up, they will not be able to do anything with each other by default.

For this one side needs to be in target mode and provide storage. For this reason the SCST project has a driver for the QLogic series of FC cards that can make them act as a storage over FC target.

After compiling this module and getting it working, I looked into how to use SCST and found, to my delight, that there is a user space module. That module allows the creation of devices entirely in userspace, which can then be exposed to another system over FC.

root@testtop:~# rmmod qla2xxx # Unload the mainline driver
root@testtop:~# modprobe qla2x00tgt # Load the target mode SCST version
root@testtop:~# # Apply basic config to allow the other side use our "disk"
root@testtop:~# scstadmin -config lol.conf 

Collecting current configuration: done.

-> Checking configuration file 'lol.conf' for errors.
	-> Done

-> Applying configuration.
	-> Setting target attribute 'rel_tgt_id' to value '1' for driver/target 'qla2x00t/50:01:43:80:26:68:8e:5c': done.
	-> Adding new group 'net' to driver/target 'qla2x00t/50:01:43:80:26:68:8e:5c': done.
	-> Adding new initiator '50:01:43:80:21:de:b5:d6' to driver/target/group 
'qla2x00t/50:01:43:80:26:68:8e:5c/net': done.
	-> Adding new initiator '21:00:00:24:ff:0e:0e:75' to driver/target/group 
'qla2x00t/50:01:43:80:26:68:8e:5c/net': done.
	-> Enabling driver/target 'qla2x00t/50:01:43:80:26:68:8e:5c': done.
	-> Done, 5 change(s) made.

All done.
root@testtop:~# ./scst-driver & # Start the driver

root@testtop:~# # Attach the virtual disk to the other client.

root@testtop:~# scstadmin -add_lun 0 -driver qla2x00t \
>  -target 50:01:43:80:26:68:8e:5c \
>  -group net -device net3 -attributes read_only=0
Collecting current configuration: done.


-> Making requested changes.
	-> Adding device 'net3' at LUN 0 to driver/target/group 
'qla2x00t/50:01:43:80:26:68:8e:5c/net': done.
	-> Done.

This is great because it allows me to create a virtual device using a just systems programming techniques. The downside is that, while there is documentation for this, the amount of examples of code that uses this in the wild is effectively one, and it’s the fileio example included in the SCST codebase. This code example, however, would prove invaluable as it’s high speed (can read up to 6gbps), and is clean enough to be used as a reference.

As I got deeper into proof of concepts I realised I could not simply send arbitrary packets back and forth over FC, meaning I would have to implement a SCSI compliant device for the purpose of getting network traffic back and forth.

Regardless, I got cracking on with it and decided to see if I could write a virtual device handler in Go. What I was not expecting was that this means you have to basically write a device nearly from the ground up. This meant I would have to get familiar with the protocol that disk drives speak: SCSI. The slightly annoying bit is that being a disk is actually surprisingly complex. I now understand why bugs in drive firmware exist, since the manuals that give a medium view into the SCSI command set are over 500 pages long.

One of the most fundamental SCSI commands that exists is INQUIRY, it describes the model and type of device attached. It’s also one of (apart from the SCSI version of a ping) the first commands linux will send to a SCSI device, and it will make classification decisions based on the output of the INQUIRY.

You can see the output of this early stage command by looking into your dmesg, and looking for something that looks like:

scsi 4:0:0:0: Direct-Access     ATA      KINGSTON SA400S3 71B1 PQ: 0 ANSI: 5

These are the results of the kernel issuing the INQUIRY command, and outputting what class type your device responded back with (In the above case a “Direct-access block device”). The device type provided will determine how the kernel handles you. Picking a block storage device type will result in the kernel automatically sending various information gathering SCSI commands that can be quite hard to implement correctly. Thankfully there are a range of other devices you can pick from.

Value Device type Command set Specification
00h Direct-access block device (e.g. magnetic disk) SBC Direct Access Commands SCSI Block Commands (SBC)
01h Sequential-access device (e.g. magnetic tape) SSC Sequential Access Commands SCSI Stream Commands (SSC)
02h Printer device SSC Printer Commands SCSI Stream Commands (SSC)
03h Processor device SPC Processor Commands SCSI Primary Commands (SPC)
04h Write-once device SBC Write Once Commands SCSI Block Commands (SBC)
05h CD/DVD-ROM device MMC CD-ROM Commands SCSI Multimedia Commands (MMC)
06h Scanner device SGC Scanner Commands SCSI Graphics Commands (SGC)
07h Optical memory device (e.g. some optical disks) SBC Optical Media Commands SCSI Block Commands (SBC)
08h Medium changer (e.g. jukeboxes) SMC Medium Changer Commands SCSI Medium Changer Commands (SMC)
09h Communications device SSC Communications Commands SCSI Stream Commands (SSC)
0Ah–0Bh Defined by ASC IT8 (Graphic arts pre-press devices) ASC IT8 Prepress Commands
0Ch Storage array controller device (e.g. RAID) SCC Array Controller Commands SCSI Controller Commands (SCC)
0Dh Enclosure services device SES Enclosure Services Commands SCSI Enclosure Services (SES)
0Eh Simplified direct-access device (e.g. magnetic disk) RBC Reduced Block Commands Reduced Block Commands (RBC)
0Fh Optical card reader/writer device OCRW Optical Card Commands SCSI Specification for Optical Card Reader/Writer (OCRW)
10h Reserved for bridging expanders
11h Object-based Storage Device OSD Object-based Storage Commands Object-based Storage Commands (OSD)
14h Host managed zoned block device Zoned Block Commands (ZBC)

After some playing around, I found that identifying as a scanner resulted in the fewest questions being asked by the linux kernel. Just one or two INQUIRY’s and then a device would be made under /dev/sg<n> for your use.

This resulted in a slightly amusing dmesg of:

scsi 6:0:0:0: Scanner           BENJOJO  Network Card lol 350  PQ: 0 ANSI: 6

Now that we have a minimal virtual SCSI device, we can get to work on getting it to move ethernet packets for us. Even if smartctl had some issues parsing it, it was “close enough”.

root@black-box:~# smartctl -a /dev/sg1 
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-9-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               BENJOJO
Product:              Network Card lol
Revision:             350
Compliance:           SPC-4
User Capacity:        52,355,180,385,410,854 bytes [52.3 PB]
Logical block size:   520093954 bytes
scsiModePageOffset: response length too short, resp_len=7 offset=22 bd_len=18
scsiModePageOffset: response length too short, resp_len=7 offset=22 bd_len=18
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

So I decided to build a basic interface that accepts SCSI writes as outbound packets, and allows the client to poll for incoming packets. Like in a previous project, I used TUN/TAP to make a virtual network card on both systems.

Basic Polling Driver

On the FC Client, we can use the generic SCSI interface in linux (sg) to manually fire READ and WRITE commands.

This leaves us with a basic polling mode that results in basic connectivity:

PING 42.42.42.1 (42.42.42.1) 56(84) bytes of data.
64 bytes from 42.42.42.1: icmp_seq=1 ttl=64 time=387 ms
64 bytes from 42.42.42.1: icmp_seq=2 ttl=64 time=194 ms
64 bytes from 42.42.42.1: icmp_seq=3 ttl=64 time=102 ms

But since it was polling the fake drive, it had high latency, and abysmally low throughput (less than 80kbits/s). We can quickly improve this by reducing the polling interval from the testing 100ms to 1ms.

Polling Max benchmark

This gave us a much better latency and throughput, we were nearly at 10mbit/s with just this.

However we are now at the limit for how fast we can reasonably poll for new packets. Now we need to ideally push packets to the other side when they are available. The easy way to do this is to hang a SCSI read request for a bit until there is a packet. However, due to the way SCST’s user space mode works ( only a single request can be processed at once ) we will have to find a smart way to somehow serve a long standing SCSI read while still being able to process writes… So I did the easy thing, and just made two SCSI devices.

root@black-box:~# dmesg | tail -n 5
[  710.064439] qla2xxx [0000:01:00.0]-500a:6: LOOP UP detected (8 Gbps).
[  710.923684] scsi 6:0:0:0: Scanner           BENJOJO  Network Card lol 350  PQ: 0 ANSI: 6
[  710.924371] scsi 6:0:0:0: Attached scsi generic sg1 type 6
[  710.924979] scsi 6:0:0:1: Scanner           BENJOJO  Network Card lol 350  PQ: 0 ANSI: 6
[  710.925502] scsi 6:0:0:1: Attached scsi generic sg2 type 6

One is designed for SCSI writes (to send packets to the other side), and the other for SCSI reads (to read packets from the other side).

The Ethernet push driver

This allows us to get low latency with full duplex communications:

root@black-box:~# ping 42.42.42.1
PING 42.42.42.1 (42.42.42.1) 56(84) bytes of data.
64 bytes from 42.42.42.1: icmp_seq=1 ttl=64 time=0.956 ms
64 bytes from 42.42.42.1: icmp_seq=2 ttl=64 time=1.11 ms
64 bytes from 42.42.42.1: icmp_seq=3 ttl=64 time=1.05 ms
^C
--- 42.42.42.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 4ms
rtt min/avg/max/mdev = 0.956/1.036/1.106/0.067 ms

With a throughput of around 100mbits:

root@black-box:~# iperf -c 42.42.42.1
------------------------------------------------------------
Client connecting to 42.42.42.1, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 42.42.42.2 port 52766 connected with 42.42.42.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   105 MBytes  88.2 Mbits/sec

Or with jumbo frames (9000 MTU) enabled:

root@black-box:~# iperf -c 42.42.42.1
------------------------------------------------------------
Client connecting to 42.42.42.1, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 42.42.42.2 port 52752 connected with 42.42.42.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   579 MBytes   485 Mbits/sec

At this point, while we have not reached anywhere near the technical limit of 6gbps, we have basically reached “good enough as an out of band network” levels of function. I sadly don’t have a FC switch to test if this can go over a fabric, but I don’t see any reason why it should not. But in a slightly more creative setup someone could use such a thing as a backup IP network for when systems go totally unreachable.

I suspect with some effort you could get this to be faster, however for me it’s personally not worth it. Much like the attempted IPoFC backport it might be impractical to get the throughput to be better than gigabit ethernet while also not consuming large amounts of CPU in the process, so the value of such a project is minimal. It’s also worth pointing out that I did look into modifying the qla2xx driver itself, but without decent documentation of the qlogic chipsets themselves it seemed to be reasonably difficult to modify it directly to be an ethernet card.

However, I did manage to make several performance jumps during the project with basic stuff like ensuring not to emit debug messages on the hot path, and slightly more obscure issues that I later on found involving the performance impacts of Go channels. I’m sure you could obtain faster links by chaining together many SCSI devices together, and then using the multi queue TUN/TAP feature to avoid blocking on a single TUN/TAP file descriptor.

The final full system flamegraph looked like this (click for interactive) half of CPU was used for jumbo frames 400mbit/s:

Flamegraph of FC Driver

While the same systems with Intel X520 10 gigabit ethernet cards look like this (click for interactive) (CPU load was only a single core for a 10G flow):

Flamegraph of X520 10G iperf

Overall; not great, not terrible.

You can find this rough code over on my github, but it’s really best to be used as a data point for SCST’s user space module, since I cannot in good faith recommend people write virtual SCSI devices in a language like Go.

If you wish to see the internals, you can find them on my github: https://github.com/benjojo/IPoverFC


Since I mentioned at the start that I have a lot of these cards, if anyone is in the London area and is willing to take some of these off my hands, I will happily depart with them. I have roughly 2 tesco bags of Emulex FC 8Gbit cards and 1 tesco bag of QLogic. If that’s your thing feel free to email over at fccards@benjojo.co.uk!

If this was interesting to you, you may find other bits you like on the rest of the blog. If you want to stay up to date with my ramblings or projects you can use my blog’s RSS Feed or you can follow me on twitter.