About Me

Quod obstat viae fit via.

Professional Experience

Head of Engineering

Lower East Side Capital, 06/22-present

Responsible for the design, prototyping, development, and validation of all production trading systems with the following properties: low latency, consistency, reproducibility, safety, and scalability. These systems allow Lower East Side capital to often rank among the top 5 market makers on Binance. They also allow for minimal development time for new exchanges, the average time from conception to production being about 2 days. Last but not least, they suffered from only one production-halting defect in 3 years.

Trading pipeline

Designed and implemented a low-latency trading pipeline for crypto assets in Rust. This pipeline includes market feeds, order management systems, and strategy execution components such as signal execution, position tracking, and risk management.

This design leverages Rust to effciently handle multi-threading and multi-processing, zero-copy data parsing, dynamic data allocation and copying, and overall memory safety. It also features fully crafted HTTP/1.1, HTTP/2.0, WebSocket, FIX, and SBE implementations, upon which all exchange adapters are implemented.

Depending on the exchange (especially regarding the quality of its protocol), and depending on the strategy, this design operates with tick-to-trade median latencies of about 8us and 99.99% latencies below 200us.

The pipeline can run with the TULIPS user-space TCP/IP stack on top of AWS ENA DPDK driver, further reducing its tick-to-trade latency.

Validation harness

Designed and implemented a trading validation harness that allows scripted trading scenarios to be run against any trading interface to ensure their compliance and periodically collect their latency data to discover any potential regression. It can also be used to speed up the development of new trading interfaces.

Signal compilation

Designed and implemented a signal compiler that takes directed, acyclic signal graphs and generates effcient, statically allocated, almost control-free Rust code that implements the scheduled execution of these signals. Compiled signals can be developed separately from the pipeline and loaded as shared libraries. These signals show 15% improved median latency and up to 40% improved 99.99% latency compared their interpreted counterpart.

Trading data collection and analysis

Designed and implemented a low-latency trading data collection system that funnels trading events to a separate ancillary thread to be stored as compressed binary files for offine processing. This system comes with a set of analytical tools to generate trading activity insights such as trading latency distributions, internal queuing snapshots, in-flight order states, or trading queue position.

Cloud placement exploration

Designed and implemented graph-based distributed crawlers to determine the best possible instance placement within a given availability zone. The crawlers would use a TCP-based ping system to collect latency data between each other as well as between themselves and external services, and then build the layout of their placement. These results were also used to select the production instances with the best external latency for a given external service.

[ rust, c, c++, user-space tcp/ip, aws, ena ]

Senior Trading Engineer

Galois Capital, 03/21-06/22

Designed and implemented a low-latency trading system for crypto assets. Comprised of a market data acquisition system and an order management system, this architecture leverages the following techniques to achieve low-latency: asynchronous I/O, meta-programming, delayed execution, stream-oriented JSON parsing using SIMD instructions, and formally verifiable state machine code generation.

Other responsibilities peripheral to this work are the management of the cloud-based operational infrastructure in multiple geographic locations, the management of the automated task distribution and orchestration; the management of the CI/CD, and the management of the software development process.

[ c++, cmake, python, aws ]

Research Staff Member

AI Platform - Performance and Scale, IBM Research, 03/20-03/21

Lead the release of the cloud-native version of IBM Streams as an open-source project under the Apache v2.0 license. Now called OpenStreams, the sources can be found here.

Distributed Streams Computing, IBM Research, 10/17-03/20

Designed and implemented TULIPS, a multi-threaded, ultra low-latency user-space TCP/IP stack for the IBM Streams product leveraging hardware offload accelerations found on the Mellanox ConnectX-5 device family. This new stack leads to 4x to 6x improved network performance depending on the application workload.

Designed and collaborated to the complete replacement of IBM Streams' distributed platform in favor of Kubernetes, resulting in the removal of 80% of the platform code base without any loss of functionality or guarantees such as fault-tolerance. This new design uses the Operator model to expose IBM Streams' core concepts as first-class citizens within Kubernetes. This work was published and its tooling released as an open-source library.

[ low-latency networking; userspace tcp/ip; kubernetes ]

Senior HW/SW Engineer

Core Performance Team, Tower Research Capital, Inc, 10/15-10/17

Designed and implemented specialized networking hardware to minimize latency of various trading subsystems. Said solutions range from bump-in-the-wire data transcoders to more generic network controllers. Single-handedly authored every aspects of the solution: the hardware architecture, the software interface, and the Linux device driver.

Notably, designed and implemented market data transcoders for CME and LSE used as bump-in-the-wire on shared microwave lines were latency is particularly critical. Both transcoders achieved 200ns wire-to-wire latencies and were designed in record time using the HardCaml framework, effectively making them the first hardware trading systems designed using OCaml to be successfully used in a production environment.

[ verilog; modelsim; quartus; ocaml; c++; linux driver ]

Senior Software Developer

Core Performance Team, Tower Research Capital, Inc, 01/14-10/15

Lead the specification and design of the company's tick-to-trade infrastructure worldwide. The role included researching and recommending the proper hardware setup, implementing and distributing the packet logger facility, the automated packet processing system producing daily time series of the trading activities, and latency computation tools to generate tick-to-trade data for each trading team as well as various other network-related metrics. Also designed and implemented a model-driven configuration framework that provides portability, extendability and validation to the company's tooling configuration.

[ c++; python; lisp; ibverbs; efvi ]

Research Staff Member

Workload Optimization Group, IBM Research, 07/11-01/14

Explored the use of RDMA-capable, low-latency, high-performance network fabrics to enable next-generation NoSQL data stores and real-time BigData analytics. Revisited common data store issues such as persistence, high-availability, and distributed transactions using the shared memory paradigm.

[ c++; ibverbs; rdma ]

Post-Doctoral Researcher

Workload Optimization Group, IBM Research, 10/10-07/11

Investigated the exploitation of low-level, hardware acceleration mechanisms inside complex, multi-layered application workloads such as business rules management systems and database clients. Investigated solutions to bring zero-copy memory transfers to high levels of commercial software stacks. Implemented a RDMA-based memory update engine over Infiniband and RoCE using uDAPL.

[ c++; simd; rdma ]

Research Associate

System-Level Synthesis Group, TIMA Laboratory, 09/05-10/10

Designed and implemented DNA-OS, a generic, component-based operating system kernel which targets heterogeneous, multi-core system-on-chips. Using a micro-kernel design, it manages threads, messages, semaphores, dynamic memory, I/Os, drivers, and file system modules. Designed and implemented a component-based, automated environment to generate embedded applications. It includes the support of RISC, DSP, and micro-controller processors while successfully making use of the DNA kernel.

[ c; assembly; arm; mips; sparc; dsp; mpsoc ]

Education

Doctorate in Computer Science

System-Level Synthesis Group, TIMA Laboratory, 09/06-10/10

This dissertation shows that complex, embedded software applications can effectively operate heterogeneous MP-SoC with respect to flexibility, scalability, portability, and Time-To- Market. It presents an improved embedded software design flow that combines an application code generator, GECKO, and a novel software framework, APES, to achieve a high level of efficiency. Our contribution is twofold: 1) an improved embedded software design flow with several tools that enable the automatic construction of minimal and optimized binaries for a given application targeting a given MP-SoC, and 2) a modular and portable set of software components that includes traditional operating system mechanisms as well as the support for multiple processors.

Open-Source Software

Name	Language	Role	Description
ace	C++	Author	Versatile configuration authoring/parsing system
bitstring	OCaml	Maintainer/Author	Bit stream manipulation library
minima.l	C/Lisp	Author	Minimal Lisp interpreter
openstreams	C++/Java/SPL	Maintainer/Author	Streams computing environment
ppx_hardcaml	OCaml	Author	PPX extension of HardCaml
ppx_deriving_hardcaml	OCaml	Author	PPX extension for HardCaml types
tulips	C++	Author	Ultra-low latency user-space TCP/IP stack

Publications

2020

Scott Schneider, Xavier R. Guérin, Shaohan Hu, Kun-Lung Wu: A Cloud Native Platform for Stateful Streaming, ArXiv 2020

2015

Yandong Wang, Li Zhang, Jian Tan, Min Li, Yuqing Gao, Xavier Guérin, Xiaoqiao Meng, Shicong Meng: HydraDB: a resilient RDMA-driven key-value middleware for in-memory cluster computing, SC 2015: 22:1-22:11

2013

Liana L. Fong, Yuqing Gao, Xavier Guérin, Yonggang Liu, T. Salo, Seetharami Seelam, Wei Tan, Sandeep Tata: Toward a scale-out data-management middleware for low-latency enterprise computing, IBM Journal of Research and Development 57(3/4): 6 (2013)

2012

Xavier Guérin, Wei Tan, Yanbin Liu, Seetharami Seelam, Parijat Dube: Evaluation of Multi-core Scalability Bottlenecks in Enterprise Java Workloads, MASCOTS 2012: 308-317

2011

Xavier Guérin, Yanbin Liu, Parijat Dube, Seetharami Seelam, Pierre-Andre Paumelle: Scalability analysis of enterprise Java workloads on a multi-core system, IISWC 2011: 77

2009

Sang-Il Han, Soo-Ik Chae, Lisane B. de Brisolara, Luigi Carro, Katalin Popovici, Xavier Guérin, Ahmed Amine Jerraya, Kai Huang, Lei Li, Xiaolang Yan: Simulink®-based heterogeneous multiprocessor SoC design flow for mixed hardware/software refinement and simulation, Integration 42(2): 227-245 (2009)

Xavier Guérin, Frédéric Pétrot: A System Framework for the Design of Embedded Software Targeting Heterogeneous Multi-core SoCs, ASAP 2009: 153-160 (best paper award)

Alexandre Chagoya-Garzon, Xavier Guérin, Frédéric Rousseau, Frédéric Pétrot, Davide Rossetti, Alessandro Lonardo, Piero Vicini, Pier Stanislao Paolucci: Synthesis of Communication Mechanisms for Multi-tile Systems Based on Heterogeneous Multi-processor System-On-Chips, IEEE International Workshop on Rapid System Prototyping 2009: 48-54

2008

Katalin Popovici, Xavier Guérin, Frédéric Rousseau, Pier Stanislao Paolucci, Ahmed Amine Jerraya: Platform-based software design flow for heterogeneous MPSoC, ACM Trans. Embedded Comput. Syst. 7(4) (2008)

Patrice Gerin, Xavier Guérin, Frédéric Pétrot: Efficient Implementation of Native Software Simulation for MPSoC, DATE 2008: 676-681

2007

Sang-Il Han, Soo-Ik Chae, Lisane B. de Brisolara, Luigi Carro, Ricardo Reis, Xavier Guérin, Ahmed Amine Jerraya: Memory-efficient multithreaded code generation from Simulink for heterogeneous MPSoC, Design Autom. for Emb. Sys. 11(4): 249-283 (2007)

Xavier Guérin, Katalin Popovici, Wassim Youssef, Frédéric Rousseau, Ahmed Amine Jerraya: Flexible Application Software Generation for Heterogeneous Multi-Processor System-on-Chip, COMPSAC (1) 2007: 279-286

Kai Huang, Sang-Il Han, Katalin Popovici, Lisane B. de Brisolara, Xavier Guérin, Lei Li, Xiaolang Yan, Soo-Ik Chae, Luigi Carro, Ahmed Amine Jerraya: Simulink-Based MPSoC Design Flow: Case Study of Motion-JPEG and H.264, DAC 2007: 39-42

Katalin Popovici, Xavier Guérin, Frédéric Rousseau, Pier Stanislao Paolucci, Ahmed Amine Jerraya: Efficient Software Development Platforms for Multimedia Applications at Different Abstraction Levels, IEEE International Workshop on Rapid System Prototyping 2007: 113-122

Lisane B. de Brisolara, Sang-Il Han, Xavier Guérin, Luigi Carro, Ricardo Reis, Soo-Ik Chae, Ahmed Amine Jerraya: Reducing fine-grain communication overhead in multithread code generation for heterogeneous MPSoC, SCOPES 2007: 81-89

2006

Sang-Il Han, Xavier Guérin, Soo-Ik Chae, Ahmed Amine Jerraya: Buffer memory optimization for video codec application modeled in Simulink, DAC 2006: 689-694

Patents

2020

Xavier R. Guérin: Speculative execution in a distributed streaming system, United States US10657091B2

2019

Xavier R. Guérin, Scott Schneider, Xiang Ni: Adaptive locking in elastic threading systems, United States US20190377582A1

Xavier R. Guérin, Shicong Meng: Passive two-phase commit system for high-performance distributed transaction execution, United States US10296371B2

2018

Xavier R. Guérin, Shicong Meng: Multi-way, zero-copy, passive transaction log collection in distributed transaction systems, United States US20150347243A1

2016

Yuqing Gao, Xavier R. Guérin, Xiaoqiao Meng, Tiia Salo: Remote direct memory access (RDMA) high performance producer-consumer message processing, United States 9,495,325

Xavier R. Guérin, Tiia J. Salo: RDMA-optimized high-performance distributed cache, United States 9,378,179

Xavier R. Guérin: Keyboard with macro keys made up of positionally adjustable micro keys, United States 9,335,830

Yuqing Gao, Xavier R. Guérin, Graeme Johnson: High performance, distributed, shared, data grid for distributed Java virtual machine runtime artifacts, United States 9,332,083

Xavier R. Guérin, Yinglong Xia: Scheduling and execution of DAG-structured computation on RDMA-connected clusters, United States 9,300,749

2015

Xavier R. Guérin: Load balancing of distributed services United States 9,667,711

Parijat Dube, Xavier R. Guérin, Seetharami R. Seelam: Method, apparatus and computer programs providing cluster-wide page management, United States 9,170,950

Xavier R. Guérin, Xiaoqiao Meng, David P. Olshefski, John M. Tracey: Automatic pinning and unpinning of virtual pages for remote direct memory access, United States 9,037,753

Megumi Ito, Michael Dawson, Xavier R. Guérin, Seetharami Seelam: Java native interface array handling in a distributed java virtual machine, United States 8,990,790

2014

Michael H. Dawson, Parijat Dube, Liana L. Fong, Yuqing Gao, Xavier R. Guérin, Michel H. T. Hack, Megumi Ito, Graeme Johnson, Nai K. Ling, Yanbin Liu, Xiaoqiao Meng, Pramod B. Nagaraja, Seetharami R. Seelam, Wei Tan, Li Zhang: Preferential execution of method calls in hybrid systems, United States 8,843,894

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search