Zhinan Liu

MSEE Candidate at UW specializing in embedded systems. Focused on MCU/SoC bring-up, peripheral driver development, and firmware optimization on ESP32 and STM32 platforms. Proficient in C/C++, Python, and assembly (RISC-V, ARM, Xtensa), with knowledge of FreeRTOS and Linux. Experience with communication protocols (I²C, SPI, UART, BL, BLE). Developed a SIMD-optimized math library for the ESP32-S3 achieving up to 10x integer and 3-5x floating-point performance improvements.


Skills

Languages
  • C, C++, Python, RISC-V ASM, Xtensa ASM, SystemVerilog
MCUs / SoCs & Peripherals
  • ESP32-S3, STM32, Xtensa LX7, ARM-M7
  • Interfaces: I²C, SPI, UART, USB-3.0, Bluetooth, BLE
Systems
  • Bare-metal, FreeRTOS/RTOS, device drivers, DMA, Embedded Linux/Yocto
Tooling
  • Git, CMake, esp-idf, OpenOCD/JTAG, Docker, Jenkins

Projects

esp_simd

High-level C library wrapping Xtensa SIMD intrinsics for vectorized math on ESP32-S3. Provides safe alignment, saturation handling, and drop-in APIs for esp-idf.

  • Hand-tuned, branchless assembly with zero-overhead loops
  • Vector ops: e.g. add, sub, sum, dotp etc for int8/16/32 and float32; benchmarks show ~[5-10x speedup on INT types and 3-5x speedup of FLOAT type] vs scalar
  • Reproducible benchmarks and unit tests; CMake integration
Scalar vs SIMD ASM (dot product example) click to expand
Scalar (baseline)

/*
    C Code:
    int32_t output = 0;
    int8_t *vec1_data = (int8_t*)vec1->data;
    int8_t *vec2_data = (int8_t*)vec2->data;
    for (int i = 0; i < vec1->size; i++){
        int a = (int)vec1_data[i];
        int b = (int)vec2_data[i];
        output +=  a * b;
    }
    *result = output;
    return VECTOR_SUCCESS;
*/

420169d4:   08d8        l32i.n  a13, a8, 0
420169d6:   03e8        l32i.n  a14, a3, 0
420169d8:   0a0c        movi.n  a10, 0
420169da:   0acd        mov.n   a12, a10
420169dc:   0005c6      j   420169f7 
420169df:   00          .byte   00
420169e0:   8daa        add.n   a8, a13, a10
420169e2:   000882      l8ui    a8, a8, 0
420169e5:   238800      sext    a8, a8, 7
420169e8:   beaa        add.n   a11, a14, a10
420169ea:   000bb2      l8ui    a11, a11, 0
420169ed:   23bb00      sext    a11, a11, 7
420169f0:   8288b0      mull    a8, a8, a11
420169f3:   cc8a        add.n   a12, a12, a8
420169f5:   aa1b        addi.n  a10, a10, 1
420169f7:   e53a97      bltu    a10, a9, 420169e0 
420169fa:   04c9        s32i.n  a12, a4, 0
                
SIMD (esp_simd)
 
simd_dotp_i8:
    entry a1, 16                                    // reserve 16 bytes for the stack frame
    extui a6, a5, 0, 4                              // extracts the lowest 4 bits of a5 into a6 (a5 % 16), for tail processing
    srli a5, a5, 4                                  // shift a5 right by 4 to get the number of 16-byte blocks (a5 / 16)
    movi.n a7, 0                                    // zeros a7
    beqz a5, .Ltail_start                           // if no full blocks (a5 == 0), skip SIMD and go to scalar tail

    // SIMD mul-accumulate loop for 16-byte blocks 
    ee.zero.accx                                    // clears the QACC register
    ee.vld.128.ip     q0, a2, 16                    // loads 16 bytes from a2 into q0, then increment a2 by 16
    loopnez a5, .Lsimd_loop                         // loop until a5 == 0
        ee.vld.128.ip     q1, a3, 16                // loads 16 bytes from a3 into q1, then increments a3 by 16 
        ee.vmulas.s8.accx.ld.ip q0, a2, 16, q0, q1  // multiply-accumulates q0 and q1, stores result in QACC, increments a2, updates q0 
    .Lsimd_loop:

    rur.accx_0 a7                                   // write the lower 32 bits of QACC into a7
    addi a2, a2, -16                                // adjust a2 pointer back to the last processed element (it goes too far due to the last increment in the loop)

    .Ltail_start:                                   // Handle remaining elements that were not part of a full 16-byte block  
    loopnez a6, .Ltail_loop 
        l8ui a8, a2, 0
        l8ui a9, a3, 0
        sext a8, a8, 7
        sext a9, a9, 7
        mull a8, a8, a9
        add a7, a7, a8 
        addi a2, a2, 1
        addi a3, a3, 1
    .Ltail_loop:  
        
    s32i.n a7, a4, 0
    movi.n a2,  0                                   //return exit code 0 (success)
    retw.n 
                

Notes: [insert vector length, alignment strategy, saturation/rounding mode, tail handling policy, and measured cycles here].

Benchmarks: esp_simd vs scalar click to expand
Benchmark results for esp_simd showing speedups over scalar across operations and vector sizes
Benchmark results for esp_simd showing operation runtime for 32 vectors of random length 1–256
Tech: C, C++, Xtensa ASM, esp-idf, CMake

Experience

Researcher

Harborview Medical Center

Research Engineer within HIPRC. Developed automated data pipelines for clinical research; created and maintained a trauma-transfusion database connecting trauma admissions to patient blood use; created and deployed analytical and predictive models from large trauma datasets.

  • Emphasis on robust, reproducible pipelines and production deployment (version control, CI/CD).
Feb 2021 - May 2024

Student Assistant

University of Washington, Lieber Lab

Researched adenovirus-based gene therapy (Hemophilia A, β-Thalassemia). Analyzed off-target CRISPR mutagenesis from large genomic datasets.

  • Built analysis tooling and workflows; collaborated across engineering/research teams.
August 2016 – August 2020

Education

University of Washington

Master of Science, Electrical Engineering

Relevant Coursework: Computer Architecture, Embedded Software Design, Data Structures & Algorithms

Sep 2023 – Dec 2025 (est.)

University of Washington

Bachelor of Science, Biochemistry
Sep 2016 – Jun 2020

Publications

Selected publications (expand)