How does NVBit implement Instrumentation?

I read NVBit’s paper, and wondering how it implements Instrumentation.

In the paper, the original code was modified to the instrumented code and will call a trampoline to execute user-defined instrumentation function.

  1. Is the instrumented code generated in GPU side, or it’s a new kernel generated in CPU side, and transferred to GPU?
  2. Is the trampoline pre-generated in CPU side and built into Instrumented code, or it’s dynamically generated in GPU side?