On-Chain Data Layout — Learning to Read Bytes

I'm staring at a blob of hexadecimal. Two hundred and forty bytes of it, fetched from a Solana account using an RPC call. No labels, no field names, no commas, no brackets. Just a wall of characters: d0 b6 63 8a ab 8c 91 2b 7c 9a 1e ... stretching across my terminal. This is the data inside a DEX pool account. Somewhere in this sequence of bytes are the pool's reserves, its fee configuration, authority keys, bump seeds, and everything else that defines how the pool operates on-chain. I know the information is there. I have no idea how to read it.

It feels exactly like the first time I opened a disassembled binary in college. Not source code — the machine's version. Opcodes and addresses and register references that meant nothing without documentation and a lot of patience. Except this time, the "machine" is a blockchain, and the "binary" is the state of a DeFi protocol that I need to parse correctly if my arbitrage bot is going to function at all.

This is not optional. My bot needs to read pool state to calculate swap outputs. It needs to know the reserves. It needs to know the fee rate. It needs to know which mint addresses correspond to which token in the pool. All of that information is sitting in these bytes, and if I read the wrong offset — even by a single byte — every number downstream is garbage.

Everything Is a Byte Array

The first thing you learn about Solana that separates it from what most developers expect: account data is a raw byte array. Not JSON. Not a database row. Not a struct you can deserialize with a single function call and trust the output. It's bytes. Just bytes.

Every account on Solana — whether it holds a token balance, stores a pool's configuration, or tracks the state of a lending protocol — has a data field that is a Vec<u8>. A vector of unsigned 8-bit integers. The blockchain itself assigns no meaning to these bytes. It doesn't know that bytes 8 through 40 are a public key, or that bytes 40 through 48 are a u64 representing a reserve amount. The on-chain program that owns the account knows what those bytes mean, because the program wrote them in a specific layout. Everyone else has to figure it out.

Think of it like a shipping container at a port. The container is a standardized steel box — same dimensions, same locking mechanism, same handling equipment regardless of contents. The port authority doesn't know if it's carrying electronics, furniture, or frozen fish. Only the shipper and receiver know what's inside and how it's packed. They agreed on a packing layout before the container was sealed. If you're a third party trying to open that container and inventory its contents, you need the packing manifest. Without it, you're just looking at boxes stacked in a metal rectangle.

Solana accounts are those containers. The on-chain program is the shipper who packed them. My bot is the third party trying to read the contents without an explicit manifest. The "manifest" exists — it's the program's source code, which defines the struct layout — but as I'm about to learn, that manifest isn't always accurate.

Borsh: Sequential, No Gaps, No Mercy

The serialization format used by most Solana programs is Borsh — Binary Object Representation Serializer for Hashing. The name sounds academic, but the concept is simple: fields are stored one after another, in the exact order they're defined in the struct, with no padding and no alignment gaps.

This is different from how many programmers think about memory layout. In C, for example, a struct with a u8 field followed by a u64 field will typically insert 7 bytes of padding between them to align the u64 on an 8-byte boundary. The compiler does this for performance reasons — aligned memory access is faster on most hardware. But it means the struct in memory is bigger than the sum of its fields, and the layout depends on the compiler and the target architecture.

Borsh doesn't do any of that. A u8 followed by a u64 in Borsh takes exactly 9 bytes: 1 + 8. No padding. No alignment. The next field starts immediately after the previous one ends. The order in the struct definition is the order in the serialized bytes. Period.

This is both a blessing and a curse. The blessing: the layout is deterministic and platform-independent. If you know the struct definition, you can calculate the exact byte offset of every field with simple addition. The curse: there is zero tolerance for error. If you think a field is 4 bytes but it's actually 8, every offset after that point is wrong by 4 bytes. And Borsh won't tell you. There's no length prefix, no field delimiter, no checksum. You'll read bytes from the wrong positions and get numbers that look plausible but are completely wrong.

It's like reading a Social Security number where the groups have been concatenated without dashes. 123456789 — if you know the format is 3-2-4, you parse it as 123-45-6789. But if you mistakenly think the format is 4-2-3, you get 1234-56-789. Both are valid-looking sequences of digits. Neither will error out. You'll just file the wrong paperwork and discover the mistake weeks later when the IRS sends a notice.

The Anchor Discriminator: Eight Bytes That Name the Type

Most Solana programs built with the Anchor framework start every account's data with an 8-byte discriminator. This is the first 8 bytes of the SHA-256 hash of the string "account:<AccountName>", where AccountName is the name of the Rust struct that defines the account's layout.

For example, if a program defines an account struct called PoolState, the discriminator would be the first 8 bytes of SHA256("account:PoolState"). When the program initializes an account, it writes these 8 bytes at the very beginning of the data field. When it reads the account later, it checks these 8 bytes to verify the account type before proceeding. It's a runtime type check embedded directly in the data.

Why does this exist? Because Solana programs receive accounts as raw byte arrays. A program that manages multiple account types — pools, configurations, user positions, fee vaults — needs a way to distinguish them. Without a discriminator, passing a UserPosition account where the program expects a PoolState account would silently interpret the bytes using the wrong layout, likely producing garbage values and potentially allowing exploits. The discriminator acts as a safety seal. If the first 8 bytes don't match the expected hash, the program rejects the account immediately.

Think of it like a VIN number on a car. The first few characters encode the manufacturer, vehicle type, and manufacturing plant. Before a mechanic orders parts, they check the VIN to confirm the vehicle model. You wouldn't install a transmission designed for a sedan into a pickup truck, even if both vehicles happen to be parked in the same garage. The VIN prevents that mistake. The discriminator prevents the program from treating a fee configuration account as if it were a pool state account.

For my bot, the discriminator has a specific practical implication: the actual pool data starts at byte offset 8, not byte offset 0. Every field offset calculation must account for this 8-byte prefix. Miss it, and every field is shifted by 8 bytes. The public key that should be at offset 8 starts reading from offset 0 instead, grabbing the discriminator bytes and half of the actual first field. The resulting "public key" is meaningless but doesn't look obviously wrong — it's still 32 bytes of hex — until you try to use it on-chain and nothing works.

Not all programs use Anchor, of course. Native Solana programs — ones written without the Anchor framework — may not have a discriminator at all. Their data starts directly with the first field at offset 0. Or they might use their own type identification scheme. This is another thing you have to know about each specific program. There is no universal convention. You have to read the source code, or find documentation, or — as I'm going to discover — verify empirically.

Calculating Offsets: Arithmetic With Consequences

Once I understand the basics — Borsh serialization, Anchor discriminator, sequential field layout — calculating offsets feels almost too simple. It's addition.

Take a hypothetical pool account with an Anchor discriminator. The layout might be:

Bytes 0-7:    Discriminator (8 bytes)
Bytes 8-39:   amm_config (Pubkey, 32 bytes)
Bytes 40-71:  pool_creator (Pubkey, 32 bytes)
Bytes 72-103: token_a_vault (Pubkey, 32 bytes)
Bytes 104-135: token_b_vault (Pubkey, 32 bytes)
Bytes 136-143: reserve_a (u64, 8 bytes)
Bytes 144-151: reserve_b (u64, 8 bytes)
...

Each field's starting offset is the sum of all preceding fields' sizes. amm_config starts at 8 (after the discriminator). pool_creator starts at 8 + 32 = 40. token_a_vault starts at 40 + 32 = 72. And so on. The arithmetic is elementary. A fifth-grader could do it.

The difficulty isn't the arithmetic. The difficulty is knowing the correct type and size of each field.

Is that status field a u8 (1 byte), a u32 (4 bytes), or a u64 (8 bytes)? The answer changes every offset after it. Is there a bump seed stored as a single byte between two public keys? That shifts everything after it by one. Does the struct include an Option<Pubkey>, which in Borsh is serialized as a u8 tag (0 or 1) followed conditionally by 32 bytes? That means the field is either 1 byte or 33 bytes depending on the value, and every subsequent offset depends on whether the option is Some or None.

This is where the precision matters. I'm not reading a JSON response where I can access pool.reserve_a by name. I'm indexing into a byte array: data[136..144] to get the 8 bytes of reserve_a, then converting those bytes from little-endian to a u64. If I get the offset wrong — if reserve_a actually starts at byte 144 instead of 136, because I miscounted a field somewhere — I'm reading the wrong 8 bytes. The conversion to u64 will still succeed. It'll produce a number. That number will be wrong. And my bot will calculate swap outputs based on phantom reserves that don't exist.

This is what makes byte-level parsing so dangerous. There are almost no runtime errors. Almost every mistake produces a valid-looking result. The bytes don't care what you think they mean.

It's like misreading a UPC barcode at a grocery store self-checkout. The scanner reads the bars, converts them to digits, looks up the product. If the code is smudged and one digit is wrong, the system doesn't crash — it just rings up a different product. You scan a gallon of milk but the register shows a bag of rice at a different price. The system worked perfectly. It just read the wrong data.

When Source Code Lies

Here's where theory crashes into reality. I'm parsing pool accounts for a certain AMM protocol. I find the Rust source code on GitHub, read the struct definition, carefully calculate every offset, and write my parser. I fetch a pool account from the chain, run my parser, and check the results against known values.

The numbers don't match.

I know the fee rate for this specific pool — I can see it in the protocol's UI. My parser says it's something completely different. I check the reserves against what the UI reports. Off by orders of magnitude. The public keys I'm extracting don't match any known program addresses or token mints.

I double-check my arithmetic. It's correct. I re-read the source code. The struct definition clearly shows the field order. I've added up the sizes correctly. And yet the on-chain data doesn't align with my calculated offsets.

The problem — and I spend an embarrassing amount of time finding this — is that the first field after the discriminator isn't what I assumed. I'm looking at the struct definition and seeing what appears to be a 1-byte bump field as the first entry. A bump seed, stored as a u8. One byte. So I calculate: discriminator (8 bytes) + bump (1 byte) = next field starts at offset 9.

But when I actually examine the on-chain data byte by byte, comparing against a pool whose config address I know, that 32-byte config address starts right after the discriminator — not one byte later. There is no bump byte where I expected it. The struct I'm reading in the source code either doesn't match the deployed on-chain program, or I'm misreading the struct hierarchy. The bump field exists in the struct, but it's stored much later — after several public key fields, hundreds of bytes deeper into the account, not at the position I assumed.

This is a lesson that costs me a full day of debugging.

And it's not the last time. Working with a different DEX protocol, I encounter a field near the beginning of the account data. Looking at the source code, my initial reading suggests this field is a 4-byte integer. So I calculate the next field's position based on that assumption. But when I verify against on-chain data, the value I expect to find at that offset is nowhere to be seen — it's sitting several bytes further along. The field is actually an 8-byte integer, not 4. A single wrong type assumption shifts every subsequent offset in my parser, and every field after that point reads garbage.

Both mistakes share the same root cause: I trusted the source code as the ground truth without verifying against the actual deployed program's behavior. Source code can mislead for several reasons, and I'm learning each of them the hard way.

The deployed program may not match the latest source code. Solana programs are deployed as compiled BPF bytecode. The source code on GitHub might have been updated after the last deployment. Fields could have been reordered, resized, added, or removed in the source without a corresponding on-chain upgrade. The bytes on-chain reflect the program that's actually running, not the code in the latest commit.

Rust struct layouts can be deceptive. Nested structs, trait implementations, macros that generate fields, and conditional compilation can all make the actual field order different from what a quick read of the code suggests. A field that appears early in the struct definition might be inherited from a parent type and serialized in a different position. Anchor's #[account] attribute macro adds the discriminator implicitly — you won't see an 8-byte field in the struct definition, but it's there in the serialized data.

Different account types within the same program have different layouts. A program might have five different account structs, each with its own field order and sizes. Using the wrong struct definition for a given account is an easy mistake to make, especially when account names are similar or when the program has been through multiple versions.

Native programs skip conventions. Programs written without Anchor may not follow any of the patterns I've described. No discriminator. Custom serialization instead of Borsh. Fields stored in unexpected orders for performance reasons. Each program is its own world, and assumptions that hold for one program are invalid for the next.

The Cross-Verification Habit

After being burned twice, I develop a practice that I now follow every time I need to parse a new account type: cross-verification against known values.

The method is straightforward. I find an account whose contents I can partially predict. For a pool account, I might know the token mint addresses — they're publicly documented or visible in the protocol's explorer. I might know the fee rate from the protocol's documentation. I might know the authority address from the program's configuration.

Then I fetch the raw account data and search for those known values at the offsets I've calculated. A Solana public key is 32 bytes, and a specific key is a unique sequence. If I expect the SOL mint address to appear at offset 72, I check whether bytes 72 through 103 of the account data match the known SOL mint. If they do, my offset calculation up to that point is correct. If they don't, I search for that 32-byte sequence elsewhere in the data to find where it actually lives.

It's the same approach a cryptanalyst uses to break an unknown encoding. If you suspect a message contains the word "HELLO" somewhere, you look for its encoded form. Finding it confirms your encoding hypothesis and anchors the rest of your analysis. Without that anchor, you're guessing.

Or think of it like calibrating a tape measure. Before you measure anything that matters — before you cut expensive lumber for a renovation — you check the tape against a known reference. The edge of a standard sheet of plywood is 48 inches. If your tape reads 47.5 at the edge, you know it's off by half an inch. You don't trust the tool; you verify the tool.

In practice, this means my parsing workflow for any new account type looks like this:

  1. Read the source code and calculate theoretical offsets. This gives me a hypothesis.
  2. Fetch actual account data from the chain using an RPC call. This gives me reality.
  3. Identify known values — mint addresses, known config addresses, expected fee rates — that I can predict.
  4. Check those values at the calculated offsets. If they match, the hypothesis is confirmed up to that point.
  5. If they don't match, search the raw data for the known value's byte pattern to find its actual position. The discrepancy reveals where my offset calculation went wrong, which tells me which field has a different size than I assumed.

This process takes extra time — maybe an hour per account type — but it has saved me from deploying a bot with fundamentally broken data parsing. The cost of getting this wrong isn't an error message. It's silent miscalculation. The bot runs, finds "opportunities" based on phantom reserves, builds transactions that fail on-chain, and I waste time and fees debugging what looks like a transaction construction problem when the real problem is a parsing problem three layers down.

The Subtlety of Endianness

A detail that catches many developers the first time: Borsh uses little-endian byte order for integers. If a u64 has the value 1000 (decimal), the bytes in memory are e8 03 00 00 00 00 00 00, not 00 00 00 00 00 00 03 e8. The least significant byte comes first.

This matters because if you're manually inspecting hex dumps — which you do, constantly, when debugging offset issues — the numbers look backwards. A reserve of 1,000,000,000 lamports (1 SOL) is 00 ca 9a 3b 00 00 00 00 in the raw data, not 00 00 00 00 3b 9a ca 00. If you're eyeballing the hex and mentally converting, you need to reverse the byte order first.

It's like reading a phone number that's been written digits-reversed for some encoding scheme. 5551234567 becomes 7654321555. If you try to call the number as printed, you get a wrong number. You have to know the convention and reverse it yourself.

Most programming languages have library functions for this conversion — Rust has u64::from_le_bytes(), Python has int.from_bytes(data, 'little') — so in practice, you write the conversion once and forget about it. But when you're debugging, when you're staring at raw hex trying to figure out why your reserves look wrong, you need to remember that the bytes are backwards. It's one more thing the data doesn't tell you about itself.

Option Types and Variable-Length Fields

Most pool account structs use fixed-size types: u64, Pubkey, u8, u16, bool. These are predictable — 8, 32, 1, 2, 1 bytes respectively. But some structs include Option<T> types or vectors, and these introduce variable-length encoding that makes offset calculation harder.

In Borsh, an Option<T> is serialized as a single u8 tag followed by the value:

  • None is serialized as 0x00 — 1 byte total
  • Some(value) is serialized as 0x01 followed by the serialized value — 1 + sizeof(T) bytes total

This means the total size of an Option<Pubkey> field is either 1 byte or 33 bytes, depending on whether it's None or Some. If your offset calculations assume it's always Some (33 bytes), you'll be wrong when it's None. If you assume it's always None (1 byte), you'll be wrong when it's Some. And the offset error cascades to every field after it.

Vectors (Vec<T>) are even more variable. Borsh serializes a vector as a u32 length prefix followed by that many serialized elements. A Vec<u8> with 100 elements takes 4 + 100 = 104 bytes. An empty Vec<u8> takes 4 bytes (just the length prefix, value 0). The total size depends entirely on the data.

In practice, most DEX pool accounts avoid variable-length types for exactly this reason — they make offset calculation difficult and account size unpredictable. But configuration accounts, metadata accounts, and newer protocol designs sometimes use them. When they do, you can't calculate a static offset table. You have to parse sequentially from the beginning, reading each field's size as you go, to find the field you want.

What I Can See Now

I've been at this for a week. My bot can parse pool accounts from several different DEX protocols. Each protocol required its own offset map, verified through the cross-verification process I described. Each one had at least one surprise — a field I miscounted, a type I misidentified, a discriminator present where I didn't expect one or absent where I did.

But something has changed in how I look at blockchain data. Before this week, on-chain accounts were opaque blobs. I'd use the protocol's SDK or API to read pool state and never think about what was underneath. Now I see the structure. I can fetch raw account data, scan the hex, and identify the discriminator, the authority keys, the reserves. I can spot when a field doesn't look right — when a Pubkey contains too many zero bytes to be a real address, when a reserve value is suspiciously large or small, when the discriminator doesn't match the expected hash.

It's like learning to read an NFL play diagram after years of just watching the game on TV. The Xs and Os, the route trees, the blocking assignments — they were always there, encoded in the formation the players line up in. You just didn't know what to look for. Now you see the tight end shift to the slot and you know it's a passing play before the ball is snapped. The information was always visible. You just needed the right frame to interpret it.

Reading bytes gives me something I didn't have before: independence from SDKs and APIs. I don't need to wait for a protocol to publish a JavaScript library for reading their pool state. I don't need to trust that someone else's parser is correct. I can fetch the raw data, apply the layout, and extract exactly the fields I need, in exactly the format my bot requires, with zero dependency on external tooling.

And when something goes wrong — when my bot miscalculates a swap output and a transaction fails — I can go back to the bytes. I can fetch the account data at that specific slot, examine the raw reserves, and check whether my parser read them correctly. The bytes are the ground truth. Everything else is interpretation.

There's something I keep coming back to, though. How do you know when you've verified enough? I cross-check against known values, and everything matches — but what about the fields I can't independently verify? What about internal accounting fields, accumulated fees, state flags whose expected values I don't know? I verify what I can and trust the source code for the rest. Is that enough? What does it even mean to fully trust your data parsing when you're working with undocumented binary formats maintained by third parties who can upgrade the program at any time?

I don't have a good answer. I just have a practice: verify what you can, test the rest with live transactions, and always be ready for the layout to change.

Disclaimer

This article is for informational and educational purposes only and does not constitute financial, investment, legal, or professional advice. Content is produced independently and supported by advertising revenue. While we strive for accuracy, this article may contain unintentional errors or outdated information. Readers should independently verify all facts and data before making decisions. Company names and trademarks are referenced for analysis purposes under fair use principles. Always consult qualified professionals before making financial or legal decisions.