In the last few days I have been mucking around with Mimiron
, my Hearthstone API library and Discord bot. One of the main functions of the bot, and the one it is used for the most, is parsing Hearhstone deck codes and presenting them in a pretty fashion.
I had a working parser for a while, but I wanted to experiment with it so I decided to rewrite it in nom. Here are both parsers.
Deck Code Format
The deck code fromat is explained by HearthSim, the company behind HSReplay As far as I know, there is no official specification of deck codes encoding, so HearthSim's description is what everyone relies on. The page is actually outdated, but their example implementations (in C#m Pythonm and TypeScript) explain it clearly enough.
- The deck code is essentially a
base64
encoded series of integers. - These integers are encoded as unsigned
varint
s, variable width integers[1]. - The first two integers are always
0
and1
. They happen to be a byte each, simplifying matters. - The third integer is the Format.
1
is Wild,2
is Standard,3
is Classic, and4
is Twist.
After Format
, every block of integers starts with a count, then a listing of items, usually Card IDs (often called DBF IDs). Card IDs are unique numerical IDs that uniquely[2] identify each card in the game. The blocks are as follows:
- Hero. Conveniently, this block only ever has one element.
- Single-copy cards.
- Double-copy cards.
- N-copy cards. Each item in this block is a Card ID followed a by a count. This is only used for weird one-off formats or Arena codes.
- A block of blocks? Unclear.
The block of blocks is not clearly documented. Sussing it out from existing parsers, it only has one item, which is the Sideboard block. The Sideboard block starts with a count, as usual, but each item is a pair of Card IDs. The first is the card itself, and the second is the Card ID of the Card it is under. There are, as of this time, only two Sideboard cards in the game.
After that is whatever. It apparently does not matter. It seems to usually be two 0
s? hitherto undefined blocks? WHo knows.
Example
Take this deck code for example:
AAECAdrJBgbHpAaopQbUpQaX4Qat4Qaq6gYM5p4GpKcGqKcG66kGw74GzsAG0MAG0dAGmOEGmOIG5OoGjfgGAAED9bMGx6QG97MGx6QG6N4Gx6QGAAA=
This is the list of numbers it represents, with comments.
0
1
2 # Format
1
107738 # Hero ID
6 # Single Copy IDs
102983 # <-- Note this ID is the same as the Sidebaord Card later.
103080
103124
110743
110765
111914
12 # Double Copy IDs
102246
103332
103336
103659
106307
106574
106576
108625
110744
110872
111972
113677
0 # N-Copy IDs. This is only used in weird formats.
1 # no idea really
3 # Sideboard IDs
104949
102983
104951 # first part of pair: Card IN Sideboard
102983 # The Sideboard Card
110440
102983
0 # who knows? Future proofing maybe?
0
It is a very simple format really.
Desired Output and Shared Elements
This is the Object which I need to fill up.
struct Data {
format: Format,
hero: usize,
cards: Vec<usize>,
sideboard_cards: Vec<(usize, usize)>,
}
And this is the Format
type with its conversion from a usize
.
#[derive(Default)]
enum Format {
Standard,
#[default]
Wild,
Classic,
Twist,
}
impl TryFrom<usize> for Format {
type Error = (); // Petend it is something useful.
fn try_from(value: usize) -> Result<Self, Self::Error> {
Ok(match value {
1 => Self::Wild,
2 => Self::Standard,
3 => Self::Classic,
4 => Self::Twist,
_ => return Err(()),
})
}
}
To be able to implement and test the different parsers without much of the surrounding code interfering with it, I decided to wrap the whole thing into a tiny executable to be able to pass the codes directly on the command line. I'd also need the base64
crate.
This is the venerable Cargo.toml
:
[package]
name = "deck_parser"
version = "0.1.0"
edition = "2021"
[dependencies]
base64 = "0.22.1"
pico-args = "0.5.0"
# more to xome
And this is main.rs
, without the types defined above:
use base64::{prelude::BASE64_STANDARD, Engine};
use pico_args::Arguments;
fn main() {
let mut args = Arguments::from_env();
let raw = args.contains("--raw");
let Ok(decoded) = args.free_from_fn(|s| BASE64_STANDARD.decode(s)) else {
eprintln!("Not base64 encoded");
std::process::exit(1);
};
if raw {
print_nums(&decoded);
} else {
if print_data(&decoded).is_err() {
eprintln!("Invalid Deck Code");
std::process::exit(1);
}
}
}
// Print the List of Numbers
fn print_nums(decoded: &[u8]) {
todo!()
}
// Print the Raw Data obejct.
fn print_data(decoded: &[u8]) -> Result<(), Box<dyn Error>> {
todo!()
}
The --raw
flag is mostly to test the varint
reader, and check it always gets the same output.
Straightforward Parser
This is the first parser that I implemented. I used the integer_encoding
crate because I dod not want to think about how varint
encoding works, and it presents a nice API. It required wrapping the buffer in a std::io::Cursor
, but that was it. It reads right from there.
Consdiering the varint
parsing is relegated to a crate, print_nums
is fairly straightforward.
use integer_encoding::VarIntReader; // way at the top of the file.
use std::io::Cursor;
fn print_nums(decoded: &[u8]) {
let mut buffer = Cursor::new(decoded);
while let Ok(n) = buffer.read_varint::<usize>() {
println!("{n}");
}
}
And one can quickly test that it works with the following command. Should you run it with the deck code earlier, you will find it lists the same numbers.
cargo run -- --raw <insert-deck-code-here>
Excellent. This means that integer_encoding
does its job splendidly.
The second one is a bit more involved, but also fairly straightforward. read_varint
advances the Cursor
, so all that is needed is to assign the values in the correct order. There is a small footgun hiding in the API, however, which is that unsigned varint
s have a different encoding for signed varint
s. Considering literals are, by default, i32
, Rust's type inference works against you here.
fn print_data(decoded: &[u8]) -> Result<(), Box<dyn Error>> {
let mut buffer = Cursor::new(&decoded);
// Format is the third number.
buffer.set_position(2);
let format = buffer
.read_varint::<usize>()?
.try_into()
.unwrap_or_default();
// Hero ID is the fifth number.
buffer.set_position(4);
let hero = buffer.read_varint()?;
let mut cards = Vec::new();
// Single copy cards
let count = buffer.read_varint()?;
for _ in 0_usize..count { // <-- Type Inference footgun!!
let id = buffer.read_varint()?;
cards.push(id);
}
// Double copy cards
let count = buffer.read_varint()?;
for _ in 0_usize..count {
let id = buffer.read_varint()?;
cards.push(id);
cards.push(id); // twice
}
// N-copy cards
let count = buffer.read_varint()?;
for _ in 0_usize..count {
let id = buffer.read_varint()?;
let n = buffer.read_varint()?;
for _ in 0_usize..n {
cards.push(id);
}
}
// Sideboard cards. Are they always available?
let mut sideboard_cards = Vec::new();
if buffer.read_varint::<usize>().is_ok_and(|i| i == 1) {
let count = buffer.read_varint()?;
for _ in 0_usize..count {
let id = buffer.read_varint()?;
let sb_id = buffer.read_varint()?;
sideboard_cards.push((id, sb_id));
}
}
let result = Data {
format,
hero,
cards,
sideboard_cards,
};
println!("{:#?}", result); // Don't forget to dervie Debug
Ok(())
}
Not a very pretty printing. But it works. Good base implementation.
Unsigned Variable Integers
When I started trying to rewrite the parser in nom
, I looked for an extension nom
crate that can read varint
s. I found two! Reading through their source, however, I realized that it is really simple, and decided to bring it in the code base myself.
Unsigned varint
s' encoding is, as it turned out, fairly simple. Honestly it is actually easier to just show the code than explain how the encoding works.
fn parse_varint(input: &[u8]) -> Option<usize> {
// a usize can only be as big as 9 bytes in varint encoding
const MAX_BYTES: usize = 9;
let mut num = 0;
for (idx, &byte) in input.iter().take(MAX_BYTES).enumerate() {
// 7 bits from each byte are used
let byte = byte as usize & 0x7F;
let offset = idx * 7;
num |= byte << offset;
// The last byte's most significant bit is 0.
if byte & 0x80 == 0 {
return Some(num);
}
}
None // Insufficient data or overflow
}
Signed varint
s are a whole other beast, and they do not concern this article outside of the type inference footgun from the previous section.
nom
Parser
This is what actually prompted writing this article.[3] I already used nom
elsewhere in Mimiron
, and I wanted to exercise more with it.
The nom
way is all about small, composable functions. The clearest small composable function to do is parse_varint
. I am not going to use the imperative version shown above, I am writing one using nom
's parser combinators and the standard library's Iterator
combinators.
// no nom codebase is complete without a giant use statement on top
// this includes combinators for all the functions below
use nom::{
bytes::complete::{
take, // takes a number of bytes unconditionally
take_till, // takes until a condition is met and stops before
},
combinator::{
cond, // calls a parser only if a condition is met
map, // maps the result of a parser
map_opt, // maps the result of a parser to an Option
opt, // optional parser, for a value. returns Option
verify, // verifies the result satisfies a condition
},
multi::{
length_count, // gets a number from first parser,
// then applies second parser that many times
many0, // repeats a parser until it fails. returns Vec
},
sequence::{
pair, // call first parser, then second parser.
preceded, // discards result of first parser
tuple, // same as pair, but up to 21 separate parsers
},
IResult, // nom's Result type.
// Returns remaining input and successful results
// or an error with the input inside the Error.
};
fn parse_varint(input: &[u8]) -> IResult<&[u8], usize> {
let is_last = |b| b & 0x80 == 0;
let is_in_bounds = |p: &[u8]| p.len() < 9;
let try_varint = |(p, lb): (&[u8], &[u8])| {
// this is the same imperative code up above with some error checking
p.iter()
.chain(lb)
.enumerate()
.try_fold(0, |acc, (idx, byte)| {
((*byte as usize) & 0x7F)
.checked_shl(idx as u32 * 7)
.map(|n| acc | n)
})
};
map_opt(
pair(verify(take_till(is_last), is_in_bounds), take(1u8)),
try_varint,
)(input)
}
Same, but slightly more unreadable (I am going to have so much fun with this):
fn parse_varint(input: &[u8]) -> IResult<&[u8], usize> {
map_opt(
pair(
verify(take_till(|b| b & 0x80 == 0), |p: &[u8]| p.len() < 9),
take(1u8),
),
|(p, lb): (&[u8], &[u8])| {
p.iter()
.chain(lb)
.enumerate()
.try_fold(0, |acc, (idx, byte)| {
((*byte as usize) & 0x7F)
.checked_shl(idx as u32 * 7)
.map(|n| acc | n)
})
},
)(input)
}
Using this small function to generate the list of numbers, going straight for the unreadable verion:
pub fn print_nums(decoded: &[u8]) {
_ = many0(map(parse_varint, |n| println!("{n}")))(decoded);
}
The second function is a bit more involved, but it generally follows the structure of the previous section's solution. I think it is easy enough to follow.
pub fn print_data(decoded: &[u8]) -> Result<(), Box<dyn Error + '_>> {
// starting from 2 as Format is the third number.
let (rem, format) = map(
parse_varint,
|f| f.try_into().unwrap_or_default()
)(&decoded[2..])?;
// skip 1 byte for Hero ID.
let (rem, _) = parse_varint(rem)?;
let (rem, hero) = parse_varint(rem)?;
let (rem, cards) = map(
tuple((
// single cards
length_count(parse_varint, parse_varint),
// double cards
length_count(parse_varint, map(parse_varint, |id| [id; 2])),
// n-count cards
length_count(
parse_varint,
map(
pair(parse_varint, parse_varint),
(id, n)| [id].repeat(n)
),
),
)),
|(v1, v2, vn)| {
v1.into_iter()
.chain(v2.into_iter().flatten())
.chain(vn.into_iter().flatten())
.collect()
},
)(rem)?;
let (rem, c) = opt(parse_varint)(rem)?;
let (_rem, sideboard_cards) = map(
cond(
c.is_some_and(|c| c == 1),
length_count(parse_varint, pair(parse_varint, parse_varint)),
),
Option::unwrap_or_default,
)(rem)?;
let result = super::Data {
format,
hero,
cards,
sideboard_cards,
};
println!("{result:#?}");
Ok(())
}
But this can be made worse: the whole thing can be collapsed into one function. rustfmt
is pulling a lot of weight here, to be honest.
pub fn print_data(decoded: &[u8]) -> Result<(), Box<dyn Error + '_>> {
let result = map(
tuple((
map(parse_varint, |f| f.try_into().unwrap_or_default()),
preceded(parse_varint, parse_varint),
map(
tuple((
length_count(parse_varint, parse_varint),
length_count(
parse_varint,
map(parse_varint, |id| [id; 2])
),
length_count(
parse_varint,
map(
pair(parse_varint, parse_varint),
|(id, n)| [id].repeat(n)
),
),
)),
|(v1, v2, vn)| {
v1.into_iter()
.chain(v2.into_iter().flatten())
.chain(vn.into_iter().flatten())
.collect()
},
),
preceded(
verify(parse_varint, |i| *i == 1),
length_count(parse_varint, pair(parse_varint, parse_varint)),
),
)),
|(format, hero, cards, sideboard_cards)| Data {
format,
hero,
cards,
sideboard_cards,
},
)(&decoded[2..])?
.1;
println!("{:#?}", result);
Ok(())
}
One tiny thing before the end. Note the +'_
bound on the return type. Since nom
's errors contain the input, which is a reference, converting the errors can sometimes cause Rust's type inference to declare that your functions expect a 'static
lifetime. For example, here, when the +'_
bound is removed, or when converting nom
's errors to anyhow
's errors using the ?
operator.. It is not a footgun, per se, as the compiler yells at you. But the error is frankly inscrutable unless you know what you are looking for.
Final Thoughts
Rewriting in nom was a bit easier than I thought it would be. And even the compact "one-liner" is more readable than I thought. The hardest part is probably how to translate the problem into nom
's API. (Which I struggled with when I considered using nom
for Advent of Code). Who knows? Maybe this year.
If you would like to do this with another Rust parser library (chumsky
or winnow
or combine
or pest
), please do share, and I will happily link it here.
-
This Wikipedia page gives a terrible and obtuse explanation. The implementation turned out to very simple. ↩
-
IDs can be deleted from the game, leading to outdated deck codes out there, but that's out of scope of this article. In
Mimiron
, I work around that by looking them up in the third party database everyone else uses: HearthSim's, and "canonilizing" the IDs. My bot is somewhat unique for relying first on Blizzard's official API. ↩ -
If you are somehow this far into this post and do not know what
nom
is, it is a Rust parser combinator library, probably the first and most famous. Combinators are small higher order functions (read: functions that can take and/or output other functions: function functions, if you will), that can be composed together. Parser Combinator Libraries are, in a way, their own Domain Specific Language.nom
's documentation has a nice intro. ↩