the bloggest of mendes

Plug-based authorisation for Elixir and Phoenix

2020-04-17T09:07:48-07:00

Note: this post was originally published on the Subvisual blog.

Some years ago, most of us here at Subvisual got really-perhaps-a-bit-too-much into Elixir. Ever since then, whenever we are free to choose the technology to work with, we’ve pretty much been going Elixir all the way.

We learned a lot. We laughed a lot. And I copy and pasted some code from different projects a lot. Don’t tell the rest of the development team. Aaaaanyway, I finally got around to open sourcing the copy/pasted code and releasing it as a package.

I called this thingy Dictator. It implements a plug-based authorisation system and allows you dictate (get it??) what your users can access, by defining policies (hah! get it??). You can be as granular as you want and override pretty much everything. The philosophy behind it is to implement sane defaults but be easily overridable as well. You might even call it convention over configuration. Enough chit-chat, let’s showcase it.

How to use the thing #

very important pre-condition: it assumes you have a current_user or current_resource or similar in your conn.assigns

Dictator uses the concept of policy, which is a set of rules you implement to determine what actions your users can take. To do that, you just define a can?/3 function, which receives the current user as the first argument, the action (:new, :index, so on) as the second and finally the resource being accessed. Loading of all those is automagically handled for you.

Let’s assume you want to define a Post policy:

# lib/client_web/policies/post.ex
defmodule ClientWeb.Policies.Post do
  alias Client.Content.Post

  use Dictator.Policy, for: Post

  def can?(%User{id: user_id}, action, %Post{user_id: user_id})
    when action in [:edit, :update, :delete, :show], do: true
  def can?(_, action, _) when action in [:index, :new, :create], do: true
  def can?(_, _, _), do: false
end

In this scenario our users can update, edit and delete their own things. But anyone can index and create things, even if they don’t belong to them. The last can?/3 function branch prevents users from editing, updating or deleting post that don’t belong to them.

This scenario is so common across different resources and projects I had, that I extracted it to a Standard policy. To do the above, you can just do the following:

# lib/client_web/policies/post.ex
defmodule ClientWeb.Policies.Post do
  alias Client.Content.Post

  use Dictator.Policies.Standard, for: Post
end

This is a prime example of what I had in mind when building and extracting the code from previous projects: implement the most common use cases and allow edge cases to be overridden.

Once you have defined a policy, simply plug in Dictator.Plug.Authorize and it will even infer the policy to use (provided some details explained below, but we’ll get to that)

# lib/client_web/controllers/post_controller.ex
defmodule ClientWeb.PostController do
  use ClientWeb, :controller

  plug Dictator.Plug.Authorize

  # ...
end

Tadaaaaaaa. Half-a-dozen lines of code and you’re already bossing around your users. Screw the user is always right, we dictatin’ everything ‘round 'ere.

Well, it seems that so far Dictator does a lot of magic behind the scenes, but fear not. We’ll go through how it loads resources, how it figures out the correct policy, how it determines which action the user is attempting and how we can override the stuff it uses. You and me, on a magic trip across the Land Of Code as if we were building dictator from scratch.

How the thing loads resources #

The first thing we need to do when enforcing a policy is to figure what the hell we are dealing with. This means figuring out what resource the user wants to access, what action they want to take and what specific policy decides if they can or cannot perform said action.

So let’s start with getting the correct resource. The first piece of the puzzle we need is the module that defines the resource being accessed. Well, that’s easy, when defining the policy the developer needs to specify what resource it is referring to:

defmodule ClientWeb.Policies.Post do
  alias Client.Content.Post

  use Dictator.Policy, for: Post

  # ...
end

Nice work! What a team you and me are! So now we need to get the correct repo. If you dive into policy.ex, you’ll figure out how much of lazy cheaters you and me are. We try two things and then give up.

First we try to use the namespace and see if that module exists (get_repo_from_namespace/1). If you are defining a policy for Client.Content.Post most of the time you’ll have a Client.Repo. So let’s just check if that exists and hope for the best. If that doesn’t work, well, we can just use the :ecto_repo config that we are required to have when using Ecto and hope there is only one Repo defined (get_repo_from_application/1).

Sometimes this isn’t the case. Sometimes our web apps need multiple repos or we even accidentally choose the wrong one (e.g. in the first scenario, if the developer has defined multiple repos we may end up with the wrong one). We really can’t figure out what the developer wants in those cases. Instead let’s just be lazy, raise an error and ask the developer to specify the repo via the :repo key:

defmodule ClientWeb.Policies.Post do
  alias Client.Content.Post

  use Dictator.Policy, for: Post, repo: Client.MyFunkyWeirdRepo

  # ...
end

At this point we have the repo and module for the resource the user is trying to access. We also know the params of the HTTP call. So now we just need to call repo.get(module, params["id"]). Now, this assumes the resource has a primary key named id. For the large majority of the resources we code, this happens to be true and we can default to that. However, developers like to get picky and use different primary keys. We’ll need to accept a :key option:

defmodule ClientWeb.Policies.Post do
  alias Client.Content.Post

  use Dictator.Policy, for: Post, key: :uuid

  # ...
end

Note that this assumes the key has the same name in the HTTP call params hash. If we have id as the primary key, we expect the params hash to be %{"id" => id}. If it’s uuid, we expect it to be %{"uuid" => uuid}. This logic is defined in the load_resource/1 function.

But, we, developers, like to complicate things. Sometimes the primary key might be uuid but the HTTP param might be named something different. Sometimes we like to feel smart and have composite primary keys. Well, that’s too much of a hassle to handle and there are way too many edge cases. Let’s just allow the load_resource/1 function to be overridable and say “heh, developers can handle it”:

defmodule ClientWeb.Policies.Post do
  alias Client.Content.Post
  alias Client.Repo

  use Dictator.Policy, for: Post

  def load_resource(params) do
      Repo.get_by(Post, uuid: params["uuid"], id: params["id"])
  end

  # ...
end

You can notice we allow the function to be overridden in the same policy.ex file and the defoverridable call.

Let’s recap. At this point we know how to find repos, load resources and we’ve allowed developers that use our library to have a bunch of options when use-ing Dictator.Policy:

:repo allows them to specify which repo to use to load resources.
:key allows them to specify a different primary key for the resource.
load_resource/1 is overridable to allow complex queries.

Time to move along to how this Dictator thingy calls the police.

How the thing calls the police policy #

The next step on our tour is a detour (get it?? I’m on fire today) to plug/authorize.ex, specifically the extract_policy_module/1 function. The trick to inferring the correct policy is very obvious: use private Phoenix stuff that may or may not be in the documentation and get the controller from that. Obviously. We then use that to generate the policy module. If the controller is ClientWeb.PostController, we’ll transform it to ClientWeb.Policies.Post.

With that in mind, we can again rely on the developers to be picky and define shared policies or to want to reuse them or do weird developer stuff. Which means that they’ll need an override option. Luckily we can easily arrange it. When we are plugging the policy into the controller, developers can provide a :policy key and we’ll only call load_policy/1 if the key isn’t present:

# lib/client_web/controllers/post_controller.ex
defmodule ClientWeb.PostController do
  use ClientWeb, :controller

  plug Dictator.Plug.Authorize, policy: ClientWeb.Policies.Content

  # ...
end

We’ve covered how to load resources and how to select the policy. But we’re missing a couple of things: how to get the current user and how to get the action.

How the thing interacts with Phoenix #

Starting with the current user, let’s once again be lazy: we assume there’s a current_user in the conn.assigns. Most of the time it will. Of course, them developers will not always call it that, so we can - guess what? - give them an overridable :resource_key option when they’re plug-ing the policy in the controller. If your current user in conn.assigns is called current_resource, you can do:

# lib/client_web/controllers/post_controller.ex
defmodule ClientWeb.PostController do
  use ClientWeb, :controller

  plug Dictator.Plug.Authorize, resource_key: :current_resource

  # ...
end

All we need now is the action. authorize.ex has the answer for that: use private Phoenix stuff, again. conn.private.phoenix_action, ez-pz.

For the sake of sanity, let’s add one final option. :only which specifies the actions which to enforce the policy. By default, we enforce the policy to all them actions. But a developer might want to only call a policy for the create action:

# lib/client_web/controllers/post_controller.ex
defmodule ClientWeb.PostController do
  use ClientWeb, :controller

  plug Dictator.Plug.Authorize, only: [:new]

  # ...
end

We finally have the current user, the action they want to take, the policy to be enforced. All we have to do in our Authorize plug is to call policy.can?(user, action, resource) and if they can, return an unchanged conn. If not, well, 401 it and halt everything.

The logic for all these tricks is straightforward and the whole project boils down to two relevant modules (Dictator.Plug.Authorize and Dictator.Policy) with a staggering total of 141 lines of code. Isn’t Elixir awesome?

Overrides for the Standard Policy #

I mentioned in the beginning of this post that there’s a very common scenario: when the developer wants to allow users to edit, update and delete their own resources and everyone to read or create new posts.

For that, Dictator comes bundled with the Dictator.Policies.Standard policy. However, this policy makes two assumptions:

the primary key of the user trying to access is id
the foreign key of the resource being accessed is user_id

Of course, this doesn’t happen all the time. So when use-ing the Standard policy, developers have these corresponding override options:

owner_key (e.g. if your user has a uuid field as primary key instead of id).
foreign_key (e.g. if your resource has a manager_id instead of user_id as the foreign key in the relation).

In Summary #

Lots of stuff happening, small number of codes. Elixir awesome. Demo here. Please contribute to project: subvisual/dictator.

Have a nice.

Mendes

jobs and timers in neovim: how to watch your builds fail

2019-06-07T09:23:28-07:00

Note: this blog post was originally written for the [Subvisual blog][sv-blog]. You can find the original [here][original-post].

If you’re like me (and for your own sake, I truly hope you are not), you
probably tend to have a lot of builds fail. Even worse, if you really are like
me, you spend most of your time in vim.

If that is not the case, you’re in the clear, there’s nothing wrong with you,
feel free to go, end this blog post now, be free, happy, enjoy the sunlight and
the birds and the trees. Life is good.

… Are we, the sadists, all alone now? Cool. Ok, so you use vim a lot and you make builds
fail. Chances are you would like to know when that happens without ever leaving vim. It’s alright. I got you, mate.

Here’s an asciicast of my nvim. Notice how the status bar includes, on the bottom
right, the status of the CI. Notice how it updates. Damn, that’s neat. You want
that.

[![asciicast][asciicast-svg]][asciicast]

First things first, either make an API wrapper, preferably in Rust or Go,
something compiled and fancy, that allows you to check the [GitHub checks
API][gh-checks-api]. Got it? Good. Now stop being a muppet and use [hub][hub]
instead.

Now that you have hub, you can make use of the hub ci-status command.

$ hub ci-status
success

Coolio.

Now let’s change our custom status bar.

First, we want to check if we’re in a git project:

let s:in_git = system("git rev-parse --git-dir 2> /dev/null")

if s:in_git == 0
  " call hub
endif

So now we need to call hub. However just doing a system call to hub would
be a blocking operation and we don’t want our vim to block every few
moments for like 5 seconds. So let’s use jobstart.

Start by calling :h jobstart from your (n)vim. You can see that it runs an
asynchronous job and it supports shell commands.

So let’s create a CiStatus function that looks like this:

function! CiStatus()
  let l:callbacks = {
  \ 'on_stdout': function('OnCiStatus'),
  \ }

  call jobstart('hub ci-status', l:callbacks)
endfunction

We define a map of callbacks for stdout and delegate that to a new function
called OnCiStatus. This is a very simple function that gets the output from
hub and converts it to whatever we want, storing it in a g:ci_status
variable. We will later use this variable in our statusline.

function! OnCiStatus(job_id, data, event) dict
  if a:event == "stdout" && a:data[0] != ''
    let g:ci_status = ParseCiStatus(a:data[0])
  endif
endfunction

function! ParseCiStatus(out)
  let l:states = {
    \ 'success': "ci passed",
    \ 'failure': "ci failed",
    \ 'neutral': "ci yet to run",
    \ 'error': "ci errored",
    \ 'cancelled': "ci cancelled",
    \ 'action_required': "ci requires action",
    \ 'pending': "ci running",
    \ 'timed_out': "ci timed out",
    \ 'no status': "no ci",
  \ }

  return l:states[a:out] . ", "
endfunction

There are a couple of things missing though. This runs the hub ci-status job
only once. We want to have it perform constant checks. If we do :h timers, we
can see the new time API in neovim. Theres a timer_start that takes a period
and a callback to run after that period.

We can then change our OnCiStatus function to call timer_start with that
first CiStatus function again:

function! OnCiStatus(job_id, data, event) dict
  if a:event == "stdout" && a:data[0] != ''
    let g:ci_status = ParseCiStatus(a:data[0])
    call timer_start(30000, 'CiStatus') " relevant new part
  endif
endfunction

Now CiStatus gets called by timer_start every 3 seconds. timer_start,
however, passes the timer_id as an argument to the callback. So we will need
to modify CiStatus to accept an argument (that we can safely ignore):

function! CiStatus(timer_id)
  let l:callbacks = {
  \ 'on_stdout': function('OnCiStatus'),
  \ }

  call jobstart('hub ci-status', l:callbacks)
endfunction

" We also need to change the first CiStatus call to receive an int
" Since we don't care about it, let's just use 0

let s:in_git = system("git rev-parse --git-dir 2> /dev/null")

if s:in_git == 0
  call CiStatus(0)
endif

All that’s missing now is to take the value of g:ci_status and put into the
statusline. That’s pretty simple, using some code borrowed from [Kade
Killary][kade-killary].

set statusline=
set statusline+=\ \ \  " Empty space
set statusline+=%< " Where to truncate line
set statusline+=%f " Path to the file in the buffer, as typed or relative to current directory
set statusline+=%{&modified?'\ +':''}
set statusline+=%{&readonly?'\ ':''}
set statusline+=%= " Separation point between left and right aligned items
set statusline+=\ %{g:ci_status} " Our custom CI status check
set statusline+=col:\ %c
set statusline+=\ \ \  " Empty space

And that’s that. Cheerios. Hugs n kisses and all that.

[hub]: https://hub.github.com/
[gh-checks-api]: https://developer.github.com/v3/checks/
[asciicast]: https://asciinema.org/a/5ynHiyckpQmQP7oWYI6HsVKKI
[asciicast-svg]: https://asciinema.org/a/5ynHiyckpQmQP7oWYI6HsVKKI.svg
[kade-killary]: https://kadekillary.work/post/statusline-vim/
[original-post]: https://medium.com/subvisual/jobs-and-timers-in-neovim-how-to-watch-your-builds-fail-f18931f2ffb6
[sv-blog]: https://medium.com/subvisual

you must return here with a shrubbery: the pixels camp quizshow qualifier treasure hunt

2019-03-05T06:05:18-08:00

This is the story of how I locked myself inside my room for 29 hours and only left after finishing one of the craziest tech wargames/treasure hunt I have ever taken part in.

Prelude #

For those of you who don’t know, Pixels Camp has a quiz show.

To get to the quiz show, you have to ~~be tortured~~ get through a very good and awesome and oh so fun, so amazingly fun qualifiers.

The qualifiers run for 4 weeks. Then there’s a week off. Then the 16 top players get to find a partner to ~~violently murder the quizmaster~~ go on stage and make fools of themselves. In order to do that, you get asked questions and then you fail. Invariably. Eventually consistently.

2 editions ago, me and @naps62 failed so hard we won. The next year, we felt a bit more confident and got swept in the first round. As far as logic goes, leave it by the door of the quiz finals and pick it up afterwards. Think you know the answer? You don’t. Been feeling like a champ? You’re gonna get chewed up and then made fun of by the quizmaster and probably your own coworkers who, incidentally, had been not only watching it live but also recording everything for reaction gifs (yes, oddly specific, I know).

Every year the qualifiers have a treasure hunt. That is one of my favorite things in the world. First because you get mentally challenged. Second because if you fail, you can just blame it on the lack of time and how popular you are.

yeah, I couldn’t do the qualifier because it starts on a Friday night and I’m out at the pub unlike YOU LOSERS AHAHAHAH also does someone have any tips for step 3?

– stage one of doing the annual treasure hunt

Third because it gives you a reason to hate on another human being. And, let’s face it, we all love blaming our beloved quizmaster, @carlosefr, for our own shortcomings.

I hate you quiz master, I haven’t slept properly for 7 weeks, I’ve been putting on weight since 2016, don’t feel like going to the gym and it’s all because of YOU. YOU and YOUR STUPID TREASURE HUNT.

– stage two of the treasure hunt, commonly found on the Pixels Camp Slack

Enough chit chat, let’s go through the solution.

Step 1: The Cipher and The Dial #

How does the treasure hunt work? Simple, two steps:

You start on step 1 and get to the next step
You repeat.

There are no rules, but there are patterns. It’s never too complicated. It’s usually a single solution step. If you have an image, you have all it takes to get to the next level. You won’t need to do the hex dump of the image, convert it to decimal, factor out the primes, convert to ascii and that will give you a riddle to solve. If you have an image, everything you need is in there (sometimes literally in the image). And, again, it’s usually a single step. The multitude of steps like the one I showed you, will give you uncertainty. You won’t know if you are on the right track. But with the treasure hunt you always know and you should keep this in mind. If you find you are uncertain in your solution, it’s probably wrong.

Right then, how did it start? We went to the challenge page (sidenote: there’s a small chance this link might not be available due to the challenge closing). And we had the explanation for the treasure hunt, a form for submitting the final solution (given in the last step of the hunt) and an image. This image:

Ah, a weird dialect. An unknown alphabet. A ciphertext! I have a background in cryptography, so I knew exactly what to do in this case…

Yep, that’s right. Reverse google search.

The reverse google search lead nowhere. So what’s next? My cryptographer brain was prepared. Years of study and long nights reading through complex mathematical formulas all built up to this moment.

You see, there’s this little trick known as frequency analysis. Basically it consists in taking every symbol of the ciphertext, drawing it carefully in a notebook and ignoring it because the right thing to do is to google “weird alien fonts”.

After spending some time in questionable websites which had every possible font, the solution: this is the Aurebesh alphabet!

I found this quite amusing. It’s literally the image for “Languages in Star Wars” in Wikipedia.

Translating it, we got:

TO DIAL ANOTHER PLANET, USE THE QM PREFIX. OVER

Perfect, we need to use IPFS. All the IPFS hashes start with Qm. But where do we find the hash? Like I said, all we need is in the picture.

curl‘ing the image, we got the following output:

$ curl https://quiz.pixels.camp/challenge/2019-2-not-mojibake-09c6e098-43518c9be33e65de/start.png

# [TRUNCATED]
��������������������������������������������������#�Ř��w�ֶYtEXtcommentBright Pixel Mars Research Facility: QmfXPu3fBiPt6x6F8bsoXyuur1KPbJRqxqC72whtJRuxG���IEND�B`�%

Important sidenote: the original image had transparency, so it wouldn’t show on this page. I uploaded an altered version which might not include the hash. So if you try it yourself with the uploaded image, beware of different results.

There we go! We got a hash! Let’s put it into the IPFS gateway: https://gateway.ipfs.io/ipfs/QmfXPu3fBiPt6x6F8bsoXyuur1KPbJRqxqC72whtJRuxG/

Huh, it doesn’t work… Finds nothing. Let’s install IPFS and use the CLI… Nope. Just the same. Wait whaaat? How is this… Is the hash ok? Let’s compare it:

# IPFS webpage hash, 46 bytes
QmTeW79w7QQ6Npa3b1d5tANreCDxF2iDaAPsDvW6KtLmfB
# our hash, 45 bytes
QmfXPu3fBiPt6x6F8bsoXyuur1KPbJRqxqC72whtJRuxG

Ok, we’re missing quite a byte. At this point I wrote a program to brute force the remaining byte. But nothing worked. What do we add?

The treasure hunt is supposed to be a brain challenge. Having you guess random stuff isn’t the MO. What do we have then? A hash with a byte missing. A message saying to use IPFS. What wrongful assumption are we making?

Turns out the quizmaster is really cheeky. Our assumption that he is telling us to use IPFS is not wrong but incomplete. He’s literally telling us to use the Qm prefix. So you add Qm to your hash and now you have an extra byte. You remove the final G and you go to the IPFS gateway.

It works. A double Qm hash! Oh, how devilish.

Fun fact: later on, after finishing the challenge, discussing this with the quizmaster he told me it was indirectly because of me and another contestant. While discussing past treasure hunts in the #quizshow Slack channel, 3 days before the start of the treasure hunt, this exchange happened:

This forced our quizmaster to make that step harder, which in turn led to the double Qm hash. Oops.

Step 2: A Message in Hebrew #

Accessing the file on IPFS we had this:

66 1
66 1
4 14#6 2#8 4#2 6#2 14#4 1
4 2#10 2#2 6#4 10#8 2#10 2#4 1
4 2#2 6#2 2#2 2#8 4#2 6#2 2#2 2#2 6#2 2#4 1
4 2#2 6#2 2#4 2#6 2#2 2#4 6#2 2#2 6#2 2#4 1
4 2#2 6#2 2#2 6#2 2#2 4#2 4#6 2#2 6#2 2#4 1
4 2#10 2#2 2#2 2#4 2#2 4#4 2#4 2#10 2#4 1
4 14#2 2#2 2#2 2#2 2#2 2#2 2#2 2#2 14#4 1
20 2#2 4#2 2#2 8#2 2#20 1
4 4#2 2#4 4#6 10#8 2#2 6#2 4#6 1
8 8#2 2#2 2#2 8#2 6#2 8#2 8#4 1
4 6#2 6#10 6#8 4#4 4#4 2#6 1
4 12#2 2#2 4#2 2#2 2#2 2#6 4#8 2#2 2#4 1
8 2#6 4#4 6#4 8#2 6#4 2#2 2#6 1
6 8#4 4#4 2#10 2#2 6#10 2#6 1
6 4#6 2#12 4#16 12#4 1
4 2#2 4#2 2#10 2#2 2#2 2#2 6#2 2#2 2#2 2#4 2#4 1
6 2#8 2#2 4#2 2#6 2#2 2#2 6#6 2#2 4#4 1
8 6#4 4#2 2#2 2#2 2#2 2#2 2#2 6#4 8#4 1
4 2#2 2#6 4#2 2#2 4#6 6#2 2#2 6#2 6#4 1
8 2#4 2#8 6#2 4#2 4#2 2#2 2#4 2#2 4#4 1
4 2#4 4#2 2#2 2#2 2#2 2#10 14#2 4#6 1
20 2#4 6#4 4#2 4#6 2#2 4#6 1
4 14#2 2#6 6#6 2#2 2#2 2#2 8#6 1
4 2#10 2#4 8#4 2#6 4#6 2#2 2#2 2#4 1
4 2#2 6#2 2#4 2#2 4#2 2#2 4#4 10#6 2#4 1
4 2#2 6#2 2#2 2#4 2#2 4#2 6#6 6#2 4#6 1
4 2#2 6#2 2#8 2#12 2#2 4#4 4#4 2#4 1
4 2#10 2#2 4#2 4#2 2#2 12#4 2#4 2#6 1
4 14#2 16#2 2#4 6#4 2#2 2#6 1
66 1
66 1

My intuition told me to inspect the file carefully. There was no hidden data, no metadata of any kind, no extra whitespaces or invisible characters. Nothing. WYSIWYG.

This was one of the hardest steps, personally. What could it be? The title of the challenge was Mojibake?. I started with esoteric languages. When that didn’t work, I tried to think of encodings that could fit. I tried different text encodings. Then, I focused on the # sign. Googled every cipher I could, to see if one converted text to numbers and if any used the # sign as a delimiter.

When that failed, my attention shifted to HTML entities and later even hebrew.

@naps62 noticed the sum of each line was always 67. I noticed that every line was a different permutation of 67. I tried to search for ciphers around the number 67.

I went back to esolang and went through every language on the list.

Nothing would fit. I went to bed 7 hours after the challenge started.

I couldn’t stop thinking about it. I barely slept.

I woke up in the middle of the night with the idea of Run-Length Encoding. I had to code a C encoder & decoder in my first year of university and that moment came to me suddenly. I dismissed it, it wouldn’t make sense. RLE transforms aaabbbbcc into a3b4c2. Looking at the text, we were missing characters between some numbers. The 1 at the didn’t have any character.

I woke up again thinking about prime factors. Then again thinking about the LCM of 67. It wouldn’t fit, this was sum, not multiplication.

I don’t know how long I slept in total, but I was very much sleep-deprived going into the second day of the challenge. I grabbed my computer and started reading about number based ciphers, once more. @pfac, the remaining member of our triple trouble team for every Pixels Camp quiz, suggested chess moves the night before. Wouldn’t fit.

I was getting frustrated. At this point, I hadn’t left my room for 16 hours.

There were some whispers between other participants of thinking of it like a grid. Then it hit me.

4 2#4: 4 spaces, 2 #, 4 spaces. The 1s at the end were 1 newline.

It’s pretty simple. You can solve it with one line of Ruby. One line that solved this whole ordeal and had haunted me for the last 15 hours. Ready? Here it goes:

puts LINES.gsub(/(\d+)(\D)/m) { $2 * $1.to_i }

So simple it hurts. And here is the result:



    ##############      ##        ####  ######  ##############
    ##          ##  ######    ##########        ##          ##
    ##  ######  ##  ##        ####  ######  ##  ##  ######  ##
    ##  ######  ##    ##      ##  ##    ######  ##  ######  ##
    ##  ######  ##  ######  ##  ####  ####      ##  ######  ##
    ##          ##  ##  ##    ##  ####    ##    ##          ##
    ##############  ##  ##  ##  ##  ##  ##  ##  ##############
                    ##  ####  ##  ########  ##
    ####  ##    ####      ##########        ##  ######  ####
        ########  ##  ##  ########  ######  ########  ########
    ######  ######          ######        ####    ####    ##
    ############  ##  ####  ##  ##  ##      ####        ##  ##
        ##      ####    ######    ########  ######    ##  ##
      ########    ####    ##          ##  ######          ##
      ####      ##            ####                ############
    ##  ####  ##          ##  ##  ##  ######  ##  ##  ##    ##
      ##        ##  ####  ##      ##  ##  ######      ##  ####
        ######    ####  ##  ##  ##  ##  ##  ######    ########
    ##  ##      ####  ##  ####      ######  ##  ######  ######
        ##    ##        ######  ####  ####  ##  ##    ##  ####
    ##    ####  ##  ##  ##  ##          ##############  ####
                    ##    ######    ####  ####      ##  ####
    ##############  ##      ######      ##  ##  ##  ########
    ##          ##    ########    ##      ####      ##  ##  ##
    ##  ######  ##    ##  ####  ##  ####    ##########      ##
    ##  ######  ##  ##    ##  ####  ######      ######  ####
    ##  ######  ##        ##            ##  ####    ####    ##
    ##          ##  ####  ####  ##  ############    ##    ##
    ##############  ################  ##    ######    ##  ##

You know what this is called? Run-Length Encoding. Yeah… I’m not particularly brilliant…

Well, how nice. A QR Code. And here I was searching for ciphers and text in it. Turns out most QR Code readers don’t really like the # sign, so we can change it for the FULL BLOCK character.

puts LINES.gsub(/(\d+)(\D)/m) { ($2 == "#" ? "█" : $2) * $1.to_i }



    ██████████████      ██        ████  ██████  ██████████████
    ██          ██  ██████    ██████████        ██          ██
    ██  ██████  ██  ██        ████  ██████  ██  ██  ██████  ██
    ██  ██████  ██    ██      ██  ██    ██████  ██  ██████  ██
    ██  ██████  ██  ██████  ██  ████  ████      ██  ██████  ██
    ██          ██  ██  ██    ██  ████    ██    ██          ██
    ██████████████  ██  ██  ██  ██  ██  ██  ██  ██████████████
                    ██  ████  ██  ████████  ██
    ████  ██    ████      ██████████        ██  ██████  ████
        ████████  ██  ██  ████████  ██████  ████████  ████████
    ██████  ██████          ██████        ████    ████    ██
    ████████████  ██  ████  ██  ██  ██      ████        ██  ██
        ██      ████    ██████    ████████  ██████    ██  ██
      ████████    ████    ██          ██  ██████          ██
      ████      ██            ████                ████████████
    ██  ████  ██          ██  ██  ██  ██████  ██  ██  ██    ██
      ██        ██  ████  ██      ██  ██  ██████      ██  ████
        ██████    ████  ██  ██  ██  ██  ██  ██████    ████████
    ██  ██      ████  ██  ████      ██████  ██  ██████  ██████
        ██    ██        ██████  ████  ████  ██  ██    ██  ████
    ██    ████  ██  ██  ██  ██          ██████████████  ████
                    ██    ██████    ████  ████      ██  ████
    ██████████████  ██      ██████      ██  ██  ██  ████████
    ██          ██    ████████    ██      ████      ██  ██  ██
    ██  ██████  ██    ██  ████  ██  ████    ██████████      ██
    ██  ██████  ██  ██    ██  ████  ██████      ██████  ████
    ██  ██████  ██        ██            ██  ████    ████    ██
    ██          ██  ████  ████  ██  ████████████    ██    ██
    ██████████████  ████████████████  ██    ██████    ██  ██

And the content?

Looks like a hash, but it’s so tiny… 96f493f1

My experience from the previous treasure hunts immediately told me what this was. It’s a nice taunt. It’s a hash, but it’s tiny. What do you do with it? tinyurl.com/96f493f1.

Step 3: Then Shalt Thou Count To Three #

The URL led to a Google Drive folder with this image:

My first step was again to analyse the image. No metadata. No extra content inside it (unlike the previous one). Nothing.

One thing caught our attention. The letters in aaaarrrrggghhhh have different fonts. Some are composed of +, others of -.

I split these into two words.

..a...aaaa..rr.r.rr.gggg.gg.hh..               # letters with +
aa.aaa....rr..r.r..r....g..h..hh               # letters with -

Obviously there was no sense to make of this. I was so sleep deprived it took me quite a while to not think like a complete idiot. The following line of thought is one of my… uh… most brilliant ever, let’s call it that:

I noted the order of the + and -: --+---++++--++-+-++-++++-++-++--

It was obviously a sort of binary code. What binary codes do we know of? I tried tap code and morse code. Nothing.

I was so disheartened. I was chatting with @luisfcorreia and vented:

Could be morse, could be tap code. An IP maybe?

And then he saw it. Yep, it’s a binary code. Literally binary. I wasted away an hour before thinking of the obvious. Was I in a bad mental state? Yes. Was this a sign I needed coffee? Yes, a lot.

32 bits of binary. It’s an IP, he told me.

luis.beers += 1

Let’s look at it:

00100011 11001101 01101111 01101100
35      .205     .111     .108

We have cracked the grail.

Step 4: The Sound of Silence #

The IP address had a single file in it. synesthesia.png.

Synesthesia is basically your brain going belly up and you starting to hear colours or seeing sounds. Your senses get all mixed up. What senses can you tickle with a computer? Vision and hearing.

It was obvious to me, the image had a sound file in it. The first qualifier I went through, a few years ago, had content hidden in one of the channels of the image.

I isolated all the channels separately. Every permutation. Red, Green, Blue, Alpha, Red & Green, Red & Blue, Green & Blue, RGB (no alpha). Two things caught my eye. The “no alpha” version and the “alpha” version.

# install imagemagick before this
convert synesthesia.png -alpha off no_alpha.png
convert synesthesia.png -channel RBG -fx 0 alpha.png

Let’s start by analysing alpha.png.

I had seen this type of pattern before in a different wargame. Sets of squares in grayscale, almost randomly, an indicator of a binary file with the PNG headers around it. They seem like random noise, but they really are not. You can clearly see some lines with similar “gray” values. Usually this indicates that there is information concealed.

I suspected the audio file was in the alpha channel and the confirmation came with no_alpha.png:

There’s nothing interesting in this image, except that it does look like an actual image instead of the spaghetti mess that synesthesia.png is. I reverse google searched the image and found the original. The evil quizmaster stole this artist’s intellectual property and compressed the image almost beyond recognition, all in the name of producing a mastercrime of a challenge for us. How thoughtful and sweet.

Having seen the original, I took it as a confirmation I was on the right track. Before going back to the alpha.png image, we need to understand how PNGs work.

Images, in general, are grid of pixels. Each pixel has 3 values ranging from 0 to 255. One for each of red, blue and green. The intensity of each colour combined determines the final colour of the pixel. For each channel we need 8 bits. 24 in total for a single pixel. PNGs in particular can have 32 bits. The final 8 bits are for transparency. AKA the alpha channel.

My reasoning was set on the audio file being in there, everything pointed to it. However, when I extracted the alpha channel it was still an image. The problem was that during the extraction, I was converting back to a PNG and stretching out that information again, adding in the headers and in general screwing everything up.

I needed a way to remove the 8 bit sequences and placing them into a single file without converting it into an image. I was almost starting to implement my own code to do so when a dear friend (to whom I owe a beer for preventing me from doing this), hinted me the correct command.

# wrong command
$ convert synesthesia.png -alpha off no_alpha.png

# right command
$ convert -alpha extract synesthesia.png synesthesia.gray

$ file synesthesia.gray
synesthesia.gray: Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III,  v2.5,  32 kbps, 8 kHz, Monaural

He googled for more imagemagick options. I googled for C vim plugins and libs, ready to get my hands dirty. Let’s ponder about that for a second and never speak of it again.

Step 5: bacon. I really can’t think of a clever name for this section, I just hate the quizmaster so much. #

So now we have our audio file. Let’s open our minds for what we are about to hear, for we have been graced with - nevermind, close it, it’s morse, let’s just plug it into an online decoder

TO AVOID CREW SUSPICION REQUEST INFO WITH SMALL BACON. ONE BACON ONLY.

And what do you get out of this?

…

It’s obvious isn’t it?

You have to make a request…

…

… maybe?

To… some… server? I think?

Seriously, what. the. hell.

Ok, let’s break it down. We have to make a request for info. We have the previous server IP address, so intuition tells us it’s that.

nmap tells us… nothing.

PORT     STATE  SERVICE
22/tcp   open   ssh
80/tcp   open   http
3389/tcp closed ms-wbt-server

Maybe an HTTP request?

$ curl 35.205.111.108/bacon

<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>

$ curl 35.205.111.108/.bacon
# same response

$ curl -I 35.205.111.108 

HTTP/1.1 200 OK
Server: nginx
Date: Mon, 04 Mar 2019 22:49:33 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Vary: Accept-Encoding

# aka: nothing in the headers

At this point I tried some custom headers. Content-Type: Bacon, Authorization: Bacon, Footloose: Kevin Bacon, Accepts: Bacon. Nothing. Not even Francis: Bacon.

At this point, the quizmaster, ever so helpful, said something that… threw me and @pfac off as apparently we both thought the same.

Hurry… Be quick… QUIC! That’s it!

Aaaaaaaaaand we went in circles for the next hour.

The right tip came with this:

What an oddly specific thing to say, I wond- oh it’s a quote from The Hunt for Red October. I’m sure it’s related to the movie. Turns out someone I now owe three beers to hinted me the following line:

Give me a ping, Vasily. One ping only, please.

Elementary, now! We need to ping the server. How did I not immediately think of that?

$ ping -c 1 35.205.111.108
PING 35.205.111.108 (35.205.111.108): 56 data bytes

--- 35.205.111.108 ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss

Of course it’s not replying to pings. This is actually a good indicator that we are onto the right track.

After googling the ICMP echo request message format, we can see that it has a payload.

At this point I was fairly certain we needed to send bacon (lowercase. Remember, “small bacon”) as the payload.

ping in macOS supports the -p flag for the payload. We can send 16 bytes of hex. I was sure of this. This was it. We got it. After so many hours hitting the wall. Here we go:

$ ping -c 1 -p "6261636f6e" 35.205.111.108
PATTERN: 0x6261636f6e
PING 35.205.111.108 (35.205.111.108): 56 data bytes

--- 35.205.111.108 ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss

$ flip
(ノಠ益ಠ)ノ彡┻━┻

How? Maybe we padded it wrong. Maybe we should use 00000000000000000000006261636f6e as the string…

But still nothing.

How? I was so certain of it.

I’m not even sure how this happened, but @pfac said something like “I’m gonna try this on Linux. Can’t trust these Macs” while I just stood in disbelief.

Aaaaaand of course it worked, how could it not?

To this day, I don’t know why. Running the same command on macOS yielded nothing. After the quiz I asked the quizmaster about this and he went through the server logs. Nothing was coming on the payload of my requests so, yeah. Thanks for that, Apple.

Regardless, there was a different solution for this:

And you know what? This was much better! Want to know why!? It gave you the reply payload! Normal ping didn’t! Poor @pfac had to install Wireshark! And JAVA! EWWWWW.

Step 6: Bear with me man, I lost my train of thought… #

Finally we had it.

I.have.leaked.a.data.breach.in.the.usual.place:.DYS4VbmB

This was fairly obvious. Where else would you leak a data breach?

I've read this and, frankly, it's really embarrassing.

2018-02-11 03:38:23.040: DEBUG: PDU: 079153210100001001000C9153210100002000008A437498CD2EBBCF6510F9ED2E8740C332BB2C9687E96590380F0ABBE7F7B23CED3E83EE693A1A4447A7E7A0F9DB7DD62914C377985E2E83C6E830885E079DDF6F16285B1783DC757148317C87E9E532688C0E83E87510F9FD7681B261F43D8C0E2986EF30BD5C068DD16110BD0EA2BFDF2E50360C1AA3C3E110

Converting to ASCII or decimal gave us nothing. Alright, think. PDU. What is that? Google says it’s a Power Distribution Unit. Ok, this is maybe a kernel panic debug log? How do we decode this? Are there similar examples online? I can’t find any with that format. Maybe if we-

Nah, mate. I just googled “PDU Decoder”. Stick it in here. Apparently that’s a Protocol Data Unit? Anyway, it outputs the final text.

– @pfac

Ah… That’s… fair…

stop the overhead, refresh from your editor

2018-05-26T05:37:40-07:00

tl;dr: I am lazy and I made a script for when I don’t have webpack to refresh my browser for me. I can now refresh it from my editor. It’s available here. This is specific for macOS. Script is explained ahead.

I’m a lazy programmer. If anything requires me to get off my terminal or my vim, I will probably automate it. Like checking the most recent xkcd.

Sometimes I don’t have webpack to refresh my browser for me. This is an issue because it requires me to change focus from the terminal to the browser and then refresh. You may think that automating this is a huge overkill. However, I found that having a shortcut in my editor to do that has significantly decreased the time it takes for me to process everything that happened.

Let’s go through this one step at a time.

First of all, you need to do cmd+tab. You are very likely to code in full screen and in macOS which means you have a cute little animation that takes a few hundred milliseconds to change screens. Since this is a mechanical process, your muscles get used to doing cmd+tab once. This doesn’t take you necessarily to your browser. It takes you to the last window you had open. Oops. Now you instinctively do cmd+tab again. It takes you back to your last open window. Your terminal. Now you’re back where you started. This is a pattern I’ve seen happen recurrently in most developers.

But let’s assume the best case scenario where you effectively went to your browser. Your brain still has to process everything that happened and make sure you opened the right window. Only now will you hit cmd+r which requires a different hand movement and a couple more hundred milliseconds.

There is one final overhead: context switching. I found this happens not only to me but to other developers as well. The act of changing windows and refreshing has made your brain switch context to process the visual changes. This is mostly to make sure you are in the window you wanted to be. It takes a while for your brain to go back to the previous context and figure out what was the task you wanted to check. If you’re tired, this can mean up to a couple more seconds of overhead.

All this cognitive overhead can be solved by muscle memory. Like I said, that first cmd+tab is usually the result of muscle memory. The problem is everything that comes afterwards. We can leverage that and have a shortcut that finds the correct browser window, focuses on it if needed and refreshes in one go.

This is the script:

#!/usr/bin/osascript

tell application "System Events"
    set processList to get the name of every process whose background only is false
    set applicationNameList to {}

    repeat with processName in processList
      set applicationList to file of (application processes where name is processName)

      repeat with applicationAlias in applicationList
        set applicationName to (name of applicationAlias) as string
        set applicationNameList to applicationNameList & applicationName
      end repeat
    end repeat
end tell

set browserLaunched to true

if applicationNameList contains "Firefox Developer Edition.app" then
  set browser to "Firefox Developer Edition"
else if applicationNameList contains "Google Chrome.app" then
  set browser to "Google Chrome"
else if applicationNameList contains "Firefox.app" then
  set browser to "Firefox"
else
  set browser to "Firefox Developer Edition"
  set browserLaunched to false
end if

set numberOfDisplays to (do shell script "system_profiler SPDisplaysDataType -detailLevel | grep -e 'Resolution:' | wc -l | tr -d '[:space:]'") as integer

if browserLaunched and numberOfDisplays > 1 then
  set browserShouldActivate to false
else
  set browserShouldActivate to true
end if

tell application browser
  if browserShouldActivate then activate
end tell

tell application "System Events"
  tell process browser
    keystroke "r" using {command down}
    delay 0.1
  end tell
end tell

It works the following way:

Get the current application list. For those of you familiar with AppleScript, you will wonder why I go through the extra process of getting the application name, instead of using the process name. That’s because “Firefox Developer Edition” and “Firefox” are different applications but use the same “firefox” process.
Get the correct browser. I usually use FF Dev when working, so if it’s open, it’s going to be that one. Otherwise, I’m probably working in Chrome, so that’s the next check. If neither is open, try regular Firefox. Finally, just assume no browser is launched and set the flag to launch it. Those of you that use a different browser or browser stack, should just change the order of the if statements.
Check the number of displays. If I’m using two displays, I’ll have the terminal on one and the browser in the other so I don’t need to change focus to it. If I’m using only one display, then I need to put the focus on the browser.
Activate if needed. If no browser is launched, this will launch it. Otherwise, it will just change the window focus.
Refresh the browser.

Save this into a browser.refresh file and put it somewhere in $PATH. The next step is to call it from the editor. I use nvim, so it is a simple one-liner: nnoremap <localleader>r :silent !browser.refresh<CR>.

This automation allows me to resolve all that hassle by clicking ,r.

I found that by using this, by the time the screen finishes switching, the refresh is almost always done. It also prevents me from the annoying cmd+tab dance I do when the browser isn’t the last window I opened.

Take the script. Put it in a file in your $PATH. Add a shortcut to your editor. Stop the overhead.

UPDATE: Some people told me they mostly use Chrome and the browser lookup thing is a bit of a hassle for them. This should work for you, change Chrome to your browser as needed:

#!/usr/bin/osascript

set numberOfDisplays to (do shell script "system_profiler SPDisplaysDataType -detailLevel | grep -e 'Resolution:' | wc -l | tr -d '[:space:]'") as integer

if numberOfDisplays > 1 then
  set browserShouldActivate to false
else
  set browserShouldActivate to true
end if

tell application "Google Chrome"
  if browserShouldActivate then activate
end tell

tell application "System Events"
  tell process "Google Chrome"
    delay 0.1
    keystroke "r" using command down
  end tell
end tell

I always wanted to do a screencast

2018-05-18T10:38:13-07:00

I always wanted to do a screencast. I was also always afraid to do it.

That being said, it’s called Beware of the Software and the first episode is here:

You can find the code for it here.

The screencast is going to be about… uh… computer stuff, let’s call it that. I can’t promise you it will always be about Elixir or distributed systems. But at least the next batch of episodes will be precisely on that. Distributed systems with Elixir. From there, I’m thinking about going through some CS papers. I also take suggestions if you’re willing to give them.

I’m still learning and experimenting with the format so it is far from perfect… Or maybe even “good”. But you can help me make it good.

I’m asking everyone for feedback. Drastically reducing the number of minutes, showing myself while coding, increasing the font size, tips for the mic and voice. Help has been invaluable. Hit me on Twitter with all you have (Svbtle doesn’t allow comments). Rip me a new one if you must, but all feedback is appreciated.

Hope you enjoy!
hack the gibson and all that.

mix format in vim from anywhere (or just in umbrella apps)

2018-04-23T03:31:59-07:00

Spoiler alert: This post is about setting up vim so that you run mix format automatically when you save a file and have it detect the nearest .formatter.exs

I have this weird issue with mix format and umbrella apps.

The issue is that mix format assumes you have a .formatter.exs file in the current directory. If you don’t, it doesn’t look upwards in the file tree. It simply assumes you want to run it with the default config. You can change this behaviour by using the --dot-formatter flag to explicitly point to the formatter file you want to use.

Now, in vim you can also use ale to run mix format on save. If you don’t know ale, take the time to do so. I set up ale to do this precisely by adding the following line to my (n)vim config:

let g:ale_fixers['elixir'] = ['mix_format']

Most of the time you will be good to go with this and you won’t find any more issues.

But when I’m working with Elixir umbrella apps, I sometimes cd to the apps/<app> directly so that my Ctrl+P doesn’t get all cluttered by similarly named files from different applications. If you have a .formatter.exs file with custom rules and ale set up to run mix format on file save, things get tricky. mix format won’t detect your config unless you explicitly set the --dot-formatter flag.

As of a few weeks ago, my PR to allow custom mix format options in ale has been accepted.

With this, we can attempt to find the nearest .formatter.exs and dynamically pass the location to ale. Here’s a bit of vimscript to do so (ignore my n00bness writing vimscript):

function! LoadNearestFormatter()
  let l:formatters = []
  let l:directory = fnameescape(expand("%:p:h"))

  for l:fmt in findfile(".formatter.exs", l:directory . ";", -1)
     call insert(l:formatters, l:fmt)
  endfor

  call reverse(l:formatters)

  let g:ale_fixers['elixir'] = ['mix_format']

  if len(l:formatters) > 0
    let g:ale_elixir_mix_format_options = "--dot-formatter " . l:formatters[0]
  endif
endfunction

call LoadNearestFormatter()

If you have a better version of this, please let me know.

That bit of code will look for .formatter.exs files along the file tree and if any is found, it passes that along with the correct option.

You can also define a .formatter.exs in your home and use it as fallback for when no other .formatter.exs is present, but I’d advise against this.

@naps62 has the following vim code instead:

let l:git_root = system("git rev-parse --show-toplevel")[:-2]
let l:fmt = findfile(".formatter.exs", l:git_root)
let g:ale_elixir_mix_format_options = "--dot-formatter " . l:fmt

It’s a lot less verbose and it won’t search for .formatter.exs files in your home directory.

And that’s it! Now you can use ale to automatically run mix format inside umbrella apps.

I tend to write here so you can subscribe. I also tweet and do open source sometimes. If you are into that type of things, hit that follow button.

a look into bloom filters with ruby

2018-04-17T06:44:01-07:00

Disclaimer: this blog post was originally written and published in the Subvisual blog in April 2016.

I remember one particular class I had. It was late May and, as pretty much every Spring day in Portugal, the sun decided to greet us with a little too much enthusiasm.

The class was about Reliable Distributed Systems, as part of my Distributed Systems & Cryptography master’s program. Distributed Systems students at University of Minho have their classes every Monday in the mythical 0.05 room. A room conveniently located just a couple of meters away from the coffee machine. A room in front of a beautiful, grassy, green patch right in the middle of the campus. A room where the blazing heat caused by 6 straight hours of direct sunlight meets the noisy embrace of dozens of servers in the back. Of course, eager PhD students have millions of tests, queries and transactions to analyse, which doesn’t help our case at all. And of course they all come back from the weekend anxious to run them all at once while those poor, young and ambitious master’s students are having classes.

During that particular class, the professor was introducing P2P networks and Gnutella. At this point, everyone was in awe. I remember hearing in the distance “So this is how we piracy™…” I still find the lack of terrible corporation related puns a bit disturbing but maybe that noisy server embrace sucked all our humor away.

When the professor mentioned “bloom filters” my senses started to tickle. Maybe it was the late May heat or the fact that it sounded really fancy. But I was very bored, and I decided to check it out.

Bloom Filters #

I began my research by opening up the promised land of essays for university students: Wikipedia.

According to the Wikipedia entry, a bloom filter is a space-efficient probabilistic data-structure, which at the time I thought was mostly technical jargon for a funky array that uses hash functions to index boolean values and is supposed be really really small.

It’s used for testing the inclusion of elements in a set (is 6 in the bloom filter?), and some notorious adopters include Akamai, Bitcoin, Medium and loads of databases. Apparently, Gnutella uses it to check if a super-peer’s connections are sharing requested content. I probably could’ve learned that earlier if I was actually listening to the professor…

Before we delve into its internal behaviour, let’s make sure we get the basic definition and overall behaviour right.

Think of a Bloom Filter as a small blackbox where you can save values but not remove them. Another trait is that you can query it whether it contains a certain value. If the response is negative, it’s guaranteed that the value is not in the bloom filter. However, if the response is positive, it probably is in the bloom filter but it can sparingly happen that this isn’t true.

At this point, like me in that hot late Spring afternoon, you’re probably thinking:

Why would anyone even like this?

You put values in but you can’t remove them. You query values but you can’t trust the answer. As far as usefulness goes, you’ve probably labeled them already as the Magikarp of Computer Science.

Well, the thing about bloom filters is that they are very efficient, both in space and time. You can see if a value is inside the bloom filter in near-constant lookup and you don’t even need to save the element you are querying. In fact, most bloom filters use only few bits per element. As we will see further ahead, if your application requires fast inclusion tests and can handle a few occasional false positives, bloom filters are for you. Let’s dissect our Magikarp.

Dissecting a Bloom Filter #

So how does such a peculiar data type work?

Bloom filters implement two operations: add and test. Both operations start by hashing the given value multiple times, either by using a different seed or running different hash functions. The output is a set of indexes or keys that will either be checked for inclusion (if we are testing) or marked as present (if we are adding).

Imagine I give you an empty bloom filter and you want to add subvisual. The string will be hashed 3 times and the 3 corresponding indexes will be filled up. The result should be something similar to this:

Ok, seems good. However, you are now curious and you begin to wonder if rubyconfpt is contained in the structure. You decide to test it.

The string will be hashed the same amount of times and the resulting indexes will be verified.

Even though one of the indexes was indeed filled up, the other two weren’t, so you can conclusively state that rubyconfpt isn’t in the filter. In fact, if as much as a single index reveals an empty entrance, you can safely make this conclusion.

Eager for more values, you try adding rubyconfpt next. The resulting indexes will also be marked as present. Any repeated index will have no changes since the universe of possible values inside a bloom filter is only filled or empty.

Now, suppose you want to test if mirrorconf is in the bloom filter. I can assure you it isn’t, but you’re clever and curious. Instead of taking my word for it and decide to test it anyway.

Even though mirrorconf was never added, the bloom filter is saying it indeed contains it. Well, probably contains. This happens because a bloom filter is a probabilistic data structure. The fact that we have a reduced number of indexes available to fill, along with the natural properties of hash functions, means that eventually collisions will occur. The use of multiple hashed values attempts to reduce the amount of collisions, making them sparse but not inexistent.

Diving into Ruby #

We can implement a very simple bloom filter as an array or hash table. This will be a very dumbed-down, inefficient implementation. Let’s call it our Dumbfilter.

Everything I said so far mentioned hash functions, but are they really required? We’ll start by implementing it with a simple array. Every element we may want to add is going to be pushed into it. As a consequence, testing will be done using Array#include?. The resulting code looks something like this:

module DumbFilter
  class Array
    def initialize
      @data = []
    end

    def test(str)
      @data.include? str
    end

    def add(str)
      @data << str
    end
  end
end

If we take some time to think about the issues with this implementation, we can find some very obvious ones. Well, for starters you don’t get to play with hash functions which, at least for me during the symphony of servers orchestrated by my professor, was a big put-off. Besides that, the sequential access that comes with using an array means we end up with O(n) time complexity for both adding and testing, not to mention O(n) space complexity.

Let’s try to improve our dumbfilter by reducing the time complexity. If hash functions are required for efficiency, we can achieve constant lookup by using a hash table. In fact, let’s make use of Ruby’s internal hash functions and just use the Hash#[] operator to set the accessed value to true.

module DumbFilter
  class Hash
    def initialize
      @data = {}
    end

    def test(str)
      @data[str]
    end

    def add(str)
      @data[str] = true
    end
  end
end

This solution appears to be better since we now have constant access. However we are saving explicit (key, value) tuples and Bloom Filters are space-efficient data structures, so the current solution isn’t exactly what we are looking for. Our milestone will be the few bits per element I mentioned earlier. We can start by saving the values in an array and generating the correct indexes for each string. To do this, let’s start by adding @peterc’s bitarray to our project. We’ll also be using the fnv hash.

In this version we are going to hash a given string, obtaining an integer as a result. That integer has to be limited to the size of our array and we can guarantee that by using the modulo operation: index % size would result in a value between 0 and size. After that, adding and testing both become a simple access the correct index, setting a bit to 1 if requested.

require "fnv"
require "bitarray"

module BloomFilter
  class V1
    def initialize(size: 1024)
      @bits = BitArray.new(size)
      @fnv = FNV.new
      @size = size
    end

    def add(str)
      @bits[i(str)] = 1
    end

    def test(str)
      @bits[i(str)] == 1
    end

    private

    def i(str)
      @fnv.fnv1a_64(str) % @size
    end
  end
end

The main issue with this version is that, over time, the bloom filter will become clogged with multiple false positives due to recurrent collisions. Since our universe of possible values is limited to the array size, bloom filters in particular tend to suffer from this effect. To handle it, we can either use multiple hash functions or the same hash function with different seeds. Let’s implement the latter.

To guarantee that for multiple invocations of the same input produce the exact same output, we’ll need to generate the seeds and save them beforehand.

def seed(nr)
  (1..nr).each_with_object([]) do |n, s|
    s << SecureRandom.hex(3).to_i(16)
  end
end

After generating and saving the seeds, we need to define how hashing will occur for multiple seeds. In our case, we will simply generate an array containing the hash value for every available seed.

This particular implementation uses the MurmurHash function which is internally used by Ruby. By using it, we can later compare results with the actual Hash implementation.

def i(str)
  @seeds.map { |seed| hash(str, seed) % @size }
end

def hash(str, seed)
  MurmurHash3::V32.str_hash(str, seed)
end

Having these three methods, we are now able to generate the same indexes in recurrent calls. Adding should be nothing more than marking every index with 1 and testing should be limited to retrieving the index values and checking if they are all 1. The final versions of the code are available here. Feel free to comment if you have any questions or want to add something.

In Summary #

By now I hope to have shown you what bloom filters are and how they work.

In the wild, companies like Quora and Medium use them to help tailor your suggestions. Facebook also uses bloom filters on type-ahead queries and bitly for malicious url checks, among several others.

As for Ruby there seem to be two alternatives that stand out. igrigorik’s bloomfilter-rb, which can work with Redis and act as counting/non-counting filter, and deepfryed’s bloom-filter. Both rely on C extensions.