Now:
- fill in ParFor.read_only/write_only to avoid extra copying to/from the GPU
- Indexing by boolean masks
- LShift/RShift prims 
- Short-Circuit Or/And expressions
- Multiple comparisons chained with short-circuit And

Soon:
- Coarse parallelism for groups of IndexReduce/IndexScan results
- Fine grained tree-structured parallelism for IndexReduce/IndexScan inside CUDA kernels
- Support 'output' parameter of ufuncs 
- PreallocArrays to move locally used allocation of arrays out functions into their calling scope
- Garbage collection (or, at least, statically inferred deallocations)

On pause:
- Adverb semantics for conv
- Code generation for conv

Maybe never?
- Adverb-level vectorization 
- Revive the LLVM backend but using the Python C API at the extension boundary, use as default for Windows

Old:
- Run tiling on perfectly nested code
