Disentangling Task Transfer Learning

(page under construction -- release of models, data, live demo coming soon)
Amir R. Zamir, Alexander Sax*, William B. Shen*
Leonidas Guibas, Jitendra Malik, Silvio Savarese


The paper and supplementary material describing the methodology and evaluation.



Go to the Supervision API to find a transfer strategy for reducing supervision.

API Page


Download best-in-class pretrained models from the paper.

Pretrained Models


Download the data. About 4M images each with multiple task annotations from indoor spaces.


Demo of 20 Tasks

Upload an image and see the results of 20 vision tasks by our trained networks live.

Live Demo

Transfer Visualization

Select desired target and source task(s) and see how well they transfer visualized using videos.

Transfer Visualization Page


Would having surface normals simplify the depth estimation of an image? Do visual tasks have a relationship, or are they unrelated? Common sense suggests that visual tasks are interdependent, implying the existence of structure among tasks. However, a proper model is needed for the structure to be actionable, e.g., to reduce the supervision required by utilizing task relationships. We therefore ask: which tasks transfer to an arbitrary target task, and how well? Or, how do we learn a set of tasks collectively, with less total supervision?
These are some of the questions that can be answered by a computational model of the vision tasks space, as proposed in this paper. We explore the task structure utilizing a sampled dictionary of 2D, 2.5D, 3D, and semantic tasks, and modeling their (1st and higher order) transfer behaviors in a latent space. The product can be viewed as a computational task taxonomy (Taskonomy) and a map of the task space. We study the consequences of this structure, e.g., the emerging task relationships, and exploit them to reduce supervision demand. For instance, we show that the total number of labeled datapoints needed to solve a set of 10 tasks can be reduced to 1/4 while keeping performance nearly the same by using features from multiple proxy tasks. Users can employ a provided Binary Integer Programming solver that leverages the taxonomy to find efficient supervision policies for their own use cases.

Process overview. The steps involved in creating the taxonomy.

Supervision API

The provided API uses our results to recommend a superior set of transfers. By using these transfers, we can get similar results close to a fully supervised network using substantially less data.

Example taxonomies. Generated from the API.

Pretrained Models

(click on video thumbnails)

Denoising Autoencoder

Uncorrupted version of corrupted image.

Surface Normals

Pixel-wise surface normals.

Z-buffer Depth

Range estimation.


Coloring for grayscale images.


Shading function with new lighting.

Room Layout

Orientation, size, and translation of the current room.

Camera Pose (fixated)

Relative camera pose with matched optical centers.

Camera Pose (nonfix.)

Relative camera pose with distinct optical centers.

Vanishing Points

Three Manhattan-world vanishing points.


Magnitude of the principal curvatures.

Unsupervised 2D Segm.

Felzenswalb/graph-cut oversegmentation on RGB image.

Unsupervised 2.5D Segm.

Felzenswalb/graph-cut oversegmentation on RGB-D-Normals-Curvature image.

3D Keypoints

Keypoint estimation from geometric features.

2D Keypoints

Keypoint estimation from texture features.

Occlusion Edges

Edges which occlude parts of the scene.

Texture Edges

Edges computed from the RGB image.


Masked centers of images.

Semantic Segmentation

Pixel-level semantic classification.

Object Classification

Knowledge distillation from ImageNet.

Scene Classification

Knowledge distillation from MIT Places.

Jigsaw Puzzle

Inverse permutation of a scrambled image.


Odometry with three camera poses.


Image compression and decompression.

Point Matching

Classifying pairs of possibly matching images.


3.9 Mil.





Tags per Image



We provide a large and high-quality dataset of varied indoor scenes.

Complete pixel-level geometric information via aligned meshes.

Globally consistent camera poses. Complete camera intrinsics.

High-definition images.

3x times big as ImageNet.

* If you are interested in using the full dataset (12 TB), then please contact the authors.

Paper & Supplementary Materials

Zamir, Sax*, Shen*, Guibas, Malik, Savarese.
Taskonomy: Disentangling Task Transfer Learning.
CVPR 2018

Please cite the paper if you use the method, models, database, or API.

@ARTICLE {TaskonomyTaskTransfer2017,
 author = {Amir R. Zamir and Alexander Sax* and William B. Shen* and Leonidas J. Guibas and Jitendra Malik and Silvio Savarese},
 title = {Taskonomy: Disentangling Task Transfer Learning},
 journal = {CVPR},
 year = {2018},


We create our taxonomy in a three-step process:

  • Train tasks: Train each task using fully supervised learning on a large amount of data.
  • Train transfers: Train all possible pairwise transfers. A transfer takes the learned representation for one task, and then uses this to predict the output for another task. For example, we trained a network to use the intermediate representation of the normals prediction network to then use these features to do scene classification. We trained all possible pairwise transfers. We also trained higher-order transfers which used more than one source task (e.g. normals AND curvature).
  • Normalize results: Comparing transfers is nontrivial. For example, how do we determine if normals transfers to scene classification better than edges detection transfers to depth estimation? Scene classification uses cross-entropy loss while depth estimation uses mean L1 loss. The two average losses of the task-specific networks even have different orders of magnitude! We use a tournament graph of all transfers to a task to determine each pairwise win-rate, or how often one transfer achieved a lower loss than another. We come up with a total-ordering based on this pairwise tournament matrix via the Analytic Hierarchy Process. The resulting ordering gives us a priority value for each source, valued between 0 and 1. These values can then be compared across tasks. When computing our taxonomy, we will maximize the total priority.

Then we can solve for the taxonomy using a Boolean Integer Program (BIP).

Page under construction. More details coming soon.

Taxonomy significance. The green line is our taxonomy, and the grey lines show the performance of a random feasible solution (error bars show standard deviation).


We measure the effectiveness of our networks using two different metrics.

  • Gain: The win rate of a network versus standard supervised learning on the same number of data points.
  • Quality: The win rate of a network versus the task-specific networks trained on 120k images.

  • Page under construction. More details coming soon.

Taxonomy significance. The green line is our taxonomy, and the grey lines show the performance of a random feasible solution (error bars show standard deviation).