Differentiable Shape Rendering
We want to introduce you to a set of differentiable operations called ShapeConvs, that render different shapes giving a tight representation.
Artificial Intelligence (AI) often evokes science-fiction movies like Her, Ex-Machina or Matrix, but the technology is no fiction anymore, it is real, and its usability is within reach of anyone who owns a smartphone or computer. Big companies like Google, Apple, Facebook and a lot more have hopped on the train of AI and are using its perks in their apps or devices, proving that the use of cognitive computing improves many simple routine tasks and efficiency.
But even though AI has become a hot topic in the media, the extent of its development is still a little vague for some. Fortunately, you can narrow it down to two very important concepts: machine-learning and deep-learning. What do they mean and how do they work together?
You should check out their Blog. It's well written and also offers a wonderful video explaining the CoordConv layer.
We visualized the pixel coordinates layer as an Image. Each pixel has a different coordinate and depending on that also a different color. By simply looking at the color of the pixel it is possible to know where, in the image, a pixel lies.
All the information each pixel needs to decide if it should be painted or not, are given as several channels. The parameterization, of the rectangle to be drawn, is given by the 4 values ((x1,y1),(x2,y2)), representing the upper left and lower right point of the rectangle, normalized to [0,1].
We merge the two pixel coordinate channels and the four rectangle coordinate channels feature wise.
(I) coordinates_image = [xc,yc,x1,y1,x2,y2] channelwise
Constraint Satisfaction Evaluation
Each pixel (xc,yc) lies within rectangle ((x1,y1),(x2,y2)) if
this is exactly true if:
We want to evaluate every pixel on our canvas and check whether or not it should be painted. By using existing neural network operations we can keep our operation efficient while also maintaining a gradient flow for later learning from the given deep learning frameworks.
The output of our convolutional layer has 2 channels, it calculates how much the corresponding constraint has fulfilled or violated.
(II) differences = (x-x1,y-y1) * (x2-x,y2-y)
Now every value greater than zero suggests that the constraint is fulfilled while negative values show a violation.
We apply a Rectifier Linear Unit (ReLU) on the convolved feature maps.
Pixels that were outside of the rectangle who were negative are now are all set to zero. The pixels inside of the rectangle still have the same intensity as before the ReLU operation.
(III) filtered = relu(differences)
The 4dim filtered vector gets collapsed to a scalar map of the same resolution by multiplying the feature channels.
If any constraint was violated, it is set to 0 and therefore the pixel of the collapsed output will also be 0. So every pixel in the collapsed tensors, that is greater than 0 indicates that it lies within the wanted rectangle.
(IV) collapsed = multiply_along_feature_axis(filtered)
Our desired output should be binary as pixels can either be painted or not. Unfortunately this binarization cannot be implemented directly, because we can't propagate the gradient through hard boundaries. Therefore we use a shifted Hyperbolic Tangent function as a "soft" binary operation.
After the inside/out ReLU we know all values to be greater or equal to zero. Applying a tanh function squeezes it to fit into range [0,1].
The desired output of our renderer should always be zero or not. To avoid grayish values at edges and get crisp renderings we shift the input several magnitudes to the right and effectively 'sharpen up' our rendering.
Choosing the right number of magnitudes to shift is a delicate business. Too few and our renderings look blurrish because the operation is not "binary enough". Too many and we face the vanishing gradient problem because of the saturating activation function.
(V) binary = tanh(1e5 * collapsed)
The final output of our rendering operation has a certain color.
(VI) output = color * binary
Below we compare a rendered rectangle using plain numpy (left) vs our differentiable tensorflow implementation (right).
1. coordinates_image = [xc,yc,x1,y1,x2,y2] channelwise
2. differences = (x-x1,y-y1) * (x2-x,y2-y)
3. filtered = relu(differences)
4. collapsed = multiply_along_feature_axis(filtered)
5. binary = tanh(1e5 * collapsed)
6. output = color * binary
Supervised Rectangle Rendering
Liu et al. defined the Supervised Rendering task. Given a (x,y) coordinate, it renders an image of 64x64 pixels in which a white square of length 9 is rendered. Using a combination of CoordConv and convolutions solves this remarkably good while a plain convolutional neural network will give a disappointing result.
Supervised Rectangle Rendering
In the beginning, we wanted to extend the square rendering toy dataset but then we also found the Rectangle Rendering dataset interesting because not only it gives you the position of the rendered rectangle but also its dimensions.
Each rendered canvas has the dimensions 64x64 and contains a single rectangle at a random position with random shape. Every side has a random length in range [4,40].
Given a rendered rectangle a convolutional encoder regresses the rectangle's bounding box [x1,y1,x2,y2]. Then a rectconv operation renders the encoded rectangle. This compact autoencoding is guided by the metric Intersection over Union.
We use some vanilla convolutional layers to transform a given rendering to the four required values for the upper left and lower right corner of a rectangle. Each Convolution has a stride of 2, downsampling the input iteratively until the feature maps have size 4. Then we apply three fully connected layers, the final one returning four scalar values representing the rectangle coordinates as (x1,y1,x2,y2).
ReLU is applied after each convolutional or fully connected layer. All biases are initialized with zero except the final fully connected layer. The final fully connected bias should be initialized in a space spanning way in that the initial rectangles cover a significant portion of the canvas (we use 25% of the total area). If all 4 values are very small the resulting rectangle would also be really small. The probability of hitting a given rectangle is very low, so training will be very unstable and the small coverage will yield a very weak gradient signal. So (x1,y1) are initialized with 0.25 and (x2,y2) are initialized with 0.75. After the final ReLU an additional tanh function guarantees valid pixel spaces.
One difference to the original definition is the change in the loss function. While per-pixel sigmoid activation and cross-entropy loss work well together, we need to think of this problem in a more geometric way. RectConv basically produces binary shapes, so we argue that an Intersection over Union loss (IoU) better describes the coverage of geometric shapes.
For using IoU as a loss function the calculation has to be differentiable. We follow the implementation of Atiqur et al to approximate this metric in an elegant way.
Each training batch contains the rectangle coordinates and the rendered images. It gets randomly created on the fly. Our network sees two million training samples in a batch of size 8.
We use vanilla Adam optimizer with base learning rate 0.0001 and exponentially decrease it after 3000 steps by a factor 0.9.
The IoU measures how well the initial rectangle and the reconstructed rectangle match with each other. IoU of zero means no coverage at all while a IoU of one signals perfect coverage.
As one might expect, randomly initialized parameters yield a terrible performance. But as training progresses and parameters are optimized several thousand times, IoU climbs to almost 1 (final value 0.9944).
You can download our trained model here. Have a look at the beginning of the training progress. Initially, the rectangles are far off but quickly reconstructs the given rectangle remarkably well.
The training is quite fast as the process takes about 15 minutes on a GTX 1070. An IoU of 1 on randomly generated data is exactly what we anticipated. RectConv can be used in combination with vanilla neural network layers. Nice!
After we can render rectangles, we want to render more shapes. Let's write a differentiable circle renderer. All we need is basic geometry expressed in vector arithmetic.
a pixel (x,y) lies in circle
1. pixel_coords = [x,y] channelwise
2. circle_coords = [xc,yc,r] channelwise
3. diff_center = sum([xc-x,yc-y]^2)
4. constraint_satisfaction = r^2 - diff_center
5. filtered = relu(constraint_satisfaction)
6. binary = tanh(1e5 * collapsed)
7. output = color * binary
Rendering arbitrary triangles is a key ingredient for modern-day rendering engines because they are the very primitive shape and can be calculated efficiently while also having the ability to approximate any shape as close as desired. Here we restrict ourselves to two-dimensional triangles.
To decide for all points whether or not it should be painted or not, we transform their cartesian coordinates to barycentric coordinates and check their lambdas.
The calculation of lambdas is straight forward. We evaluate the constraints in two steps, first we check if lambda is positive and step 2 is to check if it is smaller 1.
In this article we propose a new group of efficient, precise and differentiable shape renderer called ShapeConvs. By broadcasting the representation of the shapes onto each pixel we can express the rendering process as a combination of basic operations manipulating every pixel. We calculate how much each individual pixel violates or fulfills the constraints of the shape. Rectifier Linear Units offer a switch to suppress all pixels outside the desired shape, followed by a soft binary operation which only affects all pixels on the inside yielding a binary mask which can be colored afterward.
Because we use only common basic operations, our code can be replicated effortlessly in some lines of code in all modern deep learning frameworks and simply be plugged into existing neural network architectures.
In the future, we would like to not only render 2d shapes but also 3d faces. If we could render arbitrary faces in a differentiable way, we perhaps could infer 3d models based on images of the same object taken at several viewports in compact mesh representation.